Optimise reductions and dotprod with more vectorisation

Turns out that if you don't supply -ffast-math, the C compiler will faithfully reproduce your linear reduction order, which is rather disastrous for parallelisation with vector units. This changes the summation order, so numerical results might differ slightly. To wit: the test suite needed adjustment.
author: Tom Smeding <tom@tomsmeding.com> 2025-03-14 21:57:56 +0100
committer: Tom Smeding <tom@tomsmeding.com> 2025-03-14 21:58:51 +0100
commit: 6276ed3c7bcd20c8b860e1275386ecd068671bcc (patch)
tree: b2710f261d12a7a1b73962691c187752663543f6 /src/Data/Array
parent: 308ca9fac150cd28d62afef852f26ae4c40fa5a0 (diff)
1 files changed, 0 insertions, 1 deletions
diff --git a/src/Data/Array/Mixed/Internal/Arith/Foreign.hs b/src/Data/Array/Mixed/Internal/Arith/Foreign.hs
index 15fbc79..969a25a 100644
--- a/src/Data/Array/Mixed/Internal/Arith/Foreign.hs
+++ b/src/Data/Array/Mixed/Internal/Arith/Foreign.hs
@@ -20,7 +20,6 @@ $(do
         ,("reducefull_" ++ tyn,              [t| CInt -> Int64 -> Ptr Int64 -> Ptr Int64 -> Ptr $ttyp -> IO $ttyp |])
         ,("extremum_min_" ++ tyn,            [t| Ptr Int64 -> Int64 -> Ptr Int64 -> Ptr Int64 -> Ptr $ttyp -> IO () |])
         ,("extremum_max_" ++ tyn,            [t| Ptr Int64 -> Int64 -> Ptr Int64 -> Ptr Int64 -> Ptr $ttyp -> IO () |])
-        ,("dotprod_" ++ tyn,                 [t| Int64 -> Ptr $ttyp -> Ptr $ttyp -> IO $ttyp |])
         ,("dotprod_" ++ tyn ++ "_strided",   [t| Int64 -> Int64 -> Int64 -> Ptr $ttyp -> Int64 -> Int64 -> Ptr $ttyp -> IO $ttyp |])
         ,("dotprodinner_" ++ tyn,            [t| Int64 -> Ptr Int64 -> Ptr $ttyp -> Ptr Int64 -> Ptr $ttyp -> Ptr Int64 -> Ptr $ttyp -> IO () |])
         ]
author	Tom Smeding <tom@tomsmeding.com>	2025-03-14 21:57:56 +0100
committer	Tom Smeding <tom@tomsmeding.com>	2025-03-14 21:58:51 +0100
commit	6276ed3c7bcd20c8b860e1275386ecd068671bcc (patch)
tree	b2710f261d12a7a1b73962691c187752663543f6 /src/Data/Array
parent	308ca9fac150cd28d62afef852f26ae4c40fa5a0 (diff)