Groups 146 of 99+ julia-users › Fast vector element-wise multiplication 15 posts by 7 authors Lionel du Peloux 10 6 15 Dear all, I'm looking for the fastest way to do element-wise vector multiplication in Julia. The best I could have done is the following implementation which still runs 1.5x slower than the dot product. I assume the dot product would include such an operation ... and then do a cumulative sum over the element-wise product. The MKL lib includes such an operation v?Mul but it seems OpenBLAS does not. So my question is : 1 is there any chance I can do vector element-wise multiplication faster then the actual dot product ? 2 why the built-in element-wise multiplication operator . is much slower than my own implementation for such a basic linealg operation full julia ? Thank you, Lionel Best custom implementation : function xpy! T :Number A::Vector T ,B::Vector T n size A 1 if n size B 1 for i 1:n inbounds A i B i end end return A end Bench mark results JuliaBox, A randn 300000 : function CPU s GC ALLOCATION bytes CPU x dot A,B 1.58e-04 0.00 16 1.0 xpy! A,B 2.31e-04 0.00 80 1.5 NumericExtensions.multiply! P,Q 3.60e-04 0.00 80 2.3 xpy! A,B - no inbounds check 4.36e-04 0.00 80 2.8 P. Q 2.52e-03 50.36 2400512 16.0 Tomas Lycken 10 6 15 I made some simple changes to your `xpy!`, and managed to get it to allocate nothing at all, while performing very close to the speed of `dot`. I don't know anything about e.g. ` simd` instructions, but I imagine they could help speeding this up even further. The most significant change was switching `size A 1 ` to `size A,1 ` and similarly for `B` - the former has to construct and index into a tuple, while the latter won't have to do that. `length A ` would have worked too. Notebook, also produced on JuliaBox running Julia 0.4-rc2 : http: nbviewer.ipython.org github tlycken IJulia-Notebooks blob master dot 20vs 20xpy 21.ipynb T Patrick Kofod Mogensen 10 6 15 Well, I guess your table pretty much shows it, right? It seems as it allocates a lot of temporary memory to carry out the calculations. Christoph Ortner 10 6 15 a b is equivalent to a a b, which allocates a temporary variable I think? Try fastmath inbounds simd for i 1:n A i B i end Christoph Ortner 10 6 15 or, possibly A i A i B i I'm not sure whether simd automatically translates to what it needs Steven G. Johnson 10 6 15 On Tuesday, October 6, 2015 at 12:29:04 PM UTC-4, Christoph Ortner wrote: a b is equivalent to a a b, which allocates a temporary variable I think? A A only allocates memory on the heap if A is an array or something other heap-allocated datatype. For A i B i where A i and B i are small scalar types like Float64, no temporary is allocated, the compiler just puts the result in a register. Patrick Kofod Mogensen 10 6 15 That was supposed to be A B only allocates... right? Steven G. Johnson 10 6 15 On Tuesday, October 6, 2015 at 2:23:33 PM UTC-4, Patrick Kofod Mogensen wrote: That was supposed to be A B only allocates... right? Yes. Steven G. Johnson 10 6 15 Note that the BLAS dot product probably uses all sorts of tricks to squeeze the last cycle of SIMD performance out of the CPU. e.g. here is the OpenBLAS ddot function for SandyBridge, which is hand-coded in assembly: https: github.com xianyi OpenBLAS blob develop kernel x86_64 ddot_microk_sandy-2.c Getting the last 30 or so of this performance can be extremely tricky. Lionel du Peloux 10 6 15 Thank you for all of your suggestions. The simd macro effectively gives a very slightly improved performance 5 . chobb... gmail.com Jun 20 I have the same question regarding how to calculate the entry-wise vector product and find this thread. As a novice, I wonder if the following code snippet is still the standard for entry-wise vector multiplication that one should stick to in practice? Thanks! fastmath inbounds simd for i 1: n A i B i end Chris Rackauckas Jun 20 I think that for medium size but not large arrays in v0.5 you may want to use threads from the threadding branch, and then for really large arrays you may want to use parallel. But you'd have to test some timings. chobb... gmail.com Jun 20 Thanks! I'm still using v0.4.5. In this case, is the code I highlighted above still the best choice for doing the job? Chris Rackauckas Jun 20 Most likely. I would also time it with and without simd at your problem size. For some reason I've had some simple loops do better without simd. chobb... gmail.com Jun 20 Thanks for the confirmation! Yes, I need more tests to see what the best practice is for my particular problem.