Groups 88 of 99+ julia-users › SIMD multicore 7 posts by 4 authors Jason Eckstein Apr 16 I noticed in Julia 4 now if you call A+B where A and B are matrices of equal size, the llvm code shows vectorization indicating it is equivalent to if I wrote my own function with an simd tagged for loop. I still notice though that it uses a single core to maximum capacity but never spreads an SIMD loop out over multiple cores. In contrast if I use BLAS functions like gemm! or even just A B it will use every core of the processor. I'm not sure if these linear algebra operations also use simd vectorization but I imagine they do since BLAS is very optimized. Is there a way to write an SIMD loop that spreads the data out across all processor cores, not just the multiple functional units of a single core? Valentin Churavy Apr 16 Blas is using a combination of SIMD and multi-core processing. Multi-core threading is coming in Julia v0.5 as an experimental feature. On Saturday, 16 April 2016 14:13:00 UTC+9, Jason Eckstein wrote: I noticed in Julia 4 now if you call A+B where A and B are matrices of equal size, the llvm code shows vectorization indicating it is equivalent to if I wrote my own function with an simd tagged for loop. I still notice though that it uses a single core to maximum capacity but never spreads an SIMD loop out over multiple cores. In contrast if I use BLAS functions like gemm! or even just A B it will use every core of the processor. I'm not sure if these linear algebra operations also use simd vectorization but I imagine they do since BLAS is very optimized. Is there a way to write an SIMD loop that spreads the data out across all processor cores, not just the multiple functional units of a single core? Jason Eckstein Apr 16 I often use julia muticore features with pmap and parallel for loops. So the best way to achieve this is to split the array up into parts for each core and then run SIMD loops on each parallel process? Will there ever by a time when you can add a tag like SIMD that will have the compiler automatically does this like it does for BLAS functions? Chris Rackauckas Apr 16 BLAS functions are painstakingly developed to be beautiful bastions of parallelism because of how ubiquitous their use is . The closest I think you can get is ParallelAccelerator-jl's acc which does a lot of optimizations all together. However, it still won't match BLAS in terms of its efficiency since BLAS is just really well optimized by hand. But give ParallelAccelerator a try, it's a great tool for getting things to run fast with little work. Jiahao Chen Apr 16 Re: julia-users Re: SIMD multicore Yes, optimized BLAS implementations like MKL and OpenBLAS use vectorization heavily. Note that matrix addition A+B is fundamentally a very different beast from matrix multiplication A B. In the former you have O N 2 work and O N 2 data, so the ratio of work to data is O 1 . It is very likely that the operation is memory bound, in which case there is little to gain from optimizing the computations. In the latter you have O N 3 work and O N 2 data, so the ratio of work to data is O N . There exists a good possibility for the operation to be compute bound, and so there is a payoff to optimize such computations. Thanks, Jiahao Chen Research Scientist Julia Lab jiahao.github.io Jason Eckstein Apr 17 Re: julia-users Re: SIMD multicore There's also a BLAS operation for a X + Y which is axpy! a, X, Y . I tried it with the following lines. X rand Float32, 5000, 5000 Y rand Float32, 5000, 5000 for i 1:100 axpy! a, X, Y end in a normal interactive session and noticed that all the cores were in use, near 100 CPU utilization, so even for matrix addition BLAS uses parallel processes and SIMD. For that reason I think any SIMD for loop that applies a single simple function to an array would benefit from similar memory splits across the cores. Also when I use code_llvm on BLAS operations I never see the vectorized instructions in the output but I do for native julia functions like X+Y. Is that because Julia is calling a precompiled library and doesn't directly see the byte code? Jiahao Chen Apr 17 Re: julia-users Re: SIMD multicore Is that because Julia is calling a precompiled library and doesn't directly see the byte code? Yes