Groups 113 of 99+ julia-users › SIMD long inner expression 3 posts by 3 authors Chris Rackauckas Feb 4 Hi, I am trying to optimize a code which just adds a bunch of things. My first instinct was to unravel the loops and run it as SIMD, like so: dx 1 400 addprocs 3 imin -6 jmin -2 Some SIMD time res sync parallel + for i imin:dx:0 tmp 0 for j jmin:dx:0 ans 0 simd for k 1:24 inbounds ans + coefs k i powz k j poww k end tmp + abs ans 1 end tmp end res res 12 length imin:dx:0 length jmin:dx:0 println res Mathematica gives 2.98 Coefficients arrays posted at the bottom for completeness . On a 4 core i7 this runs in 4.68 seconds. I was able to beat this by optimizing the power calculations like so: Full unravel time res sync parallel + for i imin:dx:0 tmp 0 isq2 i i; isq3 i isq2; isq4 isq2 isq2; isq5 i isq4 isq6 isq4 isq2; isq7 i isq6; isq8 isq4 isq4 for j jmin:dx:0 jsq2 j j; jsq4 jsq2 jsq2; jsq6 jsq2 jsq4; jsq8 jsq4 jsq4 inbounds tmp + abs coefs 1 jsq2 + coefs 2 jsq4 + coefs 3 jsq6 + coefs 4 jsq8 + coefs 5 i + coefs 6 i jsq2 + coefs 7 i jsq4 + coefs 8 i jsq6 + coefs 9 isq2 + coefs 10 isq2 jsq2 + coefs 11 isq2 jsq4 + coefs 12 isq2 jsq6 + coefs 13 isq3 + coefs 14 isq3 jsq2 + coefs 15 isq3 jsq4 + coefs 16 isq4 + coefs 17 isq4 jsq2 + coefs 18 isq4 jsq4 + coefs 19 isq5 + coefs 20 isq5 jsq2 + coefs 21 isq6 + coefs 22 isq6 jsq2 + coefs 23 isq7 + coefs 24 isq8 1 end tmp end res res 12 length imin:dx:0 length jmin:dx:0 println res Mathematica gives 2.98 This is the best version and gets 1.83 seconds FYI it bets a version where I distributed out by isq . However, when I put this into a function and call it with code_llvm, I get: define -jl_value_t julia_simdCheck_3161 top: 0 alloca 8 x -jl_value_t , align 8 .sub getelementptr inbounds 8 x -jl_value_t 0, i64 0, i64 0 1 getelementptr 8 x -jl_value_t 0, i64 0, i64 2 2 getelementptr 8 x -jl_value_t 0, i64 0, i64 3 store -jl_value_t inttoptr i64 12 to -jl_value_t , -jl_value_t .sub, align 8 3 getelementptr 8 x -jl_value_t 0, i64 0, i64 1 4 load -jl_value_t -jl_pgcstack, align 8 .c bitcast -jl_value_t 4 to -jl_value_t store -jl_value_t .c, -jl_value_t 3, align 8 store -jl_value_t .sub, -jl_value_t -jl_pgcstack, align 8 store -jl_value_t null, -jl_value_t 1, align 8 store -jl_value_t null, -jl_value_t 2, align 8 5 getelementptr 8 x -jl_value_t 0, i64 0, i64 4 store -jl_value_t null, -jl_value_t 5, align 8 6 getelementptr 8 x -jl_value_t 0, i64 0, i64 5 store -jl_value_t null, -jl_value_t 6, align 8 7 getelementptr 8 x -jl_value_t 0, i64 0, i64 6 store -jl_value_t null, -jl_value_t 7, align 8 8 getelementptr 8 x -jl_value_t 0, i64 0, i64 7 store -jl_value_t null, -jl_value_t 8, align 8 9 call -jl_value_t julia_sync_begin_535 store -jl_value_t inttoptr i64 2274649264 to -jl_value_t , -jl_value_t 2, align 8 10 load -jl_value_t -jl_value_t , -jl_value_t , i32 inttoptr i64 2274649264 to -jl_value_t -jl_value_t , -jl_value_t , i32 , align 16 11 load -jl_value_t inttoptr i64 2212282536 to -jl_value_t , align 8 store -jl_value_t 11, -jl_value_t 5, align 8 12 load -jl_value_t inttoptr i64 2197909240 to -jl_value_t , align 8 store -jl_value_t 12, -jl_value_t 6, align 8 13 load -jl_value_t inttoptr i64 2197909288 to -jl_value_t , align 8 store -jl_value_t 13, -jl_value_t 7, align 8 14 load -jl_value_t inttoptr i64 2212418552 to -jl_value_t , align 8 store -jl_value_t 14, -jl_value_t 8, align 8 15 call -jl_value_t 10 -jl_value_t inttoptr i64 2274649264 to -jl_value_t , -jl_value_t 5, i32 4 store -jl_value_t 15, -jl_value_t 1, align 8 call void julia_sync_end_541 16 load -jl_value_t 3, align 8 17 getelementptr inbounds -jl_value_t 16, i64 0, i32 0 store -jl_value_t 17, -jl_value_t -jl_pgcstack, align 8 ret -jl_value_t 15 This shows me that the llvm is not auto-vectorizing the inner part. Is there a way to tell the llvm to SIMD vectorize the big summation in this more optimized form? For completeness, here are the coefficient arrays note: coefs should be treated as variable. In principle those zeros could change so there's no manual optimizing to be done there : coefs Vector Float64, 24 1.3333333333333328 0.16666666666666669 0.0 0.0 2.0 2.333333333333335 0.06250000000000001 0.0 2.0 1.1666666666666659 0.019097222222222224 0.0 1.0 3.469446951953614e-18 0.0 0.25 0.010416666666666666 0.0 0.0 0.0 0.0 0.0 0.0 0.0 powz Vector Float64, 24 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 2.0 2.0 2.0 2.0 3.0 3.0 3.0 4.0 4.0 4.0 5.0 5.0 6.0 6.0 7.0 8.0 poww Vector Float64, 24 2.0 4.0 6.0 8.0 0.0 2.0 4.0 6.0 0.0 2.0 4.0 6.0 0.0 2.0 4.0 0.0 2.0 4.0 0.0 2.0 0.0 2.0 0.0 0.0 alan souza Feb 4 Hi. I think that the number of variables in the inner loop is too high and the compiler can't vectorize the code because of a insufficient number of available registers see wik i- Register_allocation and even if the compiler could vectorize this loop the speedup will be at best of two times for SSE code Could you use float32 . I believe that can be easier to the compiler if split your computation calculate the arrays with powers, pointwise multiplication and finally a reduction . Erik Schnetter Feb 4 Chris To get good performance, you need to put your code into a function. You seem to be evaluating it directly at the REPL -- this will be slow. See the Performance Tips in the manual. The LLVM code you see is not your kernel code. Instead, it contains a call statement, presumably to a function that contains all the code in the parallel region. You can try creating a function for your actual kernel, which goes from tmp 0 to tmp near the bottom, and running code_llvm on it. -erik -- Erik Schnetter schn... gmail.com http: www.perimeterinstitute.ca personal eschnetter