Groups 113 of 99+ julia-users › SIMD long inner expression 3 posts by 3 authors Chris Rackauckas Feb 4 Hi, I am trying to optimize a code which just adds a bunch of things. My first instinct was to unravel the loops and run it as SIMD, like so: dx = 1/400 addprocs(3) imin = -6 jmin = -2 #Some SIMD @time res = @sync @parallel (+) for i = imin:dx:0 tmp = 0 for j=jmin:dx:0 ans = 0 @simd for k=1:24 @inbounds ans += coefs[k]*(i^(powz[k]))*(j^(poww[k])) end tmp += abs(ans)<1 end tmp end res = res*(12/(length(imin:dx:0)*length(jmin:dx:0))) println(res)#Mathematica gives 2.98 (Coefficients arrays posted at the bottom for completeness). On a 4 core i7 this runs in ~4.68 seconds. I was able to beat this by optimizing the power calculations like so: ## Full unravel @time res = @sync @parallel (+) for i = imin:dx:0 tmp = 0 isq2 = i*i; isq3 = i*isq2; isq4 = isq2*isq2; isq5 = i*isq4 isq6 = isq4*isq2; isq7 = i*isq6; isq8 = isq4*isq4 for j=jmin:dx:0 jsq2 = j*j; jsq4 = jsq2*jsq2; jsq6 = jsq2*jsq4; jsq8 = jsq4*jsq4 @inbounds tmp += abs(coefs[1]*(jsq2) + coefs[2]*(jsq4) + coefs[3]*(jsq6) + coefs[4]*(jsq8) + coefs[5]*(i) + coefs[6]*(i)*(jsq2) + coefs[7]*(i)*(jsq4) + coefs[8]*(i)*(jsq6) + coefs[9]*(isq2) + coefs[10]*(isq2)*(jsq2) + coefs[11]*(isq2)*(jsq4) + coefs[12]*(isq2)*(jsq6) + coefs[13]*(isq3) + coefs[14]*(isq3)*(jsq2) + coefs[15]*(isq3)*(jsq4) + coefs[16]*(isq4) + coefs[17]*(isq4)*(jsq2) + coefs[18]*(isq4)*(jsq4) + coefs[19]*(isq5) + coefs[20]*(isq5)*(jsq2) + coefs[21]*(isq6) + coefs[22]*(isq6)*(jsq2) + coefs[23]*(isq7) + coefs[24]*(isq8))<1 end tmp end res = res*(12/(length(imin:dx:0)*length(jmin:dx:0))) println(res)#Mathematica gives 2.98 This is the best version and gets ~1.83 seconds (FYI it bets a version where I distributed out by isq). However, when I put this into a function and call it with @code_llvm, I get: define %jl_value_t* @julia_simdCheck_3161() { top: %0 = alloca [8 x %jl_value_t*], align 8 %.sub = getelementptr inbounds [8 x %jl_value_t*]* %0, i64 0, i64 0 %1 = getelementptr [8 x %jl_value_t*]* %0, i64 0, i64 2 %2 = getelementptr [8 x %jl_value_t*]* %0, i64 0, i64 3 store %jl_value_t* inttoptr (i64 12 to %jl_value_t*), %jl_value_t** %.sub, align 8 %3 = getelementptr [8 x %jl_value_t*]* %0, i64 0, i64 1 %4 = load %jl_value_t*** @jl_pgcstack, align 8 %.c = bitcast %jl_value_t** %4 to %jl_value_t* store %jl_value_t* %.c, %jl_value_t** %3, align 8 store %jl_value_t** %.sub, %jl_value_t*** @jl_pgcstack, align 8 store %jl_value_t* null, %jl_value_t** %1, align 8 store %jl_value_t* null, %jl_value_t** %2, align 8 %5 = getelementptr [8 x %jl_value_t*]* %0, i64 0, i64 4 store %jl_value_t* null, %jl_value_t** %5, align 8 %6 = getelementptr [8 x %jl_value_t*]* %0, i64 0, i64 5 store %jl_value_t* null, %jl_value_t** %6, align 8 %7 = getelementptr [8 x %jl_value_t*]* %0, i64 0, i64 6 store %jl_value_t* null, %jl_value_t** %7, align 8 %8 = getelementptr [8 x %jl_value_t*]* %0, i64 0, i64 7 store %jl_value_t* null, %jl_value_t** %8, align 8 %9 = call %jl_value_t* @julia_sync_begin_535() store %jl_value_t* inttoptr (i64 2274649264 to %jl_value_t*), %jl_value_t** %2, align 8 %10 = load %jl_value_t* (%jl_value_t*, %jl_value_t**, i32)** inttoptr (i64 2274649264 to %jl_value_t* (%jl_value_t*, %jl_value_t**, i32)**), align 16 %11 = load %jl_value_t** inttoptr (i64 2212282536 to %jl_value_t**), align 8 store %jl_value_t* %11, %jl_value_t** %5, align 8 %12 = load %jl_value_t** inttoptr (i64 2197909240 to %jl_value_t**), align 8 store %jl_value_t* %12, %jl_value_t** %6, align 8 %13 = load %jl_value_t** inttoptr (i64 2197909288 to %jl_value_t**), align 8 store %jl_value_t* %13, %jl_value_t** %7, align 8 %14 = load %jl_value_t** inttoptr (i64 2212418552 to %jl_value_t**), align 8 store %jl_value_t* %14, %jl_value_t** %8, align 8 %15 = call %jl_value_t* %10(%jl_value_t* inttoptr (i64 2274649264 to %jl_value_t*), %jl_value_t** %5, i32 4) store %jl_value_t* %15, %jl_value_t** %1, align 8 call void @julia_sync_end_541() %16 = load %jl_value_t** %3, align 8 %17 = getelementptr inbounds %jl_value_t* %16, i64 0, i32 0 store %jl_value_t** %17, %jl_value_t*** @jl_pgcstack, align 8 ret %jl_value_t* %15 } This shows me that the llvm is not auto-vectorizing the inner part. Is there a way to tell the llvm to SIMD vectorize the big summation in this more optimized form? For completeness, here are the coefficient arrays (note: coefs should be treated as variable. In principle those zeros could change so there's no manual optimizing to be done there): coefs Vector Float64, 24 1.3333333333333328 0.16666666666666669 0.0 0.0 2.0 2.333333333333335 0.06250000000000001 0.0 2.0 1.1666666666666659 0.019097222222222224 0.0 1.0 3.469446951953614e-18 0.0 0.25 0.010416666666666666 0.0 0.0 0.0 0.0 0.0 0.0 0.0 powz Vector Float64, 24 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 2.0 2.0 2.0 2.0 3.0 3.0 3.0 4.0 4.0 4.0 5.0 5.0 6.0 6.0 7.0 8.0 poww Vector Float64, 24 2.0 4.0 6.0 8.0 0.0 2.0 4.0 6.0 0.0 2.0 4.0 6.0 0.0 2.0 4.0 0.0 2.0 4.0 0.0 2.0 0.0 2.0 0.0 0.0 alan souza Feb 4 Hi. I think that the number of variables in the inner loop is too high and the compiler can't vectorize the code because of a insufficient number of available registers (see wik i- Register_allocation ) and even if the compiler could vectorize this loop the speedup will be at best of two times for SSE code (Could you use float32). I believe that can be easier to the compiler if split your computation (calculate the arrays with powers, pointwise multiplication and finally a reduction). - show quoted text - Erik Schnetter Feb 4 Chris To get good performance, you need to put your code into a function. You seem to be evaluating it directly at the REPL -- this will be slow. See the "Performance Tips" in the manual. The LLVM code you see is not your kernel code. Instead, it contains a "call" statement, presumably to a function that contains all the code in the @parallel region. You can try creating a function for your actual kernel, which goes from "tmp = 0" to "tmp" near the bottom, and running "@code_llvm" on it. -erik - show quoted text - -- Erik Schnetter http://www.perimeterinstitute.ca/personal/eschnetter/