Groups 220 of 99+ julia-users › tanh() speed / multi-threading 14 posts by 4 authors Carlos Becker 5/18/14 This is probably related to openblas, but it seems to be that tanh() is not multi-threaded, which hinders a considerable speed improvement. For example, MATLAB does multi-thread it and gets something around 3x speed-up over the single-threaded version. For example, x = rand(100000,200); @time y = tanh(x); yields: - 0.71 sec in Julia - 0.76 sec in matlab with -singleCompThread - and 0.09 sec in Matlab (this one uses multi-threading by default) Good news is that julia (w/openblas) is competitive with matlab single-threaded version, though setting the env variable OPENBLAS_NUM_THREADS doesn't have any effect on the timings, nor I see higher CPU usage with 'top'. Is there an override for OPENBLAS_NUM_THREADS in julia? what am I missing? Carlos Becker 5/18/14 forgot to add versioninfo(): julia> versioninfo() Julia Version 0.3.0-prerelease+2921 Commit ea70e4d* (2014-05-07 17:56 UTC) Platform Info: System: Linux (x86_64-linux-gnu) CPU: Intel(R) Xeon(R) CPU X5690 @ 3.47GHz WORD_SIZE: 64 BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY) LAPACK: libopenblas LIBM: libopenlibm - show quoted text - Carlos Becker 5/18/14 now that I think about it, maybe openblas has nothing to do here, since @which tanh(y) leads to a call to vectorize_1arg(). If that's the case, wouldn't it be advantageous to have a vectorize_1arg_openmp() function (defined in C/C++) that works for element-wise operations on scalar arrays, multi-threading with OpenMP? - show quoted text - Tobias Knopp 5/18/14 Hi Carlos, I am working on something that will allow to do multithreading on Julia functions (https://github.com/JuliaLang/julia/pull/6741). Implementing vectorize_1arg_openmp is actually a lot less trivial as the Julia runtime is not thread safe (yet) Your example is great. I first got a slowdown of 10 because the example revealed a locking issue. With a little trick I now get a speedup of 1.75 on a 2 core machine. Not to bad taking into account that memory allocation cannot be parallelized. The tweaked code looks like function tanh_core(x,y,i) N=length(x) for l=1:N/2 y[l+i*N/2] = tanh(x[l+i*N/2]) end end function ptanh(x;numthreads=2) y = similar(x) N = length(x) parapply(tanh_core,(x,y), 0:1, numthreads=numthreads) y end I actually want this to be also fast for function tanh_core(x,y,i) y[i] = tanh(x[i]) end function ptanh(x;numthreads=2) y = similar(x) N = length(x) parapply(tanh_core,(x,y), 1:N, numthreads=numthreads) y end - show quoted text - Carlos Becker 5/18/14 Re: [julia-users] Re: tanh() speed / multi-threading HI Tobias, I saw your pull request and have been following it closely, nice work ;) Though, in the case of element-wise matrix operations, like tanh, there is no need for extra allocations, since the buffer should be allocated only once. From your first code snippet, is julia smart enough to pre-compute i*N/2 ? In such cases, creating a kind of array view on the original data would probably be faster, right? (though I don't know how allocations work here). For vectorize_1arg_openmp, I was thinking of "hard-coding" it for known operations such as trigonometric ones, that benefit a lot from multi-threading. I know this is a hack, but it is quick to implement and brings an amazing speed up (8x in the case of the code I posted above). ------------------------------------------ Carlos - show quoted text - Carlos Becker 5/18/14 Re: [julia-users] Re: tanh() speed / multi-threading Other recipients: tobias...@googlemail.com btw, the code you just sent works as is with your pull request branch? ------------------------------------------ Carlos - show quoted text - Tobias Knopp 5/18/14 Re: [julia-users] Re: tanh() speed / multi-threading Other recipients: tobias...@googlemail.com sure, the function is Base.parapply though. I had explicitly imported it. In the case of vectorize_1arg it would be great to automatically parallelize comprehensions. If someone could tell me where the actual looping happens, this would be great. I have not found that yet. Seems to be somewhere in the parser. - show quoted text - Carlos Becker 5/18/14 Re: [julia-users] Re: tanh() speed / multi-threading Other recipients: tobias...@googlemail.com Sounds great! I just gave it a try, and with 16 threads I get 0.07sec which is impressive. That is when I tried it in isolated code. When put together with other julia code I have, it segfaults. Have you experienced this as well? - show quoted text - Tobias Knopp 5/18/14 Re: [julia-users] Re: tanh() speed / multi-threading Other recipients: tobias...@googlemail.com Well when I started I got segfaullt all the time :-) Could you please send me a minimal code example that segfaults? This would be great! This is the only way we can get this stable. - show quoted text - Andreas Noack 5/18/14 Re: [julia-users] Re: tanh() speed / multi-threading The computation of `tanh` is done in openlibm, not openblas, and it is not multithreaded. Probably, MATLAB uses Intel's Vectorized Mathematical Functions (VML) in MKL. If you have MKL you can do that yourself. It makes a big difference as you also saw in MATLAB. With openlibm I get julia> @time y = tanh(x); elapsed time: 1.229392453 seconds (160000096 bytes allocated) and with VML I get julia> @time (ymkl=similar(x);ccall((:vdtanh_,Base.libblas_name), Void, (Ptr{Int}, Ptr{Float64}, Ptr{Float64}), &length(x), x, ymkl)) elapsed time: 0.086282489 seconds (160000112 bytes allocated) It appears that we can get something similar with Tobias' work, which is cool. - show quoted text - -- Med venlig hilsen Andreas Noack Jensen Tobias Knopp 5/18/14 Re: [julia-users] Re: tanh() speed / multi-threading Other recipients: tobias...@googlemail.com And I am pretty excited that it seems to scale so well at your setup. I have only 2 cores so could not see if it scales to more cores. - show quoted text - Carlos Becker 5/18/14 Re: [julia-users] Re: tanh() speed / multi-threading Other recipients: tobias...@googlemail.com Great to see that Tobias' PR rocks ;) I am still getting a weird segfault, and cannot reproduce it when put to simpler code. I will keep working on it, and post it as soon as I nail it. Tobias: any pointer towards possible incompatibilities of the current state of the PR? thanks. ------------------------------------------ Carlos - show quoted text - Tobias Knopp 5/19/14 Re: [julia-users] Re: tanh() speed / multi-threading Other recipients: tobias...@googlemail.com Are any Julia tasks involved when the segfault happens? Or worker processes? Is there any allocation happening in the threaded code? There should not be incompatibilities of the PR. But as you notice this is experimental work and I have likely not found all cases where race conditions can happen. If I can reproduce such cases I can debug this and put locks around functions that should not be called in parallel. I should also mention that the gc is disabled when running the threads. So although allocations are possible, its not to difficult to claim all system memory rendering the computer unusable. But in such performance critical code allocations should be mitigated anyway. A call to gc() after invoking parapply is probably not a bad idea. To answer your initial question: It would of course be possible to have C coded OpenMP versions of special functions like tanh. But this would violate a little the Julia principle to have the C runtime very small and implementing all "base" stuff in pure Julia. There are exceptions like BLAS but this is actually not in Julia core and there is also interest in having pure Julia BLAS in base. A further complication is that clang/llvm that is used on OSX does currently not support OpenMP (but will in the future). - show quoted text - Simon Kornblith 5/19/14 Re: [julia-users] Re: tanh() speed / multi-threading According to my VML.jl benchmarks, VML tanh is ~5x faster for large arrays even with only a single core. (VML.jl is currently only single threaded because I haven't figured out how to get multithreading without ccalling into MKL, which could lead to problems if Julia is linked against OpenBLAS.) It would be great to have similar performance in base Julia, but I don't know of any open source vector math libraries that provide <1 ulp accuracy. Simon - show quoted text -