Julia and Spark

This thread is to discuss Julia - Spark integration further. This is a continuation of discussions from https: forum !topic julia-users LeCnTmOvUbw the thread topic was misleading and we could not change it . To summarize briefly, here are a few interesting packages: - https: d9w Spark-jl - https: jey Spock-jl - https: benhamner MachineLearning-jl - packages at https: JuliaParallel We can discuss approaches and coordinate efforts towards whichever looks promising. Steven Sagaert 4 15 15 I've been comtemplating writing a high level wrapper to Spark myself since I'm interested in both Julia & Spark but I was waiting for Julia 0.4 to finalize before even starting. One can do the integration on several levels: 1 simply wrap the Spark java API via JavaCall. This is the low level approach. BTW I've experimented with javaCall and found it was unstable & also lacking functionality e.g. there's no way to shutdown the jvm or create a pool of JVM analogous to DB connections so that might need some work before trying the Spark integration. 2 Spark 1.3 has now new and high level interfaces: dataframe API for accessing data in the form of distributed dataframes & pipeline API to compose algo via pipeline framework. By wrapping the spark dataframe with julia dataframe you would quickly have a high level data scientist level interface to Spark. BTW Spark dataframes are actually also FASTER than the more low level approaches like java scala methods calls or Spark SQL intermediate level because Spark itself can do more optimizations this is similar to how PyData Blaze works . By wrapping the pipeline API one could quickly compose Spark algos to create new algos. 3 for an intermediate approach : wrap the Spark SQL API and use SQL to query the system. Personally I would start with dataframe & pipeline API. Maybe later on if needed add Spark SQL API and only do the low level stuff last if needed. But before interfacing Spark dataframes with julia ones the julia dataframe should become more powerful: at least && and || should be allowed in indexing for richer querying like in Rlang dataframes. wil... 4 15 15 1 simply wrap the Spark java API via JavaCall. This is the low level approach. BTW I've experimented with javaCall and found it was unstable & also lacking functionality e.g. there's no way to shutdown the jvm or create a pool of JVM analogous to DB connections so that might need some work before trying the Spark integration. Using JavaCall is not an option, especially when JVM became close-sourced, see https: aviks JavaCall-jl issues 7. Python bindings are done through Py4J, which is RPC to JVM. If you look at the sparkR, it is done in a same way. sparkR uses a RPC interface to communicate with a Netty-based Spark JVM backend that translates Rlang calls into JVM calls, keeps SparkContext on a JVM side, and ships serialized data to from Rlang. So it is just a matter of writing Julia RPC to JVM and wrapping necessary Spark methods in a Julia friendly way. Steven Sagaert 4 16 15 yes that's a solid approach. For my personal julia - java integrations I also run the JVM in a separate process. wil... 4 16 15 However, I wonder, how hard it would be to implement RDD in Julia? It looks straight forward from a RDD paper how to implement it. It is a robust abstraction that can be used in any parallel computation. Andrei Zh 4 16 15 Julia bindings for Spark would provide much more than just RDD, they will give us access to multiple big data components for streaming, machine learning, SQL capabilities and much more. wil... 4 17 15 Of course, a Spark data access infrastructure is unbeatable, due to mature JVM-based libraries for accessing various data sources and formats avro, parquet, hdfs . That includes SQL support as well. But, look at Python and Rlang bindings, these are just facades for JVM calls. MLLib is written in Scala, Streaming API as well, and then all this called from Python or R, all data transformations happen on JVM level. It would be more efficient write code in Scala then use any non-JVM bindings. Think of overhead for RPC and data serialization over huge volumes of data needed to be processed and you'll understand why Dpark exists. BTW, machine learning libraries in JVM, good luck. It only works because of large computational resources used, but even that has its limits. Tanmay K. Mohapatra 4 18 15 There was some attempt made towards a pure Julia RDD in Spark-jl https: d9w Spark-jl . We also have DistributedArrays https: JuliaParallel DistributedArrays-jl , Blocks https: JuliaParallel Blocks-jl and https: JuliaStats DataFrames-jl . I wonder if it is possible to leverage any of these for a pure Julia RDD. And MachineLearning-jl or something similar could probably be the equivalent of MLib. wil... 4 20 15 Unfortunately, Spark-jl is an incorrect RDD implementation. Instead of creating transformations as independent abstraction operations with a lazy evaluation, the package has all transformations immediately executed upon their call. This is completely undermines whole purpose of RDD as fault-tolerant parallel data structure. ssarkaray... 10 31 15 Is there any implementation with streams of RDDs for Julia ? Jey Kottalam 10 31 15 Re: julia-users Re: Julia and Spark Could you please define streams of RDDs ? Sisyphuss 11 1 15 Re: julia-users Re: Julia and Spark Other recipients: j... http: citation.cfm?id 2228301 Jey Kottalam 11 1 15 Re: julia-users Re: Julia and Spark Other recipients: zhengw... Are you asking about Spark Streaming support? ssarkaray... 11 1 15 Re: julia-users Re: Julia and Spark Other recipients: zhengw..., j... Yes. Frank 11 14 15 Hi, I would have expected more interest in a Spark & Julia integration. Is the lack of interest due to a missing use cases b fact that both Spark and Julia are very new - relatively speaking What do you think? Thanks Frank On Wednesday, April 15, 2015 at 11:37:50 AM UTC+2, Tanmay K. Mohapatra wrote: Christof Stocker 11 14 15 Re: julia-users Re: Julia and Spark Personally, I think the most progress is made if some person has a huge interest in doing it. I for one have a big interest in using Julia for ML, but I myself am not particularly interested in using Spark from Julia. I just don't feel like it would be useful to me for anything. In the situations that I do use spark I don't feel like I would gain anything from using it from Julia. That of course doesn't mean that it wouldn't be very useful to others, but it does mean that it is unlikely that I will spend any of my time on it in the near future. Maybe other people are in similar situations. What I can leave you with is this: I think open source is a place in which one person can make all the difference in the world, if he she sets his her mind to it. So if someone is interested in doing it, go for it. I don't think it's to far fetched to assume that once the functionality is available and reasonable mature that people will gravitate towards it. Andrei Zh 11 14 15 Re: julia-users Re: Julia and Spark Small number of use cases is an important reason. I see many people interested in Julia & Spark integration, but almost nobody interested enough to invest time into its development. Another reason is that Julia infrastructure and especially Julia-Java integration is not mature enough to make integrations of such level. Instability of JNI, inconsistency between Java and Scala, serialization issues in Julia - these are just few difficulties I faced while working on Sparta-jl. Many people do great work to fix such issues, but at the moment Julia is far behind, say, Python. Finally, it's just huge amount of work. I don't mean basic functionality like map and reduce operations over text file, but the whole variety of supported data formats, DataFrames, subprojects like Spark Streaming and MLlib, etc. And without these features we get back to paragraph 1 - nobody is interested enough to invest time when there's already PySpark and SparkR. All of these makes me think that similar framework for big data analytics written in pure Julia could bypass many of these issues and generate more interest in Julia community. I wonder if somebody would want to take part in such a challenge. Christof Stocker 11 14 15 Re: julia-users Re: Julia and Spark A pure Julia ML ecosystem is something that is actively being worked on towards. Naturally large problem sizes are one big reason that people are interested in working on this. It just takes time to flesh it out to something that people are used to from languages like python or Rlang. As I see it, Julia is different and offers unique opportunities for connecting research, education, and application. It is an interesting journey to figure out how to get the most out of what the language has to offer. Andrei Zh 11 14 15 Re: julia-users Re: Julia and Spark I wouldn't restrict big data analytics to machine learning - it also includes SQL queries and visualization, real-time data enrichment, ETL, etc. I don't think Julia ML ecosystem by itself will ever expand to these areas. Christof Stocker 11 14 15 Re: julia-users Re: Julia and Spark You are right. My tunnel vision strikes again. That's what happens when one works with a hammer all day; all I see is nails :-