Spark is the flavour of the season when it comes to distributed computing frameworks, and I have been caught up in the excitement. I have narrowed down the set of Spark resources to three, which I am going to try and use over the next 3 months to try and learn Spark:
- The first is Advanced Analytics with Spark, which is hot off the presses, but has already gotten very good reviews on Amazon. Not surprising given that Sean Owen is one of the authors.
- The second, of which I have already the first few chapters, but intend to systematically re-read over the next month, is Learning Spark. I must admit, I am somewhat intrigued by how much Matei Zaharia has managed to achieve at (what I am guessing is) a relatively young age.
- Lastly, there is a brand new edX.org course on Spark that has just started. It is a little disappointing that the course uses Python 2.7 and Spark 1.3, instead of moving to the impending release of Spark 1.4, which also supports SparkR and Python 3.4. I think that the Spark guys are holding off on releasing the latest version of Spark till the Spark summit later this month.
In any case, Spark is an important new technology for data analysis, and significant improvement over the disk-only storage model of Hadoop MapReduce. That is not to say that Spark is the only in-memory distributed computing framework that can do ad hoc querying, machine learning, and graph processing on big data — there is the newly promoted to top-level project Apache Flink, but Spark definitely appears to have a significant head start. I look forward to learning more about Apache Spark.