Apache Spark for Scientific Data at Scale
¶Modern, robust, general-purpose cluster computing for data science
¶Presentation at SEA Class and Hands-on Workshop, 2017-06-22
¶Neal McBurnett Independent Consultant
Note: I welcome questions, comments, corrections anytime
Example of a lineage of RDDs. Also shows the job stages. Boxes with solid outlines are RDDs. Partitions are shaded rectangles, in black if they are already in memory. In this case, stage 1’s output RDD (B) is already in RAM, so the scheduler runs stage 2 and then 3.
If a node dies, the partitions it was responsible for can be recalculated, knowing the lineage and the immutable inputs
Real-time graph of the DAGs in your jobs, stage progress, storage, environment, nodes
Something like compilers optimizing for pipelined CPUs
Spark mashes functional programming up with SQL optimizing technology
Code describes Directed Acyclic Graph
action
You can do lots of things with your data in Spark
Not the Intergovernmental Panel on Climate Change....
The other IPCC: Intel Parallel Computing Center at Lawrence Berkeley Lab
IPCC is working to adapt Spark, which came from the world of data center clusters, to the supercomputer world.
Note: Yellowstone: no local disks on nodes
SDSC Comet (via XSEDE) does have local disks, for workloads that benefit
This presentation: https://tinyurl.com/sparksciencescale (Creative Commons Attribution-ShareAlike 4.0 International License)
Applying Apache Hadoop to NASA’s Big Climate Data Use Cases and Lessons Learned Glenn Tamkin (NASA/CSC) for MERRA: ApacheCon, 2015-04-13
Data Science and Engineering with Spark | edX - 5 free MOOC courses, 18 weeks, starts May 2016
Paco Nathan, Just Enough Math - O'Reilly Media with some free video on functional programming and monoids for cluster computation
CU Research Computing Intro to Spark, by Dan Milroy with IPython notebook examples.
Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing - Usenix NDSI 2012 best paper, Matei Zaharia et al. Video of Talk