Example of a lineage of RDDs. Also shows the job stages. Boxes with solid outlines are RDDs. Partitions are shaded rectangles, in black if they are already in memory. In this case, stage 1’s output RDD (B) is already in RAM, so the scheduler runs stage 2 and then 3.
If a node dies, the partitions it was responsible for can be recalculated, knowing the lineage and the immutable inputs
Something like compilers optimizing for pipelined CPUs
Compiler, DAG helps avoid failures, deadlock
Spark mashes functional programming up with SQL optimizing technology
Code describes Directed Acyclic Graph
You can do lots of things with your data in Spark
Not the Intergovernmental Panel on Climate Change....
The other IPCC: Intel Parallel Computing Center at Lawrence Berkeley Lab
IPCC is working to adapt Spark, which came from the world of data center clusters, to the supercomputer world.
Note: Yellowstone: no local disks on nodes
SDSC Comet (via XSEDE) does have local disks, for workloads that benefit
"Artificial Intelligence, deep learning, machine learning — whatever you’re doing if you don’t understand it — learn it. Because otherwise you’re going to be a dinosaur within 3 years." — Mark Cuban, 2017
Get on board with machine learning
SEA Class and Hands-on Workshop: Spark and TensorFlow (NCAR) related presentations and code, more on TensorFlow
Data Science and Engineering with Spark | edX - 5 free MOOC courses, 18 weeks, starts May 2016
Paco Nathan, Just Enough Math - O'Reilly Media with some free video on functional programming and monoids for cluster computation
CU Research Computing Intro to Spark, by Dan Milroy with IPython notebook examples.
Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing - Usenix NDSI 2012 best paper, Matei Zaharia et al. Video of Talk