Apache Spark for Scientific Data at Scale

Modern, robust, general-purpose cluster computing for data science

UCAR Software Engineering Assembly Conference, 2016-04-04

Neal McBurnett Independent Consultant


  • Apache Spark is a productive tool for many scientific applications
    • Easier to write Spark code than MPI, OpenMP
    • Programming model is more expressive and helps compilers optimize for the cluster
  • Builds on convergence of data analysis technologies:
    • Databases: SQL, much research into query optimization
    • HPC: MPI, OpenMP, etc
    • Interactive exploratory data analysis: S, S++, R
    • MapReduce: Hadoop, Spark, with huge investment from business world
  • Related APIs and projects support many other applications
  • Many Scientific use cases

Note: I welcome questions anytime

Apache Spark: popular, modern, elegant, productive

  • The largest open source project in data processing
    • Over 750 contributors from 200+ organizations, still growing
    • Big turnout for our free online edX MOOC last year
    • IBM helping to "educate one million data scientists and data engineers on Spark"
  • APIs for Python (2 or 3), SQL, Java, Scala, R
  • Evolved from Hadoop MapReduce
  • Brought in-memory processing for fast iteration, then much more
  • Handles distributing data to nodes for you, and flexible caching
  • Can read only the data needed from smart upstream data sources: "predicate pushdown" of filters

Core concept: RDD (Resilient Distributed Dataset)

  • RDD abstraction (Resilient Distributed Datasets), core aspect of Spark programming model
    • Matei Zaharia ACM Doctoral Dissertation Award
    • New abstraction that led to Spark
  • RDD is a collection of data, partitioned across the cluster
    • But definition includes lineage - how each RDD depends on others
    • In general, a Directed Acyclic Graph (DAG)
    • Immutable
  • Two types of RDD operations
    • Actions return values
      • count, reduce, lookup, save
    • Transformations: define new RDDs based on extisting ones
      • filter, map, groupByKey, join, sort
  • Provides fault tolerance via recalculation, and avoids data replication
  • RDDs can be cached to make them more persistent in memory, or temporarily saved to local disk:

Example of a lineage of RDDs. Also shows the job stages. Boxes with solid outlines are RDDs. Partitions are shaded rectangles, in black if they are already in memory. In this case, stage 1’s output RDD is already in RAM, so the scheduler runs stage 2 and then 3.

If a node dies, the partitions it was responsible for can be recalculated, knowing the lineage and the immutable inputs

Spark code compiled into efficient execution plan for the cluster

  • Something like compilers optimizing for pipelined CPUs

  • Spark mashes functional programming up with SQL optimizing technology

    • Catalyst compiler for parallel computations on your cluster
  • Code describes Directed Acyclic Graph

  • Remember Lazy evalution - not evaluated until you perform an action
  • Compiler can use full DAG, desired action, and even data contents, to make an optimized execution plan
  • Lack of side effects in functional programming enormously useful in parallel processing
  • Abstract algebra!
  • Lazy evaluation makes exploration, debugging quicker

Even higher-level API: Datasets and DataFrames

  • DataFrame API: Higher level than RDD, inspired by Pandas and R data frames, combined with SQL engine
  • More optimizations possible than with RDD
  • Unified with more general Datasets API in Spark 2.0: DataFrame = Dataset[Row]
    • Note confusing name for an API....
  • Internally leverage the Spark SQL logical optimizer. Intelligently plan the physical execution of operations to work well on large datasets. This planning permeates all the way into physical storage, where optimizations such as predicate pushdown are applied based on analysis of user programs.
  • Create Datasets directly from Hive tables, Parquet files, Spark SQL, etc.

Even More Higher-level APIs

You can do lots of things with your data in Spark

  • ML - elegant API for machine learning
  • MLlib - original Spark Machine Learning API, with more specialized algorithms
  • Spark SQL make Spark accessible to many more analysts
  • GraphX to bring graph processing to big data sets
  • Streaming, unified with batch processing

Scientific use cases for Spark

  • Exploring data interactively with tools like Jupyter, IPython, Scala, Zeppelin, Databricks
    • The same code used for research and exploration can be evolved into an efficient batch production system
    • See e.g. SciSpark from NASA: Automated Search for Mesoscale Convective Complexes
  • Machine learning at scale
  • Combining streaming and batch processing for real-time data products
  • Pre-process datasets: Extract, Transform, Load (ETL) with modern tools
    • NASA use of MapReduce in Hadoop for MERRA/AS (Modern Era Retrospective-Analysis for Research and Applications Analytic Services). Augments netCDF files with indexes in HDFS. Could integrate with Spark.
  • Post-process data
  • Rethinking paper with case studies in astronomy (Spark-mAdd) and genomics (ADAM): faster than HPC state-of-the-art for some tasks, organizing around data formats like Parquet, designed for parallelization and compression.


  • Most of the runtime and libraries are Scala or Java, cleverly integrated with Python
  • Lots of Java running behind the scenes
  • If partition in an RDD is larger than a node's memory, it spills to disk: slow
    • So factor into application design, or have SSDs on your nodes, or size your cluster to match your data

"HPC systems are the mirrored image of data centers.

In a data center, the file system is optimized for latency (with local disks) and the network is optimized for bandwidth.

In a supercomputer, the file system is optimized for bandwidth (without local disk) while the network is optimized for latency." - IPCC

The other IPCC: Intel Parallel Computing Center at

IPCC is working to adapt Spark, which came from the world of data center clusters, to the supercomputer world.

Note: Yellowstone: no local disks on nodes

SDSC Comet (via XSEDE) does have local disks, for workloads that benefit

  • Apache Parquet: columnar storage format with great parallelization, compression, flexibility
  • Ceph: highly parallel object storage ala Amazon S3, along with other capabilities.
  • Apache Arrow: columnar in-memory data layer between Spark, Pandas, Parquet, Drill, etc: reduce serialization overhead
  • Apache Beam: Batch and strEAMing dataflow model and API from Google.
  • Tensorflow: open source library for numerical computation using data flow graphs
  • Apache Giraph: iterative graph processing, with similarities to MPI


  • Spark has many applications for scientific computing
  • It helps you be more productive at writing code and analyzing data
  • Spark optimizers make your code faster
  • Popular due to speed, flexibility, blend of interactive investigative analytics, batch data analytics, streaming updates
  • Lots of training opportunities and experience out there due to popularity
  • Runs in lots of environments, easier to develop on laptop, test on small cluster, redeploy on big one or in the cloud

Thanks to:

  • Insights and motivation from the Boulder Data Detectives LinkedIn Group (Wednesdays at BJ's)
  • Deep insights from Sameer Farooqui, Brian Clapper, and the rest of the team at Databricks Training
  • Zebula Sampedro, Dan Milroy and Davide Del Vento for spark-setup-scripts, for running Spark on Janus and Yellowstone </p>

Explore more.... Questions?     Neal McBurnett     http:/neal.mcburnett.org