SEA Class and Hands-on Workshop: Spark and TensorFlow

Neal McBurnett, Independent Consultant in Data Science and Election Integrity

Thursday June 22 2017

Introductions
Apache Spark for Scientific Data at Scale - Neal McBurnett
Break
Setting up Spark in your CISL environment: download and untar Anderson's tar.gz on Cheyenne or Yellowstone
Lunch
Explore tutorials and notebooks - Neal McBurnett
El Niño 3.4 Index Calculation (from the tar.gz above) - Anderson Banihirwe

Friday June 23 2017

Intro to Tensorflow: Liping Yang (blog). To access the presentation, enter this password: "liping.yang_NCAR" at this link: Intro to Tensorflow
Lunch
Spatial Representations of Weather Data with Generative Adversarial Networks (GANs) - David John Gagne
Python 3 - Neal McBurnett
Discussion of good use cases for Spark and Tensorflow
Exploration of tutorials and use cases with Spark and Tensorflow
Optional: Unconference: submit ideas for small group discussions, and break up into groups to discuss them

Shared folder for SEA Class and Hands-on Workshop: Spark and TensorFlow

Exploring further on your own

To continue to explore Spark on your own, you have many options. You can use the Cheyenne or Yellowstone environments you set up during the class. You can also run Spark on your own laptop or desktop computers by downloading Apache Spark, or via Pyspark :: Anaconda Cloud, both of which are pretty easy. And of course starting up Jupyter notebooks is a lot easier on a local environment.

The 2015 edX MOOC courses that I talked about are no longer available via edX, but can be found at videos and notebooks from the 2015 "Introduction to Big Data" and "Scalable Machine Learning" courses. They are of course dated, but high quality, and include a fun but challenging PCA analysis of neural activity in a larval Zebrafish brain.

Another option, which also gets you access to a more good-quality training materials, is Databricks Community Edition. The Community Edition provids a convenient online web service to start up a free, but very small, cluster (6 GB). It has a nice intro to Spark, and an "Apache Spark on Databricks for Data Scientists" notebook that covers some Machine Learning basics. See also their "Analyzing 1000 Genomes with Spark and Hail" notebook, for more science on Spark. You can also run the MOOC class notebooks for free there.

You can import notebooks to Databricks, like the Introducing Deep Learning Pipelines for Apache Spark Spark + Tensorflow example I briefly showed you.

If you want to try bigger clusters, you can get a free 14-day trial of the full Databricks environment, which you hook up to an Amazon AWS account.

Thanks again for coming to learn about Spark and TensorFlow. Happy computing! --Neal McBurnett