20151015 zagreb spark_notebooks

© 2015 IBM Corporation
Spark and Notebooks

IBM Spark © 2015 IBM Corporation
• Big Data Developers and
Apache Spark meetups
•I also participate in number
of Moscow, Ljubljana
meetups
Hello Zagreb

• Goal – to get you started on Spark & Notebooks
•Overview of DataScience workflow
• General overview of notebooks
• Recap what Spark is
• Comparing existing technologies
• Languages & libraries
• Demo
Goal & Agenda

Skillset of the Data Scientist
Statistician
Software
Engineer
Business
Analyst
Process Automation
Parallel Computing
Software Development
Database Systems
Mathematics Background
Analytic Mindset
Domain Expertise
Business Focus
Effective Communication

Iterative Cycle of Data Science
Business
Understandi
ng
Analytic
Approach
Data
Requirement
s
Data
Collection
Data
Understandi
ngData
Preparation
Modelling
Evaluation
Deployment
Feedback

• Data scientist needs an interactive environment to
work in
• Has to be responsive
• Has to support
• literate programming
• Reproducibility and easy to publish
• Code together with description
Why we need a notebook

• In our context – interactive web env
• You input your code in cells
• Or markdown text
• Outputs are displayed on the page
• Outputs generally saved with a
notebook
What is a notebook (cont.)

• Notebook server
• On large amounts of data – parallel processing
engine
• Spark in our case (no alternatives?)
• Libraries (depends on programming language)
–Machine learning
–Data munging
–Visualisation / Plotting
What do you need to run a notebook

An Apache Foundation open source project.
An in-memory compute engine that works with data.
Enables highly iterative analysis on large volumes of data at scale
Unified environment for data scientists, developers and data engineers
Radically simplifies process of developing intelligent apps fueled by data.
Spark in simple words

If you don’t know Spark yet,
here is how you learn
https://github.com/spark-mooc/mooc-setup

What IBM has to do with Spark?

Resilient distributed datasets (RDDs)
 Immutable collections partitioned across cluster that can be
rebuilt if a partition is lost
 Created by transforming data in stable storage using data flow
operators (map, filter, group-by, …)
 Can be cached across parallel operations
Parallel operations on RDDs
 Reduce, collect, count, save, …
Spark Programming Model

Iterative & Pipeline Analysis
using Spark
Iteration 1 Iteration 2
Disk
Read
Disk
Read
Disk
Read
Disk
Write
Disk
Write
Iteration 1 Iteration 2
Disk
Read
Memory Memory
MapReduce
SystemML & Spark

Spark Programming Model - Example
lines = spark.textFile(“hdfs://...”) // Base RDD
messages = lines.filter(_.startsWith(“ERROR”)) // Transformed RDD
cachedMsgs = messages.cache() // Cached RDD
cachedMsgs.filter(_.contains(“foo”)).count // Parallel Operation
cachedMsgs.filter(_.contains(“bar”)).count
Block 2
Worker
Worker
Worker
Driver
tasks
results
Cache 2
Block 3
Cache 3
Block 1
Cache 1
Result: full-text search of Wikipedia in
<1 sec (vs 20 sec for on-disk data)

• Zeppelin
• Jupyter
• Ipython
• spark-notebook
• scala-notebook
Notebook servers

• grew out of Ipython
• Julia, Python, R
• Now many more languages (40)
•https://try.jupyter.org/
• Markdown support
• Mathjax support
Jupyter project

• Simplest way is to use Anaconda Python distribution
• https://www.continuum.io/downloads
•Otherwise read installation docs
• Start pyspark with Ipython
• PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS="notebook --no-
browser --port 12000 --ip='0.0.0.0'" ./bin/pyspark
• Open browser
Jupyter – installation with Spark

• not as easy
• install scala kernel
• https://github.com/alexarchambault/jupyter-scala
•I use cloud services for scala (see
later)
Jupyter – installing with Scala

• Use keyboard shortcuts
• Use Markdown and markdown
help
• Mathjax for formulas
Jupyter usage - basics

• Richest set of features
• Matplotlib, seaborn libs for data visualisation
• Sklearn, numpy, pandas
Languages - Python

• create subplots or just plot
• plot series
• Seaborn simplifies many tasks
Matplotlib / seaborn basics

• Fast schema creation
•Create pandas frame from small subset
• Convert to Spark DF
• extract schema
• sparkDF.limit(10).toPandas()
Pandas / Spark tips

• Better with Zeppelin
• less libraries for plotting
Languages - Scala

• Widely popular statistical
Language
•SparkR
•Ggplot2
• tried it with Data Scientist
workbench
Languages - R

• Number of sandboxes available
• Recommend using Vagrant
•https://github.com/vykhand/spark-
vagrant
•Spark edX MOOC
Running locally

• register for BlueMix
• Create Spark As a Service
Boilerplate
• upload files to object storage
Running jupyter in Cloud – Spark as a service

• Rapidly developed product
• Notebooks
• Data wrangling
• Rstudio
• Check it out – available for preview
Running jupyter in cloud – Data Scientist workbench

Demo

• Very perspective development
• Very easy and interactive
visualization
• Not very mature (still
incubating)
• My tool of choice still is Jupyter
Zeppelin

• the fastest way is this vagrant box
• http://arjon.es/2015/08/23/vagrant-spark-zeppelin-a-toolbox-to-the-
data-analyst/
• https://github.com/arjones/vagrant-spark-zeppelin
• Install vagrant
• Install virtual box
• git clone
•Vagrant up
Zeppelin – getting started

• Very pretty
• Multiple choice of interpreters,
• many interpreters per page
• configure dependencies and
execution parameters via GUI
Things I like

• Fragile
• Sometimes counter-intuitive
• No obvious way to control
notebook execution
Things I don’t like

demo

20151015 zagreb spark_notebooks

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (17)

Similar to 20151015 zagreb spark_notebooks

Similar to 20151015 zagreb spark_notebooks (20)

More from Andrey Vykhodtsev

More from Andrey Vykhodtsev (9)

Recently uploaded

Recently uploaded (20)

20151015 zagreb spark_notebooks