SlideShare a Scribd company logo
© 2015 IBM Corporation
Spark and Notebooks
IBM Spark © 2015 IBM Corporation
• Big Data Developers and
Apache Spark meetups
•I also participate in number
of Moscow, Ljubljana
meetups
Hello Zagreb
IBM Spark © 2015 IBM Corporation
• Goal – to get you started on Spark & Notebooks
•Overview of DataScience workflow
• General overview of notebooks
• Recap what Spark is
• Comparing existing technologies
• Languages & libraries
• Demo
Goal & Agenda
IBM Spark © 2015 IBM Corporation
Skillset of the Data Scientist
Statistician
Software
Engineer
Business
Analyst
Process Automation
Parallel Computing
Software Development
Database Systems
Mathematics Background
Analytic Mindset
Domain Expertise
Business Focus
Effective Communication
IBM Spark © 2015 IBM Corporation
Iterative Cycle of Data Science
Business
Understandi
ng
Analytic
Approach
Data
Requirement
s
Data
Collection
Data
Understandi
ngData
Preparation
Modelling
Evaluation
Deployment
Feedback
IBM Spark © 2015 IBM Corporation
• Data scientist needs an interactive environment to
work in
• Has to be responsive
• Has to support
• literate programming
• Reproducibility and easy to publish
• Code together with description
Why we need a notebook
IBM Spark © 2015 IBM Corporation
• In our context – interactive web env
• You input your code in cells
• Or markdown text
• Outputs are displayed on the page
• Outputs generally saved with a
notebook
What is a notebook (cont.)
IBM Spark © 2015 IBM Corporation
• Notebook server
• On large amounts of data – parallel processing
engine
• Spark in our case (no alternatives?)
• Libraries (depends on programming language)
–Machine learning
–Data munging
–Visualisation / Plotting
What do you need to run a notebook
IBM Spark © 2015 IBM Corporation
An Apache Foundation open source project.
An in-memory compute engine that works with data.
Enables highly iterative analysis on large volumes of data at scale
Unified environment for data scientists, developers and data engineers
Radically simplifies process of developing intelligent apps fueled by data.
Spark in simple words
IBM Spark © 2015 IBM Corporation
If you don’t know Spark yet,
here is how you learn
https://github.com/spark-mooc/mooc-setup
IBM Spark © 2015 IBM Corporation
What IBM has to do with Spark?
IBM Spark © 2015 IBM Corporation
Resilient distributed datasets (RDDs)
 Immutable collections partitioned across cluster that can be
rebuilt if a partition is lost
 Created by transforming data in stable storage using data flow
operators (map, filter, group-by, …)
 Can be cached across parallel operations
Parallel operations on RDDs
 Reduce, collect, count, save, …
Spark Programming Model
IBM Spark © 2015 IBM Corporation
Iterative & Pipeline Analysis
using Spark
Iteration 1 Iteration 2
Disk
Read
Disk
Read
Disk
Read
Disk
Write
Disk
Write
Iteration 1 Iteration 2
Disk
Read
Memory Memory
MapReduce
SystemML & Spark
IBM Spark © 2015 IBM Corporation
Spark Programming Model - Example
lines = spark.textFile(“hdfs://...”) // Base RDD
messages = lines.filter(_.startsWith(“ERROR”)) // Transformed RDD
cachedMsgs = messages.cache() // Cached RDD
cachedMsgs.filter(_.contains(“foo”)).count // Parallel Operation
cachedMsgs.filter(_.contains(“bar”)).count
Block 2
Worker
Worker
Worker
Driver
tasks
results
Cache 2
Block 3
Cache 3
Block 1
Cache 1
Result: full-text search of Wikipedia in
<1 sec (vs 20 sec for on-disk data)
IBM Spark © 2015 IBM Corporation
• Zeppelin
• Jupyter
• Ipython
• spark-notebook
• scala-notebook
Notebook servers
IBM Spark © 2015 IBM Corporation
• grew out of Ipython
• Julia, Python, R
• Now many more languages (40)
•https://try.jupyter.org/
• Markdown support
• Mathjax support
Jupyter project
IBM Spark © 2015 IBM Corporation
• Simplest way is to use Anaconda Python distribution
• https://www.continuum.io/downloads
•Otherwise read installation docs
• Start pyspark with Ipython
• PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS="notebook --no-
browser --port 12000 --ip='0.0.0.0'" ./bin/pyspark
• Open browser
Jupyter – installation with Spark
IBM Spark © 2015 IBM Corporation
• not as easy
• install scala kernel
• https://github.com/alexarchambault/jupyter-scala
•I use cloud services for scala (see
later)
Jupyter – installing with Scala
IBM Spark © 2015 IBM Corporation
• Use keyboard shortcuts
• Use Markdown and markdown
help
• Mathjax for formulas
Jupyter usage - basics
IBM Spark © 2015 IBM Corporation
• Richest set of features
• Matplotlib, seaborn libs for data visualisation
• Sklearn, numpy, pandas
Languages - Python
IBM Spark © 2015 IBM Corporation
• create subplots or just plot
• plot series
• Seaborn simplifies many tasks
Matplotlib / seaborn basics
IBM Spark © 2015 IBM Corporation
• Fast schema creation
•Create pandas frame from small subset
• Convert to Spark DF
• extract schema
• sparkDF.limit(10).toPandas()
Pandas / Spark tips
IBM Spark © 2015 IBM Corporation
• Better with Zeppelin
• less libraries for plotting
Languages - Scala
IBM Spark © 2015 IBM Corporation
• Widely popular statistical
Language
•SparkR
•Ggplot2
• tried it with Data Scientist
workbench
Languages - R
IBM Spark © 2015 IBM Corporation
• Number of sandboxes available
• Recommend using Vagrant
•https://github.com/vykhand/spark-
vagrant
•Spark edX MOOC
Running locally
IBM Spark © 2015 IBM Corporation
• register for BlueMix
• Create Spark As a Service
Boilerplate
• upload files to object storage
Running jupyter in Cloud – Spark as a service
IBM Spark © 2015 IBM Corporation
• Rapidly developed product
• Notebooks
• Data wrangling
• Rstudio
• Check it out – available for preview
Running jupyter in cloud – Data Scientist workbench
IBM Spark © 2015 IBM Corporation
Demo
IBM Spark © 2015 IBM Corporation
• Very perspective development
• Very easy and interactive
visualization
• Not very mature (still
incubating)
• My tool of choice still is Jupyter
Zeppelin
IBM Spark © 2015 IBM Corporation
• the fastest way is this vagrant box
• http://arjon.es/2015/08/23/vagrant-spark-zeppelin-a-toolbox-to-the-
data-analyst/
• https://github.com/arjones/vagrant-spark-zeppelin
• Install vagrant
• Install virtual box
• git clone
•Vagrant up
Zeppelin – getting started
IBM Spark © 2015 IBM Corporation
• Very pretty
• Multiple choice of interpreters,
• many interpreters per page
• configure dependencies and
execution parameters via GUI
Things I like
IBM Spark © 2015 IBM Corporation
• Fragile
• Sometimes counter-intuitive
• No obvious way to control
notebook execution
Things I don’t like
IBM Spark © 2015 IBM Corporation
demo

More Related Content

What's hot

HDPCD Spark using Python (pyspark)
HDPCD Spark using Python (pyspark)HDPCD Spark using Python (pyspark)
HDPCD Spark using Python (pyspark)
Durga Gadiraju
 
Just-in-Time Analytics and the Need for Autonomous Database Administration wi...
Just-in-Time Analytics and the Need for Autonomous Database Administration wi...Just-in-Time Analytics and the Need for Autonomous Database Administration wi...
Just-in-Time Analytics and the Need for Autonomous Database Administration wi...
Databricks
 
An Introduction to Sparkling Water by Michal Malohlava
An Introduction to Sparkling Water by Michal MalohlavaAn Introduction to Sparkling Water by Michal Malohlava
An Introduction to Sparkling Water by Michal Malohlava
Spark Summit
 
Getting started with Apache Spark
Getting started with Apache SparkGetting started with Apache Spark
Getting started with Apache Spark
Habib Ahmed Bhutto
 
MLeap: Release Spark ML Pipelines
MLeap: Release Spark ML PipelinesMLeap: Release Spark ML Pipelines
MLeap: Release Spark ML Pipelines
DataWorks Summit/Hadoop Summit
 
3D: DBT using Databricks and Delta
3D: DBT using Databricks and Delta3D: DBT using Databricks and Delta
3D: DBT using Databricks and Delta
Databricks
 
Scalable Deep Learning Platform On Spark In Baidu
Scalable Deep Learning Platform On Spark In BaiduScalable Deep Learning Platform On Spark In Baidu
Scalable Deep Learning Platform On Spark In Baidu
Jen Aman
 
Archiving, E-Discovery, and Supervision with Spark and Hadoop with Jordan Volz
Archiving, E-Discovery, and Supervision with Spark and Hadoop with Jordan VolzArchiving, E-Discovery, and Supervision with Spark and Hadoop with Jordan Volz
Archiving, E-Discovery, and Supervision with Spark and Hadoop with Jordan Volz
Databricks
 
Apache Spark At Apple with Sam Maclennan and Vishwanath Lakkundi
Apache Spark At Apple with Sam Maclennan and Vishwanath LakkundiApache Spark At Apple with Sam Maclennan and Vishwanath Lakkundi
Apache Spark At Apple with Sam Maclennan and Vishwanath Lakkundi
Databricks
 
Optimizing Spark Deployments for Containers: Isolation, Safety, and Performan...
Optimizing Spark Deployments for Containers: Isolation, Safety, and Performan...Optimizing Spark Deployments for Containers: Isolation, Safety, and Performan...
Optimizing Spark Deployments for Containers: Isolation, Safety, and Performan...
Spark Summit
 
Rust is for "Big Data"
Rust is for "Big Data"Rust is for "Big Data"
Rust is for "Big Data"
Andy Grove
 
Spark and Hadoop Technology
Spark and Hadoop Technology Spark and Hadoop Technology
Spark and Hadoop Technology
Avinash Gautam
 
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Spark Summit
 
Scala in Model-Driven development for Apparel Cloud Platform
Scala in Model-Driven development for Apparel Cloud PlatformScala in Model-Driven development for Apparel Cloud Platform
Scala in Model-Driven development for Apparel Cloud Platform
Tomoharu ASAMI
 
MLeap: Deploy Spark ML Pipelines to Production API Servers
MLeap: Deploy Spark ML Pipelines to Production API ServersMLeap: Deploy Spark ML Pipelines to Production API Servers
MLeap: Deploy Spark ML Pipelines to Production API Servers
DataWorks Summit
 
GridGain 6.0: Open Source In-Memory Computing Platform - Nikita Ivanov
GridGain 6.0: Open Source In-Memory Computing Platform - Nikita IvanovGridGain 6.0: Open Source In-Memory Computing Platform - Nikita Ivanov
GridGain 6.0: Open Source In-Memory Computing Platform - Nikita Ivanov
JAXLondon2014
 
Real-Time Machine Learning with Redis, Apache Spark, Tensor Flow, and more wi...
Real-Time Machine Learning with Redis, Apache Spark, Tensor Flow, and more wi...Real-Time Machine Learning with Redis, Apache Spark, Tensor Flow, and more wi...
Real-Time Machine Learning with Redis, Apache Spark, Tensor Flow, and more wi...
Databricks
 
MLflow: Infrastructure for a Complete Machine Learning Life Cycle
MLflow: Infrastructure for a Complete Machine Learning Life CycleMLflow: Infrastructure for a Complete Machine Learning Life Cycle
MLflow: Infrastructure for a Complete Machine Learning Life Cycle
Databricks
 
A short introduction to Spark and its benefits
A short introduction to Spark and its benefitsA short introduction to Spark and its benefits
A short introduction to Spark and its benefits
Johan Picard
 
Operationalize Apache Spark Analytics
Operationalize Apache Spark AnalyticsOperationalize Apache Spark Analytics
Operationalize Apache Spark Analytics
Databricks
 

What's hot (20)

HDPCD Spark using Python (pyspark)
HDPCD Spark using Python (pyspark)HDPCD Spark using Python (pyspark)
HDPCD Spark using Python (pyspark)
 
Just-in-Time Analytics and the Need for Autonomous Database Administration wi...
Just-in-Time Analytics and the Need for Autonomous Database Administration wi...Just-in-Time Analytics and the Need for Autonomous Database Administration wi...
Just-in-Time Analytics and the Need for Autonomous Database Administration wi...
 
An Introduction to Sparkling Water by Michal Malohlava
An Introduction to Sparkling Water by Michal MalohlavaAn Introduction to Sparkling Water by Michal Malohlava
An Introduction to Sparkling Water by Michal Malohlava
 
Getting started with Apache Spark
Getting started with Apache SparkGetting started with Apache Spark
Getting started with Apache Spark
 
MLeap: Release Spark ML Pipelines
MLeap: Release Spark ML PipelinesMLeap: Release Spark ML Pipelines
MLeap: Release Spark ML Pipelines
 
3D: DBT using Databricks and Delta
3D: DBT using Databricks and Delta3D: DBT using Databricks and Delta
3D: DBT using Databricks and Delta
 
Scalable Deep Learning Platform On Spark In Baidu
Scalable Deep Learning Platform On Spark In BaiduScalable Deep Learning Platform On Spark In Baidu
Scalable Deep Learning Platform On Spark In Baidu
 
Archiving, E-Discovery, and Supervision with Spark and Hadoop with Jordan Volz
Archiving, E-Discovery, and Supervision with Spark and Hadoop with Jordan VolzArchiving, E-Discovery, and Supervision with Spark and Hadoop with Jordan Volz
Archiving, E-Discovery, and Supervision with Spark and Hadoop with Jordan Volz
 
Apache Spark At Apple with Sam Maclennan and Vishwanath Lakkundi
Apache Spark At Apple with Sam Maclennan and Vishwanath LakkundiApache Spark At Apple with Sam Maclennan and Vishwanath Lakkundi
Apache Spark At Apple with Sam Maclennan and Vishwanath Lakkundi
 
Optimizing Spark Deployments for Containers: Isolation, Safety, and Performan...
Optimizing Spark Deployments for Containers: Isolation, Safety, and Performan...Optimizing Spark Deployments for Containers: Isolation, Safety, and Performan...
Optimizing Spark Deployments for Containers: Isolation, Safety, and Performan...
 
Rust is for "Big Data"
Rust is for "Big Data"Rust is for "Big Data"
Rust is for "Big Data"
 
Spark and Hadoop Technology
Spark and Hadoop Technology Spark and Hadoop Technology
Spark and Hadoop Technology
 
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
 
Scala in Model-Driven development for Apparel Cloud Platform
Scala in Model-Driven development for Apparel Cloud PlatformScala in Model-Driven development for Apparel Cloud Platform
Scala in Model-Driven development for Apparel Cloud Platform
 
MLeap: Deploy Spark ML Pipelines to Production API Servers
MLeap: Deploy Spark ML Pipelines to Production API ServersMLeap: Deploy Spark ML Pipelines to Production API Servers
MLeap: Deploy Spark ML Pipelines to Production API Servers
 
GridGain 6.0: Open Source In-Memory Computing Platform - Nikita Ivanov
GridGain 6.0: Open Source In-Memory Computing Platform - Nikita IvanovGridGain 6.0: Open Source In-Memory Computing Platform - Nikita Ivanov
GridGain 6.0: Open Source In-Memory Computing Platform - Nikita Ivanov
 
Real-Time Machine Learning with Redis, Apache Spark, Tensor Flow, and more wi...
Real-Time Machine Learning with Redis, Apache Spark, Tensor Flow, and more wi...Real-Time Machine Learning with Redis, Apache Spark, Tensor Flow, and more wi...
Real-Time Machine Learning with Redis, Apache Spark, Tensor Flow, and more wi...
 
MLflow: Infrastructure for a Complete Machine Learning Life Cycle
MLflow: Infrastructure for a Complete Machine Learning Life CycleMLflow: Infrastructure for a Complete Machine Learning Life Cycle
MLflow: Infrastructure for a Complete Machine Learning Life Cycle
 
A short introduction to Spark and its benefits
A short introduction to Spark and its benefitsA short introduction to Spark and its benefits
A short introduction to Spark and its benefits
 
Operationalize Apache Spark Analytics
Operationalize Apache Spark AnalyticsOperationalize Apache Spark Analytics
Operationalize Apache Spark Analytics
 

Viewers also liked

MattHallead-resume 2016 Update
MattHallead-resume 2016 UpdateMattHallead-resume 2016 Update
MattHallead-resume 2016 Update
Matt Hallead, MBA CSSBB
 
Aksiyon tvc backstage
Aksiyon tvc backstageAksiyon tvc backstage
Aksiyon tvc backstage
Haydar Durusoy
 
כיצד נתגונן מזבוב החול להצגה
כיצד נתגונן מזבוב החול להצגהכיצד נתגונן מזבוב החול להצגה
כיצד נתגונן מזבוב החול להצגהenviosh
 
Visi Direktorat Perencanaan dan Pengembangan IPB
Visi Direktorat Perencanaan dan Pengembangan IPBVisi Direktorat Perencanaan dan Pengembangan IPB
Visi Direktorat Perencanaan dan Pengembangan IPB
Bogor Agricultural University
 
Degree_CSUS_BSc
Degree_CSUS_BScDegree_CSUS_BSc
Degree_CSUS_BSc
PETER CK LEE
 
Lookin Out - Program Overview
Lookin Out - Program OverviewLookin Out - Program Overview
Lookin Out - Program Overview
mbresee
 
GerberSleep
GerberSleepGerberSleep
GerberSleep
Brian Ethridge
 
Linksys extender setup
Linksys extender setupLinksys extender setup
One coin arabic
One coin arabicOne coin arabic
One coin arabic
Said Mlm
 
GLACTOSE METABOLISM
GLACTOSE METABOLISMGLACTOSE METABOLISM
GLACTOSE METABOLISM
Dr Muhammad Mustansar
 
Auto avaliação 1
Auto avaliação  1Auto avaliação  1
Auto avaliação 1
j_sdias
 
Рынок недвижимости Екатеринбурга, 2016
Рынок недвижимости Екатеринбурга, 2016Рынок недвижимости Екатеринбурга, 2016
Рынок недвижимости Екатеринбурга, 2016
Mikhail Khorkov
 
Operators
OperatorsOperators
Chemistry of lipids MUHAMMAD MUSTANSAR
Chemistry of lipids  MUHAMMAD MUSTANSARChemistry of lipids  MUHAMMAD MUSTANSAR
Chemistry of lipids MUHAMMAD MUSTANSAR
Dr Muhammad Mustansar
 
Taller proyecto de innovación pedagogica ccesa007
Taller proyecto de  innovación pedagogica ccesa007Taller proyecto de  innovación pedagogica ccesa007
Taller proyecto de innovación pedagogica ccesa007
Demetrio Ccesa Rayme
 

Viewers also liked (17)

MattHallead-resume 2016 Update
MattHallead-resume 2016 UpdateMattHallead-resume 2016 Update
MattHallead-resume 2016 Update
 
Aksiyon tvc backstage
Aksiyon tvc backstageAksiyon tvc backstage
Aksiyon tvc backstage
 
כיצד נתגונן מזבוב החול להצגה
כיצד נתגונן מזבוב החול להצגהכיצד נתגונן מזבוב החול להצגה
כיצד נתגונן מזבוב החול להצגה
 
Visi Direktorat Perencanaan dan Pengembangan IPB
Visi Direktorat Perencanaan dan Pengembangan IPBVisi Direktorat Perencanaan dan Pengembangan IPB
Visi Direktorat Perencanaan dan Pengembangan IPB
 
Degree_CSUS_BSc
Degree_CSUS_BScDegree_CSUS_BSc
Degree_CSUS_BSc
 
Lookin Out - Program Overview
Lookin Out - Program OverviewLookin Out - Program Overview
Lookin Out - Program Overview
 
GerberSleep
GerberSleepGerberSleep
GerberSleep
 
Linksys extender setup
Linksys extender setupLinksys extender setup
Linksys extender setup
 
One coin arabic
One coin arabicOne coin arabic
One coin arabic
 
BDO_BankingSurvey_2012
BDO_BankingSurvey_2012BDO_BankingSurvey_2012
BDO_BankingSurvey_2012
 
Pengelompokkan bahasa austronesia
Pengelompokkan bahasa austronesiaPengelompokkan bahasa austronesia
Pengelompokkan bahasa austronesia
 
GLACTOSE METABOLISM
GLACTOSE METABOLISMGLACTOSE METABOLISM
GLACTOSE METABOLISM
 
Auto avaliação 1
Auto avaliação  1Auto avaliação  1
Auto avaliação 1
 
Рынок недвижимости Екатеринбурга, 2016
Рынок недвижимости Екатеринбурга, 2016Рынок недвижимости Екатеринбурга, 2016
Рынок недвижимости Екатеринбурга, 2016
 
Operators
OperatorsOperators
Operators
 
Chemistry of lipids MUHAMMAD MUSTANSAR
Chemistry of lipids  MUHAMMAD MUSTANSARChemistry of lipids  MUHAMMAD MUSTANSAR
Chemistry of lipids MUHAMMAD MUSTANSAR
 
Taller proyecto de innovación pedagogica ccesa007
Taller proyecto de  innovación pedagogica ccesa007Taller proyecto de  innovación pedagogica ccesa007
Taller proyecto de innovación pedagogica ccesa007
 

Similar to 20151015 zagreb spark_notebooks

Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Sachin Aggarwal
 
Interactive Analytics using Apache Spark
Interactive Analytics using Apache SparkInteractive Analytics using Apache Spark
Interactive Analytics using Apache Spark
Sachin Aggarwal
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
Zahra Eskandari
 
Putting the Spark into Functional Fashion Tech Analystics
Putting the Spark into Functional Fashion Tech AnalysticsPutting the Spark into Functional Fashion Tech Analystics
Putting the Spark into Functional Fashion Tech Analystics
Gareth Rogers
 
Cloud scale predictive DevOps automation using Apache Spark: Velocity in Amst...
Cloud scale predictive DevOps automation using Apache Spark: Velocity in Amst...Cloud scale predictive DevOps automation using Apache Spark: Velocity in Amst...
Cloud scale predictive DevOps automation using Apache Spark: Velocity in Amst...
Romeo Kienzler
 
.NET per la Data Science e oltre
.NET per la Data Science e oltre.NET per la Data Science e oltre
.NET per la Data Science e oltre
Marco Parenzan
 
20150617 spark meetup zagreb
20150617 spark meetup zagreb20150617 spark meetup zagreb
20150617 spark meetup zagreb
Andrey Vykhodtsev
 
Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014
mahchiev
 
Five cool ways the JVM can run Apache Spark faster
Five cool ways the JVM can run Apache Spark fasterFive cool ways the JVM can run Apache Spark faster
Five cool ways the JVM can run Apache Spark faster
Tim Ellison
 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for Beginners
Anirudh
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
Synergetics Learning and Cloud Consulting
 
Apache Spark Performance Observations
Apache Spark Performance ObservationsApache Spark Performance Observations
Apache Spark Performance Observations
Adam Roberts
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
Spark 101
Spark 101Spark 101
Seattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp APISeattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp API
shareddatamsft
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
Ahmet Bulut
 
Apache Spark: Lightning Fast Cluster Computing
Apache Spark: Lightning Fast Cluster ComputingApache Spark: Lightning Fast Cluster Computing
Apache Spark: Lightning Fast Cluster Computing
All Things Open
 
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability MeetupApache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Hyderabad Scalability Meetup
 
Dec6 meetup spark presentation
Dec6 meetup spark presentationDec6 meetup spark presentation
Dec6 meetup spark presentation
Ramesh Mudunuri
 
Cloud-based Data Lake for Analytics and AI
Cloud-based Data Lake for Analytics and AICloud-based Data Lake for Analytics and AI
Cloud-based Data Lake for Analytics and AI
Torsten Steinbach
 

Similar to 20151015 zagreb spark_notebooks (20)

Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
 
Interactive Analytics using Apache Spark
Interactive Analytics using Apache SparkInteractive Analytics using Apache Spark
Interactive Analytics using Apache Spark
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Putting the Spark into Functional Fashion Tech Analystics
Putting the Spark into Functional Fashion Tech AnalysticsPutting the Spark into Functional Fashion Tech Analystics
Putting the Spark into Functional Fashion Tech Analystics
 
Cloud scale predictive DevOps automation using Apache Spark: Velocity in Amst...
Cloud scale predictive DevOps automation using Apache Spark: Velocity in Amst...Cloud scale predictive DevOps automation using Apache Spark: Velocity in Amst...
Cloud scale predictive DevOps automation using Apache Spark: Velocity in Amst...
 
.NET per la Data Science e oltre
.NET per la Data Science e oltre.NET per la Data Science e oltre
.NET per la Data Science e oltre
 
20150617 spark meetup zagreb
20150617 spark meetup zagreb20150617 spark meetup zagreb
20150617 spark meetup zagreb
 
Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014
 
Five cool ways the JVM can run Apache Spark faster
Five cool ways the JVM can run Apache Spark fasterFive cool ways the JVM can run Apache Spark faster
Five cool ways the JVM can run Apache Spark faster
 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for Beginners
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
 
Apache Spark Performance Observations
Apache Spark Performance ObservationsApache Spark Performance Observations
Apache Spark Performance Observations
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Spark 101
Spark 101Spark 101
Spark 101
 
Seattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp APISeattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp API
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Apache Spark: Lightning Fast Cluster Computing
Apache Spark: Lightning Fast Cluster ComputingApache Spark: Lightning Fast Cluster Computing
Apache Spark: Lightning Fast Cluster Computing
 
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability MeetupApache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
 
Dec6 meetup spark presentation
Dec6 meetup spark presentationDec6 meetup spark presentation
Dec6 meetup spark presentation
 
Cloud-based Data Lake for Analytics and AI
Cloud-based Data Lake for Analytics and AICloud-based Data Lake for Analytics and AI
Cloud-based Data Lake for Analytics and AI
 

More from Andrey Vykhodtsev

Explaining machine learning models with python
Explaining machine learning models with pythonExplaining machine learning models with python
Explaining machine learning models with python
Andrey Vykhodtsev
 
20181003 Whirlwind tour into Pyspark
20181003 Whirlwind tour into Pyspark20181003 Whirlwind tour into Pyspark
20181003 Whirlwind tour into Pyspark
Andrey Vykhodtsev
 
20180405 av toxic_comment_classification
20180405 av toxic_comment_classification20180405 av toxic_comment_classification
20180405 av toxic_comment_classification
Andrey Vykhodtsev
 
20180328 av kaggle_jigsaw_with_amlwb
20180328 av kaggle_jigsaw_with_amlwb20180328 av kaggle_jigsaw_with_amlwb
20180328 av kaggle_jigsaw_with_amlwb
Andrey Vykhodtsev
 
20170927 py data_n3_bokeh_plotly
20170927 py data_n3_bokeh_plotly20170927 py data_n3_bokeh_plotly
20170927 py data_n3_bokeh_plotly
Andrey Vykhodtsev
 
PyData Ljubljana meetup #1
PyData Ljubljana meetup #1PyData Ljubljana meetup #1
PyData Ljubljana meetup #1
Andrey Vykhodtsev
 
Installing Hadoop / Spark from scratch
Installing Hadoop / Spark from scratchInstalling Hadoop / Spark from scratch
Installing Hadoop / Spark from scratch
Andrey Vykhodtsev
 
20150716 introduction to apache spark v3
20150716 introduction to apache spark v3 20150716 introduction to apache spark v3
20150716 introduction to apache spark v3
Andrey Vykhodtsev
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Andrey Vykhodtsev
 

More from Andrey Vykhodtsev (9)

Explaining machine learning models with python
Explaining machine learning models with pythonExplaining machine learning models with python
Explaining machine learning models with python
 
20181003 Whirlwind tour into Pyspark
20181003 Whirlwind tour into Pyspark20181003 Whirlwind tour into Pyspark
20181003 Whirlwind tour into Pyspark
 
20180405 av toxic_comment_classification
20180405 av toxic_comment_classification20180405 av toxic_comment_classification
20180405 av toxic_comment_classification
 
20180328 av kaggle_jigsaw_with_amlwb
20180328 av kaggle_jigsaw_with_amlwb20180328 av kaggle_jigsaw_with_amlwb
20180328 av kaggle_jigsaw_with_amlwb
 
20170927 py data_n3_bokeh_plotly
20170927 py data_n3_bokeh_plotly20170927 py data_n3_bokeh_plotly
20170927 py data_n3_bokeh_plotly
 
PyData Ljubljana meetup #1
PyData Ljubljana meetup #1PyData Ljubljana meetup #1
PyData Ljubljana meetup #1
 
Installing Hadoop / Spark from scratch
Installing Hadoop / Spark from scratchInstalling Hadoop / Spark from scratch
Installing Hadoop / Spark from scratch
 
20150716 introduction to apache spark v3
20150716 introduction to apache spark v3 20150716 introduction to apache spark v3
20150716 introduction to apache spark v3
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
 

Recently uploaded

Sample Devops SRE Product Companies .pdf
Sample Devops SRE  Product Companies .pdfSample Devops SRE  Product Companies .pdf
Sample Devops SRE Product Companies .pdf
Vineet
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
Social Samosa
 
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
1tyxnjpia
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
Timothy Spann
 
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docxDATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
SaffaIbrahim1
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
v7oacc3l
 
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
hyfjgavov
 
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
ihavuls
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Kiwi Creative
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
ytypuem
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
Timothy Spann
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
Sachin Paul
 
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
nyvan3
 
Building a Quantum Computer Neutral Atom.pdf
Building a Quantum Computer Neutral Atom.pdfBuilding a Quantum Computer Neutral Atom.pdf
Building a Quantum Computer Neutral Atom.pdf
cjimenez2581
 
Cell The Unit of Life for NEET Multiple Choice Questions.docx
Cell The Unit of Life for NEET Multiple Choice Questions.docxCell The Unit of Life for NEET Multiple Choice Questions.docx
Cell The Unit of Life for NEET Multiple Choice Questions.docx
vasanthatpuram
 
一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理
一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理
一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理
lzdvtmy8
 
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
eoxhsaa
 
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
ywqeos
 
Template xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptxTemplate xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptx
TeukuEriSyahputra
 
社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .
NABLAS株式会社
 

Recently uploaded (20)

Sample Devops SRE Product Companies .pdf
Sample Devops SRE  Product Companies .pdfSample Devops SRE  Product Companies .pdf
Sample Devops SRE Product Companies .pdf
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
 
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
 
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docxDATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
 
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
 
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
 
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
 
Building a Quantum Computer Neutral Atom.pdf
Building a Quantum Computer Neutral Atom.pdfBuilding a Quantum Computer Neutral Atom.pdf
Building a Quantum Computer Neutral Atom.pdf
 
Cell The Unit of Life for NEET Multiple Choice Questions.docx
Cell The Unit of Life for NEET Multiple Choice Questions.docxCell The Unit of Life for NEET Multiple Choice Questions.docx
Cell The Unit of Life for NEET Multiple Choice Questions.docx
 
一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理
一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理
一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理
 
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
 
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
 
Template xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptxTemplate xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptx
 
社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .
 

20151015 zagreb spark_notebooks

  • 1. © 2015 IBM Corporation Spark and Notebooks
  • 2. IBM Spark © 2015 IBM Corporation • Big Data Developers and Apache Spark meetups •I also participate in number of Moscow, Ljubljana meetups Hello Zagreb
  • 3. IBM Spark © 2015 IBM Corporation • Goal – to get you started on Spark & Notebooks •Overview of DataScience workflow • General overview of notebooks • Recap what Spark is • Comparing existing technologies • Languages & libraries • Demo Goal & Agenda
  • 4. IBM Spark © 2015 IBM Corporation Skillset of the Data Scientist Statistician Software Engineer Business Analyst Process Automation Parallel Computing Software Development Database Systems Mathematics Background Analytic Mindset Domain Expertise Business Focus Effective Communication
  • 5. IBM Spark © 2015 IBM Corporation Iterative Cycle of Data Science Business Understandi ng Analytic Approach Data Requirement s Data Collection Data Understandi ngData Preparation Modelling Evaluation Deployment Feedback
  • 6. IBM Spark © 2015 IBM Corporation • Data scientist needs an interactive environment to work in • Has to be responsive • Has to support • literate programming • Reproducibility and easy to publish • Code together with description Why we need a notebook
  • 7. IBM Spark © 2015 IBM Corporation • In our context – interactive web env • You input your code in cells • Or markdown text • Outputs are displayed on the page • Outputs generally saved with a notebook What is a notebook (cont.)
  • 8. IBM Spark © 2015 IBM Corporation • Notebook server • On large amounts of data – parallel processing engine • Spark in our case (no alternatives?) • Libraries (depends on programming language) –Machine learning –Data munging –Visualisation / Plotting What do you need to run a notebook
  • 9. IBM Spark © 2015 IBM Corporation An Apache Foundation open source project. An in-memory compute engine that works with data. Enables highly iterative analysis on large volumes of data at scale Unified environment for data scientists, developers and data engineers Radically simplifies process of developing intelligent apps fueled by data. Spark in simple words
  • 10. IBM Spark © 2015 IBM Corporation If you don’t know Spark yet, here is how you learn https://github.com/spark-mooc/mooc-setup
  • 11. IBM Spark © 2015 IBM Corporation What IBM has to do with Spark?
  • 12. IBM Spark © 2015 IBM Corporation Resilient distributed datasets (RDDs)  Immutable collections partitioned across cluster that can be rebuilt if a partition is lost  Created by transforming data in stable storage using data flow operators (map, filter, group-by, …)  Can be cached across parallel operations Parallel operations on RDDs  Reduce, collect, count, save, … Spark Programming Model
  • 13. IBM Spark © 2015 IBM Corporation Iterative & Pipeline Analysis using Spark Iteration 1 Iteration 2 Disk Read Disk Read Disk Read Disk Write Disk Write Iteration 1 Iteration 2 Disk Read Memory Memory MapReduce SystemML & Spark
  • 14. IBM Spark © 2015 IBM Corporation Spark Programming Model - Example lines = spark.textFile(“hdfs://...”) // Base RDD messages = lines.filter(_.startsWith(“ERROR”)) // Transformed RDD cachedMsgs = messages.cache() // Cached RDD cachedMsgs.filter(_.contains(“foo”)).count // Parallel Operation cachedMsgs.filter(_.contains(“bar”)).count Block 2 Worker Worker Worker Driver tasks results Cache 2 Block 3 Cache 3 Block 1 Cache 1 Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data)
  • 15. IBM Spark © 2015 IBM Corporation • Zeppelin • Jupyter • Ipython • spark-notebook • scala-notebook Notebook servers
  • 16. IBM Spark © 2015 IBM Corporation • grew out of Ipython • Julia, Python, R • Now many more languages (40) •https://try.jupyter.org/ • Markdown support • Mathjax support Jupyter project
  • 17. IBM Spark © 2015 IBM Corporation • Simplest way is to use Anaconda Python distribution • https://www.continuum.io/downloads •Otherwise read installation docs • Start pyspark with Ipython • PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS="notebook --no- browser --port 12000 --ip='0.0.0.0'" ./bin/pyspark • Open browser Jupyter – installation with Spark
  • 18. IBM Spark © 2015 IBM Corporation • not as easy • install scala kernel • https://github.com/alexarchambault/jupyter-scala •I use cloud services for scala (see later) Jupyter – installing with Scala
  • 19. IBM Spark © 2015 IBM Corporation • Use keyboard shortcuts • Use Markdown and markdown help • Mathjax for formulas Jupyter usage - basics
  • 20. IBM Spark © 2015 IBM Corporation • Richest set of features • Matplotlib, seaborn libs for data visualisation • Sklearn, numpy, pandas Languages - Python
  • 21. IBM Spark © 2015 IBM Corporation • create subplots or just plot • plot series • Seaborn simplifies many tasks Matplotlib / seaborn basics
  • 22. IBM Spark © 2015 IBM Corporation • Fast schema creation •Create pandas frame from small subset • Convert to Spark DF • extract schema • sparkDF.limit(10).toPandas() Pandas / Spark tips
  • 23. IBM Spark © 2015 IBM Corporation • Better with Zeppelin • less libraries for plotting Languages - Scala
  • 24. IBM Spark © 2015 IBM Corporation • Widely popular statistical Language •SparkR •Ggplot2 • tried it with Data Scientist workbench Languages - R
  • 25. IBM Spark © 2015 IBM Corporation • Number of sandboxes available • Recommend using Vagrant •https://github.com/vykhand/spark- vagrant •Spark edX MOOC Running locally
  • 26. IBM Spark © 2015 IBM Corporation • register for BlueMix • Create Spark As a Service Boilerplate • upload files to object storage Running jupyter in Cloud – Spark as a service
  • 27. IBM Spark © 2015 IBM Corporation • Rapidly developed product • Notebooks • Data wrangling • Rstudio • Check it out – available for preview Running jupyter in cloud – Data Scientist workbench
  • 28. IBM Spark © 2015 IBM Corporation Demo
  • 29. IBM Spark © 2015 IBM Corporation • Very perspective development • Very easy and interactive visualization • Not very mature (still incubating) • My tool of choice still is Jupyter Zeppelin
  • 30. IBM Spark © 2015 IBM Corporation • the fastest way is this vagrant box • http://arjon.es/2015/08/23/vagrant-spark-zeppelin-a-toolbox-to-the- data-analyst/ • https://github.com/arjones/vagrant-spark-zeppelin • Install vagrant • Install virtual box • git clone •Vagrant up Zeppelin – getting started
  • 31. IBM Spark © 2015 IBM Corporation • Very pretty • Multiple choice of interpreters, • many interpreters per page • configure dependencies and execution parameters via GUI Things I like
  • 32. IBM Spark © 2015 IBM Corporation • Fragile • Sometimes counter-intuitive • No obvious way to control notebook execution Things I don’t like
  • 33. IBM Spark © 2015 IBM Corporation demo