DataScience with Spark & Zeppelin
Ofer Mendelevitch
Vinay Shukla
Moon Soo Lee
Page 2 © Hortonworks Inc. 2014
Data Science with iPython
Ofer Mendelevitch
© Hortonworks Inc. 2015
The Data Science Workflow…
Page 3
What is the
question I'm
answering?
What data will
I need?
Plan
Acquire
the data
Analyze data
quality
Reformat
Impute
etc
Clean Data
Analyze data
Visualize
Create model
Evaluate
results
Create
features
Create report
Deploy in
Production
Publish
& Share
Start
here
End
here
Script
VisualizeScript
Introducing Apache Zeppelin
Lee Moon Soo,
Vinay Shukla
Apache Zeppelin
• A web-based notebook for interactive analytics
• Deeply integrated with Spark and Hadoop
• Supports multiple language backends
• Incubating
Use cases for Zeppelin
• Data exploration & discovery
• Visualization - tables, graphs, charts
• Interactive snippet-at-a-time experience
• Collaboration and publishing
“Modern Data Science Studio”
DEMO I
A day in the life of a data scientist with Zeppelin
Apache Spark Integration
• Supports scala, pyspark and spark sql
• SparkContext injected automatically
• Supports 3rd party dependencies
• Spark-on-YARN and Spark standalone modes
• Full Spark interpreter configuration
• Multiple Spark interpreter profiles
DEMO I I
Apache Spark using Zeppelin
Support for multiple back-ends
• Scala, Python, spark sql
• Hive, Tajo, Ignite, Mysql, ….
• Apache Flink
• Markdown, shell
Driven by the community - thank you!
How is this so easy to do?
Zeppelin Interpreter Architecture
Interpreter is connector between Zeppelin and Backend data processing system.
ZeppelinServer
InterpreterGroup
Separate JVM process
Interpreter Interpreter Interpreter
Spark
Spark PySpark SparkSQL Dep
Load
libraries
Maven repositorySpark cluster
Share single SparkDriver
Thrift
Notebook - Interpreter Selection
Spark
spark pyspark sql dep
Load
libraries
Maven repositorySpark cluster
Share single SparkDriver
DEMO III
Interpreter Deep Dive
Join the community
• Try out Apache Zeppelin today
• https://zeppelin.incubator.apache.org/
• Join us on the community discussions
• Help define how we shape the roadmap and features
• Lets get this party started!
Page15 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Cloud of your
choice
Storage
YARN: Data Operating System
Governance Security
Operations
Resource Management
Questions?
Thank you

Data Science in the Cloud with Spark, Zeppelin, and Cloudbreak

  • 1.
    DataScience with Spark& Zeppelin Ofer Mendelevitch Vinay Shukla Moon Soo Lee
  • 2.
    Page 2 ©Hortonworks Inc. 2014 Data Science with iPython Ofer Mendelevitch
  • 3.
    © Hortonworks Inc.2015 The Data Science Workflow… Page 3 What is the question I'm answering? What data will I need? Plan Acquire the data Analyze data quality Reformat Impute etc Clean Data Analyze data Visualize Create model Evaluate results Create features Create report Deploy in Production Publish & Share Start here End here Script VisualizeScript
  • 4.
    Introducing Apache Zeppelin LeeMoon Soo, Vinay Shukla
  • 5.
    Apache Zeppelin • Aweb-based notebook for interactive analytics • Deeply integrated with Spark and Hadoop • Supports multiple language backends • Incubating
  • 6.
    Use cases forZeppelin • Data exploration & discovery • Visualization - tables, graphs, charts • Interactive snippet-at-a-time experience • Collaboration and publishing “Modern Data Science Studio”
  • 7.
    DEMO I A dayin the life of a data scientist with Zeppelin
  • 8.
    Apache Spark Integration •Supports scala, pyspark and spark sql • SparkContext injected automatically • Supports 3rd party dependencies • Spark-on-YARN and Spark standalone modes • Full Spark interpreter configuration • Multiple Spark interpreter profiles
  • 9.
    DEMO I I ApacheSpark using Zeppelin
  • 10.
    Support for multipleback-ends • Scala, Python, spark sql • Hive, Tajo, Ignite, Mysql, …. • Apache Flink • Markdown, shell Driven by the community - thank you! How is this so easy to do?
  • 11.
    Zeppelin Interpreter Architecture Interpreteris connector between Zeppelin and Backend data processing system. ZeppelinServer InterpreterGroup Separate JVM process Interpreter Interpreter Interpreter Spark Spark PySpark SparkSQL Dep Load libraries Maven repositorySpark cluster Share single SparkDriver Thrift
  • 12.
    Notebook - InterpreterSelection Spark spark pyspark sql dep Load libraries Maven repositorySpark cluster Share single SparkDriver
  • 13.
  • 14.
    Join the community •Try out Apache Zeppelin today • https://zeppelin.incubator.apache.org/ • Join us on the community discussions • Help define how we shape the roadmap and features • Lets get this party started!
  • 15.
    Page15 © HortonworksInc. 2011 – 2015. All Rights Reserved Cloud of your choice Storage YARN: Data Operating System Governance Security Operations Resource Management Questions? Thank you