Data Science lifecycle with Apache Zeppelin
http://zeppelin.apache.org
Moon
Creator of Apache Zeppelin
Co-founder NFLabs
Zeppelin
2012. 12 Data analytics solution based on AMP Lab Spark/Shark
Zeppelin
2012. 12 Data analytics solution based on AMP Lab Spark/Shark
2013. 10 Opensource interactive analytics feature as ‘Zeppelin’
2013. 10 2014. 08
Zeppelin
2012. 12 Data analytics solution based on AMP Lab Spark/Shark
2013. 10 Opensource interactive analytics feature as ‘Zeppelin’
2014. 12 ASF incubation
Incubation Status http://incubator.apache.org/projects/zeppelin.html
Zeppelin
2012. 12 Data analytics solution based on AMP Lab Spark/Shark
2013. 10 Opensource interactive analytics feature as ‘Zeppelin’
2014. 12 ASF incubation
2016. 10 157 Contributors world wide
2071 Stars on github repo
6 Releases
One of the most popular project in ASF
Collect
ETL /
Process
Analysis
Report
Data
Product
Life cycle of big data
Data
Engineer
Data
Scientist
Business user
Customer
Zeppelin
A web-based notebook that enables interactive data analytics. You can
make beautiful data-driven, interactive and collaborative documents with
SQL, Scala and more.
Zeppelin
JDBC
Markdown > _ Shell
Interpreter : pluggable layer for language / processing backend integration
20+ interpreters are supported officially
2016. 03. Interpreters in Zeppelin source tree. Does not include 3rd party interpreters
Zeppelin
Interpreter : pluggable layer for language / processing backend integration
Zeppelin
Interpreter : Easy to extend
public abstract class Interpreter {
public void open();
public void close();
public InterpreterResult interpret(String st, InterpreterContext context);
public void cancel(InterpreterContext context);
public int getProgress(InterpreterContext context);
public List<String> completion(String buf, int cursor);
public FormType getFormType();
public Scheduler getScheduler();
}
{Must have
{Good to have
Advanced {
Zeppelin
Notebook Repo : pluggable layer for notebook persistence
5+ Notebook repos are supported officially
2016. 03. Notebook repos in Zeppelin source tree. Does not include 3rd party interpreters
ZeppelinHub
Zeppelin
Notebook Repo : Easy to extend
public interface NotebookRepo {
public List<NoteInfo> list() throws IOException;
public Note get(String noteId) throws IOException;
public void save(Note note) throws IOException;
public void remove(String noteId) throws IOException;
public void checkpoint(String noteId, String checkPointName) throws IOException;
public void close();
}
Zeppelin
Visualizations : 6 Built-in visualizations comes with pivot
Table Bar Pie Area Line Scatter
Free to draw any customized visualizations inside of notebook
…
He liumHe
2
Platform for data analytics application that
makes visualization pluggable and more.
http://issues.apache.org/jira/browse/ZEPPELIN-533
https://cwiki.apache.org/confluence/display/ZEPPELIN/Helium+proposal
Proposal
Umbrella issue
Makes Zeppelin fly!
He liumHe
2
RESTful API Websocket
Interpreter Notebook Storage
Spark
Flink
Geode
JDBC
…
FileSystem
AmazonS3
Git
…
ZeppelinServer
Interpreters and Notebook storage are pluggable
He liumHe
2
Interpreter Notebook Storage
Spark
Flink
Geode
JDBC
…
FileSystem
AmazonS3
Git
…
ZeppelinServer
Visualizations
Map
WordCloud
…
We want visualization be pluggable
He liumHe
2
Interpreter Notebook Storage
Spark
Flink
Geode
JDBC …
FileSystem
AmazonS3
Git
…
Application
Visualizations
Map
WordCloud
…
Resource Pool
SparkContext Flink Environment JDBC connection …
Analytics
…
…
User object
Extend pluggable visualization to pluggable analytics application
Helium Application: Easy to extend
public abstract class Application {
public Application(ApplicationContext context);
public abstract void run(ResourceSet args);
public abstract void unload();
}
He liumHe
2
Launcher: Suggest application according to data type in ResourcePool
He liumHe
2
& Enterprise
Jongyoul Lee
PMC of Apache Zeppelin
Software Development Engineer at NFLabs
& Enterprise
More than 1000 employers use Apache Zeppelin
Supports Apache Zeppelin as an internal service
Recommendation team uses Apache Zeppelin
Monitors their infrastructures via Apache Zeppelin
& Enterprise
& Enterprise History
~ 0.6
• NOTHING!!!
0.6.x
• Authentication & Authorization
• Note level permission
• Note level isolation
• Partially supported by Livy
& Enterprise Future
0.7.0
• Enterprise Support
• Multi users environment
• Impersonation on Spark/JDBC interpreter
• Job management
• Interpreter
• Improvement on JDBC/Python interpreter
• Frontend performance improvement
• Pluggable visualization
& Enterprise Future
0.7.0
• Enterprise Support
• Multi users environment
• Impersonation on Spark/JDBC interpreter
• Job management
• Interpreter
• Improvement on JDBC/Python interpreter
• Frontend performance improvement
• Pluggable visualization
& Enterprise
RESTful API Websocket
Interpreter Notebook Storage
Spark
Flink
Geode
JDBC
…
FileSystem
AmazonS3
Git
…
ZeppelinServer
Multi-tenancy
& Enterprise
RESTful API Websocket
Interpreter Notebook Storage
Spark
Flink
Geode
JDBC
…
FileSystem
AmazonS3
Git
…
ZeppelinServer
NO USER
Multi-tenancy
Shared, Isolated, Scoped
& Enterprise
ZeppelinServer
SparkInterpreter
Run P1 on NoteA
Run SparkInterpreter for P1
User1
Multi-tenancy
& Enterprise
ZeppelinServer
SparkInterpreter
Run P1 on NoteA
Run SparkInterpreter for P1
User1
User2
Run P2 on NoteB Run SparkInterpreter for P2
Multi-tenancy
& Enterprise
• Originally implemented
• Pros
• Simple structure
• Predictable behavior
• Cons
• All resources shared
• Interference among users
Multi-tenancy
Shared
& Enterprise
ZeppelinServer
SparkInterpreter
Run P1 on NoteA
Run SparkInterpreter for P1
User1
User2
Run P2 on NoteB
Run SparkInterpreter for P2 SparkInterpreter
Multi-tenancy
& Enterprise
• Pros
• No pending
• No resources shared
• Cons
• Lots of memory
• Inefficiency of using memory
• Limited by resources
Multi-tenancy
Isolated
& Enterprise
ZeppelinServer
SparkInterpreter
Run P1 on NoteA
Run SparkInterpreter for P1
User1
User2
Run P2 on NoteB
Run SparkInterpreter for P2 SparkInterpreter
Multi-tenancy
& Enterprise
ZeppelinServer
JDBCInterpreter
Run P2 on NoteA
Run SparkInterpreter for P2
User1
User2
Run P3 on NoteB
Run SparkInterpreter for P3 JDBCInterpreter
Multi-tenancy
& Enterprise
ZeppelinServer
JDBCInterpreter
Run P2 on NoteA
Run SparkInterpreter for P2
User1
User2
Run P3 on NoteB Run SparkInterpreter for P3
Multi-tenancy
JDBCInstance
User1
JDBCInstance
User2
& Enterprise
• Pros
• Less memory
• Some resources Isolated
• Cons
• Some resources shared
• Big single process
Multi-tenancy
Scoped
& Enterprise Future
0.7.0
• Enterprise Support
• Multi users environment
• Impersonation on Spark/JDBC interpreter
• Job management
• Interpreter
• Improvement on JDBC/Python interpreter
• Frontend performance improvement
• Pluggable visualization
& Enterprise Impersonation
What if all users use different credentials?
& Enterprise Impersonation
& Enterprise Impersonation
Credentials
• Already merged by Twitter at Mar. 2016
• Never used in any interpreter
& Enterprise Impersonation
• JDBC
• Set user and password in properties
• https://issues.apache.org/jira/browse/ZEPPELIN-1567
• Spark
• Adopt ugi.doAs()
• https://issues.apache.org/jira/browse/ZEPPELIN-1572
& Enterprise Impersonation
& Enterprise Future
0.7.0
• Enterprise Support
• Multi users environment
• Impersonation on Spark/JDBC interpreter
• Job management
• Interpreter
• Improvement on JDBC/Python interpreter
• Frontend performance improvement
• Pluggable visualization
& Enterprise Job mgmt
& Enterprise Future
0.7.0
• Enterprise Support
• Multi users environment
• Impersonation on Spark/JDBC interpreter
• Job management
• Interpreter
• Improvement on JDBC/Python interpreter
• Frontend performance improvement
• Pluggable visualization
• JDBC
• Connection pool
• Stabilization for BI
• Python
• Matplot library
• Support on python user
& Enterprise Interpreters
& Enterprise Future
0.7.0
• Enterprise Support
• Multi users environment
• Impersonation on Spark/JDBC interpreter
• Job management
• Interpreter
• Improvement on JDBC/Python interpreter
• Frontend performance improvement
• Pluggable visualization
• Frontend
• Fine-grained broadcast of WebSocket
• Betterment of rendering DOM
• Pluggable visualization
• lium
& Enterprise Frontend
He
2
Zeppelin
Homepage
http://zeppelin.apache.org/

Mailing list
users@zeppelin.apache.org
dev@zeppelin.apache.org
Issue tracker
https://issues.apache.org/jira/browse/ZEPPELIN
Github repository
http://github.com/apache/zeppelin
Join the community
Thank you
Moon soo Lee
moon@nflabs.com
https://twitter.com/issuefreaks
Jongyoul Lee
jongyoul@nflabs.com
https://twitter.com/madeng

Data science lifecycle with Apache Zeppelin

  • 1.
    Data Science lifecyclewith Apache Zeppelin http://zeppelin.apache.org
  • 2.
    Moon Creator of ApacheZeppelin Co-founder NFLabs
  • 3.
    Zeppelin 2012. 12 Dataanalytics solution based on AMP Lab Spark/Shark
  • 4.
    Zeppelin 2012. 12 Dataanalytics solution based on AMP Lab Spark/Shark 2013. 10 Opensource interactive analytics feature as ‘Zeppelin’ 2013. 10 2014. 08
  • 5.
    Zeppelin 2012. 12 Dataanalytics solution based on AMP Lab Spark/Shark 2013. 10 Opensource interactive analytics feature as ‘Zeppelin’ 2014. 12 ASF incubation Incubation Status http://incubator.apache.org/projects/zeppelin.html
  • 6.
    Zeppelin 2012. 12 Dataanalytics solution based on AMP Lab Spark/Shark 2013. 10 Opensource interactive analytics feature as ‘Zeppelin’ 2014. 12 ASF incubation 2016. 10 157 Contributors world wide 2071 Stars on github repo 6 Releases One of the most popular project in ASF
  • 7.
    Collect ETL / Process Analysis Report Data Product Life cycleof big data Data Engineer Data Scientist Business user Customer
  • 8.
    Zeppelin A web-based notebookthat enables interactive data analytics. You can make beautiful data-driven, interactive and collaborative documents with SQL, Scala and more.
  • 9.
    Zeppelin JDBC Markdown > _Shell Interpreter : pluggable layer for language / processing backend integration 20+ interpreters are supported officially 2016. 03. Interpreters in Zeppelin source tree. Does not include 3rd party interpreters
  • 10.
    Zeppelin Interpreter : pluggablelayer for language / processing backend integration
  • 11.
    Zeppelin Interpreter : Easyto extend public abstract class Interpreter { public void open(); public void close(); public InterpreterResult interpret(String st, InterpreterContext context); public void cancel(InterpreterContext context); public int getProgress(InterpreterContext context); public List<String> completion(String buf, int cursor); public FormType getFormType(); public Scheduler getScheduler(); } {Must have {Good to have Advanced {
  • 12.
    Zeppelin Notebook Repo :pluggable layer for notebook persistence 5+ Notebook repos are supported officially 2016. 03. Notebook repos in Zeppelin source tree. Does not include 3rd party interpreters ZeppelinHub
  • 13.
    Zeppelin Notebook Repo :Easy to extend public interface NotebookRepo { public List<NoteInfo> list() throws IOException; public Note get(String noteId) throws IOException; public void save(Note note) throws IOException; public void remove(String noteId) throws IOException; public void checkpoint(String noteId, String checkPointName) throws IOException; public void close(); }
  • 14.
    Zeppelin Visualizations : 6Built-in visualizations comes with pivot Table Bar Pie Area Line Scatter Free to draw any customized visualizations inside of notebook …
  • 15.
    He liumHe 2 Platform fordata analytics application that makes visualization pluggable and more. http://issues.apache.org/jira/browse/ZEPPELIN-533 https://cwiki.apache.org/confluence/display/ZEPPELIN/Helium+proposal Proposal Umbrella issue Makes Zeppelin fly!
  • 16.
    He liumHe 2 RESTful APIWebsocket Interpreter Notebook Storage Spark Flink Geode JDBC … FileSystem AmazonS3 Git … ZeppelinServer Interpreters and Notebook storage are pluggable
  • 17.
    He liumHe 2 Interpreter NotebookStorage Spark Flink Geode JDBC … FileSystem AmazonS3 Git … ZeppelinServer Visualizations Map WordCloud … We want visualization be pluggable
  • 18.
    He liumHe 2 Interpreter NotebookStorage Spark Flink Geode JDBC … FileSystem AmazonS3 Git … Application Visualizations Map WordCloud … Resource Pool SparkContext Flink Environment JDBC connection … Analytics … … User object Extend pluggable visualization to pluggable analytics application
  • 19.
    Helium Application: Easyto extend public abstract class Application { public Application(ApplicationContext context); public abstract void run(ResourceSet args); public abstract void unload(); } He liumHe 2
  • 20.
    Launcher: Suggest applicationaccording to data type in ResourcePool He liumHe 2
  • 21.
  • 22.
    Jongyoul Lee PMC ofApache Zeppelin Software Development Engineer at NFLabs
  • 23.
    & Enterprise More than1000 employers use Apache Zeppelin Supports Apache Zeppelin as an internal service Recommendation team uses Apache Zeppelin Monitors their infrastructures via Apache Zeppelin
  • 24.
  • 25.
    & Enterprise History ~0.6 • NOTHING!!! 0.6.x • Authentication & Authorization • Note level permission • Note level isolation • Partially supported by Livy
  • 26.
    & Enterprise Future 0.7.0 •Enterprise Support • Multi users environment • Impersonation on Spark/JDBC interpreter • Job management • Interpreter • Improvement on JDBC/Python interpreter • Frontend performance improvement • Pluggable visualization
  • 27.
    & Enterprise Future 0.7.0 •Enterprise Support • Multi users environment • Impersonation on Spark/JDBC interpreter • Job management • Interpreter • Improvement on JDBC/Python interpreter • Frontend performance improvement • Pluggable visualization
  • 28.
    & Enterprise RESTful APIWebsocket Interpreter Notebook Storage Spark Flink Geode JDBC … FileSystem AmazonS3 Git … ZeppelinServer Multi-tenancy
  • 29.
    & Enterprise RESTful APIWebsocket Interpreter Notebook Storage Spark Flink Geode JDBC … FileSystem AmazonS3 Git … ZeppelinServer NO USER Multi-tenancy
  • 30.
  • 31.
    & Enterprise ZeppelinServer SparkInterpreter Run P1on NoteA Run SparkInterpreter for P1 User1 Multi-tenancy
  • 32.
    & Enterprise ZeppelinServer SparkInterpreter Run P1on NoteA Run SparkInterpreter for P1 User1 User2 Run P2 on NoteB Run SparkInterpreter for P2 Multi-tenancy
  • 33.
    & Enterprise • Originallyimplemented • Pros • Simple structure • Predictable behavior • Cons • All resources shared • Interference among users Multi-tenancy Shared
  • 34.
    & Enterprise ZeppelinServer SparkInterpreter Run P1on NoteA Run SparkInterpreter for P1 User1 User2 Run P2 on NoteB Run SparkInterpreter for P2 SparkInterpreter Multi-tenancy
  • 35.
    & Enterprise • Pros •No pending • No resources shared • Cons • Lots of memory • Inefficiency of using memory • Limited by resources Multi-tenancy Isolated
  • 36.
    & Enterprise ZeppelinServer SparkInterpreter Run P1on NoteA Run SparkInterpreter for P1 User1 User2 Run P2 on NoteB Run SparkInterpreter for P2 SparkInterpreter Multi-tenancy
  • 37.
    & Enterprise ZeppelinServer JDBCInterpreter Run P2on NoteA Run SparkInterpreter for P2 User1 User2 Run P3 on NoteB Run SparkInterpreter for P3 JDBCInterpreter Multi-tenancy
  • 38.
    & Enterprise ZeppelinServer JDBCInterpreter Run P2on NoteA Run SparkInterpreter for P2 User1 User2 Run P3 on NoteB Run SparkInterpreter for P3 Multi-tenancy JDBCInstance User1 JDBCInstance User2
  • 39.
    & Enterprise • Pros •Less memory • Some resources Isolated • Cons • Some resources shared • Big single process Multi-tenancy Scoped
  • 40.
    & Enterprise Future 0.7.0 •Enterprise Support • Multi users environment • Impersonation on Spark/JDBC interpreter • Job management • Interpreter • Improvement on JDBC/Python interpreter • Frontend performance improvement • Pluggable visualization
  • 41.
  • 42.
    What if allusers use different credentials?
  • 43.
  • 44.
    & Enterprise Impersonation Credentials •Already merged by Twitter at Mar. 2016 • Never used in any interpreter
  • 45.
  • 46.
    • JDBC • Setuser and password in properties • https://issues.apache.org/jira/browse/ZEPPELIN-1567 • Spark • Adopt ugi.doAs() • https://issues.apache.org/jira/browse/ZEPPELIN-1572 & Enterprise Impersonation
  • 47.
    & Enterprise Future 0.7.0 •Enterprise Support • Multi users environment • Impersonation on Spark/JDBC interpreter • Job management • Interpreter • Improvement on JDBC/Python interpreter • Frontend performance improvement • Pluggable visualization
  • 48.
  • 49.
    & Enterprise Future 0.7.0 •Enterprise Support • Multi users environment • Impersonation on Spark/JDBC interpreter • Job management • Interpreter • Improvement on JDBC/Python interpreter • Frontend performance improvement • Pluggable visualization
  • 50.
    • JDBC • Connectionpool • Stabilization for BI • Python • Matplot library • Support on python user & Enterprise Interpreters
  • 51.
    & Enterprise Future 0.7.0 •Enterprise Support • Multi users environment • Impersonation on Spark/JDBC interpreter • Job management • Interpreter • Improvement on JDBC/Python interpreter • Frontend performance improvement • Pluggable visualization
  • 52.
    • Frontend • Fine-grainedbroadcast of WebSocket • Betterment of rendering DOM • Pluggable visualization • lium & Enterprise Frontend He 2
  • 53.
  • 54.
    Thank you Moon sooLee moon@nflabs.com https://twitter.com/issuefreaks Jongyoul Lee jongyoul@nflabs.com https://twitter.com/madeng