Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Data science lifecycle with Apache Zeppelin

2,326 views

Published on

Data science lifecycle with Apache Zeppelin

Published in: Technology

Data science lifecycle with Apache Zeppelin

  1. 1. Data Science lifecycle with Apache Zeppelin http://zeppelin.apache.org
  2. 2. Moon Creator of Apache Zeppelin Co-founder NFLabs
  3. 3. Zeppelin 2012. 12 Data analytics solution based on AMP Lab Spark/Shark
  4. 4. Zeppelin 2012. 12 Data analytics solution based on AMP Lab Spark/Shark 2013. 10 Opensource interactive analytics feature as ‘Zeppelin’ 2013. 10 2014. 08
  5. 5. Zeppelin 2012. 12 Data analytics solution based on AMP Lab Spark/Shark 2013. 10 Opensource interactive analytics feature as ‘Zeppelin’ 2014. 12 ASF incubation Incubation Status http://incubator.apache.org/projects/zeppelin.html
  6. 6. Zeppelin 2012. 12 Data analytics solution based on AMP Lab Spark/Shark 2013. 10 Opensource interactive analytics feature as ‘Zeppelin’ 2014. 12 ASF incubation 2016. 10 157 Contributors world wide 2071 Stars on github repo 6 Releases One of the most popular project in ASF
  7. 7. Collect ETL / Process Analysis Report Data Product Life cycle of big data Data Engineer Data Scientist Business user Customer
  8. 8. Zeppelin A web-based notebook that enables interactive data analytics. You can make beautiful data-driven, interactive and collaborative documents with SQL, Scala and more.
  9. 9. Zeppelin JDBC Markdown > _ Shell Interpreter : pluggable layer for language / processing backend integration 20+ interpreters are supported officially 2016. 03. Interpreters in Zeppelin source tree. Does not include 3rd party interpreters
  10. 10. Zeppelin Interpreter : pluggable layer for language / processing backend integration
  11. 11. Zeppelin Interpreter : Easy to extend public abstract class Interpreter { public void open(); public void close(); public InterpreterResult interpret(String st, InterpreterContext context); public void cancel(InterpreterContext context); public int getProgress(InterpreterContext context); public List<String> completion(String buf, int cursor); public FormType getFormType(); public Scheduler getScheduler(); } {Must have {Good to have Advanced {
  12. 12. Zeppelin Notebook Repo : pluggable layer for notebook persistence 5+ Notebook repos are supported officially 2016. 03. Notebook repos in Zeppelin source tree. Does not include 3rd party interpreters ZeppelinHub
  13. 13. Zeppelin Notebook Repo : Easy to extend public interface NotebookRepo { public List<NoteInfo> list() throws IOException; public Note get(String noteId) throws IOException; public void save(Note note) throws IOException; public void remove(String noteId) throws IOException; public void checkpoint(String noteId, String checkPointName) throws IOException; public void close(); }
  14. 14. Zeppelin Visualizations : 6 Built-in visualizations comes with pivot Table Bar Pie Area Line Scatter Free to draw any customized visualizations inside of notebook …
  15. 15. He liumHe 2 Platform for data analytics application that makes visualization pluggable and more. http://issues.apache.org/jira/browse/ZEPPELIN-533 https://cwiki.apache.org/confluence/display/ZEPPELIN/Helium+proposal Proposal Umbrella issue Makes Zeppelin fly!
  16. 16. He liumHe 2 RESTful API Websocket Interpreter Notebook Storage Spark Flink Geode JDBC … FileSystem AmazonS3 Git … ZeppelinServer Interpreters and Notebook storage are pluggable
  17. 17. He liumHe 2 Interpreter Notebook Storage Spark Flink Geode JDBC … FileSystem AmazonS3 Git … ZeppelinServer Visualizations Map WordCloud … We want visualization be pluggable
  18. 18. He liumHe 2 Interpreter Notebook Storage Spark Flink Geode JDBC … FileSystem AmazonS3 Git … Application Visualizations Map WordCloud … Resource Pool SparkContext Flink Environment JDBC connection … Analytics … … User object Extend pluggable visualization to pluggable analytics application
  19. 19. Helium Application: Easy to extend public abstract class Application { public Application(ApplicationContext context); public abstract void run(ResourceSet args); public abstract void unload(); } He liumHe 2
  20. 20. Launcher: Suggest application according to data type in ResourcePool He liumHe 2
  21. 21. & Enterprise
  22. 22. Jongyoul Lee PMC of Apache Zeppelin Software Development Engineer at NFLabs
  23. 23. & Enterprise More than 1000 employers use Apache Zeppelin Supports Apache Zeppelin as an internal service Recommendation team uses Apache Zeppelin Monitors their infrastructures via Apache Zeppelin
  24. 24. & Enterprise
  25. 25. & Enterprise History ~ 0.6 • NOTHING!!! 0.6.x • Authentication & Authorization • Note level permission • Note level isolation • Partially supported by Livy
  26. 26. & Enterprise Future 0.7.0 • Enterprise Support • Multi users environment • Impersonation on Spark/JDBC interpreter • Job management • Interpreter • Improvement on JDBC/Python interpreter • Frontend performance improvement • Pluggable visualization
  27. 27. & Enterprise Future 0.7.0 • Enterprise Support • Multi users environment • Impersonation on Spark/JDBC interpreter • Job management • Interpreter • Improvement on JDBC/Python interpreter • Frontend performance improvement • Pluggable visualization
  28. 28. & Enterprise RESTful API Websocket Interpreter Notebook Storage Spark Flink Geode JDBC … FileSystem AmazonS3 Git … ZeppelinServer Multi-tenancy
  29. 29. & Enterprise RESTful API Websocket Interpreter Notebook Storage Spark Flink Geode JDBC … FileSystem AmazonS3 Git … ZeppelinServer NO USER Multi-tenancy
  30. 30. Shared, Isolated, Scoped
  31. 31. & Enterprise ZeppelinServer SparkInterpreter Run P1 on NoteA Run SparkInterpreter for P1 User1 Multi-tenancy
  32. 32. & Enterprise ZeppelinServer SparkInterpreter Run P1 on NoteA Run SparkInterpreter for P1 User1 User2 Run P2 on NoteB Run SparkInterpreter for P2 Multi-tenancy
  33. 33. & Enterprise • Originally implemented • Pros • Simple structure • Predictable behavior • Cons • All resources shared • Interference among users Multi-tenancy Shared
  34. 34. & Enterprise ZeppelinServer SparkInterpreter Run P1 on NoteA Run SparkInterpreter for P1 User1 User2 Run P2 on NoteB Run SparkInterpreter for P2 SparkInterpreter Multi-tenancy
  35. 35. & Enterprise • Pros • No pending • No resources shared • Cons • Lots of memory • Inefficiency of using memory • Limited by resources Multi-tenancy Isolated
  36. 36. & Enterprise ZeppelinServer SparkInterpreter Run P1 on NoteA Run SparkInterpreter for P1 User1 User2 Run P2 on NoteB Run SparkInterpreter for P2 SparkInterpreter Multi-tenancy
  37. 37. & Enterprise ZeppelinServer JDBCInterpreter Run P2 on NoteA Run SparkInterpreter for P2 User1 User2 Run P3 on NoteB Run SparkInterpreter for P3 JDBCInterpreter Multi-tenancy
  38. 38. & Enterprise ZeppelinServer JDBCInterpreter Run P2 on NoteA Run SparkInterpreter for P2 User1 User2 Run P3 on NoteB Run SparkInterpreter for P3 Multi-tenancy JDBCInstance User1 JDBCInstance User2
  39. 39. & Enterprise • Pros • Less memory • Some resources Isolated • Cons • Some resources shared • Big single process Multi-tenancy Scoped
  40. 40. & Enterprise Future 0.7.0 • Enterprise Support • Multi users environment • Impersonation on Spark/JDBC interpreter • Job management • Interpreter • Improvement on JDBC/Python interpreter • Frontend performance improvement • Pluggable visualization
  41. 41. & Enterprise Impersonation
  42. 42. What if all users use different credentials?
  43. 43. & Enterprise Impersonation
  44. 44. & Enterprise Impersonation Credentials • Already merged by Twitter at Mar. 2016 • Never used in any interpreter
  45. 45. & Enterprise Impersonation
  46. 46. • JDBC • Set user and password in properties • https://issues.apache.org/jira/browse/ZEPPELIN-1567 • Spark • Adopt ugi.doAs() • https://issues.apache.org/jira/browse/ZEPPELIN-1572 & Enterprise Impersonation
  47. 47. & Enterprise Future 0.7.0 • Enterprise Support • Multi users environment • Impersonation on Spark/JDBC interpreter • Job management • Interpreter • Improvement on JDBC/Python interpreter • Frontend performance improvement • Pluggable visualization
  48. 48. & Enterprise Job mgmt
  49. 49. & Enterprise Future 0.7.0 • Enterprise Support • Multi users environment • Impersonation on Spark/JDBC interpreter • Job management • Interpreter • Improvement on JDBC/Python interpreter • Frontend performance improvement • Pluggable visualization
  50. 50. • JDBC • Connection pool • Stabilization for BI • Python • Matplot library • Support on python user & Enterprise Interpreters
  51. 51. & Enterprise Future 0.7.0 • Enterprise Support • Multi users environment • Impersonation on Spark/JDBC interpreter • Job management • Interpreter • Improvement on JDBC/Python interpreter • Frontend performance improvement • Pluggable visualization
  52. 52. • Frontend • Fine-grained broadcast of WebSocket • Betterment of rendering DOM • Pluggable visualization • lium & Enterprise Frontend He 2
  53. 53. Zeppelin Homepage http://zeppelin.apache.org/
 Mailing list users@zeppelin.apache.org dev@zeppelin.apache.org Issue tracker https://issues.apache.org/jira/browse/ZEPPELIN Github repository http://github.com/apache/zeppelin Join the community
  54. 54. Thank you Moon soo Lee moon@nflabs.com https://twitter.com/issuefreaks Jongyoul Lee jongyoul@nflabs.com https://twitter.com/madeng

×