Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Moon soo Lee – Data Science Lifecycle with Apache Flink and Apache Zeppelin

8,397 views

Published on

Flink Forward 2015

Published in: Technology

Moon soo Lee – Data Science Lifecycle with Apache Flink and Apache Zeppelin

  1. 1. Data science lifecycle with Apache Flink and Apache Zeppelin (incubating) Flink Forward Moon moon@nflabs.com NFLabs www.nflabs.com
  2. 2. Content 1. Data science lifecycle 2. Zeppelin for data science 3. Zeppelin and Flink 4. Project Roadmap
  3. 3. Data science lifecycle
  4. 4. Data Science: process https://en.wikipedia.org/wiki/Data_analysis
  5. 5. Data Science: tools MLlib
  6. 6. Data Science: people Engineer Data Scientist DevOps Business http://aarondavis.design/
  7. 7. Content 1. Data science lifecycle 2. Zeppelin for data science 3. Zeppelin and Flink 4. Project Roadmap
  8. 8. Zeppelin for data scientist
  9. 9. ProjectTimeline ASF Incubation12.2014 08.2014 Started getting adoption http://zeppelin.incubator.apache.org 12.2012 Commercial Product for data analysis 10.2013 Open sourced a single feature
  10. 10. Hadoop Landscape Cloudera-ML ML-base MRQL Shark ?
  11. 11. Commercial Product 12.2012
  12. 12. Zeppelin 10.2013
  13. 13. Zeppelin 10.2013
  14. 14. Zeppelin 08.2014
  15. 15. Zeppelin 08.2014
  16. 16. Third-party Products 10.2014
  17. 17. Apache Incubation Proposal 11.2014
  18. 18. Acceptance by Incubator 23.12.2014
  19. 19. Current Status 1 Release 68 Contributors worldwide 722 Stars on GH 300/900 Emails at users/dev @i.a.o
  20. 20. Interactive Notebooks
  21. 21. InteractiveVisualization
  22. 22. Multiple Backends
  23. 23. Zeppelin & Friends Z-Manager ZeppelinHub …⋯ Collaboration/Sharing Packaging & Deployment Zeppelin + Full stack on a cloud Packages Backend Integration
  24. 24. OnlineViewer
  25. 25. Deployment https://github.com/hortonworks-gallery/ambari-zeppelin-service
  26. 26. Deployment
  27. 27. As a Service
  28. 28. Before Cloudera-ML ML-base MRQL Shark ?
  29. 29. After Cloudera-ML ML-base MRQL Shark
  30. 30. Content 1. Data science lifecycle 2. Zeppelin for data science 3. Zeppelin and Flink 4. Project Roadmap
  31. 31. Flink integration Integrated through Interpreter 
 Data processing system abstraction in Zeppelin
  32. 32. Interpreter http://zeppelin.incubator.apache.org/docs/development/writingzeppelininterpreter.html
  33. 33. Writing an Interpreter public abstract void open(); public abstract void close(); public abstract InterpreterResult interpret(String st, InterpreterContext context); public abstract void cancel(InterpreterContext context); public abstract int getProgress(InterpreterContext context); public abstract List<String> completion(String buf, int cursor); public abstract FormType getFormType(); public Scheduler getScheduler(); Must have Good to have Advanced
  34. 34. Flink Interpreter https://github.com/apache/incubator-zeppelin/blob/master/flink/src/main/java/org/apache/zeppelin/flink/FlinkInterpreter.java Zeppelin Server Thrift Flink Interpreter Interpreter JVM process FlinkILoop ExecutionEnvironment
  35. 35. Using interpreter Configure Bind use
  36. 36. Using interpreter Use different interpreters in the same notebook
  37. 37. Display System Zeppelin Server Flink Interpreter Other Interpreter Zeppelin webapp Websocket, REST Text Html Table Angular
  38. 38. Display System Select display system through output
  39. 39. Built in scheduler Built-in scheduler runs your notebook with cron expression.
  40. 40. Flexible layout Flexible layout
  41. 41. DEMO
  42. 42. Content 1. Data science lifecycle 2. Zeppelin for data science 3. Zeppelin and Flink 4. Project Roadmap
  43. 43. Flink Integration • ZeppelinContext :Access to Zeppelin provided features • - Dynamic form • - Angular display system • Dependency loading • Auto completion • Cancel • Get progress information
  44. 44. Thank you Q & A Moon moon@nflabs.com NFLabs www.nflabs.com http://zeppelin.incubator.apache.org/
  45. 45. Project roadmap
  46. 46. Multi-tenancy Two approaches 1. Implement authentication,ACL inside of Zeppelin https://github.com/apache/incubator-zeppelin/pull/53 2. Run Zeppelin on top of Docker
 
 http://github.com/NFLabs/z-manager
  47. 47. Zeppelin for organizations
  48. 48. An Engineer engineer by http://aarondavis.design/
  49. 49. ATeam engineer by http://aarondavis.design/
  50. 50. An Organization engineer by http://aarondavis.design/
  51. 51. That’s too many! engineer by http://aarondavis.design/
  52. 52. What is the problem? Too much: Install Configure Cluster resources
  53. 53. Solution? We have containers + reverse proxy
  54. 54. Z Manager PoC httpd + mod_php nginx Linux box engineer by http://aarondavis.design/ 2 days, bash + php :(
  55. 55. Z Manager PoC
  56. 56. Z Manager http://github.com/NFLabs/z-manager Apache 2.0 Licence Containerized deployment per user Reverse proxy Single binary Simple web application Z Manager SGA to ASF coming *
  57. 57. Z Manager Auto-update engineer by http://aarondavis.design/ Linux box go + react :) Z Manager process
  58. 58. Z Manager
  59. 59. Helium
  60. 60. People do the similar work with different data New visualization Model & Algorithm Data process pipeline engineer by http://aarondavis.design/
  61. 61. Package and distribute work New visualization Model & Algorithm Data process pipeline Pkg Repo engineer by http://aarondavis.design/
  62. 62. Helium https://s.apache.org/helium Platform for on top of Apache Zeppelin Data Analytics Application
  63. 63. Helium Application = + View Algorithm Zeppelin provided Resources
  64. 64. Resources Data Computing Any java object

×