Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Building a Data Product using
apache Zeppelin (incubating)
NFLabs for ApacheCon ’15 EU
Alexander Bezzubov
Data engineer @NFLabs
based in Seoul, South Korea
bzz@apache.org
So you want to build a data
product…
What do you need?
To build a data product we need:
Idea
Data
Software (to process the data)
Hardware (to run the software)
…and a brain
Idea
Data
Software (to process the data)
Hardware (to run the software)
…and a brain
To build a data product we need:
Software
Software
Software
… …
Software
… …
Zeppelin
Zeppelin
Opensouce analytical environment,
with pluggable backend data-processing systems,
with notebook-style GUI for vis...
ASF Incubation12.2014
08.2013 NFLabs Internal project Hive/Shark
http://zeppelin.incubator.apache.org
12.2012 Commercial A...
Zeppelin History
Zeppelin History
Zeppelin History
Zeppelin History
Zeppelin Current Status
1 Release
63 Contributors worldwide
689 Stars on GH
~300/900 Emails at users/dev
@i.a.o
http://zep...
Zeppelin Architecture
Hardware
Hardware
.
.
.
.
.
.
Idea
An Idea
Would not it be cool if …
An Idea
Would not it be cool if …Would not it be cool if …
you could have your own Google
Analytics?
An Idea
Would not it be cool if …Would not it be cool if …
you could have your own Google
Analytics?
sorry, we already saw...
An Idea
you could be the first to know when
there is a new interesting*
opensouce project
Would not it be cool if …
Data
Data: Github archive
https://www.githubarchive.org• Github logs, hosted in the
cloud
• Collaboration between Github
and Go...
But what if you could:
analyse this data independently, without asking
permission or paying anybody?
But what if you could:
analyse this data independent, without asking
permission or paying anybody?
well, with ASF you can!
Let’s build a data product for ourselves!
Building a product
We are going to build a Notebook that:
• Downloads the latest data from GitHub Archive
• Read & explore the dataset
• Impo...
We are going to build a Notebook that:
basically, sends you digest emails:
Start Zeppelin
./bin/zeppelin-daemon.sh start
& create a new notebook
http://zeppelin.incubator.apache.org/download.html
Zeppelin Architecture
We are going to build a Notebook that:
• Downloads the latest data from GitHub Archive
• Read & explore the dataset
• Impo...
Load Dependency
We are going to build a Notebook that:
• Downloads the latest data from GitHub Archive
• Read & explore the dataset
• Impo...
Download Data
In serial, sample, using shell interpreter
Download Data
In serial, whole day, using shell interpreter
Don’t need this as we have data
prepared
Download Data
In parallel, using Spark interpreter
We are going to build a Notebook that:
• Downloads the latest data from GitHub Archive
• Read & explore the dataset
• Impo...
Read Data
Explore Data #1
Pie Chart of Event types
Explore Data #2
Top 10 organisations by event type
We are going to build a Notebook that:
• Downloads the latest data from GitHub Archive
• Read & explore the dataset
• Impo...
Filter: cleanup
Only organisations open sourcing
repositories
Filter: interesting companies
Only organisations open sourcing
repositories
We are going to build a Notebook that:
• Downloads the latest data from GitHub Archive
• Read & explore the dataset
• Impo...
Call Github API
Getting more information about
repository
GitHub personal access token to rise rate-limit
github.com/<user...
Join orgs and repos
We are going to build a Notebook that:
• Downloads the latest data from GitHub Archive
• Read & explore the dataset
• Impo...
HTML Preview
To generate a template
HTML Preview
To output the results
We are going to build a Notebook that:
• Downloads the latest data from GitHub Archive
• Read & explore the dataset
• Impo...
Send Email
Send Email
Send Email
We are going to build a Notebook that:
• Downloads the latest data from GitHub Archive
• Read & explore the dataset
• Impo...
Schedule
Kylin: interpreter setup
Kylin: consume data
Alexander Bezzubov
bzz@apache.org
http://s.apache.org/zeppelin-workshop
Upcoming SlideShare
Loading in …5
×

4.Building a Data Product using apache Zeppelin - Apache Kylin Meetup @Shanghai

3,450 views

Published on

Building a Data Product using apache Zeppelin - Apache Kylin Meetup @Shanghai

Published in: Software
  • Be the first to comment

4.Building a Data Product using apache Zeppelin - Apache Kylin Meetup @Shanghai

  1. 1. Building a Data Product using apache Zeppelin (incubating) NFLabs for ApacheCon ’15 EU
  2. 2. Alexander Bezzubov Data engineer @NFLabs based in Seoul, South Korea bzz@apache.org
  3. 3. So you want to build a data product… What do you need?
  4. 4. To build a data product we need: Idea Data Software (to process the data) Hardware (to run the software) …and a brain
  5. 5. Idea Data Software (to process the data) Hardware (to run the software) …and a brain To build a data product we need:
  6. 6. Software
  7. 7. Software
  8. 8. Software … …
  9. 9. Software … …
  10. 10. Zeppelin
  11. 11. Zeppelin Opensouce analytical environment, with pluggable backend data-processing systems, with notebook-style GUI for visualisations
  12. 12. ASF Incubation12.2014 08.2013 NFLabs Internal project Hive/Shark http://zeppelin.incubator.apache.org 12.2012 Commercial App using AMP Lab Shark 0.5 10.2013 Prototype Hive/Shark Zeppelin History
  13. 13. Zeppelin History
  14. 14. Zeppelin History
  15. 15. Zeppelin History
  16. 16. Zeppelin History
  17. 17. Zeppelin Current Status 1 Release 63 Contributors worldwide 689 Stars on GH ~300/900 Emails at users/dev @i.a.o http://zeppelin.incubator.apache.org
  18. 18. Zeppelin Architecture
  19. 19. Hardware
  20. 20. Hardware . . . . . .
  21. 21. Idea
  22. 22. An Idea Would not it be cool if …
  23. 23. An Idea Would not it be cool if …Would not it be cool if … you could have your own Google Analytics?
  24. 24. An Idea Would not it be cool if …Would not it be cool if … you could have your own Google Analytics? sorry, we already saw it in eCG talk.. ok, let’s pick something else
  25. 25. An Idea you could be the first to know when there is a new interesting* opensouce project Would not it be cool if …
  26. 26. Data
  27. 27. Data: Github archive https://www.githubarchive.org• Github logs, hosted in the cloud • Collaboration between Github and Google engineers • 20+ events, 250+Gb since 2012 • Proprietary software • available on BigQuery
  28. 28. But what if you could: analyse this data independently, without asking permission or paying anybody?
  29. 29. But what if you could: analyse this data independent, without asking permission or paying anybody? well, with ASF you can!
  30. 30. Let’s build a data product for ourselves!
  31. 31. Building a product
  32. 32. We are going to build a Notebook that: • Downloads the latest data from GitHub Archive • Read & explore the dataset • Imports, filters the PublicEvent • Join logs w/ more data from Github API calls • Shows simple HTML template, to visualise the list • Sends email notifications
  33. 33. We are going to build a Notebook that: basically, sends you digest emails:
  34. 34. Start Zeppelin ./bin/zeppelin-daemon.sh start & create a new notebook http://zeppelin.incubator.apache.org/download.html
  35. 35. Zeppelin Architecture
  36. 36. We are going to build a Notebook that: • Downloads the latest data from GitHub Archive • Read & explore the dataset • Imports, filters the PublicEvent • Join logs w/ more data from Github API calls • Shows simple HTML template, to visualise the list • Sends email notifications
  37. 37. Load Dependency
  38. 38. We are going to build a Notebook that: • Downloads the latest data from GitHub Archive • Read & explore the dataset • Imports, filters the PublicEvent • Join logs w/ more data from Github API calls • Shows simple HTML template, to visualise the list • Sends email notifications
  39. 39. Download Data In serial, sample, using shell interpreter
  40. 40. Download Data In serial, whole day, using shell interpreter Don’t need this as we have data prepared
  41. 41. Download Data In parallel, using Spark interpreter
  42. 42. We are going to build a Notebook that: • Downloads the latest data from GitHub Archive • Read & explore the dataset • Imports, filters the PublicEvent • Join on external information through remote API call • Shows simple HTML template to visualise the list • Sends email notifications
  43. 43. Read Data
  44. 44. Explore Data #1 Pie Chart of Event types
  45. 45. Explore Data #2 Top 10 organisations by event type
  46. 46. We are going to build a Notebook that: • Downloads the latest data from GitHub Archive • Read & explore the dataset • Imports, filters the PublicEvent • Join on external information through remote API call • Shows simple HTML template to visualise the list • Sends email notifications
  47. 47. Filter: cleanup Only organisations open sourcing repositories
  48. 48. Filter: interesting companies Only organisations open sourcing repositories
  49. 49. We are going to build a Notebook that: • Downloads the latest data from GitHub Archive • Read & explore the dataset • Imports, filters the PublicEvent • Join on external information through remote API • Shows simple HTML template to visualize the list • Sends email notifications
  50. 50. Call Github API Getting more information about repository GitHub personal access token to rise rate-limit github.com/<username> => Edit Profile => Personal access tokens => Generate new token
  51. 51. Join orgs and repos
  52. 52. We are going to build a Notebook that: • Downloads the latest data from GitHub Archive • Read & explore the dataset • Imports, filters the PublicEvent • Join on external information through remote API call • Shows simple HTML template • Sends email notifications
  53. 53. HTML Preview To generate a template
  54. 54. HTML Preview To output the results
  55. 55. We are going to build a Notebook that: • Downloads the latest data from GitHub Archive • Read & explore the dataset • Imports, filters the PublicEvent • Join on external information through remote API call • Shows simple HTML template to visualise the list • Sends email notifications
  56. 56. Send Email
  57. 57. Send Email
  58. 58. Send Email
  59. 59. We are going to build a Notebook that: • Downloads the latest data from GitHub Archive • Read & explore the dataset • Imports, filters the PublicEvent • Join on external information through remote API call • Shows simple HTML template to visualise the list • Sends email notifications
  60. 60. Schedule
  61. 61. Kylin: interpreter setup
  62. 62. Kylin: consume data
  63. 63. Alexander Bezzubov bzz@apache.org http://s.apache.org/zeppelin-workshop

×