Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

HadoopCon 2016 - 用 Jupyter Notebook Hold 住一個上線 Spark Machine Learning 專案實戰

425 views

Published on

一些在 Jupyter Notebook 上開發機器學習專案的經驗分享

Published in: Data & Analytics
  • Be the first to comment

HadoopCon 2016 - 用 Jupyter Notebook Hold 住一個上線 Spark Machine Learning 專案實戰

  1. 1. Jupyter Notebook Hold Spark Machine Learning - Wayne
  2. 2. • Machine Learning • • https:// www.facebook.com/wjmuse About Me
  3. 3. What We Do? • A lot of Data • 6 40 • A lot of Visitor Log • 2800 PV 770 UV •
  4. 4. • Google Facebook • PIXNET PIXNET
  5. 5. Machine Learning • label data • Gender Age Slackbot Rank
  6. 6. We Almost Build Everything on Notebook • Python • Dashboard • Spark • script daemon
  7. 7. Play Jupyter with PySpark Config PySpark driver Execute PySpark on standalone mode
  8. 8. Run as Script or Daemon
  9. 9. Pandas Dataframe on Notebook is Wonderful From File From Redshift From Google Spreadsheet concat, drop_duplicates, dropna, groupby, … pandas.read_csv(DICT, header=None, sep=" ", names=[‘word’,'weight','type']) pandas.read_json(TOP_ARTICLE) sql = “select keyword, sum(clicks) AS cc from search_console WHERE … GROUP BY …” df = read_sql(sql, con=con) sheet = gc.open_by_url(link) spreadata = pandas.DataFrame(sheet.get_all_records())
  10. 10. ipywidgets • sliders, progress bars, checkboxes, buttons, … qgrid • Uses SlickGrid to render pandas DataFrames within a Jupyter notebook. IPython.Display • SVG, Math, Javascript, IFrame, HTML nbviewer • A simple way to share Jupyter Notebooks plotly • Make charts and dashboards online
  11. 11. Components Analyst Frontend Business Reporting
  12. 12. word-library Jieba word2vec data-utility BigQueryApi RedshiftApi url2content url2keyword RESTAPI Scheduling SlackBot Api Dashboard Build Model ... Notebook control ML data pipeline Core-Algorithm
  13. 13. Spark + Jupyter
  14. 14. or • training & prediction • training • cookie • • • bottleneck •
  15. 15. • • 4 (about 4 billions record) • 3 • Run 1 worker with 4 executor instances (per 2 cores, 4 GB RAM) • Bottleneck • Query ordered data with doing mapPartitions • Merge 20 millions cookies from 4 billions rows • ReduceByKey will do lots of shuffle • Feature selection (sklearn.feature_selection.chi2)
  16. 16. spark-defaults.conf spark-env.sh
  17. 17. 1 - server
  18. 18. 2 - stay up 2
  19. 19. Doing Spark with PHP? Model Idea 500 training data ( ) 
 3 PHP script .....
  20. 20. Doing Spark with PHP? 32 Core Executor spark.master local[*] spark.executor.instances 32 sc.textFile(“url.csv”).repartition(128) Executor Executor Executor Executor Executor PHP PHP PHP PHP PHP PHP
  21. 21. MySQL ...
  22. 22. Build word2vec Model • • 120 • • We choose cppjieba [github] • thread_number=16 • spark.executor.instances 32
  23. 23. • Jupyter Notebook (data pipeline) • Jupyter Notebook reopen hard to track status • Slack Channel • Jupyter Notebook
  24. 24. • Spark • Jupyter Notebook production
  25. 25. Data Scientist Tool Set -> ->
  26. 26. Use Notebook to define machine learning workflow Jupyter Lab • The next generation of the Jupyter Notebook • Jupyter team + Bloomberg + Continuum Analytics Google Datalab • Cloud Datalab is built on Jupyter, enables analysis of data on BigQuery, GCE, and Cloud Storage using Python, SQL, and JavaScript. Domino • A Platform to Accelerate Data Science, makes data scientists more productive and facilitates collaborative, reproducible, reusable analysis. Zeppelin • Inspired by iPython notebook focusing on providing analytical environment on top of Hadoop eco-system. Databricks Cloud Notebook • Notebook Workflows as APIs that allow users to chain notebooks together using the standard control structures of the source programming language.
  27. 27. 感謝您的聆聽

×