HadoopCon 2016 - 用 Jupyter Notebook Hold 住一個上線 Spark Machine Learning 專案實戰

Jupyter Notebook Hold Spark
Machine Learning
-
Wayne

• Machine Learning
•
• https://
www.facebook.com/wjmuse
About Me

What We Do?
• A lot of Data
• 6 40
• A lot of Visitor Log
• 2800 PV 770 UV
•

• Google Facebook
• PIXNET
PIXNET

Machine Learning
• label data
•
Gender
Age
Slackbot
Rank

We Almost Build Everything
on Notebook
• Python
• Dashboard
• Spark
• script daemon

Play Jupyter with PySpark
Conﬁg PySpark driver
Execute PySpark on standalone mode

Pandas Dataframe on
Notebook is Wonderful
From File
From Redshift
From Google Spreadsheet
concat, drop_duplicates, dropna, groupby, …
pandas.read_csv(DICT, header=None, sep=" ", names=[‘word’,'weight','type'])
pandas.read_json(TOP_ARTICLE)
sql = “select keyword, sum(clicks) AS cc from search_console WHERE … GROUP BY …”
df = read_sql(sql, con=con)
sheet = gc.open_by_url(link)
spreadata = pandas.DataFrame(sheet.get_all_records())

ipywidgets
• sliders, progress bars, checkboxes, buttons, …
qgrid
• Uses SlickGrid to render pandas DataFrames within a Jupyter
notebook.
IPython.Display
• SVG, Math, Javascript, IFrame, HTML
nbviewer
• A simple way to share Jupyter Notebooks
plotly
• Make charts and dashboards online

Components
Analyst
Frontend
Business
Reporting

word-library
Jieba
word2vec
data-utility
BigQueryApi
RedshiftApi
url2content
url2keyword RESTAPI
Scheduling
SlackBot Api
Dashboard
Build Model
...
Notebook control ML data pipeline
Core-Algorithm

or
• training & prediction
• training
• cookie
•
•
• bottleneck
•

•
• 4 (about 4 billions record)
• 3
• Run 1 worker with 4 executor instances (per 2 cores, 4 GB RAM)
• Bottleneck
• Query ordered data with doing mapPartitions
• Merge 20 millions cookies from 4 billions rows
• ReduceByKey will do lots of shufﬂe
• Feature selection (sklearn.feature_selection.chi2)

spark-defaults.conf
spark-env.sh

Doing Spark with PHP?
Model Idea
500
training data ( )
 
3 PHP script
.....

Doing Spark with PHP?
32 Core
Executor
spark.master local[*]
spark.executor.instances 32
sc.textFile(“url.csv”).repartition(128)
Executor
Executor
Executor
Executor
Executor
PHP
PHP
PHP
PHP
PHP
PHP

Build word2vec Model
•
• 120
•
• We choose cppjieba [github]
• thread_number=16
• spark.executor.instances 32

• Jupyter Notebook
(data pipeline)
• Jupyter Notebook reopen hard to track status
• Slack Channel
• Jupyter Notebook

• Spark
• Jupyter Notebook production

Use Notebook to define machine
learning workflow
Jupyter Lab
• The next generation of the Jupyter Notebook
• Jupyter team + Bloomberg + Continuum Analytics
Google Datalab
• Cloud Datalab is built on Jupyter, enables analysis of data on BigQuery,
GCE, and Cloud Storage using Python, SQL, and JavaScript.
Domino
• A Platform to Accelerate Data Science, makes data scientists more
productive and facilitates collaborative, reproducible, reusable analysis.
Zeppelin
• Inspired by iPython notebook focusing on providing analytical environment
on top of Hadoop eco-system.
Databricks Cloud Notebook
• Notebook Workflows as APIs that allow users to chain notebooks together
using the standard control structures of the source programming language.

HadoopCon 2016 - 用 Jupyter Notebook Hold 住一個上線 Spark Machine Learning 專案實戰

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to HadoopCon 2016 - 用 Jupyter Notebook Hold 住一個上線 Spark Machine Learning 專案實戰

Similar to HadoopCon 2016 - 用 Jupyter Notebook Hold 住一個上線 Spark Machine Learning 專案實戰 (20)

Recently uploaded

Recently uploaded (20)

HadoopCon 2016 - 用 Jupyter Notebook Hold 住一個上線 Spark Machine Learning 專案實戰