Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Managing Machine Learning workflows on Treasure Data

2,291 views

Published on

2018/10/17にTECH PLAY SHIBUYAで開催されたPLAZMA TD Tech Talkの発表です。

Published in: Software
  • Be the first to comment

Managing Machine Learning workflows on Treasure Data

  1. 1. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. Aki Ariga | Software Engineer Managing Machine Learning Workflows on Treasure Data
  2. 2. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. ● Aki Ariga (a.k.a. @chezou) ● Software Engineer at Machine Learning team ● Co-author of 「仕事ではじめる機械学習」 ● Founder of kawasaki.rb & MLCT ● Interesting: MLOps, ML deployment/management Who am I?
  3. 3. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. Machine Learning on Treasure Data
  4. 4. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. Machine Learning capability on Treasure Data Treasure CDP UI: GUI based, handy SQL+workflow: Scalable Integrate with third-party ML toolkit
  5. 5. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. Apache Hivemall (incubating) ● Scalable ML library implemented as Hive UDFs ● OSS project under Apache Software Foundation ● TD bundles Hivemall and has 3 developers (creator + 2 core committers) Easy-to-use ML in SQL Scalable Runs in parallel on Hadoop ecosystem Versatile Efficient, generic functions
  6. 6. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. Example SQL for training with supervised learning
  7. 7. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. Treasure Workflow (a.k.a. digdag)
  8. 8. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. Treasure Workflow for ML ● Easy to productionize your ML workflow for training/prediction ● Easy to get benefit from parallelization +parameter_tuning: for_each>: eta0: [5.0, 1.0, 0.5, 0.1, 0.05, 0.01, 0.001] reg: ['no', 'rda', 'l1', 'l2', 'elasticnet'] _parallel: true _do: +train: td>: queries/train_regressor.sql suffix: _${reg}_${eta0.toString().replace('.', '_')} create_table: regressor${suffix} +evaluate: td>: queries/evaluate_params.sql insert_into: accuracy_test suffix: _${reg}_${eta0.toString().replace('.', '_')}
  9. 9. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. No secrets with ML workflows on TD... ● EDA with a Jupyter notebook to investigate data distribution, trend, outliers, etc ● Reuse queries in the training phase with prediction as much as possible ● Write commit message with accuracies for queries and workflows ○ A digdag workflow is a code. Versioning workflow is easy ● Versioning models. A Hivemall model is just a table! ○ Logistic Regression/Linear Regression model weights help to understand feature importance ● Visualize table/query dependencies with existing workflow But I can tell some tips
  10. 10. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. Visualize table/query dependency Extract dependencies from CREATE/INSERT TABLE statements
  11. 11. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. Machine Learning capability on Treasure Data (revisit) Treasure CDP UI: GUI based, handy SQL + Workflow: Scalable Integrate with third-party ML toolkit
  12. 12. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. Pros/Cons Treasure CDP UI Hivemall + Treasure Workflow Integrate third-party tools Pros - Easy to use - Fully integrated with web based GUI Cons - Specific purposes (Predictive scoring, Customer tagging) Pros - Customizable - Scalable with big data - Recommendation - Scheduled train/predict Cons - Different paradigm with Python scripting Pros - Flexibility with familiar frameworks - Model portability Cons - Data transfer time - Need to prepare your own machine
  13. 13. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. Pros/Cons with py> operator on TD Treasure CDP UI Hivemall + Treasure Workflow Third-party tools with py> operator on TD (private alpha) Pros - Easy to use - Fully integrated with web based GUI Cons - Specific purposes (Predictive scoring, Customer tagging) Pros - Customizable - Scalable with big data - Recommendation - Scheduled train/predict Cons - Different paradigm with Python scripting Pros - Flexibility with familiar frameworks - Model portability - Scheduled train/predict Cons - Data transfer time - Not scalable - Need to prepare your own machine
  14. 14. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. What we can do with py> operator on TD?
  15. 15. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. Current architecture Data SQL + Treasure Data Customer environment Heavy data Aggregated dataScheduled execution with Treasure Workflow Ad-hoc query
  16. 16. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. Architecture with py> operator Data SQL + Treasure Data Container on TD Heavy data Aggregated data Scheduled execution with Treasure Workflow Model import/export Ad-hoc query based modeling available as well Prediction results Prediction results
  17. 17. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. Demo: Time series prediction with Prophet
  18. 18. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. py> operator is good for... ● Experimenting on your machine, productionizing on TD ○ Scheduled prediction with customer trained model using Python ML libraries and write back prediction results to TD ○ E.g. Build TF/sklearn model on a customer’s machine and predict on TD ● Exporting models for customer own prediction APIs ○ E.g. Build a model with PyTorch, export ONNX and build own API server ● Updating your model continuously on TD ○ E.g. Train TF model with GPU and predict on TD (not planned yet) ● Data preparation/enrichment with Python script ○ E.g. Complex text analysis
  19. 19. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. Batch prediction vs on-the-fly prediction ● TD has a strong capability for batch prediction with Hivemall ○ In batch manner, storing prediction results is the most easiest way ● We don’t have any option for on-the-fly prediction yet ○ Option 1) Export models on S3 and customer will build their own API servers ○ Option 2) Build APIs for each customer’s model on TD (not planned yet)
  20. 20. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. Example workflows with py> operator ● Time series prediction for sales with Prophet ○ https://github.com/treasure-data/workflow-examples/pull/117 ● Sentiment classification with TensorFlow ○ https://github.com/treasure-data/workflow-examples/pull/118 ● Feature selection with scikit-learn ○ https://github.com/treasure-data/workflow-examples/pull/116
  21. 21. Confidential © Arm 2017Confidential © Arm 2017Confidential © Arm 2017 Thank You! Danke! Merci! 谢谢! ありがとう! Gracias! Kiitos!

×