Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Building High Available & Scalable
Machine Learning Products
Yalçın Yenigün
25/05/2017
Agenda
Agenda
1. What is Data-Driven Product?
a) Introduction
b) Examples
2. Machine Learning
a) Term Definitions
b) A Visual Exa...
What is
Data Driven Product?
Data Driven Product
• Data driven is the future!!!
• It’s the ‘right’ way of doing things!!!..etc.
• What is “data-driven”...
Data Driven Product
• Experimentation:
• Data-Driven: Making design decisions based on
behavioral evidence from users.
• E...
Data Driven Product
• Machine Learning : Building systems that learn from
behavioral data generated by users
• Examples:
•...
Data Driven Product
• Databases or APIs
• They just use the data
• To them their system is also data-driven.
• But they ar...
Examples
• A mobile app that gives information about public transport around you.
• Pulls data from transport operator or ...
Examples
• A website that provides blogging services to users
• Write posts, subscribe other posts.. etc.
• Data-driven ve...
Machine Learning
Term Definitions
• Machine Learning: “Field of study that gives computers the ability to
learn without being explicitly pr...
Term Definitions
• Data Sampling: Data sampling is a statistical
analysis technique used to select,
manipulate and analyze...
Term Definitions
• Training Set: A training set is a set of data used to discover potentially predictive
relationships.
• ...
Term Definitions
Confusion Matrix
Confusion Matrix
• Accuracy: Ratio of correctly predicted observations.
(TP + TN) / (TP + TN + FP + FN)
• Precision: Ratio...
Visual Example
Visual Example
Supervised Learning
Supervised Learning
• Input data is called training data and has a known
label or result such as spam/not-spam or a stock ...
Supervised Learning Example
Supervised Learning Example
Supervised Learning
• Supervised Learning: Right answers given
• Regression: Predict continuous valued
output
• Classifica...
Supervised Learning – Classification Example
Supervised Learning – Classification Example
Linear Regression with One Variable
Linear Regression with One Variable
Supervised Learning – Classification Example
http://localhost:8888/notebooks/dev/workspaces/i
yzico/scipy_2015_sklearn_tut...
Linear Regression with One Variable
Linear Regression with One Variable
Cost Function
Cost Function
Cost Function
Supervised Learning – Regression Example
http://localhost:8888/notebooks/dev/workspaces/i
yzico/scipy_2015_sklearn_tutoria...
Unsupervised Learning
Unsupervised Learning
• Input data is not labeled and does not have a known
result.
• Example problems are clustering, dim...
Supervised vs Unsupervised Learning
Unsupervised Learning Examples
Unsupervised Learning –
Transformation Example
http://localhost:8888/notebooks/dev/workspaces/i
yzico/scipy_2015_sklearn_t...
Unsupervised Learning – Clustering Example
http://localhost:8888/notebooks/dev/workspaces/i
yzico/scipy_2015_sklearn_tutor...
Cross Validation
Cross Validation
• A model validation technique for
assessing how the results of
a statistical analysis will generalize to...
Cross Validation Example
http://localhost:8888/notebooks/dev/workspaces/i
yzico/scipy_2015_sklearn_tutorial/notebooks/04.1...
Feature Extraction
Feature Extraction
• Feature extraction starts from an initial set of measured data and builds derived values (features)
i...
Feature Extraction
PL & Tools & Frameworks
Machine Learning
In iyzico
Architecture
Roadmap
Challenge 1:
Model Needs To Be
Tested With Real Data
Before Production
Machine Learning Model Release Pipeline
Model 1.0.2
(local)
Model 1.0.1
(listen)
Model 1.0.0
(production)
• New model deve...
Challenge 2:
Response Time Should
Be Minor Than
0.1 seconds
Optimize Spark Cluster
• Use Spark Cluster for Training
• Use Standalone Spark for
Predictions
• Load Balancer for High
Av...
Challenge 3:
Dynamic Data
Schemaless Database with MySQL
• Multiple features developed
each week
• All features stored and reported
• Data is really...
Challenge 4:
High Availability and
Fail Fast
Never Stop Payment Transaction
• If API is down fail fast
• Use fallback methods not to
affect payment transactions
• Netf...
Netflix Hystrix Circuit Breaker
Challenge 5:
Continuous Delivery
and
Machine Learning
Continuous Delivery and Machine Learning
• Training Jobs Devops Scripts implemented and automatized for
Continuous Integra...
Challenge 6:
Aggregated Feature
Simulation with
Batch Data
Aggregated Features with Batch Data
• Time based aggregated features needs to be simulated before
production
• Ex: Buyers ...
References
• https://medium.com/@neal_lathia/what-do-we-mean-when-we-talk-about-data-
driven-products-127ceb3e6cf
• https:...
thanks
25/05/2017
Upcoming SlideShare
Loading in …5
×

Building High Available and Scalable Machine Learning Applications

976 views

Published on

The slide contains some high level information about some machine learning algorithms, cross validation and feature extraction techniques. It also contains high level techniques about high available and scalable ML products.

Published in: Technology
  • Be the first to comment

Building High Available and Scalable Machine Learning Applications

  1. 1. Building High Available & Scalable Machine Learning Products Yalçın Yenigün 25/05/2017
  2. 2. Agenda
  3. 3. Agenda 1. What is Data-Driven Product? a) Introduction b) Examples 2. Machine Learning a) Term Definitions b) A Visual Example c) Supervised Learning d) Unsupervised Learning e) Cross Validation f) Feature Extraction 3. Machine Learning in iyzico
  4. 4. What is Data Driven Product?
  5. 5. Data Driven Product • Data driven is the future!!! • It’s the ‘right’ way of doing things!!!..etc. • What is “data-driven” ?? • Is Facebook a data-driven product?? • Is Uber a data-driven product?? • We can say that “all” of these are data-driven products • All of them works with data. • But they are really data-driven products??
  6. 6. Data Driven Product • Experimentation: • Data-Driven: Making design decisions based on behavioral evidence from users. • Example: Picking a green button for your website because conversion metrics are significantly improved over the purple button
  7. 7. Data Driven Product • Machine Learning : Building systems that learn from behavioral data generated by users • Examples: • Recommendation • Personalized Ranking • People-you-may-know • Products-you-may-like
  8. 8. Data Driven Product • Databases or APIs • They just use the data • To them their system is also data-driven. • But they are NOT data-driven. • They don’t use behavioral data generated by users.
  9. 9. Examples • A mobile app that gives information about public transport around you. • Pulls data from transport operator or APIs, merges and gives you. • Nothing really data-driven. • Data-driven version of this app: • Learn what part of the transport network relevant to you. • Predict when cycling is better when walking is better. • Predict waiting times. • Predict delays of transports.
  10. 10. Examples • A website that provides blogging services to users • Write posts, subscribe other posts.. etc. • Data-driven version of this blog: • Recommend who to follow based on your previous likes • Auto-tag your content to allow people quickly find it • Create relevance-sorted feed of posts.
  11. 11. Machine Learning
  12. 12. Term Definitions • Machine Learning: “Field of study that gives computers the ability to learn without being explicitly programmed” Arthur Samuel • Arthur Samuel: A pioneer in the field of computer gaming and artificial intelligence. He coined the term "machine learning" in 1959. • Feature: In machine learning and pattern recognition, a feature is individual measurable property of a phenomenon being observed.
  13. 13. Term Definitions • Data Sampling: Data sampling is a statistical analysis technique used to select, manipulate and analyze a representative subset of data points in order to identify patterns in the larger data set being examined.
  14. 14. Term Definitions • Training Set: A training set is a set of data used to discover potentially predictive relationships. • ML Model: You can use the ML model to get predictions on new data for which you do not know the target. • Cross Validation: A model validation technique for assessing how the results of a statistical analysis will generalize to an independent data set.
  15. 15. Term Definitions
  16. 16. Confusion Matrix
  17. 17. Confusion Matrix • Accuracy: Ratio of correctly predicted observations. (TP + TN) / (TP + TN + FP + FN) • Precision: Ratio of correct positive observations. TP / (TP + FP) • Recall: Ratio of correctly predicted positive events. TP / (TP + FN)
  18. 18. Visual Example
  19. 19. Visual Example
  20. 20. Supervised Learning
  21. 21. Supervised Learning • Input data is called training data and has a known label or result such as spam/not-spam or a stock price at a time. • Example problems are classification and regression. • Example algorithms include Logistic Regression and the Back Propagation Neural Network.
  22. 22. Supervised Learning Example
  23. 23. Supervised Learning Example
  24. 24. Supervised Learning • Supervised Learning: Right answers given • Regression: Predict continuous valued output • Classification: Discrete valued output
  25. 25. Supervised Learning – Classification Example
  26. 26. Supervised Learning – Classification Example
  27. 27. Linear Regression with One Variable
  28. 28. Linear Regression with One Variable
  29. 29. Supervised Learning – Classification Example http://localhost:8888/notebooks/dev/workspaces/i yzico/scipy_2015_sklearn_tutorial/notebooks/02.1 %20Supervised%20Learning%20- %20Classification.ipynb
  30. 30. Linear Regression with One Variable
  31. 31. Linear Regression with One Variable
  32. 32. Cost Function
  33. 33. Cost Function
  34. 34. Cost Function
  35. 35. Supervised Learning – Regression Example http://localhost:8888/notebooks/dev/workspaces/i yzico/scipy_2015_sklearn_tutorial/notebooks/02.2 %20Supervised%20Learning%20- %20Regression.ipynb
  36. 36. Unsupervised Learning
  37. 37. Unsupervised Learning • Input data is not labeled and does not have a known result. • Example problems are clustering, dimensionality reduction and association rule learning. • Example algorithms include: the Apriori algorithm and k-Means.
  38. 38. Supervised vs Unsupervised Learning
  39. 39. Unsupervised Learning Examples
  40. 40. Unsupervised Learning – Transformation Example http://localhost:8888/notebooks/dev/workspaces/i yzico/scipy_2015_sklearn_tutorial/notebooks/02.3 %20Unsupervised%20Learning%20- %20Transformations%20and%20Dimensionality%20 Reduction.ipynb
  41. 41. Unsupervised Learning – Clustering Example http://localhost:8888/notebooks/dev/workspaces/i yzico/scipy_2015_sklearn_tutorial/notebooks/02.4 %20Unsupervised%20Learning%20- %20Clustering.ipynb
  42. 42. Cross Validation
  43. 43. Cross Validation • A model validation technique for assessing how the results of a statistical analysis will generalize to an independent data set.
  44. 44. Cross Validation Example http://localhost:8888/notebooks/dev/workspaces/i yzico/scipy_2015_sklearn_tutorial/notebooks/04.1 %20Cross%20Validation.ipynb
  45. 45. Feature Extraction
  46. 46. Feature Extraction • Feature extraction starts from an initial set of measured data and builds derived values (features) intended to be informative and non-redundant. • Feature extraction involves reducing the amount of resources required to describe a large set of data.
  47. 47. Feature Extraction
  48. 48. PL & Tools & Frameworks
  49. 49. Machine Learning In iyzico
  50. 50. Architecture
  51. 51. Roadmap
  52. 52. Challenge 1: Model Needs To Be Tested With Real Data Before Production
  53. 53. Machine Learning Model Release Pipeline Model 1.0.2 (local) Model 1.0.1 (listen) Model 1.0.0 (production) • New model developed and tested on local environment. • Tech stack: Anaconda, Jupyter, Python, R, Scala • New model tested on Listen Mode Server with real transaction data. • Tech stack: Spark, Scala, Java 8 • Cost Matrix reported with real data • Response Time reported with real data
  54. 54. Challenge 2: Response Time Should Be Minor Than 0.1 seconds
  55. 55. Optimize Spark Cluster • Use Spark Cluster for Training • Use Standalone Spark for Predictions • Load Balancer for High Availability • Increase Spark Total Executor Core Size • Decrease Spark Max Memory In Mb
  56. 56. Challenge 3: Dynamic Data
  57. 57. Schemaless Database with MySQL • Multiple features developed each week • All features stored and reported • Data is really dynamic • Schema management is really difficult • i.e. Uber, Friendfeed..etc.
  58. 58. Challenge 4: High Availability and Fail Fast
  59. 59. Never Stop Payment Transaction • If API is down fail fast • Use fallback methods not to affect payment transactions • Netflix Circuit Breaker used
  60. 60. Netflix Hystrix Circuit Breaker
  61. 61. Challenge 5: Continuous Delivery and Machine Learning
  62. 62. Continuous Delivery and Machine Learning • Training Jobs Devops Scripts implemented and automatized for Continuous Integration Environment • Cross Validation jobs automatized on Spark with millions of transactions • Probability Calibration is implemented. • Data Sampling is automatized (Clustering based sampling)
  63. 63. Challenge 6: Aggregated Feature Simulation with Batch Data
  64. 64. Aggregated Features with Batch Data • Time based aggregated features needs to be simulated before production • Ex: Buyers last 1 hours payment behavior • Redis used for time series data (ZRANGE functions) • ZRANGE and ZREVRANGE offer the ability to retrieve elements from a Sorted Set based on their sorted position
  65. 65. References • https://medium.com/@neal_lathia/what-do-we-mean-when-we-talk-about-data- driven-products-127ceb3e6cf • https://www.slideshare.net/HadoopSummit/h20-a-platform-for-big-math • https://www.wikipedia.org/ • https://www.coursera.org/learn/machine-learning • http://www.r2d3.us/visual-intro-to-machine-learning-part-1/ • https://github.com/amueller/scipy_2015_sklearn_tutorial • https://redis.io/commands/ • https://github.com/Netflix/Hystrix • https://eng.uber.com/schemaless-part-one/ • https://backchannel.org/blog/friendfeed-schemaless-mysql • https://www.continuum.io/anaconda-overview
  66. 66. thanks 25/05/2017

×