Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Successfully reported this slideshow.

915 views

Published on

Published in:
Technology

No Downloads

Total views

915

On SlideShare

0

From Embeds

0

Number of Embeds

109

Shares

0

Downloads

16

Comments

0

Likes

3

No embeds

No notes for slide

- 1. Building High Available & Scalable Machine Learning Products Yalçın Yenigün 25/05/2017
- 2. Agenda
- 3. Agenda 1. What is Data-Driven Product? a) Introduction b) Examples 2. Machine Learning a) Term Definitions b) A Visual Example c) Supervised Learning d) Unsupervised Learning e) Cross Validation f) Feature Extraction 3. Machine Learning in iyzico
- 4. What is Data Driven Product?
- 5. Data Driven Product • Data driven is the future!!! • It’s the ‘right’ way of doing things!!!..etc. • What is “data-driven” ?? • Is Facebook a data-driven product?? • Is Uber a data-driven product?? • We can say that “all” of these are data-driven products • All of them works with data. • But they are really data-driven products??
- 6. Data Driven Product • Experimentation: • Data-Driven: Making design decisions based on behavioral evidence from users. • Example: Picking a green button for your website because conversion metrics are significantly improved over the purple button
- 7. Data Driven Product • Machine Learning : Building systems that learn from behavioral data generated by users • Examples: • Recommendation • Personalized Ranking • People-you-may-know • Products-you-may-like
- 8. Data Driven Product • Databases or APIs • They just use the data • To them their system is also data-driven. • But they are NOT data-driven. • They don’t use behavioral data generated by users.
- 9. Examples • A mobile app that gives information about public transport around you. • Pulls data from transport operator or APIs, merges and gives you. • Nothing really data-driven. • Data-driven version of this app: • Learn what part of the transport network relevant to you. • Predict when cycling is better when walking is better. • Predict waiting times. • Predict delays of transports.
- 10. Examples • A website that provides blogging services to users • Write posts, subscribe other posts.. etc. • Data-driven version of this blog: • Recommend who to follow based on your previous likes • Auto-tag your content to allow people quickly find it • Create relevance-sorted feed of posts.
- 11. Machine Learning
- 12. Term Definitions • Machine Learning: “Field of study that gives computers the ability to learn without being explicitly programmed” Arthur Samuel • Arthur Samuel: A pioneer in the field of computer gaming and artificial intelligence. He coined the term "machine learning" in 1959. • Feature: In machine learning and pattern recognition, a feature is individual measurable property of a phenomenon being observed.
- 13. Term Definitions • Data Sampling: Data sampling is a statistical analysis technique used to select, manipulate and analyze a representative subset of data points in order to identify patterns in the larger data set being examined.
- 14. Term Definitions • Training Set: A training set is a set of data used to discover potentially predictive relationships. • ML Model: You can use the ML model to get predictions on new data for which you do not know the target. • Cross Validation: A model validation technique for assessing how the results of a statistical analysis will generalize to an independent data set.
- 15. Term Definitions
- 16. Confusion Matrix
- 17. Confusion Matrix • Accuracy: Ratio of correctly predicted observations. (TP + TN) / (TP + TN + FP + FN) • Precision: Ratio of correct positive observations. TP / (TP + FP) • Recall: Ratio of correctly predicted positive events. TP / (TP + FN)
- 18. Visual Example
- 19. Visual Example
- 20. Supervised Learning
- 21. Supervised Learning • Input data is called training data and has a known label or result such as spam/not-spam or a stock price at a time. • Example problems are classification and regression. • Example algorithms include Logistic Regression and the Back Propagation Neural Network.
- 22. Supervised Learning Example
- 23. Supervised Learning Example
- 24. Supervised Learning • Supervised Learning: Right answers given • Regression: Predict continuous valued output • Classification: Discrete valued output
- 25. Supervised Learning – Classification Example
- 26. Supervised Learning – Classification Example
- 27. Linear Regression with One Variable
- 28. Linear Regression with One Variable
- 29. Supervised Learning – Classification Example http://localhost:8888/notebooks/dev/workspaces/i yzico/scipy_2015_sklearn_tutorial/notebooks/02.1 %20Supervised%20Learning%20- %20Classification.ipynb
- 30. Linear Regression with One Variable
- 31. Linear Regression with One Variable
- 32. Cost Function
- 33. Cost Function
- 34. Cost Function
- 35. Supervised Learning – Regression Example http://localhost:8888/notebooks/dev/workspaces/i yzico/scipy_2015_sklearn_tutorial/notebooks/02.2 %20Supervised%20Learning%20- %20Regression.ipynb
- 36. Unsupervised Learning
- 37. Unsupervised Learning • Input data is not labeled and does not have a known result. • Example problems are clustering, dimensionality reduction and association rule learning. • Example algorithms include: the Apriori algorithm and k-Means.
- 38. Supervised vs Unsupervised Learning
- 39. Unsupervised Learning Examples
- 40. Unsupervised Learning – Transformation Example http://localhost:8888/notebooks/dev/workspaces/i yzico/scipy_2015_sklearn_tutorial/notebooks/02.3 %20Unsupervised%20Learning%20- %20Transformations%20and%20Dimensionality%20 Reduction.ipynb
- 41. Unsupervised Learning – Clustering Example http://localhost:8888/notebooks/dev/workspaces/i yzico/scipy_2015_sklearn_tutorial/notebooks/02.4 %20Unsupervised%20Learning%20- %20Clustering.ipynb
- 42. Cross Validation
- 43. Cross Validation • A model validation technique for assessing how the results of a statistical analysis will generalize to an independent data set.
- 44. Cross Validation Example http://localhost:8888/notebooks/dev/workspaces/i yzico/scipy_2015_sklearn_tutorial/notebooks/04.1 %20Cross%20Validation.ipynb
- 45. Feature Extraction
- 46. Feature Extraction • Feature extraction starts from an initial set of measured data and builds derived values (features) intended to be informative and non-redundant. • Feature extraction involves reducing the amount of resources required to describe a large set of data.
- 47. Feature Extraction
- 48. PL & Tools & Frameworks
- 49. Machine Learning In iyzico
- 50. Architecture
- 51. Roadmap
- 52. Challenge 1: Model Needs To Be Tested With Real Data Before Production
- 53. Machine Learning Model Release Pipeline Model 1.0.2 (local) Model 1.0.1 (listen) Model 1.0.0 (production) • New model developed and tested on local environment. • Tech stack: Anaconda, Jupyter, Python, R, Scala • New model tested on Listen Mode Server with real transaction data. • Tech stack: Spark, Scala, Java 8 • Cost Matrix reported with real data • Response Time reported with real data
- 54. Challenge 2: Response Time Should Be Minor Than 0.1 seconds
- 55. Optimize Spark Cluster • Use Spark Cluster for Training • Use Standalone Spark for Predictions • Load Balancer for High Availability • Increase Spark Total Executor Core Size • Decrease Spark Max Memory In Mb
- 56. Challenge 3: Dynamic Data
- 57. Schemaless Database with MySQL • Multiple features developed each week • All features stored and reported • Data is really dynamic • Schema management is really difficult • i.e. Uber, Friendfeed..etc.
- 58. Challenge 4: High Availability and Fail Fast
- 59. Never Stop Payment Transaction • If API is down fail fast • Use fallback methods not to affect payment transactions • Netflix Circuit Breaker used
- 60. Netflix Hystrix Circuit Breaker
- 61. Challenge 5: Continuous Delivery and Machine Learning
- 62. Continuous Delivery and Machine Learning • Training Jobs Devops Scripts implemented and automatized for Continuous Integration Environment • Cross Validation jobs automatized on Spark with millions of transactions • Probability Calibration is implemented. • Data Sampling is automatized (Clustering based sampling)
- 63. Challenge 6: Aggregated Feature Simulation with Batch Data
- 64. Aggregated Features with Batch Data • Time based aggregated features needs to be simulated before production • Ex: Buyers last 1 hours payment behavior • Redis used for time series data (ZRANGE functions) • ZRANGE and ZREVRANGE offer the ability to retrieve elements from a Sorted Set based on their sorted position
- 65. References • https://medium.com/@neal_lathia/what-do-we-mean-when-we-talk-about-data- driven-products-127ceb3e6cf • https://www.slideshare.net/HadoopSummit/h20-a-platform-for-big-math • https://www.wikipedia.org/ • https://www.coursera.org/learn/machine-learning • http://www.r2d3.us/visual-intro-to-machine-learning-part-1/ • https://github.com/amueller/scipy_2015_sklearn_tutorial • https://redis.io/commands/ • https://github.com/Netflix/Hystrix • https://eng.uber.com/schemaless-part-one/ • https://backchannel.org/blog/friendfeed-schemaless-mysql • https://www.continuum.io/anaconda-overview
- 66. thanks 25/05/2017

No public clipboards found for this slide

Be the first to comment