Successfully reported this slideshow.
Your SlideShare is downloading. ×

Machine Learning at Scale with MLflow and Apache Spark

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 18 Ad

Machine Learning at Scale with MLflow and Apache Spark

Download to read offline

Societe Generale is one of the major banks in France and has many data science teams across the globe. After years of explorations and prototyping, it is time for the company to really deploy machine learning projects at scale to the production environment.

To achieve that goal, we have been working hard to define a standard process of collaboration between data engineers and data scientists. And we also designed and deployed an infrastructure for productionizing machine learning.

During this presentation, you will be looking at the following points of our adventure:
1. Difficulties that we had for putting ML applications into production, such as lack of model registry; hard to deploy ML libraries to our Hadoop cluster; collaboration between data scientists and data engineers etc. ?
2. How did we deploy MLflow as a key technical component to our production hadoop environment given different security constraints.
3. How did we build a CI/CD pipeline to deploy ML applications automatically. MLflow plays an important role in this piepline.
4. A first and concrete production project developed on top of this infrastructure with MLflow, Spark streaming, Sklearn and CI/CD.

The key takeaways of this presentation would be:
1. To productionize machine learning in a big structure like Société Générale, a process of collaboration should be clearly defined.
2. A ML model registry is key to ML productionization. MLflow is the best solution we found.
3. A CI/CD pipeline is essential to the success of a machine learning application.

Societe Generale is one of the major banks in France and has many data science teams across the globe. After years of explorations and prototyping, it is time for the company to really deploy machine learning projects at scale to the production environment.

To achieve that goal, we have been working hard to define a standard process of collaboration between data engineers and data scientists. And we also designed and deployed an infrastructure for productionizing machine learning.

During this presentation, you will be looking at the following points of our adventure:
1. Difficulties that we had for putting ML applications into production, such as lack of model registry; hard to deploy ML libraries to our Hadoop cluster; collaboration between data scientists and data engineers etc. ?
2. How did we deploy MLflow as a key technical component to our production hadoop environment given different security constraints.
3. How did we build a CI/CD pipeline to deploy ML applications automatically. MLflow plays an important role in this piepline.
4. A first and concrete production project developed on top of this infrastructure with MLflow, Spark streaming, Sklearn and CI/CD.

The key takeaways of this presentation would be:
1. To productionize machine learning in a big structure like Société Générale, a process of collaboration should be clearly defined.
2. A ML model registry is key to ML productionization. MLflow is the best solution we found.
3. A CI/CD pipeline is essential to the success of a machine learning application.

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Similar to Machine Learning at Scale with MLflow and Apache Spark (20)

Advertisement

More from Databricks (20)

Recently uploaded (20)

Advertisement

Machine Learning at Scale with MLflow and Apache Spark

  1. 1. WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
  2. 2. Chongguang LIU, Société Générale Machine learning at scale with MLflow and Spark #UnifiedDataAnalytics #SparkAISummit
  3. 3. #UnifiedDataAnalytics #SparkAISummit About me 3 • Studied computer science and engineering • Data engineer at SocGen • Using Spark and MLflow at work • Skiing and diving during vacations
  4. 4. Data is strategic at SocGen • SocGen is French multinational bank. • We have 80+ data pipelines in production in our data lake. • More than 200 data scientists working across the globe. • Data allows us to create new products, improve customer experience and be more efficient. • Relevant use cases such as anti-money laundering, fraud detection, automatic document analysis etc. 4
  5. 5. But also a lot of pain points ... 5 Business Data scientist Data engineer Manually copy training data  Code rewrite in another programming language No automated data flow Manually deploy models Difficult to use ML models Suboptimal predictions Models rarely updated  Limited value for business !
  6. 6. Finally we realised that ... 6 ML Code Hidden technical debt in machine learning systems, 2015, Google
  7. 7. Challenge 1: data locality • A central Hadoop cluster • Client data, transaction data, accounting data etc. • Automated data pipelines • Banking industry is highly regulated, sensitive data is kept in the data lake for security reasons. 7 training and prediction inside the data lake
  8. 8. Challenge 2: application reliability 8 prototyping phase
  9. 9. Challenge 2: application reliability 9 Yarn Data node Data node Data node production phase
  10. 10. Challenge 3: variant python packages 10 python code python code + conda env python code + conda env
  11. 11. Challenge 4: model management 11
  12. 12. 12 MLflow tracking server HDFS Data nodeData nodeData node ML modelsML meta data Challenge 4: model management
  13. 13. Challenge 5: tracking server reliability 13 Tracking server Tracking server HA proxy PostgreSQLData node
  14. 14. Challenge 6: model serving 14 MLflow HDFS Knox API server Kafka Spark streaming
  15. 15. A concrete example 15 Web app DB ML model server news HDFS feedback feedback Kafka score MLflow Spark + Sklearn feedback new model periodic model retraining score news score Spark streaming model score
  16. 16. Moving forward • Model drift monitoring • A/B testing • pandas_udf • koalas 16
  17. 17. Thank you! 17
  18. 18. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT

×