SlideShare a Scribd company logo
ML IN DATA PLATFORM
A Case Study with NLP Application
US Office
2150 Ringwood Ave, San Jose,
CA 95131
UK Office
3 Beeston Place, Belgravia,
London SW1W 0JJ, UK
Vietnam Office
Floor #1-4, 302 Le Van Sy,
Ward 1, Tan Binh District, HCMC,
Vietnam
SG Office
6A Shenton Way #04-08 OUE
Downtown Gallery Singapore 068815
2
Table of content
No Content
1 Introduction
2 Data Platform – ETL Process
3 Data Platform – Analytics Workflow
4 Afterthoughts
3
INTRODUCTION
01
1. Introduction to Case Study
2. Introduction to Data Platform
1.1.1. Potential Values of ML/NLP Application
4
- ML applications can bring new-found values
- Case study: Online Review Analytics
- Opinions from others increasingly guide customer's purchases
=> Growth, Improvement, Investment implications
Refs
- https://www.mckinsey.com/industries/consumer-packaged-goods/our-insights/five-star-growth-using-online-ratings-to-design-better-products
- https://www.thinkwithgoogle.com/consumer-insights/consumer-trends/customer-review-preference-statistics/
1.1.2. Dealing with text data
5
- An insight-mining platform for review text is highly valuable. It is difficult though
- Engineering challenges
- Getting the reviews => web-scraping, data collection
- Storing reviews => moving, maintaining, deduplicating large amount of texts
- Processing reviews => text cleaning, processing, and analytics at scale
- Analytics challenges
- Natural Language Processing – NLP
- Insight communication: dashboards and visualization
1.2.1. Data Platform overall architecture
6
1.2.2. Example: output from ETL Process
7
1.3. Example: output from Analytics Workflow
8
1.3. Example: insight communication – Web Application
9
10
ETL PROCESS
02
1. Extract, Transform, Load
2. Data Collection
3. Data Storage
2.1. Extract, Transform, Load
11
- Extract:
- Data Collector: collect data from websites
- Extract and Map from raw data collected
- Transform: clean up data (trim, special characters,…), deduplications, etc.
- Load: to databases for storage and analysis: MongoDB, BigQuery
- Batching: split large amount of data into batches for parallel processing
- Worker: a container that moves/processes data -> Mini-ETL
2.1. Data Collection: web-scraping
12
Web Scraper
2.1. Data Collection: Benefit & Challenge
13
Benefit Challenge
It’s Free
It’s Big Data
Fake Data
- Captcha
- IP Blocking
Hard to collect
- Javascript Rendering
2.1. Data Collection: How to deal with challenges?
14
WEB BROWSER
SELENIUM
PROXY
To avoid IPs blocking & Captcha
To overcome Javascript rendering
Control Browser by Code
Control Browser by Code
2.2. Data Storage
15
- PostgreSQL: store process metadata (used by orchestrator)
- Google Cloud Storage: store intermediary CSV files
- MongoDB: flexible, persistent storage for text documents. Allow easy and frequent
edits
- Google BigQuery: analytics data storage and distributed processing engine using
SQL – familiar language for Data Analysts
16
ANALYTICS WORKFLOW
03
1. First Implementation
2. Inference Services
3.1.1 Analytics Workflow
17
- After ETL process, data is available for further processing and analysis
- Analytics Workflow:
- A part of Data Platform
- Extract information from data for insights
- Machine Learning models are integral part of text analytics
- Information is extracted, and pushed to BigQuery for queries
3.1.2 First implementation
18
- Implement each model as a worker
- Advantages:
- Easy to implement
- Suitable for early stages: fast
implementation and acceptable
performance
- Several drawbacks: technical debts
- Mixing of concerns
- Low flexibility
- Limited scalability
3.1.3 First implementation: mixing of concerns
19
- Data Platform’s intended purpose: moving data, processing, and interacting with
various API on the way => mostly I/O operations
- Computationally-heavy tasks are usually delegated: e.g. to BigQuery
- Mixing I/O and computations
3.1.4 First implementation: scalability
20
- Everything seems ok, until
we must process many
reviews (100,000s -
1,000,000s, various
lengths, can be very long)
- Manual scaling: replicate
workers -> VM
resource/cost constraint
- GPU acceleration? -> ETL
workers don’t need GPU
3.1.5 First implementation: monitoring and maintenance
21
- No real monitoring components for performance degradation
- Data drift, concept drift?
- If needed, model is inspected manually
- Collect, process, re-train models manually
- Upload trained model to GCS, re-deploy workers
3.2.1 Inference Services: separation of concerns
22
- Income Inference Services
- No direct I/O for data, only accept
HTTP requests with input and
response with computed results
=> Easier to maintain and optimize both
ends
3.2.2 Inference Services: overall architecture
23
3.2.3 Inference Services: solving redundancy and reusability
24
- Each ML model is treated as a microservice
- Several ML models can be connected as an inference pipeline for complex tasks
- Promote reusability and flexibility => save resources
3.2.4 Inference Services: solving scalability
25
- Services are containerized, run, and deployed independently
- Can be migrate to any environment with relative ease
- For maximum scalability => K8s cluster (GKE) with autoscaling
- Thanks to K8s, deployment is easier.
- Rollout deployments: no/minimal downtime
3.2.5. Inference Services: monitoring
26
- Metrics are logged to a central data-lake and visualized in a
dashboard.
Image from https://www.datarobot.com/wiki/machine-learning-operations-mlops/
3.2.6. Inference Services: results and drawbacks
27
- Results
- A more flexible and effective solution
- More resilient ETL process: less complex
- Reduced ETL resource consumption and processing time
- New system of services can be developed and maintained separately
- Drawbacks
- Appearance of more infrastructures and tools -> management overhead
- Complex inter-dependency of inference services as it expands
- Requires more expertise in managing K8s clusters and deployment
28
WHAT WE LEARNED
04
4.1. What We Learned?
29
- ML Application can be tricky to be done right
- Not much resources and best practices
- Solved by: thorough analysis of use-cases
- Solved by: proper scoping and sizing
- Separating I/O Intensive from Computationally-intensive tasks
- ETL components
- ML components
- Good architecture design from the beginning can save time and cost later
- Over-engineered vs under-engineered
- Easy in hindsight, difficult in practice
Hope these ideas help you in designing your next ML Application
THANK YOU – Q&A

More Related Content

What's hot

Building Bizweb Microservices with Docker
Building Bizweb Microservices with DockerBuilding Bizweb Microservices with Docker
Building Bizweb Microservices with Docker
Khôi Nguyễn Minh
 
Seamless MLOps with Seldon and MLflow
Seamless MLOps with Seldon and MLflowSeamless MLOps with Seldon and MLflow
Seamless MLOps with Seldon and MLflow
Databricks
 
MLOps in action
MLOps in actionMLOps in action
MLOps in action
Pieter de Bruin
 
Apache Flink, AWS Kinesis, Analytics
Apache Flink, AWS Kinesis, Analytics Apache Flink, AWS Kinesis, Analytics
Apache Flink, AWS Kinesis, Analytics
Araf Karsh Hamid
 
Seldon: Deploying Models at Scale
Seldon: Deploying Models at ScaleSeldon: Deploying Models at Scale
Seldon: Deploying Models at Scale
Seldon
 
Grokking Techtalk #39: How to build an event driven architecture with Kafka ...
 Grokking Techtalk #39: How to build an event driven architecture with Kafka ... Grokking Techtalk #39: How to build an event driven architecture with Kafka ...
Grokking Techtalk #39: How to build an event driven architecture with Kafka ...
Grokking VN
 
MLOps Using MLflow
MLOps Using MLflowMLOps Using MLflow
MLOps Using MLflow
Databricks
 
HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...
HTTP Analytics for 6M requests per second using ClickHouse, by  Alexander Boc...HTTP Analytics for 6M requests per second using ClickHouse, by  Alexander Boc...
HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...
Altinity Ltd
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
Jose Quesada (hiring)
 
SOLID & Design Patterns
SOLID & Design PatternsSOLID & Design Patterns
SOLID & Design Patterns
Grokking VN
 
Monitoring Hadoop with Prometheus (Hadoop User Group Ireland, December 2015)
Monitoring Hadoop with Prometheus (Hadoop User Group Ireland, December 2015)Monitoring Hadoop with Prometheus (Hadoop User Group Ireland, December 2015)
Monitoring Hadoop with Prometheus (Hadoop User Group Ireland, December 2015)
Brian Brazil
 
How to Monitoring the SRE Golden Signals (E-Book)
How to Monitoring the SRE Golden Signals (E-Book)How to Monitoring the SRE Golden Signals (E-Book)
How to Monitoring the SRE Golden Signals (E-Book)
Siglos
 
Oracle http server installation on linux
Oracle http server installation on linuxOracle http server installation on linux
Oracle http server installation on linux
Ravi Kumar Lanke
 
Continuous Data Replication into Cloud Storage with Oracle GoldenGate
Continuous Data Replication into Cloud Storage with Oracle GoldenGateContinuous Data Replication into Cloud Storage with Oracle GoldenGate
Continuous Data Replication into Cloud Storage with Oracle GoldenGate
Michael Rainey
 
MLOps with Kubeflow
MLOps with Kubeflow MLOps with Kubeflow
MLOps with Kubeflow
Saurabh Kaushik
 
Grokking Techtalk #40: Consistency and Availability tradeoff in database cluster
Grokking Techtalk #40: Consistency and Availability tradeoff in database clusterGrokking Techtalk #40: Consistency and Availability tradeoff in database cluster
Grokking Techtalk #40: Consistency and Availability tradeoff in database cluster
Grokking VN
 
Auto-Train a Time-Series Forecast Model With AML + ADB
Auto-Train a Time-Series Forecast Model With AML + ADBAuto-Train a Time-Series Forecast Model With AML + ADB
Auto-Train a Time-Series Forecast Model With AML + ADB
Databricks
 
Ml ops intro session
Ml ops   intro sessionMl ops   intro session
Ml ops intro session
Avinash Patil
 
Learn to Use Databricks for the Full ML Lifecycle
Learn to Use Databricks for the Full ML LifecycleLearn to Use Databricks for the Full ML Lifecycle
Learn to Use Databricks for the Full ML Lifecycle
Databricks
 
Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...
Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...
Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...
Edureka!
 

What's hot (20)

Building Bizweb Microservices with Docker
Building Bizweb Microservices with DockerBuilding Bizweb Microservices with Docker
Building Bizweb Microservices with Docker
 
Seamless MLOps with Seldon and MLflow
Seamless MLOps with Seldon and MLflowSeamless MLOps with Seldon and MLflow
Seamless MLOps with Seldon and MLflow
 
MLOps in action
MLOps in actionMLOps in action
MLOps in action
 
Apache Flink, AWS Kinesis, Analytics
Apache Flink, AWS Kinesis, Analytics Apache Flink, AWS Kinesis, Analytics
Apache Flink, AWS Kinesis, Analytics
 
Seldon: Deploying Models at Scale
Seldon: Deploying Models at ScaleSeldon: Deploying Models at Scale
Seldon: Deploying Models at Scale
 
Grokking Techtalk #39: How to build an event driven architecture with Kafka ...
 Grokking Techtalk #39: How to build an event driven architecture with Kafka ... Grokking Techtalk #39: How to build an event driven architecture with Kafka ...
Grokking Techtalk #39: How to build an event driven architecture with Kafka ...
 
MLOps Using MLflow
MLOps Using MLflowMLOps Using MLflow
MLOps Using MLflow
 
HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...
HTTP Analytics for 6M requests per second using ClickHouse, by  Alexander Boc...HTTP Analytics for 6M requests per second using ClickHouse, by  Alexander Boc...
HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
 
SOLID & Design Patterns
SOLID & Design PatternsSOLID & Design Patterns
SOLID & Design Patterns
 
Monitoring Hadoop with Prometheus (Hadoop User Group Ireland, December 2015)
Monitoring Hadoop with Prometheus (Hadoop User Group Ireland, December 2015)Monitoring Hadoop with Prometheus (Hadoop User Group Ireland, December 2015)
Monitoring Hadoop with Prometheus (Hadoop User Group Ireland, December 2015)
 
How to Monitoring the SRE Golden Signals (E-Book)
How to Monitoring the SRE Golden Signals (E-Book)How to Monitoring the SRE Golden Signals (E-Book)
How to Monitoring the SRE Golden Signals (E-Book)
 
Oracle http server installation on linux
Oracle http server installation on linuxOracle http server installation on linux
Oracle http server installation on linux
 
Continuous Data Replication into Cloud Storage with Oracle GoldenGate
Continuous Data Replication into Cloud Storage with Oracle GoldenGateContinuous Data Replication into Cloud Storage with Oracle GoldenGate
Continuous Data Replication into Cloud Storage with Oracle GoldenGate
 
MLOps with Kubeflow
MLOps with Kubeflow MLOps with Kubeflow
MLOps with Kubeflow
 
Grokking Techtalk #40: Consistency and Availability tradeoff in database cluster
Grokking Techtalk #40: Consistency and Availability tradeoff in database clusterGrokking Techtalk #40: Consistency and Availability tradeoff in database cluster
Grokking Techtalk #40: Consistency and Availability tradeoff in database cluster
 
Auto-Train a Time-Series Forecast Model With AML + ADB
Auto-Train a Time-Series Forecast Model With AML + ADBAuto-Train a Time-Series Forecast Model With AML + ADB
Auto-Train a Time-Series Forecast Model With AML + ADB
 
Ml ops intro session
Ml ops   intro sessionMl ops   intro session
Ml ops intro session
 
Learn to Use Databricks for the Full ML Lifecycle
Learn to Use Databricks for the Full ML LifecycleLearn to Use Databricks for the Full ML Lifecycle
Learn to Use Databricks for the Full ML Lifecycle
 
Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...
Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...
Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...
 

Similar to Grokking Techtalk #42: Engineering challenges on building data platform for ML application

Five ways database modernization simplifies your data life
Five ways database modernization simplifies your data lifeFive ways database modernization simplifies your data life
Five ways database modernization simplifies your data life
SingleStore
 
MuleSoft Manchester Meetup #4 slides 11th February 2021
MuleSoft Manchester Meetup #4 slides 11th February 2021MuleSoft Manchester Meetup #4 slides 11th February 2021
MuleSoft Manchester Meetup #4 slides 11th February 2021
Ieva Navickaite
 
BigQuery ML - Machine learning at scale using SQL
BigQuery ML - Machine learning at scale using SQLBigQuery ML - Machine learning at scale using SQL
BigQuery ML - Machine learning at scale using SQL
Márton Kodok
 
ESP POC Findings
ESP POC FindingsESP POC Findings
ESP POC Findings
kevin_donovan
 
MODERN DATA PIPELINE
MODERN DATA PIPELINEMODERN DATA PIPELINE
MODERN DATA PIPELINE
IRJET Journal
 
127801976 mobile-shop-management-system-documentation
127801976 mobile-shop-management-system-documentation127801976 mobile-shop-management-system-documentation
127801976 mobile-shop-management-system-documentation
Nitesh Kumar
 
Accelerating Machine Learning as a Service with Automated Feature Engineering
Accelerating Machine Learning as a Service with Automated Feature EngineeringAccelerating Machine Learning as a Service with Automated Feature Engineering
Accelerating Machine Learning as a Service with Automated Feature Engineering
Cognizant
 
[EN] Building modern data pipeline with Snowflake + DBT + Airflow.pdf
[EN] Building modern data pipeline with Snowflake + DBT + Airflow.pdf[EN] Building modern data pipeline with Snowflake + DBT + Airflow.pdf
[EN] Building modern data pipeline with Snowflake + DBT + Airflow.pdf
Chris Hoyean Song
 
Distributed Systems in Data Engineering
Distributed Systems in Data EngineeringDistributed Systems in Data Engineering
Distributed Systems in Data Engineering
Adetimehin Oluwasegun Matthew
 
Print report
Print reportPrint report
Print report
Ved Prakash
 
How to overcome challenges in it system evolution
How to overcome challenges in it system evolutionHow to overcome challenges in it system evolution
How to overcome challenges in it system evolution
Grupa Unity
 
Datawarehouse and reporting in service manager
Datawarehouse and reporting in service manager Datawarehouse and reporting in service manager
Datawarehouse and reporting in service manager
Eduardo Castro
 
Workshop: Delivering chnages for applications and databases
Workshop: Delivering chnages for applications and databasesWorkshop: Delivering chnages for applications and databases
Workshop: Delivering chnages for applications and databases
Eduardo Piairo
 
Internet of Things Microservices
Internet of Things MicroservicesInternet of Things Microservices
Internet of Things Microservices
Capgemini
 
Dataweave Libraries and ObjectStore
Dataweave Libraries and ObjectStoreDataweave Libraries and ObjectStore
Dataweave Libraries and ObjectStore
Vikalp Bhalia
 
Book store Black Book - Dinesh48
Book store Black Book - Dinesh48Book store Black Book - Dinesh48
Book store Black Book - Dinesh48
Dinesh Jogdand
 
Zakir_Hussain_cv
Zakir_Hussain_cvZakir_Hussain_cv
Zakir_Hussain_cv
zakir hussain
 
LOTAR-PDES: Engineering digitalization through task automation and reuse in t...
LOTAR-PDES: Engineering digitalization through task automation and reuse in t...LOTAR-PDES: Engineering digitalization through task automation and reuse in t...
LOTAR-PDES: Engineering digitalization through task automation and reuse in t...
CARLOS III UNIVERSITY OF MADRID
 
Limited Budget but Effective End to End MLOps Practices (Machine Learning Mod...
Limited Budget but Effective End to End MLOps Practices (Machine Learning Mod...Limited Budget but Effective End to End MLOps Practices (Machine Learning Mod...
Limited Budget but Effective End to End MLOps Practices (Machine Learning Mod...
IRJET Journal
 
Bank Management System.docx
Bank Management System.docxBank Management System.docx
Bank Management System.docx
Nikhil Patil
 

Similar to Grokking Techtalk #42: Engineering challenges on building data platform for ML application (20)

Five ways database modernization simplifies your data life
Five ways database modernization simplifies your data lifeFive ways database modernization simplifies your data life
Five ways database modernization simplifies your data life
 
MuleSoft Manchester Meetup #4 slides 11th February 2021
MuleSoft Manchester Meetup #4 slides 11th February 2021MuleSoft Manchester Meetup #4 slides 11th February 2021
MuleSoft Manchester Meetup #4 slides 11th February 2021
 
BigQuery ML - Machine learning at scale using SQL
BigQuery ML - Machine learning at scale using SQLBigQuery ML - Machine learning at scale using SQL
BigQuery ML - Machine learning at scale using SQL
 
ESP POC Findings
ESP POC FindingsESP POC Findings
ESP POC Findings
 
MODERN DATA PIPELINE
MODERN DATA PIPELINEMODERN DATA PIPELINE
MODERN DATA PIPELINE
 
127801976 mobile-shop-management-system-documentation
127801976 mobile-shop-management-system-documentation127801976 mobile-shop-management-system-documentation
127801976 mobile-shop-management-system-documentation
 
Accelerating Machine Learning as a Service with Automated Feature Engineering
Accelerating Machine Learning as a Service with Automated Feature EngineeringAccelerating Machine Learning as a Service with Automated Feature Engineering
Accelerating Machine Learning as a Service with Automated Feature Engineering
 
[EN] Building modern data pipeline with Snowflake + DBT + Airflow.pdf
[EN] Building modern data pipeline with Snowflake + DBT + Airflow.pdf[EN] Building modern data pipeline with Snowflake + DBT + Airflow.pdf
[EN] Building modern data pipeline with Snowflake + DBT + Airflow.pdf
 
Distributed Systems in Data Engineering
Distributed Systems in Data EngineeringDistributed Systems in Data Engineering
Distributed Systems in Data Engineering
 
Print report
Print reportPrint report
Print report
 
How to overcome challenges in it system evolution
How to overcome challenges in it system evolutionHow to overcome challenges in it system evolution
How to overcome challenges in it system evolution
 
Datawarehouse and reporting in service manager
Datawarehouse and reporting in service manager Datawarehouse and reporting in service manager
Datawarehouse and reporting in service manager
 
Workshop: Delivering chnages for applications and databases
Workshop: Delivering chnages for applications and databasesWorkshop: Delivering chnages for applications and databases
Workshop: Delivering chnages for applications and databases
 
Internet of Things Microservices
Internet of Things MicroservicesInternet of Things Microservices
Internet of Things Microservices
 
Dataweave Libraries and ObjectStore
Dataweave Libraries and ObjectStoreDataweave Libraries and ObjectStore
Dataweave Libraries and ObjectStore
 
Book store Black Book - Dinesh48
Book store Black Book - Dinesh48Book store Black Book - Dinesh48
Book store Black Book - Dinesh48
 
Zakir_Hussain_cv
Zakir_Hussain_cvZakir_Hussain_cv
Zakir_Hussain_cv
 
LOTAR-PDES: Engineering digitalization through task automation and reuse in t...
LOTAR-PDES: Engineering digitalization through task automation and reuse in t...LOTAR-PDES: Engineering digitalization through task automation and reuse in t...
LOTAR-PDES: Engineering digitalization through task automation and reuse in t...
 
Limited Budget but Effective End to End MLOps Practices (Machine Learning Mod...
Limited Budget but Effective End to End MLOps Practices (Machine Learning Mod...Limited Budget but Effective End to End MLOps Practices (Machine Learning Mod...
Limited Budget but Effective End to End MLOps Practices (Machine Learning Mod...
 
Bank Management System.docx
Bank Management System.docxBank Management System.docx
Bank Management System.docx
 

More from Grokking VN

Grokking Techtalk #46: Lessons from years hacking and defending Vietnamese banks
Grokking Techtalk #46: Lessons from years hacking and defending Vietnamese banksGrokking Techtalk #46: Lessons from years hacking and defending Vietnamese banks
Grokking Techtalk #46: Lessons from years hacking and defending Vietnamese banks
Grokking VN
 
Grokking Techtalk #45: First Principles Thinking
Grokking Techtalk #45: First Principles ThinkingGrokking Techtalk #45: First Principles Thinking
Grokking Techtalk #45: First Principles Thinking
Grokking VN
 
Grokking Techtalk #43: Payment gateway demystified
Grokking Techtalk #43: Payment gateway demystifiedGrokking Techtalk #43: Payment gateway demystified
Grokking Techtalk #43: Payment gateway demystified
Grokking VN
 
Grokking Techtalk #39: Gossip protocol and applications
Grokking Techtalk #39: Gossip protocol and applicationsGrokking Techtalk #39: Gossip protocol and applications
Grokking Techtalk #39: Gossip protocol and applications
Grokking VN
 
Grokking Techtalk #38: Escape Analysis in Go compiler
 Grokking Techtalk #38: Escape Analysis in Go compiler Grokking Techtalk #38: Escape Analysis in Go compiler
Grokking Techtalk #38: Escape Analysis in Go compiler
Grokking VN
 
Grokking Techtalk #37: Data intensive problem
 Grokking Techtalk #37: Data intensive problem Grokking Techtalk #37: Data intensive problem
Grokking Techtalk #37: Data intensive problem
Grokking VN
 
Grokking Techtalk #37: Software design and refactoring
 Grokking Techtalk #37: Software design and refactoring Grokking Techtalk #37: Software design and refactoring
Grokking Techtalk #37: Software design and refactoring
Grokking VN
 
Grokking TechTalk #35: Efficient spellchecking
Grokking TechTalk #35: Efficient spellcheckingGrokking TechTalk #35: Efficient spellchecking
Grokking TechTalk #35: Efficient spellchecking
Grokking VN
 
Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer...
 Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer... Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer...
Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer...
Grokking VN
 
Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...
Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...
Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...
Grokking VN
 
Grokking TechTalk #30: From App to Ecosystem: Lessons Learned at Scale
Grokking TechTalk #30: From App to Ecosystem: Lessons Learned at ScaleGrokking TechTalk #30: From App to Ecosystem: Lessons Learned at Scale
Grokking TechTalk #30: From App to Ecosystem: Lessons Learned at Scale
Grokking VN
 
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedInGrokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking VN
 
Grokking TechTalk #27: Optimal Binary Search Tree
Grokking TechTalk #27: Optimal Binary Search TreeGrokking TechTalk #27: Optimal Binary Search Tree
Grokking TechTalk #27: Optimal Binary Search Tree
Grokking VN
 
Grokking TechTalk #26: Kotlin, Understand the Magic
Grokking TechTalk #26: Kotlin, Understand the MagicGrokking TechTalk #26: Kotlin, Understand the Magic
Grokking TechTalk #26: Kotlin, Understand the Magic
Grokking VN
 
Grokking TechTalk #26: Compare ios and android platform
Grokking TechTalk #26: Compare ios and android platformGrokking TechTalk #26: Compare ios and android platform
Grokking TechTalk #26: Compare ios and android platform
Grokking VN
 
Grokking TechTalk #24: Thiết kế hệ thống Background Job Queue bằng Ruby & Pos...
Grokking TechTalk #24: Thiết kế hệ thống Background Job Queue bằng Ruby & Pos...Grokking TechTalk #24: Thiết kế hệ thống Background Job Queue bằng Ruby & Pos...
Grokking TechTalk #24: Thiết kế hệ thống Background Job Queue bằng Ruby & Pos...
Grokking VN
 
Grokking TechTalk #24: Kafka's principles and protocols
Grokking TechTalk #24: Kafka's principles and protocolsGrokking TechTalk #24: Kafka's principles and protocols
Grokking TechTalk #24: Kafka's principles and protocols
Grokking VN
 
Grokking TechTalk #21: Deep Learning in Computer Vision
Grokking TechTalk #21: Deep Learning in Computer VisionGrokking TechTalk #21: Deep Learning in Computer Vision
Grokking TechTalk #21: Deep Learning in Computer Vision
Grokking VN
 
Grokking TechTalk #20: PostgreSQL Internals 101
Grokking TechTalk #20: PostgreSQL Internals 101Grokking TechTalk #20: PostgreSQL Internals 101
Grokking TechTalk #20: PostgreSQL Internals 101
Grokking VN
 
Grokking TechTalk #19: Software Development Cycle In The International Moneta...
Grokking TechTalk #19: Software Development Cycle In The International Moneta...Grokking TechTalk #19: Software Development Cycle In The International Moneta...
Grokking TechTalk #19: Software Development Cycle In The International Moneta...
Grokking VN
 

More from Grokking VN (20)

Grokking Techtalk #46: Lessons from years hacking and defending Vietnamese banks
Grokking Techtalk #46: Lessons from years hacking and defending Vietnamese banksGrokking Techtalk #46: Lessons from years hacking and defending Vietnamese banks
Grokking Techtalk #46: Lessons from years hacking and defending Vietnamese banks
 
Grokking Techtalk #45: First Principles Thinking
Grokking Techtalk #45: First Principles ThinkingGrokking Techtalk #45: First Principles Thinking
Grokking Techtalk #45: First Principles Thinking
 
Grokking Techtalk #43: Payment gateway demystified
Grokking Techtalk #43: Payment gateway demystifiedGrokking Techtalk #43: Payment gateway demystified
Grokking Techtalk #43: Payment gateway demystified
 
Grokking Techtalk #39: Gossip protocol and applications
Grokking Techtalk #39: Gossip protocol and applicationsGrokking Techtalk #39: Gossip protocol and applications
Grokking Techtalk #39: Gossip protocol and applications
 
Grokking Techtalk #38: Escape Analysis in Go compiler
 Grokking Techtalk #38: Escape Analysis in Go compiler Grokking Techtalk #38: Escape Analysis in Go compiler
Grokking Techtalk #38: Escape Analysis in Go compiler
 
Grokking Techtalk #37: Data intensive problem
 Grokking Techtalk #37: Data intensive problem Grokking Techtalk #37: Data intensive problem
Grokking Techtalk #37: Data intensive problem
 
Grokking Techtalk #37: Software design and refactoring
 Grokking Techtalk #37: Software design and refactoring Grokking Techtalk #37: Software design and refactoring
Grokking Techtalk #37: Software design and refactoring
 
Grokking TechTalk #35: Efficient spellchecking
Grokking TechTalk #35: Efficient spellcheckingGrokking TechTalk #35: Efficient spellchecking
Grokking TechTalk #35: Efficient spellchecking
 
Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer...
 Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer... Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer...
Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer...
 
Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...
Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...
Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...
 
Grokking TechTalk #30: From App to Ecosystem: Lessons Learned at Scale
Grokking TechTalk #30: From App to Ecosystem: Lessons Learned at ScaleGrokking TechTalk #30: From App to Ecosystem: Lessons Learned at Scale
Grokking TechTalk #30: From App to Ecosystem: Lessons Learned at Scale
 
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedInGrokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
 
Grokking TechTalk #27: Optimal Binary Search Tree
Grokking TechTalk #27: Optimal Binary Search TreeGrokking TechTalk #27: Optimal Binary Search Tree
Grokking TechTalk #27: Optimal Binary Search Tree
 
Grokking TechTalk #26: Kotlin, Understand the Magic
Grokking TechTalk #26: Kotlin, Understand the MagicGrokking TechTalk #26: Kotlin, Understand the Magic
Grokking TechTalk #26: Kotlin, Understand the Magic
 
Grokking TechTalk #26: Compare ios and android platform
Grokking TechTalk #26: Compare ios and android platformGrokking TechTalk #26: Compare ios and android platform
Grokking TechTalk #26: Compare ios and android platform
 
Grokking TechTalk #24: Thiết kế hệ thống Background Job Queue bằng Ruby & Pos...
Grokking TechTalk #24: Thiết kế hệ thống Background Job Queue bằng Ruby & Pos...Grokking TechTalk #24: Thiết kế hệ thống Background Job Queue bằng Ruby & Pos...
Grokking TechTalk #24: Thiết kế hệ thống Background Job Queue bằng Ruby & Pos...
 
Grokking TechTalk #24: Kafka's principles and protocols
Grokking TechTalk #24: Kafka's principles and protocolsGrokking TechTalk #24: Kafka's principles and protocols
Grokking TechTalk #24: Kafka's principles and protocols
 
Grokking TechTalk #21: Deep Learning in Computer Vision
Grokking TechTalk #21: Deep Learning in Computer VisionGrokking TechTalk #21: Deep Learning in Computer Vision
Grokking TechTalk #21: Deep Learning in Computer Vision
 
Grokking TechTalk #20: PostgreSQL Internals 101
Grokking TechTalk #20: PostgreSQL Internals 101Grokking TechTalk #20: PostgreSQL Internals 101
Grokking TechTalk #20: PostgreSQL Internals 101
 
Grokking TechTalk #19: Software Development Cycle In The International Moneta...
Grokking TechTalk #19: Software Development Cycle In The International Moneta...Grokking TechTalk #19: Software Development Cycle In The International Moneta...
Grokking TechTalk #19: Software Development Cycle In The International Moneta...
 

Recently uploaded

Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website
Pixlogix Infotech
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
Edge AI and Vision Alliance
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Zilliz
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
Rohit Gautam
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
SOFTTECHHUB
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Speck&Tech
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Vladimir Iglovikov, Ph.D.
 

Recently uploaded (20)

Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
 

Grokking Techtalk #42: Engineering challenges on building data platform for ML application

  • 1. ML IN DATA PLATFORM A Case Study with NLP Application US Office 2150 Ringwood Ave, San Jose, CA 95131 UK Office 3 Beeston Place, Belgravia, London SW1W 0JJ, UK Vietnam Office Floor #1-4, 302 Le Van Sy, Ward 1, Tan Binh District, HCMC, Vietnam SG Office 6A Shenton Way #04-08 OUE Downtown Gallery Singapore 068815
  • 2. 2 Table of content No Content 1 Introduction 2 Data Platform – ETL Process 3 Data Platform – Analytics Workflow 4 Afterthoughts
  • 3. 3 INTRODUCTION 01 1. Introduction to Case Study 2. Introduction to Data Platform
  • 4. 1.1.1. Potential Values of ML/NLP Application 4 - ML applications can bring new-found values - Case study: Online Review Analytics - Opinions from others increasingly guide customer's purchases => Growth, Improvement, Investment implications Refs - https://www.mckinsey.com/industries/consumer-packaged-goods/our-insights/five-star-growth-using-online-ratings-to-design-better-products - https://www.thinkwithgoogle.com/consumer-insights/consumer-trends/customer-review-preference-statistics/
  • 5. 1.1.2. Dealing with text data 5 - An insight-mining platform for review text is highly valuable. It is difficult though - Engineering challenges - Getting the reviews => web-scraping, data collection - Storing reviews => moving, maintaining, deduplicating large amount of texts - Processing reviews => text cleaning, processing, and analytics at scale - Analytics challenges - Natural Language Processing – NLP - Insight communication: dashboards and visualization
  • 6. 1.2.1. Data Platform overall architecture 6
  • 7. 1.2.2. Example: output from ETL Process 7
  • 8. 1.3. Example: output from Analytics Workflow 8
  • 9. 1.3. Example: insight communication – Web Application 9
  • 10. 10 ETL PROCESS 02 1. Extract, Transform, Load 2. Data Collection 3. Data Storage
  • 11. 2.1. Extract, Transform, Load 11 - Extract: - Data Collector: collect data from websites - Extract and Map from raw data collected - Transform: clean up data (trim, special characters,…), deduplications, etc. - Load: to databases for storage and analysis: MongoDB, BigQuery - Batching: split large amount of data into batches for parallel processing - Worker: a container that moves/processes data -> Mini-ETL
  • 12. 2.1. Data Collection: web-scraping 12 Web Scraper
  • 13. 2.1. Data Collection: Benefit & Challenge 13 Benefit Challenge It’s Free It’s Big Data Fake Data - Captcha - IP Blocking Hard to collect - Javascript Rendering
  • 14. 2.1. Data Collection: How to deal with challenges? 14 WEB BROWSER SELENIUM PROXY To avoid IPs blocking & Captcha To overcome Javascript rendering Control Browser by Code Control Browser by Code
  • 15. 2.2. Data Storage 15 - PostgreSQL: store process metadata (used by orchestrator) - Google Cloud Storage: store intermediary CSV files - MongoDB: flexible, persistent storage for text documents. Allow easy and frequent edits - Google BigQuery: analytics data storage and distributed processing engine using SQL – familiar language for Data Analysts
  • 16. 16 ANALYTICS WORKFLOW 03 1. First Implementation 2. Inference Services
  • 17. 3.1.1 Analytics Workflow 17 - After ETL process, data is available for further processing and analysis - Analytics Workflow: - A part of Data Platform - Extract information from data for insights - Machine Learning models are integral part of text analytics - Information is extracted, and pushed to BigQuery for queries
  • 18. 3.1.2 First implementation 18 - Implement each model as a worker - Advantages: - Easy to implement - Suitable for early stages: fast implementation and acceptable performance - Several drawbacks: technical debts - Mixing of concerns - Low flexibility - Limited scalability
  • 19. 3.1.3 First implementation: mixing of concerns 19 - Data Platform’s intended purpose: moving data, processing, and interacting with various API on the way => mostly I/O operations - Computationally-heavy tasks are usually delegated: e.g. to BigQuery - Mixing I/O and computations
  • 20. 3.1.4 First implementation: scalability 20 - Everything seems ok, until we must process many reviews (100,000s - 1,000,000s, various lengths, can be very long) - Manual scaling: replicate workers -> VM resource/cost constraint - GPU acceleration? -> ETL workers don’t need GPU
  • 21. 3.1.5 First implementation: monitoring and maintenance 21 - No real monitoring components for performance degradation - Data drift, concept drift? - If needed, model is inspected manually - Collect, process, re-train models manually - Upload trained model to GCS, re-deploy workers
  • 22. 3.2.1 Inference Services: separation of concerns 22 - Income Inference Services - No direct I/O for data, only accept HTTP requests with input and response with computed results => Easier to maintain and optimize both ends
  • 23. 3.2.2 Inference Services: overall architecture 23
  • 24. 3.2.3 Inference Services: solving redundancy and reusability 24 - Each ML model is treated as a microservice - Several ML models can be connected as an inference pipeline for complex tasks - Promote reusability and flexibility => save resources
  • 25. 3.2.4 Inference Services: solving scalability 25 - Services are containerized, run, and deployed independently - Can be migrate to any environment with relative ease - For maximum scalability => K8s cluster (GKE) with autoscaling - Thanks to K8s, deployment is easier. - Rollout deployments: no/minimal downtime
  • 26. 3.2.5. Inference Services: monitoring 26 - Metrics are logged to a central data-lake and visualized in a dashboard. Image from https://www.datarobot.com/wiki/machine-learning-operations-mlops/
  • 27. 3.2.6. Inference Services: results and drawbacks 27 - Results - A more flexible and effective solution - More resilient ETL process: less complex - Reduced ETL resource consumption and processing time - New system of services can be developed and maintained separately - Drawbacks - Appearance of more infrastructures and tools -> management overhead - Complex inter-dependency of inference services as it expands - Requires more expertise in managing K8s clusters and deployment
  • 29. 4.1. What We Learned? 29 - ML Application can be tricky to be done right - Not much resources and best practices - Solved by: thorough analysis of use-cases - Solved by: proper scoping and sizing - Separating I/O Intensive from Computationally-intensive tasks - ETL components - ML components - Good architecture design from the beginning can save time and cost later - Over-engineered vs under-engineered - Easy in hindsight, difficult in practice Hope these ideas help you in designing your next ML Application