ML IN DATA PLATFORM
A Case Study with NLP Application
US Office
2150 Ringwood Ave, San Jose,
CA 95131
UK Office
3 Beeston Place, Belgravia,
London SW1W 0JJ, UK
Vietnam Office
Floor #1-4, 302 Le Van Sy,
Ward 1, Tan Binh District, HCMC,
Vietnam
SG Office
6A Shenton Way #04-08 OUE
Downtown Gallery Singapore 068815
2
Table of content
No Content
1 Introduction
2 Data Platform – ETL Process
3 Data Platform – Analytics Workflow
4 Afterthoughts
3
INTRODUCTION
01
1. Introduction to Case Study
2. Introduction to Data Platform
1.1.1. Potential Values of ML/NLP Application
4
- ML applications can bring new-found values
- Case study: Online Review Analytics
- Opinions from others increasingly guide customer's purchases
=> Growth, Improvement, Investment implications
Refs
- https://www.mckinsey.com/industries/consumer-packaged-goods/our-insights/five-star-growth-using-online-ratings-to-design-better-products
- https://www.thinkwithgoogle.com/consumer-insights/consumer-trends/customer-review-preference-statistics/
1.1.2. Dealing with text data
5
- An insight-mining platform for review text is highly valuable. It is difficult though
- Engineering challenges
- Getting the reviews => web-scraping, data collection
- Storing reviews => moving, maintaining, deduplicating large amount of texts
- Processing reviews => text cleaning, processing, and analytics at scale
- Analytics challenges
- Natural Language Processing – NLP
- Insight communication: dashboards and visualization
1.2.1. Data Platform overall architecture
6
1.2.2. Example: output from ETL Process
7
1.3. Example: output from Analytics Workflow
8
1.3. Example: insight communication – Web Application
9
10
ETL PROCESS
02
1. Extract, Transform, Load
2. Data Collection
3. Data Storage
2.1. Extract, Transform, Load
11
- Extract:
- Data Collector: collect data from websites
- Extract and Map from raw data collected
- Transform: clean up data (trim, special characters,…), deduplications, etc.
- Load: to databases for storage and analysis: MongoDB, BigQuery
- Batching: split large amount of data into batches for parallel processing
- Worker: a container that moves/processes data -> Mini-ETL
2.1. Data Collection: web-scraping
12
Web Scraper
2.1. Data Collection: Benefit & Challenge
13
Benefit Challenge
It’s Free
It’s Big Data
Fake Data
- Captcha
- IP Blocking
Hard to collect
- Javascript Rendering
2.1. Data Collection: How to deal with challenges?
14
WEB BROWSER
SELENIUM
PROXY
To avoid IPs blocking & Captcha
To overcome Javascript rendering
Control Browser by Code
Control Browser by Code
2.2. Data Storage
15
- PostgreSQL: store process metadata (used by orchestrator)
- Google Cloud Storage: store intermediary CSV files
- MongoDB: flexible, persistent storage for text documents. Allow easy and frequent
edits
- Google BigQuery: analytics data storage and distributed processing engine using
SQL – familiar language for Data Analysts
16
ANALYTICS WORKFLOW
03
1. First Implementation
2. Inference Services
3.1.1 Analytics Workflow
17
- After ETL process, data is available for further processing and analysis
- Analytics Workflow:
- A part of Data Platform
- Extract information from data for insights
- Machine Learning models are integral part of text analytics
- Information is extracted, and pushed to BigQuery for queries
3.1.2 First implementation
18
- Implement each model as a worker
- Advantages:
- Easy to implement
- Suitable for early stages: fast
implementation and acceptable
performance
- Several drawbacks: technical debts
- Mixing of concerns
- Low flexibility
- Limited scalability
3.1.3 First implementation: mixing of concerns
19
- Data Platform’s intended purpose: moving data, processing, and interacting with
various API on the way => mostly I/O operations
- Computationally-heavy tasks are usually delegated: e.g. to BigQuery
- Mixing I/O and computations
3.1.4 First implementation: scalability
20
- Everything seems ok, until
we must process many
reviews (100,000s -
1,000,000s, various
lengths, can be very long)
- Manual scaling: replicate
workers -> VM
resource/cost constraint
- GPU acceleration? -> ETL
workers don’t need GPU
3.1.5 First implementation: monitoring and maintenance
21
- No real monitoring components for performance degradation
- Data drift, concept drift?
- If needed, model is inspected manually
- Collect, process, re-train models manually
- Upload trained model to GCS, re-deploy workers
3.2.1 Inference Services: separation of concerns
22
- Income Inference Services
- No direct I/O for data, only accept
HTTP requests with input and
response with computed results
=> Easier to maintain and optimize both
ends
3.2.2 Inference Services: overall architecture
23
3.2.3 Inference Services: solving redundancy and reusability
24
- Each ML model is treated as a microservice
- Several ML models can be connected as an inference pipeline for complex tasks
- Promote reusability and flexibility => save resources
3.2.4 Inference Services: solving scalability
25
- Services are containerized, run, and deployed independently
- Can be migrate to any environment with relative ease
- For maximum scalability => K8s cluster (GKE) with autoscaling
- Thanks to K8s, deployment is easier.
- Rollout deployments: no/minimal downtime
3.2.5. Inference Services: monitoring
26
- Metrics are logged to a central data-lake and visualized in a
dashboard.
Image from https://www.datarobot.com/wiki/machine-learning-operations-mlops/
3.2.6. Inference Services: results and drawbacks
27
- Results
- A more flexible and effective solution
- More resilient ETL process: less complex
- Reduced ETL resource consumption and processing time
- New system of services can be developed and maintained separately
- Drawbacks
- Appearance of more infrastructures and tools -> management overhead
- Complex inter-dependency of inference services as it expands
- Requires more expertise in managing K8s clusters and deployment
28
WHAT WE LEARNED
04
4.1. What We Learned?
29
- ML Application can be tricky to be done right
- Not much resources and best practices
- Solved by: thorough analysis of use-cases
- Solved by: proper scoping and sizing
- Separating I/O Intensive from Computationally-intensive tasks
- ETL components
- ML components
- Good architecture design from the beginning can save time and cost later
- Over-engineered vs under-engineered
- Easy in hindsight, difficult in practice
Hope these ideas help you in designing your next ML Application
THANK YOU – Q&A

Grokking Techtalk #42: Engineering challenges on building data platform for ML application

  • 1.
    ML IN DATAPLATFORM A Case Study with NLP Application US Office 2150 Ringwood Ave, San Jose, CA 95131 UK Office 3 Beeston Place, Belgravia, London SW1W 0JJ, UK Vietnam Office Floor #1-4, 302 Le Van Sy, Ward 1, Tan Binh District, HCMC, Vietnam SG Office 6A Shenton Way #04-08 OUE Downtown Gallery Singapore 068815
  • 2.
    2 Table of content NoContent 1 Introduction 2 Data Platform – ETL Process 3 Data Platform – Analytics Workflow 4 Afterthoughts
  • 3.
    3 INTRODUCTION 01 1. Introduction toCase Study 2. Introduction to Data Platform
  • 4.
    1.1.1. Potential Valuesof ML/NLP Application 4 - ML applications can bring new-found values - Case study: Online Review Analytics - Opinions from others increasingly guide customer's purchases => Growth, Improvement, Investment implications Refs - https://www.mckinsey.com/industries/consumer-packaged-goods/our-insights/five-star-growth-using-online-ratings-to-design-better-products - https://www.thinkwithgoogle.com/consumer-insights/consumer-trends/customer-review-preference-statistics/
  • 5.
    1.1.2. Dealing withtext data 5 - An insight-mining platform for review text is highly valuable. It is difficult though - Engineering challenges - Getting the reviews => web-scraping, data collection - Storing reviews => moving, maintaining, deduplicating large amount of texts - Processing reviews => text cleaning, processing, and analytics at scale - Analytics challenges - Natural Language Processing – NLP - Insight communication: dashboards and visualization
  • 6.
    1.2.1. Data Platformoverall architecture 6
  • 7.
    1.2.2. Example: outputfrom ETL Process 7
  • 8.
    1.3. Example: outputfrom Analytics Workflow 8
  • 9.
    1.3. Example: insightcommunication – Web Application 9
  • 10.
    10 ETL PROCESS 02 1. Extract,Transform, Load 2. Data Collection 3. Data Storage
  • 11.
    2.1. Extract, Transform,Load 11 - Extract: - Data Collector: collect data from websites - Extract and Map from raw data collected - Transform: clean up data (trim, special characters,…), deduplications, etc. - Load: to databases for storage and analysis: MongoDB, BigQuery - Batching: split large amount of data into batches for parallel processing - Worker: a container that moves/processes data -> Mini-ETL
  • 12.
    2.1. Data Collection:web-scraping 12 Web Scraper
  • 13.
    2.1. Data Collection:Benefit & Challenge 13 Benefit Challenge It’s Free It’s Big Data Fake Data - Captcha - IP Blocking Hard to collect - Javascript Rendering
  • 14.
    2.1. Data Collection:How to deal with challenges? 14 WEB BROWSER SELENIUM PROXY To avoid IPs blocking & Captcha To overcome Javascript rendering Control Browser by Code Control Browser by Code
  • 15.
    2.2. Data Storage 15 -PostgreSQL: store process metadata (used by orchestrator) - Google Cloud Storage: store intermediary CSV files - MongoDB: flexible, persistent storage for text documents. Allow easy and frequent edits - Google BigQuery: analytics data storage and distributed processing engine using SQL – familiar language for Data Analysts
  • 16.
    16 ANALYTICS WORKFLOW 03 1. FirstImplementation 2. Inference Services
  • 17.
    3.1.1 Analytics Workflow 17 -After ETL process, data is available for further processing and analysis - Analytics Workflow: - A part of Data Platform - Extract information from data for insights - Machine Learning models are integral part of text analytics - Information is extracted, and pushed to BigQuery for queries
  • 18.
    3.1.2 First implementation 18 -Implement each model as a worker - Advantages: - Easy to implement - Suitable for early stages: fast implementation and acceptable performance - Several drawbacks: technical debts - Mixing of concerns - Low flexibility - Limited scalability
  • 19.
    3.1.3 First implementation:mixing of concerns 19 - Data Platform’s intended purpose: moving data, processing, and interacting with various API on the way => mostly I/O operations - Computationally-heavy tasks are usually delegated: e.g. to BigQuery - Mixing I/O and computations
  • 20.
    3.1.4 First implementation:scalability 20 - Everything seems ok, until we must process many reviews (100,000s - 1,000,000s, various lengths, can be very long) - Manual scaling: replicate workers -> VM resource/cost constraint - GPU acceleration? -> ETL workers don’t need GPU
  • 21.
    3.1.5 First implementation:monitoring and maintenance 21 - No real monitoring components for performance degradation - Data drift, concept drift? - If needed, model is inspected manually - Collect, process, re-train models manually - Upload trained model to GCS, re-deploy workers
  • 22.
    3.2.1 Inference Services:separation of concerns 22 - Income Inference Services - No direct I/O for data, only accept HTTP requests with input and response with computed results => Easier to maintain and optimize both ends
  • 23.
    3.2.2 Inference Services:overall architecture 23
  • 24.
    3.2.3 Inference Services:solving redundancy and reusability 24 - Each ML model is treated as a microservice - Several ML models can be connected as an inference pipeline for complex tasks - Promote reusability and flexibility => save resources
  • 25.
    3.2.4 Inference Services:solving scalability 25 - Services are containerized, run, and deployed independently - Can be migrate to any environment with relative ease - For maximum scalability => K8s cluster (GKE) with autoscaling - Thanks to K8s, deployment is easier. - Rollout deployments: no/minimal downtime
  • 26.
    3.2.5. Inference Services:monitoring 26 - Metrics are logged to a central data-lake and visualized in a dashboard. Image from https://www.datarobot.com/wiki/machine-learning-operations-mlops/
  • 27.
    3.2.6. Inference Services:results and drawbacks 27 - Results - A more flexible and effective solution - More resilient ETL process: less complex - Reduced ETL resource consumption and processing time - New system of services can be developed and maintained separately - Drawbacks - Appearance of more infrastructures and tools -> management overhead - Complex inter-dependency of inference services as it expands - Requires more expertise in managing K8s clusters and deployment
  • 28.
  • 29.
    4.1. What WeLearned? 29 - ML Application can be tricky to be done right - Not much resources and best practices - Solved by: thorough analysis of use-cases - Solved by: proper scoping and sizing - Separating I/O Intensive from Computationally-intensive tasks - ETL components - ML components - Good architecture design from the beginning can save time and cost later - Over-engineered vs under-engineered - Easy in hindsight, difficult in practice Hope these ideas help you in designing your next ML Application
  • 30.