Grokking Techtalk #42: Engineering challenges on building data platform for ML application

ML IN DATA PLATFORM
A Case Study with NLP Application
US Office
2150 Ringwood Ave, San Jose,
CA 95131
UK Office
3 Beeston Place, Belgravia,
London SW1W 0JJ, UK
Vietnam Office
Floor #1-4, 302 Le Van Sy,
Ward 1, Tan Binh District, HCMC,
Vietnam
SG Office
6A Shenton Way #04-08 OUE
Downtown Gallery Singapore 068815

2
Table of content
No Content
1 Introduction
2 Data Platform – ETL Process
3 Data Platform – Analytics Workﬂow
4 Afterthoughts

3
INTRODUCTION
01
1. Introduction to Case Study
2. Introduction to Data Platform

1.1.1. Potential Values of ML/NLP Application
4
- ML applications can bring new-found values
- Case study: Online Review Analytics
- Opinions from others increasingly guide customer's purchases
=> Growth, Improvement, Investment implications
Refs
- https://www.mckinsey.com/industries/consumer-packaged-goods/our-insights/ﬁve-star-growth-using-online-ratings-to-design-better-products
- https://www.thinkwithgoogle.com/consumer-insights/consumer-trends/customer-review-preference-statistics/

1.1.2. Dealing with text data
5
- An insight-mining platform for review text is highly valuable. It is difficult though
- Engineering challenges
- Getting the reviews => web-scraping, data collection
- Storing reviews => moving, maintaining, deduplicating large amount of texts
- Processing reviews => text cleaning, processing, and analytics at scale
- Analytics challenges
- Natural Language Processing – NLP
- Insight communication: dashboards and visualization

1.2.1. Data Platform overall architecture
6

1.2.2. Example: output from ETL Process
7

1.3. Example: output from Analytics Workﬂow
8

1.3. Example: insight communication – Web Application
9

10
ETL PROCESS
02
1. Extract, Transform, Load
2. Data Collection
3. Data Storage

2.1. Extract, Transform, Load
11
- Extract:
- Data Collector: collect data from websites
- Extract and Map from raw data collected
- Transform: clean up data (trim, special characters,…), deduplications, etc.
- Load: to databases for storage and analysis: MongoDB, BigQuery
- Batching: split large amount of data into batches for parallel processing
- Worker: a container that moves/processes data -> Mini-ETL

2.1. Data Collection: web-scraping
12
Web Scraper

2.1. Data Collection: Beneﬁt & Challenge
13
Beneﬁt Challenge
It’s Free
It’s Big Data
Fake Data
- Captcha
- IP Blocking
Hard to collect
- Javascript Rendering

2.1. Data Collection: How to deal with challenges?
14
WEB BROWSER
SELENIUM
PROXY
To avoid IPs blocking & Captcha
To overcome Javascript rendering
Control Browser by Code
Control Browser by Code

2.2. Data Storage
15
- PostgreSQL: store process metadata (used by orchestrator)
- Google Cloud Storage: store intermediary CSV ﬁles
- MongoDB: ﬂexible, persistent storage for text documents. Allow easy and frequent
edits
- Google BigQuery: analytics data storage and distributed processing engine using
SQL – familiar language for Data Analysts

16
ANALYTICS WORKFLOW
03
1. First Implementation
2. Inference Services

3.1.1 Analytics Workﬂow
17
- After ETL process, data is available for further processing and analysis
- Analytics Workﬂow:
- A part of Data Platform
- Extract information from data for insights
- Machine Learning models are integral part of text analytics
- Information is extracted, and pushed to BigQuery for queries

3.1.2 First implementation
18
- Implement each model as a worker
- Advantages:
- Easy to implement
- Suitable for early stages: fast
implementation and acceptable
performance
- Several drawbacks: technical debts
- Mixing of concerns
- Low ﬂexibility
- Limited scalability

3.1.3 First implementation: mixing of concerns
19
- Data Platform’s intended purpose: moving data, processing, and interacting with
various API on the way => mostly I/O operations
- Computationally-heavy tasks are usually delegated: e.g. to BigQuery
- Mixing I/O and computations

3.1.4 First implementation: scalability
20
- Everything seems ok, until
we must process many
reviews (100,000s -
1,000,000s, various
lengths, can be very long)
- Manual scaling: replicate
workers -> VM
resource/cost constraint
- GPU acceleration? -> ETL
workers don’t need GPU

3.1.5 First implementation: monitoring and maintenance
21
- No real monitoring components for performance degradation
- Data drift, concept drift?
- If needed, model is inspected manually
- Collect, process, re-train models manually
- Upload trained model to GCS, re-deploy workers

3.2.1 Inference Services: separation of concerns
22
- Income Inference Services
- No direct I/O for data, only accept
HTTP requests with input and
response with computed results
=> Easier to maintain and optimize both
ends

3.2.2 Inference Services: overall architecture
23

3.2.3 Inference Services: solving redundancy and reusability
24
- Each ML model is treated as a microservice
- Several ML models can be connected as an inference pipeline for complex tasks
- Promote reusability and ﬂexibility => save resources

3.2.4 Inference Services: solving scalability
25
- Services are containerized, run, and deployed independently
- Can be migrate to any environment with relative ease
- For maximum scalability => K8s cluster (GKE) with autoscaling
- Thanks to K8s, deployment is easier.
- Rollout deployments: no/minimal downtime

3.2.5. Inference Services: monitoring
26
- Metrics are logged to a central data-lake and visualized in a
dashboard.
Image from https://www.datarobot.com/wiki/machine-learning-operations-mlops/

3.2.6. Inference Services: results and drawbacks
27
- Results
- A more ﬂexible and effective solution
- More resilient ETL process: less complex
- Reduced ETL resource consumption and processing time
- New system of services can be developed and maintained separately
- Drawbacks
- Appearance of more infrastructures and tools -> management overhead
- Complex inter-dependency of inference services as it expands
- Requires more expertise in managing K8s clusters and deployment

4.1. What We Learned?
29
- ML Application can be tricky to be done right
- Not much resources and best practices
- Solved by: thorough analysis of use-cases
- Solved by: proper scoping and sizing
- Separating I/O Intensive from Computationally-intensive tasks
- ETL components
- ML components
- Good architecture design from the beginning can save time and cost later
- Over-engineered vs under-engineered
- Easy in hindsight, difficult in practice
Hope these ideas help you in designing your next ML Application

Grokking Techtalk #42: Engineering challenges on building data platform for ML application

More Related Content

What's hot

Similar to Grokking Techtalk #42: Engineering challenges on building data platform for ML application

More from Grokking VN

Recently uploaded

Grokking Techtalk #42: Engineering challenges on building data platform for ML application