SlideShare a Scribd company logo
ML IN DATA PLATFORM
A Case Study with NLP Application
US Office
2150 Ringwood Ave, San Jose,
CA 95131
UK Office
3 Beeston Place, Belgravia,
London SW1W 0JJ, UK
Vietnam Office
Floor #1-4, 302 Le Van Sy,
Ward 1, Tan Binh District, HCMC,
Vietnam
SG Office
6A Shenton Way #04-08 OUE
Downtown Gallery Singapore 068815
2
Table of content
No Content
1 Introduction
2 Data Platform – ETL Process
3 Data Platform – Analytics Workflow
4 Afterthoughts
3
INTRODUCTION
01
1. Introduction to Case Study
2. Introduction to Data Platform
1.1.1. Potential Values of ML/NLP Application
4
- ML applications can bring new-found values
- Case study: Online Review Analytics
- Opinions from others increasingly guide customer's purchases
=> Growth, Improvement, Investment implications
Refs
- https://www.mckinsey.com/industries/consumer-packaged-goods/our-insights/five-star-growth-using-online-ratings-to-design-better-products
- https://www.thinkwithgoogle.com/consumer-insights/consumer-trends/customer-review-preference-statistics/
1.1.2. Dealing with text data
5
- An insight-mining platform for review text is highly valuable. It is difficult though
- Engineering challenges
- Getting the reviews => web-scraping, data collection
- Storing reviews => moving, maintaining, deduplicating large amount of texts
- Processing reviews => text cleaning, processing, and analytics at scale
- Analytics challenges
- Natural Language Processing – NLP
- Insight communication: dashboards and visualization
1.2.1. Data Platform overall architecture
6
1.2.2. Example: output from ETL Process
7
1.3. Example: output from Analytics Workflow
8
1.3. Example: insight communication – Web Application
9
10
ETL PROCESS
02
1. Extract, Transform, Load
2. Data Collection
3. Data Storage
2.1. Extract, Transform, Load
11
- Extract:
- Data Collector: collect data from websites
- Extract and Map from raw data collected
- Transform: clean up data (trim, special characters,…), deduplications, etc.
- Load: to databases for storage and analysis: MongoDB, BigQuery
- Batching: split large amount of data into batches for parallel processing
- Worker: a container that moves/processes data -> Mini-ETL
2.1. Data Collection: web-scraping
12
Web Scraper
2.1. Data Collection: Benefit & Challenge
13
Benefit Challenge
It’s Free
It’s Big Data
Fake Data
- Captcha
- IP Blocking
Hard to collect
- Javascript Rendering
2.1. Data Collection: How to deal with challenges?
14
WEB BROWSER
SELENIUM
PROXY
To avoid IPs blocking & Captcha
To overcome Javascript rendering
Control Browser by Code
Control Browser by Code
2.2. Data Storage
15
- PostgreSQL: store process metadata (used by orchestrator)
- Google Cloud Storage: store intermediary CSV files
- MongoDB: flexible, persistent storage for text documents. Allow easy and frequent
edits
- Google BigQuery: analytics data storage and distributed processing engine using
SQL – familiar language for Data Analysts
16
ANALYTICS WORKFLOW
03
1. First Implementation
2. Inference Services
3.1.1 Analytics Workflow
17
- After ETL process, data is available for further processing and analysis
- Analytics Workflow:
- A part of Data Platform
- Extract information from data for insights
- Machine Learning models are integral part of text analytics
- Information is extracted, and pushed to BigQuery for queries
3.1.2 First implementation
18
- Implement each model as a worker
- Advantages:
- Easy to implement
- Suitable for early stages: fast
implementation and acceptable
performance
- Several drawbacks: technical debts
- Mixing of concerns
- Low flexibility
- Limited scalability
3.1.3 First implementation: mixing of concerns
19
- Data Platform’s intended purpose: moving data, processing, and interacting with
various API on the way => mostly I/O operations
- Computationally-heavy tasks are usually delegated: e.g. to BigQuery
- Mixing I/O and computations
3.1.4 First implementation: scalability
20
- Everything seems ok, until
we must process many
reviews (100,000s -
1,000,000s, various
lengths, can be very long)
- Manual scaling: replicate
workers -> VM
resource/cost constraint
- GPU acceleration? -> ETL
workers don’t need GPU
3.1.5 First implementation: monitoring and maintenance
21
- No real monitoring components for performance degradation
- Data drift, concept drift?
- If needed, model is inspected manually
- Collect, process, re-train models manually
- Upload trained model to GCS, re-deploy workers
3.2.1 Inference Services: separation of concerns
22
- Income Inference Services
- No direct I/O for data, only accept
HTTP requests with input and
response with computed results
=> Easier to maintain and optimize both
ends
3.2.2 Inference Services: overall architecture
23
3.2.3 Inference Services: solving redundancy and reusability
24
- Each ML model is treated as a microservice
- Several ML models can be connected as an inference pipeline for complex tasks
- Promote reusability and flexibility => save resources
3.2.4 Inference Services: solving scalability
25
- Services are containerized, run, and deployed independently
- Can be migrate to any environment with relative ease
- For maximum scalability => K8s cluster (GKE) with autoscaling
- Thanks to K8s, deployment is easier.
- Rollout deployments: no/minimal downtime
3.2.5. Inference Services: monitoring
26
- Metrics are logged to a central data-lake and visualized in a
dashboard.
Image from https://www.datarobot.com/wiki/machine-learning-operations-mlops/
3.2.6. Inference Services: results and drawbacks
27
- Results
- A more flexible and effective solution
- More resilient ETL process: less complex
- Reduced ETL resource consumption and processing time
- New system of services can be developed and maintained separately
- Drawbacks
- Appearance of more infrastructures and tools -> management overhead
- Complex inter-dependency of inference services as it expands
- Requires more expertise in managing K8s clusters and deployment
28
WHAT WE LEARNED
04
4.1. What We Learned?
29
- ML Application can be tricky to be done right
- Not much resources and best practices
- Solved by: thorough analysis of use-cases
- Solved by: proper scoping and sizing
- Separating I/O Intensive from Computationally-intensive tasks
- ETL components
- ML components
- Good architecture design from the beginning can save time and cost later
- Over-engineered vs under-engineered
- Easy in hindsight, difficult in practice
Hope these ideas help you in designing your next ML Application
THANK YOU – Q&A

More Related Content

What's hot

Hệ mật mã elgamal
Hệ mật mã elgamalHệ mật mã elgamal
Hệ mật mã elgamal
Thành phố Đà Lạt
 
Tìm hiểu hệ mã hoá RSA và cách triển khai vào hệ thống
Tìm hiểu hệ mã hoá RSA và cách triển khai vào hệ thốngTìm hiểu hệ mã hoá RSA và cách triển khai vào hệ thống
Tìm hiểu hệ mã hoá RSA và cách triển khai vào hệ thống
tNguynMinh11
 
Cơ bản về blockchain, bitcoin và ethereum
Cơ bản về blockchain, bitcoin và ethereumCơ bản về blockchain, bitcoin và ethereum
Cơ bản về blockchain, bitcoin và ethereum
Long Le
 
Mã hóa đồng cấu
Mã hóa đồng cấuMã hóa đồng cấu
Mã hóa đồng cấu
LE Ngoc Luyen
 
Hệ mật mã Elgamal
Hệ mật mã ElgamalHệ mật mã Elgamal
Hệ mật mã Elgamal
Thành phố Đà Lạt
 
Thuật toán mã hóa rsa
Thuật toán mã hóa rsaThuật toán mã hóa rsa
Thuật toán mã hóa rsa
Bảo Điệp
 
itlchn 20 - Kien truc he thong chung khoan - Phan 2
itlchn 20 - Kien truc he thong chung khoan - Phan 2itlchn 20 - Kien truc he thong chung khoan - Phan 2
itlchn 20 - Kien truc he thong chung khoan - Phan 2
IT Expert Club
 
Ml ppt
Ml pptMl ppt
Ml ppt
Alpna Patel
 
KHO DỮ LIỆU VÀ KHAI PHÁ DỮ LIỆU PTIT
KHO DỮ LIỆU VÀ KHAI PHÁ DỮ LIỆU PTITKHO DỮ LIỆU VÀ KHAI PHÁ DỮ LIỆU PTIT
KHO DỮ LIỆU VÀ KHAI PHÁ DỮ LIỆU PTIT
Popping Khiem - Funky Dance Crew PTIT
 
Hệ Cơ Sở Dữ Liệu Đa Phương Tiện PTIT
Hệ Cơ Sở Dữ Liệu Đa Phương Tiện PTITHệ Cơ Sở Dữ Liệu Đa Phương Tiện PTIT
Hệ Cơ Sở Dữ Liệu Đa Phương Tiện PTIT
Popping Khiem - Funky Dance Crew PTIT
 
Hệ thống thông tin quản lý-website tin tức nhà đất
Hệ thống thông tin quản lý-website tin tức nhà đấtHệ thống thông tin quản lý-website tin tức nhà đất
Hệ thống thông tin quản lý-website tin tức nhà đất
Kali Back Tracker
 
Lưu trữ và xử lý dữ liệu trong điện toán đám mây
Lưu trữ và xử lý dữ liệu trong điện toán đám mâyLưu trữ và xử lý dữ liệu trong điện toán đám mây
Lưu trữ và xử lý dữ liệu trong điện toán đám mây
PhamTuanKhiem
 
docx.vn - Xay dung website ban quan ao online
docx.vn - Xay dung website ban quan ao onlinedocx.vn - Xay dung website ban quan ao online
docx.vn - Xay dung website ban quan ao online
Vi Thái
 
Thuat toan pca full 24-5-2017
Thuat toan pca full   24-5-2017 Thuat toan pca full   24-5-2017
Thuat toan pca full 24-5-2017
Tuan Remy
 
Quy trình bảo mật an toàn thông tin doanh nghiệp
Quy trình bảo mật an toàn thông tin doanh nghiệpQuy trình bảo mật an toàn thông tin doanh nghiệp
Quy trình bảo mật an toàn thông tin doanh nghiệp
laonap166
 
BÁO CÁO ĐỒ ÁN MÔN HỌC ĐIỆN TOÁN ĐÁM MÂY ĐỀ TÀI: TÌM HIỂU VÀ SỬ DỤNG AMAZON WE...
BÁO CÁO ĐỒ ÁN MÔN HỌC ĐIỆN TOÁN ĐÁM MÂY ĐỀ TÀI: TÌM HIỂU VÀ SỬ DỤNG AMAZON WE...BÁO CÁO ĐỒ ÁN MÔN HỌC ĐIỆN TOÁN ĐÁM MÂY ĐỀ TÀI: TÌM HIỂU VÀ SỬ DỤNG AMAZON WE...
BÁO CÁO ĐỒ ÁN MÔN HỌC ĐIỆN TOÁN ĐÁM MÂY ĐỀ TÀI: TÌM HIỂU VÀ SỬ DỤNG AMAZON WE...
nataliej4
 
9 CÂU NÓI NỔI TIẾNG VỀ BIG DATA-DỮ LIỆU LỚN
9 CÂU NÓI NỔI TIẾNG VỀ BIG DATA-DỮ LIỆU LỚN9 CÂU NÓI NỔI TIẾNG VỀ BIG DATA-DỮ LIỆU LỚN
9 CÂU NÓI NỔI TIẾNG VỀ BIG DATA-DỮ LIỆU LỚN
Huynh Huu Tai
 
An Ninh Mạng Và Kỹ Thuật Session Hijacking.pdf
An Ninh Mạng Và Kỹ Thuật Session Hijacking.pdfAn Ninh Mạng Và Kỹ Thuật Session Hijacking.pdf
An Ninh Mạng Và Kỹ Thuật Session Hijacking.pdf
NuioKila
 
Thuật toán K mean
Thuật toán K meanThuật toán K mean
Thuật toán K mean
Haokillboom Aăâ
 

What's hot (20)

Hệ mật mã elgamal
Hệ mật mã elgamalHệ mật mã elgamal
Hệ mật mã elgamal
 
Tìm hiểu hệ mã hoá RSA và cách triển khai vào hệ thống
Tìm hiểu hệ mã hoá RSA và cách triển khai vào hệ thốngTìm hiểu hệ mã hoá RSA và cách triển khai vào hệ thống
Tìm hiểu hệ mã hoá RSA và cách triển khai vào hệ thống
 
Cơ bản về blockchain, bitcoin và ethereum
Cơ bản về blockchain, bitcoin và ethereumCơ bản về blockchain, bitcoin và ethereum
Cơ bản về blockchain, bitcoin và ethereum
 
Mã hóa đồng cấu
Mã hóa đồng cấuMã hóa đồng cấu
Mã hóa đồng cấu
 
Hệ mật mã Elgamal
Hệ mật mã ElgamalHệ mật mã Elgamal
Hệ mật mã Elgamal
 
Thuật toán mã hóa rsa
Thuật toán mã hóa rsaThuật toán mã hóa rsa
Thuật toán mã hóa rsa
 
itlchn 20 - Kien truc he thong chung khoan - Phan 2
itlchn 20 - Kien truc he thong chung khoan - Phan 2itlchn 20 - Kien truc he thong chung khoan - Phan 2
itlchn 20 - Kien truc he thong chung khoan - Phan 2
 
Ml ppt
Ml pptMl ppt
Ml ppt
 
KHO DỮ LIỆU VÀ KHAI PHÁ DỮ LIỆU PTIT
KHO DỮ LIỆU VÀ KHAI PHÁ DỮ LIỆU PTITKHO DỮ LIỆU VÀ KHAI PHÁ DỮ LIỆU PTIT
KHO DỮ LIỆU VÀ KHAI PHÁ DỮ LIỆU PTIT
 
Hệ Cơ Sở Dữ Liệu Đa Phương Tiện PTIT
Hệ Cơ Sở Dữ Liệu Đa Phương Tiện PTITHệ Cơ Sở Dữ Liệu Đa Phương Tiện PTIT
Hệ Cơ Sở Dữ Liệu Đa Phương Tiện PTIT
 
Hệ thống thông tin quản lý-website tin tức nhà đất
Hệ thống thông tin quản lý-website tin tức nhà đấtHệ thống thông tin quản lý-website tin tức nhà đất
Hệ thống thông tin quản lý-website tin tức nhà đất
 
Lưu trữ và xử lý dữ liệu trong điện toán đám mây
Lưu trữ và xử lý dữ liệu trong điện toán đám mâyLưu trữ và xử lý dữ liệu trong điện toán đám mây
Lưu trữ và xử lý dữ liệu trong điện toán đám mây
 
Heap Sort
Heap SortHeap Sort
Heap Sort
 
docx.vn - Xay dung website ban quan ao online
docx.vn - Xay dung website ban quan ao onlinedocx.vn - Xay dung website ban quan ao online
docx.vn - Xay dung website ban quan ao online
 
Thuat toan pca full 24-5-2017
Thuat toan pca full   24-5-2017 Thuat toan pca full   24-5-2017
Thuat toan pca full 24-5-2017
 
Quy trình bảo mật an toàn thông tin doanh nghiệp
Quy trình bảo mật an toàn thông tin doanh nghiệpQuy trình bảo mật an toàn thông tin doanh nghiệp
Quy trình bảo mật an toàn thông tin doanh nghiệp
 
BÁO CÁO ĐỒ ÁN MÔN HỌC ĐIỆN TOÁN ĐÁM MÂY ĐỀ TÀI: TÌM HIỂU VÀ SỬ DỤNG AMAZON WE...
BÁO CÁO ĐỒ ÁN MÔN HỌC ĐIỆN TOÁN ĐÁM MÂY ĐỀ TÀI: TÌM HIỂU VÀ SỬ DỤNG AMAZON WE...BÁO CÁO ĐỒ ÁN MÔN HỌC ĐIỆN TOÁN ĐÁM MÂY ĐỀ TÀI: TÌM HIỂU VÀ SỬ DỤNG AMAZON WE...
BÁO CÁO ĐỒ ÁN MÔN HỌC ĐIỆN TOÁN ĐÁM MÂY ĐỀ TÀI: TÌM HIỂU VÀ SỬ DỤNG AMAZON WE...
 
9 CÂU NÓI NỔI TIẾNG VỀ BIG DATA-DỮ LIỆU LỚN
9 CÂU NÓI NỔI TIẾNG VỀ BIG DATA-DỮ LIỆU LỚN9 CÂU NÓI NỔI TIẾNG VỀ BIG DATA-DỮ LIỆU LỚN
9 CÂU NÓI NỔI TIẾNG VỀ BIG DATA-DỮ LIỆU LỚN
 
An Ninh Mạng Và Kỹ Thuật Session Hijacking.pdf
An Ninh Mạng Và Kỹ Thuật Session Hijacking.pdfAn Ninh Mạng Và Kỹ Thuật Session Hijacking.pdf
An Ninh Mạng Và Kỹ Thuật Session Hijacking.pdf
 
Thuật toán K mean
Thuật toán K meanThuật toán K mean
Thuật toán K mean
 

Similar to Grokking Techtalk #42: Engineering challenges on building data platform for ML application

Five ways database modernization simplifies your data life
Five ways database modernization simplifies your data lifeFive ways database modernization simplifies your data life
Five ways database modernization simplifies your data life
SingleStore
 
MuleSoft Manchester Meetup #4 slides 11th February 2021
MuleSoft Manchester Meetup #4 slides 11th February 2021MuleSoft Manchester Meetup #4 slides 11th February 2021
MuleSoft Manchester Meetup #4 slides 11th February 2021
Ieva Navickaite
 
BigQuery ML - Machine learning at scale using SQL
BigQuery ML - Machine learning at scale using SQLBigQuery ML - Machine learning at scale using SQL
BigQuery ML - Machine learning at scale using SQL
Márton Kodok
 
ESP POC Findings
ESP POC FindingsESP POC Findings
ESP POC Findings
kevin_donovan
 
MODERN DATA PIPELINE
MODERN DATA PIPELINEMODERN DATA PIPELINE
MODERN DATA PIPELINE
IRJET Journal
 
127801976 mobile-shop-management-system-documentation
127801976 mobile-shop-management-system-documentation127801976 mobile-shop-management-system-documentation
127801976 mobile-shop-management-system-documentation
Nitesh Kumar
 
Accelerating Machine Learning as a Service with Automated Feature Engineering
Accelerating Machine Learning as a Service with Automated Feature EngineeringAccelerating Machine Learning as a Service with Automated Feature Engineering
Accelerating Machine Learning as a Service with Automated Feature Engineering
Cognizant
 
[EN] Building modern data pipeline with Snowflake + DBT + Airflow.pdf
[EN] Building modern data pipeline with Snowflake + DBT + Airflow.pdf[EN] Building modern data pipeline with Snowflake + DBT + Airflow.pdf
[EN] Building modern data pipeline with Snowflake + DBT + Airflow.pdf
Chris Hoyean Song
 
Distributed Systems in Data Engineering
Distributed Systems in Data EngineeringDistributed Systems in Data Engineering
Distributed Systems in Data Engineering
Oluwasegun Matthew
 
Print report
Print reportPrint report
Print report
Ved Prakash
 
How to overcome challenges in it system evolution
How to overcome challenges in it system evolutionHow to overcome challenges in it system evolution
How to overcome challenges in it system evolution
Grupa Unity
 
Datawarehouse and reporting in service manager
Datawarehouse and reporting in service manager Datawarehouse and reporting in service manager
Datawarehouse and reporting in service manager
Eduardo Castro
 
Workshop: Delivering chnages for applications and databases
Workshop: Delivering chnages for applications and databasesWorkshop: Delivering chnages for applications and databases
Workshop: Delivering chnages for applications and databases
Eduardo Piairo
 
Internet of Things Microservices
Internet of Things MicroservicesInternet of Things Microservices
Internet of Things Microservices
Capgemini
 
Dataweave Libraries and ObjectStore
Dataweave Libraries and ObjectStoreDataweave Libraries and ObjectStore
Dataweave Libraries and ObjectStore
Vikalp Bhalia
 
Book store Black Book - Dinesh48
Book store Black Book - Dinesh48Book store Black Book - Dinesh48
Book store Black Book - Dinesh48
Dinesh Jogdand
 
Zakir_Hussain_cv
Zakir_Hussain_cvZakir_Hussain_cv
Zakir_Hussain_cv
zakir hussain
 
LOTAR-PDES: Engineering digitalization through task automation and reuse in t...
LOTAR-PDES: Engineering digitalization through task automation and reuse in t...LOTAR-PDES: Engineering digitalization through task automation and reuse in t...
LOTAR-PDES: Engineering digitalization through task automation and reuse in t...
CARLOS III UNIVERSITY OF MADRID
 
Limited Budget but Effective End to End MLOps Practices (Machine Learning Mod...
Limited Budget but Effective End to End MLOps Practices (Machine Learning Mod...Limited Budget but Effective End to End MLOps Practices (Machine Learning Mod...
Limited Budget but Effective End to End MLOps Practices (Machine Learning Mod...
IRJET Journal
 
Bank Management System.docx
Bank Management System.docxBank Management System.docx
Bank Management System.docx
Nikhil Patil
 

Similar to Grokking Techtalk #42: Engineering challenges on building data platform for ML application (20)

Five ways database modernization simplifies your data life
Five ways database modernization simplifies your data lifeFive ways database modernization simplifies your data life
Five ways database modernization simplifies your data life
 
MuleSoft Manchester Meetup #4 slides 11th February 2021
MuleSoft Manchester Meetup #4 slides 11th February 2021MuleSoft Manchester Meetup #4 slides 11th February 2021
MuleSoft Manchester Meetup #4 slides 11th February 2021
 
BigQuery ML - Machine learning at scale using SQL
BigQuery ML - Machine learning at scale using SQLBigQuery ML - Machine learning at scale using SQL
BigQuery ML - Machine learning at scale using SQL
 
ESP POC Findings
ESP POC FindingsESP POC Findings
ESP POC Findings
 
MODERN DATA PIPELINE
MODERN DATA PIPELINEMODERN DATA PIPELINE
MODERN DATA PIPELINE
 
127801976 mobile-shop-management-system-documentation
127801976 mobile-shop-management-system-documentation127801976 mobile-shop-management-system-documentation
127801976 mobile-shop-management-system-documentation
 
Accelerating Machine Learning as a Service with Automated Feature Engineering
Accelerating Machine Learning as a Service with Automated Feature EngineeringAccelerating Machine Learning as a Service with Automated Feature Engineering
Accelerating Machine Learning as a Service with Automated Feature Engineering
 
[EN] Building modern data pipeline with Snowflake + DBT + Airflow.pdf
[EN] Building modern data pipeline with Snowflake + DBT + Airflow.pdf[EN] Building modern data pipeline with Snowflake + DBT + Airflow.pdf
[EN] Building modern data pipeline with Snowflake + DBT + Airflow.pdf
 
Distributed Systems in Data Engineering
Distributed Systems in Data EngineeringDistributed Systems in Data Engineering
Distributed Systems in Data Engineering
 
Print report
Print reportPrint report
Print report
 
How to overcome challenges in it system evolution
How to overcome challenges in it system evolutionHow to overcome challenges in it system evolution
How to overcome challenges in it system evolution
 
Datawarehouse and reporting in service manager
Datawarehouse and reporting in service manager Datawarehouse and reporting in service manager
Datawarehouse and reporting in service manager
 
Workshop: Delivering chnages for applications and databases
Workshop: Delivering chnages for applications and databasesWorkshop: Delivering chnages for applications and databases
Workshop: Delivering chnages for applications and databases
 
Internet of Things Microservices
Internet of Things MicroservicesInternet of Things Microservices
Internet of Things Microservices
 
Dataweave Libraries and ObjectStore
Dataweave Libraries and ObjectStoreDataweave Libraries and ObjectStore
Dataweave Libraries and ObjectStore
 
Book store Black Book - Dinesh48
Book store Black Book - Dinesh48Book store Black Book - Dinesh48
Book store Black Book - Dinesh48
 
Zakir_Hussain_cv
Zakir_Hussain_cvZakir_Hussain_cv
Zakir_Hussain_cv
 
LOTAR-PDES: Engineering digitalization through task automation and reuse in t...
LOTAR-PDES: Engineering digitalization through task automation and reuse in t...LOTAR-PDES: Engineering digitalization through task automation and reuse in t...
LOTAR-PDES: Engineering digitalization through task automation and reuse in t...
 
Limited Budget but Effective End to End MLOps Practices (Machine Learning Mod...
Limited Budget but Effective End to End MLOps Practices (Machine Learning Mod...Limited Budget but Effective End to End MLOps Practices (Machine Learning Mod...
Limited Budget but Effective End to End MLOps Practices (Machine Learning Mod...
 
Bank Management System.docx
Bank Management System.docxBank Management System.docx
Bank Management System.docx
 

More from Grokking VN

Grokking Techtalk #45: First Principles Thinking
Grokking Techtalk #45: First Principles ThinkingGrokking Techtalk #45: First Principles Thinking
Grokking Techtalk #45: First Principles Thinking
Grokking VN
 
Grokking Techtalk #40: Consistency and Availability tradeoff in database cluster
Grokking Techtalk #40: Consistency and Availability tradeoff in database clusterGrokking Techtalk #40: Consistency and Availability tradeoff in database cluster
Grokking Techtalk #40: Consistency and Availability tradeoff in database cluster
Grokking VN
 
Grokking Techtalk #40: AWS’s philosophy on designing MLOps platform
Grokking Techtalk #40: AWS’s philosophy on designing MLOps platformGrokking Techtalk #40: AWS’s philosophy on designing MLOps platform
Grokking Techtalk #40: AWS’s philosophy on designing MLOps platform
Grokking VN
 
Grokking Techtalk #39: Gossip protocol and applications
Grokking Techtalk #39: Gossip protocol and applicationsGrokking Techtalk #39: Gossip protocol and applications
Grokking Techtalk #39: Gossip protocol and applications
Grokking VN
 
Grokking Techtalk #39: How to build an event driven architecture with Kafka ...
 Grokking Techtalk #39: How to build an event driven architecture with Kafka ... Grokking Techtalk #39: How to build an event driven architecture with Kafka ...
Grokking Techtalk #39: How to build an event driven architecture with Kafka ...
Grokking VN
 
Grokking Techtalk #38: Escape Analysis in Go compiler
 Grokking Techtalk #38: Escape Analysis in Go compiler Grokking Techtalk #38: Escape Analysis in Go compiler
Grokking Techtalk #38: Escape Analysis in Go compiler
Grokking VN
 
Grokking Techtalk #37: Data intensive problem
 Grokking Techtalk #37: Data intensive problem Grokking Techtalk #37: Data intensive problem
Grokking Techtalk #37: Data intensive problem
Grokking VN
 
Grokking Techtalk #37: Software design and refactoring
 Grokking Techtalk #37: Software design and refactoring Grokking Techtalk #37: Software design and refactoring
Grokking Techtalk #37: Software design and refactoring
Grokking VN
 
Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer...
 Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer... Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer...
Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer...
Grokking VN
 
Grokking TechTalk #33: High Concurrency Architecture at TIKI
Grokking TechTalk #33: High Concurrency Architecture at TIKIGrokking TechTalk #33: High Concurrency Architecture at TIKI
Grokking TechTalk #33: High Concurrency Architecture at TIKI
Grokking VN
 
Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...
Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...
Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...
Grokking VN
 
SOLID & Design Patterns
SOLID & Design PatternsSOLID & Design Patterns
SOLID & Design Patterns
Grokking VN
 
Grokking TechTalk #31: Asynchronous Communications
Grokking TechTalk #31: Asynchronous CommunicationsGrokking TechTalk #31: Asynchronous Communications
Grokking TechTalk #31: Asynchronous Communications
Grokking VN
 
Grokking TechTalk #30: From App to Ecosystem: Lessons Learned at Scale
Grokking TechTalk #30: From App to Ecosystem: Lessons Learned at ScaleGrokking TechTalk #30: From App to Ecosystem: Lessons Learned at Scale
Grokking TechTalk #30: From App to Ecosystem: Lessons Learned at Scale
Grokking VN
 
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedInGrokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking VN
 
Grokking TechTalk #27: Optimal Binary Search Tree
Grokking TechTalk #27: Optimal Binary Search TreeGrokking TechTalk #27: Optimal Binary Search Tree
Grokking TechTalk #27: Optimal Binary Search Tree
Grokking VN
 
Grokking TechTalk #26: Kotlin, Understand the Magic
Grokking TechTalk #26: Kotlin, Understand the MagicGrokking TechTalk #26: Kotlin, Understand the Magic
Grokking TechTalk #26: Kotlin, Understand the Magic
Grokking VN
 
Grokking TechTalk #26: Compare ios and android platform
Grokking TechTalk #26: Compare ios and android platformGrokking TechTalk #26: Compare ios and android platform
Grokking TechTalk #26: Compare ios and android platform
Grokking VN
 
Grokking TechTalk #24: Thiết kế hệ thống Background Job Queue bằng Ruby & Pos...
Grokking TechTalk #24: Thiết kế hệ thống Background Job Queue bằng Ruby & Pos...Grokking TechTalk #24: Thiết kế hệ thống Background Job Queue bằng Ruby & Pos...
Grokking TechTalk #24: Thiết kế hệ thống Background Job Queue bằng Ruby & Pos...
Grokking VN
 
Grokking TechTalk #24: Kafka's principles and protocols
Grokking TechTalk #24: Kafka's principles and protocolsGrokking TechTalk #24: Kafka's principles and protocols
Grokking TechTalk #24: Kafka's principles and protocols
Grokking VN
 

More from Grokking VN (20)

Grokking Techtalk #45: First Principles Thinking
Grokking Techtalk #45: First Principles ThinkingGrokking Techtalk #45: First Principles Thinking
Grokking Techtalk #45: First Principles Thinking
 
Grokking Techtalk #40: Consistency and Availability tradeoff in database cluster
Grokking Techtalk #40: Consistency and Availability tradeoff in database clusterGrokking Techtalk #40: Consistency and Availability tradeoff in database cluster
Grokking Techtalk #40: Consistency and Availability tradeoff in database cluster
 
Grokking Techtalk #40: AWS’s philosophy on designing MLOps platform
Grokking Techtalk #40: AWS’s philosophy on designing MLOps platformGrokking Techtalk #40: AWS’s philosophy on designing MLOps platform
Grokking Techtalk #40: AWS’s philosophy on designing MLOps platform
 
Grokking Techtalk #39: Gossip protocol and applications
Grokking Techtalk #39: Gossip protocol and applicationsGrokking Techtalk #39: Gossip protocol and applications
Grokking Techtalk #39: Gossip protocol and applications
 
Grokking Techtalk #39: How to build an event driven architecture with Kafka ...
 Grokking Techtalk #39: How to build an event driven architecture with Kafka ... Grokking Techtalk #39: How to build an event driven architecture with Kafka ...
Grokking Techtalk #39: How to build an event driven architecture with Kafka ...
 
Grokking Techtalk #38: Escape Analysis in Go compiler
 Grokking Techtalk #38: Escape Analysis in Go compiler Grokking Techtalk #38: Escape Analysis in Go compiler
Grokking Techtalk #38: Escape Analysis in Go compiler
 
Grokking Techtalk #37: Data intensive problem
 Grokking Techtalk #37: Data intensive problem Grokking Techtalk #37: Data intensive problem
Grokking Techtalk #37: Data intensive problem
 
Grokking Techtalk #37: Software design and refactoring
 Grokking Techtalk #37: Software design and refactoring Grokking Techtalk #37: Software design and refactoring
Grokking Techtalk #37: Software design and refactoring
 
Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer...
 Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer... Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer...
Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer...
 
Grokking TechTalk #33: High Concurrency Architecture at TIKI
Grokking TechTalk #33: High Concurrency Architecture at TIKIGrokking TechTalk #33: High Concurrency Architecture at TIKI
Grokking TechTalk #33: High Concurrency Architecture at TIKI
 
Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...
Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...
Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...
 
SOLID & Design Patterns
SOLID & Design PatternsSOLID & Design Patterns
SOLID & Design Patterns
 
Grokking TechTalk #31: Asynchronous Communications
Grokking TechTalk #31: Asynchronous CommunicationsGrokking TechTalk #31: Asynchronous Communications
Grokking TechTalk #31: Asynchronous Communications
 
Grokking TechTalk #30: From App to Ecosystem: Lessons Learned at Scale
Grokking TechTalk #30: From App to Ecosystem: Lessons Learned at ScaleGrokking TechTalk #30: From App to Ecosystem: Lessons Learned at Scale
Grokking TechTalk #30: From App to Ecosystem: Lessons Learned at Scale
 
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedInGrokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
 
Grokking TechTalk #27: Optimal Binary Search Tree
Grokking TechTalk #27: Optimal Binary Search TreeGrokking TechTalk #27: Optimal Binary Search Tree
Grokking TechTalk #27: Optimal Binary Search Tree
 
Grokking TechTalk #26: Kotlin, Understand the Magic
Grokking TechTalk #26: Kotlin, Understand the MagicGrokking TechTalk #26: Kotlin, Understand the Magic
Grokking TechTalk #26: Kotlin, Understand the Magic
 
Grokking TechTalk #26: Compare ios and android platform
Grokking TechTalk #26: Compare ios and android platformGrokking TechTalk #26: Compare ios and android platform
Grokking TechTalk #26: Compare ios and android platform
 
Grokking TechTalk #24: Thiết kế hệ thống Background Job Queue bằng Ruby & Pos...
Grokking TechTalk #24: Thiết kế hệ thống Background Job Queue bằng Ruby & Pos...Grokking TechTalk #24: Thiết kế hệ thống Background Job Queue bằng Ruby & Pos...
Grokking TechTalk #24: Thiết kế hệ thống Background Job Queue bằng Ruby & Pos...
 
Grokking TechTalk #24: Kafka's principles and protocols
Grokking TechTalk #24: Kafka's principles and protocolsGrokking TechTalk #24: Kafka's principles and protocols
Grokking TechTalk #24: Kafka's principles and protocols
 

Recently uploaded

zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
Alex Pruden
 
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
Edge AI and Vision Alliance
 
"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota
Fwdays
 
Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
Jason Packer
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 
The Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptxThe Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptx
operationspcvita
 
Leveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and StandardsLeveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and Standards
Neo4j
 
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectorsConnector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
DianaGray10
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
Javier Junquera
 
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge GraphGraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
Neo4j
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
saastr
 
Essentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation ParametersEssentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation Parameters
Safe Software
 
Apps Break Data
Apps Break DataApps Break Data
Apps Break Data
Ivo Velitchkov
 
Mutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented ChatbotsMutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented Chatbots
Pablo Gómez Abajo
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
Ivanti
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Alpen-Adria-Universität
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
Brandon Minnick, MBA
 

Recently uploaded (20)

zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
 
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
 
"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota
 
Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 
The Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptxThe Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptx
 
Leveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and StandardsLeveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and Standards
 
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectorsConnector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
 
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge GraphGraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
 
Essentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation ParametersEssentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation Parameters
 
Apps Break Data
Apps Break DataApps Break Data
Apps Break Data
 
Mutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented ChatbotsMutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented Chatbots
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
 

Grokking Techtalk #42: Engineering challenges on building data platform for ML application

  • 1. ML IN DATA PLATFORM A Case Study with NLP Application US Office 2150 Ringwood Ave, San Jose, CA 95131 UK Office 3 Beeston Place, Belgravia, London SW1W 0JJ, UK Vietnam Office Floor #1-4, 302 Le Van Sy, Ward 1, Tan Binh District, HCMC, Vietnam SG Office 6A Shenton Way #04-08 OUE Downtown Gallery Singapore 068815
  • 2. 2 Table of content No Content 1 Introduction 2 Data Platform – ETL Process 3 Data Platform – Analytics Workflow 4 Afterthoughts
  • 3. 3 INTRODUCTION 01 1. Introduction to Case Study 2. Introduction to Data Platform
  • 4. 1.1.1. Potential Values of ML/NLP Application 4 - ML applications can bring new-found values - Case study: Online Review Analytics - Opinions from others increasingly guide customer's purchases => Growth, Improvement, Investment implications Refs - https://www.mckinsey.com/industries/consumer-packaged-goods/our-insights/five-star-growth-using-online-ratings-to-design-better-products - https://www.thinkwithgoogle.com/consumer-insights/consumer-trends/customer-review-preference-statistics/
  • 5. 1.1.2. Dealing with text data 5 - An insight-mining platform for review text is highly valuable. It is difficult though - Engineering challenges - Getting the reviews => web-scraping, data collection - Storing reviews => moving, maintaining, deduplicating large amount of texts - Processing reviews => text cleaning, processing, and analytics at scale - Analytics challenges - Natural Language Processing – NLP - Insight communication: dashboards and visualization
  • 6. 1.2.1. Data Platform overall architecture 6
  • 7. 1.2.2. Example: output from ETL Process 7
  • 8. 1.3. Example: output from Analytics Workflow 8
  • 9. 1.3. Example: insight communication – Web Application 9
  • 10. 10 ETL PROCESS 02 1. Extract, Transform, Load 2. Data Collection 3. Data Storage
  • 11. 2.1. Extract, Transform, Load 11 - Extract: - Data Collector: collect data from websites - Extract and Map from raw data collected - Transform: clean up data (trim, special characters,…), deduplications, etc. - Load: to databases for storage and analysis: MongoDB, BigQuery - Batching: split large amount of data into batches for parallel processing - Worker: a container that moves/processes data -> Mini-ETL
  • 12. 2.1. Data Collection: web-scraping 12 Web Scraper
  • 13. 2.1. Data Collection: Benefit & Challenge 13 Benefit Challenge It’s Free It’s Big Data Fake Data - Captcha - IP Blocking Hard to collect - Javascript Rendering
  • 14. 2.1. Data Collection: How to deal with challenges? 14 WEB BROWSER SELENIUM PROXY To avoid IPs blocking & Captcha To overcome Javascript rendering Control Browser by Code Control Browser by Code
  • 15. 2.2. Data Storage 15 - PostgreSQL: store process metadata (used by orchestrator) - Google Cloud Storage: store intermediary CSV files - MongoDB: flexible, persistent storage for text documents. Allow easy and frequent edits - Google BigQuery: analytics data storage and distributed processing engine using SQL – familiar language for Data Analysts
  • 16. 16 ANALYTICS WORKFLOW 03 1. First Implementation 2. Inference Services
  • 17. 3.1.1 Analytics Workflow 17 - After ETL process, data is available for further processing and analysis - Analytics Workflow: - A part of Data Platform - Extract information from data for insights - Machine Learning models are integral part of text analytics - Information is extracted, and pushed to BigQuery for queries
  • 18. 3.1.2 First implementation 18 - Implement each model as a worker - Advantages: - Easy to implement - Suitable for early stages: fast implementation and acceptable performance - Several drawbacks: technical debts - Mixing of concerns - Low flexibility - Limited scalability
  • 19. 3.1.3 First implementation: mixing of concerns 19 - Data Platform’s intended purpose: moving data, processing, and interacting with various API on the way => mostly I/O operations - Computationally-heavy tasks are usually delegated: e.g. to BigQuery - Mixing I/O and computations
  • 20. 3.1.4 First implementation: scalability 20 - Everything seems ok, until we must process many reviews (100,000s - 1,000,000s, various lengths, can be very long) - Manual scaling: replicate workers -> VM resource/cost constraint - GPU acceleration? -> ETL workers don’t need GPU
  • 21. 3.1.5 First implementation: monitoring and maintenance 21 - No real monitoring components for performance degradation - Data drift, concept drift? - If needed, model is inspected manually - Collect, process, re-train models manually - Upload trained model to GCS, re-deploy workers
  • 22. 3.2.1 Inference Services: separation of concerns 22 - Income Inference Services - No direct I/O for data, only accept HTTP requests with input and response with computed results => Easier to maintain and optimize both ends
  • 23. 3.2.2 Inference Services: overall architecture 23
  • 24. 3.2.3 Inference Services: solving redundancy and reusability 24 - Each ML model is treated as a microservice - Several ML models can be connected as an inference pipeline for complex tasks - Promote reusability and flexibility => save resources
  • 25. 3.2.4 Inference Services: solving scalability 25 - Services are containerized, run, and deployed independently - Can be migrate to any environment with relative ease - For maximum scalability => K8s cluster (GKE) with autoscaling - Thanks to K8s, deployment is easier. - Rollout deployments: no/minimal downtime
  • 26. 3.2.5. Inference Services: monitoring 26 - Metrics are logged to a central data-lake and visualized in a dashboard. Image from https://www.datarobot.com/wiki/machine-learning-operations-mlops/
  • 27. 3.2.6. Inference Services: results and drawbacks 27 - Results - A more flexible and effective solution - More resilient ETL process: less complex - Reduced ETL resource consumption and processing time - New system of services can be developed and maintained separately - Drawbacks - Appearance of more infrastructures and tools -> management overhead - Complex inter-dependency of inference services as it expands - Requires more expertise in managing K8s clusters and deployment
  • 29. 4.1. What We Learned? 29 - ML Application can be tricky to be done right - Not much resources and best practices - Solved by: thorough analysis of use-cases - Solved by: proper scoping and sizing - Separating I/O Intensive from Computationally-intensive tasks - ETL components - ML components - Good architecture design from the beginning can save time and cost later - Over-engineered vs under-engineered - Easy in hindsight, difficult in practice Hope these ideas help you in designing your next ML Application