26. Our A.I. solution consists of
heuristics algorithms,
machine learning models,
and, data processing pipelines.
Which one should we improve first? 🧐
27. Collect More Data Data Data
and tweak the machine learning models in the meantime.
28. Your data processing pipeline should support data
collection tasks:
customer feedback
sampling algorithms
false-negative miner
…
How and where to collect data?
33. • Stream Router/Manager was written by Python2 ❤ ❤
• Python2 => Python3
• byte string / decode
• isinstance
• format
• API changes
• Results
• Python3 ❤ ❤ ❤
• Enable a lot of cool features
• F-strings, typing, asyncio, tracemalloc, etc.
Python 2 to 3
34. • Refactor Stream Manager
• Happy to use async/await
• Use pipeline pattern
• Still use run_in_executor/
ThreadPoolExecutor if needed
• Results
• Cleaner architecture
• Performance boost
Adopt Python asyncio
35. • Refactor CV Worker
• Lua => Python
• Torch => PyTorch
• Use aiohttp server
• Results
• Easier to maintain/upgrade
• High performance
• GPU resources are the bottleneck
• Python package ecosystem 👍
Torch to PyTorch
36. • Debug
• Use pyflame to profile the program
• Use tracemaclloc to find memory
usage
• Use Valgrind to check your C++
code
• Results
• Understand more and gotcha!
Fighting Bugs
41. Please make sure you have clear goals,
practical user stories, and enough resources
Before you start building the pipeline
42. • Goals
• Timeline
• Outcome
• User stories
• Blueprint
• Concerns
• Resource
• Infra eng
• Data eng
• Researcher
Behind the Machine Learning Pipelines
43. • ✅ Data collection
• ✅ Model training
• ✅ Model evaluation
• 🧐 Model deployment
Our Progress
46. • SLO
• Monitor first
• Set up alerts
• Oncall process
• Maintainability
• DevOps
• Adopt engineering best
practices
Service Level Objectives and Maintainability