9. 1. FETCHING & PREPARING DATA
90% of all time is spent on getting & cleaning data
10. TYPICAL PROBLEMS
• Data is stored in multiple DBs
• Data access is behind multiple systems
• Data is missing
• Data is in the incorrect format
• Data is only available in aggregated form
• Running queries takes a long time
11. SOLUTION: CLOUD DATA WAREHOUSING
• Data is stored in one logical, petabyte-scale DB
• Centralised user access management
• Usually much cheaper to run than in-house solutions
• Can save (and query) raw data
• Querying is typically much faster
15. SOLUTION: CLOUD-BASED COMPUTATION
Example: on AWS EC2, a p2.8xlarge instance has:
• 32 vCPUs
• 488 GiB RAM
• 8 NVIDIA K80 GPUs, 2,496 PPCs and 12GiB of GPU memory per GPU
Cost of buying one K80 yourself: $5,000
Cost of buying the equivalent hardware yourself: $50,000
Cost of running the instance in AWS: about $8 per hour
17. DEPLOYING MODELS
• ML models can take a long time to train, but the models themselves
usually don’t take much (disk/RAM) space
• Getting a prediction/result from an ML model typically doesn’t take
that much time, either (milliseconds)
• Building a REST API on top of your model allows other services to get
predictions on demand
• Use functions-as-a-service as your first choice
18. DEPLOYING MODELS
• ML models can take a long time to train, but the models themselves
usually don’t take much (disk/RAM) space
• Getting a prediction/result from an ML model typically doesn’t take
that much time, either (milliseconds)
• Building a REST API on top of your model allows other services to get
predictions on demand
• Use functions-as-a-service as your first choice
SERVERLESS + API GATEWAY = QUICK PREDICTION REST API