Yap Wei Yih is a senior data scientist at Firemark Labs Singapore who has a master's degree in electrical engineering. He discusses solving three key problems: 1) processing terabytes of geospatial data using AWS EMR and Athena, 2) a scalable model development lifecycle platform using SageMaker, and 3) exposing a scalable API endpoint using Lambda and API Gateway without devops resources. EMR is used to spin up Spark clusters for data cleansing, aggregation, and geospatial functions on large datasets. SageMaker provides full model lifecycle management with notebooks, training, versioning, testing, and deployment. Lambda and API Gateway expose the SageMaker endpoint to external parties through a controlled API.
Aws education meetup - Large scale data preprocessing with sagemaker - Weiyih
1. About Me – Yap Wei Yih
• Senior Data Scientist @ Firemark Labs Singapore (IAG)
• NYP Alumni – Specialist Diploma in Business and Big Data Analytics
• Master of Science (Electrical Engineering)
• linkedin.com/in/yapweiyih
• Interest - Applied Data Science, Machine Learning
2. Topic
Large scale data pre-processing, model training and deployment using
AWS EMR, Athena and SageMaker
3. Key Problems
1. Processing terabytes (billions observation) of geospatial data,
Spark cluster setup
2. A model development lifecycle platform
3. Scalable API endpoint, lack of DevOps resources
4. #1 - EMR & Athena
Elastic Map Reduce (EMR)
✓ Spin up Spark cluster with just a few clicks
✓ Multi user JupyterHub
✓ Data cleansing and aggregation with Scala/PySpark
✓ Easily configure number of nodes or Autoscaling
9. #2 - SageMaker
• Provides full Model Development Lifecycle Management capability
• Key requirements that is important for data science work:
✓ Jupyter Notebook, Exploratory Data Analysis (EDA)
✓ Support custom algorithm container
✓ Model Training/Versioning
✓ A/B Testing
✓ Model Endpoint Deployment
• Install both R and Python to support two main user groups
• Minimize the work of DevOps for model deployment
10. #3 - Lambda & API Gateway
Frontend
• SageMaker endpoint is only available within AWS services
• Lambda and API Gateway is used to expose model to external party
• Access by partners is controlled via API key
11. High Level Solution Architecture
Frontend
Geospatial
Data
Data
Pre-processing
Data
Ingestion
EDA,
Modeling
Deployment Serving