SlideShare a Scribd company logo
1 of 12
Download to read offline
About Me – Yap Wei Yih
• Senior Data Scientist @ Firemark Labs Singapore (IAG)
• NYP Alumni – Specialist Diploma in Business and Big Data Analytics
• Master of Science (Electrical Engineering)
• linkedin.com/in/yapweiyih
• Interest - Applied Data Science, Machine Learning
Topic
Large scale data pre-processing, model training and deployment using
AWS EMR, Athena and SageMaker
Key Problems
1. Processing terabytes (billions observation) of geospatial data,
Spark cluster setup
2. A model development lifecycle platform
3. Scalable API endpoint, lack of DevOps resources
#1 - EMR & Athena
Elastic Map Reduce (EMR)
✓ Spin up Spark cluster with just a few clicks
✓ Multi user JupyterHub
✓ Data cleansing and aggregation with Scala/PySpark
✓ Easily configure number of nodes or Autoscaling
#1 - EMR & Athena
Athena
✓ Support geo - Point, Polygon
✓ Geospatial function – ST_Intersect, ST_Contains,
ST_DISTANCE
#1 - EMR & Athena
Why geospatial function is important?
Why geospatial function is important?
#2 - SageMaker
• Provides full Model Development Lifecycle Management capability
• Key requirements that is important for data science work:
✓ Jupyter Notebook, Exploratory Data Analysis (EDA)
✓ Support custom algorithm container
✓ Model Training/Versioning
✓ A/B Testing
✓ Model Endpoint Deployment
• Install both R and Python to support two main user groups
• Minimize the work of DevOps for model deployment
#3 - Lambda & API Gateway
Frontend
• SageMaker endpoint is only available within AWS services
• Lambda and API Gateway is used to expose model to external party
• Access by partners is controlled via API key
High Level Solution Architecture
Frontend
Geospatial
Data
Data
Pre-processing
Data
Ingestion
EDA,
Modeling
Deployment Serving
Questions

More Related Content

What's hot

Large-Scaled Insurance Analytics Using Tweedie Models in Apache Spark with Ya...
Large-Scaled Insurance Analytics Using Tweedie Models in Apache Spark with Ya...Large-Scaled Insurance Analytics Using Tweedie Models in Apache Spark with Ya...
Large-Scaled Insurance Analytics Using Tweedie Models in Apache Spark with Ya...
Databricks
 
Moving a Fraud-Fighting Random Forest from scikit-learn to Spark with MLlib, ...
Moving a Fraud-Fighting Random Forest from scikit-learn to Spark with MLlib, ...Moving a Fraud-Fighting Random Forest from scikit-learn to Spark with MLlib, ...
Moving a Fraud-Fighting Random Forest from scikit-learn to Spark with MLlib, ...
Databricks
 
Computation of spatial data on Hadoop Cluster
Computation of spatial data on Hadoop ClusterComputation of spatial data on Hadoop Cluster
Computation of spatial data on Hadoop Cluster
Abhishek Sagar
 

What's hot (14)

16th Athens Big Data Meetup - 1st Talk - An Introduction to Machine Learning ...
16th Athens Big Data Meetup - 1st Talk - An Introduction to Machine Learning ...16th Athens Big Data Meetup - 1st Talk - An Introduction to Machine Learning ...
16th Athens Big Data Meetup - 1st Talk - An Introduction to Machine Learning ...
 
Saurav Sengupta Resume
Saurav Sengupta ResumeSaurav Sengupta Resume
Saurav Sengupta Resume
 
Scalable Machine Learning using R and Azure HDInsight - Parashar
Scalable Machine Learning using R and Azure HDInsight - ParasharScalable Machine Learning using R and Azure HDInsight - Parashar
Scalable Machine Learning using R and Azure HDInsight - Parashar
 
Large-Scaled Insurance Analytics Using Tweedie Models in Apache Spark with Ya...
Large-Scaled Insurance Analytics Using Tweedie Models in Apache Spark with Ya...Large-Scaled Insurance Analytics Using Tweedie Models in Apache Spark with Ya...
Large-Scaled Insurance Analytics Using Tweedie Models in Apache Spark with Ya...
 
Moving a Fraud-Fighting Random Forest from scikit-learn to Spark with MLlib, ...
Moving a Fraud-Fighting Random Forest from scikit-learn to Spark with MLlib, ...Moving a Fraud-Fighting Random Forest from scikit-learn to Spark with MLlib, ...
Moving a Fraud-Fighting Random Forest from scikit-learn to Spark with MLlib, ...
 
"Something worth ten Microsofts" - Azure Machine Learning 101
"Something worth ten Microsofts" - Azure Machine Learning 101 "Something worth ten Microsofts" - Azure Machine Learning 101
"Something worth ten Microsofts" - Azure Machine Learning 101
 
Scaling ML-Based Threat Detection For Production Cyber Attacks
Scaling ML-Based Threat Detection For Production Cyber AttacksScaling ML-Based Threat Detection For Production Cyber Attacks
Scaling ML-Based Threat Detection For Production Cyber Attacks
 
Resume chao han_tsai
Resume chao han_tsaiResume chao han_tsai
Resume chao han_tsai
 
Denver AWS Users' Group meeting - September 2017
Denver AWS Users' Group meeting - September 2017Denver AWS Users' Group meeting - September 2017
Denver AWS Users' Group meeting - September 2017
 
Computation of spatial data on Hadoop Cluster
Computation of spatial data on Hadoop ClusterComputation of spatial data on Hadoop Cluster
Computation of spatial data on Hadoop Cluster
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Data cleansing and data prep with synapse data flows
Data cleansing and data prep with synapse data flowsData cleansing and data prep with synapse data flows
Data cleansing and data prep with synapse data flows
 
Tactical Data Science Tips: Python and Spark Together
Tactical Data Science Tips: Python and Spark TogetherTactical Data Science Tips: Python and Spark Together
Tactical Data Science Tips: Python and Spark Together
 
Spark Summit EU: IBM Keynote
Spark Summit EU: IBM KeynoteSpark Summit EU: IBM Keynote
Spark Summit EU: IBM Keynote
 

Similar to Aws education meetup - Large scale data preprocessing with sagemaker - Weiyih

Adding structure to your streaming pipelines: moving from Spark streaming to ...
Adding structure to your streaming pipelines: moving from Spark streaming to ...Adding structure to your streaming pipelines: moving from Spark streaming to ...
Adding structure to your streaming pipelines: moving from Spark streaming to ...
DataWorks Summit
 
Conquering Hadoop and Apache Spark with Operational Intelligence with Akshay Rai
Conquering Hadoop and Apache Spark with Operational Intelligence with Akshay RaiConquering Hadoop and Apache Spark with Operational Intelligence with Akshay Rai
Conquering Hadoop and Apache Spark with Operational Intelligence with Akshay Rai
Databricks
 

Similar to Aws education meetup - Large scale data preprocessing with sagemaker - Weiyih (20)

Adding structure to your streaming pipelines: moving from Spark streaming to ...
Adding structure to your streaming pipelines: moving from Spark streaming to ...Adding structure to your streaming pipelines: moving from Spark streaming to ...
Adding structure to your streaming pipelines: moving from Spark streaming to ...
 
Challenges of Operationalising Data Science in Production
Challenges of Operationalising Data Science in ProductionChallenges of Operationalising Data Science in Production
Challenges of Operationalising Data Science in Production
 
PradeepDWH
PradeepDWHPradeepDWH
PradeepDWH
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data Science
 
Spark One Platform Webinar
Spark One Platform WebinarSpark One Platform Webinar
Spark One Platform Webinar
 
Machine learning model to production
Machine learning model to productionMachine learning model to production
Machine learning model to production
 
Big Data Introduction - Solix empower
Big Data Introduction - Solix empowerBig Data Introduction - Solix empower
Big Data Introduction - Solix empower
 
Teradata - Architecture of Teradata
Teradata - Architecture of TeradataTeradata - Architecture of Teradata
Teradata - Architecture of Teradata
 
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim LauSpark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
 
Introducing new AIOps innovations in Oracle 19c - San Jose AICUG
Introducing new AIOps innovations in Oracle 19c - San Jose AICUGIntroducing new AIOps innovations in Oracle 19c - San Jose AICUG
Introducing new AIOps innovations in Oracle 19c - San Jose AICUG
 
DataOps with Project Amaterasu
DataOps with Project AmaterasuDataOps with Project Amaterasu
DataOps with Project Amaterasu
 
AnilKumarT_Resume_latest
AnilKumarT_Resume_latestAnilKumarT_Resume_latest
AnilKumarT_Resume_latest
 
Save money with Postgres on IBM PowerLinux
Save money with Postgres on IBM PowerLinuxSave money with Postgres on IBM PowerLinux
Save money with Postgres on IBM PowerLinux
 
Resume
ResumeResume
Resume
 
Storage Challenges for Production Machine Learning
Storage Challenges for Production Machine LearningStorage Challenges for Production Machine Learning
Storage Challenges for Production Machine Learning
 
Multi dimension aggregations using spark and dataframes
Multi dimension aggregations using spark and dataframesMulti dimension aggregations using spark and dataframes
Multi dimension aggregations using spark and dataframes
 
6 GigaSpaces Principles to Survive Black Friday
6 GigaSpaces Principles to Survive Black Friday6 GigaSpaces Principles to Survive Black Friday
6 GigaSpaces Principles to Survive Black Friday
 
SAP HANA_class1.pptx
SAP HANA_class1.pptxSAP HANA_class1.pptx
SAP HANA_class1.pptx
 
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
 
Conquering Hadoop and Apache Spark with Operational Intelligence with Akshay Rai
Conquering Hadoop and Apache Spark with Operational Intelligence with Akshay RaiConquering Hadoop and Apache Spark with Operational Intelligence with Akshay Rai
Conquering Hadoop and Apache Spark with Operational Intelligence with Akshay Rai
 

Recently uploaded

CORS (Kitworks Team Study 양다윗 발표자료 240510)
CORS (Kitworks Team Study 양다윗 발표자료 240510)CORS (Kitworks Team Study 양다윗 발표자료 240510)
CORS (Kitworks Team Study 양다윗 발표자료 240510)
Wonjun Hwang
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
panagenda
 

Recently uploaded (20)

الأمن السيبراني - ما لا يسع للمستخدم جهله
الأمن السيبراني - ما لا يسع للمستخدم جهلهالأمن السيبراني - ما لا يسع للمستخدم جهله
الأمن السيبراني - ما لا يسع للمستخدم جهله
 
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
 
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
 
Overview of Hyperledger Foundation
Overview of Hyperledger FoundationOverview of Hyperledger Foundation
Overview of Hyperledger Foundation
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by Anitaraj
 
Cyber Insurance - RalphGilot - Embry-Riddle Aeronautical University.pptx
Cyber Insurance - RalphGilot - Embry-Riddle Aeronautical University.pptxCyber Insurance - RalphGilot - Embry-Riddle Aeronautical University.pptx
Cyber Insurance - RalphGilot - Embry-Riddle Aeronautical University.pptx
 
UiPath manufacturing technology benefits and AI overview
UiPath manufacturing technology benefits and AI overviewUiPath manufacturing technology benefits and AI overview
UiPath manufacturing technology benefits and AI overview
 
WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024
 
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptx
 
ChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps ProductivityChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps Productivity
 
Event-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream ProcessingEvent-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream Processing
 
CORS (Kitworks Team Study 양다윗 발표자료 240510)
CORS (Kitworks Team Study 양다윗 발표자료 240510)CORS (Kitworks Team Study 양다윗 발표자료 240510)
CORS (Kitworks Team Study 양다윗 발표자료 240510)
 
Top 10 CodeIgniter Development Companies
Top 10 CodeIgniter Development CompaniesTop 10 CodeIgniter Development Companies
Top 10 CodeIgniter Development Companies
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDM
 
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)
 
Intro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptxIntro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptx
 
Microsoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - QuestionnaireMicrosoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - Questionnaire
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
 
WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM Performance
 

Aws education meetup - Large scale data preprocessing with sagemaker - Weiyih

  • 1. About Me – Yap Wei Yih • Senior Data Scientist @ Firemark Labs Singapore (IAG) • NYP Alumni – Specialist Diploma in Business and Big Data Analytics • Master of Science (Electrical Engineering) • linkedin.com/in/yapweiyih • Interest - Applied Data Science, Machine Learning
  • 2. Topic Large scale data pre-processing, model training and deployment using AWS EMR, Athena and SageMaker
  • 3. Key Problems 1. Processing terabytes (billions observation) of geospatial data, Spark cluster setup 2. A model development lifecycle platform 3. Scalable API endpoint, lack of DevOps resources
  • 4. #1 - EMR & Athena Elastic Map Reduce (EMR) ✓ Spin up Spark cluster with just a few clicks ✓ Multi user JupyterHub ✓ Data cleansing and aggregation with Scala/PySpark ✓ Easily configure number of nodes or Autoscaling
  • 5. #1 - EMR & Athena
  • 6. Athena ✓ Support geo - Point, Polygon ✓ Geospatial function – ST_Intersect, ST_Contains, ST_DISTANCE #1 - EMR & Athena
  • 7. Why geospatial function is important?
  • 8. Why geospatial function is important?
  • 9. #2 - SageMaker • Provides full Model Development Lifecycle Management capability • Key requirements that is important for data science work: ✓ Jupyter Notebook, Exploratory Data Analysis (EDA) ✓ Support custom algorithm container ✓ Model Training/Versioning ✓ A/B Testing ✓ Model Endpoint Deployment • Install both R and Python to support two main user groups • Minimize the work of DevOps for model deployment
  • 10. #3 - Lambda & API Gateway Frontend • SageMaker endpoint is only available within AWS services • Lambda and API Gateway is used to expose model to external party • Access by partners is controlled via API key
  • 11. High Level Solution Architecture Frontend Geospatial Data Data Pre-processing Data Ingestion EDA, Modeling Deployment Serving