SlideShare a Scribd company logo
1 of 34
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Harmonize, Search, and Analyze Scientific Datasets
on AWS
Case Study: American Heart Association, Precision Medicine Platform
Bob Strahan, Sr. Consultant, AWS Professional Services
Dr. Taha Kass-Hout, Strategic Advisor to AHA
June 13, 2017
Harmonize, Search and Analyze Scientific
Datasets on AWS
Part 1: Theory: Concepts, Challenges, and a Reference Architecture
Part 2: Practice: AHA’s Precision Medicine Platform
Part 1:
Concepts, Challenges, and a
Reference Architecture
Tell them what you’re going to tell them
Harmonize - enable search and analysis across datasets
Search – find the data you care about
Analyze – prove your hypotheses - create and share insights –
advance science!
Do it all on AWS!
Scenario
To prove or refute your hypothesis you need to find existing
datasets, combine them, and analyze their data.
Discordant Datasets
Created at different times by different people
Use different names to mean the same thing, and the same names to mean
different things.
Use different units of measurement, different scales, different categories
Instruments weren’t calibrated to a common standard.
Data qualtity isssues
Harmonization
Harmonization Challenges
Sometimes harmonization is easier said than done.
Easier
Harder
Standardize variable names
Standardize units of measurement
Align continuous with categorized values
Align readings from different instruments / calibrations
Align measurements when procedures vary
Align survey responses when questions vary
Information missing from dataset
Harmonize on AWS
Amazon S3
Python or R
Jupyter Notebooks
Apache Spark
Amazon EMR (compute)
Search and Discovery on AWS
• Quickly find relevant data
• Preliminary analysis of filtered data
• Link to harmonization notebook
• Amazon Elasticsearch Service
• Index created from Spark
harmonization
• Data (search)
• Metadata (UI filters)
• Filter accordion panel & Kibana
dashboard
• UI Containers hosted on Amazon ECS
Analyze Datasets - Data Science
- Python or R
- Jupyter Notebooks
- Spark
- EMR
- Create clear, beautiful,
executable, reproducible
science
Same platform as harmonization
From: Notebook Gallery
Analyze Datasets – BI
- Amazon Athena
- Serverless SQL
- Data stays on Amazon S3
- Tables created by Spark
Harmonization
- Amazon QuickSight
- Serverless, easy, fast,
beautiful, shareable
Reference Architecture
Put it all together
Raw Datasets
Ingest
1
Harmonize
3
Explore
2
Store
4
Amazon EMR (Spark)
Amazon
S3
HarmonizedRaw
Harmonize
Amazon
Elasticsearch Service
Raw Datasets
Amazon ECS
ALB
Ingest
1
Kibana
Search
Filters
NGINX
Harmonize
3
Explore
2
Store
4
Amazon EMR (Spark)
ES
Proxy
ES
Proxy
Amazon
S3
HarmonizedRaw
Harmonize Search & Discover
Amazon
Elasticsearch Service
Raw Datasets
Amazon ECS
ALB
Ingest
1
Kibana
Search
Filters
NGINX
Harmonize
3
Explore
2
Store
4
Amazon EMR (Spark)
ES
Proxy
ES
Proxy
Amazon
Athena
Amazon
S3
HarmonizedRaw
Harmonize Search & Discover Analyze
More?
Harmonize, Search, and Analyze Loosely Coupled Datasets on AWS
(AWS Big Data Blog)
• Companion Sample App – deploy reference architecture and
samples with one click launch button
Wrap up – tell them what you told them
Harmonize - enable search and analysis across datasets
Search – find the data you care about
Analyze – prove your hypotheses - create and share insights –
advance science!
Do it all on AWS!
Part 2:
American Heart Association’s
Precision Medicine Platform
WHAT REALLY IMPACTS THE HEART
1
OVER 75%of Cardiovascular disease deaths take place
in LOW-AND MIDDLE-INCOME COUNTRIES.
EVERY 2 SECONDS
someone around the world dies
from Cardiovascular disease.
Cardiovascular diseases
are the number
cause of DEATH IN THE WORLD.
The global cost of
Cardiovascular disease
is approximately
$900
BILLION
and will
exceed
$1TRILLION
BY 2035.
$3.7BILLION
in research since
1949
INTEGRATING
DISPARATE DATA SOURCES
JOINING FORCES
TO IMPROVE CARDIOVASCULAR HEALTH
precision.heart.org
s
~400
Registered Users
#AHAPrecision
Thank you!

More Related Content

More from Amazon Web Services

Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 
Come costruire un'architettura Serverless nel Cloud AWS
Come costruire un'architettura Serverless nel Cloud AWSCome costruire un'architettura Serverless nel Cloud AWS
Come costruire un'architettura Serverless nel Cloud AWSAmazon Web Services
 
AWS Serverless per startup: come innovare senza preoccuparsi dei server
AWS Serverless per startup: come innovare senza preoccuparsi dei serverAWS Serverless per startup: come innovare senza preoccuparsi dei server
AWS Serverless per startup: come innovare senza preoccuparsi dei serverAmazon Web Services
 
Crea dashboard interattive con Amazon QuickSight
Crea dashboard interattive con Amazon QuickSightCrea dashboard interattive con Amazon QuickSight
Crea dashboard interattive con Amazon QuickSightAmazon Web Services
 
Costruisci modelli di Machine Learning con Amazon SageMaker Autopilot
Costruisci modelli di Machine Learning con Amazon SageMaker AutopilotCostruisci modelli di Machine Learning con Amazon SageMaker Autopilot
Costruisci modelli di Machine Learning con Amazon SageMaker AutopilotAmazon Web Services
 
Migra le tue file shares in cloud con FSx for Windows
Migra le tue file shares in cloud con FSx for Windows Migra le tue file shares in cloud con FSx for Windows
Migra le tue file shares in cloud con FSx for Windows Amazon Web Services
 
La tua organizzazione è pronta per adottare una strategia di cloud ibrido?
La tua organizzazione è pronta per adottare una strategia di cloud ibrido?La tua organizzazione è pronta per adottare una strategia di cloud ibrido?
La tua organizzazione è pronta per adottare una strategia di cloud ibrido?Amazon Web Services
 
Protect your applications from DDoS/BOT & Advanced Attacks
Protect your applications from DDoS/BOT & Advanced AttacksProtect your applications from DDoS/BOT & Advanced Attacks
Protect your applications from DDoS/BOT & Advanced AttacksAmazon Web Services
 
Track 6 Session 6_ 透過 AWS AI 服務模擬、部署機器人於產業之應用
Track 6 Session 6_ 透過 AWS AI 服務模擬、部署機器人於產業之應用Track 6 Session 6_ 透過 AWS AI 服務模擬、部署機器人於產業之應用
Track 6 Session 6_ 透過 AWS AI 服務模擬、部署機器人於產業之應用Amazon Web Services
 

More from Amazon Web Services (20)

Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 
Come costruire un'architettura Serverless nel Cloud AWS
Come costruire un'architettura Serverless nel Cloud AWSCome costruire un'architettura Serverless nel Cloud AWS
Come costruire un'architettura Serverless nel Cloud AWS
 
AWS Serverless per startup: come innovare senza preoccuparsi dei server
AWS Serverless per startup: come innovare senza preoccuparsi dei serverAWS Serverless per startup: come innovare senza preoccuparsi dei server
AWS Serverless per startup: come innovare senza preoccuparsi dei server
 
Crea dashboard interattive con Amazon QuickSight
Crea dashboard interattive con Amazon QuickSightCrea dashboard interattive con Amazon QuickSight
Crea dashboard interattive con Amazon QuickSight
 
Costruisci modelli di Machine Learning con Amazon SageMaker Autopilot
Costruisci modelli di Machine Learning con Amazon SageMaker AutopilotCostruisci modelli di Machine Learning con Amazon SageMaker Autopilot
Costruisci modelli di Machine Learning con Amazon SageMaker Autopilot
 
Migra le tue file shares in cloud con FSx for Windows
Migra le tue file shares in cloud con FSx for Windows Migra le tue file shares in cloud con FSx for Windows
Migra le tue file shares in cloud con FSx for Windows
 
La tua organizzazione è pronta per adottare una strategia di cloud ibrido?
La tua organizzazione è pronta per adottare una strategia di cloud ibrido?La tua organizzazione è pronta per adottare una strategia di cloud ibrido?
La tua organizzazione è pronta per adottare una strategia di cloud ibrido?
 
Protect your applications from DDoS/BOT & Advanced Attacks
Protect your applications from DDoS/BOT & Advanced AttacksProtect your applications from DDoS/BOT & Advanced Attacks
Protect your applications from DDoS/BOT & Advanced Attacks
 
Track 6 Session 6_ 透過 AWS AI 服務模擬、部署機器人於產業之應用
Track 6 Session 6_ 透過 AWS AI 服務模擬、部署機器人於產業之應用Track 6 Session 6_ 透過 AWS AI 服務模擬、部署機器人於產業之應用
Track 6 Session 6_ 透過 AWS AI 服務模擬、部署機器人於產業之應用
 

Recently uploaded

Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 

Recently uploaded (20)

Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 

Harmonize, Search, Analyze, and Share Scientific Datasets on AWS | AWS Public Sector Summit 2017

  • 1. © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Harmonize, Search, and Analyze Scientific Datasets on AWS Case Study: American Heart Association, Precision Medicine Platform Bob Strahan, Sr. Consultant, AWS Professional Services Dr. Taha Kass-Hout, Strategic Advisor to AHA June 13, 2017
  • 2. Harmonize, Search and Analyze Scientific Datasets on AWS Part 1: Theory: Concepts, Challenges, and a Reference Architecture Part 2: Practice: AHA’s Precision Medicine Platform
  • 3. Part 1: Concepts, Challenges, and a Reference Architecture
  • 4. Tell them what you’re going to tell them Harmonize - enable search and analysis across datasets Search – find the data you care about Analyze – prove your hypotheses - create and share insights – advance science! Do it all on AWS!
  • 5. Scenario To prove or refute your hypothesis you need to find existing datasets, combine them, and analyze their data.
  • 6. Discordant Datasets Created at different times by different people Use different names to mean the same thing, and the same names to mean different things. Use different units of measurement, different scales, different categories Instruments weren’t calibrated to a common standard. Data qualtity isssues
  • 8. Harmonization Challenges Sometimes harmonization is easier said than done. Easier Harder Standardize variable names Standardize units of measurement Align continuous with categorized values Align readings from different instruments / calibrations Align measurements when procedures vary Align survey responses when questions vary Information missing from dataset
  • 9. Harmonize on AWS Amazon S3 Python or R Jupyter Notebooks Apache Spark Amazon EMR (compute)
  • 10. Search and Discovery on AWS • Quickly find relevant data • Preliminary analysis of filtered data • Link to harmonization notebook • Amazon Elasticsearch Service • Index created from Spark harmonization • Data (search) • Metadata (UI filters) • Filter accordion panel & Kibana dashboard • UI Containers hosted on Amazon ECS
  • 11. Analyze Datasets - Data Science - Python or R - Jupyter Notebooks - Spark - EMR - Create clear, beautiful, executable, reproducible science Same platform as harmonization From: Notebook Gallery
  • 12. Analyze Datasets – BI - Amazon Athena - Serverless SQL - Data stays on Amazon S3 - Tables created by Spark Harmonization - Amazon QuickSight - Serverless, easy, fast, beautiful, shareable
  • 14. Raw Datasets Ingest 1 Harmonize 3 Explore 2 Store 4 Amazon EMR (Spark) Amazon S3 HarmonizedRaw Harmonize
  • 15. Amazon Elasticsearch Service Raw Datasets Amazon ECS ALB Ingest 1 Kibana Search Filters NGINX Harmonize 3 Explore 2 Store 4 Amazon EMR (Spark) ES Proxy ES Proxy Amazon S3 HarmonizedRaw Harmonize Search & Discover
  • 16. Amazon Elasticsearch Service Raw Datasets Amazon ECS ALB Ingest 1 Kibana Search Filters NGINX Harmonize 3 Explore 2 Store 4 Amazon EMR (Spark) ES Proxy ES Proxy Amazon Athena Amazon S3 HarmonizedRaw Harmonize Search & Discover Analyze
  • 17. More? Harmonize, Search, and Analyze Loosely Coupled Datasets on AWS (AWS Big Data Blog) • Companion Sample App – deploy reference architecture and samples with one click launch button
  • 18. Wrap up – tell them what you told them Harmonize - enable search and analysis across datasets Search – find the data you care about Analyze – prove your hypotheses - create and share insights – advance science! Do it all on AWS!
  • 19. Part 2: American Heart Association’s Precision Medicine Platform
  • 20. WHAT REALLY IMPACTS THE HEART 1 OVER 75%of Cardiovascular disease deaths take place in LOW-AND MIDDLE-INCOME COUNTRIES. EVERY 2 SECONDS someone around the world dies from Cardiovascular disease. Cardiovascular diseases are the number cause of DEATH IN THE WORLD. The global cost of Cardiovascular disease is approximately $900 BILLION and will exceed $1TRILLION BY 2035.
  • 23.
  • 24. JOINING FORCES TO IMPROVE CARDIOVASCULAR HEALTH
  • 26. s
  • 27.
  • 28.
  • 29.
  • 30.
  • 31.
  • 32.

Editor's Notes

  1. Hello everybody, thank you all for coming! I’m Bob Strahan. I’m a senior consultant in the public sector professional services team here at AWS. It’s my honor to share this session with Dr. Taha Kass-Hout. Taha is a practicing cardiologist. He’s also a technologist - the first Chief Health Informatics Officer at the FDA and creator of the award-winning openFDA platform. He is a strategic advisor to the American Heart Association, and it’s in that capacity that he’ll talk with us today about the AHA Precision Medicine Platform initiative. <next>
  2. The title of our talk is “Harmonize, Search, and Analyze Scientific Datasets on AWS”. It’s all about creating opportunities for your researchers and data scientists to uncover and share new insights We’ve split the session into two parts: In the first part - labelled ‘Theory’ - I’ll introduce some of the concepts and challenges around dataset harmonization, search, and analysis, and briefly show you a reference architecture that helps address these challenges. I’ll try to keep my part of the talk to under 20 minutes, so we can quickly get to part 2 where Dr. Kass-Hout will tell us about the recently launched AHA Precision Medicine Platform. This is an initiative that aims to accelerate scientific discovery related to the causes and possible cures of stroke and cardiovascular disease. He’ll tell us why the platform is important, and how it applies the concepts of harmonization, search, and analysis across multiple datasets, with the ultimate goal of saving lives. <next>
  3. Part 1
  4. So, here’s what I plan to talk about in Part 1. W have 3 concepts that applied to the AHA Precision Medicine Platform, and that can also apply to any platform that aims to enable research across multiple datasets: First, Harmonization. You might not be familiar with this concept, so I’ll try to explain what it means. Why you might need to do it? And how might you accomplish it cost effectively, at any scale. Then Search. Before you can know what datasets will be useful, you’ll need to have a way to see what’s inside them. We’ll discuss how you can build a search index, and show you an example search & discovery web page that you can use to search for data across many datasets. And then Analysis. Once you’ve searched, and found datasets with data that you care about, you’ll want to dig deeper into them to find patterns, correlations, possible causes and effects. You’ll want to access the latest and greatest in data science and business intelligence tools, with a way to capture and share your insights. And of course you’ll want to do all this on AWS, and take advantage of the speed, scalability, cost effectiveness, and easy access to latest in technology. I should also mention that at the end I’ll point you to a blog post with a sample application, so if you do (hopefully) find yourselves curious to know more, you’ll have the opportunity to dig deeper and explore these things for yourself. <next>
  5. Let’s set up a hypothetical scenario. Imagine you are a researcher. Doesn’t matter what kind of researcher. What does matter is that you’ve come up with a hypothesis. It’s an exciting hypothesis, one you think might make a big impact if it can be proven. What you need now is data – lots of it – so that you can see if your hypothesis holds. The data, of course, must contain the information fields you need, and should ideally span multiple jurisdictions, timeframes, and demographics, so you can see if your hypothesis is widely applicable. <next>
  6. You suspect that there are lots of potentially relevant datasets out there. And sure enough, you do find many datasets. But the problem you quickly encounter is that they don’t conform to any common standard. Because they were created by different people at different times, often for different purposes, they can use different names to mean the same thing, and the same name to mean different things. They can use different units of measurement, and different scales. Some values are represented as continuous variables while others use discrete categories. And even when categories are used their definitions may not align across datasets. Instrument or sensor readings can be skewed from dataset to dataset because different instruments were used, which may be calibrated differently. These issues make it very hard for you to filter and compare data across the data sets, or to do any sort of meaningful analysis across the superset of data. Each dataset might be fine on it’s own, but they don’t work well together. They are discordant. <next>
  7. You need to make your datasets play nice together. Minimize conflicting standards. Harmonize them. Focus on the information that the datasets have in common, and that you care about, and convert this information to a common standard (if possible!). <next>
  8. Harmonization can be a lot easier said than done. Sometimes it’s easy – like giving a variable a standard name. Sometimes it’s impossible, like when key information is simply missing and can’t be imputed. Sometimes it’s possible, but you loose fidelity of data, like when you try to align continuous and categorized variables. Sometimes it’s possible, but complicated, like when you try to align readings from differently calibrated instruments. Often harmonization is an iterative process. Look for the 80/20 rule – find a handful of variables that when aligned will give you most value for least cost. Then iterate. Decisions need to be made dataset by dataset, variable by variable. Make tradeoffs with the researchers’ goals in mind. It’s not always easy, which is why we love our statisticians & data scientists. Our goal is to provide a great platform to help these good folks do their job. <next>
  9. Here’s our recommended approach. Start by storing your raw source datasets in S3. S3 lets you secure your data with encryption and access policies, and once your data is in S3 you can do all kinds of cool things with it. We’re also going to store the harmonized versions of each dataset in S3, once we’ve completed the harmonization. Code your harmonization logic using Python or R. These languages are de-facto standards for data exploration and manipulation, and they’re very familiar to data scientists. They have huge communities where you can get help for many common problems. Jupyter Notebooks is a very nice open-source web application for integrating your documentation, exploratory analysis, and harmonization code into one self-contained notebook artifact. You can share these notebooks with the researchers who will consume your harmonized data. In fact, you can publish the notebook itself in S3 along with the harmonized dataset it generates. This will unambiguously define how the data was harmonized, in a way that anyone can review and reproduce for themselves. Apache Spark is a fast in-memory distributed compute engine, which lets you easily scale your data and compute workload across clusters of multiple nodes. Spark can be programmed easily with Python, and with R, and gives you access to a rich set of data manipulation and machine learning libraries. It’s a great platform for addressing the full gamut of harmonization challenges. Amazon EMR provides clusters in the cloud to run Apache Spark. The clusters can be sized to handle the compute workload for whatever size and number of datasets you need to harmonize. Spark running on EMR clusters can easily access the source datasets in the S3 buckets where you stored them. Your harmonization code should also save its output - the harmonized version of the dataset - to S3 as well, for researchers to access and analyze later. <next>
  10. Now that your datasets have variables that are called the same thing and that mean the same thing, you want to make them easily searchable. You, and other researchers, need to quickly find data that are relevant to your hypothesis. Spin up an Amazon Elasticsearch cluster. With a few more lines of code in your harmonization notebooks, you can easily save the harmonized data you want to search on, to an index in the ElasticSearch cluster. You can build a web search UI (like the one shown from our sample app) to let you filter on the harmonized variables. In our sample we included an embedded dashboard to display aggregate information on the records that match your search criteria, to let you do preliminary analyses, and assess which datasets have relevant data. A nice touch is to provide a link to an HTML version of the harmonization notebook for each dataset directly in the search UI. When users have narrowed down their search to some likely datasets, they can easily click the link to examine the notebook to see how the harmonization was done, and verify that it suits their needs. <next>
  11. OK, so using the search capability you have now located several harmonized datasets that you want to analyze to prove (or refute) your exiting hypothesis. What tools can you use? Well, one great option is to reuse the same technology we recommended for doing the harmonization, for all the same reasons. Python or R with Apache Spark on EMR clusters gives you access to many of the latest open source data analytics, stats, and machine learning tools. You can spin up a data science workspace in AWS, with all the tools pre-installed. You can make it as small or as big as you need it to be, depending on the combined size of your datasets and the kinds of analysis you want to run, and pay only for what you use, and even take advantage of low cost spot instances. All your datasets stay safely stored in S3. This separation of storage and compute gives you the flexibility to size and resize your compute cluster as needed, and it also means you can shut down your EMR clusters to save money when your analyses jobs finish, and you won’t lose your data. As we discussed when we talked about harmonization, you can use notebooks to encapsulate your analysis code with its output and with documentation to create beautiful live executable and reproducible scientific artifacts that you can easily share with others. <next>
  12. Here’s an alternative way to analyze your datasets – attractive if you are more comfortable with relational databases and BI tools than with data science or programming. The Amazon Athena service lets you define relational database tables to describe your data which stays stored in S3. Your harmonization notebook can be made to automatically define Athena tables for you, so that when harmonization is done, you can immediately start running SQL queries in the Athena console, or from SQL clients or BI tools such as Tableau. Athena is one of our ‘serverless’ services, meaning that it’s just always available - you don’t need to worry about sizing clusters or anything like that. You pay only for the queries you run. And speaking of serverless services, you’ll also want to try out Amazon Quicksight. Quicksight is our business analytics service. It is compatible with a wide variety of data sources, including Athena. This means that within minutes you can connect Quicksight to your harmonized datasets in S3, via Athena, and create beautiful charts and tables to reveal patterns and relationships in the data. Quicksight includes an in-memory calculation engine called SPICE – SPICE lets you build really fast interactive visualizations and dashboards. And with Quicksight you can create ‘stories’ from your data to reveal insights in a logical progression. Interactive story boards and dashboards can be easily saved and shared with others in your organization on both web browser and the Quicksight mobile app. <next>
  13. So we’ve talked about why and how you can harmonize, search, and analyze your datasets on AWS. Let’s take a quick look at how we can pull these steps together into a complete architecture. <next>
  14. Starting with harmonization.. The harmonization process is executed on an Amazon EMR cluster running in a VPC in your account. The cluster has Apache Spark and Jupyter installed, so that once it’s running you can securely connect to the Jupytper notebook from your web browser to create and run the harmonization notebooks. Your harmonization code reads the raw datasets from S3, transforms the variables and values as needed, and saves the output harmonized dataset back to S3, possibly a different bucket. When you are done harmonizing, you can download and save your notebooks in your source code repository, and then you can terminate the EMR cluster. No sense paying for it when you’re not using it. You can always fire up a new cluster if/when you want to run harmonization again. <next>
  15. To add a search & discovery portal for your datasets, create an Amazon ElasticSearch domain. Use the harmonization process to index the data that you want to make searchable. ElasticSearch has a connector for Apache Spark that makes it really easy to efficiently bulk save the data we want to index, directly from the harmonization notebook. To create a web UI search page, we’ll host the required components as docker containers on the Amazon ECS service. The UI is deployed across 2 availability zones in your VPC to ensure fault tolerance. An Application Load Balancer provides a single http URL as the entry point. You can run the search UI as a standalone site, or you can incorporate it into an existing website. <next>
  16. Add Athena and Quicksight to the architecture to provide the ability to create and share analyses of your data sets. There are no servers to deploy – just log into the console and start querying! Or, as I mentioned earlier, you can also reuse the EMR cluster with Spark and Jupyter notebooks to do your research and analysis, leveraging the latest and greatest open source data science tools and the power of programming in Python or R. A couple of final points on this architecture: It is S3 centric. We can securely store multiple versions of raw and processed datasets in S3. EMR can easily access datasets stored in S3, as can Athena, Quicksight and many other services. S3 is our source of truth – our data lake. Data lake centric architectures can be very powerful – they let you tap into your data from a wide variety of specialized tools and services that you can deploy as and when needed. I definitely recommend that you look into Data Lake architectures on AWS if you haven’t already. Is this the only way to align and combine discordant datasets. Probably not. I’m sure there are other approaches. Maybe you’ll come up with new tools, new techniques, and opportunities to leverage new AWS services like Glue, Redshift Spectrum, and more to improve on this reference architecture. That would be great! Hopefully this reference will help you get started and give you a framework for thinking about your approach. <next>
  17. We skimmed a lot of topics just now. As I mentioned at the beginning, if you are interested in digging deeper, you should check out our AWS big data blog post on the topic. The blog comes with a companion sample application. Clicking the ‘launch stack’ button fires off Cloud Formation Templates that will create the entire reference architecture that we just discussed, in your account, ready for you to try out. The blog uses publically available police incident datasets from several US cities to allow you to try out all the concepts we discussed. All the code is available on GitHub, so dig as deep as you like, and reuse any or all of it as you see fit. You can find the blog by searching the keywords ‘Harmonize Search Analyze AWS’ <next>
  18. So, here are the main points to remember: To find and analyze data that spans multiple datasets, you first have to harmonize them – create variables that are called the same thing that mean the same thing and whose values can be meaningfully combined and compared. Provide a search page where researchers can apply their filters to find which datasets have the records they want to use for their analysis. The search UI should also include a dashboard to enable some fast preliminary analysis of the combined data. Analyse the datasets using the latest data science and/or business intelligence tools, and create rich documents to share your insights with the world. Prove your exiting hypothesis, and advance the frontiers of science! And, of course, do it all on AWS! Securely store your datasets on S3 and build out a data lake strategy. Take advantage of the fact that you can spin up an entire environment in a matter of minutes to experiment, quickly try new techniques, and get to your discoveries faster. <next>
  19. OK.. I’m relieved to tell you that’s the end of part 1. We’ll hopefully have time for a few questions at the end, but for now it’s my turn to relax and be inspired as Taha makes all these concepts relevant by telling us about the AHA Precision Medicine Platform. Thank you all for listening! Now, over to Dr. Taha Kass-Hout! Taha…
  20. Here’s the problem…
  21. Systems coming together