Scenario: You have come up with an exciting hypothesis, and now you are keen to find and analyze as much data as possible to prove (or refute) it. There are many datasets that might be applicable, but they have been created at different times by different people and don’t conform to any common standard. They use different names for variables that mean the same thing and the same names for variables that mean different things. They use different units of measurement and different categories. Some have more variables than others. And they all have data quality issues. In this session, we explore some of the challenges involved in doing research across multiple datasets. We offer an architecture to support dataset harmonization, search, analysis, and sharing of results or insights, using a combination of managed and serverless services such as Jupyter Notebooks and Apache Spark on Amazon EMR, Amazon Elasticsearch, Amazon S3, Amazon Athena, & Amazon Quicksight. Dr. Taha Kass-Hout, representing the American Heart Association (AHA), will take the podium to describe how AHA and AWS have worked together to implement these techniques in the recently launched AHA Precision Medicine Platform (https://precision.heart.org), an initiative that brings together researchers and practitioners from around the globe to access, analyze, and share volumes of cardiovascular and stroke data to accelerate clinical and population health research and generate evidence around the care of patients at risk of cardio-vascular disease – the number one killer in the United States and a leading global health threat. (http://newsroom.heart.org/news/precision-medicine-platform-now-open-for-collaborative-discovery-about-cardiovascular-disease) Learn More: https://aws.amazon.com/government-education/
2. Harmonize, Search and Analyze Scientific
Datasets on AWS
Part 1: Theory: Concepts, Challenges, and a Reference Architecture
Part 2: Practice: AHA’s Precision Medicine Platform
4. Tell them what you’re going to tell them
Harmonize - enable search and analysis across datasets
Search – find the data you care about
Analyze – prove your hypotheses - create and share insights –
advance science!
Do it all on AWS!
5. Scenario
To prove or refute your hypothesis you need to find existing
datasets, combine them, and analyze their data.
6. Discordant Datasets
Created at different times by different people
Use different names to mean the same thing, and the same names to mean
different things.
Use different units of measurement, different scales, different categories
Instruments weren’t calibrated to a common standard.
Data qualtity isssues
8. Harmonization Challenges
Sometimes harmonization is easier said than done.
Easier
Harder
Standardize variable names
Standardize units of measurement
Align continuous with categorized values
Align readings from different instruments / calibrations
Align measurements when procedures vary
Align survey responses when questions vary
Information missing from dataset
10. Search and Discovery on AWS
• Quickly find relevant data
• Preliminary analysis of filtered data
• Link to harmonization notebook
• Amazon Elasticsearch Service
• Index created from Spark
harmonization
• Data (search)
• Metadata (UI filters)
• Filter accordion panel & Kibana
dashboard
• UI Containers hosted on Amazon ECS
11. Analyze Datasets - Data Science
- Python or R
- Jupyter Notebooks
- Spark
- EMR
- Create clear, beautiful,
executable, reproducible
science
Same platform as harmonization
From: Notebook Gallery
12. Analyze Datasets – BI
- Amazon Athena
- Serverless SQL
- Data stays on Amazon S3
- Tables created by Spark
Harmonization
- Amazon QuickSight
- Serverless, easy, fast,
beautiful, shareable
15. Amazon
Elasticsearch Service
Raw Datasets
Amazon ECS
ALB
Ingest
1
Kibana
Search
Filters
NGINX
Harmonize
3
Explore
2
Store
4
Amazon EMR (Spark)
ES
Proxy
ES
Proxy
Amazon
S3
HarmonizedRaw
Harmonize Search & Discover
16. Amazon
Elasticsearch Service
Raw Datasets
Amazon ECS
ALB
Ingest
1
Kibana
Search
Filters
NGINX
Harmonize
3
Explore
2
Store
4
Amazon EMR (Spark)
ES
Proxy
ES
Proxy
Amazon
Athena
Amazon
S3
HarmonizedRaw
Harmonize Search & Discover Analyze
17. More?
Harmonize, Search, and Analyze Loosely Coupled Datasets on AWS
(AWS Big Data Blog)
• Companion Sample App – deploy reference architecture and
samples with one click launch button
18. Wrap up – tell them what you told them
Harmonize - enable search and analysis across datasets
Search – find the data you care about
Analyze – prove your hypotheses - create and share insights –
advance science!
Do it all on AWS!
20. WHAT REALLY IMPACTS THE HEART
1
OVER 75%of Cardiovascular disease deaths take place
in LOW-AND MIDDLE-INCOME COUNTRIES.
EVERY 2 SECONDS
someone around the world dies
from Cardiovascular disease.
Cardiovascular diseases
are the number
cause of DEATH IN THE WORLD.
The global cost of
Cardiovascular disease
is approximately
$900
BILLION
and will
exceed
$1TRILLION
BY 2035.
Hello everybody, thank you all for coming!
I’m Bob Strahan. I’m a senior consultant in the public sector professional services team here at AWS.
It’s my honor to share this session with Dr. Taha Kass-Hout.
Taha is a practicing cardiologist. He’s also a technologist - the first Chief Health Informatics Officer at the FDA and creator of the award-winning openFDA platform.
He is a strategic advisor to the American Heart Association, and it’s in that capacity that he’ll talk with us today about the AHA Precision Medicine Platform initiative.
<next>
The title of our talk is “Harmonize, Search, and Analyze Scientific Datasets on AWS”.
It’s all about creating opportunities for your researchers and data scientists to uncover and share new insights
We’ve split the session into two parts:
In the first part - labelled ‘Theory’ - I’ll introduce some of the concepts and challenges around dataset harmonization, search, and analysis, and briefly show you a reference architecture that helps address these challenges.
I’ll try to keep my part of the talk to under 20 minutes, so we can quickly get to part 2 where Dr. Kass-Hout will tell us about the recently launched AHA Precision Medicine Platform. This is an initiative that aims to accelerate scientific discovery related to the causes and possible cures of stroke and cardiovascular disease. He’ll tell us why the platform is important, and how it applies the concepts of harmonization, search, and analysis across multiple datasets, with the ultimate goal of saving lives.
<next>
Part 1
So, here’s what I plan to talk about in Part 1.
W have 3 concepts that applied to the AHA Precision Medicine Platform, and that can also apply to any platform that aims to enable research across multiple datasets:
First, Harmonization.
You might not be familiar with this concept, so I’ll try to explain what it means. Why you might need to do it? And how might you accomplish it cost effectively, at any scale.
Then Search.
Before you can know what datasets will be useful, you’ll need to have a way to see what’s inside them. We’ll discuss how you can build a search index, and show you an example search & discovery web page that you can use to search for data across many datasets.
And then Analysis.
Once you’ve searched, and found datasets with data that you care about, you’ll want to dig deeper into them to find patterns, correlations, possible causes and effects. You’ll want to access the latest and greatest in data science and business intelligence tools, with a way to capture and share your insights.
And of course you’ll want to do all this on AWS, and take advantage of the speed, scalability, cost effectiveness, and easy access to latest in technology.
I should also mention that at the end I’ll point you to a blog post with a sample application, so if you do (hopefully) find yourselves curious to know more, you’ll have the opportunity to dig deeper and explore these things for yourself. <next>
Let’s set up a hypothetical scenario.
Imagine you are a researcher.
Doesn’t matter what kind of researcher.
What does matter is that you’ve come up with a hypothesis. It’s an exciting hypothesis, one you think might make a big impact if it can be proven.
What you need now is data – lots of it – so that you can see if your hypothesis holds.
The data, of course, must contain the information fields you need, and should ideally span multiple jurisdictions, timeframes, and demographics, so you can see if your hypothesis is widely applicable.
<next>
You suspect that there are lots of potentially relevant datasets out there.
And sure enough, you do find many datasets. But the problem you quickly encounter is that they don’t conform to any common standard.
Because they were created by different people at different times, often for different purposes, they can use different names to mean the same thing, and the same name to mean different things.
They can use different units of measurement, and different scales.
Some values are represented as continuous variables while others use discrete categories.
And even when categories are used their definitions may not align across datasets.
Instrument or sensor readings can be skewed from dataset to dataset because different instruments were used, which may be calibrated differently.
These issues make it very hard for you to filter and compare data across the data sets, or to do any sort of meaningful analysis across the superset of data.
Each dataset might be fine on it’s own, but they don’t work well together. They are discordant.
<next>
You need to make your datasets play nice together. Minimize conflicting standards. Harmonize them.
Focus on the information that the datasets have in common, and that you care about, and convert this information to a common standard (if possible!).
<next>
Harmonization can be a lot easier said than done.
Sometimes it’s easy – like giving a variable a standard name.
Sometimes it’s impossible, like when key information is simply missing and can’t be imputed.
Sometimes it’s possible, but you loose fidelity of data, like when you try to align continuous and categorized variables.
Sometimes it’s possible, but complicated, like when you try to align readings from differently calibrated instruments.
Often harmonization is an iterative process. Look for the 80/20 rule – find a handful of variables that when aligned will give you most value for least cost. Then iterate.
Decisions need to be made dataset by dataset, variable by variable. Make tradeoffs with the researchers’ goals in mind.
It’s not always easy, which is why we love our statisticians & data scientists. Our goal is to provide a great platform to help these good folks do their job.
<next>
Here’s our recommended approach.
Start by storing your raw source datasets in S3. S3 lets you secure your data with encryption and access policies, and once your data is in S3 you can do all kinds of cool things with it. We’re also going to store the harmonized versions of each dataset in S3, once we’ve completed the harmonization.
Code your harmonization logic using Python or R. These languages are de-facto standards for data exploration and manipulation, and they’re very familiar to data scientists. They have huge communities where you can get help for many common problems.
Jupyter Notebooks is a very nice open-source web application for integrating your documentation, exploratory analysis, and harmonization code into one self-contained notebook artifact.
You can share these notebooks with the researchers who will consume your harmonized data. In fact, you can publish the notebook itself in S3 along with the harmonized dataset it generates. This will unambiguously define how the data was harmonized, in a way that anyone can review and reproduce for themselves.
Apache Spark is a fast in-memory distributed compute engine, which lets you easily scale your data and compute workload across clusters of multiple nodes. Spark can be programmed easily with Python, and with R, and gives you access to a rich set of data manipulation and machine learning libraries. It’s a great platform for addressing the full gamut of harmonization challenges.
Amazon EMR provides clusters in the cloud to run Apache Spark. The clusters can be sized to handle the compute workload for whatever size and number of datasets you need to harmonize. Spark running on EMR clusters can easily access the source datasets in the S3 buckets where you stored them. Your harmonization code should also save its output - the harmonized version of the dataset - to S3 as well, for researchers to access and analyze later.
<next>
Now that your datasets have variables that are called the same thing and that mean the same thing, you want to make them easily searchable. You, and other researchers, need to quickly find data that are relevant to your hypothesis.
Spin up an Amazon Elasticsearch cluster. With a few more lines of code in your harmonization notebooks, you can easily save the harmonized data you want to search on, to an index in the ElasticSearch cluster.
You can build a web search UI (like the one shown from our sample app) to let you filter on the harmonized variables. In our sample we included an embedded dashboard to display aggregate information on the records that match your search criteria, to let you do preliminary analyses, and assess which datasets have relevant data.
A nice touch is to provide a link to an HTML version of the harmonization notebook for each dataset directly in the search UI.
When users have narrowed down their search to some likely datasets, they can easily click the link to examine the notebook to see how the harmonization was done, and verify that it suits their needs.
<next>
OK, so using the search capability you have now located several harmonized datasets that you want to analyze to prove (or refute) your exiting hypothesis.
What tools can you use?
Well, one great option is to reuse the same technology we recommended for doing the harmonization, for all the same reasons. Python or R with Apache Spark on EMR clusters gives you access to many of the latest open source data analytics, stats, and machine learning tools.
You can spin up a data science workspace in AWS, with all the tools pre-installed. You can make it as small or as big as you need it to be, depending on the combined size of your datasets and the kinds of analysis you want to run, and pay only for what you use, and even take advantage of low cost spot instances.
All your datasets stay safely stored in S3. This separation of storage and compute gives you the flexibility to size and resize your compute cluster as needed, and it also means you can shut down your EMR clusters to save money when your analyses jobs finish, and you won’t lose your data.
As we discussed when we talked about harmonization, you can use notebooks to encapsulate your analysis code with its output and with documentation to create beautiful live executable and reproducible scientific artifacts that you can easily share with others.
<next>
Here’s an alternative way to analyze your datasets – attractive if you are more comfortable with relational databases and BI tools than with data science or programming.
The Amazon Athena service lets you define relational database tables to describe your data which stays stored in S3. Your harmonization notebook can be made to automatically define Athena tables for you, so that when harmonization is done, you can immediately start running SQL queries in the Athena console, or from SQL clients or BI tools such as Tableau.
Athena is one of our ‘serverless’ services, meaning that it’s just always available - you don’t need to worry about sizing clusters or anything like that. You pay only for the queries you run.
And speaking of serverless services, you’ll also want to try out Amazon Quicksight.
Quicksight is our business analytics service. It is compatible with a wide variety of data sources, including Athena. This means that within minutes you can connect Quicksight to your harmonized datasets in S3, via Athena, and create beautiful charts and tables to reveal patterns and relationships in the data. Quicksight includes an in-memory calculation engine called SPICE – SPICE lets you build really fast interactive visualizations and dashboards.
And with Quicksight you can create ‘stories’ from your data to reveal insights in a logical progression. Interactive story boards and dashboards can be easily saved and shared with others in your organization on both web browser and the Quicksight mobile app.
<next>
So we’ve talked about why and how you can harmonize, search, and analyze your datasets on AWS. Let’s take a quick look at how we can pull these steps together into a complete architecture.
<next>
Starting with harmonization..
The harmonization process is executed on an Amazon EMR cluster running in a VPC in your account. The cluster has Apache Spark and Jupyter installed, so that once it’s running you can securely connect to the Jupytper notebook from your web browser to create and run the harmonization notebooks. Your harmonization code reads the raw datasets from S3, transforms the variables and values as needed, and saves the output harmonized dataset back to S3, possibly a different bucket.
When you are done harmonizing, you can download and save your notebooks in your source code repository, and then you can terminate the EMR cluster. No sense paying for it when you’re not using it. You can always fire up a new cluster if/when you want to run harmonization again.
<next>
To add a search & discovery portal for your datasets, create an Amazon ElasticSearch domain.
Use the harmonization process to index the data that you want to make searchable.
ElasticSearch has a connector for Apache Spark that makes it really easy to efficiently bulk save the data we want to index, directly from the harmonization notebook.
To create a web UI search page, we’ll host the required components as docker containers on the Amazon ECS service.
The UI is deployed across 2 availability zones in your VPC to ensure fault tolerance. An Application Load Balancer provides a single http URL as the entry point.
You can run the search UI as a standalone site, or you can incorporate it into an existing website.
<next>
Add Athena and Quicksight to the architecture to provide the ability to create and share analyses of your data sets. There are no servers to deploy – just log into the console and start querying!
Or, as I mentioned earlier, you can also reuse the EMR cluster with Spark and Jupyter notebooks to do your research and analysis, leveraging the latest and greatest open source data science tools and the power of programming in Python or R.
A couple of final points on this architecture:
It is S3 centric. We can securely store multiple versions of raw and processed datasets in S3.
EMR can easily access datasets stored in S3, as can Athena, Quicksight and many other services. S3 is our source of truth – our data lake.
Data lake centric architectures can be very powerful – they let you tap into your data from a wide variety of specialized tools and services that you can deploy as and when needed. I definitely recommend that you look into Data Lake architectures on AWS if you haven’t already.
Is this the only way to align and combine discordant datasets. Probably not. I’m sure there are other approaches.
Maybe you’ll come up with new tools, new techniques, and opportunities to leverage new AWS services like Glue, Redshift Spectrum, and more to improve on this reference architecture. That would be great!
Hopefully this reference will help you get started and give you a framework for thinking about your approach.
<next>
We skimmed a lot of topics just now. As I mentioned at the beginning, if you are interested in digging deeper, you should check out our AWS big data blog post on the topic.
The blog comes with a companion sample application. Clicking the ‘launch stack’ button fires off Cloud Formation Templates that will create the entire reference architecture that we just discussed, in your account, ready for you to try out. The blog uses publically available police incident datasets from several US cities to allow you to try out all the concepts we discussed.
All the code is available on GitHub, so dig as deep as you like, and reuse any or all of it as you see fit.
You can find the blog by searching the keywords ‘Harmonize Search Analyze AWS’
<next>
So, here are the main points to remember:
To find and analyze data that spans multiple datasets, you first have to harmonize them – create variables that are called the same thing that mean the same thing and whose values can be meaningfully combined and compared.
Provide a search page where researchers can apply their filters to find which datasets have the records they want to use for their analysis. The search UI should also include a dashboard to enable some fast preliminary analysis of the combined data.
Analyse the datasets using the latest data science and/or business intelligence tools, and create rich documents to share your insights with the world. Prove your exiting hypothesis, and advance the frontiers of science!
And, of course, do it all on AWS! Securely store your datasets on S3 and build out a data lake strategy. Take advantage of the fact that you can spin up an entire environment in a matter of minutes to experiment, quickly try new techniques, and get to your discoveries faster.
<next>
OK.. I’m relieved to tell you that’s the end of part 1.
We’ll hopefully have time for a few questions at the end, but for now it’s my turn to relax and be inspired as Taha makes all these concepts relevant by telling us about the AHA Precision Medicine Platform.
Thank you all for listening!
Now, over to Dr. Taha Kass-Hout! Taha…