Uses of Data Lakes

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved
Pop-up Loft
Uses of Data Lakes (Data Lakes in the Wild)
Ryan Jancaitis
Sr. Product Manager, Envision Engineering
rjancait@amazon.com
Al Belsky
Sr. Solutions Engineer, Envision Engineering
albelsk@amazon.com

Envision Engineering – About Us
The Solution
Bring the “Art of the Possible” to our customers
Collaborative
Iterate and deliver based
on constant feedback
from customer
stakeholders
Business Solutions
Focus on solving business
challenges, not
technology challenges
Specialized Team
End-to-End development
approach, services, and
skills
Touchable, tangible results are more impactful than an architecture diagram.
Analysis paralysis and uncertainties are barriers to cloud adoption
Which TechnologyCan it Work Where to Begin
? ? ?

Envision Engineering
Image
Recognition
IoT
Machine
Learning
AI/Bots
Art of the Possible

Envision Engineering
What is a Data Lake?
Centralized
Storage
Security
Controls
Application
Integration
Lineage and
Auditing
Data Lake

Customer Example – United States Census
Core Business Challenge
Core Data
& Copies
Data Security
Auditing
Usage
Monitoring
Reproducibility
Storage
Constraints
Compute
Constraints

United States Census
Core Use cases
Column-level Access Control
(with cell-level capability)
Data Lineage
(macro level)
On-Demand Infrastructure
for analytics jobs
Cost Tracking
per analytics job
Hadoop platform choice:
Amazon EMR and
Hortonworks HDP
Ability to run
legacy scripts
SAS 9.4
Centralized
Storage
LDAP based user
security/permissions

Deep Dive into Implementation – Column Security
• Custom Accumulo Loader
§ loads datasets from S3 into Accumulo table(s)
§ assigns column names as security labels
• Custom Accumulo Authorization handler
§ checks which labels user has access to (in LDAP)
Installed via Bootstrap script
on EMR (Elastic Map Reduce)
Installed on Hortonworks cluster
via Apache Ambari Blueprints

Deep Dive into Implementation – Hortonworks
Hortonworks
Cluster on EC2
Ø Create recipes:
• Accumulo setup
1. Install Loader
2. Install Custom auth
3. Stop Accumulo
4. Start Accumulo
• Import Data & Run SAS
Ø Create stack
Blueprint

Deep Dive into Implementation – SAS Script Execution
• SAS instance is launched per analytics task (on-demand)
• AWS Systems Manager “Run Command” triggers remote
shell script
• Shell script downloads SAS script from S3, runs it via SAS
• SAS accesses the data via Hive endpoint on Hadoop,
reads from External Table linked to Accumulo table
• SAS persists results locally
• Shell script copies the results to Amazon S3
Amazon EC2
SAS Instance
>
Amazon AMI
SAS 9.4 Amazon
EMR
Amazon S3
bucket
Amazon EC2
Systems Manager

User initiates Analysis
Routine based on
selected data
1
Deep Dive into Implementation – Pulling it all together
Hive tables are
created based on
data visible to user in
Accumulo
5
A SAS AMI is
launched with Hive
connection details
6
A NodeJS Lambda
function launches
EMR/HDX via SDK/API
2
A SAS Program is run
and results are stored in
S3. The AWS instances
and services are
terminated
7
1) Location of Results
2) Location of Logs
1) Analysis Routine
2) Data File
3) AD Group
An Hadoop cluster is
launched and
bootstrapped to install
Accumulo and Hive
3
Custom Java routine
creates Accumulo
rights and data tables
and loads the data
4

Single Page App Serverless
API Gateway
Deep Dive into Implementation – Serverless UI
Client Side Server Side

Solution Architecture
Hadoop
Amazon
CloudWatch Logs
Data and Scripts Serverless UI
Analytics Infrastructure
Spark
R
Other
Analytics

US Census Summary and Next Steps
- Data Lake provides:
- Centralized, secured storage
- On demand analytics environment
- Data and Program Lineage
- Re-use of existing data and SAS Programs
- What’s Next:
- Authority to Operate in FedRAMP High environment
- Spin up of interactive environments
- Control of AWS images and cost by user and group
- Deeper integration with Apache Ranger and Atlas

Customer Example – USC Alzheimer's Therapeutic Research Institute
The USC ATRI mission is to create a leading hub
of basic, translational and clinical research in
neuroscience and neurological diseases by
collaborating with sites and investigators
around the world

Core Data
& Routines
Core Data
& Routines
Silo A Silo B Silo C
Customer Example – ATRI
Core Data
& Routines

ATRI
Core Use cases
Collect Data from Multiple
Sources
Data Lineage
for analytics jobs
HIPAA Eligible
Environment
LDAP based user
security/permissions
Data Discovery

Outcome
• Web-accessible data lake that demonstrates:
• User authentication and authorization
• Text-based search and discovery based on:
• project name
• files within the project
• columns within tables/csv-files.
• Control of access roles and rights on a data set
• Analytics task execution scripts against selected data sets
• (R, Python and Java)
• Audit information for data :
• Storage, sharing, and usage
• REST-like API(s) for uploading and updating data to the
data lake
• Store data sets in data lake via scripting/automation.
• Store data in a HIPAA eligible environment

Customer Example – National Heart Blood and Lung Institute
The National Heart, Lung, and Blood Institute’s
(NHLBI) mission is to provide global leadership for
a research, training, and education program to
promote the prevention and treatment of heart,
lung, and blood disease. To this end, Institutions,
Scientists, and Researchers rely on data provided
by the NHLBI to drive basic discoveries about the
causes of disease and translate those discoveries
into clinical practice.

Customer Example – NHLBI
Massive amounts of
genetic data
Consent Management
held by outside group
Auditing
Compute
Constraints

NHLBI
Core Use cases
Data Lineage
for genomics tasks
Cost Tracking
Centralized
Storage
Consent Group based
access controls
Data Discovery

Outcome
• Web-accessible data lake that demonstrates:
• User authentication and authorization based on internal
Identity Provider
• Text-based search and discovery based on based on
DbGAP controlled studies
• Control of access roles and rights on a study by Consent
Group
• On Demand Genomics Tooling based on selected data
files
• (samtools, bcftools, HTSGet, Plink, etc…)
• Audit information for data :
• Storage, sharing, and usage

NHLBI Solution Architecture
SAML Authentication
SAML Assertion with
Consent Group
permissions
NIH CIT
dbGap/SRAStudy details
Meta-data, run lists, and
permission details
File Access Request
Secured access
IAM
Roles
UI
NHLBI Data Lake
NHLBI Data Storage
Study 1 Study 2 Study N

Uses of Data Lake – In Summary
Common Needs Across Verticals
Common Services to Meet Data Lake Needs
Centralized
Storage
Security
Controls
Application
Integration
Lineage and
Auditing

Pop-up Loft
aws.amazon.com/activate
Everything and Anything Startups
Need to Get Started on AWS

Uses of Data Lakes

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Uses of Data Lakes

Similar to Uses of Data Lakes (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Uses of Data Lakes