SlideShare a Scribd company logo
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved
Pop-up Loft
Uses of Data Lakes (Data Lakes in the Wild)
Ryan Jancaitis
Sr. Product Manager, Envision Engineering
rjancait@amazon.com
Al Belsky
Sr. Solutions Engineer, Envision Engineering
albelsk@amazon.com
Envision Engineering – About Us
The Solution
Bring the “Art of the Possible” to our customers
Collaborative
Iterate and deliver based
on constant feedback
from customer
stakeholders
Business Solutions
Focus on solving business
challenges, not
technology challenges
Specialized Team
End-to-End development
approach, services, and
skills
Touchable, tangible results are more impactful than an architecture diagram.
Analysis paralysis and uncertainties are barriers to cloud adoption
Which TechnologyCan it Work Where to Begin
? ? ?
Envision Engineering
Image
Recognition
IoT
Machine
Learning
AI/Bots
Art of the Possible
Envision Engineering
What is a Data Lake?
Centralized
Storage
Security
Controls
Application
Integration
Lineage and
Auditing
Data Lake
Customer Example – United States Census
Core Business Challenge
Core Data
& Copies
Data Security
Auditing
Usage
Monitoring
Reproducibility
Storage
Constraints
Compute
Constraints
United States Census
Core Use cases
Column-level Access Control
(with cell-level capability)
Data Lineage
(macro level)
On-Demand Infrastructure
for analytics jobs
Cost Tracking
per analytics job
Hadoop platform choice:
Amazon EMR and
Hortonworks HDP
Ability to run
legacy scripts
SAS 9.4
Centralized
Storage
LDAP based user
security/permissions
Deep Dive into Implementation – Column Security
• Custom Accumulo Loader
§ loads datasets from S3 into Accumulo table(s)
§ assigns column names as security labels
• Custom Accumulo Authorization handler
§ checks which labels user has access to (in LDAP)
Installed via Bootstrap script
on EMR (Elastic Map Reduce)
Installed on Hortonworks cluster
via Apache Ambari Blueprints
Deep Dive into Implementation – Hortonworks
Hortonworks
Cluster on EC2
Ø Create recipes:
• Accumulo setup
1. Install Loader
2. Install Custom auth
3. Stop Accumulo
4. Start Accumulo
• Import Data & Run SAS
Ø Create stack
Blueprint
Deep Dive into Implementation – SAS Script Execution
• SAS instance is launched per analytics task (on-demand)
• AWS Systems Manager “Run Command” triggers remote
shell script
• Shell script downloads SAS script from S3, runs it via SAS
• SAS accesses the data via Hive endpoint on Hadoop,
reads from External Table linked to Accumulo table
• SAS persists results locally
• Shell script copies the results to Amazon S3
Amazon EC2
SAS Instance
>
Amazon AMI
SAS 9.4 Amazon
EMR
Amazon S3
bucket
Amazon EC2
Systems Manager
User initiates Analysis
Routine based on
selected data
1
Deep Dive into Implementation – Pulling it all together
Hive tables are
created based on
data visible to user in
Accumulo
5
A SAS AMI is
launched with Hive
connection details
6
A NodeJS Lambda
function launches
EMR/HDX via SDK/API
2
A SAS Program is run
and results are stored in
S3. The AWS instances
and services are
terminated
7
1) Location of Results
2) Location of Logs
1) Analysis Routine
2) Data File
3) AD Group
An Hadoop cluster is
launched and
bootstrapped to install
Accumulo and Hive
3
Custom Java routine
creates Accumulo
rights and data tables
and loads the data
4
Single Page App Serverless
API Gateway
Deep Dive into Implementation – Serverless UI
Client Side Server Side
Solution Architecture
Hadoop
Amazon
CloudWatch Logs
Data and Scripts Serverless UI
Analytics Infrastructure
Spark
R
Other
Analytics
Demo
US Census Summary and Next Steps
- Data Lake provides:
- Centralized, secured storage
- On demand analytics environment
- Data and Program Lineage
- Re-use of existing data and SAS Programs
- What’s Next:
- Authority to Operate in FedRAMP High environment
- Spin up of interactive environments
- Control of AWS images and cost by user and group
- Deeper integration with Apache Ranger and Atlas
Customer Example – USC Alzheimer's Therapeutic Research Institute
The USC ATRI mission is to create a leading hub
of basic, translational and clinical research in
neuroscience and neurological diseases by
collaborating with sites and investigators
around the world
Core Data
& Routines
Core Data
& Routines
Silo A Silo B Silo C
Customer Example – ATRI
Core Business Challenge
Core Data
& Routines
ATRI
Core Use cases
Collect Data from Multiple
Sources
Data Lineage
On-Demand Infrastructure
for analytics jobs
HIPAA Eligible
Environment
LDAP based user
security/permissions
Data Discovery
Outcome
• Web-accessible data lake that demonstrates:
• User authentication and authorization
• Text-based search and discovery based on:
• project name
• files within the project
• columns within tables/csv-files.
• Control of access roles and rights on a data set
• Analytics task execution scripts against selected data sets
• (R, Python and Java)
• Audit information for data :
• Storage, sharing, and usage
• REST-like API(s) for uploading and updating data to the
data lake
• Store data sets in data lake via scripting/automation.
• Store data in a HIPAA eligible environment
Customer Example – National Heart Blood and Lung Institute
The National Heart, Lung, and Blood Institute’s
(NHLBI) mission is to provide global leadership for
a research, training, and education program to
promote the prevention and treatment of heart,
lung, and blood disease. To this end, Institutions,
Scientists, and Researchers rely on data provided
by the NHLBI to drive basic discoveries about the
causes of disease and translate those discoveries
into clinical practice.
Customer Example – NHLBI
Core Business Challenge
Massive amounts of
genetic data
Consent Management
held by outside group
Auditing
Compute
Constraints
NHLBI
Core Use cases
Data Lineage
On-Demand Infrastructure
for genomics tasks
Cost Tracking
Centralized
Storage
Consent Group based
access controls
Data Discovery
Outcome
• Web-accessible data lake that demonstrates:
• User authentication and authorization based on internal
Identity Provider
• Text-based search and discovery based on based on
DbGAP controlled studies
• Control of access roles and rights on a study by Consent
Group
• On Demand Genomics Tooling based on selected data
files
• (samtools, bcftools, HTSGet, Plink, etc…)
• Audit information for data :
• Storage, sharing, and usage
NHLBI Solution Architecture
SAML Authentication
SAML Assertion with
Consent Group
permissions
NIH CIT
dbGap/SRAStudy details
Meta-data, run lists, and
permission details
File Access Request
Secured access
IAM
Roles
UI
NHLBI Data Lake
NHLBI Data Storage
Study 1 Study 2 Study N
Uses of Data Lake – In Summary
Common Needs Across Verticals
Common Services to Meet Data Lake Needs
Centralized
Storage
Security
Controls
Application
Integration
Lineage and
Auditing
Pop-up Loft
aws.amazon.com/activate
Everything and Anything Startups
Need to Get Started on AWS

More Related Content

What's hot

Jeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and QualityJeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Databricks
 
Big Data – A New Testing Challenge
Big Data – A New Testing ChallengeBig Data – A New Testing Challenge
Big Data – A New Testing Challenge
TEST Huddle
 
Open Metadata and Governance with Apache Atlas
Open Metadata and Governance with Apache AtlasOpen Metadata and Governance with Apache Atlas
Open Metadata and Governance with Apache Atlas
DataWorks Summit
 
Compute-based sizing and system dashboard
Compute-based sizing and system dashboardCompute-based sizing and system dashboard
Compute-based sizing and system dashboard
DataWorks Summit
 
Insights into Real World Data Management Challenges
Insights into Real World Data Management ChallengesInsights into Real World Data Management Challenges
Insights into Real World Data Management Challenges
DataWorks Summit
 
Operationalizing Edge Machine Learning with Apache Spark with Nisha Talagala ...
Operationalizing Edge Machine Learning with Apache Spark with Nisha Talagala ...Operationalizing Edge Machine Learning with Apache Spark with Nisha Talagala ...
Operationalizing Edge Machine Learning with Apache Spark with Nisha Talagala ...
Databricks
 
Breaking the Silos: Storage for Analytics & AI
Breaking the Silos: Storage for Analytics & AIBreaking the Silos: Storage for Analytics & AI
Breaking the Silos: Storage for Analytics & AI
DataWorks Summit
 
Hadoop Journey at Walgreens
Hadoop Journey at WalgreensHadoop Journey at Walgreens
Hadoop Journey at Walgreens
DataWorks Summit
 
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin MotgiWhither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi
Felicia Haggarty
 
An architecture for federated data discovery and lineage over on-prem datasou...
An architecture for federated data discovery and lineage over on-prem datasou...An architecture for federated data discovery and lineage over on-prem datasou...
An architecture for federated data discovery and lineage over on-prem datasou...
DataWorks Summit
 
Security, ETL, BI & Analytics, and Software Integration
Security, ETL, BI & Analytics, and Software IntegrationSecurity, ETL, BI & Analytics, and Software Integration
Security, ETL, BI & Analytics, and Software Integration
DataWorks Summit
 
Apache Atlas: Tracking dataset lineage across Hadoop components
Apache Atlas: Tracking dataset lineage across Hadoop componentsApache Atlas: Tracking dataset lineage across Hadoop components
Apache Atlas: Tracking dataset lineage across Hadoop components
DataWorks Summit/Hadoop Summit
 
Big Data at Geisinger Health System: Big Wins in a Short Time
Big Data at Geisinger Health System: Big Wins in a Short TimeBig Data at Geisinger Health System: Big Wins in a Short Time
Big Data at Geisinger Health System: Big Wins in a Short Time
DataWorks Summit
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
DataWorks Summit
 
Spark Summit EU talk by Pat Patterson
Spark Summit EU talk by Pat PattersonSpark Summit EU talk by Pat Patterson
Spark Summit EU talk by Pat Patterson
Spark Summit
 
Oracle Stream Analytics - Developer Introduction
Oracle Stream Analytics - Developer IntroductionOracle Stream Analytics - Developer Introduction
Oracle Stream Analytics - Developer Introduction
Jeffrey T. Pollock
 
The EDW Ecosystem
The EDW EcosystemThe EDW Ecosystem
How big data and AI saved the day: critical IP almost walked out the door
How big data and AI saved the day: critical IP almost walked out the doorHow big data and AI saved the day: critical IP almost walked out the door
How big data and AI saved the day: critical IP almost walked out the door
DataWorks Summit
 
Tag based policies using Apache Atlas and Ranger
Tag based policies using Apache Atlas and RangerTag based policies using Apache Atlas and Ranger
Tag based policies using Apache Atlas and Ranger
Vimal Sharma
 
Iasi code camp 20 april 2013 testing big data-anca sfecla - embarcadero
Iasi code camp 20 april 2013 testing big data-anca sfecla - embarcaderoIasi code camp 20 april 2013 testing big data-anca sfecla - embarcadero
Iasi code camp 20 april 2013 testing big data-anca sfecla - embarcadero
Codecamp Romania
 

What's hot (20)

Jeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and QualityJeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and Quality
 
Big Data – A New Testing Challenge
Big Data – A New Testing ChallengeBig Data – A New Testing Challenge
Big Data – A New Testing Challenge
 
Open Metadata and Governance with Apache Atlas
Open Metadata and Governance with Apache AtlasOpen Metadata and Governance with Apache Atlas
Open Metadata and Governance with Apache Atlas
 
Compute-based sizing and system dashboard
Compute-based sizing and system dashboardCompute-based sizing and system dashboard
Compute-based sizing and system dashboard
 
Insights into Real World Data Management Challenges
Insights into Real World Data Management ChallengesInsights into Real World Data Management Challenges
Insights into Real World Data Management Challenges
 
Operationalizing Edge Machine Learning with Apache Spark with Nisha Talagala ...
Operationalizing Edge Machine Learning with Apache Spark with Nisha Talagala ...Operationalizing Edge Machine Learning with Apache Spark with Nisha Talagala ...
Operationalizing Edge Machine Learning with Apache Spark with Nisha Talagala ...
 
Breaking the Silos: Storage for Analytics & AI
Breaking the Silos: Storage for Analytics & AIBreaking the Silos: Storage for Analytics & AI
Breaking the Silos: Storage for Analytics & AI
 
Hadoop Journey at Walgreens
Hadoop Journey at WalgreensHadoop Journey at Walgreens
Hadoop Journey at Walgreens
 
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin MotgiWhither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi
 
An architecture for federated data discovery and lineage over on-prem datasou...
An architecture for federated data discovery and lineage over on-prem datasou...An architecture for federated data discovery and lineage over on-prem datasou...
An architecture for federated data discovery and lineage over on-prem datasou...
 
Security, ETL, BI & Analytics, and Software Integration
Security, ETL, BI & Analytics, and Software IntegrationSecurity, ETL, BI & Analytics, and Software Integration
Security, ETL, BI & Analytics, and Software Integration
 
Apache Atlas: Tracking dataset lineage across Hadoop components
Apache Atlas: Tracking dataset lineage across Hadoop componentsApache Atlas: Tracking dataset lineage across Hadoop components
Apache Atlas: Tracking dataset lineage across Hadoop components
 
Big Data at Geisinger Health System: Big Wins in a Short Time
Big Data at Geisinger Health System: Big Wins in a Short TimeBig Data at Geisinger Health System: Big Wins in a Short Time
Big Data at Geisinger Health System: Big Wins in a Short Time
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Spark Summit EU talk by Pat Patterson
Spark Summit EU talk by Pat PattersonSpark Summit EU talk by Pat Patterson
Spark Summit EU talk by Pat Patterson
 
Oracle Stream Analytics - Developer Introduction
Oracle Stream Analytics - Developer IntroductionOracle Stream Analytics - Developer Introduction
Oracle Stream Analytics - Developer Introduction
 
The EDW Ecosystem
The EDW EcosystemThe EDW Ecosystem
The EDW Ecosystem
 
How big data and AI saved the day: critical IP almost walked out the door
How big data and AI saved the day: critical IP almost walked out the doorHow big data and AI saved the day: critical IP almost walked out the door
How big data and AI saved the day: critical IP almost walked out the door
 
Tag based policies using Apache Atlas and Ranger
Tag based policies using Apache Atlas and RangerTag based policies using Apache Atlas and Ranger
Tag based policies using Apache Atlas and Ranger
 
Iasi code camp 20 april 2013 testing big data-anca sfecla - embarcadero
Iasi code camp 20 april 2013 testing big data-anca sfecla - embarcaderoIasi code camp 20 april 2013 testing big data-anca sfecla - embarcadero
Iasi code camp 20 april 2013 testing big data-anca sfecla - embarcadero
 

Similar to Uses of Data Lakes

Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
StampedeCon
 
Analyzing Data Streams in Real Time with Amazon Kinesis: PNNL's Serverless Da...
Analyzing Data Streams in Real Time with Amazon Kinesis: PNNL's Serverless Da...Analyzing Data Streams in Real Time with Amazon Kinesis: PNNL's Serverless Da...
Analyzing Data Streams in Real Time with Amazon Kinesis: PNNL's Serverless Da...
Amazon Web Services
 
Transforming Your IT with AWS
Transforming Your IT with AWSTransforming Your IT with AWS
Transforming Your IT with AWS
Amazon Web Services
 
Preparing Your Data for Cloud Analytics & AI/ML
Preparing Your Data for Cloud Analytics & AI/MLPreparing Your Data for Cloud Analytics & AI/ML
Preparing Your Data for Cloud Analytics & AI/ML
Amazon Web Services
 
Building a Real-Time Security Application Using Log Data and Machine Learning...
Building a Real-Time Security Application Using Log Data and Machine Learning...Building a Real-Time Security Application Using Log Data and Machine Learning...
Building a Real-Time Security Application Using Log Data and Machine Learning...
Sri Ambati
 
Horses for Courses: Database Roundtable
Horses for Courses: Database RoundtableHorses for Courses: Database Roundtable
Horses for Courses: Database Roundtable
Eric Kavanagh
 
(ENT211) Migrating the US Government to the Cloud | AWS re:Invent 2014
(ENT211) Migrating the US Government to the Cloud | AWS re:Invent 2014(ENT211) Migrating the US Government to the Cloud | AWS re:Invent 2014
(ENT211) Migrating the US Government to the Cloud | AWS re:Invent 2014
Amazon Web Services
 
Using Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFUsing Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SF
Amazon Web Services
 
Big Data Session 1.pptx
Big Data Session 1.pptxBig Data Session 1.pptx
Big Data Session 1.pptx
ElsonPaul2
 
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Amazon Web Services LATAM
 
AWS Big Data Platform
AWS Big Data PlatformAWS Big Data Platform
AWS Big Data Platform
Amazon Web Services
 
Using Data Lakes
Using Data Lakes Using Data Lakes
Using Data Lakes
Amazon Web Services
 
AWS Big Data Solution Days
AWS Big Data Solution DaysAWS Big Data Solution Days
AWS Big Data Solution Days
Amazon Web Services
 
Migrate and Manage Workloads with Apps Associates
Migrate and Manage Workloads with Apps AssociatesMigrate and Manage Workloads with Apps Associates
Migrate and Manage Workloads with Apps Associates
Amazon Web Services
 
AWS Storage State of the Union
AWS Storage State of the UnionAWS Storage State of the Union
AWS Storage State of the Union
Amazon Web Services
 
ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...
ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...
ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...
DATAVERSITY
 
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the CloudFSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
Amazon Web Services
 
Agile Big Data Analytics Development: An Architecture-Centric Approach
Agile Big Data Analytics Development: An Architecture-Centric ApproachAgile Big Data Analytics Development: An Architecture-Centric Approach
Agile Big Data Analytics Development: An Architecture-Centric Approach
SoftServe
 
Data lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiryData lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiry
datastack
 
Building your Datalake on AWS
Building your Datalake on AWSBuilding your Datalake on AWS
Building your Datalake on AWS
Amazon Web Services
 

Similar to Uses of Data Lakes (20)

Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
 
Analyzing Data Streams in Real Time with Amazon Kinesis: PNNL's Serverless Da...
Analyzing Data Streams in Real Time with Amazon Kinesis: PNNL's Serverless Da...Analyzing Data Streams in Real Time with Amazon Kinesis: PNNL's Serverless Da...
Analyzing Data Streams in Real Time with Amazon Kinesis: PNNL's Serverless Da...
 
Transforming Your IT with AWS
Transforming Your IT with AWSTransforming Your IT with AWS
Transforming Your IT with AWS
 
Preparing Your Data for Cloud Analytics & AI/ML
Preparing Your Data for Cloud Analytics & AI/MLPreparing Your Data for Cloud Analytics & AI/ML
Preparing Your Data for Cloud Analytics & AI/ML
 
Building a Real-Time Security Application Using Log Data and Machine Learning...
Building a Real-Time Security Application Using Log Data and Machine Learning...Building a Real-Time Security Application Using Log Data and Machine Learning...
Building a Real-Time Security Application Using Log Data and Machine Learning...
 
Horses for Courses: Database Roundtable
Horses for Courses: Database RoundtableHorses for Courses: Database Roundtable
Horses for Courses: Database Roundtable
 
(ENT211) Migrating the US Government to the Cloud | AWS re:Invent 2014
(ENT211) Migrating the US Government to the Cloud | AWS re:Invent 2014(ENT211) Migrating the US Government to the Cloud | AWS re:Invent 2014
(ENT211) Migrating the US Government to the Cloud | AWS re:Invent 2014
 
Using Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFUsing Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SF
 
Big Data Session 1.pptx
Big Data Session 1.pptxBig Data Session 1.pptx
Big Data Session 1.pptx
 
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
 
AWS Big Data Platform
AWS Big Data PlatformAWS Big Data Platform
AWS Big Data Platform
 
Using Data Lakes
Using Data Lakes Using Data Lakes
Using Data Lakes
 
AWS Big Data Solution Days
AWS Big Data Solution DaysAWS Big Data Solution Days
AWS Big Data Solution Days
 
Migrate and Manage Workloads with Apps Associates
Migrate and Manage Workloads with Apps AssociatesMigrate and Manage Workloads with Apps Associates
Migrate and Manage Workloads with Apps Associates
 
AWS Storage State of the Union
AWS Storage State of the UnionAWS Storage State of the Union
AWS Storage State of the Union
 
ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...
ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...
ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...
 
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the CloudFSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
 
Agile Big Data Analytics Development: An Architecture-Centric Approach
Agile Big Data Analytics Development: An Architecture-Centric ApproachAgile Big Data Analytics Development: An Architecture-Centric Approach
Agile Big Data Analytics Development: An Architecture-Centric Approach
 
Data lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiryData lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiry
 
Building your Datalake on AWS
Building your Datalake on AWSBuilding your Datalake on AWS
Building your Datalake on AWS
 

More from Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
Amazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
Amazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
Amazon Web Services
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Amazon Web Services
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
Amazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
Amazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Amazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
Amazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Amazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
Amazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
Amazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
Amazon Web Services
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
Amazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
Amazon Web Services
 

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Uses of Data Lakes

  • 1. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved Pop-up Loft Uses of Data Lakes (Data Lakes in the Wild) Ryan Jancaitis Sr. Product Manager, Envision Engineering rjancait@amazon.com Al Belsky Sr. Solutions Engineer, Envision Engineering albelsk@amazon.com
  • 2. Envision Engineering – About Us The Solution Bring the “Art of the Possible” to our customers Collaborative Iterate and deliver based on constant feedback from customer stakeholders Business Solutions Focus on solving business challenges, not technology challenges Specialized Team End-to-End development approach, services, and skills Touchable, tangible results are more impactful than an architecture diagram. Analysis paralysis and uncertainties are barriers to cloud adoption Which TechnologyCan it Work Where to Begin ? ? ?
  • 4. Envision Engineering What is a Data Lake? Centralized Storage Security Controls Application Integration Lineage and Auditing Data Lake
  • 5. Customer Example – United States Census Core Business Challenge Core Data & Copies Data Security Auditing Usage Monitoring Reproducibility Storage Constraints Compute Constraints
  • 6. United States Census Core Use cases Column-level Access Control (with cell-level capability) Data Lineage (macro level) On-Demand Infrastructure for analytics jobs Cost Tracking per analytics job Hadoop platform choice: Amazon EMR and Hortonworks HDP Ability to run legacy scripts SAS 9.4 Centralized Storage LDAP based user security/permissions
  • 7. Deep Dive into Implementation – Column Security • Custom Accumulo Loader § loads datasets from S3 into Accumulo table(s) § assigns column names as security labels • Custom Accumulo Authorization handler § checks which labels user has access to (in LDAP) Installed via Bootstrap script on EMR (Elastic Map Reduce) Installed on Hortonworks cluster via Apache Ambari Blueprints
  • 8. Deep Dive into Implementation – Hortonworks Hortonworks Cluster on EC2 Ø Create recipes: • Accumulo setup 1. Install Loader 2. Install Custom auth 3. Stop Accumulo 4. Start Accumulo • Import Data & Run SAS Ø Create stack Blueprint
  • 9. Deep Dive into Implementation – SAS Script Execution • SAS instance is launched per analytics task (on-demand) • AWS Systems Manager “Run Command” triggers remote shell script • Shell script downloads SAS script from S3, runs it via SAS • SAS accesses the data via Hive endpoint on Hadoop, reads from External Table linked to Accumulo table • SAS persists results locally • Shell script copies the results to Amazon S3 Amazon EC2 SAS Instance > Amazon AMI SAS 9.4 Amazon EMR Amazon S3 bucket Amazon EC2 Systems Manager
  • 10. User initiates Analysis Routine based on selected data 1 Deep Dive into Implementation – Pulling it all together Hive tables are created based on data visible to user in Accumulo 5 A SAS AMI is launched with Hive connection details 6 A NodeJS Lambda function launches EMR/HDX via SDK/API 2 A SAS Program is run and results are stored in S3. The AWS instances and services are terminated 7 1) Location of Results 2) Location of Logs 1) Analysis Routine 2) Data File 3) AD Group An Hadoop cluster is launched and bootstrapped to install Accumulo and Hive 3 Custom Java routine creates Accumulo rights and data tables and loads the data 4
  • 11. Single Page App Serverless API Gateway Deep Dive into Implementation – Serverless UI Client Side Server Side
  • 12. Solution Architecture Hadoop Amazon CloudWatch Logs Data and Scripts Serverless UI Analytics Infrastructure Spark R Other Analytics
  • 13. Demo
  • 14. US Census Summary and Next Steps - Data Lake provides: - Centralized, secured storage - On demand analytics environment - Data and Program Lineage - Re-use of existing data and SAS Programs - What’s Next: - Authority to Operate in FedRAMP High environment - Spin up of interactive environments - Control of AWS images and cost by user and group - Deeper integration with Apache Ranger and Atlas
  • 15. Customer Example – USC Alzheimer's Therapeutic Research Institute The USC ATRI mission is to create a leading hub of basic, translational and clinical research in neuroscience and neurological diseases by collaborating with sites and investigators around the world
  • 16. Core Data & Routines Core Data & Routines Silo A Silo B Silo C Customer Example – ATRI Core Business Challenge Core Data & Routines
  • 17. ATRI Core Use cases Collect Data from Multiple Sources Data Lineage On-Demand Infrastructure for analytics jobs HIPAA Eligible Environment LDAP based user security/permissions Data Discovery
  • 18. Outcome • Web-accessible data lake that demonstrates: • User authentication and authorization • Text-based search and discovery based on: • project name • files within the project • columns within tables/csv-files. • Control of access roles and rights on a data set • Analytics task execution scripts against selected data sets • (R, Python and Java) • Audit information for data : • Storage, sharing, and usage • REST-like API(s) for uploading and updating data to the data lake • Store data sets in data lake via scripting/automation. • Store data in a HIPAA eligible environment
  • 19. Customer Example – National Heart Blood and Lung Institute The National Heart, Lung, and Blood Institute’s (NHLBI) mission is to provide global leadership for a research, training, and education program to promote the prevention and treatment of heart, lung, and blood disease. To this end, Institutions, Scientists, and Researchers rely on data provided by the NHLBI to drive basic discoveries about the causes of disease and translate those discoveries into clinical practice.
  • 20. Customer Example – NHLBI Core Business Challenge Massive amounts of genetic data Consent Management held by outside group Auditing Compute Constraints
  • 21. NHLBI Core Use cases Data Lineage On-Demand Infrastructure for genomics tasks Cost Tracking Centralized Storage Consent Group based access controls Data Discovery
  • 22. Outcome • Web-accessible data lake that demonstrates: • User authentication and authorization based on internal Identity Provider • Text-based search and discovery based on based on DbGAP controlled studies • Control of access roles and rights on a study by Consent Group • On Demand Genomics Tooling based on selected data files • (samtools, bcftools, HTSGet, Plink, etc…) • Audit information for data : • Storage, sharing, and usage
  • 23. NHLBI Solution Architecture SAML Authentication SAML Assertion with Consent Group permissions NIH CIT dbGap/SRAStudy details Meta-data, run lists, and permission details File Access Request Secured access IAM Roles UI NHLBI Data Lake NHLBI Data Storage Study 1 Study 2 Study N
  • 24. Uses of Data Lake – In Summary Common Needs Across Verticals Common Services to Meet Data Lake Needs Centralized Storage Security Controls Application Integration Lineage and Auditing
  • 25. Pop-up Loft aws.amazon.com/activate Everything and Anything Startups Need to Get Started on AWS