SlideShare a Scribd company logo
1 of 39
Download to read offline
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Dickson Yue
Solution Architect
21 June 2017
Building Your First Data Lake
Modern Data Architectures on AWS
Today's conversation
Business drivers for a Data Lake
Designing and building
Production use cases
Outcome 1 : Modernize and consolidate
• Insights to enhance business applications and create new digital services
Outcome 2 : Innovate for new revenues
• Personalization, demand forecasting, risk analysis
Outcome 3 : Real-time engagement
• Interactive customer experience, event-driven automation, fraud detection
Outcome 4 : Automate for expansive reach
• Automation of business processes and physical infrastructure
Business Outcomes on a Modern Data Architecture
Expanding access requirements
Data
scientists
Automation /
events
Business
users
Data
analysts
Engagement
platforms
1. More personas need access to data, through appropriate tools
2. More systems need to link to data for decision and process automation
3. Users need to be able to find information, and access it securely
Exponential growth of business data
1. Data must be captured from diverse sources at speed and scale
2. Data needs to be pulled together, breaking down traditional silos
3. Benefits need to far outweigh the costs of collection and analysis
Transactions ERP Connected
devices
Social mediaWeb logs /
cookies
Speed (Real-time)
Ingest ServingData
sources
Scale (Batch)
Modern data architecture
Insights to enhance business applications, new digital services
Data analysts
Data scientists
Business users
Engagement platforms
Automation / events
Speed (Real-time)
Ingest ServingData
sources
Scale (Batch)
Modern data architecture
Insights to enhance business applications, new digital services
Transactions
Web logs /
cookies
ERP
Data analysts
Data scientists
Business users
Engagement platformsConnected
devices
Social media Automation / events
Speed (Real-time)
Ingest ServingData
sources
Scale (Batch)
Modern data architecture
Insights to enhance business applications, new digital services
Data Warehouse
Amazon Redshift
Legacy Apps
Amazon RDS
Data analysts
Data scientists
Business users
Engagement platforms
Schemaless
Amazon ElasticSearch
Direct Query
Amazon Athena
Near-Zero Latency
Amazon DynamoDB
Automation / events
Semi/Unstructured
Amazon EMR
Transactions
Web logs /
cookies
ERP
Connected
devices
Social media
Characteristics of a Data Lake
Future
Proof
Flexible
Access
Dive in
Anywhere
Collect
Anything
Speed (Real-time)
Ingest ServingData
sources
Scale (Batch)
Modern data architecture
Insights to enhance business applications, new digital services
Data Warehouse
Amazon Redshift
Legacy Apps
Amazon RDS
Data analysts
Data scientists
Business users
Engagement platforms
Schemaless
Amazon ElasticSearch
Direct Query
Amazon Athena
Near-Zero Latency
Amazon DynamoDB
Automation / events
Semi/Unstructured
Amazon EMR
Transactions
Web logs /
cookies
ERP
Connected
devices
Social media
Designed for 11 9s
of durability
Designed for
99.99% availability
Durable
Available
Store as much as you need
Scale storage and compute
independently
Scalable
Amazon S3
Amazon Redshift / Spectrum
Amazon EMR
Amazon Athena
Amazon DynamoDB
Integrated
Speed (Real-time)
Ingest ServingData
sources
Scale (Batch)
Modern data architecture
Insights to enhance business applications, new digital services
Data Warehouse
Amazon Redshift
Legacy Apps
Amazon RDS
Data analysts
Data scientists
Business users
Engagement platforms
Schemaless
Amazon ElasticSearch
Direct Query
Amazon Athena
Near-Zero Latency
Amazon DynamoDB
Automation / events
Amazon S3
Staged Data
(Data Lake)
Semi/Unstructured
Amazon EMR
Transactions
Web logs /
cookies
ERP
Connected
devices
Social media
Today's conversation
Business drivers for a Data Lake
Designing and building
Production use cases
Important Components of a Data Lake
Catalogue
& Search
Protect
& Secure
Access &
User Interface Ingest & Store
Data Ingestion into S3
AWS Direct Connect
AWS SnowballISV Connectors
Amazon Kinesis
Firehose
AWS Storage
Gateway
S3 Transfer
Acceleration
Speed (Real-time)
Ingest ServingData
sources
Scale (Batch)
Modern data architecture
Insights to enhance business applications, new digital services
Data Warehouse
Amazon Redshift
Legacy Apps
Amazon RDS
Data analysts
Data scientists
Business users
Engagement platforms
Schemaless
Amazon ElasticSearch
Direct Query
Amazon Athena
Near-Zero Latency
Amazon DynamoDB
Automation / events
Amazon S3
Staged Data
(Data Lake)
Semi/Unstructured
Amazon EMR
Transactions
Web logs /
cookies
ERP
AWS Database
Migration
AWS Direct
Connect
Internet
Interfaces
Amazon
Kinesis
Connected
devices
Social media
Building a Data Catalogue
• Aggregated information about your storage & streaming
layer
• Storage service for metadata
Ownership, data lineage
• Data abstraction layer
Customer data = collection of prefixes
• Enabling data discovery
• API for use by entitlements service
AWS
Lambda
Amazon
DynamoDB
+
Streams
Amazon
Elasticsearch
AWS
LambdaS3
Bucket
PUT
OBJECT
CREATE
OBJECT
PUT
ITEM
UPDATE
STREAM
UPDATE
INDEX
Populating Metadata and Search
AWS
Glue
Managed Transform Engine
Job Scheduler
Data Catalog
Built on Apache Spark
Integrated with S3, RDS, Redshift & any
JDBC-compliant data store
Security
 Identity and Access
Management (IAM) policies
 Bucket policies
 Access Control Lists (ACLs)
 Private VPC endpoints to
Amazon S3
 Pre-signed S3 URLs
Encryption
 SSL endpoints
 Server Side Encryption
(SSE-S3)
 S3 Server Side
Encryption with
provided keys (SSE-C,
SSE-KMS)
 Client-side Encryption
Audit & Compliance
 Buckets access logs
 Lifecycle Management
Policies
 Versioning & MFA
deletes
 Certifications – HIPAA,
PCI, SOC 1/2/3 etc.
Implement the right cloud security controls
Speed (Real-time)
Ingest ServingData
sources
Scale (Batch)
Modern data architecture
Insights to enhance business applications, new digital services
Data Warehouse
Amazon Redshift
Legacy Apps
Amazon RDS
Data analysts
Data scientists
Business users
Engagement platforms
Schemaless
Amazon ElasticSearch
Direct Query
Amazon Athena
Near-Zero Latency
Amazon DynamoDB
Automation / events
Amazon S3
Staged Data
(Data Lake)
Semi/Unstructured
Amazon EMR
Transactions
Web logs /
cookies
ERP
AWS Database
Migration
AWS Direct
Connect
Internet
Interfaces
Amazon
Kinesis
Connected
devices
Social media
AWS
Cloud Trail
AWS
IAM
Amazon
CloudWatch
AWS
KMS
Speed (Real-time)
Ingest ServingData
sources
Scale (Batch)
Modern data architecture
Insights to enhance business applications, new digital services
Data Warehouse
Amazon Redshift
Legacy Apps
Amazon RDS
Data analysts
Data scientists
Business users
Engagement platforms
Schemaless
Amazon ElasticSearch
Direct Query
Amazon Athena
Near-Zero Latency
Amazon DynamoDB
Automation / events
Amazon S3
Staged Data
(Data Lake)
Semi/Unstructured
Amazon EMR
Transactions
Web logs /
cookies
ERP
AWS Database
Migration
AWS Direct
Connect
Internet
Interfaces
Amazon
Kinesis
Connected
devices
Social media
Amazon S3
Raw Data
Amazon EMR
ETL
Advanced
Analytics
MLlib
Event Capture
Amazon Kinesis
Stream Analysis
Amazon EMR
AWS
Cloud Trail
AWS
IAM
Amazon
CloudWatch
AWS
KMS
Today's conversation
Business drivers for a Data Lake
Designing and building
Production use cases
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Angus Tse
Director of Engineer
21 June 2017
Clickstream Analytics Pipeline
HK01
Growth
DATA BEATS
EMOTIONS
Sean Rad
Founder & CEO Tinder
Clickstream Analytics
• Free and easy
• Excellent for initial
• Good learning materials
Google Analytics (GA)
• Free and easy
• Excellent for initial
• Good learning materials
Google Analytics (GA)
• free version latency &
accuracy issue
• GA 360 (Premium) +
BigQuery are expensive
• Not flexible enough
Our needs
• Large data volume
• Raw data for Machine Learning
• Flexible for further processing
• Low latency
Building a scalable pipeline on
AWS
Piwik
• Open-source analytics
platform
• Realtime dashboard
• Web & mobile SDK
• PageView
• Content / Media
• A/B Test
Piwik
• Open-source analytics
platform
• Realtime dashboard
• Web & mobile SDK
• PageView
• Content / Media
• A/B Test
Phase 1
AWS
Lambda
API
Gateway
Kinesis
Firehose
Redshift Quicksight
Experience on AWS
• Complete and Integrated
• Quick. 2 man weeks for first version
• Easy to scale
• Minimal maintenance cost
Future
• More server-less in future
• S3 as datalake
• click event, system log, etc
• raw, processed data (like ML result)
• Hot on disk, cold on s3
• Explore AWS Machine Learning
AWS
Lambda
API
Gateway
Kinesis
Firehose
Redshift
Quicksight
S3
Machine
learning
EMR
SparkML
ML
Hot data
Raw data
Processed data
Cold data
Direct Query
Athena
Redshift
Spectrum
P2
Deep
learning AMI
Visualization
Serverless
Download HK01 We’re Hiring
Thanks
Join Us : hk01.com/job
Summary
1. S3 as data lake
2. Pick the right tool to match the
persona requirements
3. Go serverless

More Related Content

What's hot

Azure data analytics platform - A reference architecture
Azure data analytics platform - A reference architecture Azure data analytics platform - A reference architecture
Azure data analytics platform - A reference architecture
Rajesh Kumar
 
Cloud migration strategies
Cloud migration strategiesCloud migration strategies
Cloud migration strategies
SogetiLabs
 

What's hot (20)

Introduction to AWS Cost Management
Introduction to AWS Cost ManagementIntroduction to AWS Cost Management
Introduction to AWS Cost Management
 
Azure Cloud Adoption Framework + Governance - Sana Khan and Jay Kumar
Azure Cloud Adoption Framework + Governance - Sana Khan and Jay Kumar Azure Cloud Adoption Framework + Governance - Sana Khan and Jay Kumar
Azure Cloud Adoption Framework + Governance - Sana Khan and Jay Kumar
 
AWS Marketplace
AWS MarketplaceAWS Marketplace
AWS Marketplace
 
Microservices for Application Modernisation
Microservices for Application ModernisationMicroservices for Application Modernisation
Microservices for Application Modernisation
 
Azure data analytics platform - A reference architecture
Azure data analytics platform - A reference architecture Azure data analytics platform - A reference architecture
Azure data analytics platform - A reference architecture
 
Introduction to AWS Glue
Introduction to AWS GlueIntroduction to AWS Glue
Introduction to AWS Glue
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
Data Platform Architecture Principles and Evaluation Criteria
Data Platform Architecture Principles and Evaluation CriteriaData Platform Architecture Principles and Evaluation Criteria
Data Platform Architecture Principles and Evaluation Criteria
 
Cloud migration strategies
Cloud migration strategiesCloud migration strategies
Cloud migration strategies
 
Azure migration
Azure migrationAzure migration
Azure migration
 
AWS Cloud Adoption Framework
AWS Cloud Adoption Framework AWS Cloud Adoption Framework
AWS Cloud Adoption Framework
 
Building a Data Lake on AWS
Building a Data Lake on AWSBuilding a Data Lake on AWS
Building a Data Lake on AWS
 
Modern Data Platform on AWS
Modern Data Platform on AWSModern Data Platform on AWS
Modern Data Platform on AWS
 
Cloud Migration Workshop
Cloud Migration WorkshopCloud Migration Workshop
Cloud Migration Workshop
 
Implementing a Data Lake
Implementing a Data LakeImplementing a Data Lake
Implementing a Data Lake
 
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
 
Capgemini Cloud Assessment - A Pathway to Enterprise Cloud Migration
Capgemini Cloud Assessment - A Pathway to Enterprise Cloud MigrationCapgemini Cloud Assessment - A Pathway to Enterprise Cloud Migration
Capgemini Cloud Assessment - A Pathway to Enterprise Cloud Migration
 
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse ArchitectureServerless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
 
The Microsoft Well Architected Framework For Data Analytics
The Microsoft Well Architected Framework For Data AnalyticsThe Microsoft Well Architected Framework For Data Analytics
The Microsoft Well Architected Framework For Data Analytics
 
AWS Cloud Migration Insights Forum
AWS Cloud Migration Insights ForumAWS Cloud Migration Insights Forum
AWS Cloud Migration Insights Forum
 

Similar to Building your Datalake on AWS

Similar to Building your Datalake on AWS (20)

Driving Business Outcomes with a Modern Data Architecture - Level 100
Driving Business Outcomes with a Modern Data Architecture - Level 100Driving Business Outcomes with a Modern Data Architecture - Level 100
Driving Business Outcomes with a Modern Data Architecture - Level 100
 
Driving Business Insights with a Modern Data Architecture AWS Summit SG 2017
Driving Business Insights with a Modern Data Architecture  AWS Summit SG 2017Driving Business Insights with a Modern Data Architecture  AWS Summit SG 2017
Driving Business Insights with a Modern Data Architecture AWS Summit SG 2017
 
AWS Big Data Platform
AWS Big Data PlatformAWS Big Data Platform
AWS Big Data Platform
 
Finding Meaning in the Noise: Understanding Big Data with AWS Analytics
Finding Meaning in the Noise: Understanding Big Data with AWS AnalyticsFinding Meaning in the Noise: Understanding Big Data with AWS Analytics
Finding Meaning in the Noise: Understanding Big Data with AWS Analytics
 
Building a Data Lake in Amazon S3 & Amazon Glacier (STG401-R1) - AWS re:Inven...
Building a Data Lake in Amazon S3 & Amazon Glacier (STG401-R1) - AWS re:Inven...Building a Data Lake in Amazon S3 & Amazon Glacier (STG401-R1) - AWS re:Inven...
Building a Data Lake in Amazon S3 & Amazon Glacier (STG401-R1) - AWS re:Inven...
 
The AWS Big Data Platform – Overview
The AWS Big Data Platform – OverviewThe AWS Big Data Platform – Overview
The AWS Big Data Platform – Overview
 
"Building a Modern Data platform in the Cloud", Alex Casalboni, AWS Dev Day K...
"Building a Modern Data platform in the Cloud", Alex Casalboni, AWS Dev Day K..."Building a Modern Data platform in the Cloud", Alex Casalboni, AWS Dev Day K...
"Building a Modern Data platform in the Cloud", Alex Casalboni, AWS Dev Day K...
 
Building your First Big Data Application on AWS
Building your First Big Data Application on AWSBuilding your First Big Data Application on AWS
Building your First Big Data Application on AWS
 
AWS Webcast - Informatica - Big Data Solutions Showcase
AWS Webcast - Informatica - Big Data Solutions ShowcaseAWS Webcast - Informatica - Big Data Solutions Showcase
AWS Webcast - Informatica - Big Data Solutions Showcase
 
Getting Started with Amazon QuickSight
Getting Started with Amazon QuickSightGetting Started with Amazon QuickSight
Getting Started with Amazon QuickSight
 
Big Data Meets AI - Driving Insights and Adding Intelligence to Your Solutions
 Big Data Meets AI - Driving Insights and Adding Intelligence to Your Solutions Big Data Meets AI - Driving Insights and Adding Intelligence to Your Solutions
Big Data Meets AI - Driving Insights and Adding Intelligence to Your Solutions
 
AWS Summit 2013 | Singapore - Big Data Analytics, Presented by AWS, Intel and...
AWS Summit 2013 | Singapore - Big Data Analytics, Presented by AWS, Intel and...AWS Summit 2013 | Singapore - Big Data Analytics, Presented by AWS, Intel and...
AWS Summit 2013 | Singapore - Big Data Analytics, Presented by AWS, Intel and...
 
Visualize your data in Data Lake with AWS Athena and AWS Quicksight Hands-on ...
Visualize your data in Data Lake with AWS Athena and AWS Quicksight Hands-on ...Visualize your data in Data Lake with AWS Athena and AWS Quicksight Hands-on ...
Visualize your data in Data Lake with AWS Athena and AWS Quicksight Hands-on ...
 
Modern Data Architectures for Business Outcomes
Modern Data Architectures for Business OutcomesModern Data Architectures for Business Outcomes
Modern Data Architectures for Business Outcomes
 
Fast Track to Your Data Lake on AWS
Fast Track to Your Data Lake on AWSFast Track to Your Data Lake on AWS
Fast Track to Your Data Lake on AWS
 
ABD202_Best Practices for Building Serverless Big Data Applications
ABD202_Best Practices for Building Serverless Big Data ApplicationsABD202_Best Practices for Building Serverless Big Data Applications
ABD202_Best Practices for Building Serverless Big Data Applications
 
How TrueCar Gains Actionable Insights with Splunk Cloud PPT
How TrueCar Gains Actionable Insights with Splunk Cloud PPTHow TrueCar Gains Actionable Insights with Splunk Cloud PPT
How TrueCar Gains Actionable Insights with Splunk Cloud PPT
 
Modern Data Architectures for Business Insights at Scale
Modern Data Architectures for Business Insights at Scale Modern Data Architectures for Business Insights at Scale
Modern Data Architectures for Business Insights at Scale
 
Modern Data Architectures for Business Outcomes
Modern Data Architectures for Business OutcomesModern Data Architectures for Business Outcomes
Modern Data Architectures for Business Outcomes
 
AWS Big Data Solution Days
AWS Big Data Solution DaysAWS Big Data Solution Days
AWS Big Data Solution Days
 

More from Amazon Web Services

Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
Amazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
Amazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
Amazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
Amazon Web Services
 

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Recently uploaded

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Recently uploaded (20)

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 

Building your Datalake on AWS

  • 1. © 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Dickson Yue Solution Architect 21 June 2017 Building Your First Data Lake Modern Data Architectures on AWS
  • 2. Today's conversation Business drivers for a Data Lake Designing and building Production use cases
  • 3. Outcome 1 : Modernize and consolidate • Insights to enhance business applications and create new digital services Outcome 2 : Innovate for new revenues • Personalization, demand forecasting, risk analysis Outcome 3 : Real-time engagement • Interactive customer experience, event-driven automation, fraud detection Outcome 4 : Automate for expansive reach • Automation of business processes and physical infrastructure Business Outcomes on a Modern Data Architecture
  • 4. Expanding access requirements Data scientists Automation / events Business users Data analysts Engagement platforms 1. More personas need access to data, through appropriate tools 2. More systems need to link to data for decision and process automation 3. Users need to be able to find information, and access it securely
  • 5. Exponential growth of business data 1. Data must be captured from diverse sources at speed and scale 2. Data needs to be pulled together, breaking down traditional silos 3. Benefits need to far outweigh the costs of collection and analysis Transactions ERP Connected devices Social mediaWeb logs / cookies
  • 6. Speed (Real-time) Ingest ServingData sources Scale (Batch) Modern data architecture Insights to enhance business applications, new digital services Data analysts Data scientists Business users Engagement platforms Automation / events
  • 7. Speed (Real-time) Ingest ServingData sources Scale (Batch) Modern data architecture Insights to enhance business applications, new digital services Transactions Web logs / cookies ERP Data analysts Data scientists Business users Engagement platformsConnected devices Social media Automation / events
  • 8. Speed (Real-time) Ingest ServingData sources Scale (Batch) Modern data architecture Insights to enhance business applications, new digital services Data Warehouse Amazon Redshift Legacy Apps Amazon RDS Data analysts Data scientists Business users Engagement platforms Schemaless Amazon ElasticSearch Direct Query Amazon Athena Near-Zero Latency Amazon DynamoDB Automation / events Semi/Unstructured Amazon EMR Transactions Web logs / cookies ERP Connected devices Social media
  • 9. Characteristics of a Data Lake Future Proof Flexible Access Dive in Anywhere Collect Anything
  • 10. Speed (Real-time) Ingest ServingData sources Scale (Batch) Modern data architecture Insights to enhance business applications, new digital services Data Warehouse Amazon Redshift Legacy Apps Amazon RDS Data analysts Data scientists Business users Engagement platforms Schemaless Amazon ElasticSearch Direct Query Amazon Athena Near-Zero Latency Amazon DynamoDB Automation / events Semi/Unstructured Amazon EMR Transactions Web logs / cookies ERP Connected devices Social media Designed for 11 9s of durability Designed for 99.99% availability Durable Available Store as much as you need Scale storage and compute independently Scalable Amazon S3 Amazon Redshift / Spectrum Amazon EMR Amazon Athena Amazon DynamoDB Integrated
  • 11. Speed (Real-time) Ingest ServingData sources Scale (Batch) Modern data architecture Insights to enhance business applications, new digital services Data Warehouse Amazon Redshift Legacy Apps Amazon RDS Data analysts Data scientists Business users Engagement platforms Schemaless Amazon ElasticSearch Direct Query Amazon Athena Near-Zero Latency Amazon DynamoDB Automation / events Amazon S3 Staged Data (Data Lake) Semi/Unstructured Amazon EMR Transactions Web logs / cookies ERP Connected devices Social media
  • 12. Today's conversation Business drivers for a Data Lake Designing and building Production use cases
  • 13. Important Components of a Data Lake Catalogue & Search Protect & Secure Access & User Interface Ingest & Store
  • 14. Data Ingestion into S3 AWS Direct Connect AWS SnowballISV Connectors Amazon Kinesis Firehose AWS Storage Gateway S3 Transfer Acceleration
  • 15. Speed (Real-time) Ingest ServingData sources Scale (Batch) Modern data architecture Insights to enhance business applications, new digital services Data Warehouse Amazon Redshift Legacy Apps Amazon RDS Data analysts Data scientists Business users Engagement platforms Schemaless Amazon ElasticSearch Direct Query Amazon Athena Near-Zero Latency Amazon DynamoDB Automation / events Amazon S3 Staged Data (Data Lake) Semi/Unstructured Amazon EMR Transactions Web logs / cookies ERP AWS Database Migration AWS Direct Connect Internet Interfaces Amazon Kinesis Connected devices Social media
  • 16. Building a Data Catalogue • Aggregated information about your storage & streaming layer • Storage service for metadata Ownership, data lineage • Data abstraction layer Customer data = collection of prefixes • Enabling data discovery • API for use by entitlements service
  • 18. AWS Glue Managed Transform Engine Job Scheduler Data Catalog Built on Apache Spark Integrated with S3, RDS, Redshift & any JDBC-compliant data store
  • 19. Security  Identity and Access Management (IAM) policies  Bucket policies  Access Control Lists (ACLs)  Private VPC endpoints to Amazon S3  Pre-signed S3 URLs Encryption  SSL endpoints  Server Side Encryption (SSE-S3)  S3 Server Side Encryption with provided keys (SSE-C, SSE-KMS)  Client-side Encryption Audit & Compliance  Buckets access logs  Lifecycle Management Policies  Versioning & MFA deletes  Certifications – HIPAA, PCI, SOC 1/2/3 etc. Implement the right cloud security controls
  • 20. Speed (Real-time) Ingest ServingData sources Scale (Batch) Modern data architecture Insights to enhance business applications, new digital services Data Warehouse Amazon Redshift Legacy Apps Amazon RDS Data analysts Data scientists Business users Engagement platforms Schemaless Amazon ElasticSearch Direct Query Amazon Athena Near-Zero Latency Amazon DynamoDB Automation / events Amazon S3 Staged Data (Data Lake) Semi/Unstructured Amazon EMR Transactions Web logs / cookies ERP AWS Database Migration AWS Direct Connect Internet Interfaces Amazon Kinesis Connected devices Social media AWS Cloud Trail AWS IAM Amazon CloudWatch AWS KMS
  • 21. Speed (Real-time) Ingest ServingData sources Scale (Batch) Modern data architecture Insights to enhance business applications, new digital services Data Warehouse Amazon Redshift Legacy Apps Amazon RDS Data analysts Data scientists Business users Engagement platforms Schemaless Amazon ElasticSearch Direct Query Amazon Athena Near-Zero Latency Amazon DynamoDB Automation / events Amazon S3 Staged Data (Data Lake) Semi/Unstructured Amazon EMR Transactions Web logs / cookies ERP AWS Database Migration AWS Direct Connect Internet Interfaces Amazon Kinesis Connected devices Social media Amazon S3 Raw Data Amazon EMR ETL Advanced Analytics MLlib Event Capture Amazon Kinesis Stream Analysis Amazon EMR AWS Cloud Trail AWS IAM Amazon CloudWatch AWS KMS
  • 22. Today's conversation Business drivers for a Data Lake Designing and building Production use cases
  • 23. © 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Angus Tse Director of Engineer 21 June 2017 Clickstream Analytics Pipeline HK01
  • 24.
  • 28. • Free and easy • Excellent for initial • Good learning materials Google Analytics (GA)
  • 29. • Free and easy • Excellent for initial • Good learning materials Google Analytics (GA) • free version latency & accuracy issue • GA 360 (Premium) + BigQuery are expensive • Not flexible enough
  • 30. Our needs • Large data volume • Raw data for Machine Learning • Flexible for further processing • Low latency
  • 31. Building a scalable pipeline on AWS
  • 32. Piwik • Open-source analytics platform • Realtime dashboard • Web & mobile SDK • PageView • Content / Media • A/B Test
  • 33. Piwik • Open-source analytics platform • Realtime dashboard • Web & mobile SDK • PageView • Content / Media • A/B Test
  • 35. Experience on AWS • Complete and Integrated • Quick. 2 man weeks for first version • Easy to scale • Minimal maintenance cost
  • 36. Future • More server-less in future • S3 as datalake • click event, system log, etc • raw, processed data (like ML result) • Hot on disk, cold on s3 • Explore AWS Machine Learning AWS Lambda API Gateway Kinesis Firehose Redshift Quicksight S3 Machine learning EMR SparkML ML Hot data Raw data Processed data Cold data Direct Query Athena Redshift Spectrum P2 Deep learning AMI Visualization Serverless
  • 38. Thanks Join Us : hk01.com/job
  • 39. Summary 1. S3 as data lake 2. Pick the right tool to match the persona requirements 3. Go serverless