SlideShare a Scribd company logo
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
James Juniper
Stephen Moon
Senior Solutions Architect, Amazon Web Services
Build Data Lakes and Analytics on
AWS: Patterns & Best Practices
Solution Architect, Geo-Community Cloud , National Resources Canada
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Big Data: Different forms of challenges
VisualizationVariability
Volume Velocity Variety Veracity Value
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Challenges are often driven by:
https://www.promptcloud.com
https://john-popelaars.blogspot.com
https://ww.signiant.com
https://www.linkedin.com/pulse/world-today-data-rich-information-poor-
guru-p-mohapatra-pmp/
Data growth
faster than ever
Data variety is
increasing
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
AWS Data Lake helps address this
Quickly ingest and store
any type of data
Insights and security,
together …
Run the right tool for the
right job without manually
copying data around
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Data Lakes from AWS
Analytics
Machine
Learning
Real-time dataTraditional
Data Lake
on AWS
movementdata movement
Ingestion
Intelligence
Storage
Catalog
Variety of
ingestion tools
Decoupled
analytics from
storage/catalog
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
What data do I have?
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
What data do I have?
Gartner:
“Through 2018, 80% of data lakes will not include effective
metadata management capabilities, making them
inefficient.”
”Metadata Is the Fish Finder in Data Lake”
Data Lake
on AWS
Storage | Archival Storage | Data Catalog
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
AWS Glue components
Job AuthoringData Catalog Job Execution
Apache Hive Metastore compatible
Integrated with AWS services
Automatic crawl and discover data
Discover
Auto-generates ETL code
Python and Apache Spark
Edit, debug, and share
Develop
Serverless execution
Flexible scheduling
Monitoring and alerting
Deploy
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
What can crawlers discover?
IAM Role
AWS Glue Crawler Databases
Amazon
Redshift
Amazon S3
JDBC Connection
Object Connection
Built-in classifiers
MySQL
MariaDB
PostreSQL
Aurora
Oracle
Amazon Redshift
Avro
Parquet
ORC
XML
JSON & JSONPaths
AWS CloudTrail
BSON
Logs
(Apache (Grok), Linux(Grok), MS(Grok), Ruby, Redis,
and many others)
Delimited
(comma, pipe, tab, semicolon)
< ALWAYS GROWING…>
Create additional custom
classifiers
Amazon
DynamoDB
NoSQL Connection
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
But I have my own data formats …?
− There is a custom classifier for that …
Row-Based
GROK Classifier
A grok pattern is a
named set of regular
expressions (regex)
that are used to
match data one line
at a time.
XML
XML Classifier
XML tag that
defines a table row
in the XML
document.
JSON
JSON Classifier
JSON path to the
object, array, or value
that defines a row of
the table being
created. Type the
name in either dot or
bracket JSON syntax
using AWS Glue
supported operators
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Other ways of populating the catalog
Call the AWS Glue CreateTable API
Create table manually DDL statement (in Amazon Athena or Amazon EMR)
Apache Hive
Metastore
AWS GLUE ETL AWS GLUE
DATA CATALOG
Import from Apache Hive Metastore
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
How do I hydrate my data lake?
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
How do I drive value?
Amazon SageMaker
AWS Deep Learning AMIs
Amazon Rekognition
Amazon Lex
AWS DeepLens
Amazon Comprehend
Amazon Translate
Amazon Transcribe
Amazon Polly
Amazon Athena
Amazon EMR
Amazon Redshift
Amazon Elasticsearch Service
Amazon Kinesis
Amazon QuickSight
AWS Direct Connect
AWS Snowball
AWS Snowmobile
AWS Database Migration Service
AWS IoT Core
Amazon Kinesis Data Firehose
Amazon Kinesis Data Streams
Amazon Kinesis Video Streams
Data Lake
on AWS
Storage | Archival Storage | Data Catalog
AnalyticsMachine learning
Real-time data movementTraditional data movement
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Ingest data based on the type of data
Open and comprehensive
• Data movement from on-premises datacenters
• Dedicated network connection
• Secure appliances
• Ruggedized shipping container
• Database migration
• Gateway that lets applications write to the cloud
• Data movement from real-time sources
• Connect devices to AWS
• Real-time data streams
• Real-time video streams
AWS Direct Connect
AWS Snowball
AWS Snowmobile
AWS Database Migration Service
AWS Storage Gateway
AWS IoT Core
Amazon Kinesis Data Firehose
Amazon Kinesis Data Streams
Amazon Kinesis Video Streams
Data movement from
real-time sources
Data movement from
your datacenters
Am azon S3
Am azon Glacier
AW S Glue
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Real-time data movement and data lakes on AWS
Amazon
Kinesis Data
Firehose
AWS Glue
Data Catalog
Amazon
S3 Data
Data Lake
on AWS
Amazon
Kinesis Data
Streams
Data definitionKinesis Agent
Apache Kafka
AWS SDK
LOG4J
Flume
Fluentd
AWS Mobile SDK
Kinesis Producer Library
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
IMPORTANT: Ingest data in its raw form …
Open and comprehensive
Am azon S3
Am azon Glacier
AW S Glue
• Store the data in its raw form:
• BEFORE
• Transforming
• Analyzing
• Manipulating
• Doing … anything … to it
CSV
ORC
Grok
Avro
Parquet
JSON
• This becomes your source of record you can
always go back to …
• Lifecycle policies allow you to shift it to warm
and cold storage.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Tiered storage to optimize price / performance
Lowest cost
• Tiered storage to optimize price/performance
• Amazon S3 Standard
• Amazon S3 Standard—Infrequent Access
• Amazon S3 One Zone—Infrequent Access
• Amazon Glacier
• Migrate between tiers based on lifecycle policies
• Store data at $0.023*/GB/month with Amazon S3
• Store data at $0.004*/GB/month with Amazon
Glacier
Amazon S3
Standard
Amazon S3
Standard
Infrequent Access
Amazon S3 One
Zone-IA
Amazon Glacier
Active Infrequent Archive
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Datasets in the Lake?
Raw datasets – immutable datasets that you can always
go back to.
• Abstract out the complexities of how the data is stored
through the catalog and SerDes
Optimizing Analytics and Machine Learning:
Curated datasets – query-optimized for consumption across
wide number of tools
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Preparing raw data for consumption
Raw data stored in Data Lake:
Preparation:
Normalized
Partitioned
Compressed
Storage Optimized
Extract – Load – Transform
Data Lake
on AWS
Raw
Ingestion
Curated
DataSets
Data Catalog
ELT
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Which tool should I use to analyze my data?
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
How do I drive value?
Amazon SageMaker
AWS Deep Learning AMIs
Amazon Rekognition
Amazon Lex
AWS DeepLens
Amazon Comprehend
Amazon Translate
Amazon Transcribe
Amazon Polly
Amazon Athena
Amazon EMR
Amazon Redshift
Amazon Elasticsearch Service
Amazon Kinesis
Amazon QuickSight
AWS Direct Connect
AWS Snowball
AWS Snowmobile
AWS Database Migration Service
AWS IoT Core
Amazon Kinesis Data Firehose
Amazon Kinesis Data Streams
Amazon Kinesis Video Streams
Data Lake
on AWS
Storage | Archival Storage | Data Catalog
AnalyticsMachine Learning
Real-time dataTraditional movementdata movement
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Different tools for different users…
Business
Reporting
Data
Catalog
Central
Storage
SagemakerMachine Learning/Deep Learning
Data Scientists
Data Engineer
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon Athena – interactive analysis
Interactive query service to analyze data in Amazon S3 using standard SQL
No infrastructure to set up or manage and no data to load
Ability to run SQL queries on data archived in Amazon Glacier (coming soon)
$ SQL
Query instantly
Zero setup cost; just
point to Amazon S3
and start querying
Pay per query
Pay only for queries run;
save 30%–90% on per-
query costs through
compression
Open
ANSI SQL interface,
JDBC/ODBC drivers, multiple
formats, compression types,
and complex joins and data
types
Easy
Serverless: zero
infrastructure, zero
administration
Integrated with Amazon
QuickSight
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon EMR – big data processing
Analytics and ML at scale
19 open-source projects: Apache Hadoop, Spark, HBase, Presto, and more
Enterprise-grade security
$
Latest versions
Updated with the latest
open source frameworks
within 30 days of release
Low cost
Flexible billing with per-
second billing, Amazon
EC2 Spot, Reserved
Instances, and Auto
Scaling to reduce costs
50%-80%
Use Amazon S3
storage
Process data directly in
the Amazon S3 data lake
securely with high
performance using the
EMRFS connector
Easy
Launch fully managed
Hadoop & Spark in minutes;
no cluster setup, node
provisioning, cluster tuning
Data Lake
100110000100101011100
1010101110010101000
00111100101100101
010001100001
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Hadoop / Spark Analytics on AWS
YARN (Hadoop Resource Manager)
NoSQLMachine
learning
Real-timeInteractiveScriptBatch
Data Lake
on AWS
Amazon S3
Amazon EMR
Managed Hadoop / Spark
Object storage
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Fitting this into the Common Data Catalog
Amazon S3
Interactive Spark cluster
Amazon EMR
Amazon EMR
EMRFS
HDFS
Transient ETL job
Source of Truth
EMRFS
HDFS
Describes the data
MySQL DB
instance
Unifieddataview
AWS Glue
Data Catalog
Stores the data
…
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon Redshift – data warehousing
Fast, powerful, simple, and fully managed data warehouse at 1/10 the cost
Massively parallel, scale from gigabytes to petabytes
Fast at scale
Columnar storage
technology to improve I/O
efficiency and scale query
performance
$
Inexpensive
As low as $1,000 per
terabyte per year, 1/10 the
cost of traditional data
warehouse solutions; start
at $0.25 per hour
Open file formats Secure
Audit everything; encrypt
data end-to-end;
extensive certification and
compliance
Analyze optimized data
formats on the latest SSD,
and all open data formats in
Amazon S3
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Data warehouse …
Amazon Redshift Data
Warehouse Relational data
Gigabytes to petabytes scale
Reporting and analysis
Schema defined prior to data load
AWS
Glue ETL
On Prem
Amazon
QuickSight
Existing or new
BI tool
Redshift
COPY
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
A Data Lake is not an Enterprise Data Warehouse
Complementary to EDW (not replacement) EDW can be sourced from Data Lake
Schema on read (no predefined schemas) Schema on write (predefined schemas)
Structured/semi-structured/Unstructured data Structured data only
Fast ingestion of new data/content Time consuming to introduce new content
Data Science + Prediction/Advanced Analytics + BI use
cases
BI use cases
Data at low level of detail/granularity Data at summary/aggregated level of detail
Loosely defined SLAs Tight SLAs (production schedules)
Flexibility in tools (open source/tools for advanced
analytics)
Limited flexibility in tools (SQL only)
Elastic storage and compute capacity – decoupled
Explicitly sized environments, compute and storage
scaled in linearly
Data Lake EDW
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon Redshift Spectrum
Extend the data warehouse to exabytes of data in Amazon S3 data lake
Amazon S3
Data Lake
Amazon
Redshift data
Amazon Redshift
Spectrum
query engine
Exabyte Redshift SQL queries against Amazon S3
Join data across Redshift and Amazon S3
Scale compute and storage separately
Stable query performance and unlimited concurrency
CSV, ORC, Grok, Avro, & Parquet data formats
Pay only for the amount of data scanned
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon Redshift Spectrum
Query your Data Lake
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon Redshift
Spectrum
Scale-out serverless compute
AWS Glue Data Catalog
COPY
commands
Hot data
Query directly
on Data Lake
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Data Lakes extend the traditional data warehouse
Data warehouse
Business intelligence
OLTP ERP CRM LOB
• Relational and nonrelational data
• TBs–EBs scale
• Diverse analytical engines
• Low-cost storage & analytics
Devices Web Sensors Social
Data lake
Big data processing,
real-time, machine learning
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Machine Learning & Big Data
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Big Data driving Machine Learning
Better
Decisions
Object Storage
Databases
Data warehouse
Streaming analytics
BI
Hadoop
Spark/Presto
Elasticsearch
Better
Products Machine Learning
Deep Learning/ AI
More
Users
More
Data
Click stream
User activity
Generated content
Purchases
Clicks
Likes
Sensor data
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Agility in Machine Learning
Amazon SageMaker
AWS Deep Learning AMIs
Amazon Rekognition
Amazon Lex
AWS DeepLens
Amazon Comprehend
Amazon Translate
Amazon Transcribe
Amazon Polly
Amazon Athena
Amazon EMR
Amazon Redshift
Amazon Elasticsearch Service
Amazon Kinesis
Amazon QuickSight
AWS Direct Connect
AWS Snowball
AWS Snowmobile
AWS Database Migration Service
AWS IoT Core
Amazon Kinesis Data Firehose
Amazon Kinesis Data Streams
Amazon Kinesis Video Streams
Data Lake
on AWS
Storage | Archival Storage | Data Catalog
AnalyticsMachine Learning
Real-time dataOn-premises movementdata movement
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
In Summary…
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Core Tenants
• Data lakes and data warehouses complement each other
• Loose coupling, but highly performant
• Storage, analytics, metadata management, etc..
• Future-proof your analytics
• Choosing the best tool for the job
• Elasticity and multiple clusters for dedicated purposes
• Replace capacity planning with a consumption model
• Don’t forget metadata management
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Use the right storage tier and data format
Data structure → Fixed schema, JSON, key-value
Access patterns → Store data in the format you will access it
Data characteristics → Hot, warm, cold
Cost → Right cost
Federal Geospatial Platform
Geo-Community Cloud on AWS
Federal Geospatial Platform
The Leader in Geospatial for the Government of Canada
 Easy access to GC “AAA” Geospatial Data
 Standards-based formats
 RESTful web services
 ISO Metadata
 OGC
 Simple workflow to assess, visualize and publish
 Re-usable viewer on GitHub
 Collaborative Mapping Environment on Esri’s ArcGIS Online
 FGP Geo-Community Cloud Platform as a Service (PaaS) on AWS
 A GC standards compliant Geospatial Platform On-Demand
…OBJECTIVES 2018-20
 Make Government of Canada Earth Observation information more easily
available to Canadians
 Access, Visualization and Analysis functionality for EO and Spatial
Information using the Federal Geospatial Platform (GC Tool)
 Enhanced imagery visualization options (past/present time-series)
 On-the-fly imagery processing (projection, class renderings, dynamic mosaics)
 Geoanalytics against near real time GC imagery on-demand
SpaceX Falcon 9 Launch Vehicle
Radarsat Constellation Mission (RCM) 2018
FGP Geo-Community Cloud
 2017-18 Proof of Concept on AWS (complete)
 2018-19 Foundation Laid – SSC Brokered Cloud
 FGP “Core Solution Stack”
 2019-20
 On Demand Processing Capabilities via API Gateway
 Geospatial Managed Storage - host your own geospatial data
 Support multiple “portals” from a common GC ecosystem
 Concurrently…
 Innovation Zone
 Sandbox Enviros for broad-based Geospatial R&D P/T/A
 AI and Machine Learning against Geospatial + EO integrated with FGP Platform as a
Service
Geo-Community Cloud – AWS Services
ca-canada-1a
Public Subnet
Private Subnet
Private Subnet
App Tier
Web Tier
DB Tier
Amazon Route 53
WAF Web Application
Firewall
Internet Gateway
Classic Load Balancer
EC2 Instances
Application Load Balancer
Elastic Block Storage
NAT Gateway
Database
S3 Bucket
Glacier Storage
NAT Gateway
NAT Gateway
Auto Scaling
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Thank You!
Please rate my session.
https://amzn.to/ottawa-sessions
Track: Technical
Session: 2:10 PM - Build Data Lakes and Analytics on AWS
How did we do?
https://amzn.to/ottawa-summit

More Related Content

What's hot

Data saturday Oslo Azure Purview Erwin de Kreuk
Data saturday Oslo Azure Purview Erwin de KreukData saturday Oslo Azure Purview Erwin de Kreuk
Data saturday Oslo Azure Purview Erwin de Kreuk
Erwin de Kreuk
 
Elastic Data Warehousing
Elastic Data WarehousingElastic Data Warehousing
Elastic Data Warehousing
Snowflake Computing
 
Digital Core Transformation - SAP S/4HANA
Digital Core Transformation - SAP S/4HANADigital Core Transformation - SAP S/4HANA
Digital Core Transformation - SAP S/4HANA
Deloitte Switzerland
 
Cloud Migration, Application Modernization and Security for Partners
Cloud Migration, Application Modernization and Security for PartnersCloud Migration, Application Modernization and Security for Partners
Cloud Migration, Application Modernization and Security for Partners
Amazon Web Services
 
Introduction to DCAM, the Data Management Capability Assessment Model - Editi...
Introduction to DCAM, the Data Management Capability Assessment Model - Editi...Introduction to DCAM, the Data Management Capability Assessment Model - Editi...
Introduction to DCAM, the Data Management Capability Assessment Model - Editi...
Element22
 
Is the traditional data warehouse dead?
Is the traditional data warehouse dead?Is the traditional data warehouse dead?
Is the traditional data warehouse dead?
James Serra
 
Enterprise Architecture vs. Data Architecture
Enterprise Architecture vs. Data ArchitectureEnterprise Architecture vs. Data Architecture
Enterprise Architecture vs. Data Architecture
DATAVERSITY
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Building Data Lakes with AWS
Building Data Lakes with AWSBuilding Data Lakes with AWS
Building Data Lakes with AWS
Amazon Web Services
 
Cloud Migration Strategy and Best Practices
Cloud Migration Strategy and Best PracticesCloud Migration Strategy and Best Practices
Cloud Migration Strategy and Best Practices
QBurst
 
Data Migration Strategies PowerPoint Presentation Slides
Data Migration Strategies PowerPoint Presentation SlidesData Migration Strategies PowerPoint Presentation Slides
Data Migration Strategies PowerPoint Presentation Slides
SlideTeam
 
Accelerate Cloud Migration to AWS Cloud with Cognizant Cloud Steps
Accelerate Cloud Migration to AWS Cloud with Cognizant Cloud StepsAccelerate Cloud Migration to AWS Cloud with Cognizant Cloud Steps
Accelerate Cloud Migration to AWS Cloud with Cognizant Cloud Steps
Amazon Web Services
 
Immersion Day - Well Architected Workshop - June 2019
Immersion Day - Well Architected Workshop - June 2019Immersion Day - Well Architected Workshop - June 2019
Immersion Day - Well Architected Workshop - June 2019
Amazon Web Services
 
5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data Lake5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data Lake
MetroStar
 
An Overview of Best Practices for Large Scale Migrations - AWS Transformation...
An Overview of Best Practices for Large Scale Migrations - AWS Transformation...An Overview of Best Practices for Large Scale Migrations - AWS Transformation...
An Overview of Best Practices for Large Scale Migrations - AWS Transformation...
Amazon Web Services
 
Snowflake Overview
Snowflake OverviewSnowflake Overview
Snowflake Overview
Snowflake Computing
 
Introducing the Snowflake Computing Cloud Data Warehouse
Introducing the Snowflake Computing Cloud Data WarehouseIntroducing the Snowflake Computing Cloud Data Warehouse
Introducing the Snowflake Computing Cloud Data Warehouse
Snowflake Computing
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
James Serra
 
Executing a Large-Scale Migration to AWS
Executing a Large-Scale Migration to AWSExecuting a Large-Scale Migration to AWS
Executing a Large-Scale Migration to AWS
Amazon Web Services
 
BIAN Applied to Open Banking - Thoughts on Architecture and Implementation
BIAN Applied to Open Banking - Thoughts on Architecture and ImplementationBIAN Applied to Open Banking - Thoughts on Architecture and Implementation
BIAN Applied to Open Banking - Thoughts on Architecture and Implementation
Biao Hao
 

What's hot (20)

Data saturday Oslo Azure Purview Erwin de Kreuk
Data saturday Oslo Azure Purview Erwin de KreukData saturday Oslo Azure Purview Erwin de Kreuk
Data saturday Oslo Azure Purview Erwin de Kreuk
 
Elastic Data Warehousing
Elastic Data WarehousingElastic Data Warehousing
Elastic Data Warehousing
 
Digital Core Transformation - SAP S/4HANA
Digital Core Transformation - SAP S/4HANADigital Core Transformation - SAP S/4HANA
Digital Core Transformation - SAP S/4HANA
 
Cloud Migration, Application Modernization and Security for Partners
Cloud Migration, Application Modernization and Security for PartnersCloud Migration, Application Modernization and Security for Partners
Cloud Migration, Application Modernization and Security for Partners
 
Introduction to DCAM, the Data Management Capability Assessment Model - Editi...
Introduction to DCAM, the Data Management Capability Assessment Model - Editi...Introduction to DCAM, the Data Management Capability Assessment Model - Editi...
Introduction to DCAM, the Data Management Capability Assessment Model - Editi...
 
Is the traditional data warehouse dead?
Is the traditional data warehouse dead?Is the traditional data warehouse dead?
Is the traditional data warehouse dead?
 
Enterprise Architecture vs. Data Architecture
Enterprise Architecture vs. Data ArchitectureEnterprise Architecture vs. Data Architecture
Enterprise Architecture vs. Data Architecture
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Building Data Lakes with AWS
Building Data Lakes with AWSBuilding Data Lakes with AWS
Building Data Lakes with AWS
 
Cloud Migration Strategy and Best Practices
Cloud Migration Strategy and Best PracticesCloud Migration Strategy and Best Practices
Cloud Migration Strategy and Best Practices
 
Data Migration Strategies PowerPoint Presentation Slides
Data Migration Strategies PowerPoint Presentation SlidesData Migration Strategies PowerPoint Presentation Slides
Data Migration Strategies PowerPoint Presentation Slides
 
Accelerate Cloud Migration to AWS Cloud with Cognizant Cloud Steps
Accelerate Cloud Migration to AWS Cloud with Cognizant Cloud StepsAccelerate Cloud Migration to AWS Cloud with Cognizant Cloud Steps
Accelerate Cloud Migration to AWS Cloud with Cognizant Cloud Steps
 
Immersion Day - Well Architected Workshop - June 2019
Immersion Day - Well Architected Workshop - June 2019Immersion Day - Well Architected Workshop - June 2019
Immersion Day - Well Architected Workshop - June 2019
 
5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data Lake5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data Lake
 
An Overview of Best Practices for Large Scale Migrations - AWS Transformation...
An Overview of Best Practices for Large Scale Migrations - AWS Transformation...An Overview of Best Practices for Large Scale Migrations - AWS Transformation...
An Overview of Best Practices for Large Scale Migrations - AWS Transformation...
 
Snowflake Overview
Snowflake OverviewSnowflake Overview
Snowflake Overview
 
Introducing the Snowflake Computing Cloud Data Warehouse
Introducing the Snowflake Computing Cloud Data WarehouseIntroducing the Snowflake Computing Cloud Data Warehouse
Introducing the Snowflake Computing Cloud Data Warehouse
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
 
Executing a Large-Scale Migration to AWS
Executing a Large-Scale Migration to AWSExecuting a Large-Scale Migration to AWS
Executing a Large-Scale Migration to AWS
 
BIAN Applied to Open Banking - Thoughts on Architecture and Implementation
BIAN Applied to Open Banking - Thoughts on Architecture and ImplementationBIAN Applied to Open Banking - Thoughts on Architecture and Implementation
BIAN Applied to Open Banking - Thoughts on Architecture and Implementation
 

Similar to Build Data Lakes and Analytics on AWS: Patterns & Best Practices

Construindo data lakes e analytics com AWS
Construindo data lakes e analytics com AWSConstruindo data lakes e analytics com AWS
Construindo data lakes e analytics com AWS
Amazon Web Services LATAM
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWS
Amazon Web Services
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWS
Amazon Web Services
 
Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...
Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...
Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...
Amazon Web Services
 
AWS Data Lake: data analysis @ scale
AWS Data Lake: data analysis @ scaleAWS Data Lake: data analysis @ scale
AWS Data Lake: data analysis @ scale
Amazon Web Services
 
Analyze your Data Lake, Fast @ Any Scale - AWS Online Tech Talks
Analyze your Data Lake, Fast @ Any Scale - AWS Online Tech TalksAnalyze your Data Lake, Fast @ Any Scale - AWS Online Tech Talks
Analyze your Data Lake, Fast @ Any Scale - AWS Online Tech Talks
Amazon Web Services
 
Implementazione di una soluzione Data Lake.pdf
Implementazione di una soluzione Data Lake.pdfImplementazione di una soluzione Data Lake.pdf
Implementazione di una soluzione Data Lake.pdf
Amazon Web Services
 
Data_Analytics_and_AI_ML
Data_Analytics_and_AI_MLData_Analytics_and_AI_ML
Data_Analytics_and_AI_ML
Amazon Web Services
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWS
Amazon Web Services
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWS
Amazon Web Services
 
Building a Modern Data Platform in the Cloud
Building a Modern Data Platform in the CloudBuilding a Modern Data Platform in the Cloud
Building a Modern Data Platform in the Cloud
Amazon Web Services
 
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...
Amazon Web Services
 
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...
Amazon Web Services
 
Building Data Lake on AWS | AWS Floor28
Building Data Lake on AWS | AWS Floor28Building Data Lake on AWS | AWS Floor28
Building Data Lake on AWS | AWS Floor28
Amazon Web Services
 
AWS Floor 28 - Building Data lake on AWS
AWS Floor 28 - Building Data lake on AWSAWS Floor 28 - Building Data lake on AWS
AWS Floor 28 - Building Data lake on AWS
Adir Sharabi
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWS
Amazon Web Services
 
Module 1 - CP Datalake on AWS
Module 1 - CP Datalake on AWSModule 1 - CP Datalake on AWS
Module 1 - CP Datalake on AWS
Lam Le
 
Big Data@Scale
 Big Data@Scale Big Data@Scale
Big Data@Scale
Amazon Web Services
 
Build Data Lakes & Analytics on AWS: Patterns & Best Practices - BDA305 - Ana...
Build Data Lakes & Analytics on AWS: Patterns & Best Practices - BDA305 - Ana...Build Data Lakes & Analytics on AWS: Patterns & Best Practices - BDA305 - Ana...
Build Data Lakes & Analytics on AWS: Patterns & Best Practices - BDA305 - Ana...
Amazon Web Services
 
Implementing a Data Lake
Implementing a Data LakeImplementing a Data Lake
Implementing a Data Lake
Amazon Web Services
 

Similar to Build Data Lakes and Analytics on AWS: Patterns & Best Practices (20)

Construindo data lakes e analytics com AWS
Construindo data lakes e analytics com AWSConstruindo data lakes e analytics com AWS
Construindo data lakes e analytics com AWS
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWS
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWS
 
Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...
Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...
Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...
 
AWS Data Lake: data analysis @ scale
AWS Data Lake: data analysis @ scaleAWS Data Lake: data analysis @ scale
AWS Data Lake: data analysis @ scale
 
Analyze your Data Lake, Fast @ Any Scale - AWS Online Tech Talks
Analyze your Data Lake, Fast @ Any Scale - AWS Online Tech TalksAnalyze your Data Lake, Fast @ Any Scale - AWS Online Tech Talks
Analyze your Data Lake, Fast @ Any Scale - AWS Online Tech Talks
 
Implementazione di una soluzione Data Lake.pdf
Implementazione di una soluzione Data Lake.pdfImplementazione di una soluzione Data Lake.pdf
Implementazione di una soluzione Data Lake.pdf
 
Data_Analytics_and_AI_ML
Data_Analytics_and_AI_MLData_Analytics_and_AI_ML
Data_Analytics_and_AI_ML
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWS
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWS
 
Building a Modern Data Platform in the Cloud
Building a Modern Data Platform in the CloudBuilding a Modern Data Platform in the Cloud
Building a Modern Data Platform in the Cloud
 
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...
 
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...
 
Building Data Lake on AWS | AWS Floor28
Building Data Lake on AWS | AWS Floor28Building Data Lake on AWS | AWS Floor28
Building Data Lake on AWS | AWS Floor28
 
AWS Floor 28 - Building Data lake on AWS
AWS Floor 28 - Building Data lake on AWSAWS Floor 28 - Building Data lake on AWS
AWS Floor 28 - Building Data lake on AWS
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWS
 
Module 1 - CP Datalake on AWS
Module 1 - CP Datalake on AWSModule 1 - CP Datalake on AWS
Module 1 - CP Datalake on AWS
 
Big Data@Scale
 Big Data@Scale Big Data@Scale
Big Data@Scale
 
Build Data Lakes & Analytics on AWS: Patterns & Best Practices - BDA305 - Ana...
Build Data Lakes & Analytics on AWS: Patterns & Best Practices - BDA305 - Ana...Build Data Lakes & Analytics on AWS: Patterns & Best Practices - BDA305 - Ana...
Build Data Lakes & Analytics on AWS: Patterns & Best Practices - BDA305 - Ana...
 
Implementing a Data Lake
Implementing a Data LakeImplementing a Data Lake
Implementing a Data Lake
 

More from Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
Amazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
Amazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
Amazon Web Services
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Amazon Web Services
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
Amazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
Amazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Amazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
Amazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Amazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
Amazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
Amazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
Amazon Web Services
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
Amazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
Amazon Web Services
 

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Build Data Lakes and Analytics on AWS: Patterns & Best Practices

  • 1. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. James Juniper Stephen Moon Senior Solutions Architect, Amazon Web Services Build Data Lakes and Analytics on AWS: Patterns & Best Practices Solution Architect, Geo-Community Cloud , National Resources Canada
  • 2. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Big Data: Different forms of challenges VisualizationVariability Volume Velocity Variety Veracity Value
  • 3. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Challenges are often driven by: https://www.promptcloud.com https://john-popelaars.blogspot.com https://ww.signiant.com https://www.linkedin.com/pulse/world-today-data-rich-information-poor- guru-p-mohapatra-pmp/ Data growth faster than ever Data variety is increasing
  • 4. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. AWS Data Lake helps address this Quickly ingest and store any type of data Insights and security, together … Run the right tool for the right job without manually copying data around
  • 5. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Data Lakes from AWS Analytics Machine Learning Real-time dataTraditional Data Lake on AWS movementdata movement Ingestion Intelligence Storage Catalog Variety of ingestion tools Decoupled analytics from storage/catalog
  • 6. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. What data do I have?
  • 7. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. What data do I have? Gartner: “Through 2018, 80% of data lakes will not include effective metadata management capabilities, making them inefficient.” ”Metadata Is the Fish Finder in Data Lake” Data Lake on AWS Storage | Archival Storage | Data Catalog
  • 8. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. AWS Glue components Job AuthoringData Catalog Job Execution Apache Hive Metastore compatible Integrated with AWS services Automatic crawl and discover data Discover Auto-generates ETL code Python and Apache Spark Edit, debug, and share Develop Serverless execution Flexible scheduling Monitoring and alerting Deploy
  • 9. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. What can crawlers discover? IAM Role AWS Glue Crawler Databases Amazon Redshift Amazon S3 JDBC Connection Object Connection Built-in classifiers MySQL MariaDB PostreSQL Aurora Oracle Amazon Redshift Avro Parquet ORC XML JSON & JSONPaths AWS CloudTrail BSON Logs (Apache (Grok), Linux(Grok), MS(Grok), Ruby, Redis, and many others) Delimited (comma, pipe, tab, semicolon) < ALWAYS GROWING…> Create additional custom classifiers Amazon DynamoDB NoSQL Connection
  • 10. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. But I have my own data formats …? − There is a custom classifier for that … Row-Based GROK Classifier A grok pattern is a named set of regular expressions (regex) that are used to match data one line at a time. XML XML Classifier XML tag that defines a table row in the XML document. JSON JSON Classifier JSON path to the object, array, or value that defines a row of the table being created. Type the name in either dot or bracket JSON syntax using AWS Glue supported operators
  • 11. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Other ways of populating the catalog Call the AWS Glue CreateTable API Create table manually DDL statement (in Amazon Athena or Amazon EMR) Apache Hive Metastore AWS GLUE ETL AWS GLUE DATA CATALOG Import from Apache Hive Metastore
  • 12. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. How do I hydrate my data lake?
  • 13. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. How do I drive value? Amazon SageMaker AWS Deep Learning AMIs Amazon Rekognition Amazon Lex AWS DeepLens Amazon Comprehend Amazon Translate Amazon Transcribe Amazon Polly Amazon Athena Amazon EMR Amazon Redshift Amazon Elasticsearch Service Amazon Kinesis Amazon QuickSight AWS Direct Connect AWS Snowball AWS Snowmobile AWS Database Migration Service AWS IoT Core Amazon Kinesis Data Firehose Amazon Kinesis Data Streams Amazon Kinesis Video Streams Data Lake on AWS Storage | Archival Storage | Data Catalog AnalyticsMachine learning Real-time data movementTraditional data movement
  • 14. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Ingest data based on the type of data Open and comprehensive • Data movement from on-premises datacenters • Dedicated network connection • Secure appliances • Ruggedized shipping container • Database migration • Gateway that lets applications write to the cloud • Data movement from real-time sources • Connect devices to AWS • Real-time data streams • Real-time video streams AWS Direct Connect AWS Snowball AWS Snowmobile AWS Database Migration Service AWS Storage Gateway AWS IoT Core Amazon Kinesis Data Firehose Amazon Kinesis Data Streams Amazon Kinesis Video Streams Data movement from real-time sources Data movement from your datacenters Am azon S3 Am azon Glacier AW S Glue
  • 15. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Real-time data movement and data lakes on AWS Amazon Kinesis Data Firehose AWS Glue Data Catalog Amazon S3 Data Data Lake on AWS Amazon Kinesis Data Streams Data definitionKinesis Agent Apache Kafka AWS SDK LOG4J Flume Fluentd AWS Mobile SDK Kinesis Producer Library
  • 16. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. IMPORTANT: Ingest data in its raw form … Open and comprehensive Am azon S3 Am azon Glacier AW S Glue • Store the data in its raw form: • BEFORE • Transforming • Analyzing • Manipulating • Doing … anything … to it CSV ORC Grok Avro Parquet JSON • This becomes your source of record you can always go back to … • Lifecycle policies allow you to shift it to warm and cold storage.
  • 17. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Tiered storage to optimize price / performance Lowest cost • Tiered storage to optimize price/performance • Amazon S3 Standard • Amazon S3 Standard—Infrequent Access • Amazon S3 One Zone—Infrequent Access • Amazon Glacier • Migrate between tiers based on lifecycle policies • Store data at $0.023*/GB/month with Amazon S3 • Store data at $0.004*/GB/month with Amazon Glacier Amazon S3 Standard Amazon S3 Standard Infrequent Access Amazon S3 One Zone-IA Amazon Glacier Active Infrequent Archive
  • 18. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Datasets in the Lake? Raw datasets – immutable datasets that you can always go back to. • Abstract out the complexities of how the data is stored through the catalog and SerDes Optimizing Analytics and Machine Learning: Curated datasets – query-optimized for consumption across wide number of tools
  • 19. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Preparing raw data for consumption Raw data stored in Data Lake: Preparation: Normalized Partitioned Compressed Storage Optimized Extract – Load – Transform Data Lake on AWS Raw Ingestion Curated DataSets Data Catalog ELT
  • 20. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Which tool should I use to analyze my data?
  • 21. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. How do I drive value? Amazon SageMaker AWS Deep Learning AMIs Amazon Rekognition Amazon Lex AWS DeepLens Amazon Comprehend Amazon Translate Amazon Transcribe Amazon Polly Amazon Athena Amazon EMR Amazon Redshift Amazon Elasticsearch Service Amazon Kinesis Amazon QuickSight AWS Direct Connect AWS Snowball AWS Snowmobile AWS Database Migration Service AWS IoT Core Amazon Kinesis Data Firehose Amazon Kinesis Data Streams Amazon Kinesis Video Streams Data Lake on AWS Storage | Archival Storage | Data Catalog AnalyticsMachine Learning Real-time dataTraditional movementdata movement
  • 22. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Different tools for different users… Business Reporting Data Catalog Central Storage SagemakerMachine Learning/Deep Learning Data Scientists Data Engineer
  • 23. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Athena – interactive analysis Interactive query service to analyze data in Amazon S3 using standard SQL No infrastructure to set up or manage and no data to load Ability to run SQL queries on data archived in Amazon Glacier (coming soon) $ SQL Query instantly Zero setup cost; just point to Amazon S3 and start querying Pay per query Pay only for queries run; save 30%–90% on per- query costs through compression Open ANSI SQL interface, JDBC/ODBC drivers, multiple formats, compression types, and complex joins and data types Easy Serverless: zero infrastructure, zero administration Integrated with Amazon QuickSight
  • 24. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon EMR – big data processing Analytics and ML at scale 19 open-source projects: Apache Hadoop, Spark, HBase, Presto, and more Enterprise-grade security $ Latest versions Updated with the latest open source frameworks within 30 days of release Low cost Flexible billing with per- second billing, Amazon EC2 Spot, Reserved Instances, and Auto Scaling to reduce costs 50%-80% Use Amazon S3 storage Process data directly in the Amazon S3 data lake securely with high performance using the EMRFS connector Easy Launch fully managed Hadoop & Spark in minutes; no cluster setup, node provisioning, cluster tuning Data Lake 100110000100101011100 1010101110010101000 00111100101100101 010001100001
  • 25. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Hadoop / Spark Analytics on AWS YARN (Hadoop Resource Manager) NoSQLMachine learning Real-timeInteractiveScriptBatch Data Lake on AWS Amazon S3 Amazon EMR Managed Hadoop / Spark Object storage
  • 26. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Fitting this into the Common Data Catalog Amazon S3 Interactive Spark cluster Amazon EMR Amazon EMR EMRFS HDFS Transient ETL job Source of Truth EMRFS HDFS Describes the data MySQL DB instance Unifieddataview AWS Glue Data Catalog Stores the data …
  • 27. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Redshift – data warehousing Fast, powerful, simple, and fully managed data warehouse at 1/10 the cost Massively parallel, scale from gigabytes to petabytes Fast at scale Columnar storage technology to improve I/O efficiency and scale query performance $ Inexpensive As low as $1,000 per terabyte per year, 1/10 the cost of traditional data warehouse solutions; start at $0.25 per hour Open file formats Secure Audit everything; encrypt data end-to-end; extensive certification and compliance Analyze optimized data formats on the latest SSD, and all open data formats in Amazon S3
  • 28. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Data warehouse … Amazon Redshift Data Warehouse Relational data Gigabytes to petabytes scale Reporting and analysis Schema defined prior to data load AWS Glue ETL On Prem Amazon QuickSight Existing or new BI tool Redshift COPY
  • 29. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. A Data Lake is not an Enterprise Data Warehouse Complementary to EDW (not replacement) EDW can be sourced from Data Lake Schema on read (no predefined schemas) Schema on write (predefined schemas) Structured/semi-structured/Unstructured data Structured data only Fast ingestion of new data/content Time consuming to introduce new content Data Science + Prediction/Advanced Analytics + BI use cases BI use cases Data at low level of detail/granularity Data at summary/aggregated level of detail Loosely defined SLAs Tight SLAs (production schedules) Flexibility in tools (open source/tools for advanced analytics) Limited flexibility in tools (SQL only) Elastic storage and compute capacity – decoupled Explicitly sized environments, compute and storage scaled in linearly Data Lake EDW
  • 30. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Redshift Spectrum Extend the data warehouse to exabytes of data in Amazon S3 data lake Amazon S3 Data Lake Amazon Redshift data Amazon Redshift Spectrum query engine Exabyte Redshift SQL queries against Amazon S3 Join data across Redshift and Amazon S3 Scale compute and storage separately Stable query performance and unlimited concurrency CSV, ORC, Grok, Avro, & Parquet data formats Pay only for the amount of data scanned
  • 31. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Redshift Spectrum Query your Data Lake Amazon Redshift JDBC/ODBC ... 1 2 3 4 N Amazon Redshift Spectrum Scale-out serverless compute AWS Glue Data Catalog COPY commands Hot data Query directly on Data Lake
  • 32. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Data Lakes extend the traditional data warehouse Data warehouse Business intelligence OLTP ERP CRM LOB • Relational and nonrelational data • TBs–EBs scale • Diverse analytical engines • Low-cost storage & analytics Devices Web Sensors Social Data lake Big data processing, real-time, machine learning
  • 33. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Machine Learning & Big Data
  • 34. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Big Data driving Machine Learning Better Decisions Object Storage Databases Data warehouse Streaming analytics BI Hadoop Spark/Presto Elasticsearch Better Products Machine Learning Deep Learning/ AI More Users More Data Click stream User activity Generated content Purchases Clicks Likes Sensor data
  • 35. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Agility in Machine Learning Amazon SageMaker AWS Deep Learning AMIs Amazon Rekognition Amazon Lex AWS DeepLens Amazon Comprehend Amazon Translate Amazon Transcribe Amazon Polly Amazon Athena Amazon EMR Amazon Redshift Amazon Elasticsearch Service Amazon Kinesis Amazon QuickSight AWS Direct Connect AWS Snowball AWS Snowmobile AWS Database Migration Service AWS IoT Core Amazon Kinesis Data Firehose Amazon Kinesis Data Streams Amazon Kinesis Video Streams Data Lake on AWS Storage | Archival Storage | Data Catalog AnalyticsMachine Learning Real-time dataOn-premises movementdata movement
  • 36. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. In Summary…
  • 37. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Core Tenants • Data lakes and data warehouses complement each other • Loose coupling, but highly performant • Storage, analytics, metadata management, etc.. • Future-proof your analytics • Choosing the best tool for the job • Elasticity and multiple clusters for dedicated purposes • Replace capacity planning with a consumption model • Don’t forget metadata management
  • 38. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Use the right storage tier and data format Data structure → Fixed schema, JSON, key-value Access patterns → Store data in the format you will access it Data characteristics → Hot, warm, cold Cost → Right cost
  • 40. Federal Geospatial Platform The Leader in Geospatial for the Government of Canada  Easy access to GC “AAA” Geospatial Data  Standards-based formats  RESTful web services  ISO Metadata  OGC  Simple workflow to assess, visualize and publish  Re-usable viewer on GitHub  Collaborative Mapping Environment on Esri’s ArcGIS Online  FGP Geo-Community Cloud Platform as a Service (PaaS) on AWS  A GC standards compliant Geospatial Platform On-Demand
  • 41. …OBJECTIVES 2018-20  Make Government of Canada Earth Observation information more easily available to Canadians  Access, Visualization and Analysis functionality for EO and Spatial Information using the Federal Geospatial Platform (GC Tool)  Enhanced imagery visualization options (past/present time-series)  On-the-fly imagery processing (projection, class renderings, dynamic mosaics)  Geoanalytics against near real time GC imagery on-demand
  • 42.
  • 43. SpaceX Falcon 9 Launch Vehicle
  • 45.
  • 46. FGP Geo-Community Cloud  2017-18 Proof of Concept on AWS (complete)  2018-19 Foundation Laid – SSC Brokered Cloud  FGP “Core Solution Stack”  2019-20  On Demand Processing Capabilities via API Gateway  Geospatial Managed Storage - host your own geospatial data  Support multiple “portals” from a common GC ecosystem  Concurrently…  Innovation Zone  Sandbox Enviros for broad-based Geospatial R&D P/T/A  AI and Machine Learning against Geospatial + EO integrated with FGP Platform as a Service
  • 47. Geo-Community Cloud – AWS Services ca-canada-1a Public Subnet Private Subnet Private Subnet App Tier Web Tier DB Tier Amazon Route 53 WAF Web Application Firewall Internet Gateway Classic Load Balancer EC2 Instances Application Load Balancer Elastic Block Storage NAT Gateway Database S3 Bucket Glacier Storage NAT Gateway NAT Gateway Auto Scaling
  • 48.
  • 49. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Thank You! Please rate my session. https://amzn.to/ottawa-sessions Track: Technical Session: 2:10 PM - Build Data Lakes and Analytics on AWS How did we do? https://amzn.to/ottawa-summit

Editor's Notes

  1. Accessible Authoritative 800 datasets and growing Think enterprise, not silos Build once, use many times, for the benefit of all, including common approaches and solutions Think horizontal not vertical Cloud First Use existing GC standards and tools
  2. Context Climate Change Cumulative Effects
  3. Current launch window is Feb 18-24 2019 Qty 3 satellites all on same SpaceX Falcon 9 launch vehicle from Vandenberg Air Force base in California.  
  4. The 3 satellites will be released 3 minutes apart and then later once ‘fully woken up’ they will be moved into final position which is evenly spaced. (120 degrees apart) 1.4 TBytes/day, just under 1 PByte/year RCM data policy is TBD but even free data for the public requires individual accounts due to Remote Sensing Space Systems Act (RSSSA). They will most likely be valuable added products that could be open (FGP/OpenMaps), but qty is unknown.  Question is this… How will it be made useful? We decided to tackle the technical challenge of processing EO data our way – to create web services, which make it possible for everyone to get data in their favorite GIS applications using standard WMS and WCS services. Also, processing on demand with Esri’s Image Server and GeoAnalytics Server.
  5. Shared responsibility model.