More Related Content Similar to Building Data Lakes & Analytics on AWS (20) More from AWS Summits (20) Building Data Lakes & Analytics on AWS1. P U B L I C S E C T O R
S U M M I T
B OGOTA
2. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Building Data Lakes &Analytics onAWS
Mauro Assis
Researcher
Earth System Science Center
INPE
Angelo Carvalho
Specialist Solutions Architect - Analytics
Public Sector for Latin America, Canada and Caribbean
AWS
3. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Organizations that successfully generate
business value from their data will outperform
their peers. An Aberdeen survey showed that
organizations that implemented a data lake
outperform similar companies by 9% in
organic revenue growth.*
24%
15%
Leaders Followers
Organic revenue growth
*Aberdeen: Angling for Insight in Today’s Data Lake, Michael Lock, SVP Analytics and Business Intelligence
Most Important: DrivingValuefrom Data
4. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Traditionally,AnalyticsUsed to Look LikeThis
OLTP ERP CRM LOB
Data warehouse
Business intelligence • Relational data
• TBs–PBs scale
• Schema defined prior to data load
• Operational reporting and ad hoc
• Large initial CAPEX + $10K–$50K/TB/year
5. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Data Lakes Extend theTraditionalApproach
Data warehouse
Business intelligence
OLTP ERP CRM LOB
• Relational and non-relational data
• TBs–EBs scale
• Diverse analytical engines
• Low-cost storage & analytics
Devices Web Sensors Social
Data lake
Big data processing,
real-time, machine learning
6. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Data Lakes fromAWS
Analytics
• Unmatched durability and availability at EB scale
• Best security, compliance, and audit capabilities
• Object-level controls for fine-grained access
• Fastest performance by retrieving subsets of data
• The greatest variety of ways to bring data in
• 2x as many integrations with partners
• Analyze with broadest set of analytics & machine
learning (ML) services
Machine
learning
Real-time dataOn-premises
Data Lake
on AWS
movementdata movement
7. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Managed ML Service
Deep Learning AMIs
Video and Image Recognition
Conversational Interfaces
Deep-Learning Video Camera
Natural Language Processing
Language Translation
Speech Recognition
Text-to-Speech
Interactive Analysis
Hadoop & Spark
Data Warehousing
Full-text search
Real-time analytics
Dashboards & Visualizations
Dedicated Network connection
Secure appliances
Ruggedized Shipping Container
Database migration
Connect Devices to AWS
Real-time Data Streams
Real-time Video Streams
Data Lake
on AWS
Storage | Archival Storage | Data Catalog
AnalyticsMachine learning
Real-time dataOn-premises movementdata movement
Data Lakes,Analytics,and IoTPortfolio fromAWS
Broadest,deepestsetofanalyticservices
8. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Data Lakes,Analytics,and IoTPortfolio fromAWS
Broadest,deepestsetofanalyticservices
Amazon SageMaker
AWS Deep Learning AMIs
Amazon Rekognition
Amazon Lex
AWS DeepLens
Amazon Comprehend
Amazon Translate
Amazon Transcribe
Amazon Polly
Amazon Athena
Amazon EMR
Amazon Redshift
Amazon Elasticsearch Service
Amazon Kinesis
Amazon QuickSight
AWS Direct Connect
AWS Snowball
AWS Snowmobile
AWS Database Migration Service
AWS IoT Core
Amazon Kinesis Data Firehose
Amazon Kinesis Data Streams
Amazon Kinesis Video Streams
Data Lake
on AWS
Storage | Archival Storage | Data Catalog
AnalyticsMachine learning
Real-time dataOn-premises movementdata movement
9. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Job AuthoringData Catalog Job Execution
Apache Hive Metastore compatible
Integrated with AWS services
Automatic crawling
Discover
Auto-generates ETL code
Python and Apache Spark
Edit, debug, and share
Develop
Serverless execution
Flexible scheduling
Monitoring and alerting
Deploy
AWS Glue
10. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Data Lake on Amazon S3 with AWS Glue
On-premises data
Web app data
Amazon RDS
Other databases
Streaming data
Your data
AMAZON
QUICKSIGHT
11. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Other Ways of Populating the Catalog
Call the AWS Glue CreateTable API
Create table manually
Run Hive DDL statement
Apache Hive
Metastore
AWS GLUE ETL AWS GLUE
DATA CATALOG
Import from Apache Hive Metastore
12. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
How Do IDriveValue?
Amazon SageMaker
AWS Deep Learning AMIs
Amazon Rekognition
Amazon Lex
AWS DeepLens
Amazon Comprehend
Amazon Translate
Amazon Transcribe
Amazon Polly
Amazon Athena
Amazon EMR
Amazon Redshift
Amazon Elasticsearch Service
Amazon Kinesis
Amazon QuickSight
AWS Direct Connect
AWS Snowball
AWS Snowmobile
AWS Database Migration Service
AWS IoT Core
Amazon Kinesis Data Firehose
Amazon Kinesis Data Streams
Amazon Kinesis Video Streams
Data Lake
on AWS
Storage | Archival Storage | Data Catalog
AnalyticsMachine learning
Real-time dataOn-premises movementdata movement
13. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Amazon Athena
Amazon Athena is an interactive query service
that makes it easy to analyze data in Amazon
S3 using standard SQL.
14. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
FamiliarTechnologiesUnder theCovers
Used for SQL Queries
In-memory distributed query engine
ANSI-SQL compatible with extensions
(Eg. SELECT * FROM tableName)
Used for DDL functionality
Complex data types
Multitude of formats
Supports data partitioning
(Eg. CREATE TABLE, ALTER TABLE,
MSCK REPAIR)
15. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Exploring data with Amazon Athena
On-premises Data
Web app data
Amazon RDS
Other Databases
Streaming data
AMAZON
QUICKSIGHT
AMAZON
SAGEMAKER
16. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Hadoop/SparkAnalytics
• Distributed processing
• Diverse analytics
• Batch/Script (Hive/Pig)
• Interactive (Spark, Presto)
• Real-time (Spark)
• Machine Learning (Spark)
• NoSQL (HBase)
• For many use cases
• Log and clickstream analysis
• Machine learning
• Real-time analytics
• Large-scale analytics
• Genomics
• ETL
YARN (Hadoop Resource Manager)
NoSQLMachine
learning
Real-timeInteractiveScriptBatch
Data Lake
on AWS
17. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Hadoop/SparkAnalyticsonAWS
YARN (Hadoop Resource Manager)
NoSQLMachine
learning
Real-timeInteractiveScriptBatch
Data Lake
on AWS
Amazon S3
Amazon EMR
Managed Hadoop/Spark
Object Storage
18. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
EMR – Enterprise-grade Hadoop &Spark
DeploylatestreleasesinHadoopandSparkecosystems
• Nineteen open-source
projects: Apache Hadoop,
Spark, HBase, Presto, and
more
• Updated with the latest
open source frameworks
within 30 days of release
Hadoop
Ganglia
HBase
Hive&
Catalog
Hue
Mahout
Oozie
Phoenix
Pig
Presto
Spark
Tez
Zeppelin
Zookeeper
Flink
Livy
MXNet
Sqoop
Emr-4.0.0
July2015
2.6.0 1.0.0 0.10.0 0.14.0 1.4.1
Emr-4.7.0
June2016
2.7.2 3.7.2 1.2.1 1.0.0 3.7.1 0.12.0 4.2.0 4.7.0 0.14.0 .147 1.6.1 1.4.6 0.8.3 0.5.6 3.4.8
Emr-5.3.0
January2017
2.7.3 3.7.2
1.2.3
+
S3
2.1.1 3.11.0 0.12.2 4.3.0 4.7.0 0.16.0 0.157.1 2.1.0 1.4.6 0.8.4 0.6.2 3.4.9 1.1.4
Emr-5.11.0
December2017
2.7.3 3.7.2
1.3.1
+
S3
2.3.2 4.0.1 0.13.0 4.3.0 4.11.0 0.17.0 .187 2.2.1 1.4.6 0.8.4 0.7.3 3.4.10 1.3.2 0.4.0 0.12.0
EMR releases
19. AmazonS3 –Source ofTruth,MultipleClusters
Amazon S3
Interactive Spark Cluster
Amazon EMR
Amazon EMR
HDFS
HDFS
EC2 Instance Memory
Intermediates stored on
local disk or HDFSLocal
HDFS
EC2 Instance Memory
Intermediates stored on
local disk or HDFSLocal
Transient ETL Job
Source of Truth
HDFS
HDFS
HDFS
Local Intermediate HDFS/Storage
Local Intermediate HDFS/Storage
20. External Metadata Management
Amazon S3
Interactive Spark Cluster
Amazon EMR
Amazon EMR
HDFS
Transient ETL Job
Source of Truth
HDFS
Describes Data in S3
MySQL DB
instance
Customershaveoptions
Glue Data
Catalog
21. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Reprocess data with Amazon EMR (Spark)
On-premise data
Web app data
Amazon RDS
Other Databases
Streaming data
AMAZON
QUICKSIGHT
AMAZON
SAGEMAKER
22. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
MachineLearning onYour DataLake
Amazon SageMaker
AWS Deep Learning AMIs
Amazon Rekognition
Amazon Lex
AWS DeepLens
Amazon Comprehend
Amazon Translate
Amazon Transcribe
Amazon Polly
Amazon Athena
Amazon EMR
Amazon Redshift
Amazon Elasticsearch Service
Amazon Kinesis
Amazon QuickSight
AWS Direct Connect
AWS Snowball
AWS Snowmobile
AWS Database Migration Service
AWS IoT Core
Amazon Kinesis Data Firehose
Amazon Kinesis Data Streams
Amazon Kinesis Video Streams
Data Lake
on AWS
Storage | Archival Storage | Data Catalog
AnalyticsMachine learning
Real-time dataOn-premises movementdata movement
23. Frameworks &
Infrastructure
AWS Deep Learning AMI
GPU
(P3 Instances)
MobileCPU IoT (Greengrass)
Vision:
Amazon Rekognition Image
Amazon Rekognition Video
Speech:
Amazon Polly
Amazon Transcribe
Language:
Amazon Lex
Amazon Translate
Amazon Comprehend
Apache
MXNet
PyTorch
Cognitive
Toolkit
Keras
Caffe2
& Caffe
TensorFlow Gluon
Application
Services
Platform
Services
Amazon Machine
Learning
Mechanical
Turk
Spark &
EMR
Amazon
SageMaker
AWS
DeepLens
ML in the Hands of Every Developer
24. Amazon SageMaker
1 2 3 4
I I I I
Notebook Instances Algorithms ML Training Service ML Hosting Service
25. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Machine Learning with Amazon
SageMaker
On-premises data
Web app data
Amazon RDS
Other databases
Streaming data
AMAZON
QUICKSIGHT
AMAZON
SAGEMAKER
26. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
27. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Mapping Amazon Biomass using
Amazon Analytics
www.ccst.inpe.br
Mauro Assis
assismauro@hotmail.com
28. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
EARTH SYSTEM SCIENCE
CENTERSTRATEGIC GOALS
Development and improvement of earth system models,
monitoring networks and socio-political analyzes, aiming at the
construction and analysis of scenarios of environmental
changes and climate projections.
29. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
The question:
How much does the Amazon forest weigh?
30. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
The previous question:
Why map Amazon forest biomass?
31. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Biomass map process
• 68 million pixels (250 x 250m)
• 4 million km² area
• ~1000 LiDAR flights data
• Each flight: 6.5 billion of data recs
• 10 bands of satellite data for each pixel
• 4 to 6 h/map generation
• 16 CPU/32 gbyte RAM/21 Tb HD
• Random Forest algorithm
• Python H2O
32. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
LiDAR
33. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
LiDAR
34. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
35. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
36. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Uncertainty map
• Propagate error from field to random forest extrapolation
• 1000 biomass values normally distributed for each pixel
• A thousand maps to generate…
• … How???
37. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
The answer: Analytics processing on AWS
• AWS engages a partner (DataRain)
• Two PoCs
• Four EC2 instances Linux 64 cores/256 Gbytes each
• Anaconda/H2O Python environment
• Script with lots of parallel processing
• Divided Amazon area into16 segments
• Two operators to run everything in 40 hours
• We downloaded the 1000 maps and sumarize at INPE
• It tooks about 2 days to generate the final map
38. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Architecture
AWS CloudINPE
Researcher
Desktops Amazon
S3
Amazon EC2 Amazon EBS
Internet
Amazon EC2 Amazon EBS
Amazon EC2 Amazon EBS
Amazon EC2 Amazon EBS
39. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
40. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Main benefits
• Uncertainty map itself
• DataRain (AWS partner) support
• First time we use cloud services at INPE
• Map obtained before the end of the project
• ROI
41. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Return of the investment
• LiDAR Fligts costs $2,000 each
• 1000 flights => $2M
• To update the model: 100~150 flights
• 150 flights => $300k
• Cost of map generation: $10,000
• Money saved in the next map update:
$2M – $300k – $10k = $1.69 M
42. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
43. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Agilityand InnovationAreKey
Amazon SageMaker
AWS Deep Learning AMIs
Amazon Rekognition
Amazon Lex
AWS DeepLens
Amazon Comprehend
Amazon Translate
Amazon Transcribe
Amazon Polly
Amazon Athena
Amazon EMR
Amazon Redshift
Amazon Elasticsearch Service
Amazon Kinesis
Amazon QuickSight
AWS Direct Connect
AWS Snowball
AWS Snowmobile
AWS Database Migration Service
AWS IoT Core
Amazon Kinesis Data Firehose
Amazon Kinesis Data Streams
Amazon Kinesis Video Streams
Data Lake
on AWS
Storage | Archival Storage | Data Catalog
AnalyticsMachine learning
Real-time dataOn-premises movementdata movement
44. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Thank you!
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Mauro Assis
assismauro@hotmail.com
Angelo Carvalho
carvaa@amazon.com
45. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T