Using Data Lakes

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved
Pop-up Loft
Using Data Lakes
Mamoon Chowdry
chowdry@amazon.com
Solutions Architect
Ben Willett
benwille@amazon.com
Solutions Architect

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
E x p o n e n t i a l g r o w t h i n d a t a
Reasons for building a data lake
Transactions
ERP
Sensor Data
Billing
Web logs
Social
Infrastructure logs

Transactions
ERP
Sensor Data
Billing
Web logs
Social
Infrastructure logs
D i v e r s i f i e d c o n s u m e r s
Data Scientists
Business Analyst External Consumers
Applications

Transactions
ERP
Sensor Data
Billing
Web logs
Social
Infrastructure logs
D i v e r s i f i e d c o n s u m e r s
Data Scientists
Business Analyst External Consumers
Applications
M u l t i p l e a c c e s s m e c h a n i s m s
API Access
BI Tools
Notebooks

Characteristics of a data lake
Future
Proof
Flexible
Access
Dive in
Anywhere
Collect
Anything

Server%rack%1
(20%nodes)
Server%rack%2
(20%nodes)
Server%rack%N%
(20%nodes)
Core
On-premises Hadoop clusters
• A cluster of 1U machines
• Typically 12 Cores, 32/64 GB
RAM, and 6 - 8 TB of HDD ($3-4K)
• Networking switches and racks
• Open-source distribution of
Hadoop or a fixed licensing term
by commercial distributions
• Different node roles
• HDFS uses local disk and is sized
for 3x data replication

Workload types running on the same cluster
• Large Scale ETL: Apache Spark, Apache Hive with Apache Tez, or
Apache Hadoop MapReduce
• Interactive Queries: Apache Impala, Spark SQL, Presto, Apache
Phoenix
• Machine Learning and Data Science: Spark ML, Apache Mahout
• NoSQL: Apache HBase
• Stream Processing: Apache Kafka, Spark Streaming, Apache Flink,
Apache NiFi, Apache Storm
• Search: Elasticsearch, Apache Solr
• Job Submission: Client Edge Node, Apache Oozie
• Data warehouses like Pivotal Greenplum or Teradata

Security
• Authentication: Kerberos with local KDC or
Active Directory, LDAP integration, local user
management, Apache Knox
• Authorization: Open-source native authZ (i.e.,
HiveServer2 authZ or HDFS ACLs), Apache
Ranger, Apache Sentry
• Encryption: local disk encryption with LUKS,
HDFS transparent-data encryption, in-flight
encryption for each framework (i.e., Hadoop
MapReduce encrypted shuffle)
• Configuration: Different tools for management
based on vendor

Swim lane of jobs
Over-utilized Under-utilized

Role of a Hadoop administrator
• Management of the cluster (failures,
hardware replacement, restarting
services, expanding cluster)
• Configuration management
• Tuning of specific jobs or hardware
• Managing development and test
environments
• Backing up data and disaster recovery

On-prem: Over-utilization and idle capacity
• Tightly coupled compute and storage requires buying
excess capacity
• Can be over-utilized during peak hours and under-
utilized at other times
• Results in high costs and low efficiency

On-prem: System management difficulties
• Managing distributed applications and availability
• Durable storage and disaster recovery
• Adding new frameworks and doing upgrades
• Multiple environments
• Need team to manage cluster and procure hardware

Why Amazon EMR?
Low Cost
Pay an hourly rate
Open-Source Variety
Latest versions of software
Managed
Spend less time monitoring
Secure
Easy-to-manage options
Flexible
Customize the cluster
Easy to Use
Launch a cluster in minutes

Translate use cases to the right tools
- Low-latency SQL -> Athena or Presto or Amazon Redshift
- Data warehouse/Reporting -> Spark or Hive or Glue or Amazon Redshift
- Management and monitoring -> EMR console or Ganglia metrics
- HDFS -> Amazon S3
- Notebooks -> Zeppelin Notebook or Jupyter (via bootstrap action)
- Query console -> Athena or Hue
- Security -> Ranger (CF template) or HiveServer2 or IAM roles
Storage
S3 (EMRFS), HDFS
YARN
Cluster Resource Management
Batch
MapReduce
Interactive
Tez
In Memory
Spark
Applications
Hive, Pig, Spark SQL/Streaming/ML, Flink, Mahout, Sqoop
HBase/Phoenix
Presto
Athena
Streaming
Flink
Glue
Amazon Redshift

Many storage layers to choose from
Amazon DynamoDB
Amazon RDS
Amazon Kinesis
Amazon Redshift
Amazon S3
Amazon EMR
Amazon Elasticsearch
Service

Decouple compute and storage by using
Amazon S3 as your data layer
HDFS
S3 is designed for 11
9’s of durability and is
massively scalable
EC2 Instance
Memory
Amazon S3
Amazon EMR
Amazon EMR
Intermediates
stored on local
disk or HDFS
Local

HBase on Amazon S3 for scalable NoSQL

Options to submit jobs
Amazon EMR
Step API
Submit a Spark
application
Amazon EMR
AWS Data Pipeline
Airflow, Luigi, or other
schedulers on EC2
Create a pipeline
to schedule job
submission or create
complex workflows
AWS Lambda
Use AWS Lambda to
submit applications to
EMR Step API or directly
to Spark on your cluster
Use Oozie on your
cluster to build
DAGs of jobs

Performance and hardware
• Transient or long running
• Instance types
• Cluster size
• Application settings
• File formats and Amazon
S3 tuning
Master Node
r4.2xlarge
Slave Group - Core
c5.2xlarge
Slave Group – Task
m5.2xlarge (EC2 Spot)
Considerations

On-cluster UIs to quickly tune workloads
Manage applications
SQL editor, Workflow designer,
Metastore browser
Notebooks
Design and execute
queries and workloads

Spot for
task nodes
Up to 80%
off Amazon EC2
On-Demand
pricing
On-Demand for
core nodes
Standard
Amazon EC2
pricing for
On-Demand
capacity
Use Spot and Reserved Instances to lower costs
Meet SLA at predictable cost Exceed SLA at lower cost

Instance fleets for advanced Spot provisioning
Master Node Core Instance Fleet Task Instance Fleet
• Provision from a list of instance types with Spot and On-Demand
• Launch in the most optimal Availability Zone based on capacity/price
• Spot Block support

Security – Authentication and authorization
Tag: user = MyUserIAM user: MyUser
EMR role
EC2 role
SSH key

Security – Authentication and authorization
• Plug-ins for Hive, HBase,
YARN, and HDFS
• Row-level authorization for Hive
(with data-masking)
• Full auditing capabilities with
embedded search
• Run Ranger on an edge node –
visit the AWS Big Data Blog
Apache Ranger

Security – Governance and auditing
• AWS CloudTrail for EMR APIs
• Custom AMIs
• S3 access logs for cluster S3 access
• YARN and application logs
• Ranger for UI for application level auditing

FINRA: Migrating from on-prem to AWS
Petabytes of data generated
on-premises, brought to AWS,
and stored in Amazon S3
Thousands of analytical
queries performed on EMR
and Amazon Redshift.
Stringent security requirements
met by leveraging VPC, VPN,
encryption at-rest and in-
transit, CloudTrail, and
database auditing
Flexible
Interactive
Queries
Predefined
Queries
Surveillance
Analytics
Data Management
Data Movement
Data Registration
Version Management
Amazon S3
Web Applications
Analysts; Regulators

Lower cost and higher scale than on-premises

FINRA saved 60% by moving to HBase on EMR

Learn
Models
ModelsImpressions
Clicks
Activities
Calibrate
Evaluate
Real
Time
Bidding
Amazon S3
ETL Attribution
Machine
Learning
Amazon S3Amazon
Kinesis
• 2 petabytes processed daily
• 2 million bid decisions per second
• Runs 24 X 7 on 5 continents
• Thousands of ML models
trained per day

Amazon Athena is an interactive query service
that makes it easy to analyze data directly
from Amazon S3 using Standard SQL

Why use Athena?
• Decouple storage from compute
• Serverless – No infrastructure or resources to manage
• Pay only for data scanned
• Schema on read – Same data, many views
• Encrypted
• Standard compliant and open storage formats
• Built on powerful community supported OSS solutions

Simple Pricing
• DDL operations – FREE
• SQL operations – FREE
• Query concurrency – FREE
• Data scanned - $5 / TB
• Standard S3 rates for storage, requests, and data transfer
apply

Customers Drive Product Decisions

Familiar Technologies Under the Covers
Used for SQL Queries
In-memory distributed query engine
ANSI-SQL compatible with extensions
Used for DDL functionality
Complex data types
Multitude of formats
Supports data partitioning

Hive Metadata Definition
• Hive Data Definition Language
• Data Manipulation Language (INSERT, UPDATE)
• Create Table As
• User Defined Functions
• Hive compatible SerDe (serializer/deserializer)
• CSV, JSON, RegEx, Parquet, Avro, ORC, CloudTrail

Presto SQL
• ANSI SQL compliant
• Complex joins, nested queries &
window functions
• Complex data types (arrays,
structs, maps)
• Partitioning of data by any key
• date, time, custom keys
• Presto built-in functions

Fast @ Exabyte scale Elastic & highly available On-demand, pay-per-
query
High concurrency:
Multiple clusters access
same data
No ETL: Query data in-
place using open file
formats
Full Amazon Redshift
SQL support
S3
SQL
Run SQL queries directly against data in S3 using thousands of nodes
Amazon Redshift Spectrum

Query:
SELECT COUNT(*)
FROM s3.ext_table
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive Metastore
Redshift Architecture
with Spectrum

Using Data Lakes

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Using Data Lakes

Similar to Using Data Lakes (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Using Data Lakes