SlideShare a Scribd company logo
©2015, Amazon Web Services, Inc. or its affiliates. All rights reserved
Deep Dive: Amazon EMR
Rahul Pathak—Sr. Mgr. Amazon EMR (@rahulpathak)
Jason Timmes—AVP Software Development, Nasdaq
Why Amazon EMR?
Easy to Use
Launch a cluster in minutes
Low Cost
Pay an hourly rate
Elastic
Easily add or remove capacity
Reliable
Spend less time monitoring
Secure
Managed firewalls
Flexible
You control the cluster
Easy to deploy
AWS Management Console Command Line
or use the EMR API with your favorite SDK
Easy to monitor and debug
Monitor Debug
Integrated with Amazon CloudWatch
Monitor cluster, node, and I/O
Try different configurations to find your optimal architecture
CPU
c3 family
cc1.4xlarge
cc2.8xlarge
Memory
m2 family
r3 family
Disk/IO
d2 family
i2 family
General
m1 family
m3 family
Choose your instance types
Batch Machine Spark and Large
process learning interactive HDFS
Easy to add and remove compute
capacity on your cluster.
Match compute
demands with
cluster sizing.
Resizable clusters
Spot for
task nodes
Up to 90%
off EC2
on-demand
pricing
On-demand for
core nodes
Standard
Amazon EC2
pricing for
on-demand
capacity
Easy to use Spot Instances
Meet SLA at predictable cost Exceed SLA at lower cost
Read data directly into Hive,
Pig, streaming and cascading
from Amazon Kinesis streams
No intermediate data
persistence required
Simple way to introduce real time sources into
batch oriented systems
Multi-application support & automatic
checkpointing
Amazon EMR integration with Amazon Kinesis
The Hadoop ecosystem can run in Amazon EMR
Hue
Amazon S3 and HDFS
Hue
Query Editor
Hue
Job Browser
Leverage Amazon S3 with EMRFS
Amazon S3 as your persistent data store
• Separate compute and storage
• Resize and shut down Amazon
EMR clusters with no data loss
• Point multiple Amazon EMR
clusters at the same data in
Amazon S3
EMR
EMR
Amazon
S3
EMRFS makes it easier to use Amazon S3
• Read-after-write consistency
• Very fast list operations
• Error handling options
• Support for Amazon S3 encryption
• Transparent to applications: s3://
EMRFS client-side encryption
Amazon S3
AmazonS3encryptionclients
EMRFSenabledfor
AmazonS3client-sideencryption
Key vendor (AWS KMS or your custom key vendor)
(client-side encrypted objects)
HDFS is still there if you need it
• Iterative workloads
– If you’re processing the same dataset more than
once
– Consider using Spark & RDDs for this too
• Disk I/O intensive workloads
• Persist data on Amazon S3 and use S3DistCp to
copy to/from HDFS for processing
Amazon EMR—Design Patterns
EMR example #1: Batch Processing
GB of logs pushed to
S3 hourly
Daily EMR cluster
using Hive to process
data
Input and output
stored in S3
250 Amazon EMR jobs per day, processing 30 TB of data
http://aws.amazon.com/solutions/case-studies/yelp/
EMR example #2: Long-running cluster
Data pushed to S3 Daily EMR cluster
ETL data into
database
24/7 EMR cluster running
HBase holds last 2 years of
data
Front-end service uses
HBase cluster to power
dashboard with high
concurrency
TBs of logs sent
daily
Logs stored in
Amazon S3
Hive Metastore
on Amazon EMR
EMR example #3: Interactive query
Interactive query using Presto on multi-petabyte warehouse
http://nflx.it/1dO7Pnt
EMR example #4: Streaming data processing
TBs of logs sent
daily
Logs stored in
Amazon Kinesis
Amazon Kinesis
Client Library
AWS Lambda
Amazon EMR
Amazon EC2
Optimizations for storage
File formats
• Row oriented
– Text files
– Sequence files
• Writable object
– Avro data files
• Described by schema
• Columnar format
– Object record columnar (ORC)
– Parquet
Logical table
Row oriented
Column oriented
Choosing the right file format
• Processing and query tools
– Hive, Impala, and Presto
• Evolution of schema
– Avro for Schema and Presto for storage
• File format “splittability”
– Avoid JSON/XML Files. Use them as records.
• Compression—block or file
File sizes
• Avoid small files
– Anything smaller than 100 MB
• Each mapper is a single JVM
– CPU time is required to spawn JVMs/mappers
• Fewer files, matching closely to block size
– Fewer calls to S3
– Fewer network/HDFS requests
Dealing with small files
• Reduce HDFS block size, e.g. 1 MB (default is 128 MB)
– --bootstrap-action s3://elasticmapreduce/bootstrap-
actions/configure-hadoop --args “-m,dfs.block.size=1048576”
• Better: Use S3DistCP to combine smaller files together
– S3DistCP takes a pattern and target path to combine smaller
input files to larger ones
– Supply a target size and compression codec
Compression
• Always compress data files on Amazon S3
– Reduces network traffic between Amazon S3 and
Amazon EMR
– Speeds up your job
• Compress mappers and reducer output
Amazon EMR compresses inter-node traffic with LZO with
Hadoop 1, and Snappy with Hadoop 2
Choosing the right compression
• Time-sensitive, faster compressions are a better choice
• Large amount of data, use space efficient compressions
• Combined workload, use gzip
Algorithm Splittable? Compression ratio
Compress +
Decompress speed
Gzip (DEFLATE) No High Medium
bzip2 Yes Very high Slow
LZO Yes Low Fast
Snappy No Low Very fast
Cost saving tips for Amazon EMR
• Use S3 as your persistent data store; query it using Presto, Hive,
Spark, etc.
• Only pay for compute when you need it
• Use Amazon EC2 Spot Instances to save > 80%
• Use Amazon EC2 Reserved Instances for steady workloads
• Use Amazon CloudWatch alerts to notify you if a cluster is
underutilized, then shut it down. E.g. 0 mappers running for
> N hours
OPTIMIZING DATA
WAREHOUSING COSTS
WITH S3 AND EMR
July 9, 2015
Jason Timmes, AVP of Software Development
Nate Sammons, Principal Architect
Document Title32
We make the
world’s capital markets
move faster
more efficient
more transparent
Public company
in S&P 500
Develop and run
markets globally in
all asset classes
We provide technology, trading,
intelligence and listing services
Intense Operational Focus
on Efficiency
and Competitiveness
We provide the infrastructure, tools and strategic
insight to help our customers navigate the
complexity of global capital markets
and realize their capital ambitions.
Get to know us
We have uniquely transformed our business from predominately a U.S. equities exchange to a global
provider of corporate, trading, technology and information solutions.
Document Title33 33
LEADING INDEX PROVIDER WITH
41,000+ INDEXES
ACROSS ASSET CLASSES AND
GEOGRAPHIES
Over 10,000 Corporate Clients in
60 countries
Our technology
powers over
70
MARKETPLACES,
regulators, CSDs
and clearing-
houses
in over
50 COUNTRIES
100+ DATA
PRODUCT OFFERINGS
supporting 2.5+ million
investment professionals
and users
IN 98 COUNTRIES
26 Markets
3 Clearing Houses
5 Central
Securities
Depositories
Lists more than 3,500
companies in 35 countries,
representing more than $8.8
trillion in total market value
IGNITE YOUR AMBITION
CURRENT STATE
• Replaced on-premises data warehouse with Amazon Redshift in 2014
• On-premises warehouse held 1 year of data
• Amazon Redshift migration yielded 57% cost savings over legacy
solution
• Business now wants more than 1 year of data in warehouse
• Currently have Jan. 2014 to present in Amazon Redshift (18 dw1.8xl
nodes)
Optimizing data warehousing costs with S3 and EMR34
THE PROBLEM
• Historical year archives are queried much less frequently
• Amazon Redshift is a very competitive price point, but for rarely needed data,
it’s still expensive
• Need a solution that stores historical data once (cheaply), and can
use elastic compute directly on that data only as needed
Optimizing data warehousing costs with S3 and EMR35
REQUIREMENTS
• Decouple storage and compute resources
• All data is encrypted in flight and at rest
• SQL interface to all archived data
• Parallel ingest to Amazon Redshift & new (historical years)
warehouse
• Transparent usage-based billing to internal departments (clients)
• Isolate workloads between clients (no resource contention)
• Ideally deliver a platform that can enable different kinds of
compute paradigms over the same data (SQL, stream processing,
etc.)
Optimizing data warehousing costs with S3 and EMR36
ARCHITECTURE DIAGRAM
Optimizing data warehousing costs with S3 and EMR37
SEPARATE STORAGE & COMPUTE RESOURCES
• Single source of truth; highly durable encrypted files in S3
• Avoids read hotspots; very high concurrent read capability
• Multiple compute clusters isolate workloads for users
• Each client runs their own EMR clusters in their own Amazon VPCs
• No contention on compute resources, or even IP address range
• Scale compute up/down as needed (and manages costs)
• Cost allocation using multiple AWS accounts (1 per internal budget)
• S3/Hive Metastore costs are only shared infrastructure; extremely cheap
• Consolidated billing makes cost allocations easy and fully transparent
• Run multiple query layers & experiment with new projects
• Spark, Drill, etc…
Optimizing data warehousing costs with S3 and EMR38
DATA SECURITY & ENCRYPTION
Current state:
• Nasdaq KMS for encryption keys
• Key hierarchy uses MySQL, rooted in an HSM cluster
• Using the S3 “Encryption Materials Provider” interface
Future State:
• Working with InfoSec to evaluate AWS KMS
• Nasdaq KMS keys rooted in AWS KMS
• Move MySQL DB to Amazon Aurora (encrypted with AWS KMS)
• AWS KMS support in AWS services is growing
Optimizing data warehousing costs with S3 and EMR39
ENCRYPTED DATA ACCESS MONITORING/CONTROL
Optimizing data warehousing costs with S3 and EMR40
S3 FILE FORMAT: PARQUET
• Parquet file format: http://parquet.apache.org
• Self-describing columnar file format
• Supports nested structures (Dremel “record shredding” algo)
• Emerging standard data format for Hadoop
• Supported by: Presto, Spark, Drill, Hive, Impala, etc.
Optimizing data warehousing costs with S3 and EMR41
PARQUET VS ORC
• Evaluated Parquet and ORC (competing open columnar formats)
• ORC encrypted performance is currently a problem
• 15x slower vs. unencrypted (94% slower)
• 8 CPUs on 2 nodes: ~900 MB/sec vs. ~60 MB/sec encrypted
• Encrypted Parquet is ~27% slower vs. unencrypted
• Parquet: ~100MB/sec from S3 per CPU core (encrypted)
• DATE field support in Parquet is not ready yet
• DATE + Parquet not supported in Presto at this time
• Currently writing DATEs as INTs
– 2015-06-22 => 20150622
Optimizing data warehousing costs with S3 and EMR42
PARQUET DATA CONVERSION & MANAGEMENT
• Generic schema management system
• Supports multiple versions of a named table schema
• Data conversion API
• Simple Java API to read/write records in Parquet
• Encodes schema version, data transformations in metadata
• Automatically converts DATE to INT for now
– Migrate to new native DATE schema version when available!
• Parquet doesn’t really handle “ALTER TABLE…” migrations
• Schema changes require building new Parquet files (via EMR Map-Reduce jobs)
• CSV to Parquet conversion tools
• Given a JSON schema definition and CSV, produce Parquet files
• Writes files into a Hive-compliant directory structure
Optimizing data warehousing costs with S3 and EMR43
SQL QUERY LAYER: PRESTO
• A distributed SQL engine developed by Facebook
• https://prestodb.io
• Used by Facebook, Netflix, AirBNB, others
• Reads schema definition from Hive, data files from S3 or HDFS
• Recent enterprise support and contributions from Teradata
• http://www.teradata.com/Presto
• Data encryption support added by Nasdaq
• Working on contributing this back to Presto through github
• Uses Hive Metastore to impose table definitions on S3 files
Optimizing data warehousing costs with S3 and EMR44
STREAM PROCESSING LAYER: SPARK
• One of our clients wants to test stream processing order data
• Spark supports Parquet
• Transparent file decryption prior to Spark data processing layer
• Needs to be tested, but should “just work”
• …we’re hoping to support Hadoop projects today and in the future
Optimizing data warehousing costs with S3 and EMR45
DATA INGEST
• Daily ingest into Amazon Redshift will also write data to S3
• Averaging ~6 B rows/day currently (1.9 TB/day uncompressed)
• Build Parquet files directly from CSV files loaded into Amazon Redshift
• Limit Amazon Redshift to a 1-year rolling window of data
• Effectively caps Amazon Redshift costs
• Future enhancement:
• Use Presto as a unified SQL access layer to Amazon Redshift and EMR/S3
Optimizing data warehousing costs with S3 and EMR46
New Features
Amazon EMR is now HIPAA-eligible
Planned for next month: EMR 4.0
• Directly configure Hadoop application settings
• Hadoop 2.6, Spark 1.4.0, Hive 1.0, Pig 0.14
• Standard ports and paths (Apache Bigtop standards)
• Quickly configure and launch clusters for standard use cases
Planned for next month: EMR 4.0
• Use the configuration parameter to directly change default settings
for Hadoop ecosystem applications instead of using the “configure-
hadoop” bootstrap action
NEW YORK
Rahul Pathak—Sr. Mgr. Amazon EMR (@rahulpathak)
Jason Timmes—AVP Software Development, Nasdaq

More Related Content

What's hot

Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...
Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...
Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...
Amazon Web Services
 
(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...
(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...
(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...
Amazon Web Services
 
AWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMRAWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMR
Amazon Web Services
 
Scaling your analytics with Amazon EMR
Scaling your analytics with Amazon EMRScaling your analytics with Amazon EMR
Scaling your analytics with Amazon EMR
Israel AWS User Group
 
Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...
Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...
Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...
Amazon Web Services
 
Masterclass Live: Amazon EMR
Masterclass Live: Amazon EMRMasterclass Live: Amazon EMR
Masterclass Live: Amazon EMR
Amazon Web Services
 
Amazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best Practices
Amazon Web Services
 
Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR - AWS On...
Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR - AWS On...Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR - AWS On...
Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR - AWS On...
Amazon Web Services
 
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloudHive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
Jaipaul Agonus
 
Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3
Amazon Web Services
 
Masterclass Webinar: Amazon Elastic MapReduce (EMR)
Masterclass Webinar: Amazon Elastic MapReduce (EMR)Masterclass Webinar: Amazon Elastic MapReduce (EMR)
Masterclass Webinar: Amazon Elastic MapReduce (EMR)
Amazon Web Services
 
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
Amazon Web Services
 
Amazon EMR Masterclass
Amazon EMR MasterclassAmazon EMR Masterclass
Amazon EMR Masterclass
Amazon Web Services
 
(BDT316) Offloading ETL to Amazon Elastic MapReduce
(BDT316) Offloading ETL to Amazon Elastic MapReduce(BDT316) Offloading ETL to Amazon Elastic MapReduce
(BDT316) Offloading ETL to Amazon Elastic MapReduce
Amazon Web Services
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWS
Amazon Web Services
 
Masterclass Webinar - Amazon Elastic MapReduce (EMR)
Masterclass Webinar - Amazon Elastic MapReduce (EMR)Masterclass Webinar - Amazon Elastic MapReduce (EMR)
Masterclass Webinar - Amazon Elastic MapReduce (EMR)
Amazon Web Services
 
Data science with spark on amazon EMR - Pop-up Loft Tel Aviv
Data science with spark on amazon EMR - Pop-up Loft Tel AvivData science with spark on amazon EMR - Pop-up Loft Tel Aviv
Data science with spark on amazon EMR - Pop-up Loft Tel Aviv
Amazon Web Services
 
(BDT210) Building Scalable Big Data Solutions: Intel & AOL
(BDT210) Building Scalable Big Data Solutions: Intel & AOL(BDT210) Building Scalable Big Data Solutions: Intel & AOL
(BDT210) Building Scalable Big Data Solutions: Intel & AOL
Amazon Web Services
 
Hadoop in the cloud with AWS' EMR
Hadoop in the cloud with AWS' EMRHadoop in the cloud with AWS' EMR
Hadoop in the cloud with AWS' EMR
rICh morrow
 
Data Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMRData Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMR
Amazon Web Services
 

What's hot (20)

Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...
Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...
Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...
 
(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...
(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...
(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...
 
AWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMRAWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMR
 
Scaling your analytics with Amazon EMR
Scaling your analytics with Amazon EMRScaling your analytics with Amazon EMR
Scaling your analytics with Amazon EMR
 
Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...
Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...
Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...
 
Masterclass Live: Amazon EMR
Masterclass Live: Amazon EMRMasterclass Live: Amazon EMR
Masterclass Live: Amazon EMR
 
Amazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best Practices
 
Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR - AWS On...
Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR - AWS On...Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR - AWS On...
Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR - AWS On...
 
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloudHive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
 
Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3
 
Masterclass Webinar: Amazon Elastic MapReduce (EMR)
Masterclass Webinar: Amazon Elastic MapReduce (EMR)Masterclass Webinar: Amazon Elastic MapReduce (EMR)
Masterclass Webinar: Amazon Elastic MapReduce (EMR)
 
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
 
Amazon EMR Masterclass
Amazon EMR MasterclassAmazon EMR Masterclass
Amazon EMR Masterclass
 
(BDT316) Offloading ETL to Amazon Elastic MapReduce
(BDT316) Offloading ETL to Amazon Elastic MapReduce(BDT316) Offloading ETL to Amazon Elastic MapReduce
(BDT316) Offloading ETL to Amazon Elastic MapReduce
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWS
 
Masterclass Webinar - Amazon Elastic MapReduce (EMR)
Masterclass Webinar - Amazon Elastic MapReduce (EMR)Masterclass Webinar - Amazon Elastic MapReduce (EMR)
Masterclass Webinar - Amazon Elastic MapReduce (EMR)
 
Data science with spark on amazon EMR - Pop-up Loft Tel Aviv
Data science with spark on amazon EMR - Pop-up Loft Tel AvivData science with spark on amazon EMR - Pop-up Loft Tel Aviv
Data science with spark on amazon EMR - Pop-up Loft Tel Aviv
 
(BDT210) Building Scalable Big Data Solutions: Intel & AOL
(BDT210) Building Scalable Big Data Solutions: Intel & AOL(BDT210) Building Scalable Big Data Solutions: Intel & AOL
(BDT210) Building Scalable Big Data Solutions: Intel & AOL
 
Hadoop in the cloud with AWS' EMR
Hadoop in the cloud with AWS' EMRHadoop in the cloud with AWS' EMR
Hadoop in the cloud with AWS' EMR
 
Data Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMRData Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMR
 

Viewers also liked

Taking Hadoop to Enterprise Security Standards
Taking Hadoop to Enterprise Security StandardsTaking Hadoop to Enterprise Security Standards
Taking Hadoop to Enterprise Security Standards
DataWorks Summit
 
August 2013 HUG: Compression Options in Hadoop - A Tale of Tradeoffs
August 2013 HUG: Compression Options in Hadoop - A Tale of TradeoffsAugust 2013 HUG: Compression Options in Hadoop - A Tale of Tradeoffs
August 2013 HUG: Compression Options in Hadoop - A Tale of Tradeoffs
Yahoo Developer Network
 
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Julien Le Dem
 
Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013
larsgeorge
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWS
Amazon Web Services
 
Amazon EC2 Systems Manager for Hybrid Cloud Management at Scale
Amazon EC2 Systems Manager for Hybrid Cloud Management at ScaleAmazon EC2 Systems Manager for Hybrid Cloud Management at Scale
Amazon EC2 Systems Manager for Hybrid Cloud Management at Scale
Amazon Web Services
 
From Cowboy To Astronaut (KSU Version)
From Cowboy To Astronaut (KSU Version)From Cowboy To Astronaut (KSU Version)
From Cowboy To Astronaut (KSU Version)
Jeremy Fuksa
 
Sex And The Samurai Done4
Sex And The Samurai Done4Sex And The Samurai Done4
Sex And The Samurai Done4
Demonassassin88
 
07 ZamyšLení Velké MyšLenky
07  ZamyšLení   Velké MyšLenky07  ZamyšLení   Velké MyšLenky
07 ZamyšLení Velké MyšLenkyjedlickak07
 
Request for a decentralized social network
Request for a decentralized social networkRequest for a decentralized social network
Request for a decentralized social network
Justin Kistner
 
Stefano Ricci, PRIVACY E SERVIZI DELLA SOCIETA' DELL'INFORMAZIONE (1)
Stefano Ricci, PRIVACY E SERVIZI DELLA SOCIETA' DELL'INFORMAZIONE (1)Stefano Ricci, PRIVACY E SERVIZI DELLA SOCIETA' DELL'INFORMAZIONE (1)
Stefano Ricci, PRIVACY E SERVIZI DELLA SOCIETA' DELL'INFORMAZIONE (1)Andrea Rossetti
 
Design changes with Purpose
Design changes with PurposeDesign changes with Purpose
Design changes with Purpose
Alan Doherty
 
Moving Beyond BINO Beta
Moving Beyond BINO BetaMoving Beyond BINO Beta
Moving Beyond BINO Beta
Tim Swanson
 
Publicitat Creativa
Publicitat CreativaPublicitat Creativa
Publicitat Creativa
soclarale
 
Reunió d'aula de 6è 10-11
Reunió d'aula de 6è 10-11Reunió d'aula de 6è 10-11
Reunió d'aula de 6è 10-11marblocs
 
rod stewart5
rod stewart5rod stewart5
rod stewart1
rod stewart1rod stewart1
Mens health
Mens healthMens health
Mens health
Frank Meissner
 
Image & Graphic Files
Image & Graphic FilesImage & Graphic Files
Image & Graphic Files
bs07d3p
 

Viewers also liked (20)

Taking Hadoop to Enterprise Security Standards
Taking Hadoop to Enterprise Security StandardsTaking Hadoop to Enterprise Security Standards
Taking Hadoop to Enterprise Security Standards
 
August 2013 HUG: Compression Options in Hadoop - A Tale of Tradeoffs
August 2013 HUG: Compression Options in Hadoop - A Tale of TradeoffsAugust 2013 HUG: Compression Options in Hadoop - A Tale of Tradeoffs
August 2013 HUG: Compression Options in Hadoop - A Tale of Tradeoffs
 
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
 
Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWS
 
Amazon EC2 Systems Manager for Hybrid Cloud Management at Scale
Amazon EC2 Systems Manager for Hybrid Cloud Management at ScaleAmazon EC2 Systems Manager for Hybrid Cloud Management at Scale
Amazon EC2 Systems Manager for Hybrid Cloud Management at Scale
 
From Cowboy To Astronaut (KSU Version)
From Cowboy To Astronaut (KSU Version)From Cowboy To Astronaut (KSU Version)
From Cowboy To Astronaut (KSU Version)
 
Sex And The Samurai Done4
Sex And The Samurai Done4Sex And The Samurai Done4
Sex And The Samurai Done4
 
25
2525
25
 
07 ZamyšLení Velké MyšLenky
07  ZamyšLení   Velké MyšLenky07  ZamyšLení   Velké MyšLenky
07 ZamyšLení Velké MyšLenky
 
Request for a decentralized social network
Request for a decentralized social networkRequest for a decentralized social network
Request for a decentralized social network
 
Stefano Ricci, PRIVACY E SERVIZI DELLA SOCIETA' DELL'INFORMAZIONE (1)
Stefano Ricci, PRIVACY E SERVIZI DELLA SOCIETA' DELL'INFORMAZIONE (1)Stefano Ricci, PRIVACY E SERVIZI DELLA SOCIETA' DELL'INFORMAZIONE (1)
Stefano Ricci, PRIVACY E SERVIZI DELLA SOCIETA' DELL'INFORMAZIONE (1)
 
Design changes with Purpose
Design changes with PurposeDesign changes with Purpose
Design changes with Purpose
 
Moving Beyond BINO Beta
Moving Beyond BINO BetaMoving Beyond BINO Beta
Moving Beyond BINO Beta
 
Publicitat Creativa
Publicitat CreativaPublicitat Creativa
Publicitat Creativa
 
Reunió d'aula de 6è 10-11
Reunió d'aula de 6è 10-11Reunió d'aula de 6è 10-11
Reunió d'aula de 6è 10-11
 
rod stewart5
rod stewart5rod stewart5
rod stewart5
 
rod stewart1
rod stewart1rod stewart1
rod stewart1
 
Mens health
Mens healthMens health
Mens health
 
Image & Graphic Files
Image & Graphic FilesImage & Graphic Files
Image & Graphic Files
 

Similar to Deep Dive: Amazon Elastic MapReduce

AWS Analytics
AWS AnalyticsAWS Analytics
AWS Analytics
Amazon Web Services
 
Processing and Analytics
Processing and AnalyticsProcessing and Analytics
Processing and Analytics
Amazon Web Services
 
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
Amazon Web Services
 
Database and Analytics on the AWS Cloud
Database and Analytics on the AWS CloudDatabase and Analytics on the AWS Cloud
Database and Analytics on the AWS Cloud
Amazon Web Services
 
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
Amazon Web Services
 
Rethinking the database for the cloud (iJAWS)
Rethinking the database for the cloud (iJAWS)Rethinking the database for the cloud (iJAWS)
Rethinking the database for the cloud (iJAWS)
Rasmus Ekman
 
Amazon Elastic Map Reduce: the concepts
Amazon Elastic Map Reduce: the conceptsAmazon Elastic Map Reduce: the concepts
Amazon Elastic Map Reduce: the concepts
Julien SIMON
 
Deep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceDeep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduce
Amazon Web Services
 
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
Amazon Web Services
 
How to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutesHow to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutes
Vladimir Simek
 
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftData warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Amazon Web Services
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
Amazon Web Services
 
AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practice...
AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practice...AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practice...
AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practice...
Amazon Web Services
 
AWS Summit Seoul 2015 - AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석
AWS Summit Seoul 2015 -  AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석AWS Summit Seoul 2015 -  AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석
AWS Summit Seoul 2015 - AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석
Amazon Web Services Korea
 
Choosing the right data storage in the Cloud.
Choosing the right data storage in the Cloud. Choosing the right data storage in the Cloud.
Choosing the right data storage in the Cloud.
Amazon Web Services
 
Deep Dive in Big Data
Deep Dive in Big DataDeep Dive in Big Data
Deep Dive in Big Data
Amazon Web Services
 
ABD312_Deep Dive Migrating Big Data Workloads to AWS
ABD312_Deep Dive Migrating Big Data Workloads to AWSABD312_Deep Dive Migrating Big Data Workloads to AWS
ABD312_Deep Dive Migrating Big Data Workloads to AWS
Amazon Web Services
 
Building Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftBuilding Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon Redshift
Amazon Web Services
 
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Amazon Web Services
 
AWS Summit Tel Aviv - Enterprise Track - Cost Optimization & TCO
AWS Summit Tel Aviv - Enterprise Track - Cost Optimization & TCOAWS Summit Tel Aviv - Enterprise Track - Cost Optimization & TCO
AWS Summit Tel Aviv - Enterprise Track - Cost Optimization & TCO
Amazon Web Services
 

Similar to Deep Dive: Amazon Elastic MapReduce (20)

AWS Analytics
AWS AnalyticsAWS Analytics
AWS Analytics
 
Processing and Analytics
Processing and AnalyticsProcessing and Analytics
Processing and Analytics
 
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
 
Database and Analytics on the AWS Cloud
Database and Analytics on the AWS CloudDatabase and Analytics on the AWS Cloud
Database and Analytics on the AWS Cloud
 
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
 
Rethinking the database for the cloud (iJAWS)
Rethinking the database for the cloud (iJAWS)Rethinking the database for the cloud (iJAWS)
Rethinking the database for the cloud (iJAWS)
 
Amazon Elastic Map Reduce: the concepts
Amazon Elastic Map Reduce: the conceptsAmazon Elastic Map Reduce: the concepts
Amazon Elastic Map Reduce: the concepts
 
Deep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceDeep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduce
 
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
 
How to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutesHow to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutes
 
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftData warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practice...
AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practice...AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practice...
AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practice...
 
AWS Summit Seoul 2015 - AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석
AWS Summit Seoul 2015 -  AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석AWS Summit Seoul 2015 -  AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석
AWS Summit Seoul 2015 - AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석
 
Choosing the right data storage in the Cloud.
Choosing the right data storage in the Cloud. Choosing the right data storage in the Cloud.
Choosing the right data storage in the Cloud.
 
Deep Dive in Big Data
Deep Dive in Big DataDeep Dive in Big Data
Deep Dive in Big Data
 
ABD312_Deep Dive Migrating Big Data Workloads to AWS
ABD312_Deep Dive Migrating Big Data Workloads to AWSABD312_Deep Dive Migrating Big Data Workloads to AWS
ABD312_Deep Dive Migrating Big Data Workloads to AWS
 
Building Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftBuilding Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon Redshift
 
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
 
AWS Summit Tel Aviv - Enterprise Track - Cost Optimization & TCO
AWS Summit Tel Aviv - Enterprise Track - Cost Optimization & TCOAWS Summit Tel Aviv - Enterprise Track - Cost Optimization & TCO
AWS Summit Tel Aviv - Enterprise Track - Cost Optimization & TCO
 

More from Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
Amazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
Amazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
Amazon Web Services
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Amazon Web Services
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
Amazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
Amazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Amazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
Amazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Amazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
Amazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
Amazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
Amazon Web Services
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
Amazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
Amazon Web Services
 

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Recently uploaded

inQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
inQuba Webinar Mastering Customer Journey Management with Dr Graham HillinQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
inQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
LizaNolte
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
Hiroshi SHIBATA
 
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid ResearchHarnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
Neo4j
 
Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |
AstuteBusiness
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
Chart Kalyan
 
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptxPRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
christinelarrosa
 
Northern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
Northern Engraving | Modern Metal Trim, Nameplates and Appliance PanelsNorthern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
Northern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
Northern Engraving
 
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and BioinformaticiansBiomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Neo4j
 
Session 1 - Intro to Robotic Process Automation.pdf
Session 1 - Intro to Robotic Process Automation.pdfSession 1 - Intro to Robotic Process Automation.pdf
Session 1 - Intro to Robotic Process Automation.pdf
UiPathCommunity
 
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
ScyllaDB
 
What is an RPA CoE? Session 2 – CoE Roles
What is an RPA CoE?  Session 2 – CoE RolesWhat is an RPA CoE?  Session 2 – CoE Roles
What is an RPA CoE? Session 2 – CoE Roles
DianaGray10
 
"NATO Hackathon Winner: AI-Powered Drug Search", Taras Kloba
"NATO Hackathon Winner: AI-Powered Drug Search",  Taras Kloba"NATO Hackathon Winner: AI-Powered Drug Search",  Taras Kloba
"NATO Hackathon Winner: AI-Powered Drug Search", Taras Kloba
Fwdays
 
"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota
Fwdays
 
Essentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation ParametersEssentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation Parameters
Safe Software
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
Miro Wengner
 
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Pitangent Analytics & Technology Solutions Pvt. Ltd
 
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
Fwdays
 
Christine's Product Research Presentation.pptx
Christine's Product Research Presentation.pptxChristine's Product Research Presentation.pptx
Christine's Product Research Presentation.pptx
christinelarrosa
 
Christine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptxChristine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptx
christinelarrosa
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
Antonios Katsarakis
 

Recently uploaded (20)

inQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
inQuba Webinar Mastering Customer Journey Management with Dr Graham HillinQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
inQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
 
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid ResearchHarnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
 
Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
 
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptxPRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
 
Northern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
Northern Engraving | Modern Metal Trim, Nameplates and Appliance PanelsNorthern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
Northern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
 
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and BioinformaticiansBiomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
 
Session 1 - Intro to Robotic Process Automation.pdf
Session 1 - Intro to Robotic Process Automation.pdfSession 1 - Intro to Robotic Process Automation.pdf
Session 1 - Intro to Robotic Process Automation.pdf
 
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
 
What is an RPA CoE? Session 2 – CoE Roles
What is an RPA CoE?  Session 2 – CoE RolesWhat is an RPA CoE?  Session 2 – CoE Roles
What is an RPA CoE? Session 2 – CoE Roles
 
"NATO Hackathon Winner: AI-Powered Drug Search", Taras Kloba
"NATO Hackathon Winner: AI-Powered Drug Search",  Taras Kloba"NATO Hackathon Winner: AI-Powered Drug Search",  Taras Kloba
"NATO Hackathon Winner: AI-Powered Drug Search", Taras Kloba
 
"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota
 
Essentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation ParametersEssentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation Parameters
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
 
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
 
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
 
Christine's Product Research Presentation.pptx
Christine's Product Research Presentation.pptxChristine's Product Research Presentation.pptx
Christine's Product Research Presentation.pptx
 
Christine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptxChristine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptx
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
 

Deep Dive: Amazon Elastic MapReduce

  • 1. ©2015, Amazon Web Services, Inc. or its affiliates. All rights reserved Deep Dive: Amazon EMR Rahul Pathak—Sr. Mgr. Amazon EMR (@rahulpathak) Jason Timmes—AVP Software Development, Nasdaq
  • 2. Why Amazon EMR? Easy to Use Launch a cluster in minutes Low Cost Pay an hourly rate Elastic Easily add or remove capacity Reliable Spend less time monitoring Secure Managed firewalls Flexible You control the cluster
  • 3. Easy to deploy AWS Management Console Command Line or use the EMR API with your favorite SDK
  • 4. Easy to monitor and debug Monitor Debug Integrated with Amazon CloudWatch Monitor cluster, node, and I/O
  • 5. Try different configurations to find your optimal architecture CPU c3 family cc1.4xlarge cc2.8xlarge Memory m2 family r3 family Disk/IO d2 family i2 family General m1 family m3 family Choose your instance types Batch Machine Spark and Large process learning interactive HDFS
  • 6. Easy to add and remove compute capacity on your cluster. Match compute demands with cluster sizing. Resizable clusters
  • 7. Spot for task nodes Up to 90% off EC2 on-demand pricing On-demand for core nodes Standard Amazon EC2 pricing for on-demand capacity Easy to use Spot Instances Meet SLA at predictable cost Exceed SLA at lower cost
  • 8. Read data directly into Hive, Pig, streaming and cascading from Amazon Kinesis streams No intermediate data persistence required Simple way to introduce real time sources into batch oriented systems Multi-application support & automatic checkpointing Amazon EMR integration with Amazon Kinesis
  • 9. The Hadoop ecosystem can run in Amazon EMR
  • 13. Leverage Amazon S3 with EMRFS
  • 14. Amazon S3 as your persistent data store • Separate compute and storage • Resize and shut down Amazon EMR clusters with no data loss • Point multiple Amazon EMR clusters at the same data in Amazon S3 EMR EMR Amazon S3
  • 15. EMRFS makes it easier to use Amazon S3 • Read-after-write consistency • Very fast list operations • Error handling options • Support for Amazon S3 encryption • Transparent to applications: s3://
  • 16. EMRFS client-side encryption Amazon S3 AmazonS3encryptionclients EMRFSenabledfor AmazonS3client-sideencryption Key vendor (AWS KMS or your custom key vendor) (client-side encrypted objects)
  • 17. HDFS is still there if you need it • Iterative workloads – If you’re processing the same dataset more than once – Consider using Spark & RDDs for this too • Disk I/O intensive workloads • Persist data on Amazon S3 and use S3DistCp to copy to/from HDFS for processing
  • 19. EMR example #1: Batch Processing GB of logs pushed to S3 hourly Daily EMR cluster using Hive to process data Input and output stored in S3 250 Amazon EMR jobs per day, processing 30 TB of data http://aws.amazon.com/solutions/case-studies/yelp/
  • 20. EMR example #2: Long-running cluster Data pushed to S3 Daily EMR cluster ETL data into database 24/7 EMR cluster running HBase holds last 2 years of data Front-end service uses HBase cluster to power dashboard with high concurrency
  • 21. TBs of logs sent daily Logs stored in Amazon S3 Hive Metastore on Amazon EMR EMR example #3: Interactive query Interactive query using Presto on multi-petabyte warehouse http://nflx.it/1dO7Pnt
  • 22. EMR example #4: Streaming data processing TBs of logs sent daily Logs stored in Amazon Kinesis Amazon Kinesis Client Library AWS Lambda Amazon EMR Amazon EC2
  • 24. File formats • Row oriented – Text files – Sequence files • Writable object – Avro data files • Described by schema • Columnar format – Object record columnar (ORC) – Parquet Logical table Row oriented Column oriented
  • 25. Choosing the right file format • Processing and query tools – Hive, Impala, and Presto • Evolution of schema – Avro for Schema and Presto for storage • File format “splittability” – Avoid JSON/XML Files. Use them as records. • Compression—block or file
  • 26. File sizes • Avoid small files – Anything smaller than 100 MB • Each mapper is a single JVM – CPU time is required to spawn JVMs/mappers • Fewer files, matching closely to block size – Fewer calls to S3 – Fewer network/HDFS requests
  • 27. Dealing with small files • Reduce HDFS block size, e.g. 1 MB (default is 128 MB) – --bootstrap-action s3://elasticmapreduce/bootstrap- actions/configure-hadoop --args “-m,dfs.block.size=1048576” • Better: Use S3DistCP to combine smaller files together – S3DistCP takes a pattern and target path to combine smaller input files to larger ones – Supply a target size and compression codec
  • 28. Compression • Always compress data files on Amazon S3 – Reduces network traffic between Amazon S3 and Amazon EMR – Speeds up your job • Compress mappers and reducer output Amazon EMR compresses inter-node traffic with LZO with Hadoop 1, and Snappy with Hadoop 2
  • 29. Choosing the right compression • Time-sensitive, faster compressions are a better choice • Large amount of data, use space efficient compressions • Combined workload, use gzip Algorithm Splittable? Compression ratio Compress + Decompress speed Gzip (DEFLATE) No High Medium bzip2 Yes Very high Slow LZO Yes Low Fast Snappy No Low Very fast
  • 30. Cost saving tips for Amazon EMR • Use S3 as your persistent data store; query it using Presto, Hive, Spark, etc. • Only pay for compute when you need it • Use Amazon EC2 Spot Instances to save > 80% • Use Amazon EC2 Reserved Instances for steady workloads • Use Amazon CloudWatch alerts to notify you if a cluster is underutilized, then shut it down. E.g. 0 mappers running for > N hours
  • 31. OPTIMIZING DATA WAREHOUSING COSTS WITH S3 AND EMR July 9, 2015 Jason Timmes, AVP of Software Development Nate Sammons, Principal Architect
  • 32. Document Title32 We make the world’s capital markets move faster more efficient more transparent Public company in S&P 500 Develop and run markets globally in all asset classes We provide technology, trading, intelligence and listing services Intense Operational Focus on Efficiency and Competitiveness We provide the infrastructure, tools and strategic insight to help our customers navigate the complexity of global capital markets and realize their capital ambitions. Get to know us We have uniquely transformed our business from predominately a U.S. equities exchange to a global provider of corporate, trading, technology and information solutions.
  • 33. Document Title33 33 LEADING INDEX PROVIDER WITH 41,000+ INDEXES ACROSS ASSET CLASSES AND GEOGRAPHIES Over 10,000 Corporate Clients in 60 countries Our technology powers over 70 MARKETPLACES, regulators, CSDs and clearing- houses in over 50 COUNTRIES 100+ DATA PRODUCT OFFERINGS supporting 2.5+ million investment professionals and users IN 98 COUNTRIES 26 Markets 3 Clearing Houses 5 Central Securities Depositories Lists more than 3,500 companies in 35 countries, representing more than $8.8 trillion in total market value IGNITE YOUR AMBITION
  • 34. CURRENT STATE • Replaced on-premises data warehouse with Amazon Redshift in 2014 • On-premises warehouse held 1 year of data • Amazon Redshift migration yielded 57% cost savings over legacy solution • Business now wants more than 1 year of data in warehouse • Currently have Jan. 2014 to present in Amazon Redshift (18 dw1.8xl nodes) Optimizing data warehousing costs with S3 and EMR34
  • 35. THE PROBLEM • Historical year archives are queried much less frequently • Amazon Redshift is a very competitive price point, but for rarely needed data, it’s still expensive • Need a solution that stores historical data once (cheaply), and can use elastic compute directly on that data only as needed Optimizing data warehousing costs with S3 and EMR35
  • 36. REQUIREMENTS • Decouple storage and compute resources • All data is encrypted in flight and at rest • SQL interface to all archived data • Parallel ingest to Amazon Redshift & new (historical years) warehouse • Transparent usage-based billing to internal departments (clients) • Isolate workloads between clients (no resource contention) • Ideally deliver a platform that can enable different kinds of compute paradigms over the same data (SQL, stream processing, etc.) Optimizing data warehousing costs with S3 and EMR36
  • 37. ARCHITECTURE DIAGRAM Optimizing data warehousing costs with S3 and EMR37
  • 38. SEPARATE STORAGE & COMPUTE RESOURCES • Single source of truth; highly durable encrypted files in S3 • Avoids read hotspots; very high concurrent read capability • Multiple compute clusters isolate workloads for users • Each client runs their own EMR clusters in their own Amazon VPCs • No contention on compute resources, or even IP address range • Scale compute up/down as needed (and manages costs) • Cost allocation using multiple AWS accounts (1 per internal budget) • S3/Hive Metastore costs are only shared infrastructure; extremely cheap • Consolidated billing makes cost allocations easy and fully transparent • Run multiple query layers & experiment with new projects • Spark, Drill, etc… Optimizing data warehousing costs with S3 and EMR38
  • 39. DATA SECURITY & ENCRYPTION Current state: • Nasdaq KMS for encryption keys • Key hierarchy uses MySQL, rooted in an HSM cluster • Using the S3 “Encryption Materials Provider” interface Future State: • Working with InfoSec to evaluate AWS KMS • Nasdaq KMS keys rooted in AWS KMS • Move MySQL DB to Amazon Aurora (encrypted with AWS KMS) • AWS KMS support in AWS services is growing Optimizing data warehousing costs with S3 and EMR39
  • 40. ENCRYPTED DATA ACCESS MONITORING/CONTROL Optimizing data warehousing costs with S3 and EMR40
  • 41. S3 FILE FORMAT: PARQUET • Parquet file format: http://parquet.apache.org • Self-describing columnar file format • Supports nested structures (Dremel “record shredding” algo) • Emerging standard data format for Hadoop • Supported by: Presto, Spark, Drill, Hive, Impala, etc. Optimizing data warehousing costs with S3 and EMR41
  • 42. PARQUET VS ORC • Evaluated Parquet and ORC (competing open columnar formats) • ORC encrypted performance is currently a problem • 15x slower vs. unencrypted (94% slower) • 8 CPUs on 2 nodes: ~900 MB/sec vs. ~60 MB/sec encrypted • Encrypted Parquet is ~27% slower vs. unencrypted • Parquet: ~100MB/sec from S3 per CPU core (encrypted) • DATE field support in Parquet is not ready yet • DATE + Parquet not supported in Presto at this time • Currently writing DATEs as INTs – 2015-06-22 => 20150622 Optimizing data warehousing costs with S3 and EMR42
  • 43. PARQUET DATA CONVERSION & MANAGEMENT • Generic schema management system • Supports multiple versions of a named table schema • Data conversion API • Simple Java API to read/write records in Parquet • Encodes schema version, data transformations in metadata • Automatically converts DATE to INT for now – Migrate to new native DATE schema version when available! • Parquet doesn’t really handle “ALTER TABLE…” migrations • Schema changes require building new Parquet files (via EMR Map-Reduce jobs) • CSV to Parquet conversion tools • Given a JSON schema definition and CSV, produce Parquet files • Writes files into a Hive-compliant directory structure Optimizing data warehousing costs with S3 and EMR43
  • 44. SQL QUERY LAYER: PRESTO • A distributed SQL engine developed by Facebook • https://prestodb.io • Used by Facebook, Netflix, AirBNB, others • Reads schema definition from Hive, data files from S3 or HDFS • Recent enterprise support and contributions from Teradata • http://www.teradata.com/Presto • Data encryption support added by Nasdaq • Working on contributing this back to Presto through github • Uses Hive Metastore to impose table definitions on S3 files Optimizing data warehousing costs with S3 and EMR44
  • 45. STREAM PROCESSING LAYER: SPARK • One of our clients wants to test stream processing order data • Spark supports Parquet • Transparent file decryption prior to Spark data processing layer • Needs to be tested, but should “just work” • …we’re hoping to support Hadoop projects today and in the future Optimizing data warehousing costs with S3 and EMR45
  • 46. DATA INGEST • Daily ingest into Amazon Redshift will also write data to S3 • Averaging ~6 B rows/day currently (1.9 TB/day uncompressed) • Build Parquet files directly from CSV files loaded into Amazon Redshift • Limit Amazon Redshift to a 1-year rolling window of data • Effectively caps Amazon Redshift costs • Future enhancement: • Use Presto as a unified SQL access layer to Amazon Redshift and EMR/S3 Optimizing data warehousing costs with S3 and EMR46
  • 48. Amazon EMR is now HIPAA-eligible
  • 49. Planned for next month: EMR 4.0 • Directly configure Hadoop application settings • Hadoop 2.6, Spark 1.4.0, Hive 1.0, Pig 0.14 • Standard ports and paths (Apache Bigtop standards) • Quickly configure and launch clusters for standard use cases
  • 50. Planned for next month: EMR 4.0 • Use the configuration parameter to directly change default settings for Hadoop ecosystem applications instead of using the “configure- hadoop” bootstrap action
  • 51. NEW YORK Rahul Pathak—Sr. Mgr. Amazon EMR (@rahulpathak) Jason Timmes—AVP Software Development, Nasdaq