Hadoop in the Cloud: Unlocking Big Data Potential with MapR and AWS

Hadoop in the Cloud: Unlocking the Potential of
Big Data on AWS

Introducing
Maya Cabassi
Partner Marketing Manager
Amazon Web Services

Webinar Overview
 Submit Your Questions using the Q&A tool.
 A copy of today’s presentation will be made available on:
 AWS SlideShare Channel@ http://www.slideshare.net/AmazonWebServices/
 AWS Webinar Channel on YouTube@ http://www.youtube.com/channel/UCT-nPlVzJI-
ccQXlxjSvJmw

Introducing
Jonathan Fritz
Sr. Product Manager
Amazon Web Services
Steve Wooledge
VP, Product Marketing
MapR Technologies
Bruce Penn
Principal Sales Engineer
MapR Technologies

What We’ll Cover
• Elastic MapReduce (EMR): Hadoop in the cloud
• Elastic clusters tailored for your workflows
• Best container to run Hadoop in the AWS Ecosystem
• Introduction to MapR’s Hadoop Platform
• Defining feature
• Increased performance
• Case Studies: MapR + Elastic MapReduce
• Q&A

Hadoop in the Cloud
Using MapR and Amazon Elastic MapReduce to unlock Big Data
Jonathan Fritz, Sr. Product Manager, Amazon Web Services
Steve Wooledge, VP, Product Marketing, MapR Technologies

Agenda
• Elastic MapReduce (EMR): Hadoop in the cloud
– Elastic clusters tailored for your workflows
– Best container to run Hadoop in the AWS Ecosystem
• Introduction to MapR’s Hadoop Platform
– Defining features
– Increased performance
• Case Studies: MapR + Elastic MapReduce
• Q+A

• YouTube users upload 48 hours of new video/min/day
• Twitter sees roughly 175 million tweets every day
The Three V’s: the drivers behind Big Data
Variety
Velocity
Volume
• Facebook analyzes 30+ petabytes of user generated data
• More than 5 billion people are calling, texting, tweeting and
browsing on mobile phones worldwide
• 2.7 zetabyes data exist in the digital universe today.
• Data production will be 44 times greater in 2020 vs. 2009

Hadoop is the right system for Big Data
• Scalable and fault tolerant
• Flexibility for multiple languages
and data formats
• Open source
• Ecosystem of tools
• Batch and real-time analytics

Challenges with managing Hadoop
On-Premises
• Manage HDFS, upgrades,
and system administration
• Pay for expensive support
contracts
• Select hardware in
advance and stick with
predictions
Cloud
• Hard to tightly integrate
with AWS storage services
• Independently manage
and monitor clusters

Amazon Elastic MapReduce (EMR) is the
easiest way to run Hadoop in the cloud.

• Managed services
• Easy to tune clusters and trim costs
• Support for multiple AWS datastores
• Unique features and ecosystem support
Why Amazon Elastic MapReduce?

Input data
S3, DynamoDB, Redshift

Elastic
MapReduce
Code
Input data

Elastic
MapReduce
Code Name
node
Input data

Elastic
MapReduce
Code Name
node
Input data
Elastic
cluster
S3/HDFS

Elastic
MapReduce
Code Name
node
Input data
S3/HDFS
Queries
+ BI
Via JDBC, Pig, Hive
Elastic
cluster

Elastic
MapReduce
Code Name
node
Output
Input data
Queries
+ BI
Via JDBC, Pig, Hive
Elastic
cluster
S3/HDFS

Output
Input data

Elastic clusters.
Customize size and type to reduce costs.

Choose your instance types
Try out different configurations to find your
optimal architecture.
CPU
c1.xlarge
cc1.4xlarge
cc2.8xlarge
Memory
m1.large
m2.2xlarge
m2.4xlarge
Disk
hs1.8xlarge

Long running or transient clusters
Easy to run Hadoop clusters short-term or 24/7, and
only pay for what you need.
=

10 hours
Resizable clusters
Easy to add and remove compute
capacity on your cluster.

6 hours
Resizable clusters

Peak capacity
Resizable clusters

Matched compute
demands with cluster sizing.
Resizable clusters
10 hours

Use Spot and Reserved Instances.
Minimize costs by supplementing on-demand pricing.

Easy to use Spot Instances
Name-your-price supercomputing to minimize costs.
Spot for
task nodes
Up to 90%
off EC2
on-demand
pricing
On-demand for
core nodes
Standard EC2
pricing for
on-demand
capacity.

24/7 clusters on Reserved Instances
Minimize cost for consistent capacity.
Reserved
Instances for
long running
clusters.
Up to 65% off
on-demand
pricing.

Your data, your choice.
Easy to integrate Elastic MapReduce with your datastores.

Using Amazon S3 and On-Cluster Storage
Data Sources
Transient EMR cluster
for batch map/reduce jobs
for daily reports
Long running EMR cluster
holding data on the cluster
in a NoSQL database
Weekly Report
Ad-hoc Query
Data aggregated
and stored in
Amazon S3

Use Amazon EMR with Redshift and S3
Data Sources
Daily data
aggregated in
Amazon S3
Amazon EMR
cluster used to
process data
Processed data
loaded into
Amazon Redshift
data warehouse

© 2014 MapR Technologies 34© 2014 MapR Technologies
Introduction to MapR

© 2014 MapR Technologies 35
MAPR: WORLDWIDEHADOOPTECHNOLOGYLEADER
UNIQUELYADDRESSESBOTH
ANALYTICANDOPERATIONALUSECASES
500+PAYINGCUSTOMERS
HQ

Hadoop Distributions
Open Source Open Source
Distribution A Distribution B
MANAGEMENT
Open Source
MANAGEMENT
ARCHITECTURAL
INNOVATIONS

Management
MapR Data Platform
APACHE HADOOP & OSS ECOSYSTEM
Impala SharkHivePigHueOozieZooKeeper
Mahout MLLibJujuSolrCascadingHttpFSFlume
Storm
Spark
Streaming
YARNMapReduceHBaseWhirrSqoop
Drill Tez
Knox Sentry
Spark Falcon
• High availability
• Data protection
• Disaster recovery
• Standard file
access
• Standard database
access
• Pluggable services
• Broad developer
support
• Enterprise security
authorization
• Wire-level
authentication
• Data governance
• Ability to support
predictive
analytics, real-time
database
operations, and
support high arrival
rate data
• Ability to logically
divide a cluster to
support different
use cases, job
types, user groups,
and administrators
• 2X to 7X higher
performance
• Consistent, low
latency
MapR Distribution for Hadoop
Enterprise-grade Security OperationalPerformance Multi-tenancyInteroperability

A winning combination:
MapR with Amazon Elastic MapReduce

Launching a Cluster
MapR Option Integrated within EMR

MapR: Designed for Both Transient and Long-Term Clusters
• High Availability
• Easy Development
• Multi-Tenancy
• World-Record Performance
• Breadth of Applications
Fastest On-Ramp to
Develop Hadoop
Applications
Best Platform for
Long-Term Hadoop
Production Success

Resource Manager HA,
Application Master HA
JobTracker HA for MRv1
NFS HA
Instant recovery
• YARN jobs are not impacted by failures
• Continue to meet SLAs with MapReduce v2
• MapReduce v1 jobs are not impacted by failures
• Meet your data processing SLAs
• High throughput and resilience for NFS-based data
ingestion, import/export and multi-client access
• Files and tables are accessible within seconds of a node
failure or cluster restart
High Availability (HA) For Hadoop
No-NameNode architecture
• Distributed metadata can self-heal
• No practical limit on # of files

Direct Integration with Existing Applications
• 100% POSIX compliant
• Industry standard APIs
- NFS, ODBC, LDAP, REST
• More 3rd-party solutions
• No proprietary connectors
required
• Language neutral

Multi-Tenancy Support for ParallelizedApp Development
Isolation
• Tasks sandboxed so they don’t impact other tasks or system daemons
• System resources protected from runaway jobs
• Volume-based data segregation based on users and groups
• Volume-based data placement to control
• Label-based job scheduling to control
Quotas
• Storage quotas by volume/user/group
• CPU and memory quotas by queue/user/group
Security and delegation
• Fine-grained administration permissions including volume-level delegation
• Authenticate users to AD, LDAP and Kerberos via Linux PAM
Reporting
• Detailed reporting on resource usage (75+ different metrics)
• All reports are available via UI, CLI and REST API

MapR M7: The Best In-Hadoop NoSQL Database
Benefit Features
High Performance Over 1 million ops/sec with 10 node cluster
Continuous Low Latency No I/O storms, no compactions
24x7 Applications
Instant recovery, online schema modification, snapshots,
mirroring
Zero Administration No processes to manage, automated splits, self-tuning
High Scalability 1 trillion tables, billions of rows, millions of columns
Low TCO
Files and tables on one platform, more work with fewer
nodes
Performance
Reliability
Easy
Administration

425
925
333
563
367
532
163
331
IDH 2.4.1
CDH 4.3
Source: Flux7 Labs Study, October 2013
Flux7: Comparative Study of Hadoop Distributions
Web Search and Data Analytics Benchmarks
Page Rank Hive JOIN Query
Timeinseconds
Timein
Seconds
Lower is Better
Hardware Specs: EC2 on AWS
1 Master: m1.xlarge; 64-bit; 4 vCPU, 8 ECU; 15 GiB RAM; 4x420 GB Storage; 4x Intel ® Xeon ® CPU E5-2650 0 @ 2.00 GHz
4 Slaves: m1.large; 64-bit; 2 vCPU, 4 ECU; 7.5 GiB RAM; 2x420 GB Storage; 2x Intel ® Xeon ® CPU E5430 @ 2.66 GHz

Comparative Study of Hadoop Distributions
212
59
262
69
276
64
475 465 IDH
CDH
HDP
MapR
Source: Flux7 Labs Study, October 2013
http://flux7.com/blogs/case-studies/hadoop-distributions-a-detailed-comparative-study-whitepaper/
Read and Write Throughput Benchmarks
DFSIO Read Throughput DFSIO Write Throughput
MBperSecond
MBperSecond
Hardware Specs: EC2 on AWS
1 Master: m1.xlarge; 64-bit; 4 vCPU, 8 ECU; 15 GiB RAM; 4x420 GB Storage; 4x Intel ® Xeon ® CPU E5-2650 0 @ 2.00 GHz
4 Slaves: m1.large; 64-bit; 2 vCPU, 4 ECU; 7.5 GiB RAM; 2x420 GB Storage; 2x Intel ® Xeon ® CPU E5430 @ 2.66 GHz
Higher is Better

MapR M7: The Best In-Hadoop Database
 NoSQL Columnar Store
 Apache HBase API
 Integrated with Hadoop
HBase
JVM
HDFS
JVM
ext3/ext4
Disks
Other Distros
Tables/Files
Disks
MapR M7
The most scalable, enterprise-grade,
NoSQL database that supports online applications and analytics

Customer Case Studies
MapR with Amazon Elastic MapReduce in Action

Use cases for MapR with Amazon EMR
• Targeted advertising / clickstream analysis
• Security: anti-virus, fraud detection, image
recognition
• Pattern matching / recommendations
• Reporting / BI
• Bio-informatics (genome analysis)
• Financial simulation (Monte Carlo simulation)
• File processing (resize jpegs, video encoding)
• Web indexing

Case Study
Outcomes from MapR Deployment w/ EMR
• Increased flexibility to scale at lower costs
• Faster turnaround for customer requests
• Ease of experimentation
Challenges
• RDBMS on AWS too slow
• Solution must be compatible with AWS & Java 7
• High performance

Case Study
Outcomes from MapR Deployment w/ EMR
• Faster machine learning performance
enables more/faster simulations
• MapR M7 provides geospatial database
backed by Amazon S3
Challenges
• Large volumes of sensor data
• Project weather for 2.5 years
at every 20x20 plot across the US
• Climatology simulations need to quickly
experiment at small scale and then scale reliably

Demo

MapR/EMR Demonstration
• Create MapR cluster using EMR
• Review MapR Control System (MCS)
• Show S3 and MapR integration
• Demonstrate MapR’s real-time capability
• Connect Mac to MapR via NFS
• Run queries with HiveServer2 and Impala
• Visualize data with Tableau

Questions and Contact
MapR:
http://aws.amazon.com/elasticmapreduce/mapr/
swooledge@mapr.com
AWS Contact:
aws.amazon.com/contact-us
jonfritz@amazon.com
@mapr
@awscloud
Maprtech
Amazon Web Services

We’d like your feedback.
Please complete a short survey
https://aws.asia.qualtrics.com/SE/?SID=SV_brzWly
lHrqM29tr

Hadoop in the Cloud: Unlocking Big Data Potential with MapR and AWS

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to Hadoop in the Cloud: Unlocking Big Data Potential with MapR and AWS

Similar to Hadoop in the Cloud: Unlocking Big Data Potential with MapR and AWS (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Recently uploaded

Recently uploaded (20)

Hadoop in the Cloud: Unlocking Big Data Potential with MapR and AWS