Google Compute and MapR

MapR's Hadoop Distribution on
Google Compute Engine

Who am I?

http://www.mapr.com/company/events/
speaking/devfest-dc-9-28-12
•  Keys Botzum
•  kbotzum@maprtech.com
•  Senior Principal Technologist, MapR Technologies
•  MapR Federal and Eastern Region

MapR’s Experience with Google Compute Engine

•  Fast
–  Virtualized public cloud
–  Rivals on-premise physical

•  Easy
–  Provision 1,000s of servers in
minutes

•  Cost effective
–  Pay only for what you use

gcutil is your friend

•  Command line tool that runs on your client machines to manage your
instances in your cloud
•  Remarkably easy to use
–  New server/instance: gcutil addinstance
–  Connect to a server/instance: gcutil ssh
•  Can create your own custom images using Google’s tools
–  Using custom images is as easy as addinstance –image <image name>
–  MapR is creating custom images for MapR clusters

MapReduce: A Paradigm Shift
•  Distributed computing platform
•  Large clusters
•  Commodity hardware
•  Pioneered at Google
•  BigTable, MapReduce and Google File System
•  Commercially available as Hadoop

MapR Technologies
•  Open, enterprise-grade distribution for
Hadoop
–  Easy, dependable and fast
–  Open source with standards-based extensions

•  Hadoop
–  Big data analytics
–  Inspired by MapReduce paper published by Google
scientists Jeffrey Dean and Sanjay Ghemawat in
2004

•  MapR is recognized as a technology
leader

•  MapR Hadoop Cloud Service now
available on Google Compute Engine

MapR’s Complete Distribution
for Apache Hadoop
MapR Control System
•  Integrated, tested,
hardened and supported MapR
Heatmap™
LDAP, NIS
Integration
Quotas, CLI,
REST APT
Alerts, Alarms
•  Integrated with
Accumulo
Hive Pig Oozle Sqoop HBase Whirr
•  Runs on commodity
hardware
•  Open source with Accumulo Mahout Cascading Naglos Ganglia Flume Zoo-
Integration Integration keeper
standards-based
extensions for:
•  Security
•  File-based access
Direct Snap-
•  Most SQL-based Access
Real- Volumes Mirrors Data
Time shots Placemen
access NFS Streamin t
•  Easiest integration g
No NameNode High Performance Stateful Failover
•  High availability Architecture Direct Shuffle and Self Healing

•  Best performance
MapR’s Storage Services™
2.7

Overview of Starting a Cluster
•  Google’s gcutil is your friend
•  Very easy tool for spinning up instances
•  MapR is creating a tool and infrastructure to spin up a fully functional MapR
cluster composed of many nodes
•  ./mapr-start-cluster.sh –machine-type <…> -masters <#> -slaves <#>
•  …wait a few minutes
•  gcutil ssh <node running admin server> and set admin’s password
•  gcutil listinstances (to find your cluster’s IP addresses)
•  … use the cluster, it’s fully functional
•  ./mapr-stop-cluster.sh
•  …billing for cluster stops

* Note that this is not the final interface, but rather is representative of what will be released. Some details omitted for
clarity.

Demo

Let’s run a large sort
Run TeraSort on a 1250-node MapR Hadoop cluster on
Google Compute Engine

(10 billion records, 1TB of data)

How does this Compare to Terasort
Records?

MapR on Record on physical
Google Compute hardware
Engine
Hardware Virtual/Cloud Physical
Cores 5024 11680
Disks 1256 5840
Servers 1256 1460
Time 1:20 min 1:02 min

Deployment Comparison

Current Record

1460 physical servers 1256 instances
Prepare datacenter Invoke gcutil command
Rack and stack servers
Maintain hardware

Months Minutes

Cost Comparison

Current Record

1460 1U servers x 1256 n1-standard-4-d x
$4K/server = $.58/instance hour x
80 seconds =

$5,840,000 $16
($728/hour)

Easy Management at Scale

•  Health
Monitoring
•  Cluster
Administration
•  Application
Resource
Provisioning

Direct Access NFS™
File
Browsers
Standard
Linux

Commands
&
Tools

grep!
Access
Directly

sed!
“Drag
&
Drop”
sort!
tar!

Random
Read

Random
Write

Log
directly

Applica=ons

Multi-tenancy
§  Consider a large cluster with lots of storage and
numerous jobs supporting multiple
organizations
§  Volumes
§  Control storage usage
§  quotas on volumes
§  quotas on cluster storage by user or
group
§  Control data placement
§  ensure that data is stored in the
locations you want
§  Control mirroring and snapshotting
§  Job management
§  Control where jobs run
§  ensure that jobs run where you want
§  Historical view of metrics collected from
jobs
§  ease troubleshooting of job issues
§  Security/Protection
§  Fine grained permissions on volume and
cluster management, including delegation

MapR: Lights Out Data Center Ready

Dependable
Reliable Compute
Storage

•  Automated
stateful
failover
§  Business
con=nuity
with

snapshots

and
mirrors

•  Automated
re-‐replica=on
§  Recover
to
a
point
in
=me

•  Self-‐healing
from
HW

§  End-‐to-‐end
check

and
SW
failures
summing

•  Load
balancing
§  Strong
consistency

§  Built
in
compression

•  Rolling
upgrades

§  Mirror
across
sites
to

•  No
lost
jobs
or
data
meet

•  99999’s
of
up=me
Recovery
Time
Objec=ves

MapR Mirroring/COOP Requirements

Business
Con=nuity

Production Research and
Efficiency

Efficient
design

WAN §  Differen=al
deltas
are
updated

Datacenter
1
Datacenter
1

§  Compressed
and

check-‐summed

Easy
to
manage

Production
WAN
Cloud §  Scheduled
or
on-‐demand

§  WAN,
Remote
Seeding

§  Consistent
point-‐in-‐=me

Compute Engine

MapR Drives Hardware Performance
Typical Hadoop
% Performance vs. Apache/CDH
450%

400%
Commodity Hardware
350%

300%

250% % Perf vs.
Apache/CDH
200%

150%

100%

50%

0%
400MBPS 1200MBPS 1800MBPS SSD
<6 Drives 12*5400RPM Drives 12*7200RPM Drives 2*10GbE
1NIC >1NIC or 10GbE >1NIC or 10GbE 12+ Cores
6 Cores 8 Cores 12 Cores 64G DRAM
24G DRAM 32G DRAM 48G DRAM

Why is MapR faster and more efficient?
§  No
redundant
layers
(not
a
file
system
§  Na=ve
compression

over
a
file
system)
§  Op=mized
shuffle

§  C/C++
vs.
Java
(higher
performance
and
§  Advanced
cache
manager

no
garbage
collec=on
freezes)
§  Port
scaling
(mul=-‐NIC
support)
and

§  Distributed
metadata
high-‐speed
RPC

Designed for Performance and Scale
MapR Apache/CDH
Terasort w/ 1x replication (no compression)
Total (minutes) 24 min 34 sec 49 min 33 sec
Map 9 min 54 sec 28 min 12 sec
Shuffle 9 min 8 sec 27 min 0 sec
Terasort w/ 3x replication (no compression)
Total 47 min 4 sec 73 min 42 sec
Map 11 min 2 sec 30 min 8 sec
Shuffle 9 min 17 sec 28 min 40 sec
DFSIO/local write
Throughput/node 870 MB/s 240 MB/s
YCSB (HBase benchmark, 50% read, 50% update)
Throughput 33102 ops/sec 7904 ops/sec
Latency (r/u) 2.9-4 ms/0.4 ms 7-30 ms/0-5 ms
YCSB (HBase benchmark, 95% read, 5% update)
Throughput 18K ops/sec 8500 ops/sec
Latency (r/u) 5.5-5.7 ms/0.6 ms 12-30 ms/1 ms

HW: 10 servers, 2 x 4 cores (2.4 GHz), 11 x 2TB, 32 GB

Customer Support

•  24x7x365 “Follow-The-Sun” coverage
•  Critical customer issues are worked on
around the clock
•  Dedicated team of Hadoop engineering
experts
•  Contacting MapR support
•  Email: support@mapr.com
(automatically opens a case)
•  Phone: 1.855.669.6277
•  Self Service options:
§  http://answers.mapr.com/
§  Web Portal: http://mapr.com/
support

Two MapR Editions – M3 and M5

§  Control
System
§  Control
System

§  NFS
Access
§  NFS
Access

§  Performance
§  Performance

§  Unlimited
Nodes
§  High
Availability

§  Free

§  Snapshots
&
Mirroring

§  24
X
7
Support

Also Available through:
§  Annual
Subscrip=on

Compute Engine

Try MapR on Google
Compute Engine
www.mapr.com/google

Apache Drill
Interactive Analysis of Large-Scale Datasets

Latency Matters

•  Ad-hoc analysis with interactive tools

•  Real-time dashboards

•  Event/trend detection and analysis
•  Network intrusion analysis on the fly
•  Fraud
•  Failure detection and analysis

Big Data Processing

Batch processing Interactive analysis Stream processing
Query runtime Minutes to hours Milliseconds to Never-ending
minutes
Data volume TBs to PBs GBs to PBs Continuous stream
Programming MapReduce Queries DAG
model
Users Developers Analysts and Developers
developers
Google project MapReduce Dremel
Open source Hadoop Storm and S4
project MapReduce

Introducing Apache Drill…

Innovations
•  MapReduce
•  Scalable IO and compute trumps efficiency with today's commodity hardware
•  With large datasets, schemas and indexes are too limiting
•  Flexibility is more important than efficiency
•  An easy to use scalable, fault tolerant execution framework is key for large
clusters
•  Dremel
•  Columnar storage provides significant performance benefits at scale
•  Columnar storage with nesting preserves structure and can be very efficient
•  Avoiding final record assembly as long as possible improves efficiency
•  Optimizing for the query use case can avoid the full generality of MR and thus
significantly reduce latency. No need to start JVMs, just push compact queries to
running agents.
•  Apache Drill
•  Open source project based upon Dremel’s ideas
•  More flexibility and openness

More Reading on Apache Drill
•  MapR and Apache Drill
•  http://www.mapr.com/drill
•  Apache Drill project page
•  http://incubator.apache.org/projects/drill.html
•  Google’s Dremel
•  http://research.google.com/pubs/pub36632.html
•  Google’s BigQuery
•  https://developers.google.com/bigquery/docs/query-reference
•  MIT’s C-Store – a columnar database
•  http://db.csail.mit.edu/projects/cstore/
•  Microsoft’s Dryad
•  Distributed execution engine
•  http://research.microsoft.com/en-us/projects/dryad/
•  Google’s Protobufs
•  https://developers.google.com/protocol-buffers/docs/proto

Google Compute and MapR

More Related Content

What's hot

Viewers also liked

Similar to Google Compute and MapR

More from MapR Technologies

Recently uploaded

Google Compute and MapR