The Zoo Expands	

Labrador 💛 Elephant,Thanks to Hamster
Milind Bhandarkar	

Chief Scientist, Pivotal Software, Inc.
About Me
• http://www.linkedin.com/in/milindb	

• Founding member of Hadoop team atYahoo! [2005-2010]	

• Contributor to Apache Hadoop since v0.1	

• Built and led Grid SolutionsTeam atYahoo! [2007-2010]	

• Parallel Programming Paradigms [1989-today] (PhD cs.illinois.edu)	

• Center for Development of Advanced Computing (C-DAC), National Center
for Supercomputing Applications (NCSA), Center for Simulation of Advanced
Rockets, Siebel Systems (acquired by Oracle), Pathscale Inc. (acquired by
QLogic),Yahoo!, LinkedIn, and Pivotal (formerly Greenplum)
Hamster
• Hadoop and MPI on the
same cluster	

• Runtime for OpenMPI
applications onYARN	

• Available on Pivotal HD
Why MPI ?
•Hadoop Dataflow paradigms (MapReduce,
TeZ etc) not suitable for iterative
applications	

•Message Passing Interface (MPI)	

•Mature standard	

•Used extensively in HPC	

•Huge ecosystem
MPI in Science & Engg
Earth Atmosphere
Chemistry
Biology
Math Nuclear
MPI in Industry
Mechanical ar
Finance/bankOil Exploration Cryptography
Spacecraft
OpenMPI
•Mature Open Source implementation of MPI
3.0 Standard (mpi-forum.org)	

•New BSD license	

•30+ contributing organizations from
academia, research and industry	

•http://open-mpi.org
OpenMPI Architecture
Pluggable
Hamster Design
•YARN as Resource Manager	

•Hamster Application Manager	

•Manages MPI jobs	

•(tries to) Implement Gang-Scheduling	

•Leverages OMPI/ORTE strengths	

•Wire-up,Task monitoring, Fast Interconnect
Hamster Architecture
Resource Manager
Scheduler
AMService
Node Manager Node Manager Node Manager
…
Proc/Container
Framework Daemon NSMPI Scheduler HNP
MPI AM
Proc/Container
…RM-AM
AM-NM
RM-NodeManager
Client
Client-RM
Aux Srvcs
Proc/Container
Framework Daemon NS
Proc/Container
…
Aux Srvcs
RM-
NodeManager
Hamster AppMaster
• Master daemon for MPI ( similar to JobTracker in
MapReduce)	

• Implements and participates in theYARN-RM App
lifecycle protocol	

• Maintains heartbeat with RM to ensure liveness	

• MPI Scheduler - Negotiates resource allocation with
YARN-RM	

• Head Node Process (HNP) - manages job execution
Hamster Node Service
•User-level daemon per MPI job	

•Manages task execution	

•Coarse-grained container management	

•Bootstrapped byYARN-NM	

•Implemented asYARN Auxiliary Service
Why GraphLab on
Hadoop ?
•Graph Analytics & Machine Learning only
one stage in E2E data pipeline	

•ETL/Preprocessing	

•Building Graphs from fact & dimension
tables	

•Publishing analytics results, post-processing
GraphLab 2.2
•Communication patterns based on Data	

•SeveralToolkits (Graph Analytics + ML
Algorithms) available	

•Graph-Programming API	

•Uses MPI for communication
Pivotal HD
HDFS
HBase Pig, Hive,
Mahout
Map
Reduce
Sqoop Flume
Resource
Management
& Workflow
Yarn
Zookeeper
Apache Pivotal
Command
Center
Configure,
Deploy, Monitor,
Manage
Spring XD
Pivotal HD
Enterprise
Spring
Xtension
Framework
Catalog
Services
Query
Optimizer
Dynamic Pipelining
ANSI SQL + Analytics
HAWQ – Advanced
Database Services
Distributed
In-memory
Store
Query
Transactions
Ingestion
Processing
Hadoop Driver –
Parallel with Compaction
ANSI SQL + In-Memory
GemFire XD – Real-Time
Database Services
MADlib Algorithms
Oozie
Virtual
Extensions
Graphlab,
Open MPI
Performance
Test Environment
•Pivotal Analytics Workbench Cluster	

•Pivotal HD 1.1 (Apache Hadoop 2.0.5)	

•Hamster - 1.0, OpenMPI-1.7.2	

•515 nodes	

•2x6-core Westmere, 48GB RAM, 12x2TB
SATA, Mellanox FDR Infiniband
Null Job
•Measures overhead of launching MPI jobs	

•Tests scalability of resource allocation,
launching and wire-up	

•Sub-linear scalability (slightly worse than
O(logN)	

•Overhead of launching 15000 processes = 1
minute
Total RuntimeTime(Sec.)
5
18.75
32.5
46.25
60
Process number
0 4000 8000 12000 16000
E2E time
AllocationTimeTime(Sec.)
1
2.25
3.5
4.75
6
Number of Processes
0 4000 8000 12000 16000
Allocation Time
LaunchTimeTime(Sec.)
0
7.5
15
22.5
30
Number of processes
0 4000 8000 12000 16000
Launch Time
Comparison with
OpenMPI
•HPL (HP Linpack forTop-500)	

•Number of processes 50—1000	

•Hamster 1% slower than OpenMPI
HPL - Hamster vs
OpenMPI
Time(Sec.)
0
30
60
90
120
1000 500 200 50
GraphLab ALS
•Wikipedia dataset	

•4.3 M terms, 3.3M documents, 513M
occurrences	

•17 Processes	

•5 Iterations
GraphLab ALS
Time(Sec.)
0
335
670
1005
1340
Hamster OpenMPI
GraphLab PageRank
•Twitter Dataset	

•4.1 M nodes, 1.4 B edges	

•Data Size : 26GB	

•NP = 17	

•50 iterations: 297 seconds	

•100 iterations: 339 seconds
Questions?

The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster

  • 1.
    The Zoo Expands Labrador💛 Elephant,Thanks to Hamster Milind Bhandarkar Chief Scientist, Pivotal Software, Inc.
  • 2.
    About Me • http://www.linkedin.com/in/milindb •Founding member of Hadoop team atYahoo! [2005-2010] • Contributor to Apache Hadoop since v0.1 • Built and led Grid SolutionsTeam atYahoo! [2007-2010] • Parallel Programming Paradigms [1989-today] (PhD cs.illinois.edu) • Center for Development of Advanced Computing (C-DAC), National Center for Supercomputing Applications (NCSA), Center for Simulation of Advanced Rockets, Siebel Systems (acquired by Oracle), Pathscale Inc. (acquired by QLogic),Yahoo!, LinkedIn, and Pivotal (formerly Greenplum)
  • 3.
    Hamster • Hadoop andMPI on the same cluster • Runtime for OpenMPI applications onYARN • Available on Pivotal HD
  • 4.
    Why MPI ? •HadoopDataflow paradigms (MapReduce, TeZ etc) not suitable for iterative applications •Message Passing Interface (MPI) •Mature standard •Used extensively in HPC •Huge ecosystem
  • 5.
    MPI in Science& Engg Earth Atmosphere Chemistry Biology Math Nuclear
  • 6.
    MPI in Industry Mechanicalar Finance/bankOil Exploration Cryptography Spacecraft
  • 7.
    OpenMPI •Mature Open Sourceimplementation of MPI 3.0 Standard (mpi-forum.org) •New BSD license •30+ contributing organizations from academia, research and industry •http://open-mpi.org
  • 8.
  • 9.
  • 10.
    Hamster Design •YARN asResource Manager •Hamster Application Manager •Manages MPI jobs •(tries to) Implement Gang-Scheduling •Leverages OMPI/ORTE strengths •Wire-up,Task monitoring, Fast Interconnect
  • 11.
    Hamster Architecture Resource Manager Scheduler AMService NodeManager Node Manager Node Manager … Proc/Container Framework Daemon NSMPI Scheduler HNP MPI AM Proc/Container …RM-AM AM-NM RM-NodeManager Client Client-RM Aux Srvcs Proc/Container Framework Daemon NS Proc/Container … Aux Srvcs RM- NodeManager
  • 12.
    Hamster AppMaster • Masterdaemon for MPI ( similar to JobTracker in MapReduce) • Implements and participates in theYARN-RM App lifecycle protocol • Maintains heartbeat with RM to ensure liveness • MPI Scheduler - Negotiates resource allocation with YARN-RM • Head Node Process (HNP) - manages job execution
  • 13.
    Hamster Node Service •User-leveldaemon per MPI job •Manages task execution •Coarse-grained container management •Bootstrapped byYARN-NM •Implemented asYARN Auxiliary Service
  • 15.
    Why GraphLab on Hadoop? •Graph Analytics & Machine Learning only one stage in E2E data pipeline •ETL/Preprocessing •Building Graphs from fact & dimension tables •Publishing analytics results, post-processing
  • 16.
    GraphLab 2.2 •Communication patternsbased on Data •SeveralToolkits (Graph Analytics + ML Algorithms) available •Graph-Programming API •Uses MPI for communication
  • 17.
    Pivotal HD HDFS HBase Pig,Hive, Mahout Map Reduce Sqoop Flume Resource Management & Workflow Yarn Zookeeper Apache Pivotal Command Center Configure, Deploy, Monitor, Manage Spring XD Pivotal HD Enterprise Spring Xtension Framework Catalog Services Query Optimizer Dynamic Pipelining ANSI SQL + Analytics HAWQ – Advanced Database Services Distributed In-memory Store Query Transactions Ingestion Processing Hadoop Driver – Parallel with Compaction ANSI SQL + In-Memory GemFire XD – Real-Time Database Services MADlib Algorithms Oozie Virtual Extensions Graphlab, Open MPI
  • 18.
  • 19.
    Test Environment •Pivotal AnalyticsWorkbench Cluster •Pivotal HD 1.1 (Apache Hadoop 2.0.5) •Hamster - 1.0, OpenMPI-1.7.2 •515 nodes •2x6-core Westmere, 48GB RAM, 12x2TB SATA, Mellanox FDR Infiniband
  • 20.
    Null Job •Measures overheadof launching MPI jobs •Tests scalability of resource allocation, launching and wire-up •Sub-linear scalability (slightly worse than O(logN) •Overhead of launching 15000 processes = 1 minute
  • 21.
  • 22.
  • 23.
  • 24.
    Comparison with OpenMPI •HPL (HPLinpack forTop-500) •Number of processes 50—1000 •Hamster 1% slower than OpenMPI
  • 25.
    HPL - Hamstervs OpenMPI Time(Sec.) 0 30 60 90 120 1000 500 200 50
  • 26.
    GraphLab ALS •Wikipedia dataset •4.3M terms, 3.3M documents, 513M occurrences •17 Processes •5 Iterations
  • 27.
  • 28.
    GraphLab PageRank •Twitter Dataset •4.1M nodes, 1.4 B edges •Data Size : 26GB •NP = 17 •50 iterations: 297 seconds •100 iterations: 339 seconds
  • 29.