Deploying Hadoop-Based Bigdata Environments

Deploying Hadoop-Based Bigdata
Environments
Click to edit Master subtitle style
“[Tall] Tales From The Frontier”

Roman Shaposhnik
rvs@apache.org, Cloudera Inc.

$ whoami

 An open source software developer
 Linux kernel, C/C++ compilers, FFmpeg, Plan9
 A Hadoop and all around UNIX guy
 root@cloudera
 Member of the “Kitchen” team
 Apache Software Foundation Incubator PMC
 [Bigtop], Hadoop Development Tools, Celix, Helix
 VP of Apache Bigtop
2

ZooKeeper (coordination)

HUE (web based UI)

Pig (DQL) Hive (SQL) Impala (SQL)

HBase YARN/MR1 Oozie

HDFS (filesystem)

3

ZooKeeper (coordination)

HUE (web based UI)

Pig (DQL) Hive (SQL) Impala (SQL)

HBase YARN/MR1 Oozie

HDFS (filesystem)

4

It is a jungle out there
 Zookeeper  Sqoop  JDK/JRE
 Hadoop  Oozie  Kerberos
 HDFS  Whirr  Ganglia
 YARN  Mahout  Nagios
 MR1  Flume  JSVC
 HTTPFS  Giraph  Tomcat
 HBase  Hama  Utils
 Pig  Hue  Postgress
 Hive  Solr  HTTPD
 Impala  Crunch
5

And the answer is:

Puppet[forge]

6

One way of using Apache software

$ wget http://apache.org/httpd.tar.gz
$ tar xzvf httpd.tar.gz
$ cd httpd
$ ./configure ; make
$ make install
ERROR: can't write to /usr/local/bin
$ sudo make install
7

A different way

$ sudo apt-get install httpd
Would you like to also upgrade your conf?

8

Is there apt-get install hadoop ?

 Hadoop is still in a very active development
 Hadoop is Java based
 Hadoop is a distributed application
 Hadoop is way more than HDFS + MR

9

Project-by-project approach

 “Passively” maintained code
 Packaging, OS-level (init.d)
 Developer-centric view
 Edit-compile-debug cycle vs. deployment
 Lack of integration testing
 Differences in distributions/packaging:
 Where is this valid: /usr/libexec ?
 Combinatoric explosion of dependencies
10

Dependencies Inferno:

Hive 0.8.1

HBase
Hbase (0.92, 0.90)
HBase
HBase
Hadoop (1.0, 0.22, 0.23)

A million dollar question:
$ tar xzvf hive-0.8.1.tar.gz
$ ls hive-0.8.1/lib
11

Dependencies Inferno:

Hive 0.8.1

HBase
Hbase (0.92, 0.90)
HBase
HBase
Hadoop (1.0, 0.22, 0.23)

A million dollar question:
$ tar xzvf hive-0.8.1.tar.gz
$ ls hive-0.8.1/lib
hbase-0.89.jar log4j-1.2.15.jar log4j-1.2.16.jar
12

Remember what Debian did to Linux?

GNU Software Linux kernel
Linux kernel

13

Bigtop is trying to do it with Hadoop

Hadoop Ecosystem Hadoop
Linux kernel
(Pig, Hive, Mahout) (HDFS + MR)

CDH4 beta 1
14

What's there in Bigtop

 Build/Packaging infrastructure
 RPM, DEB, (tarballs, homebrew/MacPorts)
 VirtualBox, VMWare and KVM VMs
 Fedora, OpenSUSE, Mageia, CentOS, Ubuntu
 Puppet deployment infrastrucutre
 Integration test infrastrucutre (iTest)
 Bigtop Jenkins:
 http://bigtop01.cloudera.org:8080
15

And the answer is:

Puppet[Bigtop]

16

System software deployment

 Packages vs. Puppet code
 package/file/service
 What is packaging?
 dependency tracking
 build encapsulation
 java packaging
 file layout
 user creation
 service registration 17

Does it really work?

 Java packaging
 maven/ivy integration
 file layout
 side-by-side installations of the same package
 user creation
 LDAP/AD provisioning
 service registration
 start on install vs. start on reboot
18

Petascale distributed systems

 Scale
 Yahoo! ~5000 nodes
 Deployment orchestration
 Kerberos::Host_keytab <| title == "hdfs" |> ->
Service["hadoop-hdfs-datanode"]
 Highly coordinated distributed system
 It ain't HTTPD/loadbalancer
 Rolling upgrades/asynchronous rollbacks
19

Back to tarballs and shell?

 What's better for Puppet: fpm or rpm?
 What is the role of Puppet?
 coordinating the entire system: lack of DSL
 converging an isolated node: will it ever work?
 a building block for an agent-based system
 One agent to rule them all?
 there's no spoon^H^H^H^H^H^ agent: Whirr
 MCollective
20
 Cloudera Manager, Ambari

Evolution, not perfection!
 Minimalistic, highly consistent packages
 /usr/lib/hadoop, /etc/hadoop/conf (alternative)
 fail gracefully: .... || : )
 Java packaging is not solved [yet]: symlinks
 Minimalistic Puppet code
 package/file/service
 masterless (most of the time)
 integration with Whirr
 BoxGrinder 21

The road ahead
 New kind of configuration management
 /etc/hadoop vs Zookeeper
 New kinds of system packaging
 Parcels (tarballs + metadata)
 HPS (Hadoop Packaging System)
 Orchestration: to puppet or not to puppet?
 Cloudera Manager
 Apache Ambari (incubating)
 Reactor 8: http://reactor8.com 22

Java Packaging
 Fate of Java
 OpenJDK
 OSGi
 Hadoop's view: MAPREDUCE-1700
https://issues.apache.org/jira/browse/MAPREDUCE-1700
 Project Jigsaw
 Language tie-ins? Really?
 Linux vendors getting their act together
23

Integration testing
 Clean room provisioning
 Those ain't unit tests – they trash the system
 Cluster topology and cluster state discovery
 How can puppet help us?
 Cluster state manipulation
 Test-driven orchestration
 Chaos Monkey
 How to be successful in OS co-opetition
 Make everything pluggable (and subvert ;-)) 24

Anatomy of iTest

 Versioned, JVM-based test/data artifacts
 Dependency between test artifacts
 Matching stack of integration tests
 Implementation
 Maven artifacts, pom files
 JUnit test-execution entry point
 Groovy for scripting

25

Who's the target audience

 End users
 YOU!
 ASF Projects/Bigdata developers
 from Avro to Zookeeper
 Bigdata solutions vendors
 Cloudera, EMC, Hortonworks, Karmasphere
 DevOPs
 Ebay, Yahoo, Facebook, LinkedIn
26

Who's on-board?
 Cloudera
 CDH4 is 100% based on Bigtop (hadoop v2)
 Available @cloudera.com
 Canonical
 Ubuntu Server: Hadoop and Bigdata blueprint
https://blueprints.launchpad.net/ubuntu/+spec/servercloud-p-hdp-hadoop

 TrendMicro
 Hortonworks (partially)
 EMC, EBay (early stages of prototyping) 27

What's happening?
 A special release: Bigtop 0.3.0-incubating
 Hadoop 1.0.1
 Last stable release: Bigtop 0.5.0
 Hadoop 2.0.2-alpha
 Next stable release: Bigtop 0.6.0
 End of Mar 2013 release
 Hadoop 2.0.3-beta
 Major focus on developers
28

What Bigtop needs from you?

 More of you!
 Meetup: “Silicon Valley Hands-on Programming”
http://www.meetup.com/HandsOnProgrammingEvents/
 More infrastructure for build/test
 EC2, Supercell, EMC magic cluster, CloudStack
 More integration tests
 Convince your bosses to commit to Bigtop
 Validate upstream release using Bigtop
29

Contact
§
Bigtop home @Apache:
•
http://incubator.apache.org/bigtop/
§
Hangout places:
•
{dev,user}@bigtop.apache.org
•
#bigtop on Freenode
§
Roman Shaposhnik
•
rvs@apache.org, rvs@cloudera.com

30

Deploying Hadoop-Based Bigdata Environments

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (14)

Similar to Deploying Hadoop-Based Bigdata Environments

Similar to Deploying Hadoop-Based Bigdata Environments (20)

More from Puppet

More from Puppet (20)

Recently uploaded

Recently uploaded (20)

Deploying Hadoop-Based Bigdata Environments