1. Deploying Hadoop-Based Bigdata
Environments
Click to edit Master subtitle style
“[Tall] Tales From The Frontier”
Roman Shaposhnik
rvs@apache.org, Cloudera Inc.
2. $ whoami
An open source software developer
Linux kernel, C/C++ compilers, FFmpeg, Plan9
A Hadoop and all around UNIX guy
root@cloudera
Member of the “Kitchen” team
Apache Software Foundation Incubator PMC
[Bigtop], Hadoop Development Tools, Celix, Helix
VP of Apache Bigtop
2
7. One way of using Apache software
$ wget http://apache.org/httpd.tar.gz
$ tar xzvf httpd.tar.gz
$ cd httpd
$ ./configure ; make
$ make install
ERROR: can't write to /usr/local/bin
$ sudo make install
7
8. A different way
$ sudo apt-get install httpd
Would you like to also upgrade your conf?
8
9. Is there apt-get install hadoop ?
Hadoop is still in a very active development
Hadoop is Java based
Hadoop is a distributed application
Hadoop is way more than HDFS + MR
9
10. Project-by-project approach
“Passively” maintained code
Packaging, OS-level (init.d)
Developer-centric view
Edit-compile-debug cycle vs. deployment
Lack of integration testing
Differences in distributions/packaging:
Where is this valid: /usr/libexec ?
Combinatoric explosion of dependencies
10
11. Dependencies Inferno:
Hive 0.8.1
HBase
Hbase (0.92, 0.90)
HBase
HBase
Hadoop (1.0, 0.22, 0.23)
A million dollar question:
$ tar xzvf hive-0.8.1.tar.gz
$ ls hive-0.8.1/lib
11
12. Dependencies Inferno:
Hive 0.8.1
HBase
Hbase (0.92, 0.90)
HBase
HBase
Hadoop (1.0, 0.22, 0.23)
A million dollar question:
$ tar xzvf hive-0.8.1.tar.gz
$ ls hive-0.8.1/lib
hbase-0.89.jar log4j-1.2.15.jar log4j-1.2.16.jar
12
17. System software deployment
Packages vs. Puppet code
package/file/service
What is packaging?
dependency tracking
build encapsulation
java packaging
file layout
user creation
service registration 17
18. Does it really work?
Java packaging
maven/ivy integration
file layout
side-by-side installations of the same package
user creation
LDAP/AD provisioning
service registration
start on install vs. start on reboot
18
19. Petascale distributed systems
Scale
Yahoo! ~5000 nodes
Deployment orchestration
Kerberos::Host_keytab <| title == "hdfs" |> ->
Service["hadoop-hdfs-datanode"]
Highly coordinated distributed system
It ain't HTTPD/loadbalancer
Rolling upgrades/asynchronous rollbacks
19
20. Back to tarballs and shell?
What's better for Puppet: fpm or rpm?
What is the role of Puppet?
coordinating the entire system: lack of DSL
converging an isolated node: will it ever work?
a building block for an agent-based system
One agent to rule them all?
there's no spoon^H^H^H^H^H^ agent: Whirr
MCollective
20
Cloudera Manager, Ambari
21. Evolution, not perfection!
Minimalistic, highly consistent packages
/usr/lib/hadoop, /etc/hadoop/conf (alternative)
fail gracefully: .... || : )
Java packaging is not solved [yet]: symlinks
Minimalistic Puppet code
package/file/service
masterless (most of the time)
integration with Whirr
BoxGrinder 21
22. The road ahead
New kind of configuration management
/etc/hadoop vs Zookeeper
New kinds of system packaging
Parcels (tarballs + metadata)
HPS (Hadoop Packaging System)
Orchestration: to puppet or not to puppet?
Cloudera Manager
Apache Ambari (incubating)
Reactor 8: http://reactor8.com 22
23. Java Packaging
Fate of Java
OpenJDK
OSGi
Hadoop's view: MAPREDUCE-1700
https://issues.apache.org/jira/browse/MAPREDUCE-1700
Project Jigsaw
Language tie-ins? Really?
Linux vendors getting their act together
23
24. Integration testing
Clean room provisioning
Those ain't unit tests – they trash the system
Cluster topology and cluster state discovery
How can puppet help us?
Cluster state manipulation
Test-driven orchestration
Chaos Monkey
How to be successful in OS co-opetition
Make everything pluggable (and subvert ;-)) 24
25. Anatomy of iTest
Versioned, JVM-based test/data artifacts
Dependency between test artifacts
Matching stack of integration tests
Implementation
Maven artifacts, pom files
JUnit test-execution entry point
Groovy for scripting
25
26. Who's the target audience
End users
YOU!
ASF Projects/Bigdata developers
from Avro to Zookeeper
Bigdata solutions vendors
Cloudera, EMC, Hortonworks, Karmasphere
DevOPs
Ebay, Yahoo, Facebook, LinkedIn
26
27. Who's on-board?
Cloudera
CDH4 is 100% based on Bigtop (hadoop v2)
Available @cloudera.com
Canonical
Ubuntu Server: Hadoop and Bigdata blueprint
https://blueprints.launchpad.net/ubuntu/+spec/servercloud-p-hdp-hadoop
TrendMicro
Hortonworks (partially)
EMC, EBay (early stages of prototyping) 27
28. What's happening?
A special release: Bigtop 0.3.0-incubating
Hadoop 1.0.1
Last stable release: Bigtop 0.5.0
Hadoop 2.0.2-alpha
Next stable release: Bigtop 0.6.0
End of Mar 2013 release
Hadoop 2.0.3-beta
Major focus on developers
28
29. What Bigtop needs from you?
More of you!
Meetup: “Silicon Valley Hands-on Programming”
http://www.meetup.com/HandsOnProgrammingEvents/
More infrastructure for build/test
EC2, Supercell, EMC magic cluster, CloudStack
More integration tests
Convince your bosses to commit to Bigtop
Validate upstream release using Bigtop
29
30. Contact
§
Bigtop home @Apache:
•
http://incubator.apache.org/bigtop/
§
Hangout places:
•
{dev,user}@bigtop.apache.org
•
#bigtop on Freenode
§
Roman Shaposhnik
•
rvs@apache.org, rvs@cloudera.com
30