Building hadoop based big data environment

Building Hadoop Based
Big Data Environment
Evans Ye @ TWHUG

2013/12/14

Who am I
• Evans Ye @

• Dumbo Team
• http://dumbointaiwan.blogspot.tw/
12/14/2013

Copyright 2013 Trend Micro Inc.

Agenda
• Building your own Hadoop version
• Hadoop Deployment

• Hadoop release engineering
• The development environment

• Bigtop puppet

12/14/2013


Why Build our own version
• Add your own patch at any time
– From community perspective, they need to take care about
backward complicity,
which need much more time and effort on it.

• Fetch official patches in to current adopted version
– You may not upgrade your Hadoop version frequently,
But there’s a specific need for that patch.

• Flexibility, Business needed features
12/14/2013


As a Beginner

12/14/2013


Build Hadoop Infrastructure

12/14/2013


What’s your work?

….

12/14/2013


I thought you just need to
yum install Hadoop.

Brute force
• git clone
• Make some changes

• Builde binary tarball

How to do version control?

core-site.xml
hdfs-site.xml
mapred-site.xml
…

12/14/2013


Bigtop

12/14/2013


How bigtop helps you
• Apache Hadoop App developers:
– Run pseudo-distributed Hadoop cluster to test your code on.

• Vendors:
– Build your own Apache Hadoop distribution, customized from
Apache Bigtop bits.

• Packaging, Deployment, Integration Testing

12/14/2013


Supported Linux Distro
• Ubuntu 10.10
• CentOS 5/6

• Fedora 18
• Mageia 1

• openSUSE 12.2

12/14/2013


Build
• Build hadoop-common (see BUILDING.txt)
– hadoop-common$ mvn package –Pdist,docs,src,native -Dtar

• Prepare your src tar in bigtop
• Bigtop$ make hadoop-rpm

12/14/2013


Hadoop Deployment

12/14/2013


Configuration files
• Hadoop related config
–
–
–
–
–
–
–
–
–
12/14/2013

core-site.xml
hdfs-site.xml
mapred-site.xml
log4j.properties
hadoop-env.sh
fair-scheduler.xml
rack-topology
hadoop-metrics.properties
taskcontroller.cfg

Local Directories
• Hadoop related file and directory
– Namenode metadata
• /name/1, /name/2

– Datanode
• /data/1, /data/2 , /data/3 , /data/4

– Tasktracker
• /mapred/1/local, /mapred/2/local

–…

12/14/2013


More hadoop ecosystem

12/14/2013


Problems to solve
• Lots of nodes need to be configured
• Less human involved, less mistake made

• Configuration changed quite often
– adjust fair scheduler
– enable/disable short circuit
– try more performance improvement configurations

12/14/2013


Hadooppet

12/14/2013


What is puppet ?
• A IT automation tool to help system administrators
automate the many repetitive tasks
• You need to only define the desired state

12/14/2013


What is Hadooppet ?
• A general hadoop cluster deployment tool based on
puppet
• Kerberos / ldap auto configured
• A set of hadoop / kerberos management tool
• A set of sanity check scripts for trend hadoop related
services
• Manage configuration on puppetmaster

12/14/2013


Design
• Abstract environment specific configurations in a single
configuration file
• setup.sh
–
–
–
–
–
–

12/14/2013

namenode_fqdns=(“dev1.example.com” “dev2.example.com”)
namenode_dirs=(“/name/1” “/name/2”)
namenode_heap=32g
map_slots=5
reduce_slots=3
…


Benifits
• Can be used to setup any kind of hadoop cluster
• When doing main version upgarade, minimal the
downtime
– hadoop1  hadoop2
Namenode
Secondarynamenode

12/14/2013


Active/Standby Namenode
Journalnodes
ZKFC

Release Engineering

12/14/2013


Manually
• Build src tarball in hadoop-common
• Build rpms in bigtop

• submit build to release yum repo
• yum update on hadoop cluster…

12/14/2013


Continuous Integration
• Setup hadoop-common daily build
• Setup Bigtop release Build
– should be manually triggered

• Setup Hadooppet daily build
– Run sanity checks on a REAL CLUSTER

12/14/2013


Virtualization
• Build a Xen Server Cluster

12/14/2013


12/14/2013


give-me-vm
• Pycon 2012
– Small Python Tools for Software Release Engineering

• An automation tool to manage
VM lifecycle
• Use Python XenAPI

• Create temporary VM for testing
by self service
• Destroy it when the testing
is finished
12/14/2013


Build auto deployment on Hadooppet
• ./give_me_vm.py
• setup passphraseless ssh between each VM

• set hostname
• Install Hadooppet on master

• run deployment
• run sanity checks
• ./destroy_vm.py
12/14/2013


Development
Environment

12/14/2013


For hadoop service developers…
• No enough hadoop client for each developers
• Developer can not reach server side while developing
hadoop related services
• Can not experiment new technology like impala spark
flume

• CI on Hadoop related services

12/14/2013


give-me-vm + Hadoop all-in-one VM
• Use Hadooppet to setup a peudo-distributed hadoop
VM as Xenserver template

• get a Hadoop all-in-one VM via give-me-vm
• Services integrate its CI test with hadoop all-in-one VM

12/14/2013


Bigtop
puppet

12/14/2013


Bigtop puppet
• Bigtop also has a set of puppet scripts to deploy
Hadoop ecosystem

12/14/2013


Bigtop puppet
• Preparation:
– A VM with jdk, puppet installed
– mkdir –p /data/{1,2}
– git clone https://github.com/apache/bigtop.git

12/14/2013


Conclusion
• There’re many great deployment tool exist
– Ambari, CM, ETU appliance
– Choose suitable distribution by your business need

• If you want to do it by yourself
– Bigtop can do packaging for you easily
– Leverage bigtop puppet module for your deployment

12/14/2013


Building hadoop based big data environment

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (20)

Similar to Building hadoop based big data environment

Similar to Building hadoop based big data environment (20)

More from Evans Ye

More from Evans Ye (18)

Recently uploaded

Recently uploaded (20)

Building hadoop based big data environment