Leveraging docker for hadoop build automation and big data stack provisioning

LEVERAGING DOCKER FOR
HADOOP BUILD
AUTOMATION
AND  
BIG DATA STACK
PROVISIONING
Evans Ye, Sr. Software Engineer
DataWorks Summit San Jose 2017

Who am I
• Tech Lead @ APAC Data Team, Y! Taiwan
• Building data products for E-Commerce business
• PMC chair of Apache Bigtop, ASF member
2

Outline
• Quick Intro to Apache Bigtop
• Docker for Bigtop Packaging
• Docker for Bigtop Provisioner
• Docker for Bigtop Sandbox
• Releases
3

QUICK INTRO TO  
APACHE BIGTOP
4

But there're some other great
Hadoop ecosystem components..
7

From source code to packages
Bigtop 
Packaging
10

Bigtop feature set
Packaging Testing Deployment Virtualization
for you to easily build your own Big Data Stack
12

Community stats
• 94 total contributors
• Spark: 1093, Hadoop: 99, HBase: 126, Hive:115
• 5 years since 2012
• 30 Hadoop ecosystem components packaged
• 5 Linux Distro., 2 archs supported
13

DOCKER FOR  
BIGTOP PACKAGING
14

Preparing build environment
15

Preparing build environment
… 
Seriously ?
16

Bigtop Toolchain
• Puppet recipes to install required libraries, build tools
• To prepare a build environment:
• Prerequisite :
▪ Java
git clone https://github.com/apache/bigtop.git
cd bigtop
./bigtop_toolchain/bin/puppetize.sh
./gradlew toolchain
17

CI infrastructure
CentOS slave
Fedora slave
Ubuntu slave
Debian slave
OpenSuSE slave
18

CI infrastructure
CentOS slave
Fedora slave
Ubuntu slave
Debian slave
OpenSuSE slave
Bigtop Toolchain
Bigtop Toolchain
Bigtop Toolchain
Bigtop Toolchain
Bigtop Toolchain
19

CI infrastructure
CentOS slave
Fedora slave
Ubuntu slave
Debian slave
OpenSuSE slave
Bigtop Toolchain
Bigtop Toolchain
Bigtop Toolchain
Bigtop Toolchain
Bigtop Toolchain
20

Dockerlized CI infrastructure
CentOS slave
Fedora slave
Ubuntu slave
Debian slave
OpenSuSE slave
• Immutable env
• Fault tolerance
21

Dockerlized CI infrastructure
CentOS slave
Fedora slave
Ubuntu slave
Debian slave
OpenSuSE slave
• Immutable env
• Fault tolerance
22

• Execute shell
• Bigtop CI Setup Guide
How to build packages
# OS=debian-8
# COMPONENT=hadoop
docker run -u jenkins --rm
-v `pwd`:/bigtop --workdir /bigtop
bigtop/slaves:trunk-$OS
bash -l -c "./gradlew allclean $COMPONENT-pkg"
23

Bigtop packages on master
https://ci.bigtop.apache.org/view/Packages/job/Bigtop-trunk-packages/
24

• Example: How to port Bigtop Distribution to PPC64LE?
• Prepare PPC64LE docker base image
• Apply Bigtop Toolchain on PPC64LE docker image
• Build Bigtop packages on PPC64LE slaves image
• 2016: Ported 22 out of 24 Bigtop components in 2 weeks, with only 5 patches
• Credit: Amir Sanjar, IBM
Extremely friendly for porting
25

Bigtop early mission accomplished
Leveraged by app providers…
26

Get out from the Apache dome
27

New focus and target user
• Data engineers vs Distro. builders
• Solution diversity:
▪ Streaming: Flink, Apex
▪ In-memory cache: Alluxio, Ignite
▪ User/developer tools:
▪ Bigtop Provisioner
▪ Bigtop Sandbox
• Big data stack references
• Machine learning, deep learning components
28

DOCKER FOR  
BIGTOP PROVISIONER
29

Bigtop Provisioner
• A tool to demonstrate full life cycle of Bigtop
Packaging TestingDeploymentVirtualization
Create resources Run Bigtop Puppet Run Bigtop Tests
Bigtop Provisioner
30

• We use Vagrant as an abstraction layer to support
different kind of resource providers
Vagrant
Providers

One click Hadoop provisioning 
(Bigtop 1.0.0)
bigtop/deploy image  
on Docker hub
./docker-hadoop.sh -c 3
puppet apply
puppet apply
puppet apply
32
https://cwiki.apache.org/conﬂuence/display/BIGTOP/Bigtop+Provisioner+User+Guide

Problems with Vagrant’s Docker Provider
• Need to add vagrant public key into docker images
• Too many issues with auto-created boot2docker VM
• A bug for docker provider regarding provision keeps opening for 2 years
▪ Waiting for machine to boot' hangs inﬁnitely
• Can not share same code for different providers anyway
• Not all the docker options supported in Vagrantﬁle
• ^#?& slow
33

Replaced by docker-compose  
(Bigtop 1.2.0)
./docker-hadoop.sh -c 3
puppet apply
puppet apply
puppet apply
34
bigtop/deploy image  
on Docker hub

Advantages
• No need to create customized image beforehand
• Better compatibility with Docker’s native solutions
• Clear, simple yaml ﬁle for orchestration settings
• Supports new features such as overlay network
• Leverage Swarm for multi-node cluster deployment
• Fast —> better user experience
35

• Execute shell
• Bigtop CI Setup Guide
How to run Docker Provisioner
# See bigtop/provisioner/docker/*.yaml
CONFIG=YOUR_CUSTOM_CONF.yaml
# provision
./gradlew -Pconfig=${CONFIG} -Pnum_instances=1
docker-provisioner
# destroy provisioned cluster
./gradlew docker-provisioner-destroy
36

YOUR_CUSTOM_CONF.yaml example
37
docker:
memory_limit: "4g"
image: "bigtop/puppet:centos-7"
repo: "http://bigtop-repos.s3.amazonaws.com/releases/1.2.0/
centos/7/x86_64"
distro: centos
components: [hdfs, yarn, mapreduce]
enable_local_repo: false
smoke_test_components: [hdfs, yarn, mapreduce]

38
Visibility for deployments
38

Use cases
• For application developers, cluster admins, users
▪ Run a Hadoop cluster to test your code on
▪ Try & test conﬁgurations before applying to Production
▪ Play around with Bigtop Big Data Stacks
• For contributors
▪ Easy to test your packaging, deployment, testing code
• For Distro. builders
▪ CI matrix —> patch upstream code made easier
39

Introducing Bigtop Sandbox
• Easy way to get started
• Docker images that has Bigtop stacks installed and
conﬁgured
• Pseudo cluster up & running w/o installation
• Command-line tool for you to build your own stack
41

Docker image layer
Interface
Customized big data stack
Deployment & management tool
Base image (OS)
42

Docker image layer
Concrete implementation
HDFS + YARN + Spark
Bigtop Puppet
bigtop/puppet:ubuntu-16.04
43

Building images
Ubuntu 16.04
Bigtop Puppet
HDFS + YARN + Spark
+ site.yaml
$ puppet apply
44

site.yaml example
45
bigtop::hadoop_head_node: bigtop.example.com
bigtop::bigtop_repo_uri: http://bigtop-repos.s3.amazonaws.com/
releases/1.2.0/debian/8/x86_64
hadoop::hadoop_storage_dirs: [/data/1, /data/2]
hadoop_cluster_node::cluster_components: [hdfs, yarn, spark]

How to build
• Or specify your custom conf:
git clone https://github.com/apache/bigtop.git
cd bigtop/docker/sandbox
./build.sh -a bigtop -o ubuntu-16.04
-c "hdfs, yarn, spark"
./build.sh-a bigtop -o ubuntu-16.04
-f custom_site.yaml -t dws2017
46

Running images
HDFS + YARN + Spark
$ puppet apply
47

How to run
docker run --name sandbox -d
-p 50070:50070 -p 8088:8088
evansye/sandbox:dws2017
docker logs -f sandbox
docker exec sandbox spark-example SparkPi
48

Bigtop Provisioner Bigtop Sandbox
Scalable V X
Portable X V
Flexibility High Medium
Speed > 2 mins > 15 secs
Requires Network V X
Port forwarding X V
50

Bigtop Provisioner Bigtop Sandbox
Data engineers
Multi-node  
cluster testing
Build/use
sandboxes  
for dev & test
Ops
Multi-node  
cluster testing
Single node  
testing
Contributors
Test packages,
puppet recipes, 
test cases
Test packages,
puppet recipes, 
test cases
Distro. Builders
Test packages,
puppet recipes, 
test cases
Provide Sandboxes
51

Integration test in CI/CD pipeline
Unit
Test
Source
code
Compile

Build
Image
Integra7on test with
Sandbox
Sandbox Service
CD pipeline with Bigtop Sandbox
Docker Registry
Push
Image
Deploy

FINISHED

Data
52

Future
• Production deployment using Sandbox images
▪ --net host or overlay network(SDN)?
▪ External volumes for edit logs, fsimages, etc
▪ Cluster orchestration
▪ Swarm, Kubernetes?
53

▪ New components:
▪ Ambari 2.5.0
▪ GPDB 5.0.0-alpha.0 
(Greenplum)
Bigtop 1.2.0 Released April, 2017
▪ Featured upgrade:
▪ Hadoop 2.7.3
▪ Spark 2.1.0
▪ Kafka 0.10.1.1
▪ HBase 1.1.3
▪ and more
55

• New features:
▪ Juju bigtop charms
▪ Bigtop Sandbox (alpha, recommended to try master)
• Improvement:
▪ Bigtop Docker Provisioner made faster
New features in Bigtop 1.2.0
56

Juju Cloud Weather Report
http://bigtop.charm.qa/
57

• Expected to be out late June
• Hadoop 2.7.4  
(Interested in docker container support back ported, but I'm not sure yet)
• Mainly bug ﬁxes:
• Packages
• Deployments
• Sandbox
Bigtop 1.2.1 up coming
58

• Machine Learning and Deep Learning integration
• Support aarch 64
• Enhance support set in Bigtop Puppet (not all components covered)
• Extend the CI matrix coverage to Bigtop Tests
• Ambari Bigtop stack integration
• Provide Big data stack references
Road ahead towards 1.3.0
59

• Submit your proposal, contribute Bigtop w/ funding!
• Improvements, new features, build, test, CI, etc
• CFP opened June 13, 2017 
CFP closed July 14, 2017
• https://www.odpi.org/community/bigtopgrantfund
ODPi Apache Bigtop Test Drive Program
61

• Join mailing list, ask questions, suggest features, etc
• Contribute (components, tutorials, docs)
• Report bugs
▪ Home page: http://bigtop.apache.org/
▪ mailing list: http://bigtop.apache.org/mail-lists.html
▪ Document: https://cwiki.apache.org/conﬂuence/display/BIGTOP/Index
▪ Source code: https://github.com/apache/bigtop
▪ Packages: https://www.apache.org/dist/bigtop/bigtop-1.2.0/repos/
▪ JIRA: https://issues.apache.org/jira/browse/BIGTOP
Reference
62

Leveraging docker for hadoop build automation and big data stack provisioning

More Related Content

What's hot

Similar to Leveraging docker for hadoop build automation and big data stack provisioning

More from Evans Ye

Recently uploaded

Leveraging docker for hadoop build automation and big data stack provisioning