Big Data in Container; Hadoop Spark in Docker and Mesos

1
Big Data in Container
Heiko Loewe @loeweh
Meetup Big Data Hadoop & Spark NRW 08/24/2016

2
Why
• Fast Deployment
• Test/Dev Cluster
• Better Utilize Hardware
• Learn to manage Hadoop
• Test new Versions
• An appliance for continuous
integration/API testing

3
Design
Master Container
- Name Node
- Secondary Name Node
- Yarn
Slave Container
- Node Manager
- Data Node
Slave Container
- Node Manager
- Data Node
Slave Container
- Node Manager
- Data Node
Slave Container
- Node Manager
- Data Node

4
More than 1 Hosts needs Overlay Net
Interface Docker0 not routed
Overlay Network
1 Host Config
(almost ) no
Problem
For 2 Hosts
and more
we need an
Overlay Net-
work

5
Choice of the Overlay Network Impl.
Docker Multi-Host Network Weave Net
• Backend: VXLAN, AWS, GCE.
• Fallback: custom UDP-based
tunneling.
• Control plane: built-in, uses Etcd
for shared state.
CoreOS Flanneld
• Backend: VXLAN.
• Fallback: none.
• Control plane: built-in, uses
Zookeeper, Consul or Etcd for
shared state.
• Backend: VXLAN via OVS.
• Fallback: custom UDP-based
tunneling called “sleeve”.
• Control plane: built-in.

6
Normal mode of operations is called FDP – fast
data path – which works via OVS’s data path
kernel module (mainline since 3.12). It’s just
another VXLAN implementation.
Has a sleeve fallback mode, works in userspace
via pcap.
Sleeve supports full encryption.
Weaveworks also has Weave DNS, Weave
Scope and Weave Flux – providing
introspection, service discovery & routing
capabilities on top of Weave Net.
WEAVE NET

7
 /etc/sudoers
 # at the end:
 vuser ALL=(ALL) NOPASSWD: ALL
 # secure_path, append /usr/local/bin for weave
 Defaults secure_path =
/sbin:/bin:/usr/sbin:/usr/bin:/usr/local/bin
 sudo groupadd docker
 sudo gpasswd -a ${USER} docker
 sudo chgrp docker /var/run/docker.sock
 alias docker="sudo /usr/bin/docker"
Docker Adaption (Fedora/Centos/RHEL)

8
 WARNING: existing iptables rule

 '-A FORWARD -j REJECT --reject-with icmp-host-prohibited'

 will block name resolution via weaveDNS - please reconfigure your firewall.
 sudo systemctl stop firewalld
 Sudo systemctl disable firewalld
 /sbin/iptables -D FORWARD -j REJECT --reject-with icmp-host-prohibited
 /sbin/iptables -D INPUT -j REJECT --reject-with icmp-host-prohibited
 iptables-save
 reboot
Weave Problems on Fedora/Centos/RHEL

9
[vuser@linux ~]$ ifconfig | grep -v "^ "
docker0: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500
enp3s0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
[vuser@linux ~]$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
[vuser@linux ~]$ sudo weave launch
[vuser@linux ~]$ eval $(sudo weave env)
[vuser@linux ~]$ sudo weave -–local expose
10.32.0.6
[vuser@linux ~]$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
0fd6ab928d96 weaveworks/plugin:1.6.1 "/home/weave/plugin" 11 seconds ago Up 8 seconds weaveplugin
4b24e5802fcc weaveworks/weaveexec:1.6.1 "/home/weave/weavepro" 13 seconds ago Up 10 seconds weaveproxy
c4882326398a weaveworks/weave:1.6.1 "/home/weave/weaver -" 18 seconds ago Up 15 seconds weave
[vuser@linux ~]$ ifconfig | grep -v "^ "
datapath: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1410
docker0: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500
enp3s0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
vethwe-bridge: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1410
vethwe-datapath: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1410
vxlan-6784: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 65485
weave: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1410
WEAVE
Container
WEAVE Interfaces
Weave Run

10
https://github.com/kiwenlau/hadoop-cluster-docker/blob/master/Dockerfile
Hadoop Container Docker File
FROM ubuntu:14.04
# install openssh-server, openjdk and wget
# install hadoop 2.7.2
# set environment variable
# ssh without key
# set up Hadoop directorties
# copy config files from local
# make Hadoop start files executable
# format namenode
#standard run command
CMD [ "sh", "-c", "service ssh start; bash"]
$ docker build –t loewe/hadoop:latest

11
Start Hadoop Container
Host 1
• Master
$ sudo weave run –itd –p 8088:8088 –p 50070:50070 -–name hadoop-master
• Slaves 1,2
$ sudo weave run –itd -–name hadoop-slave1
Host2
• Slave 3,4
root@boot2docker:~# weave status dns
hadoop-master 10.32.0.1 6a4db5f52340 92:64:f5:c5:57:a7
hadoop-slave1 10.32.0.2 34e0a7de1105 92:64:f5:c5:57:a7
hadoop-slave2 10.32.0.3 d879f077cf4e 92:64:f5:c5:57:a7
hadoop-slave3 10.44.0.0 6ca7ddb9daf8 92:56:f4:98:36:b0
hadoop-slave4 10.44.0.1 c1ed48630b1c 92:56:f4:98:36:b0

12
Hadoop Cluster / 2 Host / 5 Nodes

14
• Container (like Docker) are the Foundation for agile
Software Development
• The initial Container Design was stateless (12-factor
App)
• Use-cases are grown in the last few Month
(NoSQL, Stateful Apps)
• Persistence for Container is not easy
The Problem

15
• Enables Persistence of Docker Volumes
• Enables the Implementation of
– Fast Bytes (Performance)
– Data Services (Protection / Snapshots)
– Data Mobility
– Availability
• Operations:
– Create, Remove, Mount, Path, Unmount
– Additonal Option can be passed to the Volume Driver
DOCKER Volume Manager API

16
Persistente Volumes for CONTAINER
Container OS
Storage
/mnt/PersistentData
Container Container
-v /mnt/PersistenData:/mnt/ContainerData
Container Container
Docker Host

17
Docker Host
Container OS
Storage
/mnt/PersistentData
Container Container
-v /mnt/PersistenData:/mnt/ContainerData
Container Container

18
AWS EC2 (EBS)
OpenStack (Cinder)
EMC Isilon
EMC ScaleIO
EMC VMAX
EMC XtremIO
Google Compute Engine (GCE)
VirtualBox
Ubuntu
Debian
RedHat
CentOS
CoreOS
OSX
TinyLinux (boot2docker)
Docker Volume API
Mesos Isolator
...

19
Hadoop + persisten Volumes
Host A
Making the
Hadoop Container
ephemeral

20
Overlay Network
Strech Hadoop w/ persisten Volumes
Host A
Host B
Easiely strech
and shrink a
Cluster without
loosing the Data

21
Other similar Projects
• Big Top Provisioner / Apache Foundation
https://github.com/apache/bigtop/tree/master/provisioner/docker
• Building Hortonworks HDP on Docker
http://henning.kropponline.de/2015/07/19/building-hdp-on-docker/
https://hub.docker.com/r/hortonworks/ambari-server/
https://hub.docker.com/r/hortonworks/ambari-agent/
• Building Cloudera CHD on Docker
http://blog.cloudera.com/blog/2015/12/docker-is-the-new-quickstart-option-for-
apache-hadoop-and-cloudera/
https://hub.docker.com/r/cloudera/quickstart/
Watch out Overlay Network topix

23
Myriad Overview
• Mesos Framework for Apache Yarn
• Mesos manages DC, Yarn Manages Hadoop
• Coarse and fine grained Resource Sharing

24
Situation without Integration

26
How it works (simplyfied)
Myriad = Control Plane

31
What about the Data
Myriad only cares for the Compute
Master Container
- Name Node
- Secondary Name Node
- Yarn
Slave Container
- Node Manager
- Data Node
Slave Container
- Node Manager
- Data Node
Slave Container
- Node Manager
- Data Node
Slave Container
- Node Manager
- Data Node
Myriad/
Mesos
Cares about
Has to be provided
Outside from
Myriad/Mesos
Has to be provided
Outside from
Myriad/Mesos

32
What about the Data
• Myriad only cares for Compute / Map Reduce
• HDFS has to be provided on other Ways
Big Data New Realities
Big Data Traditional
Assumptions
Bare-metal
Data locality
Data on local disks
Big Data
New Realities
Containers and VMs
Compute and storage
separation
In-place access on
remote data stores
New Benefits
and Value
Big-Data-as-a-
Service
Agility and
cost savings
Faster time-to-
insights

33
Options for HDFS Data Layer
• Pure HDFS Cluster (only Data Node running)
– Bare Metal
– Containerized
– Mesos based
• Enterprise HDFS Array
– EMC Isilon

34
Myriad, Mesos, EMC Isilon for HDFS

35
• Multi Tenancy
• Multiple HDFS Environments
sharing the same storage
• Quota possible on HDFS
Environments
• Snapshots of HDFS Environemnt
possible
• Remote Replication
• Worm Option for HDFS
• High Avaiable HDFS
Infrastructure (distributed
Namen and Data Nodes)
• Storage efficient (usable/raw 0.8
compared to 0.33 with Hadoop)
• Shared Access HDFS / CIFS /
NFS/SFTP possible
• Maintenance equals Enterprise
Array Standard
• All major Distributions supported
EMC Isilon Advantages over classic
Hadoop HDFS

37
48%
Standalone mode
40%
YARN
11%
Mesos
Most Common Spark Deployment Environments
(Cluster Managers)
Source: Spark Survey Report, 2015 (Databricks)
Common Deployment Patterns

38
Bare MetalBare MetalBare Metal
Bare MetalSpark Client
Virtual Machine
Virtual Machine Virtual Machine Virtual Machine
Spark
Slave
tasktask task
Spark
Slave
tasktask task
Spark
Slave
tasktask task
Spark Master
Spark Cluster – Standalone Mode
Data provided
outside

39
Node Manager Node Manager Node Manager
Spark
Executor
tasktask task
Spark
Executor
tasktask task
Spark
Executor
tasktask task
Spark Client
Spark Master
Resource
Manager
Spark Cluster – Hadoop YARN
Data provide
By Hadoop
Cluster

40
Mesos Slave Mesos Slave Mesos Slave
Spark
Executor
tasktask task
Spark
Executor
tasktask task
Spark
Executor
tasktask task
Mesos
Master
Spark
Scheduler
Spark Client
Spark Cluster – Mesos
Data provided
outside

41
Spark + Mesos + EMC Isilon
To solve the HDFS Data Layer

42Follow me on Twitter: @loeweh

Big Data in Container; Hadoop Spark in Docker and Mesos

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (19)

Similar to Big Data in Container; Hadoop Spark in Docker and Mesos

Similar to Big Data in Container; Hadoop Spark in Docker and Mesos (20)

Recently uploaded

Recently uploaded (20)

Big Data in Container; Hadoop Spark in Docker and Mesos

Editor's Notes