1 © Hortonworks Inc. 2011–2018. All rights reserved.
Containers and Big Data
Billie Rinaldi and Shane Kumpf
Software Engineering – Hortonworks YARN R&D
2 © Hortonworks Inc. 2011–2018. All rights reserved.
At Hortonworks, we run many many tests...
Dozens of product releases a year
...over 30 open source projects
...across a dozen supported Linux operating systems
…and multiple backend databases
Result: Tens of thousands of tests per release
3 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
...on a container cloud powered by Apache Hadoop YARN
YARN
Jenkins
Worker
(Docker)
Testing HDP and HDF releases in container clusters
Worker
(Docker)
Worker
(Docker)
HDP
(Docker)
HDP
(Docker)
HDP
(Docker)
HDP
(Docker)
HDFS
That’s right! HDP running in Docker containers on YARN!
4 © Hortonworks Inc. 2011–2018. All rights reserved.
2 years and 7 million containers later...
Many real world lessons learned
5 © Hortonworks Inc. 2011–2018. All rights reserved.
Let’s Talk About
Containers
6 © Hortonworks Inc. 2011–2018. All rights reserved.
• Industry adoption continues
• “Number of containerized applications will rise by 80%
in the next two years” [1]
• Patterns emerging
• Multi-cloud and hybrid strategies
• Adoption of Microservices
• Exponential ecosystem growth
• Dozens of container orchestrators
• Thousands of plugins
• Market moves
Containerization is Gaining Momentum
1. http://i.dell.com/sites/doccontent/business/solutions/whitepapers/en/Documents/Containers_Real_Adoption_2017_Dell_EMC_Forrester_Paper.pdf
7 © Hortonworks Inc. 2011–2018. All rights reserved.
The Road to Containers - The Pursuit of Faster, Better, Cheaper
Physical
Machines
VMs Containers
● Simplified IT
● Business Agility
● Consolidated Hardware
● Improved Utilization
● More efficient than VMs
● Cheaper Cost for IT
● Business agility
● Hybrid deployment value
*** Older & newer systems coexist, newer tech increasingly taking larger share
8 © Hortonworks Inc. 2011–2018. All rights reserved.
• Improved hardware utilization through increased density
• No virtual machine operating system overhead
• Image layer reuse limits data duplication on disk
• Strong resource isolation
• Namespaces and cgroups
• Better software packaging
• Package applications and dependencies together
• Improved reuse vs VM images
• Distribution mechanism
• Improved developer self service
• More control over the execution environment
Why Are Containers Gaining Popularity?
9 © Hortonworks Inc. 2011–2018. All rights reserved.
• Mix of services
• Long lived services and ephemeral jobs
• Decoupled compute and storage
• Scale independently
• Hybrid deployments
• Desire for consistency between cloud and on-prem
Container Architecture Patterns
10 © Hortonworks Inc. 2011–2018. All rights reserved.
Let’s Talk Big Data
11 © Hortonworks Inc. 2011–2018. All rights reserved.
The Road to Big Data - The Pursuit of Faster, Better, Cheaper
Siloed
Data
Systems
(ERP, CRM, DBs,
SAN, NAS)
Apache
Hadoop
Ecosystem
● Massive scale
● More efficient than Siloed
systems
● Cheaper Cost for IT
● Business agility
● Hybrid deployment value
prop
*** Older & newer systems coexist, newer tech increasingly taking larger share
12 © Hortonworks Inc. 2011–2018. All rights reserved.
• Store and process data cost effectively at unprecedented scale
• Batch and interactive workloads on shared infrastructure
• Multi-tenant resource allocation capabilities
• Scale out
• Much more evolved today ...
• Security & Governance systems
• BI tooling
• Operational tooling
• Serving systems
• ML
• AI
• Streaming
• SQL
• ...
What really made Big Data and Hadoop popular?
13 © Hortonworks Inc. 2011–2018. All rights reserved.
• Big Data = Platform + Workloads
• Workloads
• User sizes the container to perform the work
• Horizontally scale work by adding removing
containers
• Container platform schedules containers
• Platform
• Manages resources to run workloads on top
• Includes advanced scheduling capabilities
• Multi-tenant support (Queues, Capacity
guarantees, …)
• Fine grained scheduling
At its core, Big Data is about a platform & the workloads
Platform
Work
load
Work
load
Work
load
14 © Hortonworks Inc. 2011–2018. All rights reserved.
• Jobs
• Batch or Interactive, short lived, ephemeral
• Services
• Long running, persistent
• Platforms
• Schedulers, orchestrators, resource management
• The plumbing for Jobs and Services
• Supports a mix of Jobs and Services
• Security beyond client-server (tokens …)
• Understands locality and can move work closer to data
• Networks getting faster, but speed of light ...
Multiple Classes of Big Data Application Type
** The lines may be blurred in cases **
15 © Hortonworks Inc. 2011–2018. All rights reserved.
Jobs
Long Running
Services
Platform
MapReduce HBase YARN
Hive + Tez Spark Streaming K8s
Spark Storm Cloud
Hive LLAP
Example Systems
16 © Hortonworks Inc. 2011–2018. All rights reserved.
Containers & Big Data
Together
17 © Hortonworks Inc. 2011–2018. All rights reserved.
• Workloads can be platforms in disguise
• Workloads have varying requirements on collocation
• Platforms and containers don’t always mix
• Different levels, fat containers to microservices
• Stepping back, it’s similar to Big Data and VMs - what is gained?
It’s Complicated
Many nuances depending on the workload and systems
18 © Hortonworks Inc. 2011–2018. All rights reserved.
Full decomposition requires modification
Fat Containers Microservice
19 © Hortonworks Inc. 2011–2018. All rights reserved.
Considerations for
Containerization
20 © Hortonworks Inc. 2011–2018. All rights reserved.
General Considerations
21 © Hortonworks Inc. 2011–2018. All rights reserved.
• Operating System stability
• Kernel, Docker, storage drivers
• Fat containers and microservices
• Lift and shift existing applications or decompose
• Stateless and Stateful
• Is persistent storage required? Performance needs?
• In-memory state?
• Networking
• Many networking options
General Considerations for Containers in Production
22 © Hortonworks Inc. 2011–2018. All rights reserved.
• Containers are tightly coupled to the OS kernel
• Use of advanced features
• Poor support in many kernels
• One container can cripple the host
• Run the newest kernel possible
• Docker storage driver selection
• Heavy writes lead to panics
• SSDs may be needed
• Use overlay2 if workload allows it
System Stability
23 © Hortonworks Inc. 2011–2018. All rights reserved.
• Lift and shift
• “Containers as VMs”
• Need an init system
• systemd in containers comes with gotchas
• Full decomposition is a journey
• Steps in between: fat, hollow, thin, skinny
• Lift and shift is typically the first phase
Fat Containers and Microservices
24 © Hortonworks Inc. 2011–2018. All rights reserved.
• What kind of state?
• Data persistence? In-memory store?
• How does the application recover state?
• “Checkpoint” the container?
• Performance impact
• Impact of remote data access?
Stateless and Stateful
25 © Hortonworks Inc. 2011–2018. All rights reserved.
• The great thing about networking is all the options /s
• Burden is on operations
• IP per container
• no NAT
• Platforms offload network functions
• Leverage plugins
• Heavy use of software defined networking
Networking
26 © Hortonworks Inc. 2011–2018. All rights reserved.
Considerations for Jobs
27 © Hortonworks Inc. 2011–2018. All rights reserved.
• Summary: The systems that power these workloads typically run on other
platforms/orchestrators. Most commonly these are analytic workloads.
• Examples: MapReduce, Apache Hive + Tez, Apache Spark
• Benefits
• Packaging dependencies
• Challenges
• Data locality and networking considerations
• User identity propagation and Security
Workloads: Consideration for Batch/Ephemeral/Interactive
28 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Demo - Running pyspark in Containers
• What - Running a Spark pandas UDF application using Containerization (based on
example [1])
• Why - Great example of packaging complex dependencies with an application
• Details
– Launch pyspark shell with env vars specifying docker runtime and image name
– Run simple OLS least regression example
1. https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html
29 © Hortonworks Inc. 2011–2018. All rights reserved.
Considerations for Services
30 © Hortonworks Inc. 2011–2018. All rights reserved.
• Summary: These are typically serving systems with varying requirements. In many cases
low latency online serving use cases that have specific resource requirements.
• Examples: Apache HBase, Apache Spark SQL/Streaming, Apache Storm, Apache Hive LLAP
• Benefits
• Ease of deployment
• Vertical scaling
• Challenges
• Data locality considerations
• Static networks expected
• Token/key expiration
• Short circuit reads lose value
Consideration for Long Running Services
31 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Demo - Running HBase in Containers
• What - Running Containerized HBase on YARN
• Why - Demonstrate long running application and ease of scaling
• Details
– Launch HBase application using YARN service framework
– Verify HBase status through master UI
– Flex application to add regionserver(s)
32 © Hortonworks Inc. 2011–2018. All rights reserved.
Considerations for Platforms
33 © Hortonworks Inc. 2011–2018. All rights reserved.
• Summary: Workloads run on these systems. Many platforms expect that they “own”
the hardware/VM, which differs from workloads.
• Examples: YARN, K8s
• Benefits
• Hardware utilization
• Leverage existing investment for more apps
• Developer clusters
• Challenges
• Resource sharing
• Networking considerations
• User propagation
Consideration for Platforms
34 © Hortonworks Inc. 2011–2018. All rights reserved.
• Resource management challenges – Running Two schedulers
• Running YARN workloads on top of K8S
• Running K8s as a long running service on YARN
• Side-by-side but have some elastic CPU that is moved back or forth
• Full scheduler integration?
• How are resources shared?
• CPU is elastic and shareable, but memory is not
• How do the schedulers cooperate to communicate resource consumption?
• Containers added and removed on demand / possibly resized
• Cgroups for resource tracking
• What about storage?
• DataNodes are not moveable, storage is not elastic
• IO bandwidth is shareable, but isolation is critical to HDFS/Big Data apps
• Leads to use of fat containers or bare metal for DataNodes and YARN workloads
Running platforms on container platforms
35 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Demo - Running Kubernetes on YARN
• What - K8s on YARN demo
• Why - Proof of concept
• Details
– Launch K8s application through YARN service framework
– View K8s dashboard
– Launch simple http app on K8s through dashboard
36 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Summary
Each application requires its own considerations when running in
containers. There are many advantages to running big data workloads
in containers, but it is important to understand the trade-offs that
must be made.
37 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Thank you to the Apache Software Foundation!
Apache Hadoop, Apache Spark, Apache HBase, Apache Hive, and
Apache Storm are trademarks of the Apache Software Foundation.
http://www.apache.org/
38 © Hortonworks Inc. 2011–2018. All rights reserved.
Thank you

Containers and Big Data

  • 1.
    1 © HortonworksInc. 2011–2018. All rights reserved. Containers and Big Data Billie Rinaldi and Shane Kumpf Software Engineering – Hortonworks YARN R&D
  • 2.
    2 © HortonworksInc. 2011–2018. All rights reserved. At Hortonworks, we run many many tests... Dozens of product releases a year ...over 30 open source projects ...across a dozen supported Linux operating systems …and multiple backend databases Result: Tens of thousands of tests per release
  • 3.
    3 © HortonworksInc. 2011 – 2018. All Rights Reserved ...on a container cloud powered by Apache Hadoop YARN YARN Jenkins Worker (Docker) Testing HDP and HDF releases in container clusters Worker (Docker) Worker (Docker) HDP (Docker) HDP (Docker) HDP (Docker) HDP (Docker) HDFS That’s right! HDP running in Docker containers on YARN!
  • 4.
    4 © HortonworksInc. 2011–2018. All rights reserved. 2 years and 7 million containers later... Many real world lessons learned
  • 5.
    5 © HortonworksInc. 2011–2018. All rights reserved. Let’s Talk About Containers
  • 6.
    6 © HortonworksInc. 2011–2018. All rights reserved. • Industry adoption continues • “Number of containerized applications will rise by 80% in the next two years” [1] • Patterns emerging • Multi-cloud and hybrid strategies • Adoption of Microservices • Exponential ecosystem growth • Dozens of container orchestrators • Thousands of plugins • Market moves Containerization is Gaining Momentum 1. http://i.dell.com/sites/doccontent/business/solutions/whitepapers/en/Documents/Containers_Real_Adoption_2017_Dell_EMC_Forrester_Paper.pdf
  • 7.
    7 © HortonworksInc. 2011–2018. All rights reserved. The Road to Containers - The Pursuit of Faster, Better, Cheaper Physical Machines VMs Containers ● Simplified IT ● Business Agility ● Consolidated Hardware ● Improved Utilization ● More efficient than VMs ● Cheaper Cost for IT ● Business agility ● Hybrid deployment value *** Older & newer systems coexist, newer tech increasingly taking larger share
  • 8.
    8 © HortonworksInc. 2011–2018. All rights reserved. • Improved hardware utilization through increased density • No virtual machine operating system overhead • Image layer reuse limits data duplication on disk • Strong resource isolation • Namespaces and cgroups • Better software packaging • Package applications and dependencies together • Improved reuse vs VM images • Distribution mechanism • Improved developer self service • More control over the execution environment Why Are Containers Gaining Popularity?
  • 9.
    9 © HortonworksInc. 2011–2018. All rights reserved. • Mix of services • Long lived services and ephemeral jobs • Decoupled compute and storage • Scale independently • Hybrid deployments • Desire for consistency between cloud and on-prem Container Architecture Patterns
  • 10.
    10 © HortonworksInc. 2011–2018. All rights reserved. Let’s Talk Big Data
  • 11.
    11 © HortonworksInc. 2011–2018. All rights reserved. The Road to Big Data - The Pursuit of Faster, Better, Cheaper Siloed Data Systems (ERP, CRM, DBs, SAN, NAS) Apache Hadoop Ecosystem ● Massive scale ● More efficient than Siloed systems ● Cheaper Cost for IT ● Business agility ● Hybrid deployment value prop *** Older & newer systems coexist, newer tech increasingly taking larger share
  • 12.
    12 © HortonworksInc. 2011–2018. All rights reserved. • Store and process data cost effectively at unprecedented scale • Batch and interactive workloads on shared infrastructure • Multi-tenant resource allocation capabilities • Scale out • Much more evolved today ... • Security & Governance systems • BI tooling • Operational tooling • Serving systems • ML • AI • Streaming • SQL • ... What really made Big Data and Hadoop popular?
  • 13.
    13 © HortonworksInc. 2011–2018. All rights reserved. • Big Data = Platform + Workloads • Workloads • User sizes the container to perform the work • Horizontally scale work by adding removing containers • Container platform schedules containers • Platform • Manages resources to run workloads on top • Includes advanced scheduling capabilities • Multi-tenant support (Queues, Capacity guarantees, …) • Fine grained scheduling At its core, Big Data is about a platform & the workloads Platform Work load Work load Work load
  • 14.
    14 © HortonworksInc. 2011–2018. All rights reserved. • Jobs • Batch or Interactive, short lived, ephemeral • Services • Long running, persistent • Platforms • Schedulers, orchestrators, resource management • The plumbing for Jobs and Services • Supports a mix of Jobs and Services • Security beyond client-server (tokens …) • Understands locality and can move work closer to data • Networks getting faster, but speed of light ... Multiple Classes of Big Data Application Type ** The lines may be blurred in cases **
  • 15.
    15 © HortonworksInc. 2011–2018. All rights reserved. Jobs Long Running Services Platform MapReduce HBase YARN Hive + Tez Spark Streaming K8s Spark Storm Cloud Hive LLAP Example Systems
  • 16.
    16 © HortonworksInc. 2011–2018. All rights reserved. Containers & Big Data Together
  • 17.
    17 © HortonworksInc. 2011–2018. All rights reserved. • Workloads can be platforms in disguise • Workloads have varying requirements on collocation • Platforms and containers don’t always mix • Different levels, fat containers to microservices • Stepping back, it’s similar to Big Data and VMs - what is gained? It’s Complicated Many nuances depending on the workload and systems
  • 18.
    18 © HortonworksInc. 2011–2018. All rights reserved. Full decomposition requires modification Fat Containers Microservice
  • 19.
    19 © HortonworksInc. 2011–2018. All rights reserved. Considerations for Containerization
  • 20.
    20 © HortonworksInc. 2011–2018. All rights reserved. General Considerations
  • 21.
    21 © HortonworksInc. 2011–2018. All rights reserved. • Operating System stability • Kernel, Docker, storage drivers • Fat containers and microservices • Lift and shift existing applications or decompose • Stateless and Stateful • Is persistent storage required? Performance needs? • In-memory state? • Networking • Many networking options General Considerations for Containers in Production
  • 22.
    22 © HortonworksInc. 2011–2018. All rights reserved. • Containers are tightly coupled to the OS kernel • Use of advanced features • Poor support in many kernels • One container can cripple the host • Run the newest kernel possible • Docker storage driver selection • Heavy writes lead to panics • SSDs may be needed • Use overlay2 if workload allows it System Stability
  • 23.
    23 © HortonworksInc. 2011–2018. All rights reserved. • Lift and shift • “Containers as VMs” • Need an init system • systemd in containers comes with gotchas • Full decomposition is a journey • Steps in between: fat, hollow, thin, skinny • Lift and shift is typically the first phase Fat Containers and Microservices
  • 24.
    24 © HortonworksInc. 2011–2018. All rights reserved. • What kind of state? • Data persistence? In-memory store? • How does the application recover state? • “Checkpoint” the container? • Performance impact • Impact of remote data access? Stateless and Stateful
  • 25.
    25 © HortonworksInc. 2011–2018. All rights reserved. • The great thing about networking is all the options /s • Burden is on operations • IP per container • no NAT • Platforms offload network functions • Leverage plugins • Heavy use of software defined networking Networking
  • 26.
    26 © HortonworksInc. 2011–2018. All rights reserved. Considerations for Jobs
  • 27.
    27 © HortonworksInc. 2011–2018. All rights reserved. • Summary: The systems that power these workloads typically run on other platforms/orchestrators. Most commonly these are analytic workloads. • Examples: MapReduce, Apache Hive + Tez, Apache Spark • Benefits • Packaging dependencies • Challenges • Data locality and networking considerations • User identity propagation and Security Workloads: Consideration for Batch/Ephemeral/Interactive
  • 28.
    28 © HortonworksInc. 2011 – 2018. All Rights Reserved Demo - Running pyspark in Containers • What - Running a Spark pandas UDF application using Containerization (based on example [1]) • Why - Great example of packaging complex dependencies with an application • Details – Launch pyspark shell with env vars specifying docker runtime and image name – Run simple OLS least regression example 1. https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html
  • 29.
    29 © HortonworksInc. 2011–2018. All rights reserved. Considerations for Services
  • 30.
    30 © HortonworksInc. 2011–2018. All rights reserved. • Summary: These are typically serving systems with varying requirements. In many cases low latency online serving use cases that have specific resource requirements. • Examples: Apache HBase, Apache Spark SQL/Streaming, Apache Storm, Apache Hive LLAP • Benefits • Ease of deployment • Vertical scaling • Challenges • Data locality considerations • Static networks expected • Token/key expiration • Short circuit reads lose value Consideration for Long Running Services
  • 31.
    31 © HortonworksInc. 2011 – 2018. All Rights Reserved Demo - Running HBase in Containers • What - Running Containerized HBase on YARN • Why - Demonstrate long running application and ease of scaling • Details – Launch HBase application using YARN service framework – Verify HBase status through master UI – Flex application to add regionserver(s)
  • 32.
    32 © HortonworksInc. 2011–2018. All rights reserved. Considerations for Platforms
  • 33.
    33 © HortonworksInc. 2011–2018. All rights reserved. • Summary: Workloads run on these systems. Many platforms expect that they “own” the hardware/VM, which differs from workloads. • Examples: YARN, K8s • Benefits • Hardware utilization • Leverage existing investment for more apps • Developer clusters • Challenges • Resource sharing • Networking considerations • User propagation Consideration for Platforms
  • 34.
    34 © HortonworksInc. 2011–2018. All rights reserved. • Resource management challenges – Running Two schedulers • Running YARN workloads on top of K8S • Running K8s as a long running service on YARN • Side-by-side but have some elastic CPU that is moved back or forth • Full scheduler integration? • How are resources shared? • CPU is elastic and shareable, but memory is not • How do the schedulers cooperate to communicate resource consumption? • Containers added and removed on demand / possibly resized • Cgroups for resource tracking • What about storage? • DataNodes are not moveable, storage is not elastic • IO bandwidth is shareable, but isolation is critical to HDFS/Big Data apps • Leads to use of fat containers or bare metal for DataNodes and YARN workloads Running platforms on container platforms
  • 35.
    35 © HortonworksInc. 2011 – 2018. All Rights Reserved Demo - Running Kubernetes on YARN • What - K8s on YARN demo • Why - Proof of concept • Details – Launch K8s application through YARN service framework – View K8s dashboard – Launch simple http app on K8s through dashboard
  • 36.
    36 © HortonworksInc. 2011 – 2018. All Rights Reserved Summary Each application requires its own considerations when running in containers. There are many advantages to running big data workloads in containers, but it is important to understand the trade-offs that must be made.
  • 37.
    37 © HortonworksInc. 2011 – 2018. All Rights Reserved Thank you to the Apache Software Foundation! Apache Hadoop, Apache Spark, Apache HBase, Apache Hive, and Apache Storm are trademarks of the Apache Software Foundation. http://www.apache.org/
  • 38.
    38 © HortonworksInc. 2011–2018. All rights reserved. Thank you