Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Improve Big Data Workloads with Apache Hadoop 3 and Containerization

351 views

Published on

Join experts from Ventana Research and Hortonworks to see how we enable success to improve your big data workload processing.

https://hortonworks.com/webinar/improve-big-data-workloads-apache-hadoop-3-containerization/

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Improve Big Data Workloads with Apache Hadoop 3 and Containerization

  1. 1. Improve Big Data Workloads with Apache Hadoop 3 and Containerization
  2. 2. David Menninger SVP & Research Director Responsible for the overall research direction of data, information and analytics technologies at Ventana Research. Covers Analytics, Big Data, Analytics and Information Management along with the additional specific research categories including IT Performance Management and IoT. Has over twenty-five years of experience bringing leading edge data and analytics technologies to market. Previous roles: -Head of Business Development & Strategy at Pivotal (Dell/EMC) -VP of Marketing and Product Management at Vertica, Oracle, Applix, InforSense and IRI Software. © 2018 Ventana Research @dmenningerVR 2
  3. 3. 3 © Hortonworks Inc. 2011–2018. All rights reserved Billie Rinaldi Principal Software Engineer • Currently working on the Apache Hadoop YARN service framework and containerization capabilities • Apache Software Foundation (ASF) Member • Apache Hadoop committer • 12 years of big data experience
  4. 4. Improve Big Data Workloads with Apache Hadoop 3 and Containerization Educational Thought Leadership Webinar David Menninger SVP & Research Director @ventanaresearch In/ventanaresearchblog.ventanaresearch.com @dmenningerVR © 2018 Ventana Research
  5. 5. Agenda • Introduction / background • Apache Hadoop: background and history • New features in Hadoop 3 • Value of these new features • Recommendations © 2018 Ventana Research5
  6. 6. 6 © 2018 Ventana Research
  7. 7. Apache Hadoop • 10+ years of history • Ignited the world of big data • Components of Hadoop - Hadoop Common, HDFS, YARN, MapReduce - More than 30 other projects such as Apache HBase, Apache Hive, Apache Spark • Vendors package & distribute Hadoop • Workloads have evolved from long running batch jobs to a mix of shorter interactive queries and batch © 2018 Ventana Research7
  8. 8. © 2018 Ventana Research Prevalence of Hadoop for Big Data 8 • Almost half are using Hadoop as their primary data lake platform • Usage has been consistent; data warehouses/marts has declined • Representative use case for Hadoop • Often involves cloud
  9. 9. • Prior versions only allowed storage in triplicates • Hadoop 3 adds an option for erasure coding • Similar to Raid 5 or 6 • Can reduce storage requirements by 50% • Some additional overhead so best for colder storage © 2018 Ventana Research Storage Enhancement: Erasure Coding 9
  10. 10. Algorithmic Trading Company • Billions of dollars under management • Back-test algorithms against historical data - Want to test quickly, but not extremely time sensitive • Execute validated algorithms against streaming data © 2018 Ventana Research Cold Storage Scenario 10
  11. 11. • Additional cloud storage options - Microsoft Azure Data Lake - Aliyun Object Storage • Amazon S3 enhancements - S3Guard • Continues to expand Hadoop Compatible File System (HCFS) options © 2018 Ventana Research Other Storage System Enhancements 11
  12. 12. • YARN resource types - Manage mixed resources - Pooling & isolation • Great way to leverage GPUs & FPGAs (with 3.1) • Inter-queue AND intra- queue management • Rebalancing across disks within a node © 2018 Ventana Research Workload and Resource Management Improvements 12
  13. 13. © 2018 Ventana Research13
  14. 14. • Schedule tasks even if no resources available • Tasks are queued for execution when resources become available • Lower priority • Distributed scheduler • Better cluster utilization © 2018 Ventana Research Opportunistic Containers / Distributed Scheduling 14
  15. 15. • Support for multiple standby NameNodes • Previously only one standby NameNode • Supported by configuring more JournalNodes • Can tolerate failure of more nodes (n-1) © 2018 Ventana Research Improved NameNode Resiliency 15
  16. 16. • Container package complete self contained environment - Includes OS - All executables & libraries • Run in Hadoop cluster along with Hadoop workloads • Isolated for reliability • Start quickly, easily modified • More flexibility & scalability © 2018 Ventana Research Containerized Workloads 16
  17. 17. • Configure additional name nodes for better resilience • Acquire and share GPUs across AI/ML workloads • Use Hadoop 3 features to better manage mixed workloads • Learn more about and leverage containerized workloads © 2018 Ventana Research Recommendations 17
  18. 18. Questions? Twitter @dmenningervr & @ventanaresearch LinkedIn http://www.linkedin.com/company/ventana-research https://www.linkedin.com/in/davidmenninger/ Analyst Perspectives http://davidmenninger.ventanaresearch.com Electronic Mail david.menninger@ventanaresearch.com © 2018 Ventana Research18
  19. 19. Improve Big Data Workloads with Apache HadoopⓇ 3 and Containerization Educational Thought Leadership Webinar David Menninger SVP & Research Director @ventanaresearch In/ventanaresearchblog.ventanaresearch.com @dmenningerVR
  20. 20. 20 © Hortonworks Inc. 2011–2018. All rights reserved Containerized Workloads
  21. 21. 21 © Hortonworks Inc. 2011–2018. All rights reserved • Big Data = Platform + Workloads • Workloads • Container is sized by the user to perform a task • Amount of work performed is scaled horizontally by adding or removing containers • Containers are scheduled by a platform • Platform • Manages resources to run workloads • Includes advanced scheduling capabilities • Multi-tenant support (Queues, Capacity guarantees, …) • Fine grained scheduling At its core, Big Data is about a platform & the workloads Platform Work load Work load Work load
  22. 22. 22 © Hortonworks Inc. 2011–2018. All rights reserved • Jobs - MapReduce, Apache Hive + Apache Tez, Apache Spark • Batch or interactive, short lived, ephemeral • Services - Apache HBase, Apache Spark Streaming, Apache Storm, Apache Hive LLAP • Long running, persistent, serving systems • Platforms - Apache Hadoop YARN, K8s, Cloud • Schedulers, orchestrators, resource management • Plumbing that supports a mix of Jobs and Services • Security beyond client-server (tokens …) Multiple Classes of Big Data Workloads ** The lines may be blurred in cases **
  23. 23. 23 © Hortonworks Inc. 2011–2018. All rights reserved Shared Services Resource Management (YARN) Management and Monitoring (Ambari) Jenkins Worker (Docker) Testing HDP and HDF releases in container clusters HDP (Docker) Worker (Docker) Storage (HDFS) Service Discovery and REST API (YARN Services) Security and Governance (Ranger and Atlas) Submit Test Launch Test Worker (Docker) HDP (Docker) HDP (Docker) HDP (Docker) Container cloud for Hortonworks release testing
  24. 24. 24 © Hortonworks Inc. 2011–2018. All rights reserved Why Containers? • Improved hardware utilization through increased density • Switching to containers improved density 2.5x for HDP testing • No virtual machine operating system overhead • Strong resource isolation • Namespaces and cgroups • Better software packaging • Package applications and dependencies together • Improved reuse vs. VM images reduces data duplication • Distribution mechanism • Improved developer self service • More control over the execution environment
  25. 25. 25 © Hortonworks Inc. 2011–2018. All rights reserved Many real world lessons learned More than two years and 7.5 million containers later ...
  26. 26. 26 © Hortonworks Inc. 2011–2018. All rights reserved • Operating System stability • Kernel, docker, storage drivers • Containers are tightly coupled to the OS kernel • Advanced features may have poor support in many kernels • Heavy writes can lead to panics • Fat containers and microservices • Lift and shift existing applications or decompose • Stateless and Stateful • Is persistent storage required? Performance needs? • In-memory state? State recovery? • Networking • Many networking options / plugins General Considerations for Containers in Production
  27. 27. 27 © Hortonworks Inc. 2011–2018. All rights reserved • Jobs: The systems that power these workloads typically run on other platforms/orchestrators. Most commonly these are analytic workloads. • Examples: MapReduce, Apache Hive + Tez, Apache Spark • Benefits • Packaging dependencies • Challenges • Data locality and networking considerations • User identity propagation and security Containerizing Batch/Ephemeral/Interactive Jobs
  28. 28. 28 © Hortonworks Inc. 2011–2018. All rights reserved • Services: These are often low latency online serving use cases that have specific resource requirements. • Examples: Apache HBase, Apache Spark SQL/Streaming, Apache Storm, Apache Hive LLAP • Benefits • Ease of deployment • Horizontal scaling • Challenges • Data locality considerations • Client setup and discovery of services • Token/key expiration Containerizing Long Running Services
  29. 29. 29 © Hortonworks Inc. 2011–2018. All rights reserved • Platforms: Workloads run on these systems. Platforms typically expect that they “own” the hardware/VM. • Examples: YARN, K8s • Benefits • Hardware utilization • Leverage existing investment for more apps • Developer clusters • Challenges • Resource sharing / tracking (cgroups help) • Docker in docker? • Resizing? Containerizing Platforms
  30. 30. 30 © Hortonworks Inc. 2011–2018. All rights reserved Recommendations for Containerized Workloads • Evaluate the benefits and challenges of containerizing specific workloads to understand the tradeoffs being made. • Learn about the YARN Service Framework which makes it easier to run containerized workloads on YARN. • Run the newest kernel possible. Test with real workloads to verify compatibility with selected versions of kernel, docker, storage drivers, etc. SSDs may be needed for workloads with heavy writes.
  31. 31. 31 © Hortonworks Inc. 2011–2018. All rights reserved More Information
  32. 32. 32 © Hortonworks Inc. 2011–2018. All rights reserved HDP 3.0 – Find out more Check out our web page • https://hortonworks.com/products/data-platforms/hdp/ Read our white paper • https://hortonworks.com/info/ventana-white-paper-hadoop-3/ • How Apache Hadoop 3 brings high performance computing for Machine Learning and Deep Learning • What drives flexibility and agility, and how it adds value for developers and architects • How growing volumes of intensive applications can benefit from the scalability and availability enhancements
  33. 33. 33 © Hortonworks Inc. 2011–2018. All rights reserved Questions?

×