Many initiatives for running applications inside containers have been scoped to run on a single host. Using Docker containers for large-scale production environments poses interesting challenges, especially when deploying distributed big data applications like Apache Hadoop and Apache Spark. This session at Strata + Hadoop World in New York City (September 2016) explores various solutions and tips to address the challenges encountered while deploying multi-node Hadoop and Spark production workloads using Docker containers.
Some of these challenges include container life-cycle management, smart scheduling for optimal resource utilization, network configuration and security, and performance. BlueData is "all in” on Docker containers—with a specific focus on big data applications. BlueData has learned firsthand how to address these challenges for Fortune 500 enterprises and government organizations that want to deploy big data workloads using Docker.
This session by Thomas Phelan, co-founder and chief architect at BlueData, discusses how to securely network Docker containers across multiple hosts and discusses ways to achieve high availability across distributed big data applications and hosts in your data center. Since we’re talking about very large volumes of data, performance is a key factor, so Thomas shares some of the storage options implemented at BlueData to achieve near bare-metal I/O performance for Hadoop and Spark using Docker as well as lessons learned and some tips and tricks on how to Dockerize your big data applications in a reliable, scalable, and high-performance environment.
http://conferences.oreilly.com/strata/hadoop-big-data-ny/public/schedule/detail/52042
InterPlanetary File System 소개 자료입니다.
풀 한글로 작성하고 싶었으나
시간관계 상 중반부 이상은 영문 번역을 손을 못댔네요.
(이후 시간이 된다면 수정해보겠습니다.
그림 및 도표의 출처는 모두 링크로 기재되어있습니다.
본자료는 흐름을 이해하는데 사용하시고
원문 링크를 한번씩 더 읽어보시길 추천드립니다.
How Spark is Making an Impact at Goldman Sachs by Vincent SaulysSpark Summit
In this keynote, Vincent Saulys will discuss how Spark was brought into Goldman, the impact its making today (lessons learned along the way), and where they expect to apply Spark in the future
This presentation covers how app deployment model evolved from bare metal servers to Kubernetes World.
In addition to theoretical information, you will find free KATACODA workshops url to perform practices to understand the details of the each topics.
Hello, kafka! (an introduction to apache kafka)Timothy Spann
Hello ApacheKafka
An Introduction to Apache Kafka with Timothy Spann and Carolyn Duby Cloudera Principal engineers.
We also demo Flink SQL, SMM, SSB, Schema Registry, Apache Kafka, Apache NiFi and Public Cloud - AWS.
InterPlanetary File System 소개 자료입니다.
풀 한글로 작성하고 싶었으나
시간관계 상 중반부 이상은 영문 번역을 손을 못댔네요.
(이후 시간이 된다면 수정해보겠습니다.
그림 및 도표의 출처는 모두 링크로 기재되어있습니다.
본자료는 흐름을 이해하는데 사용하시고
원문 링크를 한번씩 더 읽어보시길 추천드립니다.
How Spark is Making an Impact at Goldman Sachs by Vincent SaulysSpark Summit
In this keynote, Vincent Saulys will discuss how Spark was brought into Goldman, the impact its making today (lessons learned along the way), and where they expect to apply Spark in the future
This presentation covers how app deployment model evolved from bare metal servers to Kubernetes World.
In addition to theoretical information, you will find free KATACODA workshops url to perform practices to understand the details of the each topics.
Hello, kafka! (an introduction to apache kafka)Timothy Spann
Hello ApacheKafka
An Introduction to Apache Kafka with Timothy Spann and Carolyn Duby Cloudera Principal engineers.
We also demo Flink SQL, SMM, SSB, Schema Registry, Apache Kafka, Apache NiFi and Public Cloud - AWS.
Openshift has the mechanism for building and deploying applications and Jenkins is a tool use for continuous integration/delivery/deployment. If we combine these together we can create a CI/CD pipeline that will allow us to promote builds of applications and make them available in our OSE instance.
Video - https://youtu.be/IreIK-jICgY
Kafka High Availability in multi data center setup with floating Observers wi...HostedbyConfluent
Enabling High Availability in cluster setup that spawns different data centers is challenging but it is even more if we are using just two data-centers. Not ideal for Kafka HA at all. But this is reality for most organizations as they are using the same data-centers previously used for database HA.
In this presentation we will see how to use Kafka Observer feature to address this challenge with additional tweak to distribute load evenly among Observers and ordinary Brokers and make them floating between data-centers. The whole demo is supported by Infrastructure as a code automation trough Ansible.
Application Timeline Server - Past, Present and FutureVARUN SAXENA
How YARN Application timeline server evolved from Application History Server to Application Timeline Server v1 to ATSv2 or ATS Next gen, which is currently under development.
This slide was present at Hadoop Big Data Meetup at eBay, Bangalore, India.
A Comprehensive Introduction to Kubernetes. This slide deck serves as the lecture portion of a full-day Workshop covering the architecture, concepts and components of Kubernetes. For the interactive portion, please see the tutorials here:
https://github.com/mrbobbytables/k8s-intro-tutorials
A brief introduction to Apache Kafka and describe its usage as a platform for streaming data. It will introduce some of the newer components of Kafka that will help make this possible, including Kafka Connect, a framework for capturing continuous data streams, and Kafka Streams, a lightweight stream processing library.
Implementing Exactly-once Delivery and Escaping Kafka Rebalance Storms with Y...HostedbyConfluent
"Even though stream processing has come a long way in the last few years, ensuring exactly-once delivery remains a difficult problem to solve.
This becomes an even bigger challenge when your consumers are distributed applications, and their Kubernetes pods can be scaled-out, scaled-in or simply restarted at any given moment, causing Apache Kafka to go into a “rebalance storm”.
In this talk, we’ll walk you through how we implemented exactly-once delivery with Kafka by managing Kafka transactions the right way, and how we escaped endless rebalance storms when running hundreds of consumers on the same Kafka topic.
We will discuss the issues we faced building Akamai’s data ingestion infrastructure on Azure, processing malicious traffic at internet scale.
The session covers:
- Kafka delivery semantics
- Kafka transactional API
- Kafka “anti-rebalance” tips and tricks"
Kafka for Real-Time Replication between Edge and Hybrid CloudKai Wähner
Not all workloads allow cloud computing. Low latency, cybersecurity, and cost-efficiency require a suitable combination of edge computing and cloud integration.
This session explores architectures and design patterns for software and hardware considerations to deploy hybrid data streaming with Apache Kafka anywhere. A live demo shows data synchronization from the edge to the public cloud across continents with Kafka on Hivecell and Confluent Cloud.
Kubernetes currently has two load balancing mode: userspace and IPTables. They both have limitation on scalability and performance. We introduced IPVS as third kube-proxy mode which scales kubernetes load balancer to support 50,000 services. Beyond that, control plane needs to be optimized in order to deploy 50,000 services. We will introduce alternative solutions and our prototypes with detailed performance data.
CI/CD with an Idempotent Kafka Producer & Consumer | Kafka Summit London 2022HostedbyConfluent
"Idempotence is a mathematical requirement of particular operations where the operation can be applied multiple times without changing the result beyond the initial application.
The main driver behind the idempotency requirement is often to handle duplicated messages. As developers and architects, we need to pay close attention to how we deal with our production data during new deployments to ensure we are not losing any data, duplicating messages, or introducing malformed data into our system. Furthermore, we need to figure out how to automate the process and add testing guarantees to prevent any potential human error.
In this session, you will learn about the idempotent Kafka Producer & Consumer architecture and how to automate the CI/CD process with open-source tools."
High-speed Database Throughput Using Apache Arrow Flight SQLScyllaDB
Flight SQL is a revolutionary new open database protocol designed for modern architectures. Key features in Flight SQL include a columnar-oriented design and native support for parallel processing of data partitions. This talk will go over how these new features can push SQL query throughput beyond existing standards such as ODBC.
Cluster-as-code. The Many Ways towards KubernetesQAware GmbH
iSAQB Software Architecture Gathering – Digital 2022, November 2022, Mario-Leander Reimer (@LeanderReimer, Principal Software Architect bei QAware).
== Dokument bitte herunterladen, falls unscharf! Please download slides if blurred! ==
Kubernetes is the de-facto standard when it comes to container orchestration. But why is there is no established, standard and uniform way to spin-up and manage a single or even a whole farm of Kubernetes clusters yet? Instead, a whole bunch of different and mostly incompatible ways towards Kubernetes exist today. Each with its own pros and cons in regards to ease of use, flexibility and many other requirements. In this session we will have a closer look at the different available options to create, manage and operate Kubernetes clusters at scale.
Spark Streaming has supported Kafka since it's inception, but a lot has changed since those times, both in Spark and Kafka sides, to make this integration more fault-tolerant and reliable.Apache Kafka 0.10 (actually since 0.9) introduced the new Consumer API, built on top of a new group coordination protocol provided by Kafka itself.
So a new Spark Streaming integration comes to the playground, with a similar design to the 0.8 Direct DStream approach. However, there are notable differences in usage, and many exciting new features. In this talk, we will cover what are the main differences between this new integration and the previous one (for Kafka 0.8), and why Direct DStreams have replaced Receivers for good. We will also see how to achieve different semantics (at least one, at most one, exactly once) with code examples.
Finally, we will briefly introduce the usage of this integration in Billy Mobile to ingest and process the continuous stream of events from our AdNetwork.
This presentation describes how hortonworks is delivering Hadoop on Docker for a cloud-agnostic deployment approach which presented in Cisco Live 2015.
Managing Docker Containers In A Cluster - Introducing KubernetesMarc Sluiter
Containerising your applications with Docker gets more and more attraction. While managing your Docker containers on your developer machine or on a single server is not a big hassle, it can get uncomfortable very quickly when you want to deploy your containers in a cluster, no matter if in the cloud or on premises. How do you provide high availability, scaling and monitoring? Fortunately there is a rapidly growing ecosystem around docker, and there are tools available which support you with this. In this session I want to introduce you to Kubernetes, the Docker orchestration tool started and open sourced by Google. Based on the experience with their data centers, Google uses some interesting declarative concepts like pods, replication controllers and services in Kubernetes, which I will explain to you. While Kubernetes still is a quite young project, it reached its first stable version this summer, thanks to many contributions by Red Hat, Microsoft, IBM and many more.
Openshift has the mechanism for building and deploying applications and Jenkins is a tool use for continuous integration/delivery/deployment. If we combine these together we can create a CI/CD pipeline that will allow us to promote builds of applications and make them available in our OSE instance.
Video - https://youtu.be/IreIK-jICgY
Kafka High Availability in multi data center setup with floating Observers wi...HostedbyConfluent
Enabling High Availability in cluster setup that spawns different data centers is challenging but it is even more if we are using just two data-centers. Not ideal for Kafka HA at all. But this is reality for most organizations as they are using the same data-centers previously used for database HA.
In this presentation we will see how to use Kafka Observer feature to address this challenge with additional tweak to distribute load evenly among Observers and ordinary Brokers and make them floating between data-centers. The whole demo is supported by Infrastructure as a code automation trough Ansible.
Application Timeline Server - Past, Present and FutureVARUN SAXENA
How YARN Application timeline server evolved from Application History Server to Application Timeline Server v1 to ATSv2 or ATS Next gen, which is currently under development.
This slide was present at Hadoop Big Data Meetup at eBay, Bangalore, India.
A Comprehensive Introduction to Kubernetes. This slide deck serves as the lecture portion of a full-day Workshop covering the architecture, concepts and components of Kubernetes. For the interactive portion, please see the tutorials here:
https://github.com/mrbobbytables/k8s-intro-tutorials
A brief introduction to Apache Kafka and describe its usage as a platform for streaming data. It will introduce some of the newer components of Kafka that will help make this possible, including Kafka Connect, a framework for capturing continuous data streams, and Kafka Streams, a lightweight stream processing library.
Implementing Exactly-once Delivery and Escaping Kafka Rebalance Storms with Y...HostedbyConfluent
"Even though stream processing has come a long way in the last few years, ensuring exactly-once delivery remains a difficult problem to solve.
This becomes an even bigger challenge when your consumers are distributed applications, and their Kubernetes pods can be scaled-out, scaled-in or simply restarted at any given moment, causing Apache Kafka to go into a “rebalance storm”.
In this talk, we’ll walk you through how we implemented exactly-once delivery with Kafka by managing Kafka transactions the right way, and how we escaped endless rebalance storms when running hundreds of consumers on the same Kafka topic.
We will discuss the issues we faced building Akamai’s data ingestion infrastructure on Azure, processing malicious traffic at internet scale.
The session covers:
- Kafka delivery semantics
- Kafka transactional API
- Kafka “anti-rebalance” tips and tricks"
Kafka for Real-Time Replication between Edge and Hybrid CloudKai Wähner
Not all workloads allow cloud computing. Low latency, cybersecurity, and cost-efficiency require a suitable combination of edge computing and cloud integration.
This session explores architectures and design patterns for software and hardware considerations to deploy hybrid data streaming with Apache Kafka anywhere. A live demo shows data synchronization from the edge to the public cloud across continents with Kafka on Hivecell and Confluent Cloud.
Kubernetes currently has two load balancing mode: userspace and IPTables. They both have limitation on scalability and performance. We introduced IPVS as third kube-proxy mode which scales kubernetes load balancer to support 50,000 services. Beyond that, control plane needs to be optimized in order to deploy 50,000 services. We will introduce alternative solutions and our prototypes with detailed performance data.
CI/CD with an Idempotent Kafka Producer & Consumer | Kafka Summit London 2022HostedbyConfluent
"Idempotence is a mathematical requirement of particular operations where the operation can be applied multiple times without changing the result beyond the initial application.
The main driver behind the idempotency requirement is often to handle duplicated messages. As developers and architects, we need to pay close attention to how we deal with our production data during new deployments to ensure we are not losing any data, duplicating messages, or introducing malformed data into our system. Furthermore, we need to figure out how to automate the process and add testing guarantees to prevent any potential human error.
In this session, you will learn about the idempotent Kafka Producer & Consumer architecture and how to automate the CI/CD process with open-source tools."
High-speed Database Throughput Using Apache Arrow Flight SQLScyllaDB
Flight SQL is a revolutionary new open database protocol designed for modern architectures. Key features in Flight SQL include a columnar-oriented design and native support for parallel processing of data partitions. This talk will go over how these new features can push SQL query throughput beyond existing standards such as ODBC.
Cluster-as-code. The Many Ways towards KubernetesQAware GmbH
iSAQB Software Architecture Gathering – Digital 2022, November 2022, Mario-Leander Reimer (@LeanderReimer, Principal Software Architect bei QAware).
== Dokument bitte herunterladen, falls unscharf! Please download slides if blurred! ==
Kubernetes is the de-facto standard when it comes to container orchestration. But why is there is no established, standard and uniform way to spin-up and manage a single or even a whole farm of Kubernetes clusters yet? Instead, a whole bunch of different and mostly incompatible ways towards Kubernetes exist today. Each with its own pros and cons in regards to ease of use, flexibility and many other requirements. In this session we will have a closer look at the different available options to create, manage and operate Kubernetes clusters at scale.
Spark Streaming has supported Kafka since it's inception, but a lot has changed since those times, both in Spark and Kafka sides, to make this integration more fault-tolerant and reliable.Apache Kafka 0.10 (actually since 0.9) introduced the new Consumer API, built on top of a new group coordination protocol provided by Kafka itself.
So a new Spark Streaming integration comes to the playground, with a similar design to the 0.8 Direct DStream approach. However, there are notable differences in usage, and many exciting new features. In this talk, we will cover what are the main differences between this new integration and the previous one (for Kafka 0.8), and why Direct DStreams have replaced Receivers for good. We will also see how to achieve different semantics (at least one, at most one, exactly once) with code examples.
Finally, we will briefly introduce the usage of this integration in Billy Mobile to ingest and process the continuous stream of events from our AdNetwork.
This presentation describes how hortonworks is delivering Hadoop on Docker for a cloud-agnostic deployment approach which presented in Cisco Live 2015.
Managing Docker Containers In A Cluster - Introducing KubernetesMarc Sluiter
Containerising your applications with Docker gets more and more attraction. While managing your Docker containers on your developer machine or on a single server is not a big hassle, it can get uncomfortable very quickly when you want to deploy your containers in a cluster, no matter if in the cloud or on premises. How do you provide high availability, scaling and monitoring? Fortunately there is a rapidly growing ecosystem around docker, and there are tools available which support you with this. In this session I want to introduce you to Kubernetes, the Docker orchestration tool started and open sourced by Google. Based on the experience with their data centers, Google uses some interesting declarative concepts like pods, replication controllers and services in Kubernetes, which I will explain to you. While Kubernetes still is a quite young project, it reached its first stable version this summer, thanks to many contributions by Red Hat, Microsoft, IBM and many more.
Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....Jeffrey Breen
Part 3 of 3 of series focusing on the infrastructure aspect of getting started with Big Data. This presentation demonstrates how to use Apache Whirr to launch a Hadoop cluster on Amazon EC2--easily.
Presented at the Boston Predictive Analytics Big Data Workshop, March 10, 2012. Sample code and configuration files are available on github.
Hortonworks Technical Workshop: What's New in HDP 2.3Hortonworks
The recently launched HDP 2.3 is a major advancement of Open Enterprise Hadoop. It represents the best of community led development with innovations spanning Apache Hadoop, Apache Ambari, Ranger, HBase, Spark and Storm. In this session we will provide an in-depth overview of new functionality and discuss it's impact on new and ongoing big data initiatives.
Lessons Learned from Dockerizing Spark Workloads: Spark Summit East talk by T...Spark Summit
Many initiatives for running applications inside containers have been scoped to run on a single host. Using Docker containers for large-scale production environments poses interesting challenges, especially when deploying distributed Big Data applications like Apache Spark.
Some of these challenges include container lifecycle management, smart scheduling for optimal resource utilization, network configuration and security, and performance. At BlueData, we’re “all in” on Docker containers – with a specific focus on Spark applications. We’ve learned first-hand how to address these challenges for Fortune 500 enterprises and government organizations that want to deploy Big Data workloads using Docker.
In this session, you’ll learn about networking Docker containers across multiple hosts securely. We’ll discuss ways to achieve high availability across distributed Big Data applications and hosts in your data center. And since we’re talking about very large volumes of data, performance is a key factor. So we’ll discuss some of the storage options we explored and implemented at BlueData to achieve near bare-metal I/O performance for Spark using Docker. We’ll share our lessons learned as well as some tips and tricks on how to Dockerize your Big Data applications in a reliable, scalable, and high-performance environment.
Lessons Learned from Dockerizing Spark WorkloadsBlueData, Inc.
Many initiatives for running applications inside containers have been scoped to run on a single host. Using Docker containers for large-scale production environments poses interesting challenges, especially when deploying distributed Big Data applications like Apache Spark.
Some of these challenges include container lifecycle management, smart scheduling for optimal resource utilization, network configuration and security, and performance. BlueData is “all in” on Docker containers – with a specific focus on Spark applications. They’ve learned first-hand how to address these challenges for Fortune 500 enterprises and government organizations that want to deploy Big Data workloads using Docker.
This session at Spark Summit in February 2017 (by Thomas Phelan, co-founder and chief architect at BlueData) described lessons learned as well as some tips and tricks on how to Dockerize your Big Data applications in a reliable, scalable, and high-performance environment.
In this session, Tom described how to network Docker containers across multiple hosts securely. He discussed ways to achieve high availability across distributed Big Data applications and hosts in your data center. And since we’re talking about very large volumes of data, performance is a key factor. So Tom discussed some of the storage options that BlueData explored and implemented to achieve near bare-metal I/O performance for Spark using Docker.
https://spark-summit.org/east-2017/events/lessons-learned-from-dockerizing-spark-workloads
Today, most any application can be “Dockerized.” However, there are special challenges when deploying a distributed application such as Spark on containers. This session will describe how to overcome these challenges in deploying Spark on Docker containers, with many practical tips and techniques for running Spark in a container environment.
Containers are typically used to run stateless applications on a single host. There are significant real-world enterprise requirements that need to be addressed when running a stateful, distributed application in a secure multi-host container environment.
There are decisions that need to be made concerning which tools and infrastructure to use. There are many choices with respect to container managers, orchestration frameworks, and resource schedulers that are readily available today and some that may be available tomorrow including:]
• Mesos
• Kubernetes
• Docker Swarm
Each has its own strengths and weaknesses; each has unique characteristics that may make it suitable, or unsuitable, for Spark. Understanding these differences is critical to the successful deployment of Spark on Docker containers.
This session will describe the work done by the BlueData engineering team to run Spark inside containers, on a distributed platform, including the evaluation of various orchestration frameworks and lessons learned. You will learn how to apply practical networking and storage techniques to achieve high performance and agility in a distributed, container environment.
Speaker
Thomas Phelan, Chief Architect, Blue Data, Inc
Big-Data-as-a-Service (BDaaS) in an enterprise environment requires meeting the often contradictory goals of (1) providing your data scientists, analysts, and data engineers with a self-service consumption model; (2) delivering agile and scalable on-demand infrastructure for the rapidly evolving ecosystem of big data frameworks and application software; while (3) ensuring enterprise-grade capabilities for isolation, security, monitoring, etc.
In this presentation at our BDaaS meetup in Santa Clara, Tom Phelan (chief architect and co-founder of BlueData) reviewed these goals and how to resolve the potential contradictions. He also discussed the infrastructure, application, user experience, security, and maintainability considerations required before selecting (or designing and building) a Big-Data-as-a-Service platform for an enterprise big data deployment.
More info on this BDaaS meetup can be found at: http://www.meetup.com/Big-Data-as-a-Service/events/233999817
Adopting Docker for production applications and services used to be hard. You had to hand-roll a lot of the underlying infrastructure and write lots of custom code for service discovery, load balancing, orchestration, desired state, etc. Today, with the rise of open source container orchestration platforms and cloud-native offerings, it's a lot easier to get up and running.
Github repo for demo: https://github.com/elabor8/dockertalk
Meet up presentation on Continuous Integration with Docker on Amazon Web Services (AWS). The presentation covers benefits of Docker on AWS along with advanced Docker patterns and lessons learned.
Introduction to dockers and kubernetes. Learn how this helps you to build scalable and portable applications with cloud. It introduces the basic concepts of dockers, its differences with virtualization, then explain the need for orchestration and do some hands-on experiments with dockers
Docker right now provides great value in the enterprise but the value proposition is more about developer productivity than scale-out.
Docker benefits include resource management, environment management, continuous delivery, developer and operations collaboration, and hybrid workloads.
Take care in its introduction. Consider Docker as just part of an overall toolkit and you don't need to go "full stack" to gain value.
SQL Server is container-ready. This deck covers some of the common ideas, misconceptions, myths, and realities of databases like SQL Server in a DevOps model.
Best Practices for Running Kafka on Docker ContainersBlueData, Inc.
Docker containers provide an ideal foundation for running Kafka-as-a-Service on-premises or in the public cloud. However, using Docker containers in production environments for Big Data workloads using Kafka poses some challenges – including container management, scheduling, network configuration and security, and performance.
In this session at Kafka Summit in August 2017, Nanda Vijyaydev of BlueData shared lessons learned from implementing Kafka-as-a-Service with Docker containers.
https://kafka-summit.org/sessions/kafka-service-docker-containers
Demystifying Containerization Principles for Data ScientistsDr Ganesh Iyer
Demystifying Containerization Principles for Data Scientists - An introductory tutorial on how Dockers can be used as a development environment for data science projects
Similar to Lessons Learned Running Hadoop and Spark in Docker Containers (20)
Introduction to KubeDirector - SF Kubernetes MeetupBlueData, Inc.
Presentation from San Francisco Kubernetes Meetup on October 30, 2018
https://www.meetup.com/San-Francisco-Kubernetes-Meetup/events/255431002
What is KubeDirector? - Tom Phelan & Joel Baxter, Bluedata
Kubernetes is clearly the container orchestrator of choice for cloud-native stateless applications. And with the introduction of StatefulSets and Persistent Volumes it is becoming possible to run stateful applications on Kubernetes.
Now the new KubeDirector project allows users to manage complex stateful clusters for AI, machine learning, and big data analytics on Kubernetes without writing a single line of GO code.
KubeDirector is an open source Apache project that uses the standard Kubernetes custom resource functionality and API extensions to deploy and manage complex stateful scale-out application clusters.
This session will provide an overview of the KubeDirector architecture, show how to author the metadata and artifacts required for an example stateful application (e.g. with Spark, Jupyter, and Cassandra), and demonstrate the deployment and management of the cluster on Kubernetes using KubeDirector.
https://github.com/bluek8s/kubedirector
Dell EMC Ready Solutions for Big Data are powered by the BlueData EPIC software platform - for on-demand provisioning and automation. These integrated solutions enable a cloud-like experience for Big-Data-as-a-Service (BDaaS) while ensuring the enterprise-grade security and performance of on-premises infrastructure.
With Dell EMC Ready Solutions for Big Data, customers can rapidly deploy their analytics and machine learning workloads in a secure multi-tenant architecture, for multiple different user groups running on shared infrastructure. Their users can quickly and easily provision distributed environments for Cloudera, Hortonworks, Kafka, MapR, Spark, TensorFlow, as well as other tools.
The new Ready Solutions include everything that customers need to enable BDaaS on-premises – including BlueData EPIC software as well as Dell EMC hardware, consulting, deployment, and support services.
To learn more, visit www.dellemc.com/bdaas
With BlueData, you can spin up instant containerized environments for the Hortonworks Data Platform (HDP) and other Big Data analytics and machine learning workloads — providing your data science teams with on-demand environments for greater agility. You can decouple compute from storage resources, to improve efficiency and reduce costs. And you can ensure the enterprise-grade security and governance that your IT teams require.
BlueData has completed certification through the rigorous Hortonworks QATS (Quality Assured Testing Suite) program for deploying HDP in a containerized environment. This certification enables Hortonworks and BlueData to provide best-in-class support and high performance for their customers’ existing and future investments in HDP.
“We’ve seen rapidly growing interest in running HDP on containers, therefore it was key that we work closely with BlueData to benefit those users,” said Scott Andress, vice president of global channels & alliances at Hortonworks. “They passed our most rigorous QATS certification tests, validating that BlueData provides complete interoperability and high performance for customers running HDP in containerized environments.”
How to Protect Big Data in a Containerized EnvironmentBlueData, Inc.
Every enterprise spends significant resources to protect its data. This is especially true in the case of big data, since some of this data may include sensitive or confidential customer and financial information. Common methods for protecting data include permissions and access controls as well as the encryption of data at rest and in flight.
The Hadoop community has recently rolled out Transparent Data Encryption (TDE) support in HDFS. Transparent Data Encryption refers to the process whereby data is transparently encrypted by the big data application writing the data; it is not decrypted again until it is accessed by another application. The data is encrypted during its entire lifespan—in transit and at rest—except when it is being specifically accessed by a processing application.
TDE is an excellent approach for protecting data stored in data lakes built on the latest versions of HDFS. However, it does have its challenges and limitations. Systems that want to use TDE require tight integration with enterprise-wide Kerberos Key Distribution Center (KDC) services and Key Management Systems (KMS). This integration isn’t easy to set up or maintain. These issues can be even more challenging in a virtualized or containerized environment where one Kerberos realm may be used to secure the big data compute cluster and a different Kerberos realm may be used to secure the HDFS filesystem accessed by this cluster.
BlueData has developed significant expertise in configuring, managing, and optimizing access to TDE-protected HDFS. This session at the Strata Data Conference in March 2018 (by Thomas Phelan, co-founder and chief architect at BlueData) offers a detailed overview of how transparent data encryption works with HDFS, with a particular focus on containerized environments.
You’ll learn how HDFS TDE is configured and maintained in an environment where many big data frameworks run simultaneously (e.g., in a hybrid cloud architecture using Docker containers). Moreover, you’ll learn how KDC credentials can be managed in a Kerberos cross-realm environment to provide data scientists and analysts with the greatest flexibility in accessing data while maintaining complete enterprise-grade data security.
https://conferences.oreilly.com/strata/strata-ca/public/schedule/detail/63763
La plateforme logicielle BlueData EPIC™ simplifie, accélère et rend plus rentable le déploiement d’infrastructures et d’applications Big Data telles que Hadoop, Spark, Kafka, Cassandra, et plus, dans l’infrastructure locale ou dans le cloud public.
Bare-metal performance for Big Data workloads on Docker containersBlueData, Inc.
In a benchmark study, Intel® compared the performance of Big Data workloads running on a bare-metal deployment versus running in Docker* containers with the BlueData® EPIC™ software platform.
This in-depth study shows that performance ratios for container-based Hadoop workloads on BlueData EPIC are equal to — and in some cases, better than — bare-metal Hadoop. For example, benchmark tests showed that the BlueData EPIC platform demonstrated an average 2.33% performance gain over bare metal, for a configuration with 50 Hadoop compute nodes and 10 terabytes (TB) of data. These performance results were achieved without any modifications to the Hadoop software.
This is a revolutionary milestone, and the result of an ongoing collaboration between Intel and BlueData software engineering teams.
This white paper describes the software and hardware configurations for the benchmark tests, as well as details of the performance benchmark process and results.
The BlueData EPIC software platform makes deployment of Big Data infrastructure and applications easier, faster, and more cost-effective – whether on-premises or on the public cloud.
With BlueData EPIC on AWS, you can quickly and easily deploy your preferred Big Data applications, distributions and tools; leverage enterprise-class security and cost controls for multi-tenant deployments on the Amazon cloud; and tap into both Amazon S3 and on-premises storage for your Big Data analytics.
Sign up for a free two-week trial at www.bluedata.com/aws
Enterprises have been using both Big Data and Cloud Computing technologies for years. Until recently, the two have not been combined. Now the agility and efficiency benefits of self-service elastic infrastructure are being extended to Big Data initiatives – whether on-premises or in the public cloud.
This session at Hadoop Summit in San Jose, California (June 2016) discusses the emerging category of Big-Data-as-a-Service (BDaaS) - representing the intersection of Big Data and Cloud Computing.
In this session, Kris Applegate (Cloud and Big Data Solution Architect at Dell) and Thomas Phelan (Co-Founder and Chief Architect at BlueData) outlined the following:
- Innovations that paved the way for Big-Data-as-a-Service
- Definition and categories of Big-Data-as-a-Service
- Key considerations for Big-Data-as-a-Service in the enterprise, including public cloud or on-premises deployment options
A video replay can also be found here: https://youtu.be/_ucPoTKuj8Q
Solution Brief: Real-Time Pipeline AcceleratorBlueData, Inc.
Get started with Spark Streaming, Kafka, and Cassandra for real-time data analytics.
BlueData makes it easy to deploy Spark infrastructure and applications on- premises. The BlueData EPIC software platform is purpose-built to simplify and accelerate the deployment of Spark, Hadoop, and other tools for Big Data analytics—leveraging Docker containers and virtualized infrastructure.
Our new Real-Time Pipeline Accelerator solution provides the software and professional services you need for building data pipelines in a multi-tenant environment for Spark Streaming, Kafka, and Cassandra. With help from the BlueData team, you’ll also have two end-to-end real-time data pipelines as a starting point.
Learn more about BlueData at www.bluedata.com
This white paper describes how BlueData enables virtualization of Hadoop and Spark workloads running on Intel architecture.
Even as virtualization has spread throughout the data center, Apache Hadoop continues to be deployed almost exclusively on bare-metal physical servers. Processing overhead and I/O latency typically associated with virtualization have prevented big data architects from virtualizing Hadoop implementations.
As a result, most Hadoop initiatives have been limited in terms of agility, with infrastructure changes such as provisioning a new server for Hadoop often taking weeks or even months. This infrastructure complexity continues to slow down adoption in enterprise deployments. Apache Spark is a relatively new big data technology, but interest is growing rapidly; many of these same deployment challenges apply to on-premises Spark implementations.
The BlueData EPIC software platform addresses these limitations, enabling data center operators to accelerate Hadoop and Spark implementations on Intel architecture-based servers.
For more information, visit intel.com/bigdata and bluedata.com
Accelerate Hadoop and Spark deployment in a multi-tenant lab environment for dev/test/QA, evaluation of multiple tools for Big Data analytics, and other use cases. BlueData provides a turnkey on-premises solution with software and services to get up and running in two weeks.
The new Big Data Lab Accelerator solution provides a full enterprise license of BlueData EPIC software along with the professional services needed to deploy an on-premises multi-tenant Big Data lab. Within two weeks, customers will have a lab environment to evaluate Big Data tools and spin up multiple Hadoop or Spark clusters for development, testing and quality assurance. As part of this deployment, BlueData will also work with customers to implement initial use cases for Big Data analytics.
Learn more about BlueData at www.bluedata.com
How to deploy Apache Spark in a multi-tenant, on-premises environmentBlueData, Inc.
Adoption of Apache Spark in the enterprise is increasing rapidly - it's become one of the fastest growing and most popular technologies in the Big Data ecosystem.
However, implementing an enterprise-ready, on-premises Spark deployment can be very complex and it requires expertise that is generally not available to all.
BlueData makes it easier to deploy Apache Spark on-premises. With BlueData, you can spin up virtual Spark clusters within minutes – providing secure, self-service, on-demand access to Big Data analytics and infrastructure. You can deploy Spark in standalone mode or with Hadoop / YARN. You can also build analytical pipelines and create Spark clusters using our RESTful APIs, and use web-based Zeppelin notebooks for interactive data analytics.
BlueData’s software platform leverages virtualization and Docker containers – combined with our own patent-pending innovations – to make it faster, and more cost-effective for enterprises to get up and running with a multi-tenant Spark deployment on-premises.
Learn more at www.bluedata.com
This presentation provides an overview of what’s new in the 2.0 release of the BlueData EPIC software platform.
BlueData’s EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall on-premises Big Data deployments. With BlueData, you can spin up Hadoop or Spark clusters in minutes rather than months – with the data and analytical tools that your data scientists need.
The BlueData EPIC 2.0 release leverages Docker containers to simplify Big Data clusters, supports Apache Zeppelin notebooks and other new functionality for Apache Spark, and includes an enhanced App Store that provides one-click access to Big Data distributions and analytics tools.
Learn more about BlueData at http://www.bluedata.com
This Big Data case study outlines the Hadoop infrastructure deployment for a Fortune 100 media and telecommunications company.
Hadoop adoption in this company had grown organically across multiple different teams, starting with “science projects” and lab initiatives that quickly grew and expanded. Going forward, some of the options they considered for their Big Data deployment included expanding their on-premises infrastructure and using a Hadoop-as-a-Service cloud offering.
Fortunately, they realized that there is a third option: providing the benefits of Hadoop-as-a-Service with on-premises infrastructure. They selected the BlueData EPIC software platform to virtualize their Hadoop infrastructure and provide on-demand access to virtual Hadoop clusters in a secure, multi-tenant model.
Learn more about this case study in the blog post at: http://www.bluedata.com/blog/2015/05/big-data-case-study-hadoop-infrastructure
BlueData Hunk Integration: Splunk Analytics for HadoopBlueData, Inc.
BlueData is working in partnership with Splunk to streamline and accelerate the deployment and adoption of Hunk: Splunk Analytics for Hadoop. The BlueData EPIC software platform now integrates Hunk with Hadoop clusters running on virtualized on-premises infrastructure.
Using Hunk with the BlueData EPIC platform, our joint customers can quickly provision virtual Hadoop clusters together with Hunk in a matter of minutes – providing their data scientists and analysts with the ability to rapidly detect patterns and find anomalies across petabytes of raw data in Hadoop.
Learn more at http://www.bluedata.com
BlueData makes on-premises Spark infrastructure easy.
With BlueData, you can spin up virtual Spark clusters within minutes – providing secure, on-demand access to Big Data analytics and infrastructure. You can use Spark with or without the Hadoop ecosystem of tools – using HDFS, Tachyon, or any shared storage system.
You can also build analytical pipelines and create Spark clusters using our RESTful APIs. BlueData’s software platform leverages virtualization and patent-pending innovations to make it simpler, faster, and more cost-effective to deploy Hadoop or Spark infrastructure on-premises.
Learn more at http://www.bluedata.com
This presentation provides an overview of the BlueData integration with Cloudera Manager. With this integration, customers of our BlueData EPIC software platform can leverage the power of Cloudera Manager for end-to-end Hadoop systems management and administration.
When the BlueData EPIC platform provisions a virtual CDH cluster, Cloudera Manager can be provisioned as well – so you can easily deploy, manage, monitor and perform diagnostics on your Hadoop cluster. Our customers can take advantage of the Cloudera Manager GUI to monitor their cluster, troubleshoot issues, and administer their Hadoop deployment.
Learn more about BlueData at http://www.bluedata.com
In software engineering, the right architecture is essential for robust, scalable platforms. Wix has undergone a pivotal shift from event sourcing to a CRUD-based model for its microservices. This talk will chart the course of this pivotal journey.
Event sourcing, which records state changes as immutable events, provided robust auditing and "time travel" debugging for Wix Stores' microservices. Despite its benefits, the complexity it introduced in state management slowed development. Wix responded by adopting a simpler, unified CRUD model. This talk will explore the challenges of event sourcing and the advantages of Wix's new "CRUD on steroids" approach, which streamlines API integration and domain event management while preserving data integrity and system resilience.
Participants will gain valuable insights into Wix's strategies for ensuring atomicity in database updates and event production, as well as caching, materialization, and performance optimization techniques within a distributed system.
Join us to discover how Wix has mastered the art of balancing simplicity and extensibility, and learn how the re-adoption of the modest CRUD has turbocharged their development velocity, resilience, and scalability in a high-growth environment.
Software Engineering, Software Consulting, Tech Lead.
Spring Boot, Spring Cloud, Spring Core, Spring JDBC, Spring Security,
Spring Transaction, Spring MVC,
Log4j, REST/SOAP WEB-SERVICES.
Understanding Globus Data Transfers with NetSageGlobus
NetSage is an open privacy-aware network measurement, analysis, and visualization service designed to help end-users visualize and reason about large data transfers. NetSage traditionally has used a combination of passive measurements, including SNMP and flow data, as well as active measurements, mainly perfSONAR, to provide longitudinal network performance data visualization. It has been deployed by dozens of networks world wide, and is supported domestically by the Engagement and Performance Operations Center (EPOC), NSF #2328479. We have recently expanded the NetSage data sources to include logs for Globus data transfers, following the same privacy-preserving approach as for Flow data. Using the logs for the Texas Advanced Computing Center (TACC) as an example, this talk will walk through several different example use cases that NetSage can answer, including: Who is using Globus to share data with my institution, and what kind of performance are they able to achieve? How many transfers has Globus supported for us? Which sites are we sharing the most data with, and how is that changing over time? How is my site using Globus to move data internally, and what kind of performance do we see for those transfers? What percentage of data transfers at my institution used Globus, and how did the overall data transfer performance compare to the Globus users?
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTier1 app
Even though at surface level ‘java.lang.OutOfMemoryError’ appears as one single error; underlyingly there are 9 types of OutOfMemoryError. Each type of OutOfMemoryError has different causes, diagnosis approaches and solutions. This session equips you with the knowledge, tools, and techniques needed to troubleshoot and conquer OutOfMemoryError in all its forms, ensuring smoother, more efficient Java applications.
Into the Box Keynote Day 2: Unveiling amazing updates and announcements for modern CFML developers! Get ready for exciting releases and updates on Ortus tools and products. Stay tuned for cutting-edge innovations designed to boost your productivity.
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus
As part of the DOE Integrated Research Infrastructure (IRI) program, NERSC at Lawrence Berkeley National Lab and ALCF at Argonne National Lab are working closely with General Atomics on accelerating the computing requirements of the DIII-D experiment. As part of the work the team is investigating ways to speedup the time to solution for many different parts of the DIII-D workflow including how they run jobs on HPC systems. One of these routes is looking at Globus Compute as a way to replace the current method for managing tasks and we describe a brief proof of concept showing how Globus Compute could help to schedule jobs and be a tool to connect compute at different facilities.
Designing for Privacy in Amazon Web ServicesKrzysztofKkol1
Data privacy is one of the most critical issues that businesses face. This presentation shares insights on the principles and best practices for ensuring the resilience and security of your workload.
Drawing on a real-life project from the HR industry, the various challenges will be demonstrated: data protection, self-healing, business continuity, security, and transparency of data processing. This systematized approach allowed to create a secure AWS cloud infrastructure that not only met strict compliance rules but also exceeded the client's expectations.
Globus Connect Server Deep Dive - GlobusWorld 2024Globus
We explore the Globus Connect Server (GCS) architecture and experiment with advanced configuration options and use cases. This content is targeted at system administrators who are familiar with GCS and currently operate—or are planning to operate—broader deployments at their institution.
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Globus
The U.S. Geological Survey (USGS) has made substantial investments in meeting evolving scientific, technical, and policy driven demands on storing, managing, and delivering data. As these demands continue to grow in complexity and scale, the USGS must continue to explore innovative solutions to improve its management, curation, sharing, delivering, and preservation approaches for large-scale research data. Supporting these needs, the USGS has partnered with the University of Chicago-Globus to research and develop advanced repository components and workflows leveraging its current investment in Globus. The primary outcome of this partnership includes the development of a prototype enterprise repository, driven by USGS Data Release requirements, through exploration and implementation of the entire suite of the Globus platform offerings, including Globus Flow, Globus Auth, Globus Transfer, and Globus Search. This presentation will provide insights into this research partnership, introduce the unique requirements and challenges being addressed and provide relevant project progress.
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns
Unlocking Business Potential: Tailored Technology Solutions by Prosigns
Discover how Prosigns, a leading technology solutions provider, partners with businesses to drive innovation and success. Our presentation showcases our comprehensive range of services, including custom software development, web and mobile app development, AI & ML solutions, blockchain integration, DevOps services, and Microsoft Dynamics 365 support.
Custom Software Development: Prosigns specializes in creating bespoke software solutions that cater to your unique business needs. Our team of experts works closely with you to understand your requirements and deliver tailor-made software that enhances efficiency and drives growth.
Web and Mobile App Development: From responsive websites to intuitive mobile applications, Prosigns develops cutting-edge solutions that engage users and deliver seamless experiences across devices.
AI & ML Solutions: Harnessing the power of Artificial Intelligence and Machine Learning, Prosigns provides smart solutions that automate processes, provide valuable insights, and drive informed decision-making.
Blockchain Integration: Prosigns offers comprehensive blockchain solutions, including development, integration, and consulting services, enabling businesses to leverage blockchain technology for enhanced security, transparency, and efficiency.
DevOps Services: Prosigns' DevOps services streamline development and operations processes, ensuring faster and more reliable software delivery through automation and continuous integration.
Microsoft Dynamics 365 Support: Prosigns provides comprehensive support and maintenance services for Microsoft Dynamics 365, ensuring your system is always up-to-date, secure, and running smoothly.
Learn how our collaborative approach and dedication to excellence help businesses achieve their goals and stay ahead in today's digital landscape. From concept to deployment, Prosigns is your trusted partner for transforming ideas into reality and unlocking the full potential of your business.
Join us on a journey of innovation and growth. Let's partner for success with Prosigns.
Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...Hivelance Technology
Cryptocurrency trading bots are computer programs designed to automate buying, selling, and managing cryptocurrency transactions. These bots utilize advanced algorithms and machine learning techniques to analyze market data, identify trading opportunities, and execute trades on behalf of their users. By automating the decision-making process, crypto trading bots can react to market changes faster than human traders
Hivelance, a leading provider of cryptocurrency trading bot development services, stands out as the premier choice for crypto traders and developers. Hivelance boasts a team of seasoned cryptocurrency experts and software engineers who deeply understand the crypto market and the latest trends in automated trading, Hivelance leverages the latest technologies and tools in the industry, including advanced AI and machine learning algorithms, to create highly efficient and adaptable crypto trading bots
A Comprehensive Look at Generative AI in Retail App Testing.pdfkalichargn70th171
Traditional software testing methods are being challenged in retail, where customer expectations and technological advancements continually shape the landscape. Enter generative AI—a transformative subset of artificial intelligence technologies poised to revolutionize software testing.
Modern design is crucial in today's digital environment, and this is especially true for SharePoint intranets. The design of these digital hubs is critical to user engagement and productivity enhancement. They are the cornerstone of internal collaboration and interaction within enterprises.
Accelerate Enterprise Software Engineering with PlatformlessWSO2
Key takeaways:
Challenges of building platforms and the benefits of platformless.
Key principles of platformless, including API-first, cloud-native middleware, platform engineering, and developer experience.
How Choreo enables the platformless experience.
How key concepts like application architecture, domain-driven design, zero trust, and cell-based architecture are inherently a part of Choreo.
Demo of an end-to-end app built and deployed on Choreo.
Why React Native as a Strategic Advantage for Startup Innovation.pdfayushiqss
Do you know that React Native is being increasingly adopted by startups as well as big companies in the mobile app development industry? Big names like Facebook, Instagram, and Pinterest have already integrated this robust open-source framework.
In fact, according to a report by Statista, the number of React Native developers has been steadily increasing over the years, reaching an estimated 1.9 million by the end of 2024. This means that the demand for this framework in the job market has been growing making it a valuable skill.
But what makes React Native so popular for mobile application development? It offers excellent cross-platform capabilities among other benefits. This way, with React Native, developers can write code once and run it on both iOS and Android devices thus saving time and resources leading to shorter development cycles hence faster time-to-market for your app.
Let’s take the example of a startup, which wanted to release their app on both iOS and Android at once. Through the use of React Native they managed to create an app and bring it into the market within a very short period. This helped them gain an advantage over their competitors because they had access to a large user base who were able to generate revenue quickly for them.
We describe the deployment and use of Globus Compute for remote computation. This content is aimed at researchers who wish to compute on remote resources using a unified programming interface, as well as system administrators who will deploy and operate Globus Compute services on their research computing infrastructure.
Unleash Unlimited Potential with One-Time Purchase
BoxLang is more than just a language; it's a community. By choosing a Visionary License, you're not just investing in your success, you're actively contributing to the ongoing development and support of BoxLang.
Advanced Flow Concepts Every Developer Should KnowPeter Caitens
Tim Combridge from Sensible Giraffe and Salesforce Ben presents some important tips that all developers should know when dealing with Flows in Salesforce.
Developing Distributed High-performance Computing Capabilities of an Open Sci...Globus
COVID-19 had an unprecedented impact on scientific collaboration. The pandemic and its broad response from the scientific community has forged new relationships among public health practitioners, mathematical modelers, and scientific computing specialists, while revealing critical gaps in exploiting advanced computing systems to support urgent decision making. Informed by our team’s work in applying high-performance computing in support of public health decision makers during the COVID-19 pandemic, we present how Globus technologies are enabling the development of an open science platform for robust epidemic analysis, with the goal of collaborative, secure, distributed, on-demand, and fast time-to-solution analyses to support public health.
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Lessons Learned Running Hadoop and Spark in Docker Containers
1. Lessons Learned
Running Hadoop and
Spark in Docker
Thomas Phelan
Chief Architect, BlueData
@tapbluedata
September 29, 2016
2. Outline
• Docker Containers and Big Data
• Hadoop and Spark on Docker: Challenges
• How We Did It: Lessons Learned
• Key Takeaways
• Q & A
3. A Better Way to Deploy Big Data
Traditional Approach
IT
ManufacturingSalesR&DServices
< 30%
utilization
Weeks to
build
each
cluster
Duplication
of dataManagement
complexity
Painful,
complex
upgrades
Hadoop and Spark on Docker
ManufacturingSalesR&DServices
BI/Analytics
Tools
> 90%
utilization
A New Approach
No duplication
of data
Simplified
management
Multi-tenant
Self-service,
on-demand
clusters
Simple,
instant
upgrades
4. Deploying Multiple Big Data Clusters
Data scientists want flexibility:
• Different versions of Hadoop, Spark, et.al.
• Different sets of tools
IT wants control:
• Multi-tenancy
- Data security
- Network isolation
5. Containers = the Future of Big Data
Infrastructure
• Agility and elasticity
• Standardized environments
(dev, test, prod)
• Portability
(on-premises and cloud)
• Higher resource utilization
Applications
• Fool-proof packaging
(configs, libraries, driver
versions, etc.)
• Repeatable builds and
orchestration
• Faster app dev cycles
6. The Journey to Big Data on Docker
Start with a clear goal
in sight
Begin with your Docker
toolbox of a single
container and basic
networking and storage
So you want to run Hadoop and Spark on Docker
in a multi-tenant enterprise deployment?
Beware … there is trouble ahead
7. Traverse the tightrope of
network configurations
Navigate the river of
container managers
• Swarm ?
• Kubernetes ?
• AWS ECS ?
• Mesos ?
• Overlay files ?
• Flocker ?
• Convoy ?
• Docker Networking ?
Calico
• Kubernetes Networking ?
Flannel, Weave Net
Big Data on Docker: Pitfalls
Cross the desert of storage
configurations
8. Big Data on Docker: Challenges
Pass thru the jungle of
software compatibility
Tame the lion of
performance
Finally you get to the top!
Trip down the staircase of
deployment mistakes
9. But for deployment in the enterprise, you are
not even close to being done …
Big Data on Docker: Next Steps?
You still have to climb past:
high availability,
backup/recovery, security,
multi-host, multi-container,
upgrades and patches
10. You realize it’s time
to get some help!
Big Data on Docker: Quagmire
11. How We Did It: Design Decisions
• Run Hadoop/Spark distros and applications
unmodified
- Deploy all services that typically run on a single BM
host in a single container
• Multi-tenancy support is key
- Network and storage security
12. How We Did It: Design Decisions
• Images built to “auto-configure” themselves at
time of instantiation
- Not all instances of a single image run the same set of
services when instantiated
• Master vs. worker cluster nodes
13. How We Did It: Design Decisions
• Maintain the promise of containers
- Keep them as stateless as possible
- Container storage is always ephemeral
- Persistent storage is external to the container
14. How We Did It: Implementation
Resource Utilization
• CPU cores vs. CPU shares
• Over-provisioning of CPU recommended
• No over-provisioning of memory
Network
• Connect containers across hosts
• Persistence of IP address across container restart
• Deploy VLANs and VxLAN tunnels for tenant-level traffic isolation
Noisy neighbors
15. How We Did It: Network Architecture
OVS
Container
Orchestrator
DHCP/DNS
VxLAN tunnel
NIC
Tenant Networks
OVS
NIC
Resource
Manager
Node
Manager
Node Manager SparkMaster
SparkWorker
Zeppelin
16. How We Did It: Implementation
Storage
• Tweak the default size of a container’s /root
- Resizing of storage inside an existing container is tricky
• Mount logical volume on /data
- No use of overlay file systems
• DataTap (version-independent, HDFS-compliant)
connectivity to external storage
Image Management
• Utilize Docker’s image repository
TIP: Mounting block
devices into a container
does not support
symbolic links (IOW:
/dev/sdb will not work,
/dm/… PCI device can
change across host
reboot).
TIP: Docker images can
get large. Use “docker
squash” to save on size.
17. How We Did It: Security Considerations
• Security is essential since containers and host share one kernel
- Non-privileged containers
• Achieved through layered set of capabilities
• Different capabilities provide different levels of isolation and protection
• Add “capabilities” to a container based on what operations are permitted
18. How We Did It: Sample Dockerfile
# Spark-1.5.2 docker image for RHEL/CentOS 6.x
FROM centos:centos6
# Download and extract spark
RUN mkdir /usr/lib/spark; curl -s http://d3kbcqa49mib13.cloudfront.net/spark-1.5.2-bin-hadoop2.4.tgz | tar -xz -C /usr/lib/spark/
# Download and extract scala
RUN mkdir /usr/lib/scala; curl -s http://www.scala-lang.org/files/archive/scala-2.10.3.tgz | tar xz -C /usr/lib/scala/
# Install zeppelin
RUN mkdir /usr/lib/zeppelin; curl -s http://10.10.10.10:8080/build/thirdparty/zeppelin/zeppelin-0.6.0-incubating-SNAPSHOT-v2.tar.gz|tar xz -C
/usr/lib/zeppelin
RUN yum clean all && rm -rf /tmp/* /var/tmp/* /var/cache/yum/*
ADD configure_spark_services.sh /root/configure_spark_services.sh
RUN chmod -x /root/configure_spark_services.sh && /root/configure_spark_services.sh
19. BlueData Application Image (.bin file)
Application bin file
Docker
image
CentOS
Dockerfile
RHEL
Dockerfile
appconfig
conf Init.d startscript
<app>
logo file
<app>.wb
bdwb
command
clusterconfig, image, role,
appconfig, catalog, service, ..
Sources
Docker file,
logo .PNG,
Init.d
RuntimeSoftware Bits
OR
Development
(e.g. extract .bin and modify to
create new bin)
28. 5 fully managed Docker
containers with persistent
IP addresses
“Dockerized” Spark Standalone
Spark with Zeppelin
Notebook
29. Big Data on Docker: Key Takeaways
• All apps can be “Dockerized”,
including Hadoop & Spark
- Traditional bare-metal approach to Big
Data is rigid and inflexible
- Containers (e.g. Docker) provide a more
flexible & agile model
- Faster app dev cycles for Big Data app
developers, data scientists, & engineers
30. Big Data on Docker: Key Takeaways
• There are unique Big Data pitfalls & challenges with Docker
- For enterprise deployments, you will need to overcome these and more:
Docker base images include Big Data libraries and jar files
Container orchestration, including networking and storage
Resource-aware runtime environment, including CPU and RAM
31. Big Data on Docker: Key Takeaways
• There are unique Big Data pitfalls & challenges with Docker
- More:
Access to Container secured with ssh keypair or PAM module
(LDAP/AD)
Fast access to external storage
Management agents in Docker images
Runtime injection of resource and configuration
information
32. Big Data on Docker: Key Takeaways
• “Do It Yourself” will be costly and time-consuming
- Be prepared to tackle the infrastructure & plumbing challenges
- In the enterprise, the business value is in the applications / data science
• There are other options …
- Public Cloud – AWS
- BlueData - turnkey solution
Overview of Results
Testing revealed that, across the HiBench micro-workloads investigated, the BlueData EPIC software platform enables performance that is comparable or superior to that on bare-metal.
Results are summarized in the above chart. The elapsed or execution times for physical Hadoop were used as the baseline (i.e. 100%) and the corresponding elapsed or execution time for the same test on BlueData EPIC compared to the bare metal execution time.
The elapsed time results show that while BlueData EPIC is twice as fast for write dominant I/O workloads as shown with DFSIO Write as well as Teragen.
The elapsed time results show the BlueData EPIC is comparable to bare metal for read dominant I/O workloads. However, even with lower throughput for DFSIO read, the performance of real world workloads that consist of series of read and write operations over a period of time, would make BlueData performance equivalent or better than bare metal.
This balanced read/write performance is demonstrated by the TeraSort read/write results (shown above). Even with just a single virtual node per host, the performance of virtualized EPIC platform is within 2% of bare metal performance.
The elapsed time results show that BlueData EPIC is 10-15% faster for compute intensive workloads as shown with Wordcount.
The superior performance of the BlueData EPIC software platform for write-dominant workloads is due to the application aware caching enabled by IOBoost technology. The non-persistent memory cache improves the efficiency of access to physical-storage devices by enabling a write-behind cache, optimizing the performance of sequential writes. In general, any write operation, including those outside the scope of these benchmarks will benefit from write-behind cache since Hadoop uses a separate thread for write operations. BlueData IOBoost acknowledges this write request immediately thereby enabling the application to continue data processing without any latency.
BlueData performance on read-dominant workloads is comparable to bare metal in part due to the read-ahead cache implemented in IOBoost. Unlike write operations where a separate thread is used, Hadoop applications have rigid semantics for read operations where in they wait for the read to complete before processing is continued. As a result, the read ahead cache has a lesser contribution to the I/O throughput for typical Hadoop applications.
In summary, the use of a single virtual node per physical host shows that there is minimal to no overhead using virtualized EPIC platform compared to bare metal/physical Hadoop.
Jason to moderate and introduce questions – then field questions with Anant & Tom to answer
Jason: Thank you to everyone for attending this webinar – and a special thanks to our presenters ….
All attendees of this webcast can click on the “Attachments” tab and download a copy of the slides – you’ll also have access to the on-demand replay.
And if you’d like more information about BlueData software, you can visit our website at bluedata.com, contact us directly, or try the free version of our software at bluedata.com/free
Again, thanks and have a great day.