Roman Shaposhnik: Director of Open Source, Pivotal; Committer, Apache Hadoop; Founder, Apache Bigtop
Making sense of Apache Bigtop's role in ODPi and how it matters to Apache Apex.
Apache Bigtop: a crash course in deploying a Hadoop bigdata management platformrhatr
A long time ago in a galaxy far, far away only the chosen few could deploy and operate a fully functional Hadoop cluster. Vendors were taking pride in rationalizing this experience to their customers by creating various distributions including Apache Hadoop. It all changed when Cloudera decided to support Apache Bigtop as the first 100% community driven bigdata management distribution of Apache Hadoop. Today, most major commercial distribution of Apache Hadoop are based on Bigtop. Bigtop has won the Hadoop distributions war and is offering a superset of packaged components. In this talk we will focus on practical advice of how to deploy and start operating a Hadoop cluster using Bigtop’s packages and deployment code. We will dive into the details of using packages of Hadoop ecosystem provided by Bigtop and how to build data management pipelines in support your enterprise applications.
Apache Bigtop is a project that packages, tests, and deploys the Hadoop ecosystem. It uses Vagrant and Docker provisioners to automatically set up Hadoop clusters on virtual machines or Linux containers for testing Bigtop packaging and puppet recipes. Users can run a round trip test from source code to a testing cluster locally with one click deployment.
OpenShift, Docker, Kubernetes: The next generation of PaaSGraham Dumpleton
The document discusses how platforms like OpenShift, Docker, and Kubernetes have evolved from earlier PaaS technologies to provide next generation platforms that enable automated builds, deployments, orchestration, and security across containers. It notes how these platforms allow applications to be deployed across custom strategies rather than being constrained to a single way of working, and how they integrate with existing CI/CD tools. The document encourages gradually adopting new tooling as it makes sense and provides various resources for trying OpenShift.
Slides from my presentation at #ChefConf 2013
Big Data meets Configuration Management. Edmunds.com's first foray into Hadoop is a tale of challenges, discovery, and ultimately triumph. This is the story of how Edmunds.com leveraged Chef - and its community - to build a fully automated Hadoop cluster in the face of looming project deadlines.
Dennis Matotek, Technical Lead Platforms at Experian Hitwise Australia, gave an excellent presentation on setting up puppet using vagrant, puppet and testing, including a full demo of rspec-puppet and Jenkins.
Apache Bigtop: a crash course in deploying a Hadoop bigdata management platformrhatr
A long time ago in a galaxy far, far away only the chosen few could deploy and operate a fully functional Hadoop cluster. Vendors were taking pride in rationalizing this experience to their customers by creating various distributions including Apache Hadoop. It all changed when Cloudera decided to support Apache Bigtop as the first 100% community driven bigdata management distribution of Apache Hadoop. Today, most major commercial distribution of Apache Hadoop are based on Bigtop. Bigtop has won the Hadoop distributions war and is offering a superset of packaged components. In this talk we will focus on practical advice of how to deploy and start operating a Hadoop cluster using Bigtop’s packages and deployment code. We will dive into the details of using packages of Hadoop ecosystem provided by Bigtop and how to build data management pipelines in support your enterprise applications.
Apache Bigtop is a project that packages, tests, and deploys the Hadoop ecosystem. It uses Vagrant and Docker provisioners to automatically set up Hadoop clusters on virtual machines or Linux containers for testing Bigtop packaging and puppet recipes. Users can run a round trip test from source code to a testing cluster locally with one click deployment.
OpenShift, Docker, Kubernetes: The next generation of PaaSGraham Dumpleton
The document discusses how platforms like OpenShift, Docker, and Kubernetes have evolved from earlier PaaS technologies to provide next generation platforms that enable automated builds, deployments, orchestration, and security across containers. It notes how these platforms allow applications to be deployed across custom strategies rather than being constrained to a single way of working, and how they integrate with existing CI/CD tools. The document encourages gradually adopting new tooling as it makes sense and provides various resources for trying OpenShift.
Slides from my presentation at #ChefConf 2013
Big Data meets Configuration Management. Edmunds.com's first foray into Hadoop is a tale of challenges, discovery, and ultimately triumph. This is the story of how Edmunds.com leveraged Chef - and its community - to build a fully automated Hadoop cluster in the face of looming project deadlines.
Dennis Matotek, Technical Lead Platforms at Experian Hitwise Australia, gave an excellent presentation on setting up puppet using vagrant, puppet and testing, including a full demo of rspec-puppet and Jenkins.
Open Source Recipes for Chef Deployments of HadoopDataWorks Summit
The document discusses open source Chef recipes provided by Bloomberg for deploying Hadoop clusters using configuration management. The recipes automate the installation and configuration of Hadoop and related projects through Chef code. They also allow for customization, high availability, and integration with cloud platforms like OpenStack.
OpenStack in Action 4! Thierry Carrez - From Havana to IcehouseeNovance
This document provides an overview of OpenStack development from the Havana release to the upcoming Icehouse release from the perspective of the OpenStack Technical Committee chair. It discusses key accomplishments in the Havana cycle, governance changes, the introduction of programs, infrastructure improvements, new integrated projects like Trove, planned features for Icehouse, and new projects entering incubation like Ironic, Marconi, and Savanna.
High Availability from the DevOps side - OpenStack Summit PortlandeNovance
This document summarizes Emilien Macchi and Sébastien Han's work on improving high availability in OpenStack. It discusses their contributions of Pacemaker resource agents and documentation updates. It also describes their experiences implementing OpenStack in a medium public cloud, noting challenges of scalability and split brains risks. Lastly, it outlines work to improve networking high availability and testing the cell architecture for horizontal scaling.
Cloud Foundry Deployment Tools: BOSH vs Juju CharmsAltoros
Did you know that BOSH is not the only tool for deploying the Cloud Foundry PaaS?.. Initially presented at the 2014 DevOps Summit by Andrei Yurkevich, CTO @ Altoros, this slide deck demonstrates how to deploy CF with Juju Charms and compares this orchestration solution to BOSH. It also covers overlapping features and explains when to use BOSH, Juju Charms, or both.
For more Cloud Foundry research, visit: http://www.altoros.com/research-papers
Learn how Spotify uses Puppet to manage the large and growing amount of servers used to stream music to millions of users. The presenter will also give an introduction to other technologies used to power Spotify.
Erik Dalén
System Engineer, Spotify
Erik is a system engineer within the site reliability engineering at Spotify with a focus on Puppet and automation. He is also a community contributor to Puppet and author of the puppetdbquery tool. Can be found at IRC and Github as dalen.
CAPS: What's best for deploying and managing OpenStack? Chef vs. Ansible vs. ...Daniel Krook
Presentation at the OpenStack Summit in Tokyo, Japan on October 29, 2015.
http://sched.co/49vI
This talk will cover the pros and cons of four different OpenStack deployment mechanisms. Puppet, Chef, Ansible, and Salt for OpenStack all claim to make it much easier to configure and maintain hundreds of OpenStack deployment resources. With the advent of large-scale, highly available OpenStack deployments spread across multiple global regions, the choice of which deployment methodology to use has become more and more relevant.
Beyond the initial day-one deployment, when it comes to the day-two and beyond questions of updating and upgrading existing OpenStack deployments, it becomes all the more important choose the right tool.
Come join the Bluebox and IBM team to discuss the pros and cons of these approaches. We look at each of these four tools in depth, explore their design and function, and determine which scores higher than others to address your particular deployment needs.
Daniel Krook - Senior Software Engineer, Cloud and Open Source Technologies, IBM
Paul Czarkowski - Cloud Engineer at Blue Box, an IBM company
Daniel Krook - Senior Software Engineer, Cloud and Open Source Technologies, IBM
From Zero to Cloud: Revolutionize your Application Life Cycle with OpenShift ...OpenShift Origin
From Zero to Cloud: Revolutionize your Application Life Cycle with OpenShift PaaS
Talk given by Diane Mueller, OpenShift Origin Community Manager at FISL 15 on May 9th, 2014
One of the impediments to becoming an active technical contributor in the OpenStack community is setting up an efficient R&D environment which includes deploying a simple cloud. Using RDO-manager, get a basic cloud up and running with the fewest steps and minimal hardware so you can focus on the fun stuff - development
This document discusses the use of Helm for deploying applications on Kubernetes. It begins by introducing Helm and its benefits over manually deploying Kubernetes applications. It then covers how Helm can be used to deploy test clusters, proof of concepts, and facilitate production rollouts. The document also discusses how Helm is currently being used to install curated applications, provide lifecycle management, configuration management, inheritance, and composition capabilities. It concludes by mentioning upcoming demos of Helm.
Chef for OpenStack: OpenStack Spring Summit 2013Matt Ray
This document provides an overview of using Chef to deploy and manage OpenStack. It discusses why Chef is useful for infrastructure as code and its declarative interface. The document outlines the current state of the Chef for OpenStack project, including contributors, available cookbooks, and roadmap. It promotes the project as a way to collaboratively deploy OpenStack in a standardized, automated way and reduce fragmentation.
OpenStack in Action! 5 - Dell - OpenStack powered solutions - Patrick HamoneNovance
This document discusses Dell/Intel OpenStack-powered solutions and provides the following key points:
1) OpenStack is an open-source cloud operating system that is growing rapidly in adoption with over 10,000 individual members and contributors from over 70 countries.
2) Dell offers OpenStack reference architectures, hardware, software, services, and support to help customers accelerate their adoption of private and hybrid cloud solutions based on OpenStack.
3) Case studies show how Dell OpenStack solutions have helped customers like a research university and web hosting provider build scalable, cost-effective private clouds to meet their infrastructure and data storage needs.
SaltConf14 - Craig Sebenik, LinkedIn - SaltStack at Web ScaleSaltStack
This talk will focus on the unique challenges of managing Web scale and an application stack that lives on tens of thousands of servers spread across multiple data centers. Learn more about LinkedIn's unique topology, about the development of an efficient build environment, and hear more about LinkedIn plans for a deployment system based on Salt. Also, all of the software that runs LinkedIn sends a LOT of data. In order to stay ahead of this tidal wave of data, the team must address scale challenges seen in very few environments through efficient use of monitoring and metrics systems. This talk will highlight best practices and user training necessary for the use of SaltStack in large environments.
Austin OpenStack Meetup December 2012 presentation. The first part of the session was Chef for OpenStack, the second was Q&A about AT&T's OpenStack private cloud deployments to multiple data centers.
OSDC 2018 | Three years running containers with Kubernetes in Production by T...NETWAYS
The talk gives a state of the art update of experiences with deploying applications in Kubernetes on scale. If in clouds or on premises, Kubernetes took over the leading role as a container operating system. The central paradigm of stateless containers connected to storage and services is the core of Kubernetes. However, it can be extended to distributed databases, Machine Learning, Windows VMs in Kubernetes. All these applications have been considered as edge cases a few years ago, however, are going more and more mainstream today.
Sebastien Goasguen. With VM seemingly taking a back seat with containers coming back in fame, what is the role of CloudStack or OpenStack. In this talk Sebastien will briefly review the state of the art and bring some context around container orchestrators and how they relate to CloudStack. He will then discuss how containers orchestration can be easily integrated in CloudStack.
Kolla allows running OpenStack in containers using Docker and Ansible for simplified and repeatable deployments. It builds container images for OpenStack components that can be customized and then deployed through Ansible playbooks. Key features include opinionated out-of-the-box configurations, customizability, and integration with tools like Docker, Kubernetes, and ELK for logging. However, caution is advised as Docker, Kolla and Kubernetes are new technologies with active development.
Timothy Spann provides an overview of Apache NiFi, an open source dataflow software. Some key points about NiFi include:
- It provides guaranteed data delivery, buffering, prioritized queuing, and data provenance.
- It supports over 60 source connectors and has hundreds of processors for handling different data formats.
- The architecture includes repositories for storing metadata and provenance data, and supports clustering.
- Spann discusses best practices for using NiFi such as avoiding spaghetti flows, leveraging parameters and templates, and upgrading to the latest version. He also demonstrates how to consume data from sources like MQTT and FTP.
5 ways to install @OpenShift in 5 minutes (Lightening Talk given at #DevConfC...OpenShift Origin
This document discusses 5 ways to install OpenShift including: 1) using the step-by-step instructions on install.openshift.com, 2) deploying highly available OpenShift Origin clusters using Ansible, 3) leveraging OpenStack's Heat to provision OpenShift on OpenStack infrastructure, 4) downloading an OpenShift VM, and 5) getting started with Origin Release 3 on Fedora 19, RHEL 6, or CentOS 6.5 using install.openshift.com, Puppet, Ansible, Heat, or by downloading from openshift.github.io. The document is presented by Diane Mueller, OpenShift Origin Community Manager at Red Hat.
Habitat is a tool for building and running distributed applications. It aims to standardize packaging and running applications across different environments. With Habitat, applications are packaged into "harts" which contain all their dependencies and can be run on any system. Habitat handles configuration, service discovery, and updates to provide a uniform way to deploy applications. Plans are used to define how to build harts in a reproducible way. The Habitat runtime then manages running applications as services.
Open Source Recipes for Chef Deployments of HadoopDataWorks Summit
The document discusses open source Chef recipes provided by Bloomberg for deploying Hadoop clusters using configuration management. The recipes automate the installation and configuration of Hadoop and related projects through Chef code. They also allow for customization, high availability, and integration with cloud platforms like OpenStack.
OpenStack in Action 4! Thierry Carrez - From Havana to IcehouseeNovance
This document provides an overview of OpenStack development from the Havana release to the upcoming Icehouse release from the perspective of the OpenStack Technical Committee chair. It discusses key accomplishments in the Havana cycle, governance changes, the introduction of programs, infrastructure improvements, new integrated projects like Trove, planned features for Icehouse, and new projects entering incubation like Ironic, Marconi, and Savanna.
High Availability from the DevOps side - OpenStack Summit PortlandeNovance
This document summarizes Emilien Macchi and Sébastien Han's work on improving high availability in OpenStack. It discusses their contributions of Pacemaker resource agents and documentation updates. It also describes their experiences implementing OpenStack in a medium public cloud, noting challenges of scalability and split brains risks. Lastly, it outlines work to improve networking high availability and testing the cell architecture for horizontal scaling.
Cloud Foundry Deployment Tools: BOSH vs Juju CharmsAltoros
Did you know that BOSH is not the only tool for deploying the Cloud Foundry PaaS?.. Initially presented at the 2014 DevOps Summit by Andrei Yurkevich, CTO @ Altoros, this slide deck demonstrates how to deploy CF with Juju Charms and compares this orchestration solution to BOSH. It also covers overlapping features and explains when to use BOSH, Juju Charms, or both.
For more Cloud Foundry research, visit: http://www.altoros.com/research-papers
Learn how Spotify uses Puppet to manage the large and growing amount of servers used to stream music to millions of users. The presenter will also give an introduction to other technologies used to power Spotify.
Erik Dalén
System Engineer, Spotify
Erik is a system engineer within the site reliability engineering at Spotify with a focus on Puppet and automation. He is also a community contributor to Puppet and author of the puppetdbquery tool. Can be found at IRC and Github as dalen.
CAPS: What's best for deploying and managing OpenStack? Chef vs. Ansible vs. ...Daniel Krook
Presentation at the OpenStack Summit in Tokyo, Japan on October 29, 2015.
http://sched.co/49vI
This talk will cover the pros and cons of four different OpenStack deployment mechanisms. Puppet, Chef, Ansible, and Salt for OpenStack all claim to make it much easier to configure and maintain hundreds of OpenStack deployment resources. With the advent of large-scale, highly available OpenStack deployments spread across multiple global regions, the choice of which deployment methodology to use has become more and more relevant.
Beyond the initial day-one deployment, when it comes to the day-two and beyond questions of updating and upgrading existing OpenStack deployments, it becomes all the more important choose the right tool.
Come join the Bluebox and IBM team to discuss the pros and cons of these approaches. We look at each of these four tools in depth, explore their design and function, and determine which scores higher than others to address your particular deployment needs.
Daniel Krook - Senior Software Engineer, Cloud and Open Source Technologies, IBM
Paul Czarkowski - Cloud Engineer at Blue Box, an IBM company
Daniel Krook - Senior Software Engineer, Cloud and Open Source Technologies, IBM
From Zero to Cloud: Revolutionize your Application Life Cycle with OpenShift ...OpenShift Origin
From Zero to Cloud: Revolutionize your Application Life Cycle with OpenShift PaaS
Talk given by Diane Mueller, OpenShift Origin Community Manager at FISL 15 on May 9th, 2014
One of the impediments to becoming an active technical contributor in the OpenStack community is setting up an efficient R&D environment which includes deploying a simple cloud. Using RDO-manager, get a basic cloud up and running with the fewest steps and minimal hardware so you can focus on the fun stuff - development
This document discusses the use of Helm for deploying applications on Kubernetes. It begins by introducing Helm and its benefits over manually deploying Kubernetes applications. It then covers how Helm can be used to deploy test clusters, proof of concepts, and facilitate production rollouts. The document also discusses how Helm is currently being used to install curated applications, provide lifecycle management, configuration management, inheritance, and composition capabilities. It concludes by mentioning upcoming demos of Helm.
Chef for OpenStack: OpenStack Spring Summit 2013Matt Ray
This document provides an overview of using Chef to deploy and manage OpenStack. It discusses why Chef is useful for infrastructure as code and its declarative interface. The document outlines the current state of the Chef for OpenStack project, including contributors, available cookbooks, and roadmap. It promotes the project as a way to collaboratively deploy OpenStack in a standardized, automated way and reduce fragmentation.
OpenStack in Action! 5 - Dell - OpenStack powered solutions - Patrick HamoneNovance
This document discusses Dell/Intel OpenStack-powered solutions and provides the following key points:
1) OpenStack is an open-source cloud operating system that is growing rapidly in adoption with over 10,000 individual members and contributors from over 70 countries.
2) Dell offers OpenStack reference architectures, hardware, software, services, and support to help customers accelerate their adoption of private and hybrid cloud solutions based on OpenStack.
3) Case studies show how Dell OpenStack solutions have helped customers like a research university and web hosting provider build scalable, cost-effective private clouds to meet their infrastructure and data storage needs.
SaltConf14 - Craig Sebenik, LinkedIn - SaltStack at Web ScaleSaltStack
This talk will focus on the unique challenges of managing Web scale and an application stack that lives on tens of thousands of servers spread across multiple data centers. Learn more about LinkedIn's unique topology, about the development of an efficient build environment, and hear more about LinkedIn plans for a deployment system based on Salt. Also, all of the software that runs LinkedIn sends a LOT of data. In order to stay ahead of this tidal wave of data, the team must address scale challenges seen in very few environments through efficient use of monitoring and metrics systems. This talk will highlight best practices and user training necessary for the use of SaltStack in large environments.
Austin OpenStack Meetup December 2012 presentation. The first part of the session was Chef for OpenStack, the second was Q&A about AT&T's OpenStack private cloud deployments to multiple data centers.
OSDC 2018 | Three years running containers with Kubernetes in Production by T...NETWAYS
The talk gives a state of the art update of experiences with deploying applications in Kubernetes on scale. If in clouds or on premises, Kubernetes took over the leading role as a container operating system. The central paradigm of stateless containers connected to storage and services is the core of Kubernetes. However, it can be extended to distributed databases, Machine Learning, Windows VMs in Kubernetes. All these applications have been considered as edge cases a few years ago, however, are going more and more mainstream today.
Sebastien Goasguen. With VM seemingly taking a back seat with containers coming back in fame, what is the role of CloudStack or OpenStack. In this talk Sebastien will briefly review the state of the art and bring some context around container orchestrators and how they relate to CloudStack. He will then discuss how containers orchestration can be easily integrated in CloudStack.
Kolla allows running OpenStack in containers using Docker and Ansible for simplified and repeatable deployments. It builds container images for OpenStack components that can be customized and then deployed through Ansible playbooks. Key features include opinionated out-of-the-box configurations, customizability, and integration with tools like Docker, Kubernetes, and ELK for logging. However, caution is advised as Docker, Kolla and Kubernetes are new technologies with active development.
Timothy Spann provides an overview of Apache NiFi, an open source dataflow software. Some key points about NiFi include:
- It provides guaranteed data delivery, buffering, prioritized queuing, and data provenance.
- It supports over 60 source connectors and has hundreds of processors for handling different data formats.
- The architecture includes repositories for storing metadata and provenance data, and supports clustering.
- Spann discusses best practices for using NiFi such as avoiding spaghetti flows, leveraging parameters and templates, and upgrading to the latest version. He also demonstrates how to consume data from sources like MQTT and FTP.
5 ways to install @OpenShift in 5 minutes (Lightening Talk given at #DevConfC...OpenShift Origin
This document discusses 5 ways to install OpenShift including: 1) using the step-by-step instructions on install.openshift.com, 2) deploying highly available OpenShift Origin clusters using Ansible, 3) leveraging OpenStack's Heat to provision OpenShift on OpenStack infrastructure, 4) downloading an OpenShift VM, and 5) getting started with Origin Release 3 on Fedora 19, RHEL 6, or CentOS 6.5 using install.openshift.com, Puppet, Ansible, Heat, or by downloading from openshift.github.io. The document is presented by Diane Mueller, OpenShift Origin Community Manager at Red Hat.
Habitat is a tool for building and running distributed applications. It aims to standardize packaging and running applications across different environments. With Habitat, applications are packaged into "harts" which contain all their dependencies and can be run on any system. Habitat handles configuration, service discovery, and updates to provide a uniform way to deploy applications. Plans are used to define how to build harts in a reproducible way. The Habitat runtime then manages running applications as services.
Habitat Workshop at Velocity London 2017Mandi Walls
Mandi Walls is the Technical Community Manager for EMEA at Chef and the Habitat Community lead is Ian Henry. The document discusses how modern applications are trending toward immutability, platform agnosticism, complexity reduction, and scalability. It provides an overview of ways to work with Habitat, including using artifacts that run themselves via the supervisor, exporting to Docker, and building plans from scratch or using scaffolding.
Leveraging Docker for Hadoop build automation and Big Data stack provisioningDataWorks Summit
Apache Bigtop as an open source Hadoop distribution, focuses on developing packaging, testing and deployment solutions that help infrastructure engineers to build up their own customized big data platform as easy as possible. However, packages deployed in production require a solid CI testing framework to ensure its quality. Numbers of Hadoop component must be ensured to work perfectly together as well. In this presentation, we'll talk about how Bigtop deliver its containerized CI framework which can be directly replicated by Bigtop users. The core revolution here are the newly developed Docker Provisioner that leveraged Docker for Hadoop deployment and Docker Sandbox for developer to quickly start a big data stack. The content of this talk includes the containerized CI framework, technical detail of Docker Provisioner and Docker Sandbox, a hierarchy of docker images we designed, and several components we developed such as Bigtop Toolchain to achieve build automation.
As Hadoop becomes the defacto big data platform, enterprises deploy HDP across wide range of physical and virtual environments spanning private and public clouds. This session will cover key considerations for cloud deployment and showcase Cloudbreak for simple and consistent deployment across cloud providers of choice.
This document discusses deploying Hadoop clusters using Docker and Cloudbreak. It begins with an overview of Hadoop everywhere and the challenges of deploying Hadoop across different infrastructures. It then discusses using Docker for deployment due to its portability and how Cloudbreak uses Docker and Ambari blueprints to deploy Hadoop clusters on different clouds. The remainder discusses running a workshop to deploy your own Hadoop cluster using Cloudbreak on a Docker host.
State of Big Data on ARM64 / AArch64 - Apache BigtopGanesh Raju
This document discusses Apache Bigtop, an open source project that provides tools for building, deploying, and testing big data software stacks across multiple platforms. It summarizes Bigtop's components and goals, contributions from ARM to support AArch64, challenges faced in the porting process, and the roadmap for further improving Bigtop including adding Kubernetes support and predefined sample stacks. It also demonstrates Bigtop's sandbox feature for building and running pseudo clusters using Docker.
Today, there are many companies that are open to the idea of sharing and actively promote Open Source projects.
We, at Neev, not only promote Open Source, but actively utilize Open Source wherever possible in order to increase ROI for customers and decrease time-to-market.
It is the best way to give something back to the community. Neev has, from time-to-time, given back to the Open Source community through contributions that aim to solve critical issues faced by the IT community.
Here are 18 of our innovative Open Source tools.
This document provides an introduction and overview of Docker, including its rapid growth and adoption, key benefits for developers and operations teams, technical underpinnings, ecosystem support, use cases, and future plans. Docker provides a way to package applications into lightweight containers that are portable and can run on any infrastructure. It solves issues around dependency management and consistency across environments.
Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of commodity hardware. It addresses limitations in traditional RDBMS for big data by allowing scaling to large clusters of commodity servers, high fault tolerance, and distributed processing. The core components of Hadoop are HDFS for distributed storage and MapReduce for distributed processing. Hadoop has an ecosystem of additional tools like Pig, Hive, HBase and more. Major companies use Hadoop to process and gain insights from massive amounts of structured and unstructured data.
Roman Shaposhnik of Cloudera and the Apache Software Foundation talks on "Delopying Hadoop-Based Bigdata Environments: [Tall] Tales from the Frontier" at Puppet Camp Silicon Valley 2012.
This document discusses deploying Hadoop-based big data environments. It describes the many components involved in Hadoop ecosystems, challenges with dependencies and packaging, and how tools like Puppet and the Bigtop project aim to help address these challenges through standardized packaging, configuration management, and integration testing. The document encourages users to get involved with Bigtop to help it grow and better support deploying Hadoop clusters.
Transforming Application Delivery with PaaS and Linux ContainersGiovanni Galloro
This document discusses Red Hat OpenShift Enterprise and how it helps with application delivery using Platform as a Service (PaaS) and Linux containers. It covers OpenShift's architecture using Linux containers, Docker, Kubernetes, and RHEL Atomic Host. It also discusses OpenShift's application deployment flow, adoption trends, and challenges with container adoption as well as Red Hat's strategy to address these challenges through container certification and simplifying adoption for partners.
This document discusses Hadoop for Windows, a distribution of Apache Hadoop and related projects that runs natively on the Windows operating system. It provides an overview of what is included in the distribution, such as Hadoop, Pig, Hive, and HCatalog, along with the versions and patches for each. It also describes what has changed from the Apache versions, such as new command line scripts, permissions mapping, and task controller. Users can install Hadoop for Windows on-premise or use HDInsight on Azure. The full distribution will be generally available in the second quarter along with more alignment with other Hortonworks distributions.
ODPi aims to standardize and support open source Apache Hadoop and related big data technologies. It currently defines a runtime stack of 3 components and the Ambari management stack. ODPi consists of industry members that provide engineering support and help test, integrate, and define specifications for the supported big data projects. The goal is to reduce costs for distributors by having a standardized core set of open source projects that are compatible and interoperable.
Apache Bigtop and ARM64 / AArch64 - Empowering Big Data EverywhereGanesh Raju
Apache Bigtop packages the Hadoop ecosystem into RPM and DEB packages. It provides a foundation for commercial Hadoop distributions and services. Bigtop features include a build toolchain, package framework, Puppet deployment scripts, and integration test framework. The next release of Bigtop 1.4 is upcoming in early April 2019, adding AArch64 support, improved testing, and package version updates. Future work includes focusing on core big data components like Spark and Flink, adding Kubernetes and cloud support, and expanding integrations.
This document provides an introduction and overview of Docker. It discusses why Docker was created to address issues with managing applications across different environments, and how Docker uses lightweight containers to package and run applications. It also summarizes the growth and adoption of Docker in its first 7 months, and outlines some of its core features and the Docker ecosystem including integration with DevOps tools and public clouds.
Build Your Own PaaS, Just like Red Hat's OpenShift from LinuxCon 2013 New Orl...OpenShift Origin
Learn how to build your platform as a service just like RedHat's OpenShift PaaS - covers all the architecture & internals of OpenShift Origin OpenSource project, how to deploy it & configure it for bare metal, AWS, OpenStack, CloudStack or any IaaS, and the community that's collaborating on the project to deliver the next-generation of secure, scale-able PaaS visit: openshift.com for more information
presented at LinuxCon by Diane Mueller in the CloudOpen track
This document provides an overview of Habitat, a tool for building, deploying, and managing applications. It discusses how Habitat aims to reduce complexity by providing immutable, platform-agnostic packages and managing dependencies and configurations. A demo of building and running a sample Ruby application in Habitat is also shown. Key features highlighted include Habitat plans for defining builds, hooks for controlling application startup, and configuration management at runtime. The document encourages attendees to try out Habitat and get involved in the community.
Similar to Making sense of Apache Bigtop's role in ODPi and how it matters to Apache Apex (20)
Low Latency Polyglot Model Scoring using Apache ApexApache Apex
This document discusses challenges in building low-latency machine learning applications and how Apache Apex can help address them. It introduces Apache Apex as a distributed streaming engine and describes how it allows embedding models from frameworks like R, Python, H2O through custom operators. It provides various data and model scoring patterns in Apex like dynamic resource allocation, checkpointing, exactly-once processing to meet SLAs. The document also demonstrates techniques like canary deployment, dormant models, model ensembles through logical overlays on the Apex DAG.
From Batch to Streaming with Apache Apex Dataworks Summit 2017Apache Apex
This document discusses transitioning from batch to streaming data processing using Apache Apex. It provides an overview of Apex and how it can be used to build real-time streaming applications. Examples are given of how to build an application that processes Twitter data streams and visualizes results. The document also outlines Apex's capabilities for scalable stream processing, queryable state, and its growing library of connectors and transformations.
Actionable Insights with Apache Apex at Apache Big Data 2017 by Devendra TagareApache Apex
The presentation covers how Apache Apex is used to deliver actionable insights in real-time for Ad-tech. It includes a reference architecture to provide dimensional aggregates on TB scale for billions of events per day. The reference architecture covers concepts around Apache Apex, with Kafka as source and dimensional compute. Slides from Devendra Tagare at Apache Big Data North America in Miami 2017.
David Yan offers an overview of Apache Apex, a stream processing engine used in production by several large companies for real-time data analytics.
Apache Apex uses a programming paradigm based on a directed acyclic graph (DAG). Each node in the DAG represents an operator, which can be data input, data output, or data transformation. Each directed edge in the DAG represents a stream, which is the flow of data from one operator to another.
As part of Apex, the Malhar library provides a suite of connector operators so that Apex applications can read from or write to various data sources. It also includes utility operators that are commonly used in streaming applications, such as parsers, deduplicators and join, and generic building blocks that facilitate scalable state management and checkpointing.
In addition to processing based on ingression time and processing time, Apex supports event-time windows and session windows. It also supports windowing, watermarks, allowed lateness, accumulation mode, triggering, and retraction detailed by Apache Beam as well as feedback loops in the DAG for iterative processing and at-least-once and “end-to-end” exactly-once processing guarantees. Apex provides various ways to fine-tune applications, such as operator partitioning, locality, and affinity.
Apex is integrated with several open source projects, including Apache Beam, Apache Samoa (distributed machine learning), and Apache Calcite (SQL-based application specification). Users can choose Apex as the backend engine when running their application model based on these projects.
David explains how to develop fault-tolerant streaming applications with low latency and high throughput using Apex, presenting the programming model with examples and demonstrating how custom business logic can be integrated using both the declarative high-level API and the compositional DAG-level API.
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache ApexApache Apex
Stream data processing is becoming increasingly important to support business needs for faster time to insight and action with growing volume of information from more sources. Apache Apex (http://apex.apache.org/) is a unified big data in motion processing platform for the Apache Hadoop ecosystem. Apex supports demanding use cases with:
* Architecture for high throughput, low latency and exactly-once processing semantics.
* Comprehensive library of building blocks including connectors for Kafka, Files, Cassandra, HBase and many more
* Java based with unobtrusive API to build real-time and batch applications and implement custom business logic.
* Advanced engine features for auto-scaling, dynamic changes, compute locality.
Apex was developed since 2012 and is used in production in various industries like online advertising, Internet of Things (IoT) and financial services.
Apache Big Data EU 2016: Building Streaming Applications with Apache ApexApache Apex
Stream processing applications built on Apache Apex run on Hadoop clusters and typically power analytics use cases where availability, flexible scaling, high throughput, low latency and correctness are essential. These applications consume data from a variety of sources, including streaming sources like Apache Kafka, Kinesis or JMS, file based sources or databases. Processing results often need to be stored in external systems (sinks) for downstream consumers (pub-sub messaging, real-time visualization, Hive and other SQL databases etc.). Apex has the Malhar library with a wide range of connectors and other operators that are readily available to build applications. We will cover key characteristics like partitioning and processing guarantees, generic building blocks for new operators (write-ahead-log, incremental state saving, windowing etc.) and APIs for application specification.
YARN was introduced as part of Hadoop 2.0 to address limitations in the original MapReduce (MR1) architecture like scalability bottlenecks and underutilization of resources. YARN introduces a global ResourceManager and per-node NodeManagers to allocate cluster resources to distributed applications. It allows various distributed processing frameworks beyond MapReduce to share common cluster resources. Applications request containers for ApplicationMasters that then negotiate resources from YARN to run application components in containers across nodes. Existing MapReduce jobs can also run unchanged on YARN.
Here is how you can solve this problem using MapReduce and Unix commands:
Map step:
grep -o 'Blue\|Green' input.txt | wc -l > output
This uses grep to search the input file for the strings "Blue" or "Green" and print only the matches. The matches are piped to wc which counts the lines (matches).
Reduce step:
cat output
This isn't really needed as there is only one mapper. Cat prints the contents of the output file which has the count of Blue and Green.
So MapReduce has been simulated using grep for the map and cat for the reduce functionality. The key aspects are - grep extracts the relevant data (map
HDFS stores files as blocks that are by default 64 MB in size to minimize disk seek times. The namenode manages the file system namespace and metadata, tracking which datanodes store each block. When writing a file, HDFS breaks it into blocks and replicates each block across multiple datanodes. The secondary namenode periodically merges namespace and edit log changes to prevent the log from growing too large. Small files are inefficient in HDFS due to each file requiring namespace metadata regardless of size.
Kafka to Hadoop Ingest with Parsing, Dedup and other Big Data TransformationsApache Apex
Presenter:
Chaitanya Chebolu, Committer for Apache Apex and Software Engineer at DataTorrent.
In this session we will cover the use-case of ingesting data from Kafka and writing to HDFS with a couple of processing operators - Parser, Dedup, Transform.
Building Your First Apache Apex (Next Gen Big Data/Hadoop) ApplicationApache Apex
This document provides an overview of building a first Apache Apex application. It describes the main concepts of an Apex application including operators that implement interfaces to process streaming data within windows. The document outlines a "Sorted Word Count" application that uses various operators like LineReader, WordReader, WindowWordCount, and FileWordCount. It also demonstrates wiring these operators together in a directed acyclic graph and running the application to process streaming data.
Intro to Apache Apex - Next Gen Platform for Ingest and TransformApache Apex
Introduction to Apache Apex - The next generation native Hadoop platform. This talk will cover details about how Apache Apex can be used as a powerful and versatile platform for big data processing. Common usage of Apache Apex includes big data ingestion, streaming analytics, ETL, fast batch alerts, real-time actions, threat detection, etc.
Bio:
Pramod Immaneni is Apache Apex PMC member and senior architect at DataTorrent, where he works on Apache Apex and specializes in big data platform and applications. Prior to DataTorrent, he was a co-founder and CTO of Leaf Networks LLC, eventually acquired by Netgear Inc, where he built products in core networking space and was granted patents in peer-to-peer VPNs.
Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)Apache Apex
Presenter:
Priyanka Gugale, Committer for Apache Apex and Software Engineer at DataTorrent.
In this session we will cover introduction to Yarn, understanding yarn architecture as well as look into Yarn application lifecycle. We will also learn how Apache Apex is one of the Yarn applications in Hadoop.
Ingesting Data from Kafka to JDBC with Transformation and EnrichmentApache Apex
Presenter - Dr Sandeep Deshmukh, Committer Apache Apex, DataTorrent engineer
Abstract:
Ingesting and extracting data from Hadoop can be a frustrating, time consuming activity for many enterprises. Apache Apex Data Ingestion is a standalone big data application that simplifies the collection, aggregation and movement of large amounts of data to and from Hadoop for a more efficient data processing pipeline. Apache Apex Data Ingestion makes configuring and running Hadoop data ingestion and data extraction a point and click process enabling a smooth, easy path to your Hadoop-based big data project.
In this series of talks, we would cover how Hadoop Ingestion is made easy using Apache Apex. The third talk in this series would focus on ingesting unbounded data from Kafka to JDBC with couple of processing operators -Transform and enrichment.
“An Outlook of the Ongoing and Future Relationship between Blockchain Technologies and Process-aware Information Systems.” Invited talk at the joint workshop on Blockchain for Information Systems (BC4IS) and Blockchain for Trusted Data Sharing (B4TDS), co-located with with the 36th International Conference on Advanced Information Systems Engineering (CAiSE), 3 June 2024, Limassol, Cyprus.
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!SOFTTECHHUB
As the digital landscape continually evolves, operating systems play a critical role in shaping user experiences and productivity. The launch of Nitrux Linux 3.5.0 marks a significant milestone, offering a robust alternative to traditional systems such as Windows 11. This article delves into the essence of Nitrux Linux 3.5.0, exploring its unique features, advantages, and how it stands as a compelling choice for both casual users and tech enthusiasts.
Full-RAG: A modern architecture for hyper-personalizationZilliz
Mike Del Balso, CEO & Co-Founder at Tecton, presents "Full RAG," a novel approach to AI recommendation systems, aiming to push beyond the limitations of traditional models through a deep integration of contextual insights and real-time data, leveraging the Retrieval-Augmented Generation architecture. This talk will outline Full RAG's potential to significantly enhance personalization, address engineering challenges such as data management and model training, and introduce data enrichment with reranking as a key solution. Attendees will gain crucial insights into the importance of hyperpersonalization in AI, the capabilities of Full RAG for advanced personalization, and strategies for managing complex data integrations for deploying cutting-edge AI solutions.
20 Comprehensive Checklist of Designing and Developing a WebsitePixlogix Infotech
Dive into the world of Website Designing and Developing with Pixlogix! Looking to create a stunning online presence? Look no further! Our comprehensive checklist covers everything you need to know to craft a website that stands out. From user-friendly design to seamless functionality, we've got you covered. Don't miss out on this invaluable resource! Check out our checklist now at Pixlogix and start your journey towards a captivating online presence today.
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Maruthi Prithivirajan, Head of ASEAN & IN Solution Architecture, Neo4j
Get an inside look at the latest Neo4j innovations that enable relationship-driven intelligence at scale. Learn more about the newest cloud integrations and product enhancements that make Neo4j an essential choice for developers building apps with interconnected data and generative AI.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
UiPath Test Automation using UiPath Test Suite series, part 5DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 5. In this session, we will cover CI/CD with devops.
Topics covered:
CI/CD with in UiPath
End-to-end overview of CI/CD pipeline with Azure devops
Speaker:
Lyndsey Byblow, Test Suite Sales Engineer @ UiPath, Inc.
UiPath Test Automation using UiPath Test Suite series, part 5
Making sense of Apache Bigtop's role in ODPi and how it matters to Apache Apex
1. Making sense of Apache Bigtop, ODPi and
why it all matters to Apache Apex
Roman Shaposhnik, rvs@apache.org,
@rhatr
Director of Open Source Strategy,
Pivotal Inc.
2. A slide deck build via “Apache Way”
• Bigtop community contributors
• Roman Shaposhnik
• Konstantin Boudnik
• Nate D'Amico
• Evans Ye & Darren Chen (Trend Micro)
3. What is Apache Bigtop?
• Apache Bigtop is to Hadoop what Debian is to Linux
• A 100% open, community driven distribution of bigdata
management platform based on Apache Hadoop
• A place where all communities around big data come
together
• The thing everybody (Pivotal, Cloudera, Hortonworks,
WANDisco, IBM, Amazon, TrendMicro) is building off of
• A cutting edge, quickly evolving distribution and a set
of tools
6. ODPi is a nonprofit organization committed to simplification &
standardization of the big data ecosystem with a common reference
specification called ODPi Core.
As a shared industry effort , ODPi is focused on promoting and advancing the state of Apache Hadoop®
and Big Data Technologies for the Enterprise.
9. What has ODPi done so far (1.0.1)?
• Runtime specification
• https://github.com/odpi/specs/blob/master/ODPi-Runtime.md
• Validation testsuite
• http://repo.odpi.org/ODPi/1.0/acceptance-tests/
• Reference implementation binaries
• http://repo.odpi.org/ODPi/1.0/{centos6, ubuntu-14.04}
10. What are we working on?
• Operations specification
• https://github.com/odpi/specs/blob/master/ODPi-Operations.md
• ISV “ODPi compatible” policy
• Expanding ODPi core beyond Apache Hadoop & Ambari
• Hive
• ????
• How can you help?
• Share usecases
• Test against reference implementation
• Contribute to upstream ASF projects
11. What’s in is Bigtop?
• A set of binary packages
• just like CDH/PHD/HDP/ODPi/etc.
• Integration code
• Packaging code
• Deployment code
• Orchestration code
• Validation code
• Continuous Integration infrastructure
12. Integration/packaging
• Linux packages
• RPM, DEB
• RHEL/CentOS(Fedora), SLES(OpenSUSE), Debian, Ubuntu
• VirtualBox, VMWare, etc. VM images
• Challenge: Linux packaging is node-centric
• “smart” tarballs
• Docker or BOSH images
13. Integration testing based on iTest
• Clean-room provisioning
• these ain’t your gramp’s unit tests
• Versioned test artifacts
• JVM-base test artifacts
• Matching stacks of components and integration tests
• Plug’n’play architecture: Gradle/Groovy, JARs/artifacts
14. Puppet 3.x deployment
• Master-less puppet
• $ puppet apply bigtop-deploy/puppet/manifests/site.pp # on each node
• Cluster topology is kept in Hiera
bigtop::hadoop_head_node: "hadoopmaster.example.com"
hadoop::hadoop_storage_dirs:
- ”/mnt”
hadoop_cluster_node::cluster_components:
- yarn
- zookeeper
bigtop::bigtop_repo_uri:
"http://bigtop-
16. Who is this for?
• For Hadoop app developers, cluster admins, users
• Run a Hadoop cluster to test your code on
• Try & test configurations before applying to Production
• Play around with Bigtop Big Data Stack
• For contributors
• Easy to test your packaging, deployment, testing code
• For vendors
• CI out of the box —> patch upstream code made easier
17. Works great, but…
• Need to add vagrant public key into docker images
• Too many issues with auto-created boot2docker
hosting VM
• A bug for docker provider keep opening for almost
2y
• Waiting for machine to boot' hangs infinitely
• Can not share same code for different providers
anyway
• Not all the docker options supported in Vagrantfile
• Does not support Docker Swarm
27. Blue prints for data engineering
• BigPetStore
• Data Generator
• Examples using tools in Hadoop ecosystem to process
data
• Build system and tests for integrating tools and multiple
JVM languages
• Started by Dr. Jay Vyas, prinicipal software engineer at
Red Hat, Inc.
31. New focus and target end users
Data engineers vs distro
builders
Enhance
Operations/Deployment
Reference implementations
& tutorials
32. Data data data…
Smarter/Realistic test data
-bigpetstore
-bigtop-bazaar
-weather data gen
Tutorial/Learning Data sets
-githubarchive.org
-more tbd…