3 examples for Big Data analytics containerized:
1. The installation with Docker and Weave for small and medium,
2. Hadoop on Mesos w/ Appache Myriad
3. Spark on Mesos
Lessons Learned Running Hadoop and Spark in Docker ContainersBlueData, Inc.
Many initiatives for running applications inside containers have been scoped to run on a single host. Using Docker containers for large-scale production environments poses interesting challenges, especially when deploying distributed big data applications like Apache Hadoop and Apache Spark. This session at Strata + Hadoop World in New York City (September 2016) explores various solutions and tips to address the challenges encountered while deploying multi-node Hadoop and Spark production workloads using Docker containers.
Some of these challenges include container life-cycle management, smart scheduling for optimal resource utilization, network configuration and security, and performance. BlueData is "all in” on Docker containers—with a specific focus on big data applications. BlueData has learned firsthand how to address these challenges for Fortune 500 enterprises and government organizations that want to deploy big data workloads using Docker.
This session by Thomas Phelan, co-founder and chief architect at BlueData, discusses how to securely network Docker containers across multiple hosts and discusses ways to achieve high availability across distributed big data applications and hosts in your data center. Since we’re talking about very large volumes of data, performance is a key factor, so Thomas shares some of the storage options implemented at BlueData to achieve near bare-metal I/O performance for Hadoop and Spark using Docker as well as lessons learned and some tips and tricks on how to Dockerize your big data applications in a reliable, scalable, and high-performance environment.
http://conferences.oreilly.com/strata/hadoop-big-data-ny/public/schedule/detail/52042
This presentation describes how hortonworks is delivering Hadoop on Docker for a cloud-agnostic deployment approach which presented in Cisco Live 2015.
This session will examine the many options the data scientist has for running Spark clusters in public and private clouds. We will discuss various environments employing AWS, Mesos, containers, docker, and BlueData EPIC technologies and the benefits and challenges of each.
Speakers:
Tom Phelan, Co-founder and Chief Architect - BlueData Inc. Tom has spent the last 25 years as a senior architect, developer, and team lead in the computer software industry in Silicon Valley. Prior to co-founding BlueData, Tom spent 10 years at VMware as a senior architect and team lead in the core R&D Storage and Availability group. Most recently, Tom led one of the key projects – vFlash, focusing on integration of server-based Flash into the vSphere core hypervisor. Prior to VMware, Tom was part of the early team at Silicon Graphics that developed XFS, one of the most successful open source file systems. Earlier in his career, he was a key member of the Stratus team that ported the Unix operating system to their highly available computing platform. Tom received his Computer Science degree from the University of California, Berkeley.
Structor - Automated Building of Virtual Hadoop ClustersOwen O'Malley
Discusses vagrant scripts to setup and deploy a working Hadoop multiple node cluster with or without security. All source code is available on https://github.com/hortonworks/structor .
Lessons Learned Running Hadoop and Spark in Docker ContainersBlueData, Inc.
Many initiatives for running applications inside containers have been scoped to run on a single host. Using Docker containers for large-scale production environments poses interesting challenges, especially when deploying distributed big data applications like Apache Hadoop and Apache Spark. This session at Strata + Hadoop World in New York City (September 2016) explores various solutions and tips to address the challenges encountered while deploying multi-node Hadoop and Spark production workloads using Docker containers.
Some of these challenges include container life-cycle management, smart scheduling for optimal resource utilization, network configuration and security, and performance. BlueData is "all in” on Docker containers—with a specific focus on big data applications. BlueData has learned firsthand how to address these challenges for Fortune 500 enterprises and government organizations that want to deploy big data workloads using Docker.
This session by Thomas Phelan, co-founder and chief architect at BlueData, discusses how to securely network Docker containers across multiple hosts and discusses ways to achieve high availability across distributed big data applications and hosts in your data center. Since we’re talking about very large volumes of data, performance is a key factor, so Thomas shares some of the storage options implemented at BlueData to achieve near bare-metal I/O performance for Hadoop and Spark using Docker as well as lessons learned and some tips and tricks on how to Dockerize your big data applications in a reliable, scalable, and high-performance environment.
http://conferences.oreilly.com/strata/hadoop-big-data-ny/public/schedule/detail/52042
This presentation describes how hortonworks is delivering Hadoop on Docker for a cloud-agnostic deployment approach which presented in Cisco Live 2015.
This session will examine the many options the data scientist has for running Spark clusters in public and private clouds. We will discuss various environments employing AWS, Mesos, containers, docker, and BlueData EPIC technologies and the benefits and challenges of each.
Speakers:
Tom Phelan, Co-founder and Chief Architect - BlueData Inc. Tom has spent the last 25 years as a senior architect, developer, and team lead in the computer software industry in Silicon Valley. Prior to co-founding BlueData, Tom spent 10 years at VMware as a senior architect and team lead in the core R&D Storage and Availability group. Most recently, Tom led one of the key projects – vFlash, focusing on integration of server-based Flash into the vSphere core hypervisor. Prior to VMware, Tom was part of the early team at Silicon Graphics that developed XFS, one of the most successful open source file systems. Earlier in his career, he was a key member of the Stratus team that ported the Unix operating system to their highly available computing platform. Tom received his Computer Science degree from the University of California, Berkeley.
Structor - Automated Building of Virtual Hadoop ClustersOwen O'Malley
Discusses vagrant scripts to setup and deploy a working Hadoop multiple node cluster with or without security. All source code is available on https://github.com/hortonworks/structor .
CBlocks - Posix compliant files systems for HDFSDataWorks Summit
With YARN running Docker containers, it is possible to run applications that are not HDFS aware inside these containers. It is hard to customize these applications since most of them assume a Posix file system with rewrite capabilities. In this talk, we will dive into how we created a block storage, how it is being tested internally and the storage containers which makes it all possible.
The storage container framework was developed as part of Ozone (HDFS-7240). This is talk will also explore the current state of Ozone along with CBlocks. This talk will explore architecture of storage containers, how replication is handled, scaling to millions of volumes and I/O performance optimizations.
Lessons Learned from Dockerizing Spark WorkloadsBlueData, Inc.
Many initiatives for running applications inside containers have been scoped to run on a single host. Using Docker containers for large-scale production environments poses interesting challenges, especially when deploying distributed Big Data applications like Apache Spark.
Some of these challenges include container lifecycle management, smart scheduling for optimal resource utilization, network configuration and security, and performance. BlueData is “all in” on Docker containers – with a specific focus on Spark applications. They’ve learned first-hand how to address these challenges for Fortune 500 enterprises and government organizations that want to deploy Big Data workloads using Docker.
This session at Spark Summit in February 2017 (by Thomas Phelan, co-founder and chief architect at BlueData) described lessons learned as well as some tips and tricks on how to Dockerize your Big Data applications in a reliable, scalable, and high-performance environment.
In this session, Tom described how to network Docker containers across multiple hosts securely. He discussed ways to achieve high availability across distributed Big Data applications and hosts in your data center. And since we’re talking about very large volumes of data, performance is a key factor. So Tom discussed some of the storage options that BlueData explored and implemented to achieve near bare-metal I/O performance for Spark using Docker.
https://spark-summit.org/east-2017/events/lessons-learned-from-dockerizing-spark-workloads
April 2016 HUG: The latest of Apache Hadoop YARN and running your docker apps...Yahoo Developer Network
Apache Hadoop YARN is a modern resource-management platform that handles resource scheduling, isolation and multi-tenancy for a variety of data processing engines that can co-exist and share a single data-center in a cost-effective manner.
In the first half of the talk, we are going to give a brief look into some of the big efforts cooking in the Apache Hadoop YARN community.
We will then dig deeper into one of the efforts - supporting Docker runtime in YARN. Docker is an application container engine that enables developers and sysadmins to build, deploy and run containerized applications. In this half, we'll discuss container runtimes in YARN, with a focus on using the DockerContainerRuntime to run various docker applications under YARN. Support for container runtimes (including the docker container runtime) was recently added to the Linux Container Executor (YARN-3611 and its sub-tasks). We’ll walk through various aspects of running docker containers under YARN - resource isolation, some security aspects (for example container capabilities, privileged containers, user namespaces) and other work in progress features like image localization and support for different networking modes.
Speakers:
Vinod Kumar Vavilapalli is the Hadoop YARN and MapReduce guy at Hortonworks. He is a long term Hadoop contributor at Apache, Hadoop committer and a member of the Apache Hadoop PMC. He has a Bachelors degree from Indian Institute of Technology Roorkee in Computer Science and Engineering. He has been working on Hadoop for nearly 9 years and he still has fun doing it. Straight out of college, he joined the Hadoop team at Yahoo! Bangalore, before Hortonworks happened. He is passionate about using computers to change the world for better, bit by bit.
Sidharta Seethana is a software engineer at Hortonworks. He works on the YARN team, focussing on bringing new kinds of workloads to YARN. Prior to joining Hortonworks, Sidharta spent 10 years at Yahoo! Inc., working on a variety of large scale distributed systems for core platforms/web services, search and marketplace properties, developer network and personalization.
Tuning Apache Ambari performance for Big Data at scale with 3000 agentsDataWorks Summit
Apache Ambari manages Hadoop at large-scale and it becomes increasingly difficult for cluster admins to keep the machinery running smoothly as data grows and nodes scale from 30 to 3000 agents. To test at scale, Ambari has a Performance Stack that allows a VM to host as many as 50 Ambari Agents. The simulated stack and 50 Agents per VM can stress-test Ambari Server with the same load as a 3000 node cluster. This talk will cover how to tune the performance of Ambari and MySQL, and share performance benchmarks for features like deploy times, bulk operations, installation of bits, Rolling & Express Upgrade. Moreover, the speaker will show how to use Ambari Metrics System and Grafana to plot performance, detect anomalies, and pinpoint tips on how to improve performance for a more responsive experience. Lastly, the talk will discuss roadmap features in Ambari 3.0 for improving performance and scale.
A brief introduction to YARN: how and why it came into existence and how it fits together with this thing called Hadoop.
Focus given to architecture, availability, resource management and scheduling, migration from MR1 to MR2, job history and logging, interfaces, and applications.
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
In a world with a myriad of distributed storage systems to choose from, the majority of Apache HBase clusters still rely on Apache HDFS. Theoretically, any distributed file system could be used by HBase. One major reason HDFS is predominantly used are the specific durability requirements of HBase's write-ahead log (WAL) and HDFS providing that guarantee correctly. However, HBase's use of HDFS for WALs can be replaced with sufficient effort.
This talk will cover the design of a "Log Service" which can be embedded inside of HBase that provides a sufficient level of durability that HBase requires for WALs. Apache Ratis (incubating) is a library-implementation of the RAFT consensus protocol in Java and is used to build this Log Service. We will cover the design choices of the Ratis Log Service, comparing and contrasting it to other log-based systems that exist today. Next, we'll cover how the Log Service "fits" into HBase and the necessary changes to HBase which enable this. Finally, we'll discuss how the Log Service can simplify the operational burden of HBase.
Using Ansible to deploy a 6-node Hortonworks Data Platform (hadoop) cluster on AWS with the ObjectRocket ansible-hadoop playbook.
Presented at the Ansible NOVA MeetUp on February 23, 2017: https://www.meetup.com/Ansible-NOVA/events/236853616/
As a company starts dealing with large amounts of data, operation engineers are challenged with managing the influx of information while ensuring the resilience of data. Hadoop HDFS, Mesos and Spark help reduce issues with a scheduler that allows data cluster resources to be shared. It provides a common ground where data scientists and engineers can meet, develop high performance data processing applications and deploy their own tools.
Dev ops for big data cluster management toolsRan Silberman
What are the tools that we can find to day to manage Hadoop cluster and its ecosystem?
There are two tools ready today:
Cloudera Manager and Ambari from Hortonworks.
In this presentation I explain what they do and why to use them, as well as Pros. and Cons.
A step-by-step deep dive into Kafka Security world. This presentation covers few most sought-after questions in Streaming / Kafka; like what happens internally when SASL / Kerberos / SSL security is configured, how does various Kafka components interacts with each other. This could be valuable resource for administrators, users & Application developers alike. Having internal Kafka knowledge would help them to configure, manage and use the Kafka systems in a more optimal way with least possible errors / mistakes.
Agenda is to discuss:
- Various Kafka Security model available: PLAINTEXT, SASL_PLAINTEXT, SASL_SSL, PLAINTEXT_SSL and when to use which model
- Anatomy of each Security model: in-depth examination of these models and what happens internally when they are used; with real life examples
- Do's and Don'ts of Kafka Security
- Common Errors & Troubleshooting
This talk will be all about looking under-the-hood with respect to Kafka Security. Suitable for all levels from beginners to expert.
Speaker
Vipin Rathor, Sr. Product Specialist (security), Hortonworks
Mesosphere and Contentteam: A New Way to Run CassandraDataStax Academy
We, Ben Whitehead and Robert Stupp, will show you how to run Cassandra on Mesos. We will go through all the technical steps how to plan, setup and operate even large scale Cassandra clusters on Mesos. Further we illustrate how the Cassandra-on-Mesos framework helps you to setup Cassandra on Mesos, schedule regular maintenance tasks and manage hardware failures in the heart of your data center.
CBlocks - Posix compliant files systems for HDFSDataWorks Summit
With YARN running Docker containers, it is possible to run applications that are not HDFS aware inside these containers. It is hard to customize these applications since most of them assume a Posix file system with rewrite capabilities. In this talk, we will dive into how we created a block storage, how it is being tested internally and the storage containers which makes it all possible.
The storage container framework was developed as part of Ozone (HDFS-7240). This is talk will also explore the current state of Ozone along with CBlocks. This talk will explore architecture of storage containers, how replication is handled, scaling to millions of volumes and I/O performance optimizations.
Lessons Learned from Dockerizing Spark WorkloadsBlueData, Inc.
Many initiatives for running applications inside containers have been scoped to run on a single host. Using Docker containers for large-scale production environments poses interesting challenges, especially when deploying distributed Big Data applications like Apache Spark.
Some of these challenges include container lifecycle management, smart scheduling for optimal resource utilization, network configuration and security, and performance. BlueData is “all in” on Docker containers – with a specific focus on Spark applications. They’ve learned first-hand how to address these challenges for Fortune 500 enterprises and government organizations that want to deploy Big Data workloads using Docker.
This session at Spark Summit in February 2017 (by Thomas Phelan, co-founder and chief architect at BlueData) described lessons learned as well as some tips and tricks on how to Dockerize your Big Data applications in a reliable, scalable, and high-performance environment.
In this session, Tom described how to network Docker containers across multiple hosts securely. He discussed ways to achieve high availability across distributed Big Data applications and hosts in your data center. And since we’re talking about very large volumes of data, performance is a key factor. So Tom discussed some of the storage options that BlueData explored and implemented to achieve near bare-metal I/O performance for Spark using Docker.
https://spark-summit.org/east-2017/events/lessons-learned-from-dockerizing-spark-workloads
April 2016 HUG: The latest of Apache Hadoop YARN and running your docker apps...Yahoo Developer Network
Apache Hadoop YARN is a modern resource-management platform that handles resource scheduling, isolation and multi-tenancy for a variety of data processing engines that can co-exist and share a single data-center in a cost-effective manner.
In the first half of the talk, we are going to give a brief look into some of the big efforts cooking in the Apache Hadoop YARN community.
We will then dig deeper into one of the efforts - supporting Docker runtime in YARN. Docker is an application container engine that enables developers and sysadmins to build, deploy and run containerized applications. In this half, we'll discuss container runtimes in YARN, with a focus on using the DockerContainerRuntime to run various docker applications under YARN. Support for container runtimes (including the docker container runtime) was recently added to the Linux Container Executor (YARN-3611 and its sub-tasks). We’ll walk through various aspects of running docker containers under YARN - resource isolation, some security aspects (for example container capabilities, privileged containers, user namespaces) and other work in progress features like image localization and support for different networking modes.
Speakers:
Vinod Kumar Vavilapalli is the Hadoop YARN and MapReduce guy at Hortonworks. He is a long term Hadoop contributor at Apache, Hadoop committer and a member of the Apache Hadoop PMC. He has a Bachelors degree from Indian Institute of Technology Roorkee in Computer Science and Engineering. He has been working on Hadoop for nearly 9 years and he still has fun doing it. Straight out of college, he joined the Hadoop team at Yahoo! Bangalore, before Hortonworks happened. He is passionate about using computers to change the world for better, bit by bit.
Sidharta Seethana is a software engineer at Hortonworks. He works on the YARN team, focussing on bringing new kinds of workloads to YARN. Prior to joining Hortonworks, Sidharta spent 10 years at Yahoo! Inc., working on a variety of large scale distributed systems for core platforms/web services, search and marketplace properties, developer network and personalization.
Tuning Apache Ambari performance for Big Data at scale with 3000 agentsDataWorks Summit
Apache Ambari manages Hadoop at large-scale and it becomes increasingly difficult for cluster admins to keep the machinery running smoothly as data grows and nodes scale from 30 to 3000 agents. To test at scale, Ambari has a Performance Stack that allows a VM to host as many as 50 Ambari Agents. The simulated stack and 50 Agents per VM can stress-test Ambari Server with the same load as a 3000 node cluster. This talk will cover how to tune the performance of Ambari and MySQL, and share performance benchmarks for features like deploy times, bulk operations, installation of bits, Rolling & Express Upgrade. Moreover, the speaker will show how to use Ambari Metrics System and Grafana to plot performance, detect anomalies, and pinpoint tips on how to improve performance for a more responsive experience. Lastly, the talk will discuss roadmap features in Ambari 3.0 for improving performance and scale.
A brief introduction to YARN: how and why it came into existence and how it fits together with this thing called Hadoop.
Focus given to architecture, availability, resource management and scheduling, migration from MR1 to MR2, job history and logging, interfaces, and applications.
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
In a world with a myriad of distributed storage systems to choose from, the majority of Apache HBase clusters still rely on Apache HDFS. Theoretically, any distributed file system could be used by HBase. One major reason HDFS is predominantly used are the specific durability requirements of HBase's write-ahead log (WAL) and HDFS providing that guarantee correctly. However, HBase's use of HDFS for WALs can be replaced with sufficient effort.
This talk will cover the design of a "Log Service" which can be embedded inside of HBase that provides a sufficient level of durability that HBase requires for WALs. Apache Ratis (incubating) is a library-implementation of the RAFT consensus protocol in Java and is used to build this Log Service. We will cover the design choices of the Ratis Log Service, comparing and contrasting it to other log-based systems that exist today. Next, we'll cover how the Log Service "fits" into HBase and the necessary changes to HBase which enable this. Finally, we'll discuss how the Log Service can simplify the operational burden of HBase.
Using Ansible to deploy a 6-node Hortonworks Data Platform (hadoop) cluster on AWS with the ObjectRocket ansible-hadoop playbook.
Presented at the Ansible NOVA MeetUp on February 23, 2017: https://www.meetup.com/Ansible-NOVA/events/236853616/
As a company starts dealing with large amounts of data, operation engineers are challenged with managing the influx of information while ensuring the resilience of data. Hadoop HDFS, Mesos and Spark help reduce issues with a scheduler that allows data cluster resources to be shared. It provides a common ground where data scientists and engineers can meet, develop high performance data processing applications and deploy their own tools.
Dev ops for big data cluster management toolsRan Silberman
What are the tools that we can find to day to manage Hadoop cluster and its ecosystem?
There are two tools ready today:
Cloudera Manager and Ambari from Hortonworks.
In this presentation I explain what they do and why to use them, as well as Pros. and Cons.
A step-by-step deep dive into Kafka Security world. This presentation covers few most sought-after questions in Streaming / Kafka; like what happens internally when SASL / Kerberos / SSL security is configured, how does various Kafka components interacts with each other. This could be valuable resource for administrators, users & Application developers alike. Having internal Kafka knowledge would help them to configure, manage and use the Kafka systems in a more optimal way with least possible errors / mistakes.
Agenda is to discuss:
- Various Kafka Security model available: PLAINTEXT, SASL_PLAINTEXT, SASL_SSL, PLAINTEXT_SSL and when to use which model
- Anatomy of each Security model: in-depth examination of these models and what happens internally when they are used; with real life examples
- Do's and Don'ts of Kafka Security
- Common Errors & Troubleshooting
This talk will be all about looking under-the-hood with respect to Kafka Security. Suitable for all levels from beginners to expert.
Speaker
Vipin Rathor, Sr. Product Specialist (security), Hortonworks
Mesosphere and Contentteam: A New Way to Run CassandraDataStax Academy
We, Ben Whitehead and Robert Stupp, will show you how to run Cassandra on Mesos. We will go through all the technical steps how to plan, setup and operate even large scale Cassandra clusters on Mesos. Further we illustrate how the Cassandra-on-Mesos framework helps you to setup Cassandra on Mesos, schedule regular maintenance tasks and manage hardware failures in the heart of your data center.
Lessons in moving from physical hosts to mesosRaj Shekhar
(speaker notes here : https://docs.google.com/document/d/12mXLYEFkEEd0pwOwD8bC1JQ8CPpx_PiRPXikHZ6MMYQ/pub )
t.co is the URL shortening service created by Twitter. As part of scaling up, t.co moved to using Mesos. We saw significant gain is deployment speed, scalability and reduction in operational headaches.
This talk will provide an introduction to Mesos+Aurora, and cover how t.co service migrated from running on physical hardware to Mesos. It will also cover the challenges t.co had during the migration, the "gotchas" and debugging techniques for uncovering performance issues.
Agenda:
- Introduction to Mesos + Aurora
- Benefits of moving to Mesos
- Migration steps for moving from t.co to Mesos
- Challenges faced and how t.co overcame them
ABSTRACT: The ongoing big data revolution has revolutionized the way in which technology is used to empower new business segments like social networking and transform old business segments like traditional retail. However, the DNA that is used to build data processing platform is evolving quite rapidly. There is a plethora of competing tools, technologies, and “religion” for how to build state-of-the-art data analysis frameworks. In this talk, I will go over five ways to build scalable high-performance long-lasting data analysis frameworks in the wrong way. Surprisingly, the industry is full of examples of organization building frameworks in this “wrong” way. Since the “right” way to build a technology framework is dependent on the key business drivers, it is my hope that this talk will spur a discussion on what is the “right” way for Pinterest. The talk will focus on technologies including “data plumbing” (e.g. tools in the Hadoop ecosystem), and statistical modeling methods (e.g. R and Python). In this talk, I’ll try to connect to platform builders, data scientists, and business decision makers.
BIO: Jignesh Patel is a Professor in Computer Sciences at the University of Wisconsin-Madison, where he also earned his Ph.D. He has worked in the area of databases (now fashionably called “big data”) for over two decades. He has won several best paper awards, and industry research awards. He is the recipient of the Wisconsin COW teaching award, and the U. Michigan College of Engineering Education Excellence Award. He has a strong interest in seeing research ideas transition to actual products. His Ph.D. thesis work was acquired by NCR/Teradata in 1997, and he also co-founded Locomatix -- a startup that built a platform to power real-time data-driven mobile services. Locomatix became part of Twitter in 2013. He is an ACM Distinguished Scientist and an IEEE Senior Member. He also serves on the board of Lands’ End, and advises a number of startups.
What is the future of Hadoop?
What is the new future of Hadoop?
How is that different from the old one?
Here is how Ted Dunning answered these questions at the winter Hadoop Conference of Japan 2013.
Obtaining patentable claims after Prometheus and MyriadMaryBreenSmith
The Supreme Court cases significantly changed what is patentable subject-matter in the U.S. But how broadly has the scope of patentable subject matter been narrowed by these decisions? Presentation analyzes major claim types in diagnostics and gene-type patents and whether they remain patentable under this new case law.
Infrastructure Considerations for Analytical WorkloadsCognizant
Using Apache Hadoop clusters and Mahout for analyzing big data workloads yields extraordinary performance; we offer a detailed comparison of running Hadoop in a physical vs. virtual infrastructure environment.
Slides for the talk given at MesosCon (https://mesoscon2015.sched.org/event/76ed472dbfb388b5f939dde31c7a3302#.Vd3-JyxViko)
Video is available at https://www.youtube.com/watch?v=lU2VE08fOD4
[KubeCon NA 2020] containerd: Rootless Containers 2020Akihiro Suda
Rootless Containers means running the container runtimes (e.g. runc, containerd, and kubelet) as well as the containers without the host root privileges. The most significant advantage of Rootless Containers is that it can mitigate potential container-breakout vulnerability of the runtimes, but it is also useful for isolating multi-user environments on HPC hosts. This talk will contain the introduction to rootless containers and deep-dive topics about the recent updates such as Seccomp User Notification. The main focus will be on containerd (CNCF Graduated Project) and its consumer projects including Kubernetes and Docker/Moby, but topics about other runtimes will be discussed as well.
https://sched.co/fGWc
With Hadoop-3.0.0-alpha2 being released in January 2017, it's time to have a closer look at the features and fixes of Hadoop 3.0.
We will have a look at Core Hadoop, HDFS and YARN, and answer the emerging question whether Hadoop 3.0 will be an architectural revolution like Hadoop 2 was with YARN & Co. or will it be more of an evolution adapting to new use cases like IoT, Machine Learning and Deep Learning (TensorFlow)?
Imagine this: You wake up one day in a world without technology – all the computers on the planet just disappeared. First of all, you’d wake up late because your smartphone alarm no longer exists.
my compilation of the changes and differences of the upcoming 3.0 version of Hadoop. Present during the Meetup of the group https://www.meetup.com/Big-Data-Hadoop-Spark-NRW/
With the advent of Hadoop, there comes the need for professionals skilled in Hadoop Administration making it imperative to be skilled as a Hadoop Admin for better career, salary and job opportunities.
HDFS Tiered Storage: Mounting Object Stores in HDFSDataWorks Summit
Most users know HDFS as the reliable store of record for big data analytics. HDFS is also used to store transient and operational data when working with cloud object stores, such as Azure HDInsight and Amazon EMR. In these settings- but also in more traditional, on premise deployments- applications often manage data stored in multiple storage systems or clusters, requiring a complex workflow for synchronizing data between filesystems to achieve goals for durability, performance, and coordination.
Building on existing heterogeneous storage support, we add a storage tier to HDFS to work with external stores, allowing remote namespaces to be "mounted" in HDFS. This capability not only supports transparent caching of remote data as HDFS blocks, it also supports synchronous writes to remote clusters for business continuity planning (BCP) and supports hybrid cloud architectures.
This idea was presented at last year’s Summit in San Jose. Lots of progress has been made since then and the feature is in active development at the Apache Software Foundation on branch HDFS-9806, driven by Microsoft and Western Digital. We will discuss the refined design & implementation and present how end-users and admins will be able to use this powerful functionality.
Updated version of my talk about Hadoop 3.0 with the newest community updates.
Talk given at the codecentric Meetup Berlin on 31.08.2017 and on Data2Day Meetup on 28.09.2017 in Heidelberg.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
Big Data in Container; Hadoop Spark in Docker and Mesos
1. 1
Big Data in Container
Heiko Loewe @loeweh
Meetup Big Data Hadoop & Spark NRW 08/24/2016
2. 2
Why
• Fast Deployment
• Test/Dev Cluster
• Better Utilize Hardware
• Learn to manage Hadoop
• Test new Versions
• An appliance for continuous
integration/API testing
3. 3
Design
Master Container
- Name Node
- Secondary Name Node
- Yarn
Slave Container
- Node Manager
- Data Node
Slave Container
- Node Manager
- Data Node
Slave Container
- Node Manager
- Data Node
Slave Container
- Node Manager
- Data Node
4. 4
More than 1 Hosts needs Overlay Net
Interface Docker0 not routed
Overlay Network
1 Host Config
(almost ) no
Problem
For 2 Hosts
and more
we need an
Overlay Net-
work
5. 5
Choice of the Overlay Network Impl.
Docker Multi-Host Network Weave Net
• Backend: VXLAN, AWS, GCE.
• Fallback: custom UDP-based
tunneling.
• Control plane: built-in, uses Etcd
for shared state.
CoreOS Flanneld
• Backend: VXLAN.
• Fallback: none.
• Control plane: built-in, uses
Zookeeper, Consul or Etcd for
shared state.
• Backend: VXLAN via OVS.
• Fallback: custom UDP-based
tunneling called “sleeve”.
• Control plane: built-in.
6. 6
Normal mode of operations is called FDP – fast
data path – which works via OVS’s data path
kernel module (mainline since 3.12). It’s just
another VXLAN implementation.
Has a sleeve fallback mode, works in userspace
via pcap.
Sleeve supports full encryption.
Weaveworks also has Weave DNS, Weave
Scope and Weave Flux – providing
introspection, service discovery & routing
capabilities on top of Weave Net.
WEAVE NET
7. 7
/etc/sudoers
# at the end:
vuser ALL=(ALL) NOPASSWD: ALL
# secure_path, append /usr/local/bin for weave
Defaults secure_path =
/sbin:/bin:/usr/sbin:/usr/bin:/usr/local/bin
sudo groupadd docker
sudo gpasswd -a ${USER} docker
sudo chgrp docker /var/run/docker.sock
alias docker="sudo /usr/bin/docker"
Docker Adaption (Fedora/Centos/RHEL)
14. 14
• Container (like Docker) are the Foundation for agile
Software Development
• The initial Container Design was stateless (12-factor
App)
• Use-cases are grown in the last few Month
(NoSQL, Stateful Apps)
• Persistence for Container is not easy
The Problem
15. 15
• Enables Persistence of Docker Volumes
• Enables the Implementation of
– Fast Bytes (Performance)
– Data Services (Protection / Snapshots)
– Data Mobility
– Availability
• Operations:
– Create, Remove, Mount, Path, Unmount
– Additonal Option can be passed to the Volume Driver
DOCKER Volume Manager API
16. 16
Persistente Volumes for CONTAINER
Container OS
Storage
/mnt/PersistentData
Container Container
-v /mnt/PersistenData:/mnt/ContainerData
Container Container
Docker Host
17. 17
Docker Host
Persistente Volumes for CONTAINER
Container OS
Storage
/mnt/PersistentData
Container Container
-v /mnt/PersistenData:/mnt/ContainerData
Container Container
20. 20
Overlay Network
Strech Hadoop w/ persisten Volumes
Host A
Host B
Easiely strech
and shrink a
Cluster without
loosing the Data
21. 21
Other similar Projects
• Big Top Provisioner / Apache Foundation
https://github.com/apache/bigtop/tree/master/provisioner/docker
• Building Hortonworks HDP on Docker
http://henning.kropponline.de/2015/07/19/building-hdp-on-docker/
https://hub.docker.com/r/hortonworks/ambari-server/
https://hub.docker.com/r/hortonworks/ambari-agent/
• Building Cloudera CHD on Docker
http://blog.cloudera.com/blog/2015/12/docker-is-the-new-quickstart-option-for-
apache-hadoop-and-cloudera/
https://hub.docker.com/r/cloudera/quickstart/
Watch out Overlay Network topix
31. 31
What about the Data
Myriad only cares for the Compute
Master Container
- Name Node
- Secondary Name Node
- Yarn
Slave Container
- Node Manager
- Data Node
Slave Container
- Node Manager
- Data Node
Slave Container
- Node Manager
- Data Node
Slave Container
- Node Manager
- Data Node
Myriad/
Mesos
Cares about
Has to be provided
Outside from
Myriad/Mesos
Has to be provided
Outside from
Myriad/Mesos
32. 32
What about the Data
• Myriad only cares for Compute / Map Reduce
• HDFS has to be provided on other Ways
Big Data New Realities
Big Data Traditional
Assumptions
Bare-metal
Data locality
Data on local disks
Big Data
New Realities
Containers and VMs
Compute and storage
separation
In-place access on
remote data stores
New Benefits
and Value
Big-Data-as-a-
Service
Agility and
cost savings
Faster time-to-
insights
33. 33
Options for HDFS Data Layer
• Pure HDFS Cluster (only Data Node running)
– Bare Metal
– Containerized
– Mesos based
• Enterprise HDFS Array
– EMC Isilon
35. 35
• Multi Tenancy
• Multiple HDFS Environments
sharing the same storage
• Quota possible on HDFS
Environments
• Snapshots of HDFS Environemnt
possible
• Remote Replication
• Worm Option for HDFS
• High Avaiable HDFS
Infrastructure (distributed
Namen and Data Nodes)
• Storage efficient (usable/raw 0.8
compared to 0.33 with Hadoop)
• Shared Access HDFS / CIFS /
NFS/SFTP possible
• Maintenance equals Enterprise
Array Standard
• All major Distributions supported
EMC Isilon Advantages over classic
Hadoop HDFS
Ok, so there really is a way to do this, but this means tons of work.
These Dev guys want everything instant and i am just one person.
How should I be able to deliver this?
Ok, so there really is a way to do this, but this means tons of work.
These Dev guys want everything instant and i am just one person.
How should I be able to deliver this?
Ok, so there really is a way to do this, but this means tons of work.
These Dev guys want everything instant and i am just one person.
How should I be able to deliver this?