The Elephant in the Cloud: A Quest for the Next Generation
In this talk, I will go through the evolution of Hadoop and its ecosystem projects and will try to peer into the crystal ball to predict what may be coming down the pike. I will discuss various way of crunching the data on Hadoop (MapReduce, OpenMPI, Spark and various SQL engines) and how these tools compliment each other.
Apache Hadoop is no longer just a faithful, open source, scalable implementation of two seminal papers that came out of Google 10 years ago. It has evolved into a project that provides the enterprises with a reliable layer for storing massive amounts of unstructured data (HDFS) while allowing different computational frameworks to leverage those datasets.
The original computational framework (MapReduce) has evolved into a much more scalable set of general purpose cluster management APIs collectively known as YARN. With YARN underneath, MapReduce is still there to support batch-oriented computations, but it is no longer the only game in town. With OpenMPI, Spark, and Tez rapidly becoming available now is truly the most exciting time to be a developer in a Hadoop ecosystem. It is also the time when you don't have to be employed by Yahoo!, Facebook or EBay to have access to mind-blowing compute power. That power is a credit card and a pivotal.io account away from anybody on the planet.
I will conclude by outlining some of the ongoing work that makes Hadoop and its ecosystem projects first class citizens in cloud environments based on the work that Pivotal engineers have done with integrating Hadoop into PivotalONE PaaS.
Apache Spark: killer or savior of Apache Hadoop?rhatr
The Big Boss(tm) has just OKed the first Hadoop cluster in the company. You are the guy in charge of analyzing petabytes of your company's valuable data using a combination of custom MapReduce jobs and SQL-on-Hadoop solutions. All of a sudden the web is full of articles telling you that Hadoop is dead, Spark has won and you should quit while you're still ahead. But should you?
Fast, Scalable Graph Processing: Apache Giraph on YARNDataWorks Summit
Apache Giraph performs offline, batch processing of very large graph datasets on top of a Hadoop cluster. Giraph replaces iterative MapReduce-style solutions with Bulk Synchronous Parallel graph processing using in-memory or disk-based data sets, loosely following the model of Google`s Pregel. Many recent advances have left Giraph more robust, efficient, fast, and able to accept a variety of I/O formats typical for graph data in and out of the Hadoop ecosystem. Giraph's recent port to a pure YARN platform offers increased performance, fine-grained resource control, and scalability that Giraph atop Hadoop MRv1 cannot, while paving the way for ports to other platforms like Apache Mesos. Come see whats on the roadmap for Giraph, what Giraph on YARN means, and how Giraph is leveraging the power of YARN to become a more robust, usable, and useful platform for processing Big Graph datasets.
Search in the Apache Hadoop Ecosystem: Thoughts from the FieldAlex Moundalexis
This presentation describes the Hadoop ecosystem and gives examples of how these open source tools are combined and used to solve specific and sometimes very complex problems. Drawing upon case studies from the field, Mr. Moundalexis demonstrates that one-size, rigid traditional systems don’t fit all, but that combinations of tools in the Apache Hadoop ecosystem provide a versatile and flexible platform for integrating, finding, and analyzing information.
The Zoo Expands: Labrador *Loves* Elephant, Thanks to HamsterMilind Bhandarkar
The document summarizes Milind Bhandarkar's work developing Hamster, a system for running MPI applications on Hadoop YARN. Some key points:
- Hamster allows MPI applications to run alongside Hadoop dataflow jobs on the same cluster managed by YARN. It implements an MPI runtime on top of YARN.
- Hamster's design leverages OpenMPI's strengths while allowing it to integrate with YARN. It includes an application master, node service, and scheduler component.
- Performance tests show Hamster has low overhead and scales well for large MPI jobs. It introduces only a small performance penalty compared to running MPI natively with OpenMPI.
- Example results are shown
- Hadoop was created to allow processing of large datasets in a distributed, fault-tolerant manner. It was originally developed by Doug Cutting and Mike Cafarella at Nutch in response to the growing amounts of data and computational needs at Google and other companies.
- The core of Hadoop consists of Hadoop Distributed File System (HDFS) for storage and Hadoop MapReduce for distributed processing. It also includes utilities like Hadoop Common for file system access and other basic functionality.
- Hadoop's goals were to process multi-petabyte datasets across commodity hardware in a reliable, flexible and open source way. It assumes failures are expected and handles them to provide fault tolerance.
This document provides an overview of Apache Giraph, an open source system for processing large graphs distributed across clusters. It discusses how Giraph implements Google's Pregel model using Hadoop and allows processing billion-edge graphs through its bulk synchronous parallel programming model. Key points covered include Giraph's architecture, programming model based on vertices sending messages to other vertices, example applications like ranking and community detection, and improvements to performance through use of Netty for messaging.
10 concepts the enterprise decision maker needs to understand about HadoopDonald Miner
Way too many enterprise decision makers have clouded and uninformed views of how Hadoop works and what it does. Donald Miner offers high-level observations about Hadoop technologies and explains how Hadoop can shift the paradigms inside of an organization, based on his report Hadoop: What You Need To Know—Hadoop Basics for the Enterprise Decision Maker, forthcoming from O’Reilly Media.
After a basic introduction to Hadoop and the Hadoop ecosystem, Donald outlines 10 basic concepts you need to understand to master Hadoop:
Hadoop masks being a distributed system: what it means for Hadoop to abstract away the details of distributed systems and why that’s a good thing
Hadoop scales out linearly: why Hadoop’s linear scalability is a paradigm shift (but one with a few downsides)
Hadoop runs on commodity hardware: an honest definition of commodity hardware and why this is a good thing for enterprises
Hadoop handles unstructured data: why Hadoop is better for unstructured data than other data systems from a storage and computation perspective
In Hadoop, you load data first and ask questions later: the differences between schema-on-read and schema-on-write and the drawbacks this represents
Hadoop is open source: what it really means for Hadoop to be open source from a practical perspective, not just a “feel good” perspective
HDFS stores the data but has some major limitations: an overview of HDFS (replication, not being able to edit files, and the NameNode)
YARN controls everything going on and is mostly behind the scenes: an overview of YARN and the pitfalls of sharing resources in a distributed environment and the capacity scheduler
MapReduce may be getting a bad rap, but it’s still really important: an overview of MapReduce (what it’s good at and bad at and why, while it isn’t used as much these days, it still plays an important role)
The Hadoop ecosystem is constantly growing and evolving: an overview of current tools such as Spark and Kafka and a glimpse of some things on the horizon
Apache Spark: killer or savior of Apache Hadoop?rhatr
The Big Boss(tm) has just OKed the first Hadoop cluster in the company. You are the guy in charge of analyzing petabytes of your company's valuable data using a combination of custom MapReduce jobs and SQL-on-Hadoop solutions. All of a sudden the web is full of articles telling you that Hadoop is dead, Spark has won and you should quit while you're still ahead. But should you?
Fast, Scalable Graph Processing: Apache Giraph on YARNDataWorks Summit
Apache Giraph performs offline, batch processing of very large graph datasets on top of a Hadoop cluster. Giraph replaces iterative MapReduce-style solutions with Bulk Synchronous Parallel graph processing using in-memory or disk-based data sets, loosely following the model of Google`s Pregel. Many recent advances have left Giraph more robust, efficient, fast, and able to accept a variety of I/O formats typical for graph data in and out of the Hadoop ecosystem. Giraph's recent port to a pure YARN platform offers increased performance, fine-grained resource control, and scalability that Giraph atop Hadoop MRv1 cannot, while paving the way for ports to other platforms like Apache Mesos. Come see whats on the roadmap for Giraph, what Giraph on YARN means, and how Giraph is leveraging the power of YARN to become a more robust, usable, and useful platform for processing Big Graph datasets.
Search in the Apache Hadoop Ecosystem: Thoughts from the FieldAlex Moundalexis
This presentation describes the Hadoop ecosystem and gives examples of how these open source tools are combined and used to solve specific and sometimes very complex problems. Drawing upon case studies from the field, Mr. Moundalexis demonstrates that one-size, rigid traditional systems don’t fit all, but that combinations of tools in the Apache Hadoop ecosystem provide a versatile and flexible platform for integrating, finding, and analyzing information.
The Zoo Expands: Labrador *Loves* Elephant, Thanks to HamsterMilind Bhandarkar
The document summarizes Milind Bhandarkar's work developing Hamster, a system for running MPI applications on Hadoop YARN. Some key points:
- Hamster allows MPI applications to run alongside Hadoop dataflow jobs on the same cluster managed by YARN. It implements an MPI runtime on top of YARN.
- Hamster's design leverages OpenMPI's strengths while allowing it to integrate with YARN. It includes an application master, node service, and scheduler component.
- Performance tests show Hamster has low overhead and scales well for large MPI jobs. It introduces only a small performance penalty compared to running MPI natively with OpenMPI.
- Example results are shown
- Hadoop was created to allow processing of large datasets in a distributed, fault-tolerant manner. It was originally developed by Doug Cutting and Mike Cafarella at Nutch in response to the growing amounts of data and computational needs at Google and other companies.
- The core of Hadoop consists of Hadoop Distributed File System (HDFS) for storage and Hadoop MapReduce for distributed processing. It also includes utilities like Hadoop Common for file system access and other basic functionality.
- Hadoop's goals were to process multi-petabyte datasets across commodity hardware in a reliable, flexible and open source way. It assumes failures are expected and handles them to provide fault tolerance.
This document provides an overview of Apache Giraph, an open source system for processing large graphs distributed across clusters. It discusses how Giraph implements Google's Pregel model using Hadoop and allows processing billion-edge graphs through its bulk synchronous parallel programming model. Key points covered include Giraph's architecture, programming model based on vertices sending messages to other vertices, example applications like ranking and community detection, and improvements to performance through use of Netty for messaging.
10 concepts the enterprise decision maker needs to understand about HadoopDonald Miner
Way too many enterprise decision makers have clouded and uninformed views of how Hadoop works and what it does. Donald Miner offers high-level observations about Hadoop technologies and explains how Hadoop can shift the paradigms inside of an organization, based on his report Hadoop: What You Need To Know—Hadoop Basics for the Enterprise Decision Maker, forthcoming from O’Reilly Media.
After a basic introduction to Hadoop and the Hadoop ecosystem, Donald outlines 10 basic concepts you need to understand to master Hadoop:
Hadoop masks being a distributed system: what it means for Hadoop to abstract away the details of distributed systems and why that’s a good thing
Hadoop scales out linearly: why Hadoop’s linear scalability is a paradigm shift (but one with a few downsides)
Hadoop runs on commodity hardware: an honest definition of commodity hardware and why this is a good thing for enterprises
Hadoop handles unstructured data: why Hadoop is better for unstructured data than other data systems from a storage and computation perspective
In Hadoop, you load data first and ask questions later: the differences between schema-on-read and schema-on-write and the drawbacks this represents
Hadoop is open source: what it really means for Hadoop to be open source from a practical perspective, not just a “feel good” perspective
HDFS stores the data but has some major limitations: an overview of HDFS (replication, not being able to edit files, and the NameNode)
YARN controls everything going on and is mostly behind the scenes: an overview of YARN and the pitfalls of sharing resources in a distributed environment and the capacity scheduler
MapReduce may be getting a bad rap, but it’s still really important: an overview of MapReduce (what it’s good at and bad at and why, while it isn’t used as much these days, it still plays an important role)
The Hadoop ecosystem is constantly growing and evolving: an overview of current tools such as Spark and Kafka and a glimpse of some things on the horizon
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Uwe Printz
Talk held at the IT-Stammtisch Darmstadt on 08.11.2013
Agenda:
- What is Big Data & Hadoop?
- Core Hadoop
- The Hadoop Ecosystem
- Use Cases
- What‘s next? Hadoop 2.0!
The Evolution of Hadoop at Spotify - Through Failures and PainRafał Wojdyła
The quickest way to learn and evolve infrastructure is by encountering obstacles and being forced to overcome limitations that keep you inches away from project goals. At Spotify, we’ve encountered many of these obstacles and frustrations as we grew our Hadoop cluster from a few machines in an office closet aggregating played song events for financial reports, to our current 900 node cluster that plays a large role in many features that you see in our application today.
Two members of Spotify’s Hadoop ‘squad’ will weave in war stories, failures, frustrations and lessons learned to describe the Hadoop/Big Data architecture at Spotify and talk about how that architecture has evolved.
We’ll talk about how and why we use a number of tools, including Apache Falcon and Apache Bigtop to test changes; Apache Crunch, Scalding and Hive w/ Tez to build features and provide analytics; and Snakebite and Luigi, two in-house tools created to overcome common frustrations.
Hivemall is a scalable machine learning library built as a collection of Hive UDFs. It allows users to perform machine learning tasks like classification, regression, recommendation, and anomaly detection using SQL queries. This provides an easy and scalable way to do machine learning without needing to code in other languages or move data outside of Hive. Hivemall implements many common algorithms as UDFs and UDTFs so that machine learning can be performed interactively on large datasets stored in Hive.
The document provides an overview of the Hadoop ecosystem. It introduces Hadoop and its core components, including MapReduce and HDFS. It describes other related projects like HBase, Pig, Hive, Mahout, Sqoop, Flume and Nutch that provide data access, algorithms, and data import capabilities to Hadoop. The document also discusses hosted Hadoop frameworks and the major Hadoop providers.
Hadoop Summit 2016 presentation.
As Yahoo continues to grow and diversify its mobile products, its platform team faces an ever-expanding, sophisticated group of analysts and product managers, who demand faster answers: to measure key engagement metrics like Daily Active Users, app installs and user retention; to run longitudinal analysis; and to re-calculate metrics over rolling time windows. All of this, quickly enough not to need a coffee break while waiting for results. The optimal solution for this use-case would have to take into account raw performance, cost, security implications and ease of data management. Could the benefits of Hive, ORC and Tez, coupled with a good data design provide the performance our customers crave? Or would it make sense to use more nascent, off-grid querying systems? This talk will examine the efficacy of using Hive for large-scale mobile analytics. We will quantify Hive performance on a traditional, shared, multi-tenant Hadoop cluster, and compare it with more specialized analytics tools on a single-tenant cluster. We will also highlight which tuning parameters yield maximum benefits, and analyze the surprisingly ineffectual ones. Finally, we will detail several enhancements made by Yahoo's Hive team (in split calculation, stripe elimination and the metadata system) to successfully boost performance.
The document discusses functional programming concepts and their application to big data problems. It provides an overview of functional programming foundations and languages. Key functional programming concepts discussed include first-class functions, pure functions, recursion, and immutability. These concepts are well-suited for data-centric applications like Hadoop MapReduce. The document also presents a case study comparing an imperative approach to a transaction processing problem to a functional approach, showing that the functional version was faster and avoided side effects.
How Apache Drives Music Recommendations At SpotifyJosh Baer
The slides go through the high-level process of generating personalized playlists for all Spotify's users, using Apache big data products extensively.
Presentation given at Apache: Big Data Europe conference on September 29th, 2015 in Budapest.
This document provides an overview of Hadoop, MapReduce, and HDFS. It discusses how Hadoop uses a cluster of commodity hardware and HDFS to reliably store and process large amounts of data in a distributed manner. MapReduce is the programming model used by Hadoop to process data in parallel across nodes. The document describes the core Hadoop modules and architecture, how HDFS stores and retrieves data blocks, and how MapReduce distributes work and aggregates results. Examples of using MapReduce for word counting and inverted indexes are also presented.
This document provides an agenda and summaries for a meetup on introducing DataFrames and R on Apache Spark. The agenda includes overviews of Apache Spark 1.3, DataFrames, R on Spark, and large scale machine learning on Spark. There will also be discussions on news items, contributions so far, what's new in Spark 1.3, more data source APIs, what DataFrames are, writing DataFrames, and DataFrames with RDDs and Parquet. Presentations will cover Spark components, an introduction to SparkR, and Spark machine learning experiences.
"Big Data" is a much-hyped term nowadays in Business Computing. However, the core concept of collaborative environments conducting experiments over large shared data repositories has existed for decades. In this talk, I will outline how recent advances in Cloud Computing, Big Data processing frameworks, and agile application development platforms enable Data Intensive Cloud Applications. I will provide a brief history of efforts in building scalable & adaptive run-time environments, and the role these runtime systems will play in new Cloud Applications. I will present a vision for cloud platforms for science, where data-intensive frameworks such as Apache Hadoop will play a key role.
This document discusses big data analytics platforms and techniques. It describes various open-source projects like Hadoop, Spark, and Mahout that can perform analytics on large datasets. It also discusses commercial analytics platforms from vendors like SAS, Alpine, and Revolution Analytics. Spark is highlighted as gaining rapid adoption for its speed and expanding machine learning capabilities. Key questions are raised about which open-source projects and commercial offerings will emerge as leaders in their categories.
Apache Hadoop project, and the Hadoop ecosystem has been designed be extremely flexible, and extensible. HDFS, Yarn, and MapReduce combined have more that 1000 configuration parameters that allow users to tune performance of Hadoop applications, and more importantly, extend Hadoop with application-specific functionality, without having to modify any of the core Hadoop code.
In this talk, I will start with simple extensions, such as writing a new InputFormat to efficiently process video files. I will provide with some extensions that boost application performance, such as optimized compression codecs, and pluggable shuffle implementations. With refactoring of MapReduce framework, and emergence of YARN, as a generic resource manager for Hadoop, one can extend Hadoop further by implementing new computation paradigms.
I will discuss one such computation framework, that allows Message Passing applications to run in the Hadoop cluster alongside MapReduce. I will conclude by outlining some of our ongoing work, that extends HDFS, by removing namespace limitations of the current Namenode implementation.
Hivemail: Scalable Machine Learning Library for Apache HiveDataWorks Summit
This document discusses Hivemall, a scalable machine learning library for Apache Hive. It begins with an overview of what Hivemall is and why it was created. Hivemall implements machine learning algorithms like classification, regression, and recommendation as Hive UDFs and UDTFs. It aims to make machine learning more accessible to users of Hive by eliminating the need for programming. The document then covers how Hivemall works, how it handles iterations without multiple MapReduce jobs, and experimental results comparing it to other frameworks.
This is a talk I gave at Data Science MD meetup. It was based on the talk I gave about a month before at Data Science NYC (http://www.slideshare.net/DonaldMiner/data-scienceandhadoop). I talk about data exploration, NLP, Classifiers, and recommendation systems, plus some other things. I tried to depict a realistic view of Hadoop here.
Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It allows for the reliable, scalable, and distributed processing of petabytes of data. Hadoop consists of Hadoop Distributed File System (HDFS) for storage and Hadoop MapReduce for processing vast amounts of data in parallel on large clusters of commodity hardware in a reliable, fault-tolerant manner. Many large companies use Hadoop for applications such as log analysis, web indexing, and data mining of large datasets.
Hadoop - Looking to the Future By Arun Murthyhuguk
Hadoop - Looking to the Future
By Arun Murthy (Founder of Hortonworks, Creator of YARN)
The Apache Hadoop ecosystem began as just HDFS & MapReduce nearly 10 years ago in 2006.
Very much like the Ship of Theseus (http://en.wikipedia.org/wiki/Ship_of_Theseus), Hadoop has undergone incredible amount of transformation from multi-purpose YARN to interactive SQL with Hive/Tez to machine learning with Spark.
Much more lies ahead: whether you want sub-second SQL with Hive or use SSDs/Memory effectively in HDFS or manage Metadata-driven security policies in Ranger, the Hadoop ecosystem in the Apache Software Foundation continues to evolve to meet new challenges and use-cases.
Arun C Murthy has been involved with Apache Hadoop since the beginning of the project - nearly 10 years now. In the beginning he led MapReduce, went on to create YARN and then drove Tez & the Stinger effort to get to interactive & sub-second Hive. Recently he has been very involved in the Metadata and Governance efforts. In between he founded Hortonworks, the first public Hadoop distribution company.
Apache Spark is rapidly emerging as the prime platform for advanced analytics in Hadoop. This briefing is updated to reflect news and announcements as of July 2014.
Apache Giraph: start analyzing graph relationships in your bigdata in 45 minu...rhatr
Apache Giraph allows users to start analyzing graph relationships in big data within 45 minutes. It is an Apache Hadoop-based framework for graph processing that uses the Bulk Synchronous Parallel (BSP) model. Giraph allows for extracting graph relationships from unstructured data and iterative, exploratory analytics on large graphs distributed across a cluster. It provides a programming model and API for graph processing that leverages Hadoop and HDFS for storage and parallelism.
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Uwe Printz
Talk held at the IT-Stammtisch Darmstadt on 08.11.2013
Agenda:
- What is Big Data & Hadoop?
- Core Hadoop
- The Hadoop Ecosystem
- Use Cases
- What‘s next? Hadoop 2.0!
The Evolution of Hadoop at Spotify - Through Failures and PainRafał Wojdyła
The quickest way to learn and evolve infrastructure is by encountering obstacles and being forced to overcome limitations that keep you inches away from project goals. At Spotify, we’ve encountered many of these obstacles and frustrations as we grew our Hadoop cluster from a few machines in an office closet aggregating played song events for financial reports, to our current 900 node cluster that plays a large role in many features that you see in our application today.
Two members of Spotify’s Hadoop ‘squad’ will weave in war stories, failures, frustrations and lessons learned to describe the Hadoop/Big Data architecture at Spotify and talk about how that architecture has evolved.
We’ll talk about how and why we use a number of tools, including Apache Falcon and Apache Bigtop to test changes; Apache Crunch, Scalding and Hive w/ Tez to build features and provide analytics; and Snakebite and Luigi, two in-house tools created to overcome common frustrations.
Hivemall is a scalable machine learning library built as a collection of Hive UDFs. It allows users to perform machine learning tasks like classification, regression, recommendation, and anomaly detection using SQL queries. This provides an easy and scalable way to do machine learning without needing to code in other languages or move data outside of Hive. Hivemall implements many common algorithms as UDFs and UDTFs so that machine learning can be performed interactively on large datasets stored in Hive.
The document provides an overview of the Hadoop ecosystem. It introduces Hadoop and its core components, including MapReduce and HDFS. It describes other related projects like HBase, Pig, Hive, Mahout, Sqoop, Flume and Nutch that provide data access, algorithms, and data import capabilities to Hadoop. The document also discusses hosted Hadoop frameworks and the major Hadoop providers.
Hadoop Summit 2016 presentation.
As Yahoo continues to grow and diversify its mobile products, its platform team faces an ever-expanding, sophisticated group of analysts and product managers, who demand faster answers: to measure key engagement metrics like Daily Active Users, app installs and user retention; to run longitudinal analysis; and to re-calculate metrics over rolling time windows. All of this, quickly enough not to need a coffee break while waiting for results. The optimal solution for this use-case would have to take into account raw performance, cost, security implications and ease of data management. Could the benefits of Hive, ORC and Tez, coupled with a good data design provide the performance our customers crave? Or would it make sense to use more nascent, off-grid querying systems? This talk will examine the efficacy of using Hive for large-scale mobile analytics. We will quantify Hive performance on a traditional, shared, multi-tenant Hadoop cluster, and compare it with more specialized analytics tools on a single-tenant cluster. We will also highlight which tuning parameters yield maximum benefits, and analyze the surprisingly ineffectual ones. Finally, we will detail several enhancements made by Yahoo's Hive team (in split calculation, stripe elimination and the metadata system) to successfully boost performance.
The document discusses functional programming concepts and their application to big data problems. It provides an overview of functional programming foundations and languages. Key functional programming concepts discussed include first-class functions, pure functions, recursion, and immutability. These concepts are well-suited for data-centric applications like Hadoop MapReduce. The document also presents a case study comparing an imperative approach to a transaction processing problem to a functional approach, showing that the functional version was faster and avoided side effects.
How Apache Drives Music Recommendations At SpotifyJosh Baer
The slides go through the high-level process of generating personalized playlists for all Spotify's users, using Apache big data products extensively.
Presentation given at Apache: Big Data Europe conference on September 29th, 2015 in Budapest.
This document provides an overview of Hadoop, MapReduce, and HDFS. It discusses how Hadoop uses a cluster of commodity hardware and HDFS to reliably store and process large amounts of data in a distributed manner. MapReduce is the programming model used by Hadoop to process data in parallel across nodes. The document describes the core Hadoop modules and architecture, how HDFS stores and retrieves data blocks, and how MapReduce distributes work and aggregates results. Examples of using MapReduce for word counting and inverted indexes are also presented.
This document provides an agenda and summaries for a meetup on introducing DataFrames and R on Apache Spark. The agenda includes overviews of Apache Spark 1.3, DataFrames, R on Spark, and large scale machine learning on Spark. There will also be discussions on news items, contributions so far, what's new in Spark 1.3, more data source APIs, what DataFrames are, writing DataFrames, and DataFrames with RDDs and Parquet. Presentations will cover Spark components, an introduction to SparkR, and Spark machine learning experiences.
"Big Data" is a much-hyped term nowadays in Business Computing. However, the core concept of collaborative environments conducting experiments over large shared data repositories has existed for decades. In this talk, I will outline how recent advances in Cloud Computing, Big Data processing frameworks, and agile application development platforms enable Data Intensive Cloud Applications. I will provide a brief history of efforts in building scalable & adaptive run-time environments, and the role these runtime systems will play in new Cloud Applications. I will present a vision for cloud platforms for science, where data-intensive frameworks such as Apache Hadoop will play a key role.
This document discusses big data analytics platforms and techniques. It describes various open-source projects like Hadoop, Spark, and Mahout that can perform analytics on large datasets. It also discusses commercial analytics platforms from vendors like SAS, Alpine, and Revolution Analytics. Spark is highlighted as gaining rapid adoption for its speed and expanding machine learning capabilities. Key questions are raised about which open-source projects and commercial offerings will emerge as leaders in their categories.
Apache Hadoop project, and the Hadoop ecosystem has been designed be extremely flexible, and extensible. HDFS, Yarn, and MapReduce combined have more that 1000 configuration parameters that allow users to tune performance of Hadoop applications, and more importantly, extend Hadoop with application-specific functionality, without having to modify any of the core Hadoop code.
In this talk, I will start with simple extensions, such as writing a new InputFormat to efficiently process video files. I will provide with some extensions that boost application performance, such as optimized compression codecs, and pluggable shuffle implementations. With refactoring of MapReduce framework, and emergence of YARN, as a generic resource manager for Hadoop, one can extend Hadoop further by implementing new computation paradigms.
I will discuss one such computation framework, that allows Message Passing applications to run in the Hadoop cluster alongside MapReduce. I will conclude by outlining some of our ongoing work, that extends HDFS, by removing namespace limitations of the current Namenode implementation.
Hivemail: Scalable Machine Learning Library for Apache HiveDataWorks Summit
This document discusses Hivemall, a scalable machine learning library for Apache Hive. It begins with an overview of what Hivemall is and why it was created. Hivemall implements machine learning algorithms like classification, regression, and recommendation as Hive UDFs and UDTFs. It aims to make machine learning more accessible to users of Hive by eliminating the need for programming. The document then covers how Hivemall works, how it handles iterations without multiple MapReduce jobs, and experimental results comparing it to other frameworks.
This is a talk I gave at Data Science MD meetup. It was based on the talk I gave about a month before at Data Science NYC (http://www.slideshare.net/DonaldMiner/data-scienceandhadoop). I talk about data exploration, NLP, Classifiers, and recommendation systems, plus some other things. I tried to depict a realistic view of Hadoop here.
Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It allows for the reliable, scalable, and distributed processing of petabytes of data. Hadoop consists of Hadoop Distributed File System (HDFS) for storage and Hadoop MapReduce for processing vast amounts of data in parallel on large clusters of commodity hardware in a reliable, fault-tolerant manner. Many large companies use Hadoop for applications such as log analysis, web indexing, and data mining of large datasets.
Hadoop - Looking to the Future By Arun Murthyhuguk
Hadoop - Looking to the Future
By Arun Murthy (Founder of Hortonworks, Creator of YARN)
The Apache Hadoop ecosystem began as just HDFS & MapReduce nearly 10 years ago in 2006.
Very much like the Ship of Theseus (http://en.wikipedia.org/wiki/Ship_of_Theseus), Hadoop has undergone incredible amount of transformation from multi-purpose YARN to interactive SQL with Hive/Tez to machine learning with Spark.
Much more lies ahead: whether you want sub-second SQL with Hive or use SSDs/Memory effectively in HDFS or manage Metadata-driven security policies in Ranger, the Hadoop ecosystem in the Apache Software Foundation continues to evolve to meet new challenges and use-cases.
Arun C Murthy has been involved with Apache Hadoop since the beginning of the project - nearly 10 years now. In the beginning he led MapReduce, went on to create YARN and then drove Tez & the Stinger effort to get to interactive & sub-second Hive. Recently he has been very involved in the Metadata and Governance efforts. In between he founded Hortonworks, the first public Hadoop distribution company.
Apache Spark is rapidly emerging as the prime platform for advanced analytics in Hadoop. This briefing is updated to reflect news and announcements as of July 2014.
Apache Giraph: start analyzing graph relationships in your bigdata in 45 minu...rhatr
Apache Giraph allows users to start analyzing graph relationships in big data within 45 minutes. It is an Apache Hadoop-based framework for graph processing that uses the Bulk Synchronous Parallel (BSP) model. Giraph allows for extracting graph relationships from unstructured data and iterative, exploratory analytics on large graphs distributed across a cluster. It provides a programming model and API for graph processing that leverages Hadoop and HDFS for storage and parallelism.
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...rhatr
You’ve got your Hadoop cluster, you’ve got your petabytes of unstructured data, you run mapreduce jobs and SQL-on-Hadoop queries. Something is still missing though. After all, we are not expected to enter SQL queries while looking for information on the web. Altavista and Google solved it for us ages ago. Why are we still requiring SQL or Java certification from our enterprise bigdata users? In this talk, we will look into how integration of SolrCloud into Apache Bigtop is now enabling building bigdata indexing solutions and ingest pipelines. We will dive into the details of integrating full-text search into the lifecycle of your bigdata management applications and exposing the power of Google-in-a-box to all enterprise users, not just a chosen few data scientists.
This document provides an overview of Hadoop architecture. It discusses how Hadoop uses MapReduce and HDFS to process and store large datasets reliably across commodity hardware. MapReduce allows distributed processing of data through mapping and reducing functions. HDFS provides a distributed file system that stores data reliably in blocks across nodes. The document outlines components like the NameNode, DataNodes and how Hadoop handles failures transparently at scale.
In KDD2011, Vijay Narayanan (Yahoo!) and Milind Bhandarkar (Greenplum Labs, EMC) conducted a tutorial on "Modeling with Hadoop". This is the first half of the tutorial.
The document discusses using Python with Hadoop frameworks. It introduces Hadoop Distributed File System (HDFS) and MapReduce, and how to use the mrjob library to write MapReduce jobs in Python. It also covers using Python with higher-level Hadoop frameworks like Pig, accessing HDFS with snakebite, and using Python clients for HBase and the PySpark API for the Spark framework. Key advantages discussed are Python's rich ecosystem and ability to access Hadoop frameworks.
The document provides an overview of Hadoop and its ecosystem. It discusses the history and architecture of Hadoop, describing how it uses distributed storage and processing to handle large datasets across clusters of commodity hardware. The key components of Hadoop include HDFS for storage, MapReduce for processing, and an ecosystem of related projects like Hive, HBase, Pig and Zookeeper that provide additional functions. Advantages are its ability to handle unlimited data storage and high speed processing, while disadvantages include lower speeds for small datasets and limitations on data storage size.
The document provides an overview of Hadoop and its ecosystem. It discusses the history and architecture of Hadoop, describing how it uses distributed storage and processing to handle large datasets across clusters of commodity hardware. The key components of Hadoop include HDFS for storage, MapReduce for processing, and additional tools like Hive, Pig, HBase, Zookeeper, Flume, Sqoop and Oozie that make up its ecosystem. Advantages are its ability to handle unlimited data storage and high speed processing, while disadvantages include lower speeds for small datasets and limitations on data storage size.
This document discusses building big data solutions using Microsoft's HDInsight platform. It provides an overview of big data and Hadoop concepts like MapReduce, HDFS, Hive and Pig. It also describes HDInsight and how it can be used to run Hadoop clusters on Azure. The document concludes by discussing some challenges with Hadoop and the broader ecosystem of technologies for big data beyond just Hadoop.
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)npinto
This document provides an introduction and overview of Hadoop, an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It outlines what Hadoop is, how its core components MapReduce and HDFS work, advantages like scalability and fault tolerance, disadvantages like complexity, and resources for getting started with Hadoop installations and programming.
Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It allows for the reliable, scalable, and distributed processing of large data sets across clusters of commodity hardware. The core of Hadoop includes a storage part called HDFS for reliable data storage, and a processing part called MapReduce that processes data in parallel on a large cluster. Hadoop also includes additional projects like Hive, Pig, HBase, Zookeeper, Oozie, and Sqoop that together form a powerful data processing ecosystem.
This presentation provides an overview of Hadoop, including what it is, how it works, its architecture and components, and examples of its use. Hadoop is an open-source software platform for distributed storage and processing of large datasets across clusters of computers. It allows for the reliable, scalable and distributed processing of large datasets through its core components - the Hadoop Distributed File System (HDFS) for storage, and MapReduce for processing.
Apache Giraph is a large-scale graph processing system built on Hadoop. It provides an iterative processing model and vertex-centric programming model for graphs that can be too large for a single machine. Giraph scales to graphs with trillions of edges by distributing computation across a Hadoop cluster. It is faster than traditional MapReduce approaches for graph algorithms and allows graphs to be processed in memory across iterations while only writing intermediate data to disk.
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2tcloudcomputing-tw
The presentation is designed for those interested in Hadoop technology, and can enhance your knowledge in Hadoop, such as community history, current development status, features of services, distributed computing framework and scenario of big data development in Enterprise.
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyRohit Kulkarni
The document discusses LatentView Analytics and provides an overview of data processing frameworks and MapReduce. It introduces LatentView Analytics, describing its services, partners, and experience. It then discusses distributed and parallel processing frameworks, providing examples like Hadoop, Spark, and Storm. It also provides a brief history of Hadoop, describing its key developments from 1999 to present day in addressing challenges of indexing, crawling, distributed processing etc. Finally, it explains the MapReduce process and provides a simple example to illustrate mapping and reducing functions.
Apache Hadoop is a popular open-source framework for storing and processing large datasets across clusters of computers. It includes Apache HDFS for distributed storage, YARN for job scheduling and resource management, and MapReduce for parallel processing. The Hortonworks Data Platform is an enterprise-grade distribution of Apache Hadoop that is fully open source.
Hortonworks' mission is to enable modern data architectures by delivering an enterprise-ready Apache Hadoop platform. They contribute the majority of code to Apache Hadoop and its related projects. Hortonworks develops the Hortonworks Data Platform (HDP), which provides core Hadoop services along with operational and data services to make Hadoop an enterprise data platform. Hortonworks aims to power data architectures by enabling Hadoop as a multi-purpose platform for batch, interactive, streaming and other workloads through projects like YARN, Tez, and improvements to Hive.
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
Maruthi Prithivirajan, Head of ASEAN & IN Solution Architecture, Neo4j
Get an inside look at the latest Neo4j innovations that enable relationship-driven intelligence at scale. Learn more about the newest cloud integrations and product enhancements that make Neo4j an essential choice for developers building apps with interconnected data and generative AI.
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfMalak Abu Hammad
Discover how MongoDB Atlas and vector search technology can revolutionize your application's search capabilities. This comprehensive presentation covers:
* What is Vector Search?
* Importance and benefits of vector search
* Practical use cases across various industries
* Step-by-step implementation guide
* Live demos with code snippets
* Enhancing LLM capabilities with vector search
* Best practices and optimization strategies
Perfect for developers, AI enthusiasts, and tech leaders. Learn how to leverage MongoDB Atlas to deliver highly relevant, context-aware search results, transforming your data retrieval process. Stay ahead in tech innovation and maximize the potential of your applications.
#MongoDB #VectorSearch #AI #SemanticSearch #TechInnovation #DataScience #LLM #MachineLearning #SearchTechnology
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
“An Outlook of the Ongoing and Future Relationship between Blockchain Technologies and Process-aware Information Systems.” Invited talk at the joint workshop on Blockchain for Information Systems (BC4IS) and Blockchain for Trusted Data Sharing (B4TDS), co-located with with the 36th International Conference on Advanced Information Systems Engineering (CAiSE), 3 June 2024, Limassol, Cyprus.
1. Elephant in the Cloud:
a quest for the next generation
Hadoop architecture
Roman Shaposhnik
Sr. Manager, Open Source Hadoop Platform @Pivotal
(Twitter: @rhatr)
2. Who’s this guy?
• Sr. Manager @Pivotal building a team of OS contributors
• Apache Software Foundation guy (VP of Apache Incubator, VP of
Apache Bigtop, committer on Hadoop, Giraph, Sqoop, etc)
• Used to be root@Cloudera
• Used to be PHB@Yahoo! (original Hadoop team)
• Used to be a hacker at Sun microsystems (Sun Studio compilers
and tools)
11. Big Data Utility Gap
70% of data
generated by
customers
80% of data
being stored
3% being
prepared for
analysis
0.5% being
analyzed
<0.5% being
operationalized
Average Enterprises
3 Exabytes
per day
now
40 Trillion total
Gigabytes in 2020
(Or 162 iPhones of
storage for every
human)
?
15. HDFS: not a POSIXfs
• Huge blocks: 64Mb (128Mb)
• Mostly immutable files (append, truncate)
• Streaming data access
• Block replication
16. How do I use it?
$ hadoop fs –lsr /
# hadoop-fuse-dfs dfs://hadoop-hdfs /mnt
$ ls /mnt
# mount –t nfs –o vers=3,proto=tcp,nolock host:/ /mnt
$ ls /mnt
18. Pivotal’s Focus on Data Lakes
Existing EDW
/ Datamarts
Raw “untouched” Data
In-MemoryParallelIngest
Data
Management
(Search Engine)
Processed Data
In-Memory Services
BI/AnalyticalTools
Data Lake
ERP
HR
SFDC
New Data
Sources/Formats
Machine
Traditional
Data Sources
Finally! I now
have full
transparency
on the data
with amazing
speed!
All data
is now
accessible!
I can now
afford
“Big
Data”
Business Users
ELT Processing
with Hadoop
HDFS
MapReduce/SQL/Pig/Hive
Analytical
Data
Marts/
Sandboxes
SecurityandControl
21. MapReduce
• Batch oriented (long jobs; final results)
• Brings the computation to the data
• Very constrained programming model
• Embarrassingly parallel programming model
• Used to be the only game in town for compute
28. How do I use it?
public static class TokenizerMapper
extends MapperObject, Text, Text, IntWritable {
public void map(Object key, Text value, Context context) {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
29. How do I use it?
public static class IntSumReducer
extends ReducerText,IntWritable,Text,IntWritable {
public void reduce(Text key, IterableIntWritable values, Context context) {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
30. How do I run it?
$ hadoop jar hadoop-examples.jar wordcount
input
output
32. Hadoop’s childhood
• Compact (pretty much a single jar)
• Challenged in scalability and SPOFs
• Extremely batch oriented
• Hard for non-Java programmers
36. Hadoop 2.0
• HDFS 2.0
• Yet Another Resource Negotiator (YARN)
• MapReduce is just an “application” now
• Tez is another “application”
• Pivotal’s Hamster (OpenMPI) yet another one
42. Hamster
• Hadoop and MPI on the same cluster
• OpenMPI Runtime on Hadoop YARN
• Hadoop Provides: Resource Scheduling,
Process monitoring, Distributed File System
• Open MPI Provides: Process launching,
Communication, I/O forwarding
52. Apache HBase
• Small mutable records vs. HDFS files
• HFiles kept in HDFS
• Memcached for HDFS
• Built on HDFS and Zookeeper
• Google’s Bigtable
53. Hbase datamodel
• Driven by the original Webtable usecase:
com.cnn.www html...
content:
CNN
CNN.co
anchor:a.com
anchor:b.com
54. How do I use it?
HTable table = new HTable(config, “table”);
Put p = new Put(Bytes.toBytes(“row”));
p.add(Bytes.toBytes(“family”),
Bytes.toBytes(“qualifier”),
Bytes.toBytes(“data”));
table.put(p);
60. GemFire XD: a better HBase?
• Close sourced but extremely mature
• SQL/Objects/JSON data model
• High concurrency, high update load
• Mostly selective point queries (no scans)
• Tiered storage architecture
61. YCSB Benchmark; Throughput is 2-12X
0
100000
200000
300000
400000
500000
600000
700000
800000
AU
BU
CU
D
FU
LOAD
Throughput(ops/sec)
HBase
4
8
12
16
0
100000
200000
300000
400000
500000
600000
700000
800000
AU
BU
CU
D
FU
LOAD
Throughput(ops/sec)
GemFire XD
4
8
12
16
64. Querying data
• MapReduce: “an assembly language”
• Apache Pig: a data manipulation DSL (now
Turing complete!)
• Apache Hive: a batch-oriented SQL on top
of Hadoop
65. How do I use Pig?
grunt A = load ‘./input.txt’;
grunt B = foreach A generate
flatten(TOKENIZE((chararray)$0)) as
words;
grunt C = group B by word;
grunt D = foreach C generate COUNT(B),
group;
66. How do I use Hive?
CREATE TABLE docs (line STRING);
LOAD DATA INPATH 'text' OVERWRITE INTO TABLE docs;
CREATE TABLE word_counts AS
SELECT word, count(1) AS count FROM
(SELECT explode(split(line, 's')) AS word FROM docs)
GROUP BY word
ORDER BY word;
67. Can we short Oracle now?
• No indexing
• Batch oriented scheduling
• Optimization for long running queries
• Metadata management is still in flux
72. Getting data in: Flume
• Designed for collecting log data
• Flexible deployment topology
73. Sqoop: RDBMs connection
• Sqoop 1
• A MapReduce tool
• Must use Oozie for workflows
• Sqoop 2
• Well, 0.99.x really
• A standalone service
74. Spring XD
• Unified, distributed, extensible system for data
ingestions, real time analytics and data exports
• Apache Licensed, not ASF
• A runtime service, not a library
• AKA “Oozie + Flume + Sqoop + Morphlines”
75. How do I use it?
# deployment: ./xd-singlenode
$ ./xd-shell
xd: hadoop config fs –namenode hdfs://nn:8020
xd: stream create –definition “time | hdfs”
–name ticktock
xd: stream destroy –name ticktock
76. Feeding the Elephant
HDFS
Pig
Sqoop Flume
Coordination and
workflow
management
Zookeeper
Command
Center
ASF Projects
FLOSS Projects
Pivotal Products
Oozie
MapReduce
Hive
Tez
Giraph
Hadoop UI
Hue
SolrCloud
Phoenix
HBase
Crunch Mahout
YARN
GemFire XD
SpringXD
Hamster
YARN
78. What’s wrong with MR?
Source: UC Berkeley Spark project (just the image)
79. Spark innovations
• Resilient Distribtued Datasets (RDDs)
• Distributed on a cluster
• Manipulated via parallel operators (map, etc.)
• Automatically rebuilt on failure
• A parallel ecosystem
• A solution to iterative and multi-stage apps
82. An alternative backend
• Shark: a Hive on Spark
• Spork: a Pig on Spark
• Mlib: machine learning on Spark
• GraphX: Graph processing on Spark
• Also featuring its own streaming engine
83. How do I use it?
val file = spark.textFile(hdfs://...)
val counts = file.flatMap(line = line.split( ))
.map(word = (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile(hdfs://...)
87. Hadoop Maturity
ETL Offload
Accommodate massive
data growth with existing
EDW investments
Data Lakes
Unify Unstructured and
Structured Data Access
Big Data
Apps
Build analytic-led
applications impacting
top line revenue
Data-Driven
Enterprise
App Dev and Operational
Management on HDFS
Data Architecture
88. Pivotal HD on Pivotal CF
Ÿ Enterprise PaaS Management System
Ÿ Flexible multi-language ‘buildpack’
architecture
Ÿ Deployed applications enjoy built-in
services
Ÿ On-Premise Hadoop as a Service
Ÿ Single cluster deployment of Pivotal HD
Ÿ Developers instantly bind to shared
Hadoop Clusters
Ÿ Speeds up time-to-value
89. Pivotal Data Fabric Evolution
Analytic
Data Marts
SQL Services
Operational
Intelligence
In-Memory Database
Run-Time
Applications
Data Staging
Platform
Data Mgmt. Services
Pivotal Data Platform
Stream
Ingestion
Streaming Services
Software-Defined Datacenter
New Data-fabrics
In-Memory Grid
...ETC