Big data is generated from a variety of sources at a massive scale and high velocity. Hadoop is an open source framework that allows processing and analyzing large datasets across clusters of commodity hardware. It uses a distributed file system called HDFS that stores multiple replicas of data blocks across nodes for reliability. Hadoop also uses a MapReduce processing model where mappers process data in parallel across nodes before reducers consolidate the outputs into final results. An example demonstrates how Hadoop would count word frequencies in a large text file by mapping word counts across nodes before reducing the results.
The document is a slide deck for a training on Hadoop fundamentals. It includes an agenda that covers what big data is, an introduction to Hadoop, the Hadoop architecture, MapReduce, Pig, Hive, Jaql, and certification. It provides overviews and explanations of these topics through multiple slides with images and text. The slides also describe hands-on labs for attendees to complete exercises using these big data technologies.
Supporting Financial Services with a More Flexible Approach to Big DataWANdisco Plc
Â
In this webinar, WANdisco and Hortonworks look at three examples of using 'Big Data' to get a more comprehensive view of customer behavior and activity in the banking and insurance industries. Then we'll pull out the common threads from these examples, and see how a flexible next-generation Hadoop architecture lets you get a step up on improving your business performance. Join us to learn:
- How to leverage data from across an entire global enterprise
- How to analyze a wide variety of structured and unstructured data to get quick, meaningful answers to critical questions
- What industry leaders have put in place
The document provides an overview of big data and Hadoop, discussing what big data is, current trends and challenges, approaches to solving big data problems including distributed computing, NoSQL, and Hadoop, and introduces HDFS and the MapReduce framework in Hadoop for distributed storage and processing of large datasets.
Hadoop and WANdisco: The Future of Big DataWANdisco Plc
Â
View the webinar recording here... http://youtu.be/O1pgMMyoJg0
Who: WANdisco CEO, David Richards, and core creaters of Apache Hadoop, Dr. Konstantin Shvachko and Jagane Sundare.
What: WANdisco recently acquired AltoStor, a pioneering firm with deep expertise in the multi-billion dollar Big Data market.
New to the WANdisco team are the Hadoop core creaters, Dr. Konstantin Shvachko and Jagane Sundare. They will cover the the acquisition and reveal how WANdisco's active-active replication technology will change the game of Big Data for the enterprise in 2013.
Hadoop, a proven open source Big Data technolgoy, is the backbone of Yahoo, Facebook, Netflix, Amazon, Ebay and many of the world's largest databases.
When: Tuesday, December 11th at 10am PST (1pm EST).
Why: In this 30-minute webinar youâll learn:
The staggering, cross-industry growth of Hadoop in the enterprise
How Hadoop's limitations, including HDFS's single-point of failure, are impacting the productivity of the enterprise
How WANdisco's active-active replication technology will alleviate these issues by adding high-availability to Hadoop, taking a fundamentally different approach to Big Data
View the webinar Q&A on the WANdisco blog here...http://blogs.wandisco.com/2012/12/14/answers-to-questions-from-the-webinar-of-dec-11-2012/
Apache Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of computers. The core of Hadoop consists of HDFS for storage and MapReduce for processing. Hadoop has been expanded with additional projects including YARN for job scheduling and resource management, Pig and Hive for SQL-like queries, HBase for column-oriented storage, Zookeeper for coordination, and Ambari for provisioning and managing Hadoop clusters. Hadoop provides scalable and cost-effective solutions for storing and analyzing massive amounts of data.
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-avaiability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-availabile service on top of a cluster of computers, each of which may be prone to failures.
The document is a slide deck for a training on Hadoop fundamentals. It includes an agenda that covers what big data is, an introduction to Hadoop, the Hadoop architecture, MapReduce, Pig, Hive, Jaql, and certification. It provides overviews and explanations of these topics through multiple slides with images and text. The slides also describe hands-on labs for attendees to complete exercises using these big data technologies.
Supporting Financial Services with a More Flexible Approach to Big DataWANdisco Plc
Â
In this webinar, WANdisco and Hortonworks look at three examples of using 'Big Data' to get a more comprehensive view of customer behavior and activity in the banking and insurance industries. Then we'll pull out the common threads from these examples, and see how a flexible next-generation Hadoop architecture lets you get a step up on improving your business performance. Join us to learn:
- How to leverage data from across an entire global enterprise
- How to analyze a wide variety of structured and unstructured data to get quick, meaningful answers to critical questions
- What industry leaders have put in place
The document provides an overview of big data and Hadoop, discussing what big data is, current trends and challenges, approaches to solving big data problems including distributed computing, NoSQL, and Hadoop, and introduces HDFS and the MapReduce framework in Hadoop for distributed storage and processing of large datasets.
Hadoop and WANdisco: The Future of Big DataWANdisco Plc
Â
View the webinar recording here... http://youtu.be/O1pgMMyoJg0
Who: WANdisco CEO, David Richards, and core creaters of Apache Hadoop, Dr. Konstantin Shvachko and Jagane Sundare.
What: WANdisco recently acquired AltoStor, a pioneering firm with deep expertise in the multi-billion dollar Big Data market.
New to the WANdisco team are the Hadoop core creaters, Dr. Konstantin Shvachko and Jagane Sundare. They will cover the the acquisition and reveal how WANdisco's active-active replication technology will change the game of Big Data for the enterprise in 2013.
Hadoop, a proven open source Big Data technolgoy, is the backbone of Yahoo, Facebook, Netflix, Amazon, Ebay and many of the world's largest databases.
When: Tuesday, December 11th at 10am PST (1pm EST).
Why: In this 30-minute webinar youâll learn:
The staggering, cross-industry growth of Hadoop in the enterprise
How Hadoop's limitations, including HDFS's single-point of failure, are impacting the productivity of the enterprise
How WANdisco's active-active replication technology will alleviate these issues by adding high-availability to Hadoop, taking a fundamentally different approach to Big Data
View the webinar Q&A on the WANdisco blog here...http://blogs.wandisco.com/2012/12/14/answers-to-questions-from-the-webinar-of-dec-11-2012/
Apache Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of computers. The core of Hadoop consists of HDFS for storage and MapReduce for processing. Hadoop has been expanded with additional projects including YARN for job scheduling and resource management, Pig and Hive for SQL-like queries, HBase for column-oriented storage, Zookeeper for coordination, and Ambari for provisioning and managing Hadoop clusters. Hadoop provides scalable and cost-effective solutions for storing and analyzing massive amounts of data.
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-avaiability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-availabile service on top of a cluster of computers, each of which may be prone to failures.
WANdisco Non-Stop Hadoop: PHXDataConference Presentation Oct 2014 Chris Almond
Â
Hadoop has quickly evolved into the system of choice for storing and processing Big Data, and is now widely used to support mission-critical applications that operate within a âdata lakeâ style infrastructures. A critical requirement of such applications is the need for continuous operation even in the event of various system failures. This requirement has driven adoption of multi-data center Hadoop architectures, a.k.a geo-distributed or global Hadoop. In this session we will provide a brief introduction to WANdisco, then dig into how our Non-Stop Hadoop solution addresses real world use cases, and also a show live demonstration of Non-Stop namenode operation across two WAN connected hadoop clusters.
Hadoop is an open source framework for distributed storage and processing of large datasets across clusters of commodity hardware. It uses Google's MapReduce programming model and Google File System for reliability. The Hadoop architecture includes a distributed file system (HDFS) that stores data across clusters and a job scheduling and resource management framework (YARN) that allows distributed processing of large datasets in parallel. Key components include the NameNode, DataNodes, ResourceManager and NodeManagers. Hadoop provides reliability through replication of data blocks and automatic recovery from failures.
This document provides an overview of Hadoop, an open source framework for distributed storage and processing of large datasets across clusters of computers. It discusses that Hadoop was created to address the challenges of "Big Data" characterized by high volume, variety and velocity of data. The key components of Hadoop are HDFS for storage and MapReduce as an execution engine for distributed computation. HDFS uses a master-slave architecture with a NameNode master and DataNode slaves, and provides fault tolerance through data replication. MapReduce allows processing of large datasets in parallel through mapping and reducing functions.
Introduction to Hadoop - The EssentialsFadi Yousuf
Â
This document provides an introduction to Hadoop, including:
- A brief history of Hadoop and how it was created to address limitations of relational databases for big data.
- An overview of core Hadoop concepts like its shared-nothing architecture and using computation near storage.
- Descriptions of HDFS for distributed storage and MapReduce as the original programming framework.
- How the Hadoop ecosystem has grown to include additional frameworks like Hive, Pig, HBase and tools like Sqoop and Zookeeper.
- A discussion of YARN which separates resource management from job scheduling in Hadoop.
This document provides an overview of Apache Hadoop, including its history, architecture, and key components. Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of commodity servers. It allows for the distributed processing of large data sets across clusters of computers using a simple programming model. The document outlines Hadoop's origins from Google's paper on MapReduce and the GFS file system. It describes Hadoop's core components - the Hadoop Distributed File System (HDFS) for storage and MapReduce for distributed processing. Use cases for Hadoop including log analysis, search, and analytics are also mentioned.
Selective Data Replication with Geographically Distributed HadoopDataWorks Summit
Â
This document discusses selective data replication with geographically distributed Hadoop. It describes running Hadoop across multiple data centers as a single cluster. A coordination engine ensures consistent metadata replication and a global sequence of updates. Data is replicated asynchronously over the WAN for fast ingestion. Selective data replication allows restricting replication of some data to specific locations for regulations, temporary data, or ingest-only use cases. Heterogeneous storage zones with different performance profiles can also be used for selective placement. This architecture aims to provide a single unified file system view, strict consistency, continuous availability, and geographic scalability across data centers.
This document provides summaries of various distributed file systems and distributed programming frameworks that are part of the Hadoop ecosystem. It summarizes Apache HDFS, GlusterFS, QFS, Ceph, Lustre, Alluxio, GridGain, XtreemFS, Apache Ignite, Apache MapReduce, and Apache Pig. For each one it provides 1-3 links to additional resources about the project.
The document discusses backup and disaster recovery strategies for Hadoop. It focuses on protecting data sets stored in HDFS. HDFS uses data replication and checksums to protect against disk and node failures. Snapshots can protect against data corruption and accidental deletes. The document recommends copying data from the primary to secondary site for disaster recovery rather than teeing, and discusses considerations for large data movement like bandwidth needs and security. It also notes the importance of backing up metadata like Hive configurations along with core data.
Big Data raises challenges about how to process such vast pool of raw data and how to aggregate value to our lives. For addressing these demands an ecosystem of tools named Hadoop was conceived.
Big data relates to large and complex datasets that are difficult to analyze using traditional methods. It refers to datasets that are too large to be handled by typical database management tools. Big data can help organizations make better evidence-based decisions by analyzing structured and unstructured data from a variety of sources using specialized analytical techniques.
Big data refers to the massive amounts of unstructured data that are growing exponentially. Hadoop is an open-source framework that allows processing and storing large data sets across clusters of commodity hardware. It provides reliability and scalability through its distributed file system HDFS and MapReduce programming model. The Hadoop ecosystem includes components like Hive, Pig, HBase, Flume, Oozie, and Mahout that provide SQL-like queries, data flows, NoSQL capabilities, data ingestion, workflows, and machine learning. Microsoft integrates Hadoop with its BI and analytics tools to enable insights from diverse data sources.
This presentation helps you understand the basics of Hadoop.
What is Big Data?? How google search so fast and what is MapReduce algorithm? all these questions will be answered in the presentation.
Hadoop is the popular open source like Facebook, Twitter, RFID readers, sensors, and implementation of MapReduce, a powerful tool so on.Your management wants to derive designed for deep analysis and transformation of information from both the relational data and thevery large data sets. Hadoop enables you to unstructuredexplore complex data, using custom analyses data, and wants this information as soon astailored to your information and questions. possible.Hadoop is the system that allows unstructured What should you do? Hadoop may be the answer!data to be distributed across hundreds or Hadoop is an open source project of the Apachethousands of machines forming shared nothing Foundation.clusters, and the execution of Map/Reduce It is a framework written in Java originallyroutines to run on the data in that cluster. Hadoop developed by Doug Cutting who named it after hishas its own filesystem which replicates data to sons toy elephant.multiple nodes to ensure if one node holding data Hadoop uses Googleâs MapReduce and Google Filegoes down, there are at least 2 other nodes from System technologies as its foundation.which to retrieve that piece of information. This It is optimized to handle massive quantities of dataprotects the data availability from node failure, which could be structured, unstructured orsomething which is critical when there are many semi-structured, using commodity hardware, thatnodes in a cluster (aka RAID at a server level). is, relatively inexpensive computers. This massive parallel processing is done with greatWhat is Hadoop? performance. However, it is a batch operation handling massive quantities of data, so theThe data are stored in a relational database in your response time is not immediate.desktop computer and this desktop computer As of Hadoop version 0.20.2, updates are nothas no problem handling this load. possible, but appends will be possible starting inThen your company starts growing very quickly, version 0.21.and that data grows to 10GB. Hadoop replicates its data across differentAnd then 100GB. computers, so that if one goes down, the data areAnd you start to reach the limits of your current processed on one of the replicated computers.desktop computer. Hadoop is not suitable for OnLine Transaction So you scale-up by investing in a larger computer, Processing workloads where data are randomly and you are then OK for a few more months. accessed on structured data like a relational When your data grows to 10TB, and then 100TB. database.Hadoop is not suitable for OnLineAnd you are fast approaching the limits of that Analytical Processing or Decision Support Systemcomputer. workloads where data are sequentially accessed onMoreover, you are now asked to feed your structured data like a relational database, to application with unstructured data coming from generate reports that provide business sources intelligence. Hadoop is used for Big Data. It complements OnLine Transaction Processing and OnLine Analytical Pro
Hadoop is being used across organizations for a variety of purposes like data staging, analytics, security monitoring, and manufacturing quality assurance. However, most organizations still have separate systems optimized for specific workloads. Hadoop has the potential to relieve pressure on these systems by handling data staging, archives, transformations, and exploration. Going forward, Hadoop will need to provide enterprise-grade capabilities like high performance, security, data protection, and support for both analytical and operational workloads to fully replace specialized systems and become the main enterprise data platform.
This document provides an overview of NoSQL databases. It defines NoSQL, discusses the motivations for NoSQL including scalability challenges with SQL databases. It covers key NoSQL concepts like the CAP theorem and taxonomy of NoSQL databases. Implementation concepts like consistent hashing, Bloom filters, and quorums are explained. User-facing patterns like MapReduce and inverted indexes are also overviewed. Popular existing NoSQL systems and real-world examples of NoSQL usage are briefly mentioned. The conclusion states that NoSQL is not a general purpose replacement for SQL and that both have complementary uses.
Presentation on Big Data Hadoop (Summer Training Demo)Ashok Royal
Â
This document summarizes a practical training presentation on Big Data Hadoop. It was presented by Ashutosh Tiwari and Ashok Rayal from Poornima Institute of Engineering & Technology, Jaipur under the guidance of Dr. E.S. Pilli from MNIT Jaipur. The training took place from May 28th to July 9th 2014 at MNIT Jaipur and consisted of studying Hadoop and related papers, building a Hadoop cluster, and implementing a near duplicate detection project using Hadoop MapReduce. The near duplicate detection project aimed to comparatively analyze documents to find similar ones based on a predefined threshold. Snapshots of the HDFS, MapReduce processing, and output of the project are
WANdisco Non-Stop Hadoop: PHXDataConference Presentation Oct 2014 Chris Almond
Â
Hadoop has quickly evolved into the system of choice for storing and processing Big Data, and is now widely used to support mission-critical applications that operate within a âdata lakeâ style infrastructures. A critical requirement of such applications is the need for continuous operation even in the event of various system failures. This requirement has driven adoption of multi-data center Hadoop architectures, a.k.a geo-distributed or global Hadoop. In this session we will provide a brief introduction to WANdisco, then dig into how our Non-Stop Hadoop solution addresses real world use cases, and also a show live demonstration of Non-Stop namenode operation across two WAN connected hadoop clusters.
Hadoop is an open source framework for distributed storage and processing of large datasets across clusters of commodity hardware. It uses Google's MapReduce programming model and Google File System for reliability. The Hadoop architecture includes a distributed file system (HDFS) that stores data across clusters and a job scheduling and resource management framework (YARN) that allows distributed processing of large datasets in parallel. Key components include the NameNode, DataNodes, ResourceManager and NodeManagers. Hadoop provides reliability through replication of data blocks and automatic recovery from failures.
This document provides an overview of Hadoop, an open source framework for distributed storage and processing of large datasets across clusters of computers. It discusses that Hadoop was created to address the challenges of "Big Data" characterized by high volume, variety and velocity of data. The key components of Hadoop are HDFS for storage and MapReduce as an execution engine for distributed computation. HDFS uses a master-slave architecture with a NameNode master and DataNode slaves, and provides fault tolerance through data replication. MapReduce allows processing of large datasets in parallel through mapping and reducing functions.
Introduction to Hadoop - The EssentialsFadi Yousuf
Â
This document provides an introduction to Hadoop, including:
- A brief history of Hadoop and how it was created to address limitations of relational databases for big data.
- An overview of core Hadoop concepts like its shared-nothing architecture and using computation near storage.
- Descriptions of HDFS for distributed storage and MapReduce as the original programming framework.
- How the Hadoop ecosystem has grown to include additional frameworks like Hive, Pig, HBase and tools like Sqoop and Zookeeper.
- A discussion of YARN which separates resource management from job scheduling in Hadoop.
This document provides an overview of Apache Hadoop, including its history, architecture, and key components. Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of commodity servers. It allows for the distributed processing of large data sets across clusters of computers using a simple programming model. The document outlines Hadoop's origins from Google's paper on MapReduce and the GFS file system. It describes Hadoop's core components - the Hadoop Distributed File System (HDFS) for storage and MapReduce for distributed processing. Use cases for Hadoop including log analysis, search, and analytics are also mentioned.
Selective Data Replication with Geographically Distributed HadoopDataWorks Summit
Â
This document discusses selective data replication with geographically distributed Hadoop. It describes running Hadoop across multiple data centers as a single cluster. A coordination engine ensures consistent metadata replication and a global sequence of updates. Data is replicated asynchronously over the WAN for fast ingestion. Selective data replication allows restricting replication of some data to specific locations for regulations, temporary data, or ingest-only use cases. Heterogeneous storage zones with different performance profiles can also be used for selective placement. This architecture aims to provide a single unified file system view, strict consistency, continuous availability, and geographic scalability across data centers.
This document provides summaries of various distributed file systems and distributed programming frameworks that are part of the Hadoop ecosystem. It summarizes Apache HDFS, GlusterFS, QFS, Ceph, Lustre, Alluxio, GridGain, XtreemFS, Apache Ignite, Apache MapReduce, and Apache Pig. For each one it provides 1-3 links to additional resources about the project.
The document discusses backup and disaster recovery strategies for Hadoop. It focuses on protecting data sets stored in HDFS. HDFS uses data replication and checksums to protect against disk and node failures. Snapshots can protect against data corruption and accidental deletes. The document recommends copying data from the primary to secondary site for disaster recovery rather than teeing, and discusses considerations for large data movement like bandwidth needs and security. It also notes the importance of backing up metadata like Hive configurations along with core data.
Big Data raises challenges about how to process such vast pool of raw data and how to aggregate value to our lives. For addressing these demands an ecosystem of tools named Hadoop was conceived.
Big data relates to large and complex datasets that are difficult to analyze using traditional methods. It refers to datasets that are too large to be handled by typical database management tools. Big data can help organizations make better evidence-based decisions by analyzing structured and unstructured data from a variety of sources using specialized analytical techniques.
Big data refers to the massive amounts of unstructured data that are growing exponentially. Hadoop is an open-source framework that allows processing and storing large data sets across clusters of commodity hardware. It provides reliability and scalability through its distributed file system HDFS and MapReduce programming model. The Hadoop ecosystem includes components like Hive, Pig, HBase, Flume, Oozie, and Mahout that provide SQL-like queries, data flows, NoSQL capabilities, data ingestion, workflows, and machine learning. Microsoft integrates Hadoop with its BI and analytics tools to enable insights from diverse data sources.
This presentation helps you understand the basics of Hadoop.
What is Big Data?? How google search so fast and what is MapReduce algorithm? all these questions will be answered in the presentation.
Hadoop is the popular open source like Facebook, Twitter, RFID readers, sensors, and implementation of MapReduce, a powerful tool so on.Your management wants to derive designed for deep analysis and transformation of information from both the relational data and thevery large data sets. Hadoop enables you to unstructuredexplore complex data, using custom analyses data, and wants this information as soon astailored to your information and questions. possible.Hadoop is the system that allows unstructured What should you do? Hadoop may be the answer!data to be distributed across hundreds or Hadoop is an open source project of the Apachethousands of machines forming shared nothing Foundation.clusters, and the execution of Map/Reduce It is a framework written in Java originallyroutines to run on the data in that cluster. Hadoop developed by Doug Cutting who named it after hishas its own filesystem which replicates data to sons toy elephant.multiple nodes to ensure if one node holding data Hadoop uses Googleâs MapReduce and Google Filegoes down, there are at least 2 other nodes from System technologies as its foundation.which to retrieve that piece of information. This It is optimized to handle massive quantities of dataprotects the data availability from node failure, which could be structured, unstructured orsomething which is critical when there are many semi-structured, using commodity hardware, thatnodes in a cluster (aka RAID at a server level). is, relatively inexpensive computers. This massive parallel processing is done with greatWhat is Hadoop? performance. However, it is a batch operation handling massive quantities of data, so theThe data are stored in a relational database in your response time is not immediate.desktop computer and this desktop computer As of Hadoop version 0.20.2, updates are nothas no problem handling this load. possible, but appends will be possible starting inThen your company starts growing very quickly, version 0.21.and that data grows to 10GB. Hadoop replicates its data across differentAnd then 100GB. computers, so that if one goes down, the data areAnd you start to reach the limits of your current processed on one of the replicated computers.desktop computer. Hadoop is not suitable for OnLine Transaction So you scale-up by investing in a larger computer, Processing workloads where data are randomly and you are then OK for a few more months. accessed on structured data like a relational When your data grows to 10TB, and then 100TB. database.Hadoop is not suitable for OnLineAnd you are fast approaching the limits of that Analytical Processing or Decision Support Systemcomputer. workloads where data are sequentially accessed onMoreover, you are now asked to feed your structured data like a relational database, to application with unstructured data coming from generate reports that provide business sources intelligence. Hadoop is used for Big Data. It complements OnLine Transaction Processing and OnLine Analytical Pro
Hadoop is being used across organizations for a variety of purposes like data staging, analytics, security monitoring, and manufacturing quality assurance. However, most organizations still have separate systems optimized for specific workloads. Hadoop has the potential to relieve pressure on these systems by handling data staging, archives, transformations, and exploration. Going forward, Hadoop will need to provide enterprise-grade capabilities like high performance, security, data protection, and support for both analytical and operational workloads to fully replace specialized systems and become the main enterprise data platform.
This document provides an overview of NoSQL databases. It defines NoSQL, discusses the motivations for NoSQL including scalability challenges with SQL databases. It covers key NoSQL concepts like the CAP theorem and taxonomy of NoSQL databases. Implementation concepts like consistent hashing, Bloom filters, and quorums are explained. User-facing patterns like MapReduce and inverted indexes are also overviewed. Popular existing NoSQL systems and real-world examples of NoSQL usage are briefly mentioned. The conclusion states that NoSQL is not a general purpose replacement for SQL and that both have complementary uses.
Presentation on Big Data Hadoop (Summer Training Demo)Ashok Royal
Â
This document summarizes a practical training presentation on Big Data Hadoop. It was presented by Ashutosh Tiwari and Ashok Rayal from Poornima Institute of Engineering & Technology, Jaipur under the guidance of Dr. E.S. Pilli from MNIT Jaipur. The training took place from May 28th to July 9th 2014 at MNIT Jaipur and consisted of studying Hadoop and related papers, building a Hadoop cluster, and implementing a near duplicate detection project using Hadoop MapReduce. The near duplicate detection project aimed to comparatively analyze documents to find similar ones based on a predefined threshold. Snapshots of the HDFS, MapReduce processing, and output of the project are
The document discusses fault tolerance in Apache Hadoop. It describes how Hadoop handles failures at different layers through replication and rapid recovery mechanisms. In HDFS, data nodes regularly heartbeat to the name node, and blocks are replicated across racks. The name node tracks block locations and initiates replication if a data node fails. HDFS also supports name node high availability. In MapReduce v1, task and task tracker failures cause re-execution of tasks. YARN improved fault tolerance by removing the job tracker single point of failure.
Apache Spark Introduction @ University College LondonVitthal Gogate
Â
Spark is a fast and general engine for large-scale data processing. It uses resilient distributed datasets (RDDs) that can be operated on in parallel. Transformations on RDDs are lazy, while actions trigger their execution. Spark supports operations like map, filter, reduce, and join and can run on Hadoop clusters, standalone, or in cloud services like AWS.
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - ClouderaCloudera, Inc.
Â
Attend this session and walk away armed with solutions to the most common customer problems. Learn proactive configuration tweaks and best practices to keep your cluster free of fetch failures, job tracker hangs, and the like.
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...Mahantesh Angadi
Â
This document provides an introduction to big data and the installation of a single-node Apache Hadoop cluster. It defines key terms like big data, Hadoop, and MapReduce. It discusses traditional approaches to handling big data like storage area networks and their limitations. It then introduces Hadoop as an open-source framework for storing and processing vast amounts of data in a distributed fashion using the Hadoop Distributed File System (HDFS) and MapReduce programming model. The document outlines Hadoop's architecture and components, provides an example of how MapReduce works, and discusses advantages and limitations of the Hadoop framework.
The document provides an overview of MongoDB administration including its data model, replication for high availability, sharding for scalability, deployment architectures, operations, security features, and resources for operations teams. The key topics covered are the flexible document data model, replication using replica sets for high availability, scaling out through sharding of data across multiple servers, and different deployment architectures including single/multi data center configurations.
Key Considerations for Putting Hadoop in Production SlideShareMapR Technologies
Â
This document discusses planning for production success with Hadoop. It covers key questions around business continuity, high availability, data protection and disaster recovery. It also discusses considerations for multi-tenancy, interoperability and high performance. Additionally, it provides an overview of MapR's enterprise-grade data platform and highlights how it addresses production requirements through features like its NFS interface, strong data protection, and high availability.
Proof of Concept for Hadoop: storage and analytics of electrical time-seriesDataWorks Summit
Â
1. EDF conducted a proof of concept to store and analyze massive time-series data from smart meters using Hadoop.
2. The proof of concept involved storing over 1 billion records per day from 35 million smart meters and running analytics queries.
3. Results showed Hadoop could handle tactical queries with low latency and complex analytical queries within acceptable timeframes. Hadoop provides a low-cost solution for massive time-series storage and analysis.
Elassandra: Elasticsearch as a Cassandra Secondary Index (RĂŠmi Trouville, Vin...DataStax
Â
Many companies use both elasticsearch and cassandra, typically in the form of logs or time series, but managing many softwares at a large scale can be quite challenging. Elassandra tightly integrates elasticsearch within cassandra as a secondary index, allowing near-realtime search with all existing elasticsearch APIs, plugins and tools like Kibana. We will present the core concepts of elassandra and explain how it draws benefit from internal cassandra features to make elasticsearch masterless, scalable with automatic resharding, more reliable and more efficient than deploying both softwares. We will also explore the bidirectional mapping : the way elasticsearch automatically creates the corresponding cassandra schema and the way elasticsearch indexes an existing cassandra table. Furthermore, we will share some use cases and benchmark results demonstrating practical use of elassandra to scale-out, re-index with zero-downtime, search and visualize data with various tools.
About the Speakers
Remi Trouville Consultant, Independant
Remi is an IT engineer who has worked for the last 8 years in the financial industry as a team manager responsible for all the call-center softwares managing the customer experience. At the end of this period, his team was dealing with 10,000+ agents with 100+ sites and some highly critical business processes such as storage of oral proof sales for transactions. He holds a Master's Degree in Telecommunication engineering and is now following an executive-MBA, in a French business school.
Infinit: Modern Storage Platform for Container EnvironmentsDocker, Inc.
Â
Providing state to applications in Docker requires a backend storage component that is both scalable and resilient in order to cope with a variety of use cases and failure scenarios. The Infinit Storage Platform has been designed to provide Docker applications with a set of interfaces (block, file and object) allowing for different tradeoffs. This talk will go through the design principles behind Infinit and demonstrate how the platform can be used to deploy a storage infrastructure through Docker containers in a few command lines.
The document discusses a presentation about practical problem solving with Hadoop and Pig. It provides an agenda that covers introductions to Hadoop and Pig, including the Hadoop distributed file system, MapReduce, performance tuning, and examples. It discusses how Hadoop is used at Yahoo, including statistics on usage. It also provides examples of how Hadoop has been used for applications like log processing, search indexing, and machine learning.
Big Data and Hadoop training course is designed to provide knowledge and skills to become a successful Hadoop Developer. In-depth knowledge of concepts such as Hadoop Distributed File System, Setting up the Hadoop Cluster, Map-Reduce,PIG, HIVE, HBase, Zookeeper, SQOOP etc. will be covered in the course.
This document provides an agenda and overview for a presentation on Hadoop 2.x configuration and MapReduce performance tuning. The presentation covers hardware selection and capacity planning for Hadoop clusters, key configuration parameters for operating systems, HDFS, and YARN, and performance tuning techniques for MapReduce applications. It also demonstrates the Hadoop Vaidya performance diagnostic tool.
This document provides an overview of big data and Hadoop. It discusses why Hadoop is useful for extremely large datasets that are difficult to manage in relational databases. It then summarizes what Hadoop is, including its core components like HDFS, MapReduce, HBase, Pig, Hive, Chukwa, and ZooKeeper. The document also outlines Hadoop's design principles and provides examples of how some of its components like MapReduce and Hive work.
This document provides an overview of big data. It defines big data as large volumes of diverse data that are growing rapidly and require new techniques to capture, store, distribute, manage, and analyze. The key characteristics of big data are volume, velocity, and variety. Common sources of big data include sensors, mobile devices, social media, and business transactions. Tools like Hadoop and MapReduce are used to store and process big data across distributed systems. Applications of big data include smarter healthcare, traffic control, and personalized marketing. The future of big data is promising with the market expected to grow substantially in the coming years.
This document provides an introduction to big data and Hadoop. It discusses how the volume of data being generated is growing rapidly and exceeding the capabilities of traditional databases. Hadoop is presented as a solution for distributed storage and processing of large datasets across clusters of commodity hardware. Key aspects of Hadoop covered include MapReduce for parallel processing, the Hadoop Distributed File System (HDFS) for reliable storage, and how data is replicated across nodes for fault tolerance.
- Hadoop Distributed File System (HDFS) is a distributed file system that stores large datasets across clusters of computers. It divides files into blocks and stores the blocks across nodes, replicating them for fault tolerance.
- HDFS is designed for distributed storage and processing of very large datasets. It allows applications to work with data in parallel on large clusters of commodity hardware.
Data Lake and the rise of the microservicesBigstep
Â
By simply looking at structured and unstructured data, Data Lakes enable companies to understand correlations between existing and new external data - such as social media - in ways traditional Business Intelligence tools cannot.
For this you need to find out the most efficient way to store and access structured or unstructured petabyte-sized data across your entire infrastructure.
In this meetup weâll give answers on the next questions:
1. Why would someone use a Data Lake?
2. Is it hard to build a Data Lake?
3. What are the main features that a Data Lake should bring in?
4. Whatâs the role of the microservices in the big data world?
- Data is a precious resource that can last longer than the systems themselves (Tim Berners-Lee)
- Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of commodity hardware. It provides reliability, scalability and flexibility.
- Hadoop consists of HDFS for storage and MapReduce for processing. The main nodes include NameNode, DataNodes, JobTracker and TaskTrackers. Tools like Hive, Pig, HBase extend its capabilities for SQL-like queries, data flows and NoSQL access.
Big data is characterized by 3 V's - volume, velocity, and variety. It refers to large and complex datasets that are difficult to process using traditional database management tools. Key technologies to handle big data include distributed file systems, Apache Hadoop, data-intensive computing, and tools like MapReduce. Common tools used are infrastructure management tools like Chef and Puppet, monitoring tools like Nagios and Ganglia, and analytics platforms like Netezza and Greenplum.
Introduction to Big Data and NoSQL.
This presentation was given to the Master DBA course at John Bryce Education in Israel.
Work is based on presentations by Michael Naumov, Baruch Osoveskiy, Bill Graham and Ronen Fidel.
CouchBase The Complete NoSql Solution for Big DataDebajani Mohanty
Â
Couchbase is a complete NoSQL database solution for big data. It provides a distributed database that can scale horizontally. Couchbase uses a document-oriented data model and supports the CAP theorem. It sacrifices consistency to achieve high availability and partition tolerance. Couchbase is used by many large companies for applications that involve large, complex datasets with high user volumes and real-time requirements.
Infinitely Scalable Clusters - Grid Computing on Public Cloud - LondonHentsĹŤ
Â
Slides from our recent workshop for hedge funds and a review of the cloud grid computing options. Included some live demos tackling 2TB of full depth market data using MATLAB on AWS, and Google BigQuery with Datalab.
Introduction to Cloud computing and Big Data-HadoopNagarjuna D.N
Â
Cloud Computing Evolution
Why Cloud Computing needed?
Cloud Computing Models
Cloud Solutions
Cloud Jobs opportunities
Criteria for Big Data
Big Data challenges
Technologies to process Big Data- Hadoop
Hadoop History and Architecture
Hadoop Eco-System
Hadoop Real-time Use cases
Hadoop Job opportunities
Hadoop and SAP HANA integration
Summary
This document provides an overview of the Hadoop ecosystem. It begins with introducing big data challenges around volume, variety, and velocity of data. It then introduces Hadoop as an open-source framework for distributed storage and processing of large datasets across clusters of computers. The key components of Hadoop are HDFS (Hadoop Distributed File System) for distributed storage and high throughput access to application data, and MapReduce as a programming model for distributed computing on large datasets. HDFS stores data reliably using data replication across nodes and is optimized for throughput over large files and datasets.
Hadoop is a framework for distributed storage and processing of large datasets across clusters of commodity hardware. It includes HDFS, a distributed file system, and MapReduce, a programming model for large-scale data processing. HDFS stores data reliably across clusters and allows computations to be processed in parallel near the data. The key components are the NameNode, DataNodes, JobTracker and TaskTrackers. HDFS provides high throughput access to application data and is suitable for applications handling large datasets.
Ankus, bigdata deployment and orchestration frameworkAshrith Mekala
Â
Cloudwick developed Ankus, an open source deployment and orchestration framework for big data technologies. Ankus uses configuration files and a directed acyclic graph (DAG) approach to automate the deployment of Hadoop, HBase, Cassandra, Kafka and other big data frameworks across on-premises and cloud infrastructures. It leverages tools like Puppet, Nagios and Logstash to provision, manage and monitor clusters in an integrated manner. Ankus aims to simplify and accelerate the adoption of big data across organizations.
This document discusses data-intensive computing and provides examples of technologies used for processing large datasets. It defines data-intensive computing as concerned with manipulating and analyzing large datasets ranging from hundreds of megabytes to petabytes. It then characterizes challenges including scalable algorithms, metadata management, and high-performance computing platforms and file systems. Specific technologies discussed include distributed file systems like Lustre, MapReduce frameworks like Hadoop, and NoSQL databases like MongoDB.
The document discusses data-intensive computing and provides details about related technologies. It defines data-intensive computing as concerned with large-scale data in the hundreds of megabytes to petabytes range. Key challenges include scalable algorithms, metadata management, high-performance computing platforms, and distributed file systems. Technologies discussed include MapReduce frameworks like Hadoop, Pig, and Hive; NoSQL databases like MongoDB, Cassandra, and HBase; and distributed file systems like Lustre, GPFS, and HDFS. The document also covers programming models, scheduling, and an example application to parse Aneka logs using MapReduce.
This document provides an introduction to a course on data science. It outlines the course objectives, which are to recognize key concepts in extraction, transformation and loading of data, and to complete a sample project in Hadoop. It also lists the expected course outcome, which is for students to recognize technologies for handling big data. The document then provides a chapter index and overview of topics to be covered, including distributed and parallel computing for big data, big data technologies, cloud computing, in-memory technologies, and big data techniques.
Big data and Hadoop are frameworks for processing and storing large datasets. Hadoop uses HDFS for distributed storage and MapReduce for distributed processing. HDFS stores large files across multiple machines for redundancy and parallel access. MapReduce divides jobs into map and reduce tasks that run in parallel across a cluster. Hadoop provides scalable and fault-tolerant solutions to problems like processing terabytes of data from jet engines or scaling to Google's data processing needs.
This document discusses large scale computing with MapReduce. It provides background on the growth of digital data, noting that by 2020 there will be over 5,200 GB of data for every person on Earth. It introduces MapReduce as a programming model for processing large datasets in a distributed manner, describing the key aspects of Map and Reduce functions. Examples of MapReduce jobs are also provided, such as counting URL access frequencies and generating a reverse web link graph.
The document provides an introduction to Apache Hadoop, including:
1) It describes Hadoop's architecture which uses HDFS for distributed storage and MapReduce for distributed processing of large datasets across commodity clusters.
2) It explains that Hadoop solves issues of hardware failure and combining data through replication of data blocks and a simple MapReduce programming model.
3) It gives a brief history of Hadoop originating from Doug Cutting's Nutch project and the influence of Google's papers on distributed file systems and MapReduce.
Hadoop Master Class : A concise overviewAbhishek Roy
Â
Abhishek Roy will teach a master class on Big Data and Hadoop. The class will cover what Big Data is, the history and background of Hadoop, how to set up and use Hadoop, and tools like HDFS, MapReduce, Pig, Hive, Mahout, Sqoop, Flume, Hue, Zookeeper and Impala. The class will also discuss real world use cases and the growing market for Big Data tools and skills.
Your One-Stop Shop for Python Success: Top 10 US Python Development Providersakankshawande
Â
Simplify your search for a reliable Python development partner! This list presents the top 10 trusted US providers offering comprehensive Python development services, ensuring your project's success from conception to completion.
Taking AI to the Next Level in Manufacturing.pdfssuserfac0301
Â
Read Taking AI to the Next Level in Manufacturing to gain insights on AI adoption in the manufacturing industry, such as:
1. How quickly AI is being implemented in manufacturing.
2. Which barriers stand in the way of AI adoption.
3. How data quality and governance form the backbone of AI.
4. Organizational processes and structures that may inhibit effective AI adoption.
6. Ideas and approaches to help build your organization's AI strategy.
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxSitimaJohn
Â
Ocean Lotus cyber threat actors represent a sophisticated, persistent, and politically motivated group that poses a significant risk to organizations and individuals in the Southeast Asian region. Their continuous evolution and adaptability underscore the need for robust cybersecurity measures and international cooperation to identify and mitigate the threats posed by such advanced persistent threat groups.
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfMalak Abu Hammad
Â
Discover how MongoDB Atlas and vector search technology can revolutionize your application's search capabilities. This comprehensive presentation covers:
* What is Vector Search?
* Importance and benefits of vector search
* Practical use cases across various industries
* Step-by-step implementation guide
* Live demos with code snippets
* Enhancing LLM capabilities with vector search
* Best practices and optimization strategies
Perfect for developers, AI enthusiasts, and tech leaders. Learn how to leverage MongoDB Atlas to deliver highly relevant, context-aware search results, transforming your data retrieval process. Stay ahead in tech innovation and maximize the potential of your applications.
#MongoDB #VectorSearch #AI #SemanticSearch #TechInnovation #DataScience #LLM #MachineLearning #SearchTechnology
Building Production Ready Search Pipelines with Spark and MilvusZilliz
Â
Spark is the widely used ETL tool for processing, indexing and ingesting data to serving stack for search. Milvus is the production-ready open-source vector database. In this talk we will show how to use Spark to process unstructured data to extract vector representations, and push the vectors to Milvus vector database for search serving.
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...alexjohnson7307
Â
Predictive maintenance is a proactive approach that anticipates equipment failures before they happen. At the forefront of this innovative strategy is Artificial Intelligence (AI), which brings unprecedented precision and efficiency. AI in predictive maintenance is transforming industries by reducing downtime, minimizing costs, and enhancing productivity.
Monitoring and Managing Anomaly Detection on OpenShift.pdfTosin Akinosho
Â
Monitoring and Managing Anomaly Detection on OpenShift
Overview
Dive into the world of anomaly detection on edge devices with our comprehensive hands-on tutorial. This SlideShare presentation will guide you through the entire process, from data collection and model training to edge deployment and real-time monitoring. Perfect for those looking to implement robust anomaly detection systems on resource-constrained IoT/edge devices.
Key Topics Covered
1. Introduction to Anomaly Detection
- Understand the fundamentals of anomaly detection and its importance in identifying unusual behavior or failures in systems.
2. Understanding Edge (IoT)
- Learn about edge computing and IoT, and how they enable real-time data processing and decision-making at the source.
3. What is ArgoCD?
- Discover ArgoCD, a declarative, GitOps continuous delivery tool for Kubernetes, and its role in deploying applications on edge devices.
4. Deployment Using ArgoCD for Edge Devices
- Step-by-step guide on deploying anomaly detection models on edge devices using ArgoCD.
5. Introduction to Apache Kafka and S3
- Explore Apache Kafka for real-time data streaming and Amazon S3 for scalable storage solutions.
6. Viewing Kafka Messages in the Data Lake
- Learn how to view and analyze Kafka messages stored in a data lake for better insights.
7. What is Prometheus?
- Get to know Prometheus, an open-source monitoring and alerting toolkit, and its application in monitoring edge devices.
8. Monitoring Application Metrics with Prometheus
- Detailed instructions on setting up Prometheus to monitor the performance and health of your anomaly detection system.
9. What is Camel K?
- Introduction to Camel K, a lightweight integration framework built on Apache Camel, designed for Kubernetes.
10. Configuring Camel K Integrations for Data Pipelines
- Learn how to configure Camel K for seamless data pipeline integrations in your anomaly detection workflow.
11. What is a Jupyter Notebook?
- Overview of Jupyter Notebooks, an open-source web application for creating and sharing documents with live code, equations, visualizations, and narrative text.
12. Jupyter Notebooks with Code Examples
- Hands-on examples and code snippets in Jupyter Notebooks to help you implement and test anomaly detection models.
Main news related to the CCS TSI 2023 (2023/1695)Jakub Marek
Â
An English đŹđ§ translation of a presentation to the speech I gave about the main changes brought by CCS TSI 2023 at the biggest Czech conference on Communications and signalling systems on Railways, which was held in Clarion Hotel Olomouc from 7th to 9th November 2023 (konferenceszt.cz). Attended by around 500 participants and 200 on-line followers.
The original Czech đ¨đż version of the presentation can be found here: https://www.slideshare.net/slideshow/hlavni-novinky-souvisejici-s-ccs-tsi-2023-2023-1695/269688092 .
The videorecording (in Czech) from the presentation is available here: https://youtu.be/WzjJWm4IyPk?si=SImb06tuXGb30BEH .
This presentation provides valuable insights into effective cost-saving techniques on AWS. Learn how to optimize your AWS resources by rightsizing, increasing elasticity, picking the right storage class, and choosing the best pricing model. Additionally, discover essential governance mechanisms to ensure continuous cost efficiency. Whether you are new to AWS or an experienced user, this presentation provides clear and practical tips to help you reduce your cloud costs and get the most out of your budget.
4. Big Data Is Everywhere
â˘The Large Hadron Collider (LHC), a particle
accelerator that will revolutionize our
understanding of the workings of the Universe,
will generate 60 terabytes of data per day â
15 petabytes (15 million gigabytes)
annually.[1]
â˘Decoding the human genome originally took
10 years to process; now it can be achieved
in one week.
â˘12 terabytes of Tweets created each day[2]
â˘100 terabytes of data uploaded
daily to Facebook .[3]
â˘Walmart handles more than 1 million customer
transactions every hour, which is imported into
databases estimated to contain more than
2.5 petabytes of data.[3]
â˘Convert 350 billion annual meter readings to
better predict power consumption[2].
5. What Is Big Data?
Its LARGE Its COMPLEX
Its UNSTRUCTURED
By David Kellog, âBig data refers to the datasets whose size is beyond the ability of a
typical database software tools to capture ,store, manage and analyze.â[4]
OâReilly defines big data the following way: âBig data is data that exceeds the
processing capacity of conventional database systems. The data is too big, moves too
fast, or doesn't fit the strictures of your database architectures.â [5]
6. An Obvious Question â How BIG is the BIG
DATA ?
A common misconception is Big data is
solely related to VOLUME.
While volume or size is a part of the
equationâŚ..
What about SPEED at which data
is generated ?
And about the VARIETY of big data
that variety of sources are
generating?
8. Why The Sudden Explosion Of Big Data ?
â˘An Increased number and variety of data sources
that generate large quantities of data
â˘Sensors(location, GPS..)
â˘Scientific Computing(CERN, biological research..)
â˘Web 2.0(Twitter, wikis ..)
â˘Realization that data is too valuable to delete
â˘Data analytics and Data Warehousing
â˘Business Intelligence
â˘Dramatic Decline in the cost of hardware,
especially storage
â˘decline in price of SSDs
9. BIG DATA is fuelled by CLOUD
â˘The properties of cloud help us in dealing with the Big data
â˘And the challenges of the Big data drives the Future
designs , enhancement and expansion of cloud.
â˘Both are in a Never Ending cycle.
10. The Value Of Big Data â Why Its So Important?
[6]
12. TRADITIONAL ENTERPRISE ARCHITECTURE
Consists of
â˘Servers
â˘SAN (Storage Area Network)
â˘Storage arrays
â˘Servers -a server is a physical computer dedicated to running one or more
services to serve the needs of the users of other computers on the network.
â˘Storage Arrays-A disk array is a disk storage system which contains multiple
disk drives(SATA,SSD).
â˘Storage Area Network - A storage area network (SAN) is a dedicated
network that provides access to consolidated, data storage. SANs are primarily
used to make storage devices, such as disk arrays, accessible to servers so that
the devices appear like locally attached devices to the operating system.
13. SOME ADVANTAGES AND DISADVANTAGES OF
ENTERPRISE ARCHITECTURE
ADVANTAGES
â˘Coupling between Servers and Storage /
Disk arrays â Which can be expanded,
upgraded or retire independent of each
other
â˘SAN enables services on any of server to
have access of any of storage arrays as
long as they have access permission.
â˘ROBUST and MINIMUM FAILURE rate.
â˘Mainly designed for computing
intensive applications which operate on a
subset of data.
DISADVANTAGES
â˘More Costlier as it expands.
â˘But What about BIG DATA ?
It cannot handle Data intensive
operation like sorting.
14. What we want is an Architecture that will give -
15. CLUSTER ARCHITECTURE
Consists of
â˘Nodes â each having its
own cores , memory ,disks .
â˘Interconnection via high
speed network(LAN)
⢠consists of a set of loosely connected computers that work together so that in
many respects they can be viewed as a single system.
â˘usually connected to each other through fast local area networks,
each node (computer used as a server) running its own instance of
an operating system.
â˘The activities of the computing nodes are orchestrated by "clustering
middleware", a software layer that sits atop the nodes and allows the users to
treat the cluster as by and large one cohesive computing unit.
16. Benefits of Using a Cluster Architecture
â˘Modular and Scalable - easier to expand the system without bringing down
the application that runs on top of the cluster.
â˘Data Locality â where data can be processed by the cores collocated in
same node or Rack minimizing any transfer over network.
â˘Parallelization - higher degree of parallelism via the simultaneous
execution of separate portions of a program on different processors.
â˘All this with less cost .
17. But Every Coin has two Sides!
â˘Complexity - Cost of administering a cluster of N machines .
â˘More Storage â As data is replicated to protect from failure.
â˘Data Distribution â How to distribute data evenly across cluster ?
â˘Careful Management and Need of massive parallel processing Design.
18. Riding the Elephant - Hadoop
SOLUTION
â˘Open Source Apache Project initiated and led by
Yahoo.
â˘Apache Hadoop is an open source Java framework
for processing and querying vast amounts of data
on large clusters of commodity hardware.[8][9]
â˘Runs on
oLinux, Mac OS/X, Windows, and Solaris
oCommodity hardware
â˘Target cluster of commodity PCs
oCost-effective bulk computing
â˘Invented by Doug Cutting and funded by Yahoo in
2006 and reached to its âweb scale capacityâ in
2008.[7]
Doug Cutting
19. Where Does it All come from ?
⢠underlying technology was invented by Google back in their earlier
days so they could usefully index all the rich textural and structural
information they were collecting, and then present meaningful and
actionable results to users.
â˘Based on Googleâs Map Reduce and Google File System.
20. What hadoop is ?
Hadoop Consists of two core components [9]â
1.Hadoop Distributed File System (HDFS)
2.Hadoop Distributed Processing
Framework
â Using Map/Reduce metaphor
21. Hadoop Distributed File System(HDFS)
Based on Simple design principles â
â˘To Split
â˘To Scatter
â˘To Replicate
â˘To Manage data across cluster
â˘Files are broken in to large file blocks
which is usually a multiple of storage
blocks.
ďTypically 64 MB or higher
22. Hadoop Distributed File System(HDFS) contd..
â˘File blocks are Replicated to several
datanodes, for reliability.
â˘Default is 3 replicas, but settable
â˘Blocks are placed (writes are
pipelined):
â˘On same node
â˘On same rack
â˘On the other rack
â˘Clients read from closest replica.
â˘If the replication for a block drops
below target, it is automatically re-
replicated.
23. Hadoop Distributed File System(HDFS) contd..
â˘Single namespace for entire cluster
managed by a single Name node[7]
â˘Namenode, a master server that
manages the file system namespace and
regulates access to files by clients.
â˘DataNodes: serves read, write requests,
performs block creation, deletion, and
replication upon instruction from
Namenode.
â˘When a datanode fails , Namenode
â˘identifies file blocks that have
been affected
â˘retrieves copy from other healthy
nodes
â˘finds new node to store another
copy of them.
â˘Updates information in its tables.
24. Hadoop Distributed File System(HDFS) contd..
â˘Client talks to both namenode and
datanodes
â˘Data is not sent through the
namenode.
â˘First namenode is connected and
then user can directly connect to data
node
HDFS
Architecture[10]
25. â˘ADVANTAGES
â˘Highly fault-tolerant
â˘High throughput
â˘Suitable for applications with large data
sets
â˘Streaming access to file system data
â˘Can be built out of commodity hardware
Hadoop Distributed File System(HDFS) contd..
â˘2 POINT OF FAILURES
â˘Namenode can become a single point of
failure
â˘Cluster rebalancing
â˘SOLUTIONS
â˘Enterprise Editions maintain Backup of
namenode.
â˘Architecture is compatible with data rebalancing
schemes , but its still an area of research.
26. Hadoop Map/Reduce
â˘Map/Reduce is a programming
model
for efficient distributed computing
â˘User submits MapReduce job
â˘System:
⢠Partitions job into lots
of tasks
â˘Schedules tasks on
nodes close to data
⢠Monitors tasks
⢠Kills and restarts if they
fail/hang/disappear[11]
Consists of two phases
1.Mapper Phase
2.Reduce Phase
27. Hadoop Map/Reduce contd âŚ
1.Mapper Phase
â˘The data are fed into the map function as
key value pairs to produce intermediate
key/value pairs.
⢠Input: key1,value1 pair
⢠Output: key2, value2 pairs
â˘All nodes will do same computation
â˘Uses Data Locality to increase
performance.
â˘As all data blocks stored in HDFS
are of equal size mapper computation can
be equally divided.
28. Hadoop Map/Reduce contd âŚ
Reduce Phase
â˘Once the mapping is done, all the intermediate results from various nodes are
reduced to create the final output.
â˘Has 3 Phases
⢠shuffle,
â˘sort and
â˘reduce.[12]
â˘Shuffle - Input to the Reducer is the sorted output of the mappers. In this
phase the framework fetches the relevant partition of the output of all the
mappers.
â˘Sort - The framework groups Reducer inputs by keys (since different
mappers may have output the same key) in this stage.
The shuffle and sort phases occur simultaneously; while map-outputs are
being fetched they are merged.
â˘Reduce - In this phase the reduce method is called for each <key, (list of
values)> pair in the grouped inputs and will produce final outputs.
29. Understood or not ? Lets understand it by an
Example
⢠Suppose you want to analyze blog entries stored in BigData.txt and
count no of times Hadoop , Big Data, Green Plum words appear in it.
â˘Suppose 3 nodes participate in task . In Mapper Phase , each node will receive
an address of file block and pointer to mapper function.
â˘Mapper Function will calculate word âcount.
[13]
30. Lets understand it by an Example
â˘Output of mapper function will be set of
<key,value >pairs.
FINAL OUTPUT
OF MAPPER PHASE
31. Lets understand it by an Example
â˘The Reduce Phase sums and reduces
output .
â˘A node is selected to perform
reduce function and other nodes send
their output to that node.
â˘After Shuffling of Reduce Phase
32. Lets understand it by an Example
â˘After sorting phase of Reduce Phase
And FINALLY
33. â˘JobTracker keeps track of all the
MapReduces jobs that are running on
various nodes.
â˘This schedules the jobs, keeps track of all
the map and reduce jobs running across
the nodes.
â˘If any one of those jobs fails, it reallocates
the job to another node, etc.
â˘TaskTracker performs the map and
reduce tasks that are assigned by the
JobTracker.
â˘TaskTracker also constantly sends a
hearbeat message to JobTracker, which
helps JobTracker to decide whether to
delegate a new task to this particular node
or not.
A bit more on Map/Reduce
34. Accessibilty and Implementation
â˘HDFS
â˘HDFS provides Java API for application to use.
â˘Python access is also used in many applications.
â˘It provides a command line interface called the FS shell that lets the
user interact with data in the HDFS.
â˘The syntax of the commands is similar to bash.
Example: to create a directory
Usage: hadoop dfs -mkdir <paths>
hadoop dfs -mkdir /user/hadoop/dir1 /user/hadoop/dir2
â˘Map/Reduce
â˘Java API which has prebuilt classes and Interfaces.
â˘Python , C++ can also be used.
38. References
[1] Randal E. Bryant , Randy H. Katz , Edward D. Lazowska, âBig-Data Computing:
Creating revolutionary breakthroughs in commerce, science, and societyâ
,Version 8: December 22, 2008. Available:
http://www.cra.org/ccc/docs/init/Big_Data.pdf [Accessed Sept.9,2012]
[2]What is Big Data ?[Online]. Available :
http://www-01.ibm.com/software/data/bigdata/ [Accessed Sept.9,2012]
[3] A Comprehensive List of Big Data Statistics [Online].
Available :http://wikibon.org/blog/big-data-statistics/ [Accessed Sept.9,2012]
[4] James Manyika, Michael Chui ,Brad Brown, Jacques Bughin, Richard Dobbs ,Charles
Roxburgh , Angela Hung Byers Big Data: The next frontier for innovation ,
competition ,and productivity , McKinskey Global Institute, May 2011.Availabe:
http://www.mckinsey.com/insights/mgi/research/technology_and_innovation/big_data
_the_next_frontier_for_innovation[Accessed Sept.10,2012]
[5]What Is Big Data? ,OâReilly Radar, January 11, 2012,[Online].Available :
http://radar.oreilly.com/2012/01/what-is-big-data.html[Accessed Sept.10,2102]
[6]-Big Data, Wipro,[Online].Available:
http://www.slideshare.net/wiprotechnologies/wipro-infographicbig-data[Accessed
Sept.11,2012]
39. References
[7]Owan o maley ,âIntroduction to Hadoopâ[Online].
Available : http://wiki.apache.org/hadoop/HadoopPresentations
[Accessed Sept .17,2012 ]
[8]Hadoop at Yahoo!, Yahoo developer Network[Online].Available:
http://developer.yahoo.com/hadoop/ [Accessed Sept .17,2012 ]
[9] Elif Dede, Madhusudhan Govindaraju, Dan Gunter, Lavanya
Ramakrishnan,âRidingthe elephant: managing ensembles with hadoopâ,
in MTAGS '11 Proceedings of the 2011 ACM international workshop on Many task
computing on grids and supercomputers, Pages 49-58[Online].
Available : ACM Digital Library,
http://dl.acm.org/citation.cfm?id=2132876.2132888 [Accessed Sept .17,2012 ]
[10] HDFS Architecture, Hadoop 0.20 Documentation[Online].
Available: http://hadoop.apache.org/docs/r0.20.2/hdfs_design.html[Accessed
Sep.20,2012]
40. References
[11]Doug Cutting ,âHadoop Overviewâ ,[Online] Available:
http://wiki.apache.org/hadoop/HadoopPresentations
[Accessed Sept .17,2012 ]
[12] Map/Reduce Tutorial, Hadoop 0.20 Documentation,[Online].
Available :
http://hadoop.apache.org/docs/r0.20.2/mapred_tutorial.html#Reducer
[Accessed Sept .17,2012 ]
[13] Patricia Florissi, Big Ideas : Demystifying Hadoop, [Video].
Available : http://www.youtube.com/watch?v=XtLXPLb6EXs&feature=relmfu
[14] C/C++ MapReduce Code & build, Hadoop Wiki , C++ word Count, [Online].
Available :
http://wiki.apache.org/hadoop/C%2B%2BWordCount
[Accessed October .1,2012]