The document discusses distributed computing and provides an overview of key distributed technologies including distributed file systems, MapReduce, and Hadoop. It explains that distributed computing refers to using multiple computers that communicate over a network to solve computational problems. Technologies covered include Google's distributed file system GFS, the MapReduce framework, and the open source Hadoop implementation of MapReduce.
IEEE research paper representation.
IoT system can be well managed by Distributed Computing. Here i'm explaining this topic by taking example of "System-on-a-chip for Smart Cameras".
This document summarizes an Internet of Things (IoT) meetup that covered various topics:
- Introduction to IoT and how objects can transfer data over networks.
- Introduction to cloud computing and how resources are shared over the internet.
- IoT architecture including things, gateways, and networks/cloud.
- IoT gateways like Raspberry Pi that interface devices and cloud.
- Sensor interfaces like XBee and RS-485 that connect to gateways.
- Network interfaces like WiFi and GPRS to connect gateways to cloud.
- Cloud architecture models from various sources.
- Data acquisition from devices using open-source Ponte software.
- Data storage
The document discusses the basics of the Python programming language. It introduces Python as a general purpose, object oriented language and discusses its key features like garbage collection and support for both procedural and object oriented programming. It also covers Python versions, how to start an interactive session, language basics like indentation, numbers, strings and escape sequences. The document is intended to provide an introduction to the Python language for beginners.
The Computer Science Behind a modern Distributed DatabaseArangoDB Database
What we see in the modern data store world is a race between different approaches to achieve a distributed and resilient storage of data. Every application needs a stateful layer which holds the data. There are several different necessary components which are anything but trivial to combine, and, of course, even more challenging when attempting to optimize for performance. Over the past years there has been significant progress in both the science and practical implementations of such data stores. In this talk Dan Larkin-York will introduce the audience to some of the challenges, address the difficulties of their interplay, and cover key approaches taken by some of the industry’s leaders (ArangoDB, Cassandra, CockroachDB, MarkLogic, and more).
The document discusses scaling web data at low cost. It begins by presenting Javier D. Fernández and providing context about his work in semantic web, open data, big data management, and databases. It then discusses techniques for compressing and querying large RDF datasets at low cost using binary RDF formats like HDT. Examples of applications using these techniques include compressing and sharing datasets, fast SPARQL querying, and embedding systems. It also discusses efforts to enable web-scale querying through projects like LOD-a-lot that integrate billions of triples for federated querying.
CNI fall 2009 enhanced publications john_doove-SURFfoundationJohn Doove
- SURF is an organization in the Netherlands that works to improve ICT infrastructure for higher education and research.
- SURF is working on projects to develop "enhanced publications" which combine traditional publications like text with additional materials like data, maps, images and annotations.
- Several projects have been funded to create enhanced publications in fields like archaeology and psychology. Challenges include presentation, identification, long-term preservation and developing tools and infrastructure to support enhanced publications.
- Moving forward, SURF will work on developing repository infrastructure to store and share enhanced publications, creating guidelines and incentivizing their creation through things like legal reports and reward systems.
The RMOD research group at INRIA Lille focuses on software evolution. They have 4 full-time researchers, 2 engineers, 1 postdoc, and 3 PhD students working on topics related to evolving applications and language support. They collaborate with other universities and have implemented a platform called Moose for software and data analysis in Pharo Smalltalk. Their work includes developing tools to support code history, program understanding, and software visualization to help developers evolve software over time.
The document is an introduction to LaTeX presented by Kartik Mandaville of LUG Manipal on March 20, 2010. It outlines the topics to be covered which include getting started with LaTeX, typesetting basics, math, lists and tables. It also discusses compiling LaTeX files and differences from word processors. The presentation aims to explain the basics of using LaTeX for technical documents in a casual discussion format.
IEEE research paper representation.
IoT system can be well managed by Distributed Computing. Here i'm explaining this topic by taking example of "System-on-a-chip for Smart Cameras".
This document summarizes an Internet of Things (IoT) meetup that covered various topics:
- Introduction to IoT and how objects can transfer data over networks.
- Introduction to cloud computing and how resources are shared over the internet.
- IoT architecture including things, gateways, and networks/cloud.
- IoT gateways like Raspberry Pi that interface devices and cloud.
- Sensor interfaces like XBee and RS-485 that connect to gateways.
- Network interfaces like WiFi and GPRS to connect gateways to cloud.
- Cloud architecture models from various sources.
- Data acquisition from devices using open-source Ponte software.
- Data storage
The document discusses the basics of the Python programming language. It introduces Python as a general purpose, object oriented language and discusses its key features like garbage collection and support for both procedural and object oriented programming. It also covers Python versions, how to start an interactive session, language basics like indentation, numbers, strings and escape sequences. The document is intended to provide an introduction to the Python language for beginners.
The Computer Science Behind a modern Distributed DatabaseArangoDB Database
What we see in the modern data store world is a race between different approaches to achieve a distributed and resilient storage of data. Every application needs a stateful layer which holds the data. There are several different necessary components which are anything but trivial to combine, and, of course, even more challenging when attempting to optimize for performance. Over the past years there has been significant progress in both the science and practical implementations of such data stores. In this talk Dan Larkin-York will introduce the audience to some of the challenges, address the difficulties of their interplay, and cover key approaches taken by some of the industry’s leaders (ArangoDB, Cassandra, CockroachDB, MarkLogic, and more).
The document discusses scaling web data at low cost. It begins by presenting Javier D. Fernández and providing context about his work in semantic web, open data, big data management, and databases. It then discusses techniques for compressing and querying large RDF datasets at low cost using binary RDF formats like HDT. Examples of applications using these techniques include compressing and sharing datasets, fast SPARQL querying, and embedding systems. It also discusses efforts to enable web-scale querying through projects like LOD-a-lot that integrate billions of triples for federated querying.
CNI fall 2009 enhanced publications john_doove-SURFfoundationJohn Doove
- SURF is an organization in the Netherlands that works to improve ICT infrastructure for higher education and research.
- SURF is working on projects to develop "enhanced publications" which combine traditional publications like text with additional materials like data, maps, images and annotations.
- Several projects have been funded to create enhanced publications in fields like archaeology and psychology. Challenges include presentation, identification, long-term preservation and developing tools and infrastructure to support enhanced publications.
- Moving forward, SURF will work on developing repository infrastructure to store and share enhanced publications, creating guidelines and incentivizing their creation through things like legal reports and reward systems.
The RMOD research group at INRIA Lille focuses on software evolution. They have 4 full-time researchers, 2 engineers, 1 postdoc, and 3 PhD students working on topics related to evolving applications and language support. They collaborate with other universities and have implemented a platform called Moose for software and data analysis in Pharo Smalltalk. Their work includes developing tools to support code history, program understanding, and software visualization to help developers evolve software over time.
The document is an introduction to LaTeX presented by Kartik Mandaville of LUG Manipal on March 20, 2010. It outlines the topics to be covered which include getting started with LaTeX, typesetting basics, math, lists and tables. It also discusses compiling LaTeX files and differences from word processors. The presentation aims to explain the basics of using LaTeX for technical documents in a casual discussion format.
OSDC 2018 | The Computer science behind a modern distributed data store by Ma...NETWAYS
What we see in the modern data store world is a race between different approaches to achieve a distributed and resilient storage of data. Most applications need a stateful layer which holds the data. There are at least three necessary ingredients which are everything else than trivial to combine and of course even more challenging when heading for an acceptable performance. Over the past years there has been significant progress in respect in both the science and practical implementations of such data stores. In his talk Max Neunhoeffer will introduce the audience to some of the needed ingredients, address the difficulties of their interplay and show four modern approaches of distributed open-source data stores.
Topics are:
– Challenges in developing a distributed, resilient data store
– Consensus, distributed transactions, distributed query optimization and execution
– The inner workings of ArangoDB, Cassandra, Cockroach and RethinkDB
The talk will touch complex and difficult computer science, but will at the same time be accessible to and enjoyable by a wide range of developers.
The computer science behind a modern disributed data storeJ On The Beach
What we see in the modern data store world is a race between different approaches to achieve a distributed and resilient storage of data. Every application needs a stateful layer which holds the data. There are at least three necessary components which are everything else than trivial to combine, and, of course, even more challenging when heading for an acceptable performance.
Over the past years there has been significant progress in both the science and practical implementations of such data stores. In his talk Max Neunhoeffer will introduce the audience to some of the needed ingredients, address the difficulties of their interplay and show four modern approaches of distributed open-source data stores (ArangoDB, Cassandra, Cockroach and RethinkDB).
NLP on Hadoop: A Distributed Framework for NLP-Based Keyword and Keyphrase Ex...Paolo Nesi
Abstract—The recent growth of the World Wide Web at increasing rate and speed and the number of online available resources populating Internet represent a massive source of knowledge for various research and business interests. Such knowledge is, for the most part, embedded in the textual content of web pages and documents, which is largely represented as unstructured natural language formats. In order to automatically ingest and process such huge amounts of data, single-machine, non-distributed architectures are proving to be inefficient for tasks like Big Data mining and intensive text processing and analysis. Current Natural Language Processing (NLP) systems are growing in complexity, and computational power needs have been significantly increased, requiring solutions such as distributed frameworks and parallel computing programming paradigms. This paper presents a distributed framework for executing NLP related tasks in a parallel environment. This has been achieved by integrating the APIs of the widespread GATE open source NLP platform in a multi-node cluster, built upon the open source Apache Hadoop file system. The proposed framework has been evaluated against a real corpus of web pages and documents.
This document provides an overview of the SHEBANQ project, which provides tools for querying annotated Hebrew text data. It describes the data sources and contributors that have built up the underlying text corpus over many years. It also outlines the steps taken to make this data and related tools more accessible, including developing a website, depositing data in archives, running demonstration projects, and integrating the data and tools into broader research environments through additional projects and publications. The goal has been to facilitate wider use of this linguistic resource and foster more digital humanities and data science work based on its contents.
This document provides a summary of a presentation on Python and its role in big data analytics. It discusses Python's origins and growth, key packages like NumPy and SciPy, and new tools being developed by Continuum Analytics like Numba, Blaze, and Anaconda to make Python more performant for large-scale data processing and scientific computing. The presentation outlines Continuum's vision of an integrated platform for data analysis and scientific work in Python.
Stuff we do with OSS in libraries (Bergen, 2009)Nicolas Morin
The document discusses open source software (OSS) solutions used by libraries, including Koha, Evergreen, and Drupal. It summarizes Bibliotheque's involvement with Koha in France, including their growth as a vendor providing Koha support and services. It also briefly introduces Evergreen integrated library system and Drupal projects like SOPAC that integrate a library catalog with Drupal. The document advocates an approach using Drupal and Lucene/Solr to build a flexible catalog system.
This document is a curriculum seminar report on Hadoop submitted by a computer science student to their professor. It includes sections on the need for new technologies to handle large and diverse datasets, the history and origin of Hadoop, descriptions of the key Hadoop components like HDFS and MapReduce, and comparisons of Hadoop to RDBMS systems and discussions of its disadvantages. The report provides an overview of Hadoop for educational purposes.
The document discusses parallel computing and provides an overview of parallel platforms and programming models. It describes how parallel computing can solve problems faster by using multiple processors concurrently. Different parallel platforms are covered, including pipelines, vector processors, multi-core processors, clusters, and GPUs. Shared memory programming allows processors that share physical memory to work in parallel, while distributed memory programming requires explicit communication between processors that do not share memory. The document concludes that parallel computing is necessary to continue increasing computational power given limitations of single processors.
Andhra pradesh workshop user manual october 2016OERindia
Subject Teacher Forum workshop for Andhra Pradesh Maths and Science teachers.
This is the handout or the workshop, created by IT for change Resouce center Bengaluru.
RMLL 2013 : Build Your Personal Search Engine using CrawlzillaJazz Yao-Tsung Wang
This document discusses building a personal search engine using Crawlzilla. It begins with introductions from Jazz Wang, a speaker from NCHC Taiwan who is a co-founder of Hadoop.TW. It then provides an overview of Crawlzilla, a cluster-based web crawler and search engine that supports Chinese word segmentation and multiple users/indexes. The document demonstrates how to use Crawlzilla through a multi-step process of registering for an account, receiving an acceptance notification, and then logging in to access the search functionality.
10 concepts the enterprise decision maker needs to understand about HadoopDonald Miner
Way too many enterprise decision makers have clouded and uninformed views of how Hadoop works and what it does. Donald Miner offers high-level observations about Hadoop technologies and explains how Hadoop can shift the paradigms inside of an organization, based on his report Hadoop: What You Need To Know—Hadoop Basics for the Enterprise Decision Maker, forthcoming from O’Reilly Media.
After a basic introduction to Hadoop and the Hadoop ecosystem, Donald outlines 10 basic concepts you need to understand to master Hadoop:
Hadoop masks being a distributed system: what it means for Hadoop to abstract away the details of distributed systems and why that’s a good thing
Hadoop scales out linearly: why Hadoop’s linear scalability is a paradigm shift (but one with a few downsides)
Hadoop runs on commodity hardware: an honest definition of commodity hardware and why this is a good thing for enterprises
Hadoop handles unstructured data: why Hadoop is better for unstructured data than other data systems from a storage and computation perspective
In Hadoop, you load data first and ask questions later: the differences between schema-on-read and schema-on-write and the drawbacks this represents
Hadoop is open source: what it really means for Hadoop to be open source from a practical perspective, not just a “feel good” perspective
HDFS stores the data but has some major limitations: an overview of HDFS (replication, not being able to edit files, and the NameNode)
YARN controls everything going on and is mostly behind the scenes: an overview of YARN and the pitfalls of sharing resources in a distributed environment and the capacity scheduler
MapReduce may be getting a bad rap, but it’s still really important: an overview of MapReduce (what it’s good at and bad at and why, while it isn’t used as much these days, it still plays an important role)
The Hadoop ecosystem is constantly growing and evolving: an overview of current tools such as Spark and Kafka and a glimpse of some things on the horizon
This document discusses large-scale data processing using Apache Hadoop at SARA and BiG Grid. It provides an introduction to Hadoop and MapReduce, noting that data is easier to collect, store, and analyze in large quantities. Examples are given of projects using Hadoop at SARA, including analyzing Wikipedia data and structural health monitoring. The talk outlines the Hadoop ecosystem and timeline of its adoption at SARA. It discusses how scientists are using Hadoop for tasks like information retrieval, machine learning, and bioinformatics.
Yahoo is the largest corporate contributor, tester, and user of Hadoop. They have 4000+ node clusters and contribute all their Hadoop development work back to Apache as open source. They use Hadoop for large-scale data processing and analytics across petabytes of data to power services like search and ads optimization. Some challenges of using Hadoop at Yahoo's scale include unpredictable user behavior, distributed systems issues, and the difficulties of collaboration in open source projects.
The document discusses Apache Tika, an open source content analysis and detection toolkit. It provides an overview of Tika's history and capabilities, including MIME type detection, language identification, and metadata extraction. It also describes how NASA uses Tika within its Earth science data systems to process large volumes of scientific data files in formats like HDF and netCDF.
Brain Imaging Data Structure and Center for Reproducible NeuroscinceKrzysztof Gorgolewski
This document introduces the Brain Imaging Data Structure (BIDS), a new standardized format for organizing and describing neuroimaging data. BIDS aims to address the heterogeneity in how researchers currently organize their brain imaging data, which causes problems in data sharing and combining data from multiple studies. The key principles of BIDS include adopting existing file formats like NIfTI and JSON, capturing the majority of experimental designs while allowing for extensions, and making the format simple to implement through file naming conventions and folder structures. Tools are being developed to help with validation and conversion of data to the BIDS format to promote its adoption.
Hadoop is an open-source framework for storing and processing large datasets in a distributed computing environment. It allows for massive data storage, enormous processing power, and the ability to handle large numbers of concurrent tasks across clusters of commodity hardware. The framework includes Hadoop Distributed File System (HDFS) for reliable data storage and MapReduce for parallel processing of large datasets. An ecosystem of related projects like Pig, Hive, HBase, Sqoop and Flume extend the functionality of Hadoop.
Open source software provides many options for library services. Some key software packages discussed include Drupal for content management, DSpace for digital libraries, Koha for library automation, and Moodle for e-learning. Open source allows frequent updates and community support at no cost, but also poses challenges like technological obsolescence and copyright issues.
The document provides an outline for analyzing the U-Boot developer community. It describes the introduction, methodology, results, and conclusions sections. The methodology section discusses the tools used for the analysis including cvsanaly, mlstats, and scripts. It also covers the data sources of the U-Boot code repository, mailing list, and wiki. The results section will analyze the repository, mailing list, and perform mixed analyses.
OSDC 2018 | The Computer science behind a modern distributed data store by Ma...NETWAYS
What we see in the modern data store world is a race between different approaches to achieve a distributed and resilient storage of data. Most applications need a stateful layer which holds the data. There are at least three necessary ingredients which are everything else than trivial to combine and of course even more challenging when heading for an acceptable performance. Over the past years there has been significant progress in respect in both the science and practical implementations of such data stores. In his talk Max Neunhoeffer will introduce the audience to some of the needed ingredients, address the difficulties of their interplay and show four modern approaches of distributed open-source data stores.
Topics are:
– Challenges in developing a distributed, resilient data store
– Consensus, distributed transactions, distributed query optimization and execution
– The inner workings of ArangoDB, Cassandra, Cockroach and RethinkDB
The talk will touch complex and difficult computer science, but will at the same time be accessible to and enjoyable by a wide range of developers.
The computer science behind a modern disributed data storeJ On The Beach
What we see in the modern data store world is a race between different approaches to achieve a distributed and resilient storage of data. Every application needs a stateful layer which holds the data. There are at least three necessary components which are everything else than trivial to combine, and, of course, even more challenging when heading for an acceptable performance.
Over the past years there has been significant progress in both the science and practical implementations of such data stores. In his talk Max Neunhoeffer will introduce the audience to some of the needed ingredients, address the difficulties of their interplay and show four modern approaches of distributed open-source data stores (ArangoDB, Cassandra, Cockroach and RethinkDB).
NLP on Hadoop: A Distributed Framework for NLP-Based Keyword and Keyphrase Ex...Paolo Nesi
Abstract—The recent growth of the World Wide Web at increasing rate and speed and the number of online available resources populating Internet represent a massive source of knowledge for various research and business interests. Such knowledge is, for the most part, embedded in the textual content of web pages and documents, which is largely represented as unstructured natural language formats. In order to automatically ingest and process such huge amounts of data, single-machine, non-distributed architectures are proving to be inefficient for tasks like Big Data mining and intensive text processing and analysis. Current Natural Language Processing (NLP) systems are growing in complexity, and computational power needs have been significantly increased, requiring solutions such as distributed frameworks and parallel computing programming paradigms. This paper presents a distributed framework for executing NLP related tasks in a parallel environment. This has been achieved by integrating the APIs of the widespread GATE open source NLP platform in a multi-node cluster, built upon the open source Apache Hadoop file system. The proposed framework has been evaluated against a real corpus of web pages and documents.
This document provides an overview of the SHEBANQ project, which provides tools for querying annotated Hebrew text data. It describes the data sources and contributors that have built up the underlying text corpus over many years. It also outlines the steps taken to make this data and related tools more accessible, including developing a website, depositing data in archives, running demonstration projects, and integrating the data and tools into broader research environments through additional projects and publications. The goal has been to facilitate wider use of this linguistic resource and foster more digital humanities and data science work based on its contents.
This document provides a summary of a presentation on Python and its role in big data analytics. It discusses Python's origins and growth, key packages like NumPy and SciPy, and new tools being developed by Continuum Analytics like Numba, Blaze, and Anaconda to make Python more performant for large-scale data processing and scientific computing. The presentation outlines Continuum's vision of an integrated platform for data analysis and scientific work in Python.
Stuff we do with OSS in libraries (Bergen, 2009)Nicolas Morin
The document discusses open source software (OSS) solutions used by libraries, including Koha, Evergreen, and Drupal. It summarizes Bibliotheque's involvement with Koha in France, including their growth as a vendor providing Koha support and services. It also briefly introduces Evergreen integrated library system and Drupal projects like SOPAC that integrate a library catalog with Drupal. The document advocates an approach using Drupal and Lucene/Solr to build a flexible catalog system.
This document is a curriculum seminar report on Hadoop submitted by a computer science student to their professor. It includes sections on the need for new technologies to handle large and diverse datasets, the history and origin of Hadoop, descriptions of the key Hadoop components like HDFS and MapReduce, and comparisons of Hadoop to RDBMS systems and discussions of its disadvantages. The report provides an overview of Hadoop for educational purposes.
The document discusses parallel computing and provides an overview of parallel platforms and programming models. It describes how parallel computing can solve problems faster by using multiple processors concurrently. Different parallel platforms are covered, including pipelines, vector processors, multi-core processors, clusters, and GPUs. Shared memory programming allows processors that share physical memory to work in parallel, while distributed memory programming requires explicit communication between processors that do not share memory. The document concludes that parallel computing is necessary to continue increasing computational power given limitations of single processors.
Andhra pradesh workshop user manual october 2016OERindia
Subject Teacher Forum workshop for Andhra Pradesh Maths and Science teachers.
This is the handout or the workshop, created by IT for change Resouce center Bengaluru.
RMLL 2013 : Build Your Personal Search Engine using CrawlzillaJazz Yao-Tsung Wang
This document discusses building a personal search engine using Crawlzilla. It begins with introductions from Jazz Wang, a speaker from NCHC Taiwan who is a co-founder of Hadoop.TW. It then provides an overview of Crawlzilla, a cluster-based web crawler and search engine that supports Chinese word segmentation and multiple users/indexes. The document demonstrates how to use Crawlzilla through a multi-step process of registering for an account, receiving an acceptance notification, and then logging in to access the search functionality.
10 concepts the enterprise decision maker needs to understand about HadoopDonald Miner
Way too many enterprise decision makers have clouded and uninformed views of how Hadoop works and what it does. Donald Miner offers high-level observations about Hadoop technologies and explains how Hadoop can shift the paradigms inside of an organization, based on his report Hadoop: What You Need To Know—Hadoop Basics for the Enterprise Decision Maker, forthcoming from O’Reilly Media.
After a basic introduction to Hadoop and the Hadoop ecosystem, Donald outlines 10 basic concepts you need to understand to master Hadoop:
Hadoop masks being a distributed system: what it means for Hadoop to abstract away the details of distributed systems and why that’s a good thing
Hadoop scales out linearly: why Hadoop’s linear scalability is a paradigm shift (but one with a few downsides)
Hadoop runs on commodity hardware: an honest definition of commodity hardware and why this is a good thing for enterprises
Hadoop handles unstructured data: why Hadoop is better for unstructured data than other data systems from a storage and computation perspective
In Hadoop, you load data first and ask questions later: the differences between schema-on-read and schema-on-write and the drawbacks this represents
Hadoop is open source: what it really means for Hadoop to be open source from a practical perspective, not just a “feel good” perspective
HDFS stores the data but has some major limitations: an overview of HDFS (replication, not being able to edit files, and the NameNode)
YARN controls everything going on and is mostly behind the scenes: an overview of YARN and the pitfalls of sharing resources in a distributed environment and the capacity scheduler
MapReduce may be getting a bad rap, but it’s still really important: an overview of MapReduce (what it’s good at and bad at and why, while it isn’t used as much these days, it still plays an important role)
The Hadoop ecosystem is constantly growing and evolving: an overview of current tools such as Spark and Kafka and a glimpse of some things on the horizon
This document discusses large-scale data processing using Apache Hadoop at SARA and BiG Grid. It provides an introduction to Hadoop and MapReduce, noting that data is easier to collect, store, and analyze in large quantities. Examples are given of projects using Hadoop at SARA, including analyzing Wikipedia data and structural health monitoring. The talk outlines the Hadoop ecosystem and timeline of its adoption at SARA. It discusses how scientists are using Hadoop for tasks like information retrieval, machine learning, and bioinformatics.
Yahoo is the largest corporate contributor, tester, and user of Hadoop. They have 4000+ node clusters and contribute all their Hadoop development work back to Apache as open source. They use Hadoop for large-scale data processing and analytics across petabytes of data to power services like search and ads optimization. Some challenges of using Hadoop at Yahoo's scale include unpredictable user behavior, distributed systems issues, and the difficulties of collaboration in open source projects.
The document discusses Apache Tika, an open source content analysis and detection toolkit. It provides an overview of Tika's history and capabilities, including MIME type detection, language identification, and metadata extraction. It also describes how NASA uses Tika within its Earth science data systems to process large volumes of scientific data files in formats like HDF and netCDF.
Brain Imaging Data Structure and Center for Reproducible NeuroscinceKrzysztof Gorgolewski
This document introduces the Brain Imaging Data Structure (BIDS), a new standardized format for organizing and describing neuroimaging data. BIDS aims to address the heterogeneity in how researchers currently organize their brain imaging data, which causes problems in data sharing and combining data from multiple studies. The key principles of BIDS include adopting existing file formats like NIfTI and JSON, capturing the majority of experimental designs while allowing for extensions, and making the format simple to implement through file naming conventions and folder structures. Tools are being developed to help with validation and conversion of data to the BIDS format to promote its adoption.
Hadoop is an open-source framework for storing and processing large datasets in a distributed computing environment. It allows for massive data storage, enormous processing power, and the ability to handle large numbers of concurrent tasks across clusters of commodity hardware. The framework includes Hadoop Distributed File System (HDFS) for reliable data storage and MapReduce for parallel processing of large datasets. An ecosystem of related projects like Pig, Hive, HBase, Sqoop and Flume extend the functionality of Hadoop.
Open source software provides many options for library services. Some key software packages discussed include Drupal for content management, DSpace for digital libraries, Koha for library automation, and Moodle for e-learning. Open source allows frequent updates and community support at no cost, but also poses challenges like technological obsolescence and copyright issues.
The document provides an outline for analyzing the U-Boot developer community. It describes the introduction, methodology, results, and conclusions sections. The methodology section discusses the tools used for the analysis including cvsanaly, mlstats, and scripts. It also covers the data sources of the U-Boot code repository, mailing list, and wiki. The results section will analyze the repository, mailing list, and perform mixed analyses.
1. Distributed Computing
Varun Thacker
Linux User’s Group Manipal
April 8, 2010
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 1 / 42
2. Outline I
1 Introduction
LUG Manipal
Points To Remember
2 Distributed Computing
Distributed Computing
Technologies to be covered
Idea
Data !!
Why Distributed Computing is Hard
Why Distributed Computing is Important
Three Common Distributed Architectures
3 Distributed File System
GFS
What a Distributed File System Does
Google File System Architecture
GFS Architecture: Chunks
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 2 / 42
3. Outline II
GFS Architecture: Master
GFS: Life of a Read
GFS: Life of a Write
GFS: Master Failure
4 MapReduce
MapReduce
Do We Need It?
Bad News!
MapReduce
Map Reduce Paradigm
MapReduce Paradigm
Working
Working
Under the hood: Scheduling
Robustness
5 Hadoop
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 3 / 42
4. Outline III
Hadoop
What is Hadoop
Who uses Hadoop?
Mapper
Combiners
Reducer
Some Terminology
Job Distribution
6 Contact Information
7 Attribution
8 Copying
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 4 / 42
5. Who are we?
Linux User’s Group Manipal
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 5 / 42
6. Who are we?
Linux User’s Group Manipal
Life, Universe and FOSS!!
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 5 / 42
7. Who are we?
Linux User’s Group Manipal
Life, Universe and FOSS!!
Believers of Knowledge Sharing
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 5 / 42
8. Who are we?
Linux User’s Group Manipal
Life, Universe and FOSS!!
Believers of Knowledge Sharing
Most technologically focused “group” in University
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 5 / 42
9. Who are we?
Linux User’s Group Manipal
Life, Universe and FOSS!!
Believers of Knowledge Sharing
Most technologically focused “group” in University
LUG Manipal is a non profit “Group” alive only on voluntary work!!
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 5 / 42
10. Who are we?
Linux User’s Group Manipal
Life, Universe and FOSS!!
Believers of Knowledge Sharing
Most technologically focused “group” in University
LUG Manipal is a non profit “Group” alive only on voluntary work!!
http://lugmanipal.org
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 5 / 42
11. Points To Remember!!!
If you have problem(s) don’t hesitate to ask
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 6 / 42
12. Points To Remember!!!
If you have problem(s) don’t hesitate to ask
Slides are based on Documentation so discussions are really
important, slides are for later reference!!
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 6 / 42
13. Points To Remember!!!
If you have problem(s) don’t hesitate to ask
Slides are based on Documentation so discussions are really
important, slides are for later reference!!
Please dont consider sessions as Class( Classes are boring !! )
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 6 / 42
14. Points To Remember!!!
If you have problem(s) don’t hesitate to ask
Slides are based on Documentation so discussions are really
important, slides are for later reference!!
Please dont consider sessions as Class( Classes are boring !! )
Speaker is just like any person sitting next to you
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 6 / 42
15. Points To Remember!!!
If you have problem(s) don’t hesitate to ask
Slides are based on Documentation so discussions are really
important, slides are for later reference!!
Please dont consider sessions as Class( Classes are boring !! )
Speaker is just like any person sitting next to you
Documentation is really important
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 6 / 42
16. Points To Remember!!!
If you have problem(s) don’t hesitate to ask
Slides are based on Documentation so discussions are really
important, slides are for later reference!!
Please dont consider sessions as Class( Classes are boring !! )
Speaker is just like any person sitting next to you
Documentation is really important
Google is your friend
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 6 / 42
17. Points To Remember!!!
If you have problem(s) don’t hesitate to ask
Slides are based on Documentation so discussions are really
important, slides are for later reference!!
Please dont consider sessions as Class( Classes are boring !! )
Speaker is just like any person sitting next to you
Documentation is really important
Google is your friend
If you have questions after this workshop mail me or come to LUG
Manipal’s forums
http://forums.lugmanipal.org
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 6 / 42
19. Technologies to be covered
Distributed computing refers to the use of distributed systems to
solve computational problems on the distributed system.
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 8 / 42
20. Technologies to be covered
Distributed computing refers to the use of distributed systems to
solve computational problems on the distributed system.
A distributed system consists of multiple computers that
communicate through a network.
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 8 / 42
21. Technologies to be covered
Distributed computing refers to the use of distributed systems to
solve computational problems on the distributed system.
A distributed system consists of multiple computers that
communicate through a network.
MapReduce is a framework which implements the idea of a
distributed computing.
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 8 / 42
22. Technologies to be covered
Distributed computing refers to the use of distributed systems to
solve computational problems on the distributed system.
A distributed system consists of multiple computers that
communicate through a network.
MapReduce is a framework which implements the idea of a
distributed computing.
GFS is the distributed file system on which distributed programs store
and process data in Google. It’s free implementation is HDFS.
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 8 / 42
23. Technologies to be covered
Distributed computing refers to the use of distributed systems to
solve computational problems on the distributed system.
A distributed system consists of multiple computers that
communicate through a network.
MapReduce is a framework which implements the idea of a
distributed computing.
GFS is the distributed file system on which distributed programs store
and process data in Google. It’s free implementation is HDFS.
Hadoop is an open source framework written in Java which
implements the MapReduce technology.
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 8 / 42
24. Idea
While the storage capacities of hard drives have increased massively
over the years, access speeds—the rate at which data can be read
from drives have not kept up.
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 9 / 42
25. Idea
While the storage capacities of hard drives have increased massively
over the years, access speeds—the rate at which data can be read
from drives have not kept up.
One terabyte drives are the norm, but the transfer speed is around
100 MB/s, so it takes more than two and a half hours to read all the
data off the disk.
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 9 / 42
26. Idea
While the storage capacities of hard drives have increased massively
over the years, access speeds—the rate at which data can be read
from drives have not kept up.
One terabyte drives are the norm, but the transfer speed is around
100 MB/s, so it takes more than two and a half hours to read all the
data off the disk.
The obvious way to reduce the time is to read from multiple disks at
once. Imagine if we had 100 drives, each holding one hundredth of
the data. Working in parallel, we could read the data in under two
minutes.
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 9 / 42
27. Data
We live in the data age.An IDC estimate put the size of the “digital
universe” at 0.18 zettabytes(?) in 2006.
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 10 / 42
28. Data
We live in the data age.An IDC estimate put the size of the “digital
universe” at 0.18 zettabytes(?) in 2006.
And by 2011 there will be a tenfold growth to 1.8 zettabytes.
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 10 / 42
29. Data
We live in the data age.An IDC estimate put the size of the “digital
universe” at 0.18 zettabytes(?) in 2006.
And by 2011 there will be a tenfold growth to 1.8 zettabytes.
1 zetabyte is one million petabytes, or one billion terabytes.
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 10 / 42
30. Data
We live in the data age.An IDC estimate put the size of the “digital
universe” at 0.18 zettabytes(?) in 2006.
And by 2011 there will be a tenfold growth to 1.8 zettabytes.
1 zetabyte is one million petabytes, or one billion terabytes.
The New York Stock Exchange generates about one terabyte of new
trade data per day.
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 10 / 42
31. Data
We live in the data age.An IDC estimate put the size of the “digital
universe” at 0.18 zettabytes(?) in 2006.
And by 2011 there will be a tenfold growth to 1.8 zettabytes.
1 zetabyte is one million petabytes, or one billion terabytes.
The New York Stock Exchange generates about one terabyte of new
trade data per day.
Facebook hosts approximately 10 billion photos, taking up one
petabyte of storage.
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 10 / 42
32. Data
We live in the data age.An IDC estimate put the size of the “digital
universe” at 0.18 zettabytes(?) in 2006.
And by 2011 there will be a tenfold growth to 1.8 zettabytes.
1 zetabyte is one million petabytes, or one billion terabytes.
The New York Stock Exchange generates about one terabyte of new
trade data per day.
Facebook hosts approximately 10 billion photos, taking up one
petabyte of storage.
The Large Hadron Collider near Geneva produces about 15 petabytes
of data per year.
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 10 / 42
33. Why Distributed Computing is Hard
Computers crash.
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 11 / 42
34. Why Distributed Computing is Hard
Computers crash.
Network links crash.
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 11 / 42
35. Why Distributed Computing is Hard
Computers crash.
Network links crash.
Talking is slow(even ethernet has 300 microsecond latency, during
which time your 2Ghz PC can do 600,000 cycles).
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 11 / 42
36. Why Distributed Computing is Hard
Computers crash.
Network links crash.
Talking is slow(even ethernet has 300 microsecond latency, during
which time your 2Ghz PC can do 600,000 cycles).
Bandwidth is finite.
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 11 / 42
37. Why Distributed Computing is Hard
Computers crash.
Network links crash.
Talking is slow(even ethernet has 300 microsecond latency, during
which time your 2Ghz PC can do 600,000 cycles).
Bandwidth is finite.
Internet scale: the computers and network are
heterogeneous,untrustworthy, and subject to change at any time.
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 11 / 42
38. Why Distributed Computing is Important
Can be more reliable.
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 12 / 42
39. Why Distributed Computing is Important
Can be more reliable.
Can be faster.
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 12 / 42
40. Why Distributed Computing is Important
Can be more reliable.
Can be faster.
Can be cheaper ($30 million Cray versus 100 $1000 PC’s).
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 12 / 42
41. Three Common Distributed Architectures
Hope: have N computers do separate pieces of work. Speed-up < N.
Probability of failure = 1–(1 − p)N ≈ Np. (p = probability of
individual crash).
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 13 / 42
42. Three Common Distributed Architectures
Hope: have N computers do separate pieces of work. Speed-up < N.
Probability of failure = 1–(1 − p)N ≈ Np. (p = probability of
individual crash).
Replication: have N computers do the same thing. Speed-up < 1.
Probability of failure = p N .
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 13 / 42
43. Three Common Distributed Architectures
Hope: have N computers do separate pieces of work. Speed-up < N.
Probability of failure = 1–(1 − p)N ≈ Np. (p = probability of
individual crash).
Replication: have N computers do the same thing. Speed-up < 1.
Probability of failure = p N .
Master-servant: have 1 computer hand out pieces of work to N-1
servants, and re-hand out pieces of work if servants fail. Speed-up
< N − 1. Probability of failure ≈ p.
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 13 / 42
45. What a Distributed File System Does
Usual file system stuff: create, read, move & find files.
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 15 / 42
46. What a Distributed File System Does
Usual file system stuff: create, read, move & find files.
Allow distributed access to files.
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 15 / 42
47. What a Distributed File System Does
Usual file system stuff: create, read, move & find files.
Allow distributed access to files.
Files are stored distributedly.
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 15 / 42
48. What a Distributed File System Does
Usual file system stuff: create, read, move & find files.
Allow distributed access to files.
Files are stored distributedly.
If you just do #1 and #2, you are a network file system.
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 15 / 42
49. What a Distributed File System Does
Usual file system stuff: create, read, move & find files.
Allow distributed access to files.
Files are stored distributedly.
If you just do #1 and #2, you are a network file system.
To do #3, it’s a good idea to also provide fault tolerance.
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 15 / 42
51. GFS Architecture: Chunks
Files are divided into 64 MB chunks (last chunk of a file may be
smaller).
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 17 / 42
52. GFS Architecture: Chunks
Files are divided into 64 MB chunks (last chunk of a file may be
smaller).
Each chunk is identified by an unique 64-bit id.
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 17 / 42
53. GFS Architecture: Chunks
Files are divided into 64 MB chunks (last chunk of a file may be
smaller).
Each chunk is identified by an unique 64-bit id.
Chunks are stored as regular files on local disks.
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 17 / 42
54. GFS Architecture: Chunks
Files are divided into 64 MB chunks (last chunk of a file may be
smaller).
Each chunk is identified by an unique 64-bit id.
Chunks are stored as regular files on local disks.
By default, each chunk is stored thrice, preferably on more than one
rack.
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 17 / 42
55. GFS Architecture: Chunks
Files are divided into 64 MB chunks (last chunk of a file may be
smaller).
Each chunk is identified by an unique 64-bit id.
Chunks are stored as regular files on local disks.
By default, each chunk is stored thrice, preferably on more than one
rack.
To protect data integrity, each 64 KB block gets a 32 bit checksum
that is checked on all reads.
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 17 / 42
56. GFS Architecture: Chunks
Files are divided into 64 MB chunks (last chunk of a file may be
smaller).
Each chunk is identified by an unique 64-bit id.
Chunks are stored as regular files on local disks.
By default, each chunk is stored thrice, preferably on more than one
rack.
To protect data integrity, each 64 KB block gets a 32 bit checksum
that is checked on all reads.
When idle, a chunkserver scans inactive chunks for corruption.
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 17 / 42
59. GFS Architecture: Master
Stores all metadata (namespace, access control).
Stores (file − > chunks) and (chunk − > location) mappings.
Clients get chunk locations for a file from the master, and then talk
directly to the chunkservers for the data.
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 18 / 42
60. GFS Architecture: Master
Stores all metadata (namespace, access control).
Stores (file − > chunks) and (chunk − > location) mappings.
Clients get chunk locations for a file from the master, and then talk
directly to the chunkservers for the data.
Advantage of single master simplicity.
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 18 / 42
61. GFS Architecture: Master
Stores all metadata (namespace, access control).
Stores (file − > chunks) and (chunk − > location) mappings.
Clients get chunk locations for a file from the master, and then talk
directly to the chunkservers for the data.
Advantage of single master simplicity.
Disadvantages of single master:
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 18 / 42
62. GFS Architecture: Master
Stores all metadata (namespace, access control).
Stores (file − > chunks) and (chunk − > location) mappings.
Clients get chunk locations for a file from the master, and then talk
directly to the chunkservers for the data.
Advantage of single master simplicity.
Disadvantages of single master:
Metadata operations are bottlenecked.
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 18 / 42
63. GFS Architecture: Master
Stores all metadata (namespace, access control).
Stores (file − > chunks) and (chunk − > location) mappings.
Clients get chunk locations for a file from the master, and then talk
directly to the chunkservers for the data.
Advantage of single master simplicity.
Disadvantages of single master:
Metadata operations are bottlenecked.
Maximum Number of files limited by master’s memory.
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 18 / 42
64. GFS: Life of a Read
Client program asks for 1 Gb of file “A” starting at the 200 millionth
byte.
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 19 / 42
65. GFS: Life of a Read
Client program asks for 1 Gb of file “A” starting at the 200 millionth
byte.
Client GFS library asks master for chunks 3, ... 16387 of file “A”.
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 19 / 42
66. GFS: Life of a Read
Client program asks for 1 Gb of file “A” starting at the 200 millionth
byte.
Client GFS library asks master for chunks 3, ... 16387 of file “A”.
Master responds with all of the locations of chunks 2, ... 20000 of file
“A”.
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 19 / 42
67. GFS: Life of a Read
Client program asks for 1 Gb of file “A” starting at the 200 millionth
byte.
Client GFS library asks master for chunks 3, ... 16387 of file “A”.
Master responds with all of the locations of chunks 2, ... 20000 of file
“A”.
Client caches all of these locations (with their cache time-outs)
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 19 / 42
68. GFS: Life of a Read
Client program asks for 1 Gb of file “A” starting at the 200 millionth
byte.
Client GFS library asks master for chunks 3, ... 16387 of file “A”.
Master responds with all of the locations of chunks 2, ... 20000 of file
“A”.
Client caches all of these locations (with their cache time-outs)
Client reads chunk 2 from the closest location.
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 19 / 42
69. GFS: Life of a Read
Client program asks for 1 Gb of file “A” starting at the 200 millionth
byte.
Client GFS library asks master for chunks 3, ... 16387 of file “A”.
Master responds with all of the locations of chunks 2, ... 20000 of file
“A”.
Client caches all of these locations (with their cache time-outs)
Client reads chunk 2 from the closest location.
Client reads chunk 3 from the closest location.
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 19 / 42
70. GFS: Life of a Read
Client program asks for 1 Gb of file “A” starting at the 200 millionth
byte.
Client GFS library asks master for chunks 3, ... 16387 of file “A”.
Master responds with all of the locations of chunks 2, ... 20000 of file
“A”.
Client caches all of these locations (with their cache time-outs)
Client reads chunk 2 from the closest location.
Client reads chunk 3 from the closest location.
...
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 19 / 42
71. GFS: Life of a Write
Client gets locations of chunk replicas as before.
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 20 / 42
72. GFS: Life of a Write
Client gets locations of chunk replicas as before.
For each chunk, client sends the write data to nearest replica.
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 20 / 42
73. GFS: Life of a Write
Client gets locations of chunk replicas as before.
For each chunk, client sends the write data to nearest replica.
This replica sends the data to the nearest replica to it that has not
yet received the data.
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 20 / 42
74. GFS: Life of a Write
Client gets locations of chunk replicas as before.
For each chunk, client sends the write data to nearest replica.
This replica sends the data to the nearest replica to it that has not
yet received the data.
When all of the replicas have received the data, then it is safe for
them to actually write it.
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 20 / 42
75. GFS: Life of a Write
Client gets locations of chunk replicas as before.
For each chunk, client sends the write data to nearest replica.
This replica sends the data to the nearest replica to it that has not
yet received the data.
When all of the replicas have received the data, then it is safe for
them to actually write it.
Tricky Details:
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 20 / 42
76. GFS: Life of a Write
Client gets locations of chunk replicas as before.
For each chunk, client sends the write data to nearest replica.
This replica sends the data to the nearest replica to it that has not
yet received the data.
When all of the replicas have received the data, then it is safe for
them to actually write it.
Tricky Details:
Master hands out a short term ( 1 minute) lease for a particular
replica to be the primary one.
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 20 / 42
77. GFS: Life of a Write
Client gets locations of chunk replicas as before.
For each chunk, client sends the write data to nearest replica.
This replica sends the data to the nearest replica to it that has not
yet received the data.
When all of the replicas have received the data, then it is safe for
them to actually write it.
Tricky Details:
Master hands out a short term ( 1 minute) lease for a particular
replica to be the primary one.
This primary replica assigns a serial number to each mutation so that
every replica performs the mutations in the same order.
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 20 / 42
78. GFS: Master Failure
The Master stores its state via periodic checkpoints and a mutation
log.
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 21 / 42
79. GFS: Master Failure
The Master stores its state via periodic checkpoints and a mutation
log.
Both are replicated.
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 21 / 42
80. GFS: Master Failure
The Master stores its state via periodic checkpoints and a mutation
log.
Both are replicated.
Master election and notification is implemented using an external lock
server.
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 21 / 42
81. GFS: Master Failure
The Master stores its state via periodic checkpoints and a mutation
log.
Both are replicated.
Master election and notification is implemented using an external lock
server.
New master restores state from checkpoint and log.
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 21 / 42
83. Do We Need It?
Yes: Otherwise some problems are too big.
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 23 / 42
84. Do We Need It?
Yes: Otherwise some problems are too big.
Example: 20+ billion web pages x 20KB = 400+ terabytes
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 23 / 42
85. Do We Need It?
Yes: Otherwise some problems are too big.
Example: 20+ billion web pages x 20KB = 400+ terabytes
One computer can read 30-35 MB/sec from disk
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 23 / 42
86. Do We Need It?
Yes: Otherwise some problems are too big.
Example: 20+ billion web pages x 20KB = 400+ terabytes
One computer can read 30-35 MB/sec from disk
four months to read the web
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 23 / 42
87. Do We Need It?
Yes: Otherwise some problems are too big.
Example: 20+ billion web pages x 20KB = 400+ terabytes
One computer can read 30-35 MB/sec from disk
four months to read the web
Same problem with 1000 machines, < 3 hours
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 23 / 42
88. Bad News!
Bad News!!
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 24 / 42
89. Bad News!
Bad News!!
communication and coordination
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 24 / 42
90. Bad News!
Bad News!!
communication and coordination
recovering from machine failure (all the time!)
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 24 / 42
91. Bad News!
Bad News!!
communication and coordination
recovering from machine failure (all the time!)
debugging
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 24 / 42
92. Bad News!
Bad News!!
communication and coordination
recovering from machine failure (all the time!)
debugging
optimization
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 24 / 42
93. Bad News!
Bad News!!
communication and coordination
recovering from machine failure (all the time!)
debugging
optimization
locality
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 24 / 42
94. Bad News!
Bad News!!
communication and coordination
recovering from machine failure (all the time!)
debugging
optimization
locality
Bad news II: repeat for every problem you want to solve
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 24 / 42
95. Bad News!
Bad News!!
communication and coordination
recovering from machine failure (all the time!)
debugging
optimization
locality
Bad news II: repeat for every problem you want to solve
Good News I and II: MapReduce and Hadoop!
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 24 / 42
96. Bad News!
Bad News!!
communication and coordination
recovering from machine failure (all the time!)
debugging
optimization
locality
Bad news II: repeat for every problem you want to solve
Good News I and II: MapReduce and Hadoop!
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 24 / 42
97. MapReduce
A simple programming model that applies to many large-scale
computing problems
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 25 / 42
98. MapReduce
A simple programming model that applies to many large-scale
computing problems
Hide messy details in MapReduce runtime library:
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 25 / 42
99. MapReduce
A simple programming model that applies to many large-scale
computing problems
Hide messy details in MapReduce runtime library:
automatic parallelization
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 25 / 42
100. MapReduce
A simple programming model that applies to many large-scale
computing problems
Hide messy details in MapReduce runtime library:
automatic parallelization
load balancing
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 25 / 42
101. MapReduce
A simple programming model that applies to many large-scale
computing problems
Hide messy details in MapReduce runtime library:
automatic parallelization
load balancing
network and disk transfer optimization
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 25 / 42
102. MapReduce
A simple programming model that applies to many large-scale
computing problems
Hide messy details in MapReduce runtime library:
automatic parallelization
load balancing
network and disk transfer optimization
handling of machine failures
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 25 / 42
103. MapReduce
A simple programming model that applies to many large-scale
computing problems
Hide messy details in MapReduce runtime library:
automatic parallelization
load balancing
network and disk transfer optimization
handling of machine failures
robustness
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 25 / 42
104. MapReduce
A simple programming model that applies to many large-scale
computing problems
Hide messy details in MapReduce runtime library:
automatic parallelization
load balancing
network and disk transfer optimization
handling of machine failures
robustness
Therfore we can write application level programs and let MapReduce
insulate us from many concerns.
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 25 / 42
105. MapReduce
A simple programming model that applies to many large-scale
computing problems
Hide messy details in MapReduce runtime library:
automatic parallelization
load balancing
network and disk transfer optimization
handling of machine failures
robustness
Therfore we can write application level programs and let MapReduce
insulate us from many concerns.
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 25 / 42
106. Map Reduce Paradigm
Read a lot of data
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 26 / 42
107. Map Reduce Paradigm
Read a lot of data
Map: extract something you care about from each record.
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 26 / 42
108. Map Reduce Paradigm
Read a lot of data
Map: extract something you care about from each record.
Shuffle and Sort.
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 26 / 42
109. Map Reduce Paradigm
Read a lot of data
Map: extract something you care about from each record.
Shuffle and Sort.
Reduce: aggregate, summarize, filter, or transform
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 26 / 42
110. Map Reduce Paradigm
Read a lot of data
Map: extract something you care about from each record.
Shuffle and Sort.
Reduce: aggregate, summarize, filter, or transform
Write the results.
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 26 / 42
111. MapReduce Paradigm
Basic data type: the key-value pair (k,v).
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 27 / 42
112. MapReduce Paradigm
Basic data type: the key-value pair (k,v).
For example, key = URL, value = HTML of the web page.
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 27 / 42
113. MapReduce Paradigm
Basic data type: the key-value pair (k,v).
For example, key = URL, value = HTML of the web page.
Programmer specifies two primary methods:
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 27 / 42
114. MapReduce Paradigm
Basic data type: the key-value pair (k,v).
For example, key = URL, value = HTML of the web page.
Programmer specifies two primary methods:
Map: (k, v) − > <(k1,v1), (k2,v2), (k3,v3),...,(kn,vn)>
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 27 / 42
115. MapReduce Paradigm
Basic data type: the key-value pair (k,v).
For example, key = URL, value = HTML of the web page.
Programmer specifies two primary methods:
Map: (k, v) − > <(k1,v1), (k2,v2), (k3,v3),...,(kn,vn)>
Reduce: (k’, <v’1, v’2,...,v’n’>) − > <(k’, v”1), (k’, v”2),...,(k’,
v”n”)>
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 27 / 42
116. MapReduce Paradigm
Basic data type: the key-value pair (k,v).
For example, key = URL, value = HTML of the web page.
Programmer specifies two primary methods:
Map: (k, v) − > <(k1,v1), (k2,v2), (k3,v3),...,(kn,vn)>
Reduce: (k’, <v’1, v’2,...,v’n’>) − > <(k’, v”1), (k’, v”2),...,(k’,
v”n”)>
All v’ with same k’ are reduced together.
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 27 / 42
117. MapReduce Paradigm
Basic data type: the key-value pair (k,v).
For example, key = URL, value = HTML of the web page.
Programmer specifies two primary methods:
Map: (k, v) − > <(k1,v1), (k2,v2), (k3,v3),...,(kn,vn)>
Reduce: (k’, <v’1, v’2,...,v’n’>) − > <(k’, v”1), (k’, v”2),...,(k’,
v”n”)>
All v’ with same k’ are reduced together.
(Remember the invisible “Shuffle and Sort” step.)
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 27 / 42
118. MapReduce Paradigm
Basic data type: the key-value pair (k,v).
For example, key = URL, value = HTML of the web page.
Programmer specifies two primary methods:
Map: (k, v) − > <(k1,v1), (k2,v2), (k3,v3),...,(kn,vn)>
Reduce: (k’, <v’1, v’2,...,v’n’>) − > <(k’, v”1), (k’, v”2),...,(k’,
v”n”)>
All v’ with same k’ are reduced together.
(Remember the invisible “Shuffle and Sort” step.)
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 27 / 42
121. Under the hood: Scheduling
One master, many workers
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 30 / 42
122. Under the hood: Scheduling
One master, many workers
Input data split into M map tasks (typically 64 MB in size)
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 30 / 42
123. Under the hood: Scheduling
One master, many workers
Input data split into M map tasks (typically 64 MB in size)
¯
Reduce phase partitioned into R reduce tasks (# of output files)
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 30 / 42
124. Under the hood: Scheduling
One master, many workers
Input data split into M map tasks (typically 64 MB in size)
¯
Reduce phase partitioned into R reduce tasks (# of output files)
Tasks are assigned to workers dynamically
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 30 / 42
125. Under the hood: Scheduling
One master, many workers
Input data split into M map tasks (typically 64 MB in size)
¯
Reduce phase partitioned into R reduce tasks (# of output files)
Tasks are assigned to workers dynamically
Master assigns each map task to a free worker
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 30 / 42
126. Under the hood: Scheduling
One master, many workers
Input data split into M map tasks (typically 64 MB in size)
¯
Reduce phase partitioned into R reduce tasks (# of output files)
Tasks are assigned to workers dynamically
Master assigns each map task to a free worker
Considers locality of data to worker when assigning task
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 30 / 42
127. Under the hood: Scheduling
One master, many workers
Input data split into M map tasks (typically 64 MB in size)
¯
Reduce phase partitioned into R reduce tasks (# of output files)
Tasks are assigned to workers dynamically
Master assigns each map task to a free worker
Considers locality of data to worker when assigning task
Worker reads task input (often from local disk!)
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 30 / 42
128. Under the hood: Scheduling
One master, many workers
Input data split into M map tasks (typically 64 MB in size)
¯
Reduce phase partitioned into R reduce tasks (# of output files)
Tasks are assigned to workers dynamically
Master assigns each map task to a free worker
Considers locality of data to worker when assigning task
Worker reads task input (often from local disk!)
Worker produces R local files containing intermediate (k,v) pairs
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 30 / 42
129. Under the hood: Scheduling
One master, many workers
Input data split into M map tasks (typically 64 MB in size)
¯
Reduce phase partitioned into R reduce tasks (# of output files)
Tasks are assigned to workers dynamically
Master assigns each map task to a free worker
Considers locality of data to worker when assigning task
Worker reads task input (often from local disk!)
Worker produces R local files containing intermediate (k,v) pairs
Master assigns each reduce task to a free worker
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 30 / 42
130. Under the hood: Scheduling
One master, many workers
Input data split into M map tasks (typically 64 MB in size)
¯
Reduce phase partitioned into R reduce tasks (# of output files)
Tasks are assigned to workers dynamically
Master assigns each map task to a free worker
Considers locality of data to worker when assigning task
Worker reads task input (often from local disk!)
Worker produces R local files containing intermediate (k,v) pairs
Master assigns each reduce task to a free worker
Worker reads intermediate (k,v) pairs from map workers
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 30 / 42
131. Under the hood: Scheduling
One master, many workers
Input data split into M map tasks (typically 64 MB in size)
¯
Reduce phase partitioned into R reduce tasks (# of output files)
Tasks are assigned to workers dynamically
Master assigns each map task to a free worker
Considers locality of data to worker when assigning task
Worker reads task input (often from local disk!)
Worker produces R local files containing intermediate (k,v) pairs
Master assigns each reduce task to a free worker
Worker reads intermediate (k,v) pairs from map workers
Worker sorts & applies user’s Reduce op to produce the output
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 30 / 42
132. Under the hood: Scheduling
One master, many workers
Input data split into M map tasks (typically 64 MB in size)
¯
Reduce phase partitioned into R reduce tasks (# of output files)
Tasks are assigned to workers dynamically
Master assigns each map task to a free worker
Considers locality of data to worker when assigning task
Worker reads task input (often from local disk!)
Worker produces R local files containing intermediate (k,v) pairs
Master assigns each reduce task to a free worker
Worker reads intermediate (k,v) pairs from map workers
Worker sorts & applies user’s Reduce op to produce the output
User may specify Partition: which intermediate keys to which Reducer
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 30 / 42
133. Robustness
One master, many workers.
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 31 / 42
134. Robustness
One master, many workers.
Detect failure via periodic heartbeats.
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 31 / 42
135. Robustness
One master, many workers.
Detect failure via periodic heartbeats.
Re-execute completed and in-progress map tasks.
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 31 / 42
136. Robustness
One master, many workers.
Detect failure via periodic heartbeats.
Re-execute completed and in-progress map tasks.
Re-execute in-progress reduce tasks.
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 31 / 42
137. Robustness
One master, many workers.
Detect failure via periodic heartbeats.
Re-execute completed and in-progress map tasks.
Re-execute in-progress reduce tasks.
Master assigns each map task to a free worker.
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 31 / 42
138. Robustness
One master, many workers.
Detect failure via periodic heartbeats.
Re-execute completed and in-progress map tasks.
Re-execute in-progress reduce tasks.
Master assigns each map task to a free worker.
Master failure:
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 31 / 42
139. Robustness
One master, many workers.
Detect failure via periodic heartbeats.
Re-execute completed and in-progress map tasks.
Re-execute in-progress reduce tasks.
Master assigns each map task to a free worker.
Master failure:
State is checkpointed to replicated file system.
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 31 / 42
140. Robustness
One master, many workers.
Detect failure via periodic heartbeats.
Re-execute completed and in-progress map tasks.
Re-execute in-progress reduce tasks.
Master assigns each map task to a free worker.
Master failure:
State is checkpointed to replicated file system.
New master recovers & continues.
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 31 / 42
141. Robustness
One master, many workers.
Detect failure via periodic heartbeats.
Re-execute completed and in-progress map tasks.
Re-execute in-progress reduce tasks.
Master assigns each map task to a free worker.
Master failure:
State is checkpointed to replicated file system.
New master recovers & continues.
Very Robust: lost 1600 of 1800 machines once, but finished
fine-Google.
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 31 / 42
143. What is hadoop
Apache Hadoop is a Java software framework that supports
data-intensive distributed applications under a free license.
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 33 / 42
144. What is hadoop
Apache Hadoop is a Java software framework that supports
data-intensive distributed applications under a free license.
Hadoop was inspired by Google’s MapReduce and Google File System
(GFS) papers.
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 33 / 42
145. What is hadoop
Apache Hadoop is a Java software framework that supports
data-intensive distributed applications under a free license.
Hadoop was inspired by Google’s MapReduce and Google File System
(GFS) papers.
A Map/Reduce job usually splits the input data-set into independent
chunks which are processed by the map tasks in a completely parallel
manner.
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 33 / 42
146. What is hadoop
Apache Hadoop is a Java software framework that supports
data-intensive distributed applications under a free license.
Hadoop was inspired by Google’s MapReduce and Google File System
(GFS) papers.
A Map/Reduce job usually splits the input data-set into independent
chunks which are processed by the map tasks in a completely parallel
manner.
It is then made input to the reduce tasks.
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 33 / 42
147. What is hadoop
Apache Hadoop is a Java software framework that supports
data-intensive distributed applications under a free license.
Hadoop was inspired by Google’s MapReduce and Google File System
(GFS) papers.
A Map/Reduce job usually splits the input data-set into independent
chunks which are processed by the map tasks in a completely parallel
manner.
It is then made input to the reduce tasks.
The framework takes care of scheduling tasks, monitoring them and
re-executes the failed tasks.
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 33 / 42
148. Who uses Hadoop?
Adobe
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 34 / 42
149. Who uses Hadoop?
Adobe
AOL
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 34 / 42
150. Who uses Hadoop?
Adobe
AOL
Baidu - the leading Chinese language search engine
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 34 / 42
151. Who uses Hadoop?
Adobe
AOL
Baidu - the leading Chinese language search engine
Cloudera, Inc - Cloudera provides commercial support and
professional training for Hadoop.
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 34 / 42
152. Who uses Hadoop?
Adobe
AOL
Baidu - the leading Chinese language search engine
Cloudera, Inc - Cloudera provides commercial support and
professional training for Hadoop.
Facebook
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 34 / 42
153. Who uses Hadoop?
Adobe
AOL
Baidu - the leading Chinese language search engine
Cloudera, Inc - Cloudera provides commercial support and
professional training for Hadoop.
Facebook
Google
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 34 / 42
154. Who uses Hadoop?
Adobe
AOL
Baidu - the leading Chinese language search engine
Cloudera, Inc - Cloudera provides commercial support and
professional training for Hadoop.
Facebook
Google
IBM
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 34 / 42
155. Who uses Hadoop?
Adobe
AOL
Baidu - the leading Chinese language search engine
Cloudera, Inc - Cloudera provides commercial support and
professional training for Hadoop.
Facebook
Google
IBM
Twitter
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 34 / 42
156. Who uses Hadoop?
Adobe
AOL
Baidu - the leading Chinese language search engine
Cloudera, Inc - Cloudera provides commercial support and
professional training for Hadoop.
Facebook
Google
IBM
Twitter
Yahoo!
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 34 / 42
157. Who uses Hadoop?
Adobe
AOL
Baidu - the leading Chinese language search engine
Cloudera, Inc - Cloudera provides commercial support and
professional training for Hadoop.
Facebook
Google
IBM
Twitter
Yahoo!
The New York Times,Last.fm,Hulu,LinkedIn
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 34 / 42
158. Who uses Hadoop?
Adobe
AOL
Baidu - the leading Chinese language search engine
Cloudera, Inc - Cloudera provides commercial support and
professional training for Hadoop.
Facebook
Google
IBM
Twitter
Yahoo!
The New York Times,Last.fm,Hulu,LinkedIn
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 34 / 42
159. Mapper
Mapper maps input key/value pairs to a set of intermediate key/value
pairs.
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 35 / 42
160. Mapper
Mapper maps input key/value pairs to a set of intermediate key/value
pairs.
The Hadoop Map/Reduce framework spawns one map task for each
InputSplit generated by the InputFormat.
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 35 / 42
161. Mapper
Mapper maps input key/value pairs to a set of intermediate key/value
pairs.
The Hadoop Map/Reduce framework spawns one map task for each
InputSplit generated by the InputFormat.
Output pairs do not need to be of the same types as input pairs.
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 35 / 42
162. Mapper
Mapper maps input key/value pairs to a set of intermediate key/value
pairs.
The Hadoop Map/Reduce framework spawns one map task for each
InputSplit generated by the InputFormat.
Output pairs do not need to be of the same types as input pairs.
Mapper implementations are passed the JobConf for the job.
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 35 / 42
163. Mapper
Mapper maps input key/value pairs to a set of intermediate key/value
pairs.
The Hadoop Map/Reduce framework spawns one map task for each
InputSplit generated by the InputFormat.
Output pairs do not need to be of the same types as input pairs.
Mapper implementations are passed the JobConf for the job.
The framework then calls map method for each key/value pair.
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 35 / 42
164. Mapper
Mapper maps input key/value pairs to a set of intermediate key/value
pairs.
The Hadoop Map/Reduce framework spawns one map task for each
InputSplit generated by the InputFormat.
Output pairs do not need to be of the same types as input pairs.
Mapper implementations are passed the JobConf for the job.
The framework then calls map method for each key/value pair.
Applications can use the Reporter to report progress.
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 35 / 42
165. Mapper
Mapper maps input key/value pairs to a set of intermediate key/value
pairs.
The Hadoop Map/Reduce framework spawns one map task for each
InputSplit generated by the InputFormat.
Output pairs do not need to be of the same types as input pairs.
Mapper implementations are passed the JobConf for the job.
The framework then calls map method for each key/value pair.
Applications can use the Reporter to report progress.
All intermediate values associated with a given output key are
subsequently grouped by the framework, and passed to the
Reducer(s) to determine the final output.
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 35 / 42
166. Mapper
Mapper maps input key/value pairs to a set of intermediate key/value
pairs.
The Hadoop Map/Reduce framework spawns one map task for each
InputSplit generated by the InputFormat.
Output pairs do not need to be of the same types as input pairs.
Mapper implementations are passed the JobConf for the job.
The framework then calls map method for each key/value pair.
Applications can use the Reporter to report progress.
All intermediate values associated with a given output key are
subsequently grouped by the framework, and passed to the
Reducer(s) to determine the final output.
The intermediate, sorted outputs are always stored in a simple
(key-len, key, value-len, value) format.
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 35 / 42
167. Mapper
Mapper maps input key/value pairs to a set of intermediate key/value
pairs.
The Hadoop Map/Reduce framework spawns one map task for each
InputSplit generated by the InputFormat.
Output pairs do not need to be of the same types as input pairs.
Mapper implementations are passed the JobConf for the job.
The framework then calls map method for each key/value pair.
Applications can use the Reporter to report progress.
All intermediate values associated with a given output key are
subsequently grouped by the framework, and passed to the
Reducer(s) to determine the final output.
The intermediate, sorted outputs are always stored in a simple
(key-len, key, value-len, value) format.
The number of maps is usually driven by the total size of the inputs,
that is, the total number of blocks of the input files.
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 35 / 42
168. Mapper
Mapper maps input key/value pairs to a set of intermediate key/value
pairs.
The Hadoop Map/Reduce framework spawns one map task for each
InputSplit generated by the InputFormat.
Output pairs do not need to be of the same types as input pairs.
Mapper implementations are passed the JobConf for the job.
The framework then calls map method for each key/value pair.
Applications can use the Reporter to report progress.
All intermediate values associated with a given output key are
subsequently grouped by the framework, and passed to the
Reducer(s) to determine the final output.
The intermediate, sorted outputs are always stored in a simple
(key-len, key, value-len, value) format.
The number of maps is usually driven by the total size of the inputs,
that is, the total number of blocks of the input files.
Users can optionally specify a combiner to perform local aggregation
of the intermediate outputs.
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 35 / 42
169. Mapper
Mapper maps input key/value pairs to a set of intermediate key/value
pairs.
The Hadoop Map/Reduce framework spawns one map task for each
InputSplit generated by the InputFormat.
Output pairs do not need to be of the same types as input pairs.
Mapper implementations are passed the JobConf for the job.
The framework then calls map method for each key/value pair.
Applications can use the Reporter to report progress.
All intermediate values associated with a given output key are
subsequently grouped by the framework, and passed to the
Reducer(s) to determine the final output.
The intermediate, sorted outputs are always stored in a simple
(key-len, key, value-len, value) format.
The number of maps is usually driven by the total size of the inputs,
that is, the total number of blocks of the input files.
Users can optionally specify a combiner to perform local aggregation
of the intermediate outputs.
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 35 / 42
170. Combiners
When the map operation outputs its pairs they are already available
in memory.
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 36 / 42
171. Combiners
When the map operation outputs its pairs they are already available
in memory.
If a combiner is used then the map key-value pairs are not
immediately written to the output.
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 36 / 42
172. Combiners
When the map operation outputs its pairs they are already available
in memory.
If a combiner is used then the map key-value pairs are not
immediately written to the output.
They are collected in lists, one list per each key value.
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 36 / 42
173. Combiners
When the map operation outputs its pairs they are already available
in memory.
If a combiner is used then the map key-value pairs are not
immediately written to the output.
They are collected in lists, one list per each key value.
When a certain number of key-value pairs have been written, this
buffer is flushed by passing all the values of each key to the combiner’s
reduce method and outputting the key-value pairs of the combine
operation as if they were created by the original map operation.
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 36 / 42
174. Reducer
Reducer reduces a set of intermediate values which share a key to a
smaller set of values.
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 37 / 42
175. Reducer
Reducer reduces a set of intermediate values which share a key to a
smaller set of values.
Reducer implementations are passed the JobConf for the job.
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 37 / 42
176. Reducer
Reducer reduces a set of intermediate values which share a key to a
smaller set of values.
Reducer implementations are passed the JobConf for the job.
The framework then calls reduce(WritableComparable, Iterator,
OutputCollector, Reporter) method for each ¡key, (list of values)¿ pair
in the grouped inputs.
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 37 / 42
177. Reducer
Reducer reduces a set of intermediate values which share a key to a
smaller set of values.
Reducer implementations are passed the JobConf for the job.
The framework then calls reduce(WritableComparable, Iterator,
OutputCollector, Reporter) method for each ¡key, (list of values)¿ pair
in the grouped inputs.
The reducer has 3 primary phases:
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 37 / 42
178. Reducer
Reducer reduces a set of intermediate values which share a key to a
smaller set of values.
Reducer implementations are passed the JobConf for the job.
The framework then calls reduce(WritableComparable, Iterator,
OutputCollector, Reporter) method for each ¡key, (list of values)¿ pair
in the grouped inputs.
The reducer has 3 primary phases:
Shuffle:Input to the Reducer is the sorted output of the mappers. In
this phase the framework fetches the relevant partition of the output
of all the mappers, via HTTP.
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 37 / 42
179. Reducer
Reducer reduces a set of intermediate values which share a key to a
smaller set of values.
Reducer implementations are passed the JobConf for the job.
The framework then calls reduce(WritableComparable, Iterator,
OutputCollector, Reporter) method for each ¡key, (list of values)¿ pair
in the grouped inputs.
The reducer has 3 primary phases:
Shuffle:Input to the Reducer is the sorted output of the mappers. In
this phase the framework fetches the relevant partition of the output
of all the mappers, via HTTP.
Sort:The framework groups Reducer inputs by keys (since different
mappers may have output the same key) in this stage.
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 37 / 42
180. Reducer
Reducer reduces a set of intermediate values which share a key to a
smaller set of values.
Reducer implementations are passed the JobConf for the job.
The framework then calls reduce(WritableComparable, Iterator,
OutputCollector, Reporter) method for each ¡key, (list of values)¿ pair
in the grouped inputs.
The reducer has 3 primary phases:
Shuffle:Input to the Reducer is the sorted output of the mappers. In
this phase the framework fetches the relevant partition of the output
of all the mappers, via HTTP.
Sort:The framework groups Reducer inputs by keys (since different
mappers may have output the same key) in this stage.
Reduce:In this phase the reduce method is called for each <key, (list
of values)> pair in the grouped inputs.
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 37 / 42
181. Reducer
Reducer reduces a set of intermediate values which share a key to a
smaller set of values.
Reducer implementations are passed the JobConf for the job.
The framework then calls reduce(WritableComparable, Iterator,
OutputCollector, Reporter) method for each ¡key, (list of values)¿ pair
in the grouped inputs.
The reducer has 3 primary phases:
Shuffle:Input to the Reducer is the sorted output of the mappers. In
this phase the framework fetches the relevant partition of the output
of all the mappers, via HTTP.
Sort:The framework groups Reducer inputs by keys (since different
mappers may have output the same key) in this stage.
Reduce:In this phase the reduce method is called for each <key, (list
of values)> pair in the grouped inputs.
The generated ouput is a new value.
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 37 / 42
182. Some Terminology
Job – A “full program” - an execution of a Mapper and Reducer
across a data set.
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 38 / 42
183. Some Terminology
Job – A “full program” - an execution of a Mapper and Reducer
across a data set.
Task – An execution of a Mapper or a Reducer on a slice of data
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 38 / 42
184. Some Terminology
Job – A “full program” - an execution of a Mapper and Reducer
across a data set.
Task – An execution of a Mapper or a Reducer on a slice of data
Task Attempt – A particular instance of an attempt to execute a task
on a machine.
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 38 / 42
185. Job Distribution
MapReduce programs are contained in a Java “jar” file + an XML file
containing serialized program configuration options.
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 39 / 42
186. Job Distribution
MapReduce programs are contained in a Java “jar” file + an XML file
containing serialized program configuration options.
Running a MapReduce job places these files into the HDFS and
notifies TaskTrackers where to retrieve the relevant program code.
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 39 / 42
187. Job Distribution
MapReduce programs are contained in a Java “jar” file + an XML file
containing serialized program configuration options.
Running a MapReduce job places these files into the HDFS and
notifies TaskTrackers where to retrieve the relevant program code.
Data Distribution: Implicit in design of MapReduce!
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 39 / 42
188. Contact Information
Varun Thacker
Linux User’s Group Manipal
varunthacker1989@gmail.com
http://lugmanipal.org
http:
http://forums.lugmanipal.org
//varunthacker.wordpress.com
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 40 / 42
189. Attribution
Google
Under the Creative Commons Attribution-Share Alike 2.5 Generic.
Varun Thacker (LUG Manipal) Distributed Computing April 8, 2010 41 / 42