The document discusses Hadoop MapReduce. It describes Hadoop as a framework for distributed processing of large datasets across computer clusters. MapReduce is the programming model used in Hadoop for processing and generating large datasets in parallel. The two main components of Hadoop are HDFS for storage and MapReduce for processing. MapReduce involves two main phases - the map phase where input data is converted into intermediate outputs, and the reduce phase where the outputs are aggregated to form the final results.
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)VMware Tanzu
Recorded at SpringOne2GX 2013 in Santa Clara, CA
Speaker: Adam Shook
This session assumes absolutely no knowledge of Apache Hadoop and will provide a complete introduction to all the major aspects of the Hadoop ecosystem of projects and tools. If you are looking to get up to speed on Hadoop, trying to work out what all the Big Data fuss is about, or just interested in brushing up your understanding of MapReduce, then this is the session for you. We will cover all the basics with detailed discussion about HDFS, MapReduce, YARN (MRv2), and a broad overview of the Hadoop ecosystem including Hive, Pig, HBase, ZooKeeper and more.
Learn More about Spring XD at: http://projects.spring.io/spring-xd
Learn More about Gemfire XD at:
http://www.gopivotal.com/big-data/pivotal-hd
The document summarizes the Terasort algorithm used in Hadoop. It describes:
1) Terasort uses MapReduce to sample and sort very large datasets like 100TB in under 3 hours by leveraging thousands of nodes.
2) The algorithm first generates sample data using Teragen. It then uses the samples to partition the full data among reducers with Terasort.
3) Each reducer locally sorts its partition so the entire dataset is sorted when concatenated in order of the reducers.
The document discusses Hadoop MapReduce. It describes Hadoop as a framework for distributed processing of large datasets across computer clusters. MapReduce is the programming model used in Hadoop for processing and generating large datasets in parallel. The two main components of Hadoop are HDFS for storage and MapReduce for processing. MapReduce involves two main phases - the map phase where input data is converted into intermediate outputs, and the reduce phase where the outputs are aggregated to form the final results.
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)VMware Tanzu
Recorded at SpringOne2GX 2013 in Santa Clara, CA
Speaker: Adam Shook
This session assumes absolutely no knowledge of Apache Hadoop and will provide a complete introduction to all the major aspects of the Hadoop ecosystem of projects and tools. If you are looking to get up to speed on Hadoop, trying to work out what all the Big Data fuss is about, or just interested in brushing up your understanding of MapReduce, then this is the session for you. We will cover all the basics with detailed discussion about HDFS, MapReduce, YARN (MRv2), and a broad overview of the Hadoop ecosystem including Hive, Pig, HBase, ZooKeeper and more.
Learn More about Spring XD at: http://projects.spring.io/spring-xd
Learn More about Gemfire XD at:
http://www.gopivotal.com/big-data/pivotal-hd
The document summarizes the Terasort algorithm used in Hadoop. It describes:
1) Terasort uses MapReduce to sample and sort very large datasets like 100TB in under 3 hours by leveraging thousands of nodes.
2) The algorithm first generates sample data using Teragen. It then uses the samples to partition the full data among reducers with Terasort.
3) Each reducer locally sorts its partition so the entire dataset is sorted when concatenated in order of the reducers.