Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Introduction to Hadoop and
MapReduce
Csaba Toth
Central California .NET User Group
Meeting
Date: March 19th, 2014
Location...
Agenda
• Big Data
• Hadoop
• Map Reduce
• Demos:
– Exercises with on-premise Hadoop emulator
– Azure HDInsight
Big Data
• Wikipedia: “collection of data sets so large and complex that it
becomes difficult to process using on-hand dat...
Big Data characteristics
• Three Vs: Volume, Velocity, Variety
• Sources:
– Science, Sensors, Social networks, Log files
–...
A Little History
Two Seminar papers:
• “The Google File System” - October 2003
http://labs.google.com/papers/gfs.html
– de...
Hadoop vs RDBMS
Hadoop / MapReduce RDBMS
Size of data Petabytes Gigabytes
Integrity of data Low High (referential, typed)
...
Hadoop
• Hadoop is an open-source software
framework that supports data-intensive
distributed applications.
• Has two main...
Hadoop
• All of this in a cost effective way: Hadoop is
managing a cluster of commodity hardware
computers.
• The cluster ...
Scalable Storage – Hadoop Distributed
File System
• Store large amounts of data economically
• Scaling out instead of scal...
Name node
HDFS visually
Metadata
Store
Data node Data node Data node
Node 1 Node 2
Block A Block B Block A Block B
Node 3
...
Scalable Processing
• Task: word counting program
• Pseudo-code:
1. Open text file for read, read each line
2. Parse each ...
Scalable Processing
• Issues:
– Data storage
– Data distribution
– Parallelizable algorithm
– Fault tolerance
– Aggregatio...
Scalable Processing
Issue Considered The Hadoop solution
Data storage HDFS
Data distribution—The system should be able
to ...
MapReduce
• Hadoop leverages the functional programming model
of map/reduce.
• Moves away from shared resources and relate...
Name node
Heart beat signals and
communication
Job / task management
Jobtracker
Data node Data node Data node
Tasktracker ...
MapReduce
• It is about two functions: map and reduce
1. Map Step:
– Processes a key/value pairs and generate a set of
int...
MapReduce – Map step, word count
• Input
– Key: simply the starting index of the text chunk within
the block being process...
Map step, word count example
• Input to mapper
• Output by Mapper
Key Value
Some ignored offset number “Twinkle, Twinkle L...
MapReduce – Shuffle step, word count
• Once the Map stage is over, data collected
from the Mappers
• During the Shuffle st...
Shuffle step, word count example
• Output by Shuffle
Key Value
Twinkle 1,1
Little 1
Star 1
MapRed – Reduce step, word count
• Input (inferred from map output)
– Key: A word
– Value: List of multiplicities
• Output...
Reduce step, word count example
• Output by Reduce
Key Value
Twinkle 2
Little 1
Star 1
MapReduce – Reduce step
• Aggregates the answers and creates the
needed output, which is the answer to the
problem it was ...
Word count
http://blog.jteam.nl/wp-content/uploads/2009/08/MapReduceWordCountOverview1.png
Map, Shuffle, and Reduce
https://mm-tom.s3.amazonaws.com/blog/MapReduce.png
Demo
• Simple implementation of word count
• Hortonworks HDInsight on-premise (single
node, pseudo cluster)
• Hadoop SDK, ...
Demo
• HDInsight on Azure
Hadoop architecture
Log Data RDBMS
Data Integration Layer
Flume Sqoop
Storage Layer (HDFS)
Computing Layer (MapReduce)
Adv...
References
• Daniel Jebaraj: Ignore HDInsight at Your Own
Peril: Everything You Need to Know
• Tom White: Hadoop: The Defi...
Thanks for your attention!
• Next time:
– Pig and Hive
– Recommendation engine
– Data mining with Hadoop
Upcoming SlideShare
Loading in …5
×

Hadoop and Mapreduce for .NET User Group

1,258 views

Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Hadoop and Mapreduce for .NET User Group

  1. 1. Introduction to Hadoop and MapReduce Csaba Toth Central California .NET User Group Meeting Date: March 19th, 2014 Location: Bitwise Industries, Fresno
  2. 2. Agenda • Big Data • Hadoop • Map Reduce • Demos: – Exercises with on-premise Hadoop emulator – Azure HDInsight
  3. 3. Big Data • Wikipedia: “collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications” • Examples: (Wikibon - A Comprehensive List of Big Data Statistics) – 100 Terabytes of data is uploaded to Facebook every day – Facebook Stores, Processes, and Analyzes more than 30 Petabytes of user generated data – Twitter generates 12 Terabytes of data every day – LinkedIn processes and mines Petabytes of user data to power the "People You May Know" feature – YouTube users upload 48 hours of new video content every minute of the day – Decoding of the human genome used to take 10 years. Now it can be done in 7 days
  4. 4. Big Data characteristics • Three Vs: Volume, Velocity, Variety • Sources: – Science, Sensors, Social networks, Log files – Public Data Stores, Data warehouse appliances – Network and in-stream monitoring technologies – Legacy documents • Main problems: – Storage Problem – Money Problem – Consuming and processing the data
  5. 5. A Little History Two Seminar papers: • “The Google File System” - October 2003 http://labs.google.com/papers/gfs.html – describes a scalable, distributed, fault-tolerant file system tailored for data-intensive applications, running on inexpensive commodity hardware, delivers high aggregate performance • “MapReduce: Simplified Data Processing on Large Clusters” - April 2004 http://queue.acm.org/detail.cfm?id=988408 – Describes a programming model and an implementation for processing large data sets. 1. map function that processes a key/value pair to generate a set of intermediate key/value pairs 2. reduce function that merges all intermediate values associated with the same intermediate key
  6. 6. Hadoop vs RDBMS Hadoop / MapReduce RDBMS Size of data Petabytes Gigabytes Integrity of data Low High (referential, typed) Data schema Dynamic Static Access method Interactive and Batch Batch Scaling Linear Nonlinear (worse than linear) Data structure Unstructured Structured Normalization of data Not Required Required Query Response Time Has latency (due to batch processing) Can be near immediate
  7. 7. Hadoop • Hadoop is an open-source software framework that supports data-intensive distributed applications. • Has two main pieces: – Storing large amounts of data: HDFS, Hadoop Distributed File System – Processing large amounts of data: implementation of the MapReduce programming model
  8. 8. Hadoop • All of this in a cost effective way: Hadoop is managing a cluster of commodity hardware computers. • The cluster is composed of a single master node and multiple worker nodes • It is written in Java, utilizes JVMs
  9. 9. Scalable Storage – Hadoop Distributed File System • Store large amounts of data economically • Scaling out instead of scaling up – Each file is split into large blocks (64MB by default) – Each block is replicated to multiple machines (nodes) – A centralized meta-data store has information on the blocks – Each node can fail • Allows storage of large files in fault-tolerant distributed manner
  10. 10. Name node HDFS visually Metadata Store Data node Data node Data node Node 1 Node 2 Block A Block B Block A Block B Node 3 Block A Block B
  11. 11. Scalable Processing • Task: word counting program • Pseudo-code: 1. Open text file for read, read each line 2. Parse each line into words 3. Increment the count of each word in a dictionary 4. Close the file and output the summary
  12. 12. Scalable Processing • Issues: – Data storage – Data distribution – Parallelizable algorithm – Fault tolerance – Aggregation – Storage of results
  13. 13. Scalable Processing Issue Considered The Hadoop solution Data storage HDFS Data distribution—The system should be able to distribute data to each of the processing nodes. Hadoop keeps data distribution between nodes to a minimum. It instead moves processing code to each node and processes the data where it is already available on disk. Parallelizable algorithm Hadoop uses the MapReduce programming model to enable a scalable processing model. Fault tolerance Hadoop monitors data storage nodes and will add replicas as nodes become unavailable. Hadoop also monitors tasks assigned to nodes and will reassign if a node becomes unavailable. Aggregation This is accounted for in a distributed manner through the Reduce stage Storage of results HDFS
  14. 14. MapReduce • Hadoop leverages the functional programming model of map/reduce. • Moves away from shared resources and related synchronization and contention issues • Thus inherently scalable and suitable for processing large data sets, distributed computing on clusters of computers/nodes. • The goal of map reduce is to break huge data sets into smaller pieces, distribute those pieces to various worker nodes, and process the data in parallel. • Hadoop leverages a distributed file system to store the data on various nodes.
  15. 15. Name node Heart beat signals and communication Job / task management Jobtracker Data node Data node Data node Tasktracker Tasktracker Map 1 Reduce 1 Map 2 Reduce 2 Tasktracker Map 3 Reduce 3
  16. 16. MapReduce • It is about two functions: map and reduce 1. Map Step: – Processes a key/value pairs and generate a set of intermediate key/value pairs form that 2. Shuffle step: – Groups all intermediate values associated with the same intermediate key into one set 3. Reduce Step: – Processes the intermediate values associated with the same intermediate key and produces a set of values based on the groups (usually some kind of aggregate)
  17. 17. MapReduce – Map step, word count • Input – Key: simply the starting index of the text chunk within the block being processed – ignored – Value: Single line of text – the unit we want to process • Output (implemented by user) – Key: A word (of a line) – Value: Multiplicity of the word – this will be 1 for each encountered word (we want to SUM() these 1s together for each word to get the total count later) – For each input (line) we output a whole set of words, each with the multiplicity of 1.
  18. 18. Map step, word count example • Input to mapper • Output by Mapper Key Value Some ignored offset number “Twinkle, Twinkle Little Star” Key Value Twinkle 1 Twinkle 1 Little 1 Star 1
  19. 19. MapReduce – Shuffle step, word count • Once the Map stage is over, data collected from the Mappers • During the Shuffle stage, all values that have the same key are collected and stored as a conceptual list tied to the key under which they were registered. • Guarantees that data under a specific key will be sent to exactly one reducer
  20. 20. Shuffle step, word count example • Output by Shuffle Key Value Twinkle 1,1 Little 1 Star 1
  21. 21. MapRed – Reduce step, word count • Input (inferred from map output) – Key: A word – Value: List of multiplicities • Output (implemented by user) – Key: A word – Value: SUM() of the multiplicities, thus the aggregate count
  22. 22. Reduce step, word count example • Output by Reduce Key Value Twinkle 2 Little 1 Star 1
  23. 23. MapReduce – Reduce step • Aggregates the answers and creates the needed output, which is the answer to the problem it was originally trying to solve. • That is how the system can process petabytes in a matter of hours: – Mappers can preform the map phase in parallel – Reducers can preform the reduction phase in parallel – Hadoop optimizes data and processing locality
  24. 24. Word count http://blog.jteam.nl/wp-content/uploads/2009/08/MapReduceWordCountOverview1.png
  25. 25. Map, Shuffle, and Reduce https://mm-tom.s3.amazonaws.com/blog/MapReduce.png
  26. 26. Demo • Simple implementation of word count • Hortonworks HDInsight on-premise (single node, pseudo cluster) • Hadoop SDK, building from source or from NuGet • Mapper source • Reducer source
  27. 27. Demo • HDInsight on Azure
  28. 28. Hadoop architecture Log Data RDBMS Data Integration Layer Flume Sqoop Storage Layer (HDFS) Computing Layer (MapReduce) Advanced Query Engine (Hive, Pig) Data Mining (Pegasus, Mahout) Index, Searches (Lucene) DB drivers (Hive driver) Web Browser (JS)Presentation Layer
  29. 29. References • Daniel Jebaraj: Ignore HDInsight at Your Own Peril: Everything You Need to Know • Tom White: Hadoop: The Definitive Guide, 3rd Edition, Yahoo Press • Bruno Terkaly’s presentations (for example Hadoop on Azure: Introduction) • Lynn Langit’s various presentations and YouTube videos
  30. 30. Thanks for your attention! • Next time: – Pig and Hive – Recommendation engine – Data mining with Hadoop

×