The Map Reduce
Programming Model
Boris Farber
boris.farber@gmail.com
10/5/2013
10/5/2013
About Me
• Senior Software Engineer at Varonis
– Data Governance, Leakage and
Synchronization
– Android , Windows
• Author...
Credits
• Avner Ben
– Chief Architect at Elisra Systems
– Skill Tree ® designer
• Nathan Marz
– Author of “Big Data” book ...
Plan
• Introduction to Big Data - problem
• Map Reduce - solution
– Map/Reduce functions
– Sample problems
– Analysis
– Ca...
What is Big Data
• Big Data is a new programming paradigm to
support data flow programming where the
traditional RDBMS and...
Big Data
• The data is too big, moves too fast, or
doesn’t fit the strictures of your database
architectures.
• To gain va...
Handling Huge Data Amounts
• Cost-efficiency:
– Commodity machines (cheap, but unreliable)
– Commodity network
– Automatic...
www.recessframework.org/page/map-reduce-
anonymous-functions-lambdas-php
Map Reduce
• A programming model (& its associated
implementation) for processing large data
set.
• Exploits large set of ...
Challenge
• The reason why MapReduce is such a
powerful paradigm is because programs
written in terms of MapReduce are
inh...
Usage
• http://research.google.com/archive/mapre
duce.html Original paper by Google
• Backbone of Hadoop – open source
Apa...
Map
Original
list
Function
New List
Map
• Is a higher-order function that applies a
given function to each element of a list,
returning a list of results. It ...
Reduce
Original
list
Function
Result1000
Reduce
• Reduce and accumulate are a family
of higher-order
functions that analyze a data structure
and recombine through ...
Map-Reduce
1."Map" step: The master node takes the input,
partitions it up into smaller sub-problems, and
distributes them...
Word Count Naïve Approach
• You can do this sequentially by getting
your fastest machine (you've got plenty
lying around) ...
Word Count Execution
the quick
brown fox
the fox ate
the mouse
how now
brown cow
Map
Map
Map
Reduce
Reduce
brown, 2
fox, 2...
http://code.google.com/p/lokad-
cloud/wiki/MapReduceSample
Search
• Input: (lineNumber, line) records
• Output: lines matching a given pattern
• Map:
if(line matches pattern)
output...
Inverted Index
• Input: (filename, text) records
• Output: list of files containing each word
• Map:
foreach word in text....
Inverted Index Example
to be or not
to be afraid, (12th.txt)
be, (12th.txt, hamlet.txt)
greatness, (12th.txt)
not, (12th.t...
Map Reduce
• Map:
– Accepts input
key/value pair
– Emits intermediate
key/value pair
• Reduce :
– Accepts intermediate
key...
Programming Model
• Data type: key-value records
• Map function:
(Kin, Vin)  list(Kinter, Vinter)
• Reduce function:
(Kin...
Throwing Hardware Linear Scalability
• Cheap nodes fail, especially if you have
many
• Mean time between failures for 1 no...
Fault Tolerance Task Crash
• If a task crashes:
– Retry on another node
• OK for a map because it has no dependencies
• OK...
Fault Tolerance Node Crash
• If a node crashes:
– Re-launch its current tasks on other nodes
– Re-run any maps the node pr...
Fault Tolerance Slow Task
• If a task is going slowly (straggler):
– Launch second copy of task on another node (“speculat...
Key Code (async thread pool)
MapCallback<TMapInput> mapper = new MapCallback<TMapInput>();
List<HashMap<String, Integer>> ...
Clojure/Cascalog
• Key problem in Java – very low level processing with
lack of high level combinations (it took me while ...
Benefits of Map Reduce
• Many problems can be phrased this way
• Elegant and Powerful
Takeaways
• By providing a data-parallel programming
model, MapReduce can control job
execution in useful ways:
– Automati...
Conclusions
• MapReduce programming model hides the
complexity of work distribution and fault tolerance
• Principal design...
Suitable for your task if
• Have a cluster
• Working with large dataset
• Working with independent data (or
assumed)
• Can...
Lambda Architecture
Lambda Architecture
• Query = Function(All Data)
• General purpose approach to implementing an
arbitrary function on an ar...
Chart
Lambda Architecture
• Batch Layer – pre- computed saved
queries (saves time), using Map Reduce
• Serving layer – updates
•...
Map reduce
Upcoming SlideShare
Loading in …5
×

Map reduce

512
-1

Published on

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
512
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
15
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Map reduce

  1. 1. The Map Reduce Programming Model Boris Farber boris.farber@gmail.com 10/5/2013 10/5/2013
  2. 2. About Me • Senior Software Engineer at Varonis – Data Governance, Leakage and Synchronization – Android , Windows • Author of Profiterole – Map Reduce library for Android • “Android System Programming” book reviewer
  3. 3. Credits • Avner Ben – Chief Architect at Elisra Systems – Skill Tree ® designer • Nathan Marz – Author of “Big Data” book at Manning – Software Engineer at Twitter – Creator of Storm and Cascalog • Matei Zaharia – – http://www.cs.berkeley.edu/~matei/
  4. 4. Plan • Introduction to Big Data - problem • Map Reduce - solution – Map/Reduce functions – Sample problems – Analysis – Cascalog • Lambda Architecture - getting the big picture • Summary
  5. 5. What is Big Data • Big Data is a new programming paradigm to support data flow programming where the traditional RDBMS and SQL based systems fail, not only to scale up but also to provide desired functionality. • For example back bone of Facebook/Twitter/LinkedIn/Google • The common data pattern for the companies above huge amount of un-structured data such as tweets, likes, network updates.
  6. 6. Big Data • The data is too big, moves too fast, or doesn’t fit the strictures of your database architectures. • To gain value from this data, you must choose an alternative way to process it. • Processing the data to keep pre-computed queries (we don’t want to perform peta- byte search for every query)
  7. 7. Handling Huge Data Amounts • Cost-efficiency: – Commodity machines (cheap, but unreliable) – Commodity network – Automatic fault-tolerance (fewer administrators) – Easy to use (fewer programmers) • Large-Scale Data Processing with commodity hardware • Want to use 1000s of CPUs/Processes/Threads • But don’t want hassle of managing things
  8. 8. www.recessframework.org/page/map-reduce- anonymous-functions-lambdas-php
  9. 9. Map Reduce • A programming model (& its associated implementation) for processing large data set. • Exploits large set of commodity computers • Executes process in distributed manner • Offers high degree of transparencies • Based on Functional Programming approach which is inherently more scalable than Object Oriented one !
  10. 10. Challenge • The reason why MapReduce is such a powerful paradigm is because programs written in terms of MapReduce are inherently scalable. • The same program can run on ten megabytes of data as can run on ten petabytes of data, why ?
  11. 11. Usage • http://research.google.com/archive/mapre duce.html Original paper by Google • Backbone of Hadoop – open source Apache Big Data framework
  12. 12. Map Original list Function New List
  13. 13. Map • Is a higher-order function that applies a given function to each element of a list, returning a list of results. It is often called apply-to-all when considered in functional form. • (defn bubble[x] (* x x)) • (map #(bubble %1) [ 1 3 5 7 ])  (1 9 25 49)
  14. 14. Reduce Original list Function Result1000
  15. 15. Reduce • Reduce and accumulate are a family of higher-order functions that analyze a data structure and recombine through use of a given combining operation the results of recursively processing its constituent parts, building up a return value. • (reduce * [1 2 3 4 5 6 6])  4320
  16. 16. Map-Reduce 1."Map" step: The master node takes the input, partitions it up into smaller sub-problems, and distributes them to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes the smaller problem, and passes the answer back to its master node. 2."Reduce" step: The master node then collects the answers to all the sub-problems and combines them in some way to form the output – the answer to the problem it was originally trying to solve.
  17. 17. Word Count Naïve Approach • You can do this sequentially by getting your fastest machine (you've got plenty lying around) and running over the text from start to finish. • Maintain a hash map of every word you find (the key) and incrementing the frequency (value) every time you parse a word. • Simple, straightforward and slow.
  18. 18. Word Count Execution the quick brown fox the fox ate the mouse how now brown cow Map Map Map Reduce Reduce brown, 2 fox, 2 how, 1 now, 1 the, 3 ate, 1 cow, 1 mouse, 1 quick, 1 the, 1 brown, 1 fox, 1 quick, 1 the, 1 fox, 1 the, 1 how, 1 now, 1 brown, 1 ate, 1 mouse, 1 cow, 1 Input Map Reduce Reduce Output
  19. 19. http://code.google.com/p/lokad- cloud/wiki/MapReduceSample
  20. 20. Search • Input: (lineNumber, line) records • Output: lines matching a given pattern • Map: if(line matches pattern) output(line) • Reduce: identify function – Alternative: no reducer (map-only job)
  21. 21. Inverted Index • Input: (filename, text) records • Output: list of files containing each word • Map: foreach word in text.split() output(word, filename) • Combine: uniquify filenames for each word • Reduce: def reduce(word, filenames) output(word, sort(filenames))
  22. 22. Inverted Index Example to be or not to be afraid, (12th.txt) be, (12th.txt, hamlet.txt) greatness, (12th.txt) not, (12th.txt, hamlet.txt) of, (12th.txt) or, (hamlet.txt) to, (hamlet.txt) hamlet.txt be not afraid of greatness 12th.txt to, hamlet.txt be, hamlet.txt or, hamlet.txt not, hamlet.txt be, 12th.txt not, 12th.txt afraid, 12th.txt of, 12th.txt greatness, 12th.txt
  23. 23. Map Reduce • Map: – Accepts input key/value pair – Emits intermediate key/value pair • Reduce : – Accepts intermediate key/value* pair – Emits output key/value pair Very big data Result M A P R E D U C E Partitioning Function
  24. 24. Programming Model • Data type: key-value records • Map function: (Kin, Vin)  list(Kinter, Vinter) • Reduce function: (Kinter, list(Vinter))  list(Kout, Vout)
  25. 25. Throwing Hardware Linear Scalability • Cheap nodes fail, especially if you have many • Mean time between failures for 1 node = 3 years • Mean time between failures for 1000 nodes = 1 day • Programming distributed systems is hard users write “map” & “reduce” functions, system distributes work and handles faults
  26. 26. Fault Tolerance Task Crash • If a task crashes: – Retry on another node • OK for a map because it has no dependencies • OK for reduce because map outputs are on disk – If the same task fails repeatedly, fail the job or ignore that input block (user-controlled) • For these fault tolerance features to work, your map and reduce tasks must be side-effect-free
  27. 27. Fault Tolerance Node Crash • If a node crashes: – Re-launch its current tasks on other nodes – Re-run any maps the node previously ran, Necessary because their output files were lost along with the crashed node
  28. 28. Fault Tolerance Slow Task • If a task is going slowly (straggler): – Launch second copy of task on another node (“speculative execution”) – Take the output of whichever copy finishes first, and kill the other • Surprisingly important in large clusters – Stragglers occur frequently due to failing hardware, software bugs, misconfiguration, etc – Single straggler may noticeably slow down a job
  29. 29. Key Code (async thread pool) MapCallback<TMapInput> mapper = new MapCallback<TMapInput>(); List<HashMap<String, Integer>> maps = new LinkedList<HashMap<String, Integer>>(); int numThreads = 25; ExecutorService pool = Executors.newFixedThreadPool(numThreads); CompletionService<OutputUnit> futurePool = new ExecutorCompletionService<MapCallback.OutputUnit>( pool); Set<Future<OutputUnit>> futureSet = new HashSet<Future<OutputUnit>>(); // linear addition of jobs, parallel execution for (TMapInput m : input) { futureSet.add(futurePool.submit(mapper.makeWorker(m))); } // tasks running pool.shutdown();
  30. 30. Clojure/Cascalog • Key problem in Java – very low level processing with lack of high level combinations (it took me while to add another reducer …)  static typing and OO (function is not first class citizen)  felt writing interpreter • Is a fully-featured data processing and querying library for Clojure or Java. • The main use cases for Cascalog are processing "Big Data" on top of Hadoop or doing analysis on your local computer. • Cascalog is a replacement for tools like Pig, Hive, and Cascading and operates at a significantly higher level of abstraction than those tools.
  31. 31. Benefits of Map Reduce • Many problems can be phrased this way • Elegant and Powerful
  32. 32. Takeaways • By providing a data-parallel programming model, MapReduce can control job execution in useful ways: – Automatic division of job into tasks – Automatic load balancing – Recovery from failures & stragglers • User focuses on application, not on complexities of distributed computing
  33. 33. Conclusions • MapReduce programming model hides the complexity of work distribution and fault tolerance • Principal design philosophies: – Make it scalable, so you can throw hardware at problems – Make it cheap, lowering hardware, programming and admin costs • MapReduce is not suitable for all problems, but when it works, it may save you quite a bit of time
  34. 34. Suitable for your task if • Have a cluster • Working with large dataset • Working with independent data (or assumed) • Can be cast into map and reduce
  35. 35. Lambda Architecture
  36. 36. Lambda Architecture • Query = Function(All Data) • General purpose approach to implementing an arbitrary function on an arbitrary large dataset and having the function return its results with low latency • The Lambda Architecture solves the problem of computing arbitrary functions on arbitrary data in real-time by decomposing the problem into three layers: the batch layer, the serving layer, and the speed layer. • Key Insight The alternative approach is to pre- compute the query function – Why ?
  37. 37. Chart
  38. 38. Lambda Architecture • Batch Layer – pre- computed saved queries (saves time), using Map Reduce • Serving layer – updates • Speed layer – combined result (both data in database plus updates not yet inserted)
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×