Distributed batch processing with Hadoop

1,862 views

Published on

This are the slides I used to introduce Hadoop in a meetup at the Barcelona JUG (Java Users Group).

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,862
On SlideShare
0
From Embeds
0
Number of Embeds
723
Actions
Shares
0
Downloads
22
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Distributed batch processing with Hadoop

  1. 1. Distributed batch processing with Hadoop Ferran Galí i Reniu @ferrangali 09/01/2014
  2. 2. Ferran Galí i Reniu ● UPC - FIB ● Trovit
  3. 3. Problem ● Too much data ○ 90% of all the data in the world has been generated in the last two years ○ Large Hadron Collider: 25 petabytes per year ○ Walmart: 1M transactions per hour ● Hard disks ○ Cheap! ○ Still slow access time ○ Write even slower
  4. 4. Solutions ● Multiple Hard Disks ○ Work in parallel ○ We can reduce access time! ● How to deal with hardware failure? ● What if we need to combine data?
  5. 5. Hadoop ● Doug Cutting & Mike Cafarella
  6. 6. Hadoop The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung October 2003
  7. 7. Hadoop MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat December 2004
  8. 8. Hadoop ● Doug Cutting & Mike Cafarella ● Yahoo!
  9. 9. Hadoop ● HDFS ○ Storage ● MapReduce ○ Processing ● Ecosystem
  10. 10. HDFS ● Distributed storage ○ Managed across a network of commodity machines ● Blocks ○ About 128Mb ○ Large data sets ● Tolerance to node failure ○ Data replication ● Streaming data access ○ Many access ○ Write once (batch)
  11. 11. HDFS ● DataNodes (Workers) ○ Store blocks ● NameNode (Master) ○ ○ ○ ○ ○ Maintains metadata Knows where the blocks are located Make DataNodes fault tolerant Single point of failure Secondary NameNode
  12. 12. HDFS http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html
  13. 13. HDFS ● Interfaces ○ Java ○ Command line interface ● Load hadoop fs -put file.csv /user/hadoop/file.csv ● Extract hadoop fs -get /user/hadoop/file.csv file.csv
  14. 14. MapReduce ● Distributed processing paradigm ○ Moving computation is cheaper than moving data ● Map ○ Map(k1,v1) -> list(k2,v2) ● Reduce ○ Reduce(k2,list(v2)) -> list(v3)
  15. 15. Word Counter map (Long key, String value) for each(String word in value) emit(word, 1); reduce (String word, List values) emit(word, sum(values));
  16. 16. Word Counter map (Long key, String value) for each(String word in value) emit(word, 1); reduce (String word, List values) emit(word, sum(values)); Java is great Hadoop is also great
  17. 17. Word Counter - Map map (Long key, String value) for each(String word in value) emit(word, 1); reduce (String word, List values) emit(word, sum(values)); Key Value 1 Java is great 2 Hadoop is also great
  18. 18. Word Counter - Map map (Long key, String value) for each(String word in value) emit(word, 1); reduce (String word, List values) emit(word, sum(values)); Key Value 1 Java is great 2 Hadoop is also great map(1, “Java is great”)
  19. 19. Word Counter - Map map (Long key, String value) map(1, “Java is great”) for each(String word in value) emit(word, 1); Key reduce (String word, List values) emit(word, sum(values)); Key Value 1 Java is great 2 Hadoop is also great Value Java 1
  20. 20. Word Counter - Map map (Long key, String value) map(1, “Java is great”) for each(String word in value) emit(word, 1); Key Java 1 is reduce (String word, List values) Value 1 emit(word, sum(values)); Key Value 1 Java is great 2 Hadoop is also great
  21. 21. Word Counter - Map map (Long key, String value) map(1, “Java is great”) for each(String word in value) emit(word, 1); Key Java 1 is 1 great reduce (String word, List values) Value 1 emit(word, sum(values)); Key Value 1 Java is great 2 Hadoop is also great
  22. 22. Word Counter - Map map (Long key, String value) for each(String word in value) map(2, “Hadoop is also great”) emit(word, 1); Key Java 1 is 1 great reduce (String word, List values) Value 1 emit(word, sum(values)); Key Value 1 Java is great 2 Hadoop is also great
  23. 23. Word Counter - Map map (Long key, String value) for each(String word in value) map(2, “Hadoop is also great”) emit(word, 1); Key Java 1 is 1 great reduce (String word, List values) Value 1 emit(word, sum(values)); Key Value Hadoop 1 1 Java is great is 1 2 Hadoop is also great also 1 great 1
  24. 24. Word Count - Group & Sort map(k1,v1) -> list(k2, v2) Key Value Java 1 is 1 great 1 Hadoop 1 is 1 also 1 great 1 reduce(k2, list(v2)) -> list(v3)
  25. 25. Word Count - Group & Sort map(k1,v1) -> list(k2,v2) reduce(k2,list(v2)) -> list(v3) Key Value Java 1 Key Value is 1 Java [1] great 1 is [1, 1] Hadoop 1 great [1, 1] is 1 Hadoop [1] also 1 also great 1 group [1]
  26. 26. Word Count - Group & Sort map(k1,v1) -> list(k2,v2) reduce(k2,list(v2)) -> list(v3) Key Value Java 1 Key Value Key Value is 1 Java [1] also [1] great 1 is [1, 1] great [1, 1] sort group Hadoop 1 great is 1 Hadoop [1] is [1, 1] also 1 also Java [1] great 1 [1, 1] [1] Hadoop [1]
  27. 27. Word Count - Reduce map (Long key, String value) for each(String word in value) emit(word, 1); reduce (String word, List values) emit(word, sum(values)); Key Value also [1] great [1, 1] Hadoop [1] is [1, 1] Java [1]
  28. 28. Word Count - Reduce map (Long key, String value) for each(String word in value) emit(word, 1); reduce (String word, List values) emit(word, sum(values)); Key Value also [1] great [1, 1] Hadoop [1] is [1, 1] Java [1] reduce(“also”, [1])
  29. 29. Word Count - Reduce map (Long key, String value) reduce(“also”, [1]) for each(String word in value) emit(word, 1); Key reduce (String word, List values) emit(word, sum(values)); Key Value also [1] great [1, 1] Hadoop [1] is [1, 1] Java [1] Value also 1
  30. 30. Word Count - Reduce map (Long key, String value) reduce(“great”, [1, 1]) for each(String word in value) emit(word, 1); Key reduce (String word, List values) emit(word, sum(values)); Key Value also [1] great [1, 1] Hadoop [1] is [1, 1] Java [1] Value also 1
  31. 31. Word Count - Reduce map (Long key, String value) reduce(“great”, [1, 1]) for each(String word in value) emit(word, 1); Key also 1 great reduce (String word, List values) Value 2 emit(word, sum(values)); Key Value also [1] great [1, 1] Hadoop [1] is [1, 1] Java [1]
  32. 32. Word Count - Reduce map (Long key, String value) reduce(“Hadoop”, [1]) for each(String word in value) emit(word, 1); Key also 1 great reduce (String word, List values) Value 2 emit(word, sum(values)); Key Value also [1] great [1, 1] Hadoop [1] is [1, 1] Java [1] Hadoop 1
  33. 33. Word Count - Reduce map (Long key, String value) reduce(“is”, [1, 1]) for each(String word in value) emit(word, 1); Key also 1 great reduce (String word, List values) Value 2 emit(word, sum(values)); Key Value also [1] great [1, 1] Hadoop [1] is [1, 1] Java [1] Hadoop 1 is 2
  34. 34. Word Count - Reduce map (Long key, String value) reduce(“Java”, [1]) for each(String word in value) emit(word, 1); Key also 1 great reduce (String word, List values) Value 2 emit(word, sum(values)); Key Value also [1] great [1, 1] Hadoop [1] is [1, 1] Java [1] Hadoop 1 is 2 Java 1
  35. 35. Distributed? ● Map tasks ○ Each read block executes a map task ● Reduce tasks ○ Partitioning when grouping
  36. 36. Word Count - Partition num partitions = 1 Key Value Java 1 Key Value Key Value is 1 Java [1] also [1] great 1 is [1, 1] great [1, 1] sort group Hadoop 1 great is 1 Hadoop [1] is [1, 1] also 1 also Java [1] great 1 [1, 1] [1] Hadoop [1]
  37. 37. Word Count - Partition num partitions = 2 is 1 great Value Java 1 Key Value [1] is [1, 1] is [1, 1] Java [1] Key Value Key Value great [1, 1] also [1] [1, 1] 1 up Java Key Value sort gr o Key p ou gr Hadoop 1 is 1 also 1 Hadoop [1] great great 1 also Hadoop [1] sort [1]
  38. 38. Distributed? ● Map tasks ○ Each read block executes a map task ● Reduce tasks ○ Partitioning when grouping ○ Each partition executes a reduce task
  39. 39. MapReduce ● Job Tracker ○ Dispatches Map & Reduce Tasks ● Task Tracker ○ Executes Map & Reduce Tasks
  40. 40. MapReduce Example 1: ● Map ● Reduce ● Group & Partition $> hadoop jar jug-hadoop.jar example1 /user/hadoop/input. txt /user/hadoop/output 2 $> hadoop fs -text /user/hadoop/output/part-r-* http://github.com/ferrangali/jug-hadoop
  41. 41. MapReduce Example 2: ● Sorting ● n-Job workflow $> hadoop jar jug-hadoop.jar example2 /user/hadoop/input. txt /user/hadoop/output 2 $> hadoop fs -text /user/hadoop/output/part-r-* http://github.com/ferrangali/jug-hadoop
  42. 42. Big Data
  43. 43. Big Data ● Too much data ○ Not a problem any more ● It’s just a matter of which tools use ● New opportunity for businesses
  44. 44. Big Data Platform Consumption logs Processing Serving indexes DB DB NoSQL
  45. 45. Hadoop Ecosystem
  46. 46. Hive ● Data Warehouse ● SQL-Like analysis system SELECT SPLIT(line, “ ”) AS word, COUNT(*) FROM table GROUP BY word ORDER BY word ASC; ● Executes MapReduce underneath!
  47. 47. HBase ● ● ● ● Based on BigTable Column-oriented database Random realtime read/write access Easy to bulk load from Hadoop
  48. 48. Hadoop Ecosystem ● ZooKeeper: ○ Centralized coordination system ● Pig ○ Data-flow language to analyze large data sets ● Kafka: ○ Distributed messaging system ● Sqoop: ○ Transfer between RDBMS - HDFS ● ...
  49. 49. Hadoop - Who’s using it?
  50. 50. Trovit ● What is it: ○ Vertical search engine. ○ Real estate, cars, jobs, products, vacations. ● Challenges: ○ Millions of documents to index ○ Traffic generates a huge amount of log files
  51. 51. Trovit ● Legacy: ○ Use MySQL as a support to document indexing ○ Didn’t scale! ● Batch processing: ○ Hadoop with a pipeline workflow ○ Problem solved! ● Real time processing: ○ Storm to improve freshness ● More challenges: ○ Content analysis ○ Traffic analysis
  52. 52. Questions? Distributed batch processing with Hadoop @ferrangali

×