Your SlideShare is downloading. ×
Hadoop - How It Works
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Hadoop - How It Works

550
views

Published on

Data processing parallelization approaches. Apache Hadoop, HDFS and MapReduce programming model.

Data processing parallelization approaches. Apache Hadoop, HDFS and MapReduce programming model.

Published in: Technology

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
550
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
10
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Ing. Vladimír Hanušniak University of Žilina, March 2014
  • 2.  Brief review  Parallel processing  Hadoop ◦ HDFS (Hadoop Distributed File System) ◦ MapReduce  Example 2
  • 3.  Brief review  Parallel processing  Hadoop ◦ HDFS (Hadoop Distributed File System) ◦ MapReduce  Example 3
  • 4. 4 With no signs of slowing, Big Data keep growing.
  • 5. 5  X-ray – 30MB  3D CT scan – 1GB  3D MRI – 150MB  Mammograms – 120MB  Growing – 20-40%/year  Preemies health ◦ University of Ontario & IBM ◦ 16 different data streams ◦ 1260 data points per second  Early treatment
  • 6.  Data structure and storage  Analytical methods & Processing power  Needed parallelization 6
  • 7.  Brief review  Parallel processing  Hadoop ◦ HDFS (Hadoop Distributed File System) ◦ MapReduce  Example 7
  • 8.  Task decomposition (HPC Uniza) ◦ Computationally expensive task ◦ Move the data to processing ◦ Execution order ◦ Shared data storage 8
  • 9.  Slow HDD read spead !!!  HDD reading speed ~100MB/s ◦ Read 1000GB => 10000s (166,6 min)  100 parallel reading machines => 1,6 min 9
  • 10.  Data decomposition (Hadoop) ◦ Data has regular structure (type, size) ◦ Move processing to data 10
  • 11.  Brief review  Parallel processing  Hadoop ◦ HDFS (Hadoop Distributed File System) ◦ MapReduce  Example 11
  • 12.  Hadoop – framework for processing BigData  Two main components: ◦ HDFS ◦ MapReduce  Thousands of nodes in cluster 12
  • 13.  Distributed fault-tolerant file system designed to run on commodity hardware  Main characteristics ◦ Scalability ◦ High availability ◦ Large files ◦ Common hardware ◦ Streeming data access - write once read many times 13
  • 14.  NameNode ◦ Master ◦ Control storage ◦ Store metadata about files  Name, path, size, block size, block IDs, ...  DataNode ◦ Slave ◦ Store data in blocks 14
  • 15. 15
  • 16.  Files are stored in blocks ◦ Large files are split  Size: 64, 128, 256 MB …  Stored in NameNode memory ◦ Limit factor  150 bytes per file/directory or block object ◦ 3GB of memory = 10 million one blocks files. 16
  • 17.  Seek time - 10ms seek time  Block size - 100 MB 1% of  Transfer rate - 100 MB/s transfer time  Number of Map & Reduce Jobs depends on block size 17
  • 18. 18
  • 19. 19
  • 20.  First – same node, client  Second – off-rack  Third – same rack, different node  Next… - random nodes (tries to avoid placing too many replicas on the same rack) 20
  • 21.  Brief review  Parallel processing  Hadoop ◦ HDFS (Hadoop Distributed File System) ◦ MapReduce  Example 21
  • 22.  Programing model for data processing ◦ Functional programming - directed acyclic graph  Hadoop support: Java, RUBY, Python, C++  Associative array ◦ <key,value> pairs 22
  • 23.  Job - unit of work ◦ Input data ◦ Map & Reduce program ◦ Configuration information  Job is divided into task ◦ Map tasks ◦ Reduce tasks 23
  • 24.  Job tracker ◦ Coordinates all jobs by scheduling tasks to run on task trackers ◦ Keeps job progress records ◦ Reschedule task in case of fails  Task trackers ◦ Run tasks ◦ Send progress report to Jobtracker 24
  • 25. 25
  • 26.  Hadoop divide input to MapReduce job into fixed-size piece of work – input split  Create one map per split ◦ Run user define map function  Split size tends to be the size of an HDFS block 26
  • 27.  Data locality optimization ◦ Run the map task on a node where the input data resides in HDFS. ◦ Data-local (a), rack-local (b), and off-rack (c) map tasks. 27
  • 28.  Output - <Key, Value> pairs  Write to local disk – NOT to HDFS !!! ◦ Map output is processed by reduce tasks to produce final output ◦ No replicas needed  Sort <Key, Value> pairs  If node fails before reduce –> map again 28
  • 29.  TaskTracker read region files remotely (RPC)  Invoke Reduce function (aggregate)  Output is stored in HDFS  Don’t have the advantage of data locality ◦ Input to reduce – output from all mappers 29
  • 30. 30
  • 31.  Minimize the data transferred between map and reduce tasks  Run on the map output  “Reduce on Map side”  max(0, 20, 10, 25, 15) = max(max(0, 20, 10), max(25, 15)) = max(20, 25) = 25  mean(0, 20, 10, 25, 15) = 14  mean(mean(0, 20, 10), mean(25, 15)) = mean(10, 20) = 15 31
  • 32.  Java (RUBY, Python, C++) ◦ Good for programmers  Pig ◦ Scripting language with a focus on dataflows. ◦ Use Pig Latin language ◦ Allow merging, filtering, applying functions  Hive ◦ Use HiveQL - similar to SQL (use Facebook) ◦ Provides a database query interface  Hbase 32
  • 33. 33
  • 34.  Brief review  Parallel processing  Hadoop ◦ HDFS (Hadoop Distributed File System) ◦ MapReduce  Example 34
  • 35. (0, 0067011990999991950051507004...9999999N9+00001+99999999999...) (106, 0043011990999991950051512004...9999999N9+00221+99999999999...) (212, 0043011990999991950051518004...9999999N9-00111+99999999999...) (318, 0043012650999991949032412004...0500001N9+01111+99999999999...) (424, 0043012650999991949032418004...0500001N9+00781+99999999999...)  Data Set  Find the maximum temperature by year 1901 - 317 1902 - 244 1903 - 289 1904 - 256 1905 - 283 ... 35
  • 36. #!/usr/bin/env bash for year in all/* do echo -ne `basename $year .gz`"t" gunzip -c $year | awk '{ temp = substr($0, 88, 5) + 0; q = substr($0, 93, 1); if (temp !=9999 && q ~ /[01459]/ && temp > max) max = temp } END { print max }' done % ./max_temperature.sh 1901 317 1902 244 1903 289 1904 256 1905 283 ... 36
  • 37. 37  Run parts of the program in parallel ◦ Process different years in different processes  Problems ◦ Non equal-size pieces ◦ Combining partial results need processing time ◦ Single machine processing limit ◦ Long processing time
  • 38. 38
  • 39. 39
  • 40. 40
  • 41. 41
  • 42. 42 Zdroj: Infoware 1-2/2014
  • 43. 43
  • 44.  Hadoop: The Definitive Guide, 3rd Edition ◦ http://it-ebooks.info/book/635/  Big Data: A Revolution That Will Transform How We Live, Work, and Think  http://hadoop.apache.org/  http://architects.dzone.com/articles/how- hadoop-mapreduce-works 44

×