Apache Hadoop - Big Data Engineering

2. Apache Hadoop Big Data Engineering Prepared by: ● Islam Elbanna ● Mahmoud Hanafy Presented by: ● Ahmed Mahran

3. Outlines 1. Introduction 2. History 3. Assumptions 4. Architecture a. Case Study b. MapReduce Design c. Code Example d. Main Modules e. Access Procedure 5. Hadoop Modes 6. MapReduce 1 VS MapReduce 2 (YARN) 7. Questions

5. Introduction What is Hadoop? "Apache Hadoop is a framework that allows for the distributed processing of large data sets across clusters of commodity computers using a simple programming model. It is designed to scale up from single servers to thousands of machines, each providing computation and storage" Open Source software + Hardware commodity = IT Cost reduction

6. Introduction - Cont. Why Hadoop ? ● Performance ● Storage ● Scalability ● Fault tolerance ● Cost efficiency (Commodity Machines)

7. Introduction - Cont. What is Hadoop used for ? ● Searching ● Log processing ● Recommendation system ● Analytics ● Video and Image analysis

8. Introduction - Cont. Who uses Hadoop ? ● Amazon ● Facebook ● Google ● IBM ● New York Times ● Yahoo ● Twitter ● LinkedIn ● …

9. Introduction - Cont. Hadoop RDBMS Non-Structured/Structured data Structured data Scale Out Scale Up Procedural/Functional programming Declarative Queries Offline batch processing Online/Batch Transactions Petabytes Gigabytes Key Value Pairs Predefined fields Hadoop Vs RDBMS

10. Introduction - Cont. Problem: 20+ billion web pages x 20KB = 400+ terabytes One computer can read 30-35 MB/sec from disk ~ Four months to read the web (Time). ~1,000 hard drives just to store the web (Storage).

11. Introduction - Cont. Solution: same problem with 1000 machines < 3 hours But we need: ● Communication and coordination ● Recovering from machine failure ● Status reporting ● Debugging ● Optimization Distributed System

12. Introduction - Cont. Distributed systems ● Cluster of machines ● Distributed Storage ● Distributed Computing

14. Distributed systems ● Cluster of machines ● Distributed Storage ● Distributed Computing Introduction - Cont.

17. History ● 2002-2004 Started as a sub-project of Apache Nutch. ● 2003-2004 Google published Google File System (GFS) and MapReduce Framework Paper. ● 2004 Doug Cutting and Mike Cafarella implemented Google’s frameworks in Nutch. ● In 2006 Yahoo hires Doug Cutting to work on Hadoop with a dedicated team. ● In 2008 Hadoop became Apache Top Level Project.

19. Assumptions ● Hardware Failure ● Streaming Data Access ● Large Data Sets ● Simple Coherency Model ● Moving Computation is Cheaper than Moving Data ● Software Platform Portability

21. Architecture Hadoop designed and built on two independent frameworks Hadoop = HDFS + MapReduce HDFS: is a reliable distributed file system that provides high-throughput access to data. ● File divided into blocks 64MB (default) ● Each block replicated 3 times (default) MapReduce: is a framework for performing high

23. Case Study: Word Count Problem: We need to calculate word frequencies in billions of web pages ● Input: Files with one document per record ● Output: List of words and their frequencies in the whole documents

24. Case Study: Solution

26. Architecture - Cont. MapReduce Design ● Map ● Reduce ● Shuffle & Sort

27. Case Study: Map Phase ● Specify a map function that takes a key/value pair key = document URL value = document contents ● Output of map function is key/value pairs. In our case, output(word, “1”) once per word in the document

28. Case Study: Reduce Phase ● MapReduce library gathers together all pairs with the same key (shuffle/sort) ● The reduce function combines the values for a key In our case, compute the sum ● Output of reduce will be like that

29. Architecture - Cont. MapReduce Design ● Map: extract something you care about from each record.

30. Architecture - Cont. MapReduce Design ● Reduce : aggregate, summarize, filter, or transform mapper output

31. Architecture - Cont. MapReduce Design Overall View:

32. Architecture - Cont. MapReduce Design ● Shuffle & Sort : redirect the mapper output to the right reducer

33. Case Study: Overall View

35. Architecture - Cont. MapReduce Programmer specifies two primary methods: map(k1, v1) → <k2, v2> reduce(k2, list<v2>) → <k3, v3>

36. Case Study : Code Example Map Function

37. Case Study : Code Example Reduce Function

38. Hadoop not only JAVA (streaming)

40. Architecture - Cont. Main Modules ● File System (HDFS) ⚪ Name Node ⚪ Secondary Name Node ⚪ Data Node ● MapReduce Framework ⚪ Job Tracker ⚪ Task Tracker

41. Architecture - Cont. Main Modules ● File System (HDFS) ⚪ Name Node ⚪ Secondary Name Node ⚪ Data Node ⚪

42. Architecture - Cont. Main Modules ● MapReduce Framework ⚪ Job Tracker ⚪ Task Tracker

44. Architecture - Cont. Access Procedure ● Read From HDFS ● Write to HDFS

47. Architecture - Cont. Tasks distribution Procedure: JobTracker choses the nodes to execute the tasks to achieve the data locality principle

49. Hadoop Modes Hadoop Modes ● Standalone ● Pseudo-Distributed ● Fully-Distributed

51. MapReduce 1 Vs MapReduce 2(YARN)

53. Questions

54. References ● Book “Hadoop in Action” by Chuck Lam ● Book “Hadoop The Definitive Guide” by Tom Wbite ● http://hadoop.apache.org/ ● http://en.wikipedia.org/wiki/Apache_Hadoop ● https://gigaom.com/2013/03/04/the-history-of-hadoop-from-4-nodes-to-the-future-of-data/ ● http://www.slideshare.net/emcacademics/milind-hadoop-trainingbrazil ● http://www.slideshare.net/PhilippeJulio/hadoop-architecture ● http://www.slideshare.net/rantav/introduction-to-map-reduce ● http://www.slideshare.net/sudhakara_st/hadoop-intruduction?qid=a14580f7-23be-45b8-bd1e-b3417b8a0ec1&v=qf1&b= &from_search=2 ● http://www.slideshare.net/ZhijieShen/hadoop-summit-san-jose-2014?qid=a14580f7-23be-45b8-bd1e-b3417b8a0ec1&v=q f1&b=&from_search=12 ● http://www.slideshare.net/hadoop/practical-problem-solving-with-apache-hadoop-pig ● http://www.slideshare.net/phobeo/introduction-to-data-processing-using-hadoop-and-pig ● http://www.slideshare.net/AdamKawa/hadoop-operations-powered-by-hadoop-hadoop-summit-2014-amsterdam?qid=a1 4580f7-23be-45b8-bd1e-b3417b8a0ec1&v=qf1&b=&from_search=1

55. Thanks

Apache Hadoop - Big Data Engineering

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Apache Hadoop - Big Data Engineering

Similar to Apache Hadoop - Big Data Engineering (20)

More from BADR

More from BADR (15)

Recently uploaded

Recently uploaded (20)

Apache Hadoop - Big Data Engineering