Understanding Hadoop


Published on

Understanding Hadoop, Session held in ADEF on 1st of March

Published in: Technology
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Understanding Hadoop

  1. 1. Understanding Hadoop by Ahmed Ossama
  2. 2. Agenda ● ● ● ● ● ● ● ● ● Introduction to Big Data Hadoop HDFS MapReduce and YARN Hadoop Ecosystem Planning and Installing Hadoop Clusters Writing Simple Streaming Jobs Demo Q&A
  3. 3. Commodore Amiga 500 (1990) Memory: 512K Atari ST Amiga 500 (1985) Memory: 512K Macintosh (1984) Memory: 128K Wait a sec… Are we in the 80’s?!
  4. 4. ● ● 30 Billion pieces of content were added to Facebook this past month by more than 600 million users 2.7 billion likes made daily on and off of the Facebook site More than 2.5 Billion videos were watched on YouTube… Yesterday! ● ● 1.2 million deliveries per second 35 billion searches were performed last month on Twitter What are the volumes of data that we are seeing today?
  5. 5. What does the future look like? ● Worldwide IP traffic will quadruple by 2015. ● Nearly 3 billion people will be online pushing the data created and shared to nearly 8 zettabytes. ○ ○ Zettabyte = 1024^1 Exabyte = 1024^2 Petabyte = 1024^3 Terabyte = 1024^4 Gigabyte = 1024^5 Megabyte = 1024^6 KiloByte 8 ZB = 9,223,372,036,854,775,808 KB ● 2/3rd of surveyed businesses in North America said big data will become a concern for them within the next five years.
  6. 6. Huston, We have a Problem!!! A new IDC study says the market for big technology and services will grow from $3.2 billion in 2010 to $16.9 billion in 2015! That’s a growth of 40%
  7. 7. What is Big Data? “When your data sets become so large that you have to start innovating to collect, store, organize, analyze and share”
  8. 8. From WWW to VVV ● ● ● Volume ○ data volumes are becoming unmanageable Variety ○ data complexity is growing ○ more types of data are captured than previously Velocity ○ some data is arriving so rapidly that it must either be processed instantly, or lost ○ this is a whole subfield called “stream processing”
  9. 9. Sources of Data Computer Generated ● ● ● Application server logs (websites, games) Sensor data (weather, water, smart grids) Images/Videos (traffic surveillance, security cameras) Human Generated ● ● ● ● Twitter/Facebook Blogs/Reviews/Emails Images/Videos Social Graphs: Facebook, Linkedin
  10. 10. Types of Data ● ● ● ● ● ● Relational Data (Tables/Transaction/Legacy Data) Text Data (Web) Semi-structured Data (XML) Graph Data Social Network, Semantic Web (RDF), … Streaming Data
  11. 11. What to do with these data? ● Aggregation and Statistics ○ Data warehouse and OLAP ● Indexing, Searching, and Querying ○ Keyword based search ○ Pattern matching (XML/RDF) ● Knowledge discovery ○ Machine Learning ○ Data Mining ○ Statistical Modeling
  12. 12. If RDBMS are not enough, what is?
  13. 13. Hadoop!
  14. 14. Hadoop - inspired by Google ● Apache Hadoop project ○ inspired by Google MapReduce implementation and Google File System papers ● Open sourced, flexible and available architecture for large scale computation and data processing on a network of commodity hardware ● Open Source Software + Commodity Hardware ○ IT Cost Reduction
  15. 15. Hadoop Concepts ● Distribute the data as it is initially stored in the system ● Bring the processing to the data ● Users can focus in developing applications
  16. 16. Hadoop Versions ● Hadoop version 1 (HDFS + MapReduce) ○ hadoop-1.2.X ● Hadoop Version 2 (HDFS + MR2 + YARN) ○ hadoop-2.2.X ○ hadoop-0.23.X ■ same as 2.2.X but missing NN HA
  17. 17. Enterprise Hadoop ● Cloudera ○ Oldest company provided Hadoop enterprise ○ CDH ○ Cloudera Manager ● Hortonworks ○ Forked from Yahoo! Hadoop team ○ Biggest contributor to Hadoop ○ HDP (Hortonworks Data Platform) ● MapR
  18. 18. Hadoop Components ● ● Two Core components ○ Hadoop Distributed Filesystem ○ MapReduce Software Framework Components around Hadoop ○ Often referred to as ‘Hadoop Ecosystem’ ○ Pig, Hive, HBase, Flume, Oozie, Sqoop
  19. 19. Hadoop Components: HDFS ● ● HDFS, the Hadoop Distributed File System, is responsible for storing data on the cluster Two Roles: ○ Namenode (NN): Records metadata ○ Datanode (DN): Stores Data
  20. 20. HDFS Features ● ● ● ● High fault tolerant Commodity Hardware = Node Failure Rack Awareness Large Datasets
  21. 21. HDFS Structure HDFS has a master/slave architecture for the filesystem structure, it has two main layers: ● Namespace, which consists of directories, files and blocks. It supports the file system operations. ● Block storage service, which offers Block Management and Storage: ○ Block Management service provided by the NN, supports block related operations, maintain block locations and manages block replicas. ○ Storage service provided by the DN and allows the read/write access to blocks on the local storage of the node.
  22. 22. HDFS: How files are stored?
  23. 23. File System Read Operations 1. 2. 3. 4. Client contacts the NameNode indicating the file it wants to read Client identity is validated and checked against the owner and permissions of the file The NameNode responds with the list of DataNodes that host replicas of the blocks of the file The client contact the DataNodes based on the topology that was provided from the NameNode and requests the transfer of the desired block
  24. 24. File System Write Operations 1. 2. 3. 4. 5. 6. Client asks the NameNode to choose DataNodes to host replicas of the first block of the file The NameNode grant permissions to the client and responds with a list of DataNodes for the block The client organizes a pipeline from node-tonode and sends the data When the first block is filled, the client requests new DataNodes to be chosen to host replicas of the next block The NameNode responds with a new list of DataNodes which is likely to be different The client organizes a new pipeline and sends the further blocks of the file
  25. 25. Hadoop Components: MapReduce ● ● ● ● Programming model for processing and generating large data sets Computation takes some input data, then it gets mapped using some code written by the user Then the mapped data gets reduced using another code written by the user It works like a pipeline: $ cat file | grep something | sort | uniq -c
  26. 26. MapReduce Features ● ● ● ● Automatic parallelization and distribution Automatic re-execution on failure Locality Optimizations Abstract the “housekeeping” away from the developer ○ Developer concentrate on writing MapReduce functions Job Tracker Task Tracker MapReduce History Server
  27. 27. MapReduce Features ● ● ● TaskTracker is a node in the cluster that accepts tasks - Map, Shuffle and Reduce operations from a JobTracker JobTracker is the service within Hadoop that farms out MapReduce tasks to specific nodes in the cluster History Server allows the user to get status on finished applications. Currently it only supports MapReduce and provides information on finished jobs Job Tracker Task Tracker MapReduce History Server
  28. 28. Hadoop Components: YARN (MR2) The new architecture introduced in hadoop-0.2x and hadoop-2.x, divides the two major functions of the JobTracker: resource management and job life-cycle management into separate components.
  29. 29. YARN Architecture
  30. 30. YARN Components ● ● ● ResourceManager (RM) is the ultimate authority that arbitrates resources among all the applications in the system. NodeManager (NM) is the per-machine framework agent who is responsible for containers, monitoring their resource usage (cpu, memory, disk, network) and reporting the same to the ResourceManager/Scheduler. ApplicationsManager (ASM) is responsible for accepting jobsubmissions, negotiating the first container for executing the application specific ApplicationMaster and provides the service for restarting the ApplicationMaster container on failure.
  31. 31. Hadoop Ecosystem ● Hadoop has become the kernel of the distributed operating system for Big Data ● No one uses the kernel alone ● A collection of projects at Apache
  32. 32. Hadoop Components: HBase ● ● ● ● Low-latency, distributed, nonrelational database built on top of HDFS Inspired by Google’s Bigtable Data is stored in semi-columnar format partitioned by rows into regions Typically a single table accommodate hundreds of terabytes
  33. 33. Hadoop Components: Sqoop ● ● ● ● ● Exchanging data with relational databases Short for “SQL to Hadoop” Performs bidirectional transfer between Hadoop and almost any database with a JDBC driver. Includes native connections for MySQL and PostgreSQL Free connectors exist for Teradata, Netezza, SQL Server and Oracle
  34. 34. Hadoop Components: Flume ● ● ● Streaming data collection and aggregation system designed to transport massive volumes of data into systems such as Hadoop Simplifies reliable streaming data delivery from a variety of sources including RPC services, log4j appenders and syslog Data can be routed, load-balanced, replicated to multiple destinations and aggregated from thousands of hosts
  35. 35. Hadoop Components: Pig ● ● ● Created to simplify the authoring of MapReduce jobs, so no need to write Java code Users write data processing jobs in a high-level scripting language from which Pig builds an execution plan and executes a series of MapReduce jobs Developers can extend its set of builtin operations by writing user-defined functions in Java
  36. 36. Hadoop Components: Hive ● ● ● Creates a relational database-style abstraction that allows the developer to write a dialect of SQL Hive’s dialect of SQL is called HiveQL and implements only a subset of any of the common standards Hive works by defining a table-like schema over an existing set of files in HDFS
  37. 37. Hadoop Components: Oozie ● ● ● Workflow engine and scheduler built specifically for large-scale job orchestration on Hadoop Workflows can be triggered by time or events such as data arriving in a directory Major flexibility (start, stop, suspend and re-run jobs)
  38. 38. Hadoop Components: Hue ● ● ● ● Hadoop User Experience Apache Open Source project HUE is a web UI for Hadoop Platform for building custom applications with a nice UI library
  39. 39. Hadoop Components: Mahout ● ● Distributed and scalable machine learning algorithms on the Hadoop platform Building intelligent applications easier and faster
  40. 40. Hadoop Components: ZooKeeper ● ● ● Centralized service for maintaining: ○ Configuration Information ○ Providing distributed synchronization Designed to store coordination data: ○ Status Information ○ Configuration ○ Location Information Implement reliable messaging and redundant services
  41. 41. Planning and Installing Hadoop Clusters
  42. 42. Picking a Distribution and Version ● Apache Hadoop Version ○ 1.2.X ○ 2.2.X ● Choosing a distribution ○ HDP ○ Cloudera ● What should I Use?
  43. 43. Hardware Selection ● Master Hardware Selection ○ NameNode considerations ○ Secondary NameNode considerations ● Worker Hardware Selection ○ CPU, RAM and Storage ● Cluster Sizing ○ Small clusters < 20 nodes ○ Midline configuration (2x6 core, 64 GB, 12x3 TB) ○ High end configuration (2x6 core, 96 GB, 24x1 TB)
  44. 44. OS Selection and Preparation ● Deployment layout ○ Hadoop home ○ DataNode data directories ○ NameNode directories ● Software ○ Java, cron, ntp, ssh, rsync, postfix/sendmail ● Hostnames, DNS and Identification ● Users, Groups, and Privileges
  45. 45. Network Design 2-tier tree Network 3-tier tree Network Core Core TOR TOR TOR Host Host Host Host Host Host Host Host Host Aggregation Aggregation TOR TOR TOR TOR Host Host Host Host Host Host Host Host Host Host Host Host
  46. 46. Simple Streaming Jobs
  47. 47. How Streaming Works The mapper and the reducer read the input from stdin (line by line) and emit the output to stdout. ● ● ● ● ● ● Each mapper task will launch the executable as a separate process Converts its inputs into lines and feed the lines to the stdin of the process Mapper collects the line oriented outputs from the stdout of the process and converts each line into a key/value pair Each reducer task will launch the executable as a separate process Converts its input key/values pairs into lines and feeds the lines to the stdin of the process the reducer collects the line oriented outputs from the stdout of the process, converts each line into a key/value pair
  48. 48. The Input - 05/Nov/2013:00:15:46 - "Get /" - 05/Nov/2013:00:17:46 - "Get /" - 05/Nov/2013:00:18:00 - "Get /about" - 05/Nov/2013:00:18:00 - "Get /feedback" - 05/Nov/2013:00:19:23 - "Get /" - 05/Nov/2013:00:20:00 - "Get /about" - 05/Nov/2013:00:20:31 - "Get /" - 05/Nov/2013:00:21:46 - "Get /"
  49. 49. What do we want to do? We want to extract how many hits come on each page. So filtering the above line should yield: '/': 5 '/about': 2 '/feedback': 1
  50. 50. The Mapper #!/usr/bin/perl use strict; use warnings; while (<>) { chomp; my ($ip, $date, $action) = split('-', $_); $action =~ s/^ "Get (.*)"$/$1/; print "$actiont1n"; }
  51. 51. The Reducer #!/usr/bin/perl my %actions; if (exists $actions{$action}) { $actions{$action} = $actions {$action} + $count; } else { $actions{$action} = $count; } } while (<>) { chomp; my ($action, $count) = split ("t", $_); foreach my $c (sort{$a cmp $b} keys %actions) { print "'$c': $actions{$c}n"; } use strict; use warnings; use Data::Dumper;
  52. 52. The Output Now redirecting ‘log’ to Mapper.pl and piping the output to Reducer.pl yield: $ perl Mapper.pl < log | perl Reducer.pl '/': 5 '/about': 2 '/feedback': 1
  53. 53. Running over Hadoop $ hadoop jar $HADOOP_HOME/hadoop-streaming.jar -input myInputDirs -output myOutputDir -mapper /home/ahmed/Mapper.pl -reducer /home/ahmed/Reducer.pl
  54. 54. Demo
  55. 55. Thank You Q&A