Hadoop and Big Data Overview

735 views

Published on

Provides basic concepts behind Hadoop and Big Data

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Hadoop and Big Data Overview

  1. 1. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.1 Prabhu Thukkaram Director, Product Development Oracle Complex Processing & SOA Suite Feb 28, 2014
  2. 2. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.2 What is Big Data ? HDFS Map Reduce HBase Columnar DB PIG Hive ETL Tools BI Reporting Self Healing Clustered Storage System Distributed Data Processing Higher level abstraction Top-level interfaces Structured Data,, Excel, etc Unstructured & Semi-Structured Data, Web Logs, Images, etc SQOOP Zoo Keeper
  3. 3. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.3 HBase Quick Overview  Relational Database – Product Table Product Id Price SKU Inventory Count 0001 1300 SKU0001 10 0002 2800 SKU0002 25 0003 5600 SKU0003 8  Ideal for OLTP transactions  Faster writes and record updates  But slow for OLAP  E.g. Select sum(InventoryCount) from Products;  Reason:- Data for column “InventoryCount” is not contiguous
  4. 4. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.4 HBase Quick Overview  HBase – Product Table Product Id 0001 0002 0003 Price 1300 2800 5600 SKU SKU0001 SKU0002 SKU0003 Inventory Count 10 25 8  Extremely fast for OLAP  E.g. Select sum(InventoryCount) from Products;  Supports big data analytics – single table with million columns and billion rows 43
  5. 5. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.5 Hadoop Cluster Master Job Tracker Name Node Slave 1 Task Tracker Data Node Slave 2 Task Tracker Data Node  Name Node  Does not store data, maintains directory tree of all files in the cluster  Tracks data blocks of a file across the cluster  Client apps like Hadoop Shell/CLI talk to Name Node to locate, create, move, rename, and delete a file  Returns a list of Data Node servers where data lives  Single point of failure, addressed in Hadoop V2 or YARN Hadoop Shell, CLI, etc
  6. 6. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.6 Hadoop/HDFS cluster Master Job Tracker Name Node Slave 1 Task Tracker Data Node Slave 2 Task Tracker Data Node  Data Node  Stores & replicates data on the file system  Connects to Name Node on startup & responds to file system operations  Hadoop Shell/CLI clients can talk directly to Data Node if they know the location of the data  Data Nodes talk to each other when replicating data Hadoop Shell, CLI, etc
  7. 7. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.7 Writing a file to HDFS Hadoop Client NameNode DataNode 1 DataNode 6DataNode 5 …… File.txt Blk A Blk B Blk C Wants to write Blocks A, B, C of File.txt Ok, write to data nodes 1,5,6 Blk A Blk B Blk C Blk A Replication of Blk A  Client consults Name Node  Client writes block directly to one Data Node  That Data Node replicates the block  Client writes next block and the cycle repeats
  8. 8. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.8 Hadoop/HDFS cluster Master Job Tracker Name Node Slave 1 Task Tracker Data Node Slave 2 Task Tracker Data Node  Job Tracker – Accepts MR jobs from Clients  Contacts & submits MR tasks to Task Trackers on located Data Nodes  Monitors Task Tracker nodes for heartbeat/failures and resubmits to a different Task Tracker as needed  Updates the status of a job when complete  Single point of failure, fixed in MRV2 or YARN  Note:- Jobs run as batch and clients can retrieve the status by querying the Job Tracker Hadoop Shell, CLI, etc
  9. 9. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.9 Hadoop/HDFS cluster Master Job Tracker Name Node Slave 1 Task Tracker Data Node Slave 2 Task Tracker Data Node  Task Tracker – Accepts map, & reduce tasks from Job Tracker  A predefined set of slots determine the number of tasks it can accept  Spawns a separate JVM process for the task  Notifies the Job Tracker when the process finishesHadoop Shell, CLI, etc
  10. 10. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.10 Map Reduce Example • User submits a MR job/jar for input file URL hdfs://File.txt to Job Submitter on the Client JVM. Job Submitter Client contacts Job Tracker to obtain a Job Id • Job Tracker creates Map Tasks based on the number of input splits. Reduce jobs are defined by the job itself, configured or in API call setNumReduceTasks() • Client contacts Name Node to compute input splits. Copies the job jar, computed input splits, etc. to Job Tracker’s file system with a directory named after the Job Id. Submits the Job. Job Tracker adds job to its Queue.
  11. 11. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.11 Map Reduce Example • Map Task splits records and passes each record to user’s Map logic code • In above example, user’s Map logic tokenizes each record to generate one or more key value pairs. • Output of a Map task is partitioned as per the defined # of reducers, shuffled, and sorted. Each partition output is then routed to its Reducer • Reducer merges partition output from other Map tasks in the cluster and calls user’s reduce logic • M/R guarantees that the input to every Reducer is sorted by key
  12. 12. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.12 Map Reduce – Under the Hood Source: Hadoop, The Definitive Guide. User’s Map Logic Record Split Sorted & partitioned Single Map Task Merged User’s Reduce Logic Single Reduce Task From other Map Tasks To other Reducers Data Block
  13. 13. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.13 Map Reduce – Summary Map – Process of organizing entity data as key-value pairs. Key could be Customer Id, Purchase Order Id, etc. M/R Framework – Ensures all data relevant to an entity/key is delivered to a single Reducer Reduce – Process of aggregating data related to an Entity and deducing information.
  14. 14. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.14 Risk Analysis - Large no of CC txns, no direct deposit into checking over the last two months implies the customer is unemployed and at high risk of defaulting Map Reduce - Risk Analysis HDFS Global View of Customer Credit Card Txns Chat Session Checking account deposits and withdrawals Map Phase Reduce Phase Risk Score Gathers all data (CC txns, chats, withdrawals, etc.) pertaining to a single customer. Data in HDFS changes frequently and hence the need to reevaluate risk using batch jobs. Reevaluated results are written to a DB to enable business decisions. E.g. To approve a new credit card or loan
  15. 15. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.15 Hadoop V1 - Limitations  Cluster resource management is tightly coupled to Map Reduce Job Tracker  Job Tracker Functionality in V1  Cluster resource management  Application life-cycle Management (Job Scheduling/Re-scheduling/Monitoring)  Can only run Map Reduce applications, poor utilization of cluster  Need to run other kinds of applications – Real- time, Graph, Messaging, etc.  Scalability & single point of failure (Name Node & Job Tracker)  Lack of wire compatibility protocols
  16. 16. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.16 Hadoop MRV2 or YARN
  17. 17. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.17 Hadoop MRV2 or YARN  Job Tracker in V1 split into  Global Resource Manager  Application Master per Job Request  Node Manager  Application - classic MR job or a DAG of jobs Resource Manager Node Manager Container App Master Node Manager Container App Master Node Manager Container Container Client Client Job Status Job Submission Node Status Resource Request
  18. 18. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.18 Hadoop MRV2 or YARN Resource Manager  Resource Manager  Ultimate authority for managing and scheduling resources in cluster  Works with the Node Manager to track and utilize available containers  Container is the unit of resource in YARN. E.g. 2 Cores & 2 GB memory, Disk, etc.  Accepts Jobs from clients and delegates it to an Application Master
  19. 19. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.19 Hadoop MRV2 or YARN App Master  Application Master  Negotiates the required resources for job with RM  Tracks job status and monitors progress  By shifting job control to App Master in local slaves, YARN provides better scale out and fault tolerance
  20. 20. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.20 Big Data Overview - Next

×