Big Data & Hadoop Tutorial


Published on

Big Data and Hadoop training course is designed to provide knowledge and skills to become a successful Hadoop Developer. In-depth knowledge of concepts such as Hadoop Distributed File System, Setting up the Hadoop Cluster, Map-Reduce,PIG, HIVE, HBase, Zookeeper, SQOOP etc. will be covered in the course.

Published in: Education, Technology
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Web and e-tailing- eBay is using Hadoop technology and the Hbase database, which supports real-time analysis of Hadoop data, to build a new search engine for its auction site. eBay has more than 97 million active buyers and sellers and over 200 million items for sale in 50,000 categories. The site handles close to 2 billion page views, 250 million search queries and tens of billions of database calls daily. The company has 9 petabytes of data stored on Hadoop and Teradata clusters, and the amount is growing quickly.TelecommunicationsChina Mobile; Data Mining platform for Telecom Industry, 5-8 TB/day CDR , Network Signaling DataCurrent Solutions such as Oracle DB, SAS (Data Mining), Unix Servers and SAN aren’t sufficient to store and process such a vast amount of dataNeed faster data processing to Precision marketing, Network Optimization, Service Optimization and Log Processing
  • Government-AADHAR by Govt. Of India; 5 MB Data per resident, maps to about 10-15 PB of raw data. The Hadoop stack: HDFS (Hadoop distributed file system) is used to provide high data read/write throughput in the order of many terabytes per day. Distributed architecture enables scale out as needed. Hive is used for building the UIDAI data warehouse, HBase for indexed lookup of records across millions of rows, Zookeeper as a distributed coordination service for server instances, and Pig as an ETL (extract, transform and load) solution for loading data into Hive.Healthcare and Life Sciences- Life sciences research firm NextBio uses Hadoop and HBase to help big pharmaceutical companies conduct genomic research. The company embraced Hadoop in 2009 to make the sheer scale of genomic data-analysis more affordable. The company's core 100-node Hadoop cluster, which has processed as much as 100 terabytes of data, is used to compare data from drug studies to publically available genomics data. Given that there are tens of thousands of such studies and 3.2 billion base pairs behind each of the hundreds of genomes that NextBio studies, it's clearly a big-data challenge.
  • Banks and Financial services:3 of the top 5 Banks run Cloudera HadoopJPMorgan Chase uses Hadoop technology for a growing number of purposes, including fraud detection, IT risk management and self service. With over 150 petabytes of data stored online, 30,000 databases and 3.5 billion log-ins to user accounts, data is the lifeblood of JPMorgan Chase.Retail:Sears is an American multinational mid-range department store chain headquartered in Hoffman Estates, Illinois). It Moved to Hadoop from Hadoop from Teradata and SAS to avoid archiving and deleting its meaningful sales and other customer activity data. 300-Node Hadoop cluster helps Sears to keep its 100% data (~2PB) available to BI rather than a meager 10% as was the case with Non-Hadoop solutions. Walmart; migrated data from its existing Oracle, Neteeza, Oracle and Greenplum gear to its 250-Node Hadoop Cluster.
  • Why Oracle, HP, IBM and other Enterprise Technology giants are in Red on growth. Sears:Sears wanted to personalize marketing campaigns, coupons, and offers down to the individual customer, but the existing legacy systems were incapable of supporting that.Sears' process for analyzing marketing campaigns for loyalty club members used to take six weeks on mainframe, Teradata, and SAS servers. The new process running on Hadoop can be completed weekly. For certain online and mobile commerce scenarios, Sears can now perform daily analyses. What's more, targeting is more granular, in some cases down to the individual customer. Moving up the stack, Sears is consolidating its databases to MySQL, InfoBright, and Teradata--EMC Greenplum, Microsoft SQL Server, and Oracle (including four Exadata boxes) are on their way out.Sears routinely replaces legacy Unix systems with Linux rather than upgrade them, and it has retired most of its Sun and HP-UX servers. Microsoft server and development technologies are also on the way out.
  • - 2 PB of data--mostly structured and unstructured data such as customer transaction, point of sale, and supply chain.- Because of Archiving Need 90% of the ~2PB of Data is not available for BI
  • 300-Node Hadoop cluster helps Sears to keep its 100% data (~2PB) available to BI rather than a meager 10% as was the case with Non-Hadoop solutions. Sears now keeps all of its data down to individual transactions (rather than aggregates) and years of history (rather than imposing quarterly windows on certain data, as it did previously). That's raw data, which Sears can refactor and combine as needed quickly and efficiently within Hadoop.To give a sense of how early Sears was to Hadoop development, Wal-Mart divulged early this year that it was scaling out an experimental 10-node Hadoop cluster for e-commerce analysis. Sears passed that size in 2010.Has its own Hadoop solutions subsidiary MetaScale, to provide hadoop services to other retail companies on the line of AWS.
  • Accessible: Hadoop runs on large clusters of commodity machines or cloud computing services such as Amazon EC2Robust: Since Hadoop can run on commodity cluster, its designed with the assumption of frequent hardware failure, it can gracefully handle such failure and computation don’t stop because of few failed devices / systemsScalable:Hadoop scales linearly to handle large data by adding more slave nodes to the clusterSimple : Its easy to write efficient parallel programming with Hadoop
  • We will cover other Hadoop Components in detail in future sessions of this course.
  • Data transferred from DataNode to MapTask process. DBlk is the file data block; CBlk is the file checksum block. File data are transferred to the client through Java niotransferTo (aka UNIX sendfilesyscall). Checksum data are first fetched to DataNode JVM buffer, and then pushed to the client (details are not shown). Both file data and checksum data are bundled in an HDFS packet (typically 64KB) in the format of: {packet header | checksum bytes | data bytes}.2. Data received from the socket are buffered in a BufferedInputStream, presumably for the purpose of reducing the number of syscalls to the kernel. This actually involves two buffer-copies: first, data are copied from kernel buffers into a temporary direct buffer in JDK code; second, data are copied from the temporary direct buffer to the byte[] buffer owned by the BufferedInputStream. The size of the byte[] in BufferedInputStream is controlled by configuration property "io.file.buffer.size", and is default to 4K. In our production environment, this parameter is customized to 128K.3. Through the BufferedInputStream, the checksum bytes are saved into an internal ByteBuffer (whose size is roughly (PacketSize / 512 * 4) or 512B), and file bytes (compressed data) are deposited into the byte[] buffer supplied by the decompression layer. Since the checksum calculation requires a full 512 byte chunk while a user's request may not be aligned with a chunk boundary, a 512B byte[] buffer is used to align the input before copying partial chunks into user-supplied byte[] buffer. Also note that data are copied to the buffer in 512-byte pieces (as required by FSInputChecker API). Finally, all checksum bytes are copied to a 4-byte array for FSInputChecker to perform checksum verification. Overall, this step involves an extra buffer-copy.4. The decompression layer uses a byte[] buffer to receive data from the DFSClient layer. The DecompressorStream copies the data from the byte[] buffer to a 64K direct buffer, calls the native library code to decompress the data and stores the uncompressed bytes in another 64K direct buffer. This step involves two buffer-copies.5.LineReader maintains an internal buffer to absorb data from the downstream. From the buffer, line separators are discovered and line bytes are copied to form Text objects. This step requires two buffer-copies.The client creates the file by calling create() on Distributed FileSystem (step 1). Distributed FileSystem makes an RPC call to the namenode to create a new file in the filesystem’s namespace, with no blocks associated with it (step 2). The namenode performs various checks to make sure the file doesn’t already exist, and that the client has the right permissions to create the file.
  • The client opens the file it wishes to read by calling open() on the FileSystemobject,which for HDFS is an instance of DFS(step 1).DistributedFileSystem calls the namenode, using RPC, to determine the locations of the blocks for the first few blocks in the File (step 2). For each block, the namenode returns the addresses of the datanodes that have a copy of that block
  • Big Data & Hadoop Tutorial

    1. 1. Slide 1
    2. 2. How It Works…  LIVE On-Line classes  Class recordings in Learning Management System (LMS)  Module wise Quizzes, Coding Assignments  24x7 on-demand technical support  Project work on large Datasets  Online certification exam  Lifetime access to the LMS Complimentary Java Classes Slide 2
    3. 3. Course Topics  Module 1  Module 5  Module 2  Module 6  Understanding Big Data  Hadoop Architecture  Introduction to Hadoop 2.x  Data loading Techniques  Hadoop Project Environment  Module 3  Hadoop MapReduce framework  Programming in Map Reduce  Module 4  Advance MapReduce  YARN (MRv2) Architecture  Programming in YARN Slide 3  Analytics using Pig  Understanding Pig Latin  Analytics using Hive  Understanding HIVE QL  Module 7  NoSQL Databases  Understanding HBASE  Zookeeper  Module 8  Real world Datasets and Analysis  Project Discussion
    4. 4. Topics for Today  What is Big Data?  Limitations of the existing solutions  Solving the problem with Hadoop  Introduction to Hadoop  Hadoop Eco-System  Hadoop Core Components  HDFS Architecture  MapRedcue Job execution  Anatomy of a File Write and Read  Hadoop 2.0 (YARN or MRv2) Architecture Slide 4
    5. 5. What Is Big Data?  Lots of Data (Terabytes or Petabytes)  Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. The challenges include capture, curation, storage, search, sharing, transfer, analysis, and visualization.  Systems / Enterprises generate huge amount of data from Terabytes to and even Petabytes of information. NYSE generates about one terabyte of new trade data per day to Perform stock trading analytics to determine trends for optimal trades. Slide 5
    6. 6. Un-Structured Data is Exploding Slide 6
    7. 7. IBM’s Definition Characteristics of Big Data Volume Slide 7 Velocity Variety
    8. 8. Annie’s Introduction Hello There!! My name is Annie. I love quizzes and puzzles and I am here to make you guys think and answer my questions. Slide 8
    9. 9. Annie’s Question Map the following to corresponding data type: - XML Files - Word Docs, PDF files, Text files - E-Mail body - Slide 9 Hello There!! My name is Annie. Data from Enterprise systems (ERP, CRM etc.) I love quizzes and puzzles and I am here to make you guys think and answer my questions.
    10. 10. Annie’s Answer XML Files -> Semi-structured data Word Docs, PDF files, Text files -> Unstructured Data E-Mail body -> Unstructured Data Data from Enterprise systems (ERP, CRM etc.) -> Structured Data Slide 10
    11. 11. Further Reading  More on Big Data  Why Hadoop  Opportunities in Hadoop  Big Data  IBM‟s definition – Big Data Characteristics Slide 11
    12. 12. Common Big Data Customer Scenarios  Web and e-tailing     Recommendation Engines Ad Targeting Search Quality Abuse and Click Fraud Detection  Telecommunications  Customer Churn Prevention  Network Performance Optimization  Calling Data Record (CDR) Analysis  Analyzing Network to Predict Failure Slide 12
    13. 13. Common Big Data Customer Scenarios (Contd.)  Government  Fraud Detection And Cyber Security  Welfare schemes  Justice  Healthcare & Life Sciences Health information exchange Gene sequencing Serialization Healthcare service quality improvements  Drug Safety     Slide 13
    14. 14. Common Big Data Customer Scenarios (Contd.)  Banks and Financial services      Modeling True Risk Threat Analysis Fraud Detection Trade Surveillance Credit Scoring And Analysis  Retail  Point of sales Transaction Analysis  Customer Churn Analysis  Sentiment Analysis Slide 14
    15. 15. Hidden Treasure Case Study: Sears Holding Corporation  Insight into data can provide Business Advantage.  Some key early indicators can mean Fortunes to Business. X  More Precise Analysis with more data. *Sears was using traditional systems such as Oracle Exadata, Teradata and SAS etc. to store and process the customer activity and sales data. Slide 15
    16. 16. Limitations of Existing Data Analytics Architecture BI Reports + Interactive Apps A meagre 10% of the ~2PB Data is available for BI RDBMS (Aggregated Data) 1. Can‟t explore original high fidelity raw data ETL Compute Grid 2. Moving data to compute doesn‟t scale Storage only Grid (original Raw Data) Storage Processing 90% of the ~2PB Archived 3. Premature data death Mostly Append Collection Instrumentation Slide 16
    17. 17. Solution: A Combined Storage Computer Layer BI Reports + Interactive Apps 1. Data Exploration & Advanced analytics RDBMS (Aggregated Data) No Data Archiving Entire ~2PB Data is available for processing Both Storage And Processing 2. Scalable throughput for ETL & aggregation Hadoop : Storage + Compute Grid 3. Keep data alive forever Mostly Append Collection Instrumentation *Sears moved to a 300-Node Hadoop cluster to keep 100% of its data available for processing rather than a meagre 10% as was the case with existing Non-Hadoop solutions. Slide 17
    18. 18. Hadoop Differentiating Factors Accessible Simple Differentiating Factors Robust Scalable Slide 18
    19. 19. Hadoop – It’s about Scale And Structure RDBMS EDW MPP RDBMS HADOOP NoSQL Structured Data Types Multi and Unstructured Limited, No Data Processing Processing Processing coupled with Data Standards & Structured Governance Loosely Structured Required On write Schema Required On Read Reads are Fast Speed Writes are Fast Software License Cost Support Only Known Entity Resources Growing, Complexities, Wide Interactive OLAP Analytics Complex ACID Transactions Operational Data Store Best Fit Use Data Discovery Processing Unstructured Data Massive Storage/Processing Slide 19
    20. 20. Why DFS? Read 1 TB Data 1 Machine 4 I/O Channels Each Channel – 100 MB/s Slide 20 10 Machines 4 I/O Channels Each Channel – 100 MB/s
    21. 21. Why DFS? Read 1 TB Data 1 Machine 10 Machines 4 I/O Channels Each Channel – 100 MB/s 4 I/O Channels Each Channel – 100 MB/s 45 Minutes Slide 21
    22. 22. Why DFS? Read 1 TB Data 1 Machine 4 I/O Channels Each Channel – 100 MB/s 4 I/O Channels Each Channel – 100 MB/s 45 Minutes Slide 22 10 Machines 4.5 Minutes
    23. 23. What Is Hadoop?  Apache Hadoop is a framework that allows for the distributed processing of large data sets across clusters of commodity computers using a simple programming model.  It is an Open-source Data Management with scale-out storage & distributed processing. Slide 23
    24. 24. Hadoop Key Characteristics Reliable Flexible Hadoop Features Economical Scalable Slide 24
    25. 25. Annie’s Question Hadoop is a framework that allows for the distributed processing of: - Slide 25 Small Data Sets Large Data Sets Hello There!! My name is Annie. I love quizzes and puzzles and I am here to make you guys think and answer my questions.
    26. 26. Annie’s Answer Large Data Sets. It is also capable to process small data-sets however to experience the true power of Hadoop one needs to have data in Tb‟s because this where RDBMS takes hours and fails whereas Hadoop does the same in couple of minutes. Slide 26
    27. 27. Hadoop Eco-System Apache Oozie (Workflow) Hive Pig Latin DW System Data Analysis Mahout Machine Learning MapReduce Framework HBase HDFS (Hadoop Distributed File System) Flume Sqoop Import Or Export Slide 27 Unstructured or Semi-Structured data Structured Data
    28. 28. Machine Learning with Mahout Write intelligent applications using Apache Mahout LinkedIn Recommendations Hadoop and MapReduce magic in action Slide 28
    29. 29. Hadoop Core Components Hadoop is a system for large scale data processing. It has two main components:  HDFS – Hadoop Distributed File System (Storage)  Distributed across “nodes”  Natively redundant  NameNode tracks locations.  MapReduce (Processing)  Splits a task across processors  “near” the data & assembles results  Self-Healing, High Bandwidth  Clustered storage  JobTracker manages the TaskTrackers Slide 29
    30. 30. Hadoop Core Components (Contd.) MapReduce Engine Task Tracker Task Tracker Task Tracker Task Tracker HDFS Cluster Slide 30 Job Tracker Admin Node Name node Data Node Data Node Data Node Data Node
    31. 31. HDFS Architecture Metadata ops Metadata (Name, replicas,…): /home/foo/data, 3,… NameNode Client Read Block ops Datanodes Datanodes Replication Blocks Write Rack 1 Slide 31 Client Rack 2
    32. 32. Main Components Of HDFS  NameNode:  master of the system  maintains and manages the blocks which are present on the DataNodes  DataNodes:  slaves which are deployed on each machine and provide the actual storage  responsible for serving read and write requests for the clients Slide 32
    33. 33. NameNode Metadata  Meta-data in Memory  The entire metadata is in main memory  No demand paging of FS meta-data  Types of Metadata  List of files  List of Blocks for each file  List of DataNode for each block  File attributes, e.g. access time, replication factor  A Transaction Log  Records file creations, file deletions. etc Slide 33 Name Node (Stores metadata only) METADATA: /user/doug/hinfo -> 1 3 5 /user/doug/pdetail -> 4 2 Name Node: Keeps track of overall file directory structure and the placement of Data Block
    34. 34. Secondary Name Node metadata NameNode  Secondary NameNode: Single Point Failure  Not a hot standby for the NameNode You give me metadata every hour, I will make it secure  Connects to NameNode every hour*  Housekeeping, backup of NemeNode metadata  Saved metadata can build a failed NameNode Secondary NameNode metadata Slide 34
    35. 35. Annie’s Question NameNode? a) is the “Single Point of Failure” in a cluster b) runs on „Enterprise-class‟ hardware c) d) All of the above Slide 35 Hello There!! My name is Annie. I love quizzes and puzzles and I am here to make you guys think and answer my questions. stores meta-data
    36. 36. Annie’s Answer All of the above. NameNode Stores meta-data and runs on reliable high quality hardware because it‟s a Single Point of failure in a hadoop Cluster. Slide 36
    37. 37. Annie’s Question When the NameNode fails, Secondary NameNode takes over instantly and prevents Cluster Failure: a) TRUE b) FALSE Slide 37 Hello There!! My name is Annie. I love quizzes and puzzles and I am here to make you guys think and answer my questions.
    38. 38. Annie’s Answer False. Secondary NameNode is used for creating NameNode Checkpoints. NameNode can Hello There!! be manually recovered using „edits‟ My name is Annie. and „FSImage‟ stored in Secondary NameNode. I love quizzes and puzzles and I am here to make you guys think and answer my questions. Slide 38
    39. 39. JobTracker 1. Copy Input Files DFS Job.xml. Job.jar. 3. Get Input Files‟ Info Client 2. Submit Job 4. Create Splits Input Files 5. Upload Job Information User 6. Submit Job Slide 39 Job Tracker
    40. 40. JobTracker (Contd.) DFS Input Spilts Client 8. Read Job Files Job.xml. Job.jar. Maps 6. Submit Job Job Tracker Reduces 9. Create maps and reduces 7. Initialize Job Slide 40 As many maps as splits Job Queue
    41. 41. JobTracker (Contd.) Job Tracker H1 Job Queue H3 11. Picks Tasks (Data Local if possible) H4 H5 Task Tracker H1 10. Heartbeat 10. Heartbeat Task Tracker H2 12. Assign Tasks 10. Heartbeat Task Tracker H3 Slide 41 10. Heartbeat Task Tracker H4
    42. 42. Annie’s Question Hadoop framework picks which of the following daemon for scheduling a task ? a) namenode b) datanode c) task tracker d) job tracker Slide 42
    43. 43. Annie’s Answer JobTracker takes care of all theHello There!! job scheduling and My name is Annie. I love quizzes and puzzles and I am here to make you guys think and answer my questions. assign tasks to TaskTrackers. Slide 43
    44. 44. Anatomy of A File Write HDFS Client 2. Create 1. Create Distributed File System 3. Write 4. Write Packet 5. ack Packet DataNode DataNode Slide 44 NameNode 7. Complete 4 Pipeline of Data nodes NameNode 4 DataNode 5 DataNode DataNode 5 DataNode
    45. 45. Anatomy of A File Read HDFS Client 2. Get Block locations 1. Create 3. Write NameNode Distributed File System NameNode 4. Read 5. Read DataNode DataNode DataNode Slide 45 DataNode DataNode DataNode
    46. 46. Replication and Rack Awareness Slide 46
    47. 47. Annie’s Question In HDFS, blocks of a file are written in parallel, however the replication of the blocks are done sequentially: a) TRUE b) FALSE Slide 47
    48. 48. Annie’s Answer True. A files is divided into Blocks, these blocks are written in parallel but the block replication happen in sequence. Slide 48
    49. 49. Annie’s Question A file of 400MB is being copied to HDFS. The system has finished copying 250MB. What happens if a client tries to access that file: a) b) c) d) Slide 49 can read up to block that's successfully written. can read up to last bit successfully written. Will throw an throw an exception. Cannot see that file until its finished copying.
    50. 50. Annie’s Answer Client can read up to the successfully written data block, Answer is (a) Slide 50
    51. 51. Hadoop 2.x (YARN or MRv2) HDFS All name space edits logged to shared NFS storage; single writer (fencing) Secondary Name Node Active NameNode Data Node Client YARN Shared edit logs Read edit logs and applies to its own namespace Resource Manager Standby NameNode Data Node Data Manager Node Node Container Node Manager Container Slide 51 App Master Node Manager Container App Master Data Node Node Manager Container App Master Data Node App Master
    52. 52. Further Reading  Apache Hadoop and HDFS  Apache Hadoop HDFS Architecture  Hadoop 2.0 and YARN Slide 52
    53. 53. Module-2 Pre-work  Setup the Hadoop development environment using the documents present in the LMS.  Hadoop Installation – Setup Cloudera CDH3 Demo VM  Hadoop Installation – Setup Cloudera CDH4 QuickStart VM  Execute Linux Basic Commands  Execute HDFS Hands On commands  Attempt the Module-1 Assignments present in the LMS. Slide 53
    54. 54. Thank You See You in Class Next Week