www.edureka.in/hadoop
www.edureka.in/hadoop
How It Works…
 LIVE classes
 Class recordings
 Module wise Quizzes, Coding Assignments
 24x7 on-demand technical support
 Project work on large Datasets
 Online certification exam
 Lifetime access to the Learning Management System
Complimentary Java Classes
www.edureka.in/hadoop
Course Topics
 Week 1
– Understanding Big Data
– Introduction to HDFS
 Week 2
– Playing around with Cluster
– Data loading Techniques
 Week 3
– Map-Reduce Basics, types and formats
– Use-cases for Map-Reduce
 Week 4
– Analytics using Pig
– Understanding Pig Latin
 Week 5
– Analytics using Hive
– Understanding HIVE QL
 Week 6
– NoSQL Databases
– Understanding HBASE
 Week 7
– Data loading Techniques in Hbase
– Zookeeper
 Week 8
– Real world Datasets and Analysis
– Hadoop Project Environment
www.edureka.in/hadoop
What Is Big Data?
 Lots of Data(Terabytes or Petabytes)
 Systems / Enterprises generate huge amount of data from Terabytes to and even Petabytes of
information.
A airline jet collects 10 terabytes of sensor data
for every 30 minutes of flying time.
NYSE generates about one terabyte of new trade data
per day to Perform stock trading analytics to determine
trends for optimal trades.
www.edureka.in/hadoop
Facebook Example
 Facebook users spend 10.5 billion minutes
(almost 20,000 years) online on the social network.
 Facebook has an average of 3.2 billion likes and
comments are posted every day.
www.edureka.in/hadoop
Twitter Example
 Twitter has over 500 million registered users.
 The USA, whose 141.8 million accounts represents 27.4
percent of all Twitter users, good enough to finish well ahead
of Brazil, Japan, the UK and Indonesia.
 79% of US Twitter users are more like to recommend brands
they follow .
 67% of US Twitter users are more likely to buy from brands
they follow .
 57% of all companies that use social media for business use
Twitter.
www.edureka.in/hadoop
IBM’s Definition
 IBM’s definition – Big Data Characteristics
http://www-01.ibm.com/software/data/bigdata/
www.edureka.in/hadoop
 Estimated Global Data Volume:
 2011: 1.8 ZB
 2015: 7.9 ZB
 The world's information doubles every two years
 Over the next 10 years:
 The number of servers worldwide will grow by 10x
 Amount of information managed by enterprise data
centers will grow by 50x
 Number of “files” enterprise data center handle will
grow by 75x
Source: http://www.emc.com/leadership/programs/digital-universe.htm,
which was based on the 2011 IDC Digital Universe Study
Data Volume Is Growing Exponentially
www.edureka.in/hadoop
Un-Structured Data is Exploding
www.edureka.in/hadoop
Common Big Data Customer Scenarios
Industry/Vertical Scenarios
Financial Services  Modeling True Risk
 Threat Analysis
 Fraud Detection
 Trade Surveillance
 Credit Scoring And Analysis
Web & E-Tailing  Recommendation Engines
 Ad Targeting
 Search Quality
 Abuse and Click Fraud Detection
Retail  Point of sales Transaction Analysis
 Customer Churn Analysis
 Sentiment Analysis
www.edureka.in/hadoop
Industry/Vertical Scenarios
Telecommunications  Customer Churn Prevention
 Network Performance Optimization
 Call Detail Record (CDR) Analysis
 Analyzing Network to Predict Failure
Government  Fraud Detection And Cyber Security
General
(Cross Vertical)
 ETL & Processing Engine
Common Big Data Customer Scenarios (Contd.)
www.edureka.in/hadoop
Hidden Treasure
 Insight into data can provide Business Advantage.
 Some key early indicators can mean Fortunes to Business.
 More Precise Analysis with more data.
www.edureka.in/hadoop
What Big Companies Have To Say…
“Analyzing Big Data sets will become a key basis for competition.”
“Leaders in every sector will have to grapple the implications of Big Data.”
McKinsey
Gartner
Forrester
Research
“Big Data analytics are rapidly emerging as the preferred solution to business and
technology trends that are disrupting.”
“Enterprises should not delay implementation of Big Data Analytics.”
“Use Hadoop to gain a competitive advantage over more risk-averse enterprises.”
“Prioritize Big Data projects that might benefit from Hadoop.”
www.edureka.in/hadoop
Limitations of Existing Data Analytics Architecture
www.edureka.in/hadoop
Solution: A Combined Storage Computer Layer
www.edureka.in/hadoop
Differentiating Factors
www.edureka.in/hadoop
Some Of the Hadoop Users
www.edureka.in/hadoop
Hadoop Users – In Detail
http://wiki.apache.org/hadoop/PoweredBy
www.edureka.in/hadoop
Hadoop Users – In Detail
http://wiki.apache.org/hadoop/PoweredBy
www.edureka.in/hadoop
Read 1 TB Data
10 Machines
 4 I/O Channels
 Each Channel – 100 MB/s
 4 I/O Channels
 Each Channel – 100 MB/s
1 Machine
Why DFS?
www.edureka.in/hadoop
10 Machines
 4 I/O Channels
 Each Channel – 100 MB/s
 4 I/O Channels
 Each Channel – 100 MB/s
1 Machine
Read 1 TB Data
45 Minutes
Why DFS?
www.edureka.in/hadoop
4.5 Minutes45 Minutes
10 Machines
 4 I/O Channels
 Each Channel – 100 MB/s
 4 I/O Channels
 Each Channel – 100 MB/s
1 Machine
Read 1 TB Data
Why DFS?
www.edureka.in/hadoop
 Apache Hadoop is a framework that allows for the distributed processing of large data sets
across clusters of commodity computers using a simple programming model.
 It is an Open-source Data Management with scale-out storage & distributed processing.
What Is Hadoop?
www.edureka.in/hadoop
Hadoop Key Characteristics
www.edureka.in/hadoop
Hadoop History
Doug Cutting & Mike Cafarella
started working on Nutch
NY Times converts 4TB of
Image archives over 100 EC2s
Fastest sort of a TB,
62secs over 1,460 nodes
Sorted a PB in 16.25hours
Over 3.658 nodes
Fastest sort of a TB, 3.5mins
over 910 nodes
Doug Cutting adds DFS &
MapReduce support to Nutch
Google publishes GFS &
MapReduce papers Yahoo! hires Cutting,
Hadoop spins out of Nutch
Facebook launches Hive:
SQL Support for Hadoop
Doug Cutting
Joins Cloudera
Hadoop Summit 2009,
750 attendees
Founded
www.edureka.in/hadoop
Hadoop Eco-System
www.edureka.in/hadoop
 HDFS – Hadoop Distributed File System (Storage)
 Distributed across “nodes”
 Natively redundant
 NameNode tracks locations.
 MapReduce (Processing)
 Splits a task across processors
 “near” the data & assembles results
 Self-Healing, High Bandwidth
 Clustered storage
Hadoop Core Components
www.edureka.in/hadoop
Hadoop Core Components (Contd.)
www.edureka.in/hadoop
HDFS Architecture
www.edureka.in/hadoop
 NameNode:
 master of the system
 maintains and manages the blocks which are present on the
DataNodes
 DataNodes:
 slaves which are deployed on each machine and provide the
actual storage
 responsible for serving read and write requests for the clients
Main Components Of HDFS
www.edureka.in/hadoop
 Secondary NameNode:
 Not a hot standby for the NameNode
 Connects to NameNode every hour*
 Housekeeping, backup of NemeNode metadata
 Saved metadata can build a failed NameNode
You give me
metadata every
hour, I will make it
secure
Single Point
Failure
Secondary
NameNode
NameNode
metadata
metadata
Secondary NameNode
www.edureka.in/hadoop
NameNode Metadata
 Meta-data in Memory
 The entire metadata is in main memory
 No demand paging of FS meta-data
 Types of Metadata
 List of files
 List of Blocks for each file
 List of DataNode for each block
 File attributes, e.g. access time, replication factor
 A Transaction Log
 Records file creations, file deletions. etc
www.edureka.in/hadoop
JobTracker
www.edureka.in/hadoop
JobTracker (Contd.)
www.edureka.in/hadoop
JobTracker (Contd.)
www.edureka.in/hadoop
Anatomy of A File Write
www.edureka.in/hadoop
Anatomy of A File Read
www.edureka.in/hadoop
Replication and Rack Awareness
www.edureka.in/hadoop
Big Data – It’s about Scale And Structure
Structured Data Types Multi and Unstructured
Limited, No Data Processing Processing Processing coupled with Data
Standards & Structured Governance Loosely Structured
Required On write Schema Required On Read
Reads are Fast Speed Writes are Fast
Software License Cost Support Only
Known Entity Resources Growing, Complexities, Wide
Interactive OLAP Analytics
Complex ACID Transactions
Operational Data Store
Best Fit Use Data Discovery
Processing Unstructured Data
Massive Storage/Processing
RDBMSRDBMS EDW MPP NoSQL HADOOP
www.edureka.in/hadoop
 Attempt the following Assignments using the documents present in the LMS:
 Hadoop Installation - Cloudera CDH3
 Execute Linux Basic Commands
 Execute HDFS Hands On
Assignments
Thank You
See You in Class Next Week

Learn Big Data & Hadoop

  • 1.
  • 2.
    www.edureka.in/hadoop How It Works… LIVE classes  Class recordings  Module wise Quizzes, Coding Assignments  24x7 on-demand technical support  Project work on large Datasets  Online certification exam  Lifetime access to the Learning Management System Complimentary Java Classes
  • 3.
    www.edureka.in/hadoop Course Topics  Week1 – Understanding Big Data – Introduction to HDFS  Week 2 – Playing around with Cluster – Data loading Techniques  Week 3 – Map-Reduce Basics, types and formats – Use-cases for Map-Reduce  Week 4 – Analytics using Pig – Understanding Pig Latin  Week 5 – Analytics using Hive – Understanding HIVE QL  Week 6 – NoSQL Databases – Understanding HBASE  Week 7 – Data loading Techniques in Hbase – Zookeeper  Week 8 – Real world Datasets and Analysis – Hadoop Project Environment
  • 4.
    www.edureka.in/hadoop What Is BigData?  Lots of Data(Terabytes or Petabytes)  Systems / Enterprises generate huge amount of data from Terabytes to and even Petabytes of information. A airline jet collects 10 terabytes of sensor data for every 30 minutes of flying time. NYSE generates about one terabyte of new trade data per day to Perform stock trading analytics to determine trends for optimal trades.
  • 5.
    www.edureka.in/hadoop Facebook Example  Facebookusers spend 10.5 billion minutes (almost 20,000 years) online on the social network.  Facebook has an average of 3.2 billion likes and comments are posted every day.
  • 6.
    www.edureka.in/hadoop Twitter Example  Twitterhas over 500 million registered users.  The USA, whose 141.8 million accounts represents 27.4 percent of all Twitter users, good enough to finish well ahead of Brazil, Japan, the UK and Indonesia.  79% of US Twitter users are more like to recommend brands they follow .  67% of US Twitter users are more likely to buy from brands they follow .  57% of all companies that use social media for business use Twitter.
  • 7.
    www.edureka.in/hadoop IBM’s Definition  IBM’sdefinition – Big Data Characteristics http://www-01.ibm.com/software/data/bigdata/
  • 8.
    www.edureka.in/hadoop  Estimated GlobalData Volume:  2011: 1.8 ZB  2015: 7.9 ZB  The world's information doubles every two years  Over the next 10 years:  The number of servers worldwide will grow by 10x  Amount of information managed by enterprise data centers will grow by 50x  Number of “files” enterprise data center handle will grow by 75x Source: http://www.emc.com/leadership/programs/digital-universe.htm, which was based on the 2011 IDC Digital Universe Study Data Volume Is Growing Exponentially
  • 9.
  • 10.
    www.edureka.in/hadoop Common Big DataCustomer Scenarios Industry/Vertical Scenarios Financial Services  Modeling True Risk  Threat Analysis  Fraud Detection  Trade Surveillance  Credit Scoring And Analysis Web & E-Tailing  Recommendation Engines  Ad Targeting  Search Quality  Abuse and Click Fraud Detection Retail  Point of sales Transaction Analysis  Customer Churn Analysis  Sentiment Analysis
  • 11.
    www.edureka.in/hadoop Industry/Vertical Scenarios Telecommunications Customer Churn Prevention  Network Performance Optimization  Call Detail Record (CDR) Analysis  Analyzing Network to Predict Failure Government  Fraud Detection And Cyber Security General (Cross Vertical)  ETL & Processing Engine Common Big Data Customer Scenarios (Contd.)
  • 12.
    www.edureka.in/hadoop Hidden Treasure  Insightinto data can provide Business Advantage.  Some key early indicators can mean Fortunes to Business.  More Precise Analysis with more data.
  • 13.
    www.edureka.in/hadoop What Big CompaniesHave To Say… “Analyzing Big Data sets will become a key basis for competition.” “Leaders in every sector will have to grapple the implications of Big Data.” McKinsey Gartner Forrester Research “Big Data analytics are rapidly emerging as the preferred solution to business and technology trends that are disrupting.” “Enterprises should not delay implementation of Big Data Analytics.” “Use Hadoop to gain a competitive advantage over more risk-averse enterprises.” “Prioritize Big Data projects that might benefit from Hadoop.”
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
    www.edureka.in/hadoop Hadoop Users –In Detail http://wiki.apache.org/hadoop/PoweredBy
  • 19.
    www.edureka.in/hadoop Hadoop Users –In Detail http://wiki.apache.org/hadoop/PoweredBy
  • 20.
    www.edureka.in/hadoop Read 1 TBData 10 Machines  4 I/O Channels  Each Channel – 100 MB/s  4 I/O Channels  Each Channel – 100 MB/s 1 Machine Why DFS?
  • 21.
    www.edureka.in/hadoop 10 Machines  4I/O Channels  Each Channel – 100 MB/s  4 I/O Channels  Each Channel – 100 MB/s 1 Machine Read 1 TB Data 45 Minutes Why DFS?
  • 22.
    www.edureka.in/hadoop 4.5 Minutes45 Minutes 10Machines  4 I/O Channels  Each Channel – 100 MB/s  4 I/O Channels  Each Channel – 100 MB/s 1 Machine Read 1 TB Data Why DFS?
  • 23.
    www.edureka.in/hadoop  Apache Hadoopis a framework that allows for the distributed processing of large data sets across clusters of commodity computers using a simple programming model.  It is an Open-source Data Management with scale-out storage & distributed processing. What Is Hadoop?
  • 24.
  • 25.
    www.edureka.in/hadoop Hadoop History Doug Cutting& Mike Cafarella started working on Nutch NY Times converts 4TB of Image archives over 100 EC2s Fastest sort of a TB, 62secs over 1,460 nodes Sorted a PB in 16.25hours Over 3.658 nodes Fastest sort of a TB, 3.5mins over 910 nodes Doug Cutting adds DFS & MapReduce support to Nutch Google publishes GFS & MapReduce papers Yahoo! hires Cutting, Hadoop spins out of Nutch Facebook launches Hive: SQL Support for Hadoop Doug Cutting Joins Cloudera Hadoop Summit 2009, 750 attendees Founded
  • 26.
  • 27.
    www.edureka.in/hadoop  HDFS –Hadoop Distributed File System (Storage)  Distributed across “nodes”  Natively redundant  NameNode tracks locations.  MapReduce (Processing)  Splits a task across processors  “near” the data & assembles results  Self-Healing, High Bandwidth  Clustered storage Hadoop Core Components
  • 28.
  • 29.
  • 30.
    www.edureka.in/hadoop  NameNode:  masterof the system  maintains and manages the blocks which are present on the DataNodes  DataNodes:  slaves which are deployed on each machine and provide the actual storage  responsible for serving read and write requests for the clients Main Components Of HDFS
  • 31.
    www.edureka.in/hadoop  Secondary NameNode: Not a hot standby for the NameNode  Connects to NameNode every hour*  Housekeeping, backup of NemeNode metadata  Saved metadata can build a failed NameNode You give me metadata every hour, I will make it secure Single Point Failure Secondary NameNode NameNode metadata metadata Secondary NameNode
  • 32.
    www.edureka.in/hadoop NameNode Metadata  Meta-datain Memory  The entire metadata is in main memory  No demand paging of FS meta-data  Types of Metadata  List of files  List of Blocks for each file  List of DataNode for each block  File attributes, e.g. access time, replication factor  A Transaction Log  Records file creations, file deletions. etc
  • 33.
  • 34.
  • 35.
  • 36.
  • 37.
  • 38.
  • 39.
    www.edureka.in/hadoop Big Data –It’s about Scale And Structure Structured Data Types Multi and Unstructured Limited, No Data Processing Processing Processing coupled with Data Standards & Structured Governance Loosely Structured Required On write Schema Required On Read Reads are Fast Speed Writes are Fast Software License Cost Support Only Known Entity Resources Growing, Complexities, Wide Interactive OLAP Analytics Complex ACID Transactions Operational Data Store Best Fit Use Data Discovery Processing Unstructured Data Massive Storage/Processing RDBMSRDBMS EDW MPP NoSQL HADOOP
  • 40.
    www.edureka.in/hadoop  Attempt thefollowing Assignments using the documents present in the LMS:  Hadoop Installation - Cloudera CDH3  Execute Linux Basic Commands  Execute HDFS Hands On Assignments
  • 41.
    Thank You See Youin Class Next Week

Editor's Notes

  • #17 Accessible: Hadoop runs on large clusters of commodity machines or cloud computing services such as Amazon EC2Robust: Since Hadoop can run on commodity cluster, its designed with the assumption of frequent hardware failure, it can gracefully handle such failure and computation don’t stop because of few failed devices / systemsScalable:Hadoop scales linearly to handle large data by adding more slave nodes to the clusterSimple : Its easy to write efficient parallel programming with Hadoop
  • #37 Data transferred from DataNode to MapTask process. DBlk is the file data block; CBlk is the file checksum block. File data are transferred to the client through Java niotransferTo (aka UNIX sendfilesyscall). Checksum data are first fetched to DataNode JVM buffer, and then pushed to the client (details are not shown). Both file data and checksum data are bundled in an HDFS packet (typically 64KB) in the format of: {packet header | checksum bytes | data bytes}.2. Data received from the socket are buffered in a BufferedInputStream, presumably for the purpose of reducing the number of syscalls to the kernel. This actually involves two buffer-copies: first, data are copied from kernel buffers into a temporary direct buffer in JDK code; second, data are copied from the temporary direct buffer to the byte[] buffer owned by the BufferedInputStream. The size of the byte[] in BufferedInputStream is controlled by configuration property "io.file.buffer.size", and is default to 4K. In our production environment, this parameter is customized to 128K.3. Through the BufferedInputStream, the checksum bytes are saved into an internal ByteBuffer (whose size is roughly (PacketSize / 512 * 4) or 512B), and file bytes (compressed data) are deposited into the byte[] buffer supplied by the decompression layer. Since the checksum calculation requires a full 512 byte chunk while a user's request may not be aligned with a chunk boundary, a 512B byte[] buffer is used to align the input before copying partial chunks into user-supplied byte[] buffer. Also note that data are copied to the buffer in 512-byte pieces (as required by FSInputChecker API). Finally, all checksum bytes are copied to a 4-byte array for FSInputChecker to perform checksum verification. Overall, this step involves an extra buffer-copy.4. The decompression layer uses a byte[] buffer to receive data from the DFSClient layer. The DecompressorStream copies the data from the byte[] buffer to a 64K direct buffer, calls the native library code to decompress the data and stores the uncompressed bytes in another 64K direct buffer. This step involves two buffer-copies.5.LineReader maintains an internal buffer to absorb data from the downstream. From the buffer, line separators are discovered and line bytes are copied to form Text objects. This step requires two buffer-copies.The client creates the file by calling create() on Distributed FileSystem (step 1). Distributed FileSystem makes an RPC call to the namenode to create a new file in the filesystem’s namespace, with no blocks associated with it (step 2). The namenode performs various checks to make sure the file doesn’t already exist, and that the client has the right permissions to create the file.
  • #38  The client opens the file it wishes to read by calling open() on the FileSystemobject,which for HDFS is an instance of DFS(step 1).DistributedFileSystem calls the namenode, using RPC, to determine the locations of the blocks for the first few blocks in the File (step 2). For each block, the namenode returns the addresses of the datanodes that have a copy of that block
  • #41 PIG is procedural and SQL is declarative.While fields within a SQL record must be atomic (contain one single value), fields within a PIG tuple can be multi-valued, e.g. a collection of another PIG tuples, or a map with key be an atomic data and value be anything.Unlike SQL query where the input data need to be physically loaded into the DB tables, PIG extract the data from its original data sources directly during execution.PIG is lazily executed. It use a backtracking mechanism from its "store" statement to determine which statement needs to be executed.