Big Data and Hadoop training course is designed to provide knowledge and skills to become a successful Hadoop Developer. In-depth knowledge of concepts such as Hadoop Distributed File System, Setting up the Hadoop Cluster, Map-Reduce,PIG, HIVE, HBase, Zookeeper, SQOOP etc. will be covered in the course.
2. www.edureka.in/hadoop
How It Works…
LIVE classes
Class recordings
Module wise Quizzes, Coding Assignments
24x7 on-demand technical support
Project work on large Datasets
Online certification exam
Lifetime access to the Learning Management System
Complimentary Java Classes
3. www.edureka.in/hadoop
Course Topics
Week 1
– Understanding Big Data
– Introduction to HDFS
Week 2
– Playing around with Cluster
– Data loading Techniques
Week 3
– Map-Reduce Basics, types and formats
– Use-cases for Map-Reduce
Week 4
– Analytics using Pig
– Understanding Pig Latin
Week 5
– Analytics using Hive
– Understanding HIVE QL
Week 6
– NoSQL Databases
– Understanding HBASE
Week 7
– Data loading Techniques in Hbase
– Zookeeper
Week 8
– Real world Datasets and Analysis
– Hadoop Project Environment
4. www.edureka.in/hadoop
What Is Big Data?
Lots of Data(Terabytes or Petabytes)
Systems / Enterprises generate huge amount of data from Terabytes to and even Petabytes of
information.
A airline jet collects 10 terabytes of sensor data
for every 30 minutes of flying time.
NYSE generates about one terabyte of new trade data
per day to Perform stock trading analytics to determine
trends for optimal trades.
5. www.edureka.in/hadoop
Facebook Example
Facebook users spend 10.5 billion minutes
(almost 20,000 years) online on the social network.
Facebook has an average of 3.2 billion likes and
comments are posted every day.
6. www.edureka.in/hadoop
Twitter Example
Twitter has over 500 million registered users.
The USA, whose 141.8 million accounts represents 27.4
percent of all Twitter users, good enough to finish well ahead
of Brazil, Japan, the UK and Indonesia.
79% of US Twitter users are more like to recommend brands
they follow .
67% of US Twitter users are more likely to buy from brands
they follow .
57% of all companies that use social media for business use
Twitter.
8. www.edureka.in/hadoop
Estimated Global Data Volume:
2011: 1.8 ZB
2015: 7.9 ZB
The world's information doubles every two years
Over the next 10 years:
The number of servers worldwide will grow by 10x
Amount of information managed by enterprise data
centers will grow by 50x
Number of “files” enterprise data center handle will
grow by 75x
Source: http://www.emc.com/leadership/programs/digital-universe.htm,
which was based on the 2011 IDC Digital Universe Study
Data Volume Is Growing Exponentially
10. www.edureka.in/hadoop
Common Big Data Customer Scenarios
Industry/Vertical Scenarios
Financial Services Modeling True Risk
Threat Analysis
Fraud Detection
Trade Surveillance
Credit Scoring And Analysis
Web & E-Tailing Recommendation Engines
Ad Targeting
Search Quality
Abuse and Click Fraud Detection
Retail Point of sales Transaction Analysis
Customer Churn Analysis
Sentiment Analysis
11. www.edureka.in/hadoop
Industry/Vertical Scenarios
Telecommunications Customer Churn Prevention
Network Performance Optimization
Call Detail Record (CDR) Analysis
Analyzing Network to Predict Failure
Government Fraud Detection And Cyber Security
General
(Cross Vertical)
ETL & Processing Engine
Common Big Data Customer Scenarios (Contd.)
12. www.edureka.in/hadoop
Hidden Treasure
Insight into data can provide Business Advantage.
Some key early indicators can mean Fortunes to Business.
More Precise Analysis with more data.
13. www.edureka.in/hadoop
What Big Companies Have To Say…
“Analyzing Big Data sets will become a key basis for competition.”
“Leaders in every sector will have to grapple the implications of Big Data.”
McKinsey
Gartner
Forrester
Research
“Big Data analytics are rapidly emerging as the preferred solution to business and
technology trends that are disrupting.”
“Enterprises should not delay implementation of Big Data Analytics.”
“Use Hadoop to gain a competitive advantage over more risk-averse enterprises.”
“Prioritize Big Data projects that might benefit from Hadoop.”
23. www.edureka.in/hadoop
Apache Hadoop is a framework that allows for the distributed processing of large data sets
across clusters of commodity computers using a simple programming model.
It is an Open-source Data Management with scale-out storage & distributed processing.
What Is Hadoop?
25. www.edureka.in/hadoop
Hadoop History
Doug Cutting & Mike Cafarella
started working on Nutch
NY Times converts 4TB of
Image archives over 100 EC2s
Fastest sort of a TB,
62secs over 1,460 nodes
Sorted a PB in 16.25hours
Over 3.658 nodes
Fastest sort of a TB, 3.5mins
over 910 nodes
Doug Cutting adds DFS &
MapReduce support to Nutch
Google publishes GFS &
MapReduce papers Yahoo! hires Cutting,
Hadoop spins out of Nutch
Facebook launches Hive:
SQL Support for Hadoop
Doug Cutting
Joins Cloudera
Hadoop Summit 2009,
750 attendees
Founded
30. www.edureka.in/hadoop
NameNode:
master of the system
maintains and manages the blocks which are present on the
DataNodes
DataNodes:
slaves which are deployed on each machine and provide the
actual storage
responsible for serving read and write requests for the clients
Main Components Of HDFS
31. www.edureka.in/hadoop
Secondary NameNode:
Not a hot standby for the NameNode
Connects to NameNode every hour*
Housekeeping, backup of NemeNode metadata
Saved metadata can build a failed NameNode
You give me
metadata every
hour, I will make it
secure
Single Point
Failure
Secondary
NameNode
NameNode
metadata
metadata
Secondary NameNode
32. www.edureka.in/hadoop
NameNode Metadata
Meta-data in Memory
The entire metadata is in main memory
No demand paging of FS meta-data
Types of Metadata
List of files
List of Blocks for each file
List of DataNode for each block
File attributes, e.g. access time, replication factor
A Transaction Log
Records file creations, file deletions. etc
39. www.edureka.in/hadoop
Big Data – It’s about Scale And Structure
Structured Data Types Multi and Unstructured
Limited, No Data Processing Processing Processing coupled with Data
Standards & Structured Governance Loosely Structured
Required On write Schema Required On Read
Reads are Fast Speed Writes are Fast
Software License Cost Support Only
Known Entity Resources Growing, Complexities, Wide
Interactive OLAP Analytics
Complex ACID Transactions
Operational Data Store
Best Fit Use Data Discovery
Processing Unstructured Data
Massive Storage/Processing
RDBMSRDBMS EDW MPP NoSQL HADOOP
40. www.edureka.in/hadoop
Attempt the following Assignments using the documents present in the LMS:
Hadoop Installation - Cloudera CDH3
Execute Linux Basic Commands
Execute HDFS Hands On
Assignments
Accessible: Hadoop runs on large clusters of commodity machines or cloud computing services such as Amazon EC2Robust: Since Hadoop can run on commodity cluster, its designed with the assumption of frequent hardware failure, it can gracefully handle such failure and computation don’t stop because of few failed devices / systemsScalable:Hadoop scales linearly to handle large data by adding more slave nodes to the clusterSimple : Its easy to write efficient parallel programming with Hadoop
Data transferred from DataNode to MapTask process. DBlk is the file data block; CBlk is the file checksum block. File data are transferred to the client through Java niotransferTo (aka UNIX sendfilesyscall). Checksum data are first fetched to DataNode JVM buffer, and then pushed to the client (details are not shown). Both file data and checksum data are bundled in an HDFS packet (typically 64KB) in the format of: {packet header | checksum bytes | data bytes}.2. Data received from the socket are buffered in a BufferedInputStream, presumably for the purpose of reducing the number of syscalls to the kernel. This actually involves two buffer-copies: first, data are copied from kernel buffers into a temporary direct buffer in JDK code; second, data are copied from the temporary direct buffer to the byte[] buffer owned by the BufferedInputStream. The size of the byte[] in BufferedInputStream is controlled by configuration property "io.file.buffer.size", and is default to 4K. In our production environment, this parameter is customized to 128K.3. Through the BufferedInputStream, the checksum bytes are saved into an internal ByteBuffer (whose size is roughly (PacketSize / 512 * 4) or 512B), and file bytes (compressed data) are deposited into the byte[] buffer supplied by the decompression layer. Since the checksum calculation requires a full 512 byte chunk while a user's request may not be aligned with a chunk boundary, a 512B byte[] buffer is used to align the input before copying partial chunks into user-supplied byte[] buffer. Also note that data are copied to the buffer in 512-byte pieces (as required by FSInputChecker API). Finally, all checksum bytes are copied to a 4-byte array for FSInputChecker to perform checksum verification. Overall, this step involves an extra buffer-copy.4. The decompression layer uses a byte[] buffer to receive data from the DFSClient layer. The DecompressorStream copies the data from the byte[] buffer to a 64K direct buffer, calls the native library code to decompress the data and stores the uncompressed bytes in another 64K direct buffer. This step involves two buffer-copies.5.LineReader maintains an internal buffer to absorb data from the downstream. From the buffer, line separators are discovered and line bytes are copied to form Text objects. This step requires two buffer-copies.The client creates the file by calling create() on Distributed FileSystem (step 1). Distributed FileSystem makes an RPC call to the namenode to create a new file in the filesystem’s namespace, with no blocks associated with it (step 2). The namenode performs various checks to make sure the file doesn’t already exist, and that the client has the right permissions to create the file.
The client opens the file it wishes to read by calling open() on the FileSystemobject,which for HDFS is an instance of DFS(step 1).DistributedFileSystem calls the namenode, using RPC, to determine the locations of the blocks for the first few blocks in the File (step 2). For each block, the namenode returns the addresses of the datanodes that have a copy of that block
PIG is procedural and SQL is declarative.While fields within a SQL record must be atomic (contain one single value), fields within a PIG tuple can be multi-valued, e.g. a collection of another PIG tuples, or a map with key be an atomic data and value be anything.Unlike SQL query where the input data need to be physically loaded into the DB tables, PIG extract the data from its original data sources directly during execution.PIG is lazily executed. It use a backtracking mechanism from its "store" statement to determine which statement needs to be executed.