0
“Data is a precious things and will
last longer than the system
themselves”
– Tim Berners Lee
Sandeep Kumar
What is Data ?
• What is Data ?
• And why should we care about it ?
What is Big Data ?
• Big data is a collection of data sets so large and
complex that it becomes difficult to process using...
Few Examples
• Web logs
• RFID
• Social Data-Facebook, Linkedin, Twitter.
• Call Detail Records
• Large-Scale e-commerce
•...
What is Big Data ?
• Ancestry.com stores around 2.5 petabytes of Data.
• The New York Stock Exchange generates about one
t...
How to Process The Big Data?
• Need to process large datasets (>100TB)
• Only reading 100TB of data can be overwhelming
• ...
Not so easy………..
• The challenges are in search, sharing, transfer,
visualization etc.
• Moving data from storage cluster ...
What We are looking for.
What we are looking for.
• A common infrastructure and standard set of tools to
handle this complexity.
• A Efficient, Rel...
What is Hadoop ?
• Its a framework that allows distributed processing of
large data sets across clusters of computers.
• I...
 Scalable: store and process petabytes, scale by adding
HW and added without needing to change data
formats.
 Economical...
Hadoop is useful for…….
• Batch Data Processing.
• Log Processing.
• Document Analysis & Indexing.
• Text Mining.
• Crawl ...
Use The Right Tool For The Right Job
Hadoop:RDBMS
When to use?
• Write once read many times.
• Structured or Not (Agility)...
Hadoop Terminology…….
Node 1
Hadoop Terminology…….
Node 1
Node 2
Hadoop Terminology…….
Node 1
Node 2
Node 3
Hadoop Terminology…….
Node 1
Node 2
.
.
Node 3
Rack 1
Hadoop Terminology…….
Node 1
Node 2
.
.
Node 3
Rack 1
Node 1
Node 2
.
.
Node 3
Rack 2
Hadoop Terminology…….
Node 1
Node 2
.
.
Node 3
Rack 1
Node 1
Node 2
.
.
Node 3
Rack 2
Node 1
Node 2
.
.
Node 3
Rack 3
Hadoop Terminology…….
Node 1
Node 2
.
.
Node 3
Rack 1
Node 1
Node 2
.
.
Node 3
Rack 2
Node 1
Node 2
.
.
Node 3
Rack 3
Hado...
Hadoop Framework…….
Hadoop Nodes…….
• HDFS Nodes
 NameNode (Master)
 DataNode (Slaves)
 Checkpoint Node
 Secondary NameNode (deprecated)
...
Hadoop Nodes…….
• MapReduce nodes
 JobTracker (Master)
 TaskTracker (Slaves)
Hadoop Nodes-Overview
Hadoop Nodes-NameNode
• Manages the filesystem namespace and metadata
• Replicate missing blocks
• No data goes through th...
Hadoop Nodes-CheckPoint Node
• Periodically creates checkpoints of NameNode filesystem
• The Checkpoint node should run on...
Hadoop Nodes-BackUp Node
• Difference with Checkpoint node is that it keeps and up-
to-date copy of metadata in RAM
• Same...
Hadoop Nodes-Data Node
Can be many per Hadoop cluster
•Manages blocks with data and serves them to
clients
•Periodically r...
Hadoop Nodes-Job Tracker
One per Hadoop cluster (Multiple namenode can be configured in Hadoop 2.2 or letter version)
•Rec...
Hadoop Nodes-Task Tracker
• Can be many per Hadoop cluster
• Executes MapReduce operations
• Reads blocks from DataNodes
Map Reduce
It offers:
• Operates on key and value pairs.
• Two major functions: Map() and Reduce()
• Input formats and spl...
Map Reduce Diagram
Map Reduce Architecture.
Map Reduce Job.
JobTracker
client
TaskTackers &
Datanodes
←4.tasks
NameNode
3. Namespace info
Input Output .
The MapReduce framework operates on <key, value> pairs.
It views the input to the job as a set of <key, val...
Input Output..
Input and Output types of a MapReduce job:
(input) <k1, v1> -> map -> <k2, v2> -> combine -> <k2, v2>
-> re...
HDFS Architecture
Hadoop Tools…….
Hive
 It’s a data warehouse system for Hadoop
 Providing data summarization, query, and analysis.
Hadoop Tools…….
• Pig
 Its a high-level platform for creating MapReduce
programs used with Hadoop.
 Developed by Yahoo.
Hadoop Tools…….
Hbase
 Used when needs random, real-time read/write access to
your Big Data.
 Also used for storing hist...
Hadoop Tools…….
• Hue
 Its a Web application for interacting with Apache Hadoop.
It supports a file browser, job tracker ...
Hadoop Tools…….
• Sqoop
 Its a Command-line interface application for transferring
data between relational databases and ...
Hadoop Tools…….
• Flume
 Its used for efficiently collecting, aggregating, and
moving large amounts of distributed data o...
Hadoop Tools…….
• Flume Model
Hadoop in the Enterprise…….
There are many tools developed on top of hadoop these days and
those are available in market and being used widely in indu...
Thanks for your time today.
Upcoming SlideShare
Loading in...5
×

Hadoop-Quick introduction

311

Published on

Overview of Big Data and Hadoop.

Published in: Data & Analytics
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
311
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
34
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "Hadoop-Quick introduction"

  1. 1. “Data is a precious things and will last longer than the system themselves” – Tim Berners Lee
  2. 2. Sandeep Kumar
  3. 3. What is Data ? • What is Data ? • And why should we care about it ?
  4. 4. What is Big Data ? • Big data is a collection of data sets so large and complex that it becomes difficult to process using traditional data processing applications.
  5. 5. Few Examples • Web logs • RFID • Social Data-Facebook, Linkedin, Twitter. • Call Detail Records • Large-Scale e-commerce • Medical Records • Video archives • Atmospheric Science • Astronomy • Feeds • Media & Advertising.
  6. 6. What is Big Data ? • Ancestry.com stores around 2.5 petabytes of Data. • The New York Stock Exchange generates about one terabyte of new trade data per day. • The Internet Archive stores around 2 petabytes of data, and is growing at a rate of 20 terabytes per month. (http://archive.org/web/web.php)
  7. 7. How to Process The Big Data? • Need to process large datasets (>100TB) • Only reading 100TB of data can be overwhelming • Takes ~11 days to read on a standard computer • Takes a day across a 10GB link (very high end storage solution) • On a single node (@50MB/s) – 23days • On a 1000 node cluster – 33min
  8. 8. Not so easy……….. • The challenges are in search, sharing, transfer, visualization etc. • Moving data from storage cluster to computation cluster is not feasible. • In large cluster failure is expected . Computer fails everyday. • Very expensive to build reliability into each application. • massively parallel software running on tens, hundreds, or even thousands of servers • A programmer worries about errors, data motion, communication.
  9. 9. What We are looking for.
  10. 10. What we are looking for. • A common infrastructure and standard set of tools to handle this complexity. • A Efficient, Reliable fault-tolerant and usable framework.
  11. 11. What is Hadoop ? • Its a framework that allows distributed processing of large data sets across clusters of computers. • It is designed to scale up from single servers to thousands of machines. • Its also designed to run on commodity hardware.
  12. 12.  Scalable: store and process petabytes, scale by adding HW and added without needing to change data formats.  Economical: 1000s of commodity machines.  Efficient: runs tasks where data is located.  Flexible: Hadoop is schema-less, and can absorb any type of data, structured or not, from any number of sources.  Fault tolerant: When you lose a node, the system redirects work to another location of the data and continues processing without missing a beat. Hadoop is….
  13. 13. Hadoop is useful for……. • Batch Data Processing. • Log Processing. • Document Analysis & Indexing. • Text Mining. • Crawl Data Processing. • Highly parallel data intensive distributed applications.
  14. 14. Use The Right Tool For The Right Job Hadoop:RDBMS When to use? • Write once read many times. • Structured or Not (Agility) • Batch Processing When to use? • Interactive Reporting (<1sec) • Multistep Transactions • Lots of Inserts/Updates/Deletes
  15. 15. Hadoop Terminology……. Node 1
  16. 16. Hadoop Terminology……. Node 1 Node 2
  17. 17. Hadoop Terminology……. Node 1 Node 2 Node 3
  18. 18. Hadoop Terminology……. Node 1 Node 2 . . Node 3 Rack 1
  19. 19. Hadoop Terminology……. Node 1 Node 2 . . Node 3 Rack 1 Node 1 Node 2 . . Node 3 Rack 2
  20. 20. Hadoop Terminology……. Node 1 Node 2 . . Node 3 Rack 1 Node 1 Node 2 . . Node 3 Rack 2 Node 1 Node 2 . . Node 3 Rack 3
  21. 21. Hadoop Terminology……. Node 1 Node 2 . . Node 3 Rack 1 Node 1 Node 2 . . Node 3 Rack 2 Node 1 Node 2 . . Node 3 Rack 3 Hadoop Cluster
  22. 22. Hadoop Framework…….
  23. 23. Hadoop Nodes……. • HDFS Nodes  NameNode (Master)  DataNode (Slaves)  Checkpoint Node  Secondary NameNode (deprecated)  Backup Node
  24. 24. Hadoop Nodes……. • MapReduce nodes  JobTracker (Master)  TaskTracker (Slaves)
  25. 25. Hadoop Nodes-Overview
  26. 26. Hadoop Nodes-NameNode • Manages the filesystem namespace and metadata • Replicate missing blocks • No data goes through the NameNode • NameNode mainly consists of:  fsimage: Contains a checkpoint copy of the metadata on disk  edit logs: Records all write operations, synchronizes with metadata in RAM after each write  In case of ‘power failure’ on NameNode Can recover using fsimage + edit logs
  27. 27. Hadoop Nodes-CheckPoint Node • Periodically creates checkpoints of NameNode filesystem • The Checkpoint node should run on a different machine than the NameNode • Should have same storage requirements as NameNode • There can be many Checkpoint nodes per cluster
  28. 28. Hadoop Nodes-BackUp Node • Difference with Checkpoint node is that it keeps and up- to-date copy of metadata in RAM • Same RAM requirements as NameNode • Can only have one Backup node per cluster
  29. 29. Hadoop Nodes-Data Node Can be many per Hadoop cluster •Manages blocks with data and serves them to clients •Periodically reports to NameNode the list of blocks it stores •Use inexpensive commodity hardware for this node
  30. 30. Hadoop Nodes-Job Tracker One per Hadoop cluster (Multiple namenode can be configured in Hadoop 2.2 or letter version) •Receives job requests submitted by client •Schedules and monitors MapReduce jobs on task trackers
  31. 31. Hadoop Nodes-Task Tracker • Can be many per Hadoop cluster • Executes MapReduce operations • Reads blocks from DataNodes
  32. 32. Map Reduce It offers: • Operates on key and value pairs. • Two major functions: Map() and Reduce() • Input formats and splits • Number of tasks. • Provides status about jobs to users • Monitors task progress
  33. 33. Map Reduce Diagram
  34. 34. Map Reduce Architecture.
  35. 35. Map Reduce Job. JobTracker client TaskTackers & Datanodes ←4.tasks NameNode 3. Namespace info
  36. 36. Input Output . The MapReduce framework operates on <key, value> pairs. It views the input to the job as a set of <key, value> pairs and produces a set of <key, value> pairs as the output of the job.
  37. 37. Input Output.. Input and Output types of a MapReduce job: (input) <k1, v1> -> map -> <k2, v2> -> combine -> <k2, v2> -> reduce -> <k3, v3> (output) Reference: http://hadoop.apache.org/docs/r1.0.4/mapred_tutorial.html
  38. 38. HDFS Architecture
  39. 39. Hadoop Tools……. Hive  It’s a data warehouse system for Hadoop  Providing data summarization, query, and analysis.
  40. 40. Hadoop Tools……. • Pig  Its a high-level platform for creating MapReduce programs used with Hadoop.  Developed by Yahoo.
  41. 41. Hadoop Tools……. Hbase  Used when needs random, real-time read/write access to your Big Data.  Also used for storing historical data.
  42. 42. Hadoop Tools……. • Hue  Its a Web application for interacting with Apache Hadoop. It supports a file browser, job tracker interface, Hive, Pig and more.
  43. 43. Hadoop Tools……. • Sqoop  Its a Command-line interface application for transferring data between relational databases and Hadoop.  Microsoft uses a Sqoop-based connector to help transfer data from Microsoft SQL Server databases to Hadoop.
  44. 44. Hadoop Tools……. • Flume  Its used for efficiently collecting, aggregating, and moving large amounts of distributed data or log data.
  45. 45. Hadoop Tools……. • Flume Model
  46. 46. Hadoop in the Enterprise…….
  47. 47. There are many tools developed on top of hadoop these days and those are available in market and being used widely in industry. We can get more on it from Cloudera, hortonworks and from Google.com
  48. 48. Thanks for your time today.
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×