Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Bigdata and Hadoop
1.
2. What is the Need of Big data Technology
when we have robust, high-performing,
relational database management system
?
3. Data Stored in structured format like PK, Rows,
Columns, Tuples and FK .
It was for just Transactional data analysis.
Later using Data warehouse for offline data.
(Analysis done within Enterprise)
Massive use of Internet and Social Networking(FB,
Linkdin) Data become less structured.
Data is stored on central server.
4.
5. ‘Big Data’ is similar to ‘small data’, but
bigger
…but having data bigger it requires
different approaches:
› Techniques, tools and architecture
…with an aim to solve new problems
› …or old problems in a better way
8. Open-source data storage and processing API
Massively scalable, automatically parallelizable
Based on work from Google
GFS + MapReduce + BigTable
Current Distributions based on Open Source and Vendor Work
Apache Hadoop
Cloudera – CH4 w/ Impala
Hortonworks
MapR
AWS
Windows Azure HDInsight
10. HDFS is a file system written in Java
Sits on top of a native file system
Provides redundant storage for massive
amounts of data
Use cheap, unreliable computers
11. Data is split into blocks and stored on multiple
nodes in the cluster
› Each block is usually 64 MB or 128 MB (conf)
Each block is replicated multiple times (conf)
› Replicas stored on different data nodes
Large files, 100 MB+
13. NameNode
› only 1 per cluster
› metadata server and database
› SecondaryNameNode helps with
some housekeeping
• JobTracker
• only 1 per cluster
• job scheduler
14. DataNodes
› 1-4000 per cluster
› block data storage
• TaskTrackers
• 1-4000 per cluster
• task execution
15. A single NameNode stores all metadata
Filenames, locations on DataNodes of each
block, owner, group, etc.
All information maintained in RAM for fast lookup
File system metadata size is limited to the amount
of available RAM on the NameNode
16. DataNodes store file contents
Stored as opaque ‘blocks’ on the underlying
filesystem
Different blocks of the same file will be stored
on different DataNodes
Same block is stored on three (or more)
DataNodes for redundancy
17. DataNodes send heartbeats to NameNode
› After a period without any heartbeats, a DataNode is
assumed to be lost
› NameNode determines which blocks were on the lost
node
› NameNode finds other DataNodes with copies of
these blocks
› These DataNodes are instructed to copy the blocks
to other nodes
› Replication is actively maintained
18. The Secondary NameNode is not a failover
NameNode
Does memory-intensive administrative
functions for the NameNode
Should run on a separate machine
19. datanode daemon
Linux file system
…
tasktracker
slave node
datanode daemon
Linux file system
…
tasktracker
slave node
datanode daemon
Linux file system
…
tasktracker
slave node
namenode
namenode daemon
job submission node
jobtracker
22. Preparing for
MapReduce
Loading
Files
64 MB
128 MB
File
System
Native file system
HDFS
Cloud
Output
Immutable
You
Define
Input, Map,
Reduce, Output
Use Java or other
programming
language
Work with key-
value pairs
23. Input: a set of key/value pairs
User supplies two functions:
› map(k,v) list(k1,v1)
› reduce(k1, list(v1)) v2
(k1,v1) is an intermediate key/value pair
Output is the set of (k1,v2) pairs
29. Probably the most complex aspect of MapReduce!
Map side
› Map outputs are buffered in memory in a circular buffer
› When buffer reaches threshold, contents are “spilled” to disk
› Spills merged in a single, partitioned file (sorted within each
partition): combiner runs here
Reduce side
› First, map outputs are copied over to reducer machine
› “Sort” is a multi-pass merge of map outputs (happens in
memory and on disk): combiner runs here
› Final merge pass goes directly into reducer
31. Writable Defines a de/serialization protocol.
Every data type in Hadoop is a Writable.
WritableComprable Defines a sort order. All keys must be
of this type (but not values).
IntWritable
LongWritable
Text
…
Concrete classes for different data types.
SequenceFiles Binary encoded of a sequence of
key/value pairs