Introduction to hadoop

Workshop on data analytics
using big data tools ‘ 2016 –
bharathiar uniVErsity
K.Santhiya , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,-
WDABT 2016

introduction to
prEsEntEd by
K.SANTHIYA
ph.d rEsEarch scholar
dEpartmEnt of computEr
applications
bharathiar uniVErsity
undEr thE guidancE of
dr.V.bhuVanEsWari
assistant profEssor
dEpartmEnt of computEr
applications
bharathiar uniVErsityK.Santhiya , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,-
WDABT 2016

agEnda
• WORLD OF DATA
 Few Instances
• CONVENTIONAL APPROACHES
 Limitations
• HADOOP FRAMEWORK
 Terminology Review
• HADOOP COMPONENTS
 HDFS & MAPREDUCE
• HDFS – IN DETAIL
• HADOOP ECOSYSTEM
K.Santhiya , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016

data EXplosion
2.5 quintillion bytes of data is
created each day…..
1
WDABT 2016

World WidE data
Since the
beginning of
Time
Last two years
2
WDABT 2016

2.9 375 20 24 50 700 1.3 72
Million MB Hrs PB Million Billion Exabytes items
thE World of data
3
WDABT 2016

minimum sizE that a big data
filE starts With is at lEast
1 tErabytE
4
WDABT 2016

5
WDABT 2016

&
6
WDABT 2016

conVEntional
approachEs
RDBMS
OS FILE SYSTEM
SQL QUERIES
CUSTOM FRAMEWORK
* C / C++
* PERL
* PYTHON
35
7
WDABT 2016

issuEs in lEgacy
systEms
LIMITED STORAgE CAPACITY
LIMITED PROCESSINg CAPACITY
NO SCALABILITY
SINgLE POINT OF FAILURE
SEQUENTIAL PROCESSINg
RDBMSS CAN HANDLE STRUCTURED DATA
REQUIRES PREPROCESSINg OF DATA
INFORMATION IS COLLECTED ACCORDINg
TO CURRENT BUSINESS NEEDS
8
WDABT 2016

How do we
mine (and mind)
all this data?
HOW TO RESOLVE ALL THESE
ISSUES?
9

Mr. HADOOP sAys He HAs
A sOlutiOn tO Our BiG
PrOBleM !
1
0K.Santhiya , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016

1

43
1

COMPAnies usinG
1

WHAt is
APACHe HADOOP is A frAMeWOrk tHAt
AllOWs
fOr tHe DistriButeD PrOCessinG Of lArGe
DAtAsets ACrOss Clusters Of COMMODity
COMPuters usinG A siMPle PrOGrAMMinG
MODel.
Concept
Moving computation is more efficient than moving
large data
1

STORAGE
COMPUTATION
COMPLEXITY
1

tWO DAeMOns Of
HADOOP
44
1

ARCHITECTURE
1

terMiniOlOGy reVieW
Node 1
Node 2
Node N
:
:
Rack 1
Node 1
Node 2
Node N
:
:
Rack 2
:
:
clusteR
1
8

HADOOP Cluster
ArCHiteCture
1

2

HADOOP COre serViCes
i. nAMe nODe
ii.DAtA nODe
iii.resOurCe MAnAGer
iV.APPliCAtiOn MAster
V.nODe MAnAGer
Vi.seCOnDAry nAMe nODe
2

HDFS – REAL LIFE CONNECT
• A college library was gifted a massive collection of books by a patron. The
books were very popular titles. The librarian decided to arrange the books in
a small rack, and distribute multiple copies of each book in other racks, so
that students can find the books easily. Similarly, HDFS creates multiple
copies of a data block, and keeps them in separate systems for easy access.
2
2
K.Santhiya , Ph.d Research
Scholar , Dr.V.Bhuvaneswari,

WHAT IS HDFS
• Hadoop distributed File system
Highly Fault tolerant , distributed , reliable ,
scalable file system for data storage.
Stores multiple copies of data on different
nodes
A File is split up into blocks and stored on
multiple machines
Hadoop cluster typically has a single
namenode and no. of data nodes to form a
hadoop cluster.
2
3

HDFS BLOCKS
• Files are broken in to large blocks.
 Typically 128 MB block size
 Blocks are replicated for reliability
 One replica on local node, Another replica on a remote rack,
 Third replica on local rack, Additional replicas are randomly placed
2

HDFS BLOCKS CONTD.,
ADVANTAGES OF HDFS BLOCKS
Fixed Size
Chunk of file < block size : Only needed space is
used.
Eg : 420 MB file is split as
2

HDFS OpERATION pRINCIpLE
2

NAME NODE
2

DATA NODE
2

SECONDARY NAME NODE
2

HDFS ARCHITECTURE
3
0

HDFS – BLOCK REpLICATION
ARCHITECTURE
3

NAMENODE IN HA MODE
3

NAME NODE HA ARCHITECTURE
3

BUSINESS SCENARIO
olivia tyler is the evp of it operations
with
nutri worldwide, inc.,and she has
decided to use hdfs for storing big data.
she will use hdfs shell to store the data
in a hadoop file system, and she will
execute various commands on it.
3

3

hadoop shell commands
hadoop fs -mkdir /learning
hadoop fs –copyFromLocal test.txt /learning
hadoop fs -ls /learning
hadoop fs -cat/learning/test.txt
3

hadoop ecosystem
components
3

data transfer components
3

data store components
• following are the data store components of
the hadoop ecosystem.
DISTRIBUTED
SCALABLE
BIG DATA STORE
SCALABLE
CONSISTENT
DISTRIBUTED
STRUCTURED KEY
VALUE STORE
SORTED
DISTRIBUTED KEY
VALUE DATA
STORAGE AND
RETRIEVAL SYSTEM
HBASE CASSANDRA ACCUMULO
3

serialization components
• The serialization components are Avro,
Trevni, and Thrift.
• Avro is a data serialization system.
• Trevni is a column file format used to
permit compatible, independent
implementations that read and /or write
files in this format.
• Thrift is a framework for scalable, cross-
language services development. 4
0
WDABT 2016

Job execution components
• Following are the job execution components :
4

worK management
components
4

conclusion
56
4

references
• J. Gantz and D. Reinsel, ``The digital universe in 2020: Big data, bigger digital shadows,
and biggest growth in the far east,'' in Proc. IDC iView,IDC Anal. Future, 2012.
• (2015) Available : [online] http://expandedramblings.com/index.php/by-the-numbers-a-
gigantic-list-of-google-stats-and-facts/
• D. Evans and R. Hutley, ``The explosion of data,'' white paper, 2010.
• Seema Acharya, Subhashini Chelleppan " Big Data and Analytics "Wiley India Pvt Ltd ,
2015
• Dhruba Borthakur , " HDFS Architecture Guide " , 2013.
• Available:[Online]http:// hortonworks.com/hadoop/flume/#section_2
• Marko Grobelnik , " Big-Data tutorial" , white paper,2012.
WDABT 2016

4

Introduction to hadoop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (14)

Similar to Introduction to hadoop

Similar to Introduction to hadoop (20)

Recently uploaded

Recently uploaded (20)

Introduction to hadoop