SlideShare a Scribd company logo
1 of 21
S 
11/3/2014 
Hadoop 
Technology 
Solution to BIG DATA processing 
Shital Katkar 
VJTI, MUMBAI
1 | H A D O O P T e c h n o l o g y 
REPORT TITLED 
“ HADOOP” 
Submitted 
in partial fulfillment of 
the requirements for the Degree of 
Master of Computer Applications 
(MCA) 
By 
Shi tal Kr ishna Katkar 
Rol l No. 132011005 
Under the guidance of 
Prof. Aakriti Soni 
Department Of Computer Technology 
Veermata Ji jabai Technological Insti tute 
(Autonomous Institute, Affiliated To University of Mumbai) 
Mumbai – 400019 
Year 2014-2015 
SHITAL KATKAR
2 | H A D O O P T e c h n o l o g y 
VEERMATA JIJABAI TECHNOLOGICAL INSTITUTE 
MATUNGA, MUMBAI – 400019 
CERTIFICATE 
This is to certify that the seminar report titled 
“HADOOP” 
BY 
SHITAL KRISHNA KATKAR 
Roll No. 132011005 
Class: MCA-(SEM-III) 
in Academic year 2014-2015 
Evaluator: 
Date: 
SHITAL KATKAR
3 | H A D O O P T e c h n o l o g y 
SHITAL KATKAR 
Table of Contents 
INTRODUCTION ............................................................................................................................ 4 
LITERATURE SURVEY .................................................................................................................... 4 
PROBLEM STATEMENT .................................................................................................................. 5 
1 Big Data Problem ....................................................................................................................... 5 
1.1 What is Big Data?........................................................................................................... 5 
1.2 Characteristics of Big Data .............................................................................................. 5 
1.3 Big Data Challenges........................................................................................................ 6 
PROBLEM SOLUTION.................................................................................................................... 7 
2 Introduction to Hadoop ......................................................................................................... 7 
2.1 Origin of name Hadoop .................................................................................................. 7 
2.2 Components of Hadoop ................................................................................................. 7 
3 HDFS .................................................................................................................................... 8 
3.1 HDFS architecture .......................................................................................................... 8 
3.2 Types of Nodes .............................................................................................................. 9 
3.2.1 NameNode............................................................................................................. 9 
3.2.2 DataNode............................................................................................................... 9 
3.3 Features ...................................................................................................................... 10 
4 Map Reduce........................................................................................................................ 11 
Map Function...................................................................................................................... 11 
Reduce Function ................................................................................................................. 11 
4.1 Architecture ................................................................................................................ 11 
4.2 Types of Nodes ............................................................................................................ 12 
4.2.1 Job Tracker........................................................................................................... 12 
5 DataNode .................................................................................................................... 12 
5. Examples ............................................................................................................................ 13 
5.1 Common friend list ............................................................................................................ 13 
5.2 Word Count................................................................................................................. 15 
6 Hadoop Projects ...................................................................................................................... 16 
7 Who uses Hadoop? .................................................................................................................. 17 
CONCLUSION.............................................................................................................................. 19 
REFERENCES ............................................................................................................................... 20
4 | H A D O O P T e c h n o l o g y 
INTRODUCTION 
The size of the databases used in today’s enterprises has been growing at exponential rates 
day by day. Simultaneously, the need to process and analyze the large volumes of data for business 
decision making has also increased. 
In several business and scientific applications, there is a need to process terabytes of data in 
efficient manner on daily bases. This has contributed to the big data problem faced by the industry 
due to the inability of conventional database systems and software tools to manage or process the big 
data sets within tolerable time limits. Processing of data can include various operations depending on 
usage like culling, tagging, highlighting, indexing, searching, faceting, etc. operations. It is not possible 
for single or few machines to store or process this huge amount of data in a finite time period. This 
paper reports the experimental work on big data problem and its optimal solution using Hadoop 
cluster, Hadoop Distributed File System (HDFS) for storage and using parallel processing to process 
large data sets using Map Reduce programming framework. We have done prototype implementation 
of Hadoop cluster, HDFS storage and Map Reduce framework for processing large data sets by 
considering prototype of big data application scenarios. 
LITERATURE SURVEY 
I would like to choose this paper (Aditya B.Patel, Manashvi Birla, Ushma Nair , “Addressing Big Data 
Problem Using Hadoop and Map Reduce”) because it is based on the HADOOP technology. This 
technology solves the problem of Big Data processing. For this main paper we used 2 reference 
papers (Kamalpreet Singh , Raviner Kaur , “Hadoop : Addressing Challenges of Big Data “ IEEE 2014 
AND Konstantin shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler. “The Hadoop Distributed 
File System”. Yahoo!, IEEE 2010). 
SHITAL KATKAR
5 | H A D O O P T e c h n o l o g y 
PROBLEM STATEMENT 
SHITAL KATKAR 
1 BIG DATA PROBLEM 
1.1 WHAT IS BIG DATA? 
Big Data includes data sets with sizes beyond the ability of commonly used software tools. It is 
difficult to capture, store, search, share, analysis and visualize. The size of big data is continuously 
increasing. 
The typical example of big data found in current scenario includes web logs, sensor networks, 
satellite and geo-spatial data , social data from social networks, internet text and documents, internet 
search indexing, call details records, photography archival etc. 
1.2 CHARACTERISTICS OF BIG DATA 
Big data can be described by the following characteristics: 
 Volume 
The quantity of data that is generated is very important in this context. It is the size of the 
data which determines the value and potential of the data under consideration and whether 
it can actually be considered as Big Data or not. The name ‘Big Data’ itself contains a term 
which is related to size and hence the characteristic. 
 Variety 
The next aspect of Big Data is its variety. This means that the category to which Big Data 
belongs to is also a very essential fact that needs to be known by the data analysts. This 
helps the people, who are closely analysing the data and are associated with it, to effectively 
use the data to their advantage and thus upholding the importance of the Big Data. 
 Velocity 
The term ‘velocity’ in the context refers to the speed of generation of data or how fast the 
data is generated and processed to meet the demands and the challenges which lie ahead 
in the path of growth and development.
6 | H A D O O P T e c h n o l o g y 
SHITAL KATKAR 
 Variability 
This is a factor which can be a problem for those who analyse the data. This refers to the 
inconsistency which can be shown by the data at times, thus hampering the process of being 
able to handle and manage the data effectively. 
 Complexity 
Data management can become a very complex process, especially when large volumes of 
data come from multiple sources. These data need to be linked, connected and correlated 
in order to be able to grasp the information that is supposed to be conveyed by these data. 
This situation, is therefore, termed as the ‘complexity’ of Big Data. 
1.3 BIG DATA CHALLENGES 
Following are the various challenges faced in large data management 
 Scalability 
 Unstructured data 
 Accessibility 
 Real Time analytics 
 Fault Tolerance
7 | H A D O O P T e c h n o l o g y 
PROBLEM SOLUTION 
2 INTRODUCTION TO HADOOP 
Hadoop Provides solution to manage big data. And it overcomes many of the above stated 
challenges of big data. It is use for analysing and processing big data. It supports multiprocessing 
environment. Hadoop framework is mainly written in java. And some hardware related part is in C. It 
is Open source and under apache license and Developed by Dong Cutting. 
Hadoop provides parallel processing model. Queries are split and distributed across parallel 
nodes and process in parallel. The result are then gathered and delivered. 
SHITAL KATKAR 
2.1 ORIGIN OF NAME HADOOP 
Hadoop is not an acronym. It’s a made up name. The project’s creator Dong Cutting explains 
how the name came: 
“My Kid gave this name to his toy – yellow elephant. The name is short, relatively 
easy to spell and pronounce. It is meaning less and not used elsewhere.” 
- Dong Cutting, Project Creator of Hadoop 
And hence the logo is yellow elephant. 
It’s a fact that elephant cannot jump but it can move heavy weight from one place to another. Similarly 
with Hadoop, it cannot solve small query (say day to day transaction). However it can handle very 
large amount of data. 
2.2 COMPONENTS OF HADOOP 
Hadoop consist of two main components. 
Fig. Main Components of Hadoop
8 | H A D O O P T e c h n o l o g y 
MapReduce is processing part of Hadoop and manages the job. HDFS refers Hadoop Distributed File 
System and it stores all the data redundantly requires for computation. 
Hadoop’s origin came from google file system (GFS) and MapReduce which become apache HDFS and 
apache MapReduce. 
SHITAL KATKAR 
3 HDFS 
HDFS stores all data requires for computation. It provides fault tolerance ie. If component fails a 
backup components or procedure can immediately take its place with no loss of Service. It is design 
to run on commodity hardware which cuts off the cost of buying special expensive hardware and RAID 
system. It provides high throughput access to application data therefore suitable for application that 
have large dataset. It has master-slave architecture ie large data is automatically split into chunks 
which are managed by a different node in Hadoop cluster. 
3.1 HDFS ARCHITECTURE 
Fig – Hadoop distributed file system architecture 
Large amount of data is split into nodes. HDFS do not use data protection mechanism such as RAID to 
make data durable. Instead it replicate data on multiple node for reliability. Master file is also known 
as NameNode and slaves are known as DataNodes.
9 | H A D O O P T e c h n o l o g y 
SHITAL KATKAR 
3.2 TYPES OF NODES 
3.2.1 NameNode 
 Only 1 per Hadoop Cluster 
There is only 1 name node in the cluster 
 Manages the file system name space and metadata 
Data does not go through NameNode. Or Data does not stored in name node. Rather name 
node is use to manage the other nodes and metadata. Files and directories are represented 
on NameNode by inodes, which record attributes like permissions, modification, access 
times, namespace and disk space quotas. 
 Single point of failure 
To compensate for the fact there is only 1 name node one should configure the NameNode 
to write a copy of its state information to multiple locations, such as local disk and an NFS 
mount. Here we do not have to use expensive, commodity hardware. 
 Large memory Requirement 
The NameNode should also have as much RAM as possible because it keeps the entire file 
system metadata in memory. 
3.2.2 DataNode 
 Many per Hadoop Cluster 
A typical HDFS has many DataNodes. 
 Manages blocks with data and s erves them to client 
Blocks from different files can be stored on the same DataNode. When client request the 
file, the client finds out from the NameNode which DataNodes stored the blocks that make 
up that file and the client directly reads the blocks from the individual DataNodes. 
 Periodically reports to NameNode the list of blocks it stores 
During start up each DataNode connects to the NameNode and performs a handshake. The 
purpose of handshake is to verify the namespace ID and the software version of DataNode. 
If either does not match that of the NameNode the DataNode automatically shut down. Also 
DataNode periodically reports to NameNode to provide/update information about list of 
blocks it stores.
10 | H A D O O P T e c h n o l o g y 
 Suitable for inexpensive, commodity hardware 
DataNode do not require expensive enterprise hardware or replication at the hardware 
layer. The DataNodes are design to run on commodity hardware and replication is provided 
at software layer. 
SHITAL KATKAR 
3.3 FEATURES 
 HDFS is rack aware 
HDFS is rack aware in the sense that the namenode and the job tracker obtain a list of rack 
ids corresponding to each of the slave nodes (data nodes) and creates a mapping between 
the IP address and the rack id. HDFS uses this knowledge to replicate data across different 
racks so that data is not lost in the event of a complete rack power outage or switch failure. 
 Job Performance 
Hadoop does speculative execution where if a machine is slow in the cluster and the 
map/reduce tasks running on this machine are holding on to the entire map/reduce phase, 
then it runs redundant jobs on other machines to process the same task, and whichever task 
gets completed first reports back to the job tracker and results from the same are carried 
forward into the next phase. 
 Fault Tolerance 
The task tracker spawns different JVM processes to ensure that process failures do not bring 
down the task tracker. 
The task tracker keeps sending heartbeat messages to the job tracker to say that it is alive 
and to keep it updated with the number of empty slots available for running more tasks. 
The job tracker does some check pointing of its work in the file system. Whenever, it starts 
up it checks what was it up to till the last CP and resumes any incomplete jobs. Earlier, if the 
job tracker went down, all the active job information used to get lost.
11 | H A D O O P T e c h n o l o g y 
SHITAL KATKAR 
4 MAP REDUCE 
In 2004, Google published a paper on a process called MapReduce. With MapReduce queries are split 
and distribute across parallel nodes and process in parallel (Map Step). Then results are gathered and 
delivered. The framework was very successful, so others wanted to replicate the algorithm. Therefore 
an implementation of MapReduce framework was adopted by Apache Open source project Hadoop. 
It provides a parallel processing model. It associated implementation to process huge amount of data. 
It is consist of two steps: 
Map Function 
A MapReduce task divides the input data set across various independent chunks which are 
processed in completely parallelized manner by map function. It generates the set of 
intermediate <key, value> pair. 
Map (k1,v1) –> list (k2,v2) 
Reduce Function 
The output of these map function serves as input to reduce function which reduce/merge 
all<key, value> pairs. 
Reduce (k2, list(v2) ) –> list (v3) 
4.1 ARCHITECTURE
12 | H A D O O P T e c h n o l o g y 
Fig – MapReduce Architecture 
SHITAL KATKAR 
4.2 TYPES OF NODES 
4.2.1 Job Tracker 
 Only 1 per Hadoop Cluster 
There is only 1 name node in the cluster 
 Manages the file system name space and metadata 
Data does not go through NameNode. Or Data does not stored in name node. Rather name 
node is use to manage the other nodes and metadata. Files and directories are represented 
on NameNode by inodes, which record attributes like permissions, modification, access 
times, namespace and disk space quotas. 
 Single point of failure 
To compensate for the fact there is only 1 name node one should configure the NameNode 
to write a copy of its state information to multiple locations, such as local disk and an NFS 
mount. Here we do not have to use expensive, commodity hardware. 
 Large memory Requirement 
The NameNode should also have as much RAM as possible because it keeps the entire file 
system metadata in memory. 
5 DataNode 
 Many per Hadoop Cluster 
A typical HDFS has many DataNodes. 
 Manages blocks with data and servers them to client 
Blocks from different files can be stored on the same DataNode. When client request the 
file, the client finds out from the NameNode which DataNodes stored the blocks that make 
up that file and the client directly reads the blocks from the individual DataNodes. 
 Periodically reports to NameNode the list of blocks it stores 
During start up each DataNode connects to the NameNode and performs a handshake. The 
purpose of handshake is to verify the namespace ID and the software version of DataNode. 
If either does not match that of the NameNode the DataNode automatically shut down. Also
13 | H A D O O P T e c h n o l o g y 
DataNode periodically reports to NameNode to provide/update information about list of 
blocks it stores. 
 Suitable for inexpensive, commodity hardware 
DataNode do not require expensive enterprise hardware or replication at the hardware 
layer. The DataNodes are design to run con commodity hardware and replication is provided 
at software layer. 
SHITAL KATKAR 
5. EXAMPLES 
5.1 COMMON FRIEND LIST 
Consider one social networking site (let say Facebook). 
(NOTE: Facebook may or may not actually do this implementation. It is just an example) 
Facebook has a list of friends. Note that friends are bidirectional thing on Facebook (i.e. I’m your friend 
then you’re mine). When you visit someone’s profile, you see a list of friends that you have in 
common. Do you think that Facebook will calculate it on the spot? No! it serves millions of requests 
every second. So they pre-compute calculations everyday where they can reduce the processing time 
of request. 
Assume that there are 5 people A,B,C,D,E on Facebook. Friend list is denoted by Persons -> (List of 
Friends). So here our friend list is : 
• A  B C D 
• B  A C D E 
• C  A B D E 
• D  A B C E 
• E  B C D 
i.e. A is a friend of B, C and D. B is a friend of A, C, D and E. So on. Each line will be an argument to 
mapper step. For every fiend in the list of fiends the mapper will output key values pair. The key will 
be a friend along with the person and value will be list of friends. 
For MAP(A  B C D) 
( A B )  B C D 
( A C )  B C D 
( A D )  B C D
14 | H A D O O P T e c h n o l o g y 
Similarly it is done for all person on FB. The key value pair is sorted s the friends are in order, causing 
all pairs of friends to go to same reduce. (eg. If (B,A) then it is sorted as (A,B)). 
SHITAL KATKAR 
Map(A  B C D) 
( A B )  B C D 
( A C )  B C D 
( A D )  B C D 
Map(B  A C D E ) 
( A B )  A C D E 
( B C )  A C D E 
( B D )  A C D E 
( B E )  A C D E 
Map ( C  A B D E) 
( A C )  A B D E 
( B C )  A B D E 
( C D )  A B D E 
( C E )  A B D E 
Map(D  A B C E) 
( A D )  A B C E 
( B D )  A B C E 
( C D )  A B C E 
( D E )  A B C E 
Map ( E B C D) 
( B E )  B C D 
( C E )  B C D 
( D E )  B C D 
Before we send these key value pairs to reduce step, we group them by their keys and get 
Now each line will passed as an argument to a reducer. The reduce function simply takes intersection.
15 | H A D O O P T e c h n o l o g y 
SHITAL KATKAR 
The result after reduction is: 
Now when D visits B’s profile, we can quickly look up (B, D) and see that they have 3 friends in common 
(A, C and D). 
5.2 WORD COUNT 
Hadoop provides the tremendous solution for counting words from given set of files.
16 | H A D O O P T e c h n o l o g y 
SHITAL KATKAR 
6 HADOOP PROJECTS 
 Eclipse 
is a popular IDE denoted by IBM to the open source community 
 Lucene is a text search engine library written 
in java. 
 Hbase is a Hadoop’s database initiated by Powerset and used for Table 
storage for semi-structured data 
 Hive provides data warehousing tool to extract, transform and load 
data, and then query this data to store in Hadoop files . It is initiated by 
Facebook. It uses SQL-like query language and metastore. 
 Pig is high level language that generates Map Reduce code to analyze 
large data sets. And it is initiated by Yahoo!. 
 Jaql is a query language for JavaScript open notation. 
 ZooKeepar is a centralized configuration service and naming registry for large 
distributed system initiated by Yahoo! 
 Avro is a data serialization system. 
 UIMA is a architecture for development, discovery, 
composition and deployment for the analysis of 
unstructured data
17 | H A D O O P T e c h n o l o g y 
SHITAL KATKAR 
7 WHO USES HADOOP? 
Companies that offer services on or based around Hadoop are listed in 
http://wiki.apache.org/hadoop/PoweredBy. Following are some companies and their saying about 
how they use Hadoop. 
 A9.com - Amazon* 
o We build Amazon's product search indices using the streaming API and pre-existing 
C++, Perl, and Python tools. 
o We process millions of sessions daily for analytics, using both the Java and streaming 
APIs. 
o Our clusters vary from 1 to 100 nodes 
 AOL 
o We use Apache Hadoop for variety of things ranging from ETL style processing and 
statistics generation to running advanced algorithms for doing behavioural analysis 
and targeting. 
o The cluster that we use for mainly behavioural analysis and targeting has 150 
machines, Intel Xeon, dual processors, dual core, each with 16GB Ram and 800 GB 
hard-disk. 
 EBay 
o 532 nodes cluster (8 * 532 cores, 5.3PB). 
o Heavy usage of Java MapReduce, Apache Pig, Apache Hive, Apache HBase 
o Using it for Search optimization and Research. 
 Facebook 
o We use Apache Hadoop to store copies of internal log and dimension data sources and 
use it as a source for reporting/analytics and machine learning. 
o Currently we have 2 major clusters: 
 A 1100-machine cluster with 8800 cores and about 12 PB raw storage. 
 A 300-machine cluster with 2400 cores and about 3 PB raw storage. 
 Each (commodity) node has 8 cores and 12 TB of storage. 
 We are heavy users of both streaming as well as the Java APIs. We have built 
a higher level data warehousing framework using these features called Hive 
We have also developed a FUSE implementation over HDFS.
18 | H A D O O P T e c h n o l o g y 
SHITAL KATKAR 
 LinkedIn 
o We have multiple grids divided up based upon purpose. 
 Twitter 
o We use Apache Hadoop to store and process tweets, log files, and many other types 
of data generated across Twitter. We store all data as compressed LZO files. 
o We use both Scala and Java to access Hadoop's MapReduce APIs 
o We use Apache Pig heavily for both scheduled and ad-hoc jobs, due to its ability to 
accomplish a lot with few statements. 
o We employ committers on Apache Pig, Apache Avro, Apache Hive, and Apache 
Cassandra, and contribute much of our internal Hadoop work to opensource . 
 Yahoo! 
o More than 100,000 CPUs in >40,000 computers running Hadoop 
o Our biggest cluster: 4500 nodes (2*4cpu boxes w 4*1TB disk & 16GB RAM) 
 Used to support research for Ad Systems and Web Search 
 Also used to do scaling tests to support development of Apache Hadoop on 
larger clusters 
o >60% of Hadoop Jobs within Yahoo are Apache Pig jobs.
19 | H A D O O P T e c h n o l o g y 
CONCLUSION 
In this work, we have explored the solution to big data problem using Hadoop data cluster, HDFS and 
Map Reduce programming framework using big data prototype application scenarios. The results 
obtained from various experiments indicate favourable results of above approach to address big 
data problem 
As big data continues down its path of growth, there is no doubt that these innovative approaches – 
utilizing Hadoop software – will be central to allowing companies reach full potential with data. 
Additionally, this rapid advancement of data technology has sparked a rising demand to hire the 
next generation of technical geniuses who can build up this powerful infrastructure. The cost of the 
technology and the talent may not be cheap, but for all of the value that big data is capable of 
bringing to table, companies are finding that it is a very worthy investment. 
SHITAL KATKAR
20 | H A D O O P T e c h n o l o g y 
REFERENCES 
1. Aditya B.Patel, Manashvi Birla, Ushma Nair , “Addressing Big Data Problem Using Hadoop and 
SHITAL KATKAR 
Map Reduce” 
2. Kamalpreet Singh , Raviner Kaur , “Hadoop : Addressing Challenges of Big Data “ IEEE 2014 
3. Big data - Wikipedia, the free encyclopedia http://en.wikipedia.org/wiki/Big_data 
4. Steve Krenzel -6.MapReduce : Finding Friends http://stevekrensel.com/finding-friends-with-mapreduce 
5. Konstantin shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler. “The Hadoop Distributed 
File System”. Yahoo!, IEEE 2010 
6. www.bigdatauniversity.com 
7. Hadoop, HDFS, MapReduce and Hive - Some salient understandings: Hadoop - Namenode, 
DataNode, Job Tracker and TaskTracker http://hadoop-gyan.blogspot.in/2012/11/hadoop-namenode- 
datanode-job-tracker.html

More Related Content

What's hot

A sql implementation on the map reduce framework
A sql implementation on the map reduce frameworkA sql implementation on the map reduce framework
A sql implementation on the map reduce frameworkeldariof
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopGiovanna Roda
 
Hadoop Developer
Hadoop DeveloperHadoop Developer
Hadoop DeveloperEdureka!
 
Large Scale Math with Hadoop MapReduce
Large Scale Math with Hadoop MapReduceLarge Scale Math with Hadoop MapReduce
Large Scale Math with Hadoop MapReduceHortonworks
 
Resilient Distributed Datasets
Resilient Distributed DatasetsResilient Distributed Datasets
Resilient Distributed DatasetsGabriele Modena
 
An effective classification approach for big data with parallel generalized H...
An effective classification approach for big data with parallel generalized H...An effective classification approach for big data with parallel generalized H...
An effective classification approach for big data with parallel generalized H...riyaniaes
 
Bio bigdata
Bio bigdata Bio bigdata
Bio bigdata Mk Kim
 
Big Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabBig Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabImpetus Technologies
 
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.xModule 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.xNPN Training
 
Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)PyData
 
Harnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution TimesHarnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution TimesDavid Tjahjono,MD,MBA(UK)
 
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduceBIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduceMahantesh Angadi
 
Berlin Hadoop Get Together Apache Drill
Berlin Hadoop Get Together Apache Drill Berlin Hadoop Get Together Apache Drill
Berlin Hadoop Get Together Apache Drill MapR Technologies
 

What's hot (20)

A sql implementation on the map reduce framework
A sql implementation on the map reduce frameworkA sql implementation on the map reduce framework
A sql implementation on the map reduce framework
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
 
Hadoop
HadoopHadoop
Hadoop
 
Big Data: hype or necessity?
Big Data: hype or necessity?Big Data: hype or necessity?
Big Data: hype or necessity?
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hadoop Developer
Hadoop DeveloperHadoop Developer
Hadoop Developer
 
Large Scale Math with Hadoop MapReduce
Large Scale Math with Hadoop MapReduceLarge Scale Math with Hadoop MapReduce
Large Scale Math with Hadoop MapReduce
 
Resilient Distributed Datasets
Resilient Distributed DatasetsResilient Distributed Datasets
Resilient Distributed Datasets
 
Hadoop technology doc
Hadoop technology docHadoop technology doc
Hadoop technology doc
 
Hadoop Cluster Analysis and Assessment
Hadoop Cluster Analysis and AssessmentHadoop Cluster Analysis and Assessment
Hadoop Cluster Analysis and Assessment
 
An effective classification approach for big data with parallel generalized H...
An effective classification approach for big data with parallel generalized H...An effective classification approach for big data with parallel generalized H...
An effective classification approach for big data with parallel generalized H...
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Bio bigdata
Bio bigdata Bio bigdata
Bio bigdata
 
useR 2014 jskim
useR 2014 jskimuseR 2014 jskim
useR 2014 jskim
 
Big Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabBig Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLab
 
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.xModule 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
 
Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)
 
Harnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution TimesHarnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution Times
 
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduceBIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
 
Berlin Hadoop Get Together Apache Drill
Berlin Hadoop Get Together Apache Drill Berlin Hadoop Get Together Apache Drill
Berlin Hadoop Get Together Apache Drill
 

Similar to Big data processing using - Hadoop Technology

Nikita rajbhoj(a 50)
Nikita rajbhoj(a 50)Nikita rajbhoj(a 50)
Nikita rajbhoj(a 50)NikitaRajbhoj
 
Big Data Processing with Hadoop : A Review
Big Data Processing with Hadoop : A ReviewBig Data Processing with Hadoop : A Review
Big Data Processing with Hadoop : A ReviewIRJET Journal
 
How do data analysts work with big data and distributed computing frameworks.pdf
How do data analysts work with big data and distributed computing frameworks.pdfHow do data analysts work with big data and distributed computing frameworks.pdf
How do data analysts work with big data and distributed computing frameworks.pdfSoumodeep Nanee Kundu
 
Big Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible SolutionsBig Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible Solutionsaciijournal
 
Big Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible SolutionsBig Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible Solutionsaciijournal
 
BIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONS
BIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONSBIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONS
BIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONSaciijournal
 
Big Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible SolutionsBig Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible Solutionsaciijournal
 
Social Media Market Trender with Dache Manager Using Hadoop and Visualization...
Social Media Market Trender with Dache Manager Using Hadoop and Visualization...Social Media Market Trender with Dache Manager Using Hadoop and Visualization...
Social Media Market Trender with Dache Manager Using Hadoop and Visualization...IRJET Journal
 
Big data is a broad term for data sets so large or complex that tr.docx
Big data is a broad term for data sets so large or complex that tr.docxBig data is a broad term for data sets so large or complex that tr.docx
Big data is a broad term for data sets so large or complex that tr.docxhartrobert670
 
Sameer Kumar Das International Conference Paper 53
Sameer Kumar Das International Conference Paper 53Sameer Kumar Das International Conference Paper 53
Sameer Kumar Das International Conference Paper 53Mr.Sameer Kumar Das
 
A REVIEW ON CLASSIFICATION OF DATA IMBALANCE USING BIGDATA
A REVIEW ON CLASSIFICATION OF DATA IMBALANCE USING BIGDATAA REVIEW ON CLASSIFICATION OF DATA IMBALANCE USING BIGDATA
A REVIEW ON CLASSIFICATION OF DATA IMBALANCE USING BIGDATAIJMIT JOURNAL
 
A Review on Classification of Data Imbalance using BigData
A Review on Classification of Data Imbalance using BigDataA Review on Classification of Data Imbalance using BigData
A Review on Classification of Data Imbalance using BigDataIJMIT JOURNAL
 
IRJET- A Comparative Study on Big Data Analytics Approaches and Tools
IRJET- A Comparative Study on Big Data Analytics Approaches and ToolsIRJET- A Comparative Study on Big Data Analytics Approaches and Tools
IRJET- A Comparative Study on Big Data Analytics Approaches and ToolsIRJET Journal
 
Big data - what, why, where, when and how
Big data - what, why, where, when and howBig data - what, why, where, when and how
Big data - what, why, where, when and howbobosenthil
 
IRJET- Big Data: A Study
IRJET-  	  Big Data: A StudyIRJET-  	  Big Data: A Study
IRJET- Big Data: A StudyIRJET Journal
 
A Review Paper on Big Data and Hadoop for Data Science
A Review Paper on Big Data and Hadoop for Data ScienceA Review Paper on Big Data and Hadoop for Data Science
A Review Paper on Big Data and Hadoop for Data Scienceijtsrd
 
using big-data methods analyse the Cross platform aviation
 using big-data methods analyse the Cross platform aviation using big-data methods analyse the Cross platform aviation
using big-data methods analyse the Cross platform aviationranjit banshpal
 
Big Data - Insights & Challenges
Big Data - Insights & ChallengesBig Data - Insights & Challenges
Big Data - Insights & ChallengesRupen Momaya
 

Similar to Big data processing using - Hadoop Technology (20)

Nikita rajbhoj(a 50)
Nikita rajbhoj(a 50)Nikita rajbhoj(a 50)
Nikita rajbhoj(a 50)
 
Big Data Processing with Hadoop : A Review
Big Data Processing with Hadoop : A ReviewBig Data Processing with Hadoop : A Review
Big Data Processing with Hadoop : A Review
 
How do data analysts work with big data and distributed computing frameworks.pdf
How do data analysts work with big data and distributed computing frameworks.pdfHow do data analysts work with big data and distributed computing frameworks.pdf
How do data analysts work with big data and distributed computing frameworks.pdf
 
Big Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible SolutionsBig Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible Solutions
 
Big Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible SolutionsBig Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible Solutions
 
BIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONS
BIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONSBIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONS
BIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONS
 
Big Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible SolutionsBig Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible Solutions
 
Social Media Market Trender with Dache Manager Using Hadoop and Visualization...
Social Media Market Trender with Dache Manager Using Hadoop and Visualization...Social Media Market Trender with Dache Manager Using Hadoop and Visualization...
Social Media Market Trender with Dache Manager Using Hadoop and Visualization...
 
Big data is a broad term for data sets so large or complex that tr.docx
Big data is a broad term for data sets so large or complex that tr.docxBig data is a broad term for data sets so large or complex that tr.docx
Big data is a broad term for data sets so large or complex that tr.docx
 
Complete-SRS.doc
Complete-SRS.docComplete-SRS.doc
Complete-SRS.doc
 
Sameer Kumar Das International Conference Paper 53
Sameer Kumar Das International Conference Paper 53Sameer Kumar Das International Conference Paper 53
Sameer Kumar Das International Conference Paper 53
 
A REVIEW ON CLASSIFICATION OF DATA IMBALANCE USING BIGDATA
A REVIEW ON CLASSIFICATION OF DATA IMBALANCE USING BIGDATAA REVIEW ON CLASSIFICATION OF DATA IMBALANCE USING BIGDATA
A REVIEW ON CLASSIFICATION OF DATA IMBALANCE USING BIGDATA
 
A Review on Classification of Data Imbalance using BigData
A Review on Classification of Data Imbalance using BigDataA Review on Classification of Data Imbalance using BigData
A Review on Classification of Data Imbalance using BigData
 
IRJET- A Comparative Study on Big Data Analytics Approaches and Tools
IRJET- A Comparative Study on Big Data Analytics Approaches and ToolsIRJET- A Comparative Study on Big Data Analytics Approaches and Tools
IRJET- A Comparative Study on Big Data Analytics Approaches and Tools
 
Big data - what, why, where, when and how
Big data - what, why, where, when and howBig data - what, why, where, when and how
Big data - what, why, where, when and how
 
IRJET- Big Data: A Study
IRJET-  	  Big Data: A StudyIRJET-  	  Big Data: A Study
IRJET- Big Data: A Study
 
A Review Paper on Big Data and Hadoop for Data Science
A Review Paper on Big Data and Hadoop for Data ScienceA Review Paper on Big Data and Hadoop for Data Science
A Review Paper on Big Data and Hadoop for Data Science
 
using big-data methods analyse the Cross platform aviation
 using big-data methods analyse the Cross platform aviation using big-data methods analyse the Cross platform aviation
using big-data methods analyse the Cross platform aviation
 
Big Data - Insights & Challenges
Big Data - Insights & ChallengesBig Data - Insights & Challenges
Big Data - Insights & Challenges
 
Big data
Big dataBig data
Big data
 

More from Shital Kat

Opinion Mining
Opinion MiningOpinion Mining
Opinion MiningShital Kat
 
Query By humming - Music retrieval technology
Query By humming - Music retrieval technologyQuery By humming - Music retrieval technology
Query By humming - Music retrieval technologyShital Kat
 
Query By Humming - Music Retrieval Technique
Query By Humming - Music Retrieval TechniqueQuery By Humming - Music Retrieval Technique
Query By Humming - Music Retrieval TechniqueShital Kat
 
School admission process management system (Documention)
School admission process management system (Documention)School admission process management system (Documention)
School admission process management system (Documention)Shital Kat
 
WiFi technology Writeup
WiFi technology WriteupWiFi technology Writeup
WiFi technology WriteupShital Kat
 
WIFI Introduction (PART I)
WIFI Introduction (PART I)WIFI Introduction (PART I)
WIFI Introduction (PART I)Shital Kat
 

More from Shital Kat (8)

Opinion Mining
Opinion MiningOpinion Mining
Opinion Mining
 
Query By humming - Music retrieval technology
Query By humming - Music retrieval technologyQuery By humming - Music retrieval technology
Query By humming - Music retrieval technology
 
Query By Humming - Music Retrieval Technique
Query By Humming - Music Retrieval TechniqueQuery By Humming - Music Retrieval Technique
Query By Humming - Music Retrieval Technique
 
School admission process management system (Documention)
School admission process management system (Documention)School admission process management system (Documention)
School admission process management system (Documention)
 
WiFi technology Writeup
WiFi technology WriteupWiFi technology Writeup
WiFi technology Writeup
 
Wifi Security
Wifi SecurityWifi Security
Wifi Security
 
WiFi part II
WiFi part IIWiFi part II
WiFi part II
 
WIFI Introduction (PART I)
WIFI Introduction (PART I)WIFI Introduction (PART I)
WIFI Introduction (PART I)
 

Recently uploaded

ADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptxADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptxFIDO Alliance
 
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on ThanabotsContinuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on ThanabotsLeah Henrickson
 
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...FIDO Alliance
 
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdfSimplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdfFIDO Alliance
 
Intro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераIntro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераMark Opanasiuk
 
Long journey of Ruby Standard library at RubyKaigi 2024
Long journey of Ruby Standard library at RubyKaigi 2024Long journey of Ruby Standard library at RubyKaigi 2024
Long journey of Ruby Standard library at RubyKaigi 2024Hiroshi SHIBATA
 
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...FIDO Alliance
 
Oauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoftOauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoftshyamraj55
 
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider  Progress from Awareness to Implementation.pptxTales from a Passkey Provider  Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider Progress from Awareness to Implementation.pptxFIDO Alliance
 
WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceSamy Fodil
 
AI mind or machine power point presentation
AI mind or machine power point presentationAI mind or machine power point presentation
AI mind or machine power point presentationyogeshlabana357357
 
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfLinux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfFIDO Alliance
 
Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Patrick Viafore
 
Your enemies use GenAI too - staying ahead of fraud with Neo4j
Your enemies use GenAI too - staying ahead of fraud with Neo4jYour enemies use GenAI too - staying ahead of fraud with Neo4j
Your enemies use GenAI too - staying ahead of fraud with Neo4jNeo4j
 
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...FIDO Alliance
 
Intro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptxIntro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptxFIDO Alliance
 
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdfThe Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdfFIDO Alliance
 
How we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfHow we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfSrushith Repakula
 
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...ScyllaDB
 
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?Paolo Missier
 

Recently uploaded (20)

ADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptxADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptx
 
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on ThanabotsContinuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
 
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
 
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdfSimplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
 
Intro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераIntro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджера
 
Long journey of Ruby Standard library at RubyKaigi 2024
Long journey of Ruby Standard library at RubyKaigi 2024Long journey of Ruby Standard library at RubyKaigi 2024
Long journey of Ruby Standard library at RubyKaigi 2024
 
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
 
Oauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoftOauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoft
 
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider  Progress from Awareness to Implementation.pptxTales from a Passkey Provider  Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
 
WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM Performance
 
AI mind or machine power point presentation
AI mind or machine power point presentationAI mind or machine power point presentation
AI mind or machine power point presentation
 
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfLinux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
 
Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024
 
Your enemies use GenAI too - staying ahead of fraud with Neo4j
Your enemies use GenAI too - staying ahead of fraud with Neo4jYour enemies use GenAI too - staying ahead of fraud with Neo4j
Your enemies use GenAI too - staying ahead of fraud with Neo4j
 
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
 
Intro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptxIntro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptx
 
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdfThe Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
 
How we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfHow we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdf
 
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
 
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
 

Big data processing using - Hadoop Technology

  • 1. S 11/3/2014 Hadoop Technology Solution to BIG DATA processing Shital Katkar VJTI, MUMBAI
  • 2. 1 | H A D O O P T e c h n o l o g y REPORT TITLED “ HADOOP” Submitted in partial fulfillment of the requirements for the Degree of Master of Computer Applications (MCA) By Shi tal Kr ishna Katkar Rol l No. 132011005 Under the guidance of Prof. Aakriti Soni Department Of Computer Technology Veermata Ji jabai Technological Insti tute (Autonomous Institute, Affiliated To University of Mumbai) Mumbai – 400019 Year 2014-2015 SHITAL KATKAR
  • 3. 2 | H A D O O P T e c h n o l o g y VEERMATA JIJABAI TECHNOLOGICAL INSTITUTE MATUNGA, MUMBAI – 400019 CERTIFICATE This is to certify that the seminar report titled “HADOOP” BY SHITAL KRISHNA KATKAR Roll No. 132011005 Class: MCA-(SEM-III) in Academic year 2014-2015 Evaluator: Date: SHITAL KATKAR
  • 4. 3 | H A D O O P T e c h n o l o g y SHITAL KATKAR Table of Contents INTRODUCTION ............................................................................................................................ 4 LITERATURE SURVEY .................................................................................................................... 4 PROBLEM STATEMENT .................................................................................................................. 5 1 Big Data Problem ....................................................................................................................... 5 1.1 What is Big Data?........................................................................................................... 5 1.2 Characteristics of Big Data .............................................................................................. 5 1.3 Big Data Challenges........................................................................................................ 6 PROBLEM SOLUTION.................................................................................................................... 7 2 Introduction to Hadoop ......................................................................................................... 7 2.1 Origin of name Hadoop .................................................................................................. 7 2.2 Components of Hadoop ................................................................................................. 7 3 HDFS .................................................................................................................................... 8 3.1 HDFS architecture .......................................................................................................... 8 3.2 Types of Nodes .............................................................................................................. 9 3.2.1 NameNode............................................................................................................. 9 3.2.2 DataNode............................................................................................................... 9 3.3 Features ...................................................................................................................... 10 4 Map Reduce........................................................................................................................ 11 Map Function...................................................................................................................... 11 Reduce Function ................................................................................................................. 11 4.1 Architecture ................................................................................................................ 11 4.2 Types of Nodes ............................................................................................................ 12 4.2.1 Job Tracker........................................................................................................... 12 5 DataNode .................................................................................................................... 12 5. Examples ............................................................................................................................ 13 5.1 Common friend list ............................................................................................................ 13 5.2 Word Count................................................................................................................. 15 6 Hadoop Projects ...................................................................................................................... 16 7 Who uses Hadoop? .................................................................................................................. 17 CONCLUSION.............................................................................................................................. 19 REFERENCES ............................................................................................................................... 20
  • 5. 4 | H A D O O P T e c h n o l o g y INTRODUCTION The size of the databases used in today’s enterprises has been growing at exponential rates day by day. Simultaneously, the need to process and analyze the large volumes of data for business decision making has also increased. In several business and scientific applications, there is a need to process terabytes of data in efficient manner on daily bases. This has contributed to the big data problem faced by the industry due to the inability of conventional database systems and software tools to manage or process the big data sets within tolerable time limits. Processing of data can include various operations depending on usage like culling, tagging, highlighting, indexing, searching, faceting, etc. operations. It is not possible for single or few machines to store or process this huge amount of data in a finite time period. This paper reports the experimental work on big data problem and its optimal solution using Hadoop cluster, Hadoop Distributed File System (HDFS) for storage and using parallel processing to process large data sets using Map Reduce programming framework. We have done prototype implementation of Hadoop cluster, HDFS storage and Map Reduce framework for processing large data sets by considering prototype of big data application scenarios. LITERATURE SURVEY I would like to choose this paper (Aditya B.Patel, Manashvi Birla, Ushma Nair , “Addressing Big Data Problem Using Hadoop and Map Reduce”) because it is based on the HADOOP technology. This technology solves the problem of Big Data processing. For this main paper we used 2 reference papers (Kamalpreet Singh , Raviner Kaur , “Hadoop : Addressing Challenges of Big Data “ IEEE 2014 AND Konstantin shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler. “The Hadoop Distributed File System”. Yahoo!, IEEE 2010). SHITAL KATKAR
  • 6. 5 | H A D O O P T e c h n o l o g y PROBLEM STATEMENT SHITAL KATKAR 1 BIG DATA PROBLEM 1.1 WHAT IS BIG DATA? Big Data includes data sets with sizes beyond the ability of commonly used software tools. It is difficult to capture, store, search, share, analysis and visualize. The size of big data is continuously increasing. The typical example of big data found in current scenario includes web logs, sensor networks, satellite and geo-spatial data , social data from social networks, internet text and documents, internet search indexing, call details records, photography archival etc. 1.2 CHARACTERISTICS OF BIG DATA Big data can be described by the following characteristics:  Volume The quantity of data that is generated is very important in this context. It is the size of the data which determines the value and potential of the data under consideration and whether it can actually be considered as Big Data or not. The name ‘Big Data’ itself contains a term which is related to size and hence the characteristic.  Variety The next aspect of Big Data is its variety. This means that the category to which Big Data belongs to is also a very essential fact that needs to be known by the data analysts. This helps the people, who are closely analysing the data and are associated with it, to effectively use the data to their advantage and thus upholding the importance of the Big Data.  Velocity The term ‘velocity’ in the context refers to the speed of generation of data or how fast the data is generated and processed to meet the demands and the challenges which lie ahead in the path of growth and development.
  • 7. 6 | H A D O O P T e c h n o l o g y SHITAL KATKAR  Variability This is a factor which can be a problem for those who analyse the data. This refers to the inconsistency which can be shown by the data at times, thus hampering the process of being able to handle and manage the data effectively.  Complexity Data management can become a very complex process, especially when large volumes of data come from multiple sources. These data need to be linked, connected and correlated in order to be able to grasp the information that is supposed to be conveyed by these data. This situation, is therefore, termed as the ‘complexity’ of Big Data. 1.3 BIG DATA CHALLENGES Following are the various challenges faced in large data management  Scalability  Unstructured data  Accessibility  Real Time analytics  Fault Tolerance
  • 8. 7 | H A D O O P T e c h n o l o g y PROBLEM SOLUTION 2 INTRODUCTION TO HADOOP Hadoop Provides solution to manage big data. And it overcomes many of the above stated challenges of big data. It is use for analysing and processing big data. It supports multiprocessing environment. Hadoop framework is mainly written in java. And some hardware related part is in C. It is Open source and under apache license and Developed by Dong Cutting. Hadoop provides parallel processing model. Queries are split and distributed across parallel nodes and process in parallel. The result are then gathered and delivered. SHITAL KATKAR 2.1 ORIGIN OF NAME HADOOP Hadoop is not an acronym. It’s a made up name. The project’s creator Dong Cutting explains how the name came: “My Kid gave this name to his toy – yellow elephant. The name is short, relatively easy to spell and pronounce. It is meaning less and not used elsewhere.” - Dong Cutting, Project Creator of Hadoop And hence the logo is yellow elephant. It’s a fact that elephant cannot jump but it can move heavy weight from one place to another. Similarly with Hadoop, it cannot solve small query (say day to day transaction). However it can handle very large amount of data. 2.2 COMPONENTS OF HADOOP Hadoop consist of two main components. Fig. Main Components of Hadoop
  • 9. 8 | H A D O O P T e c h n o l o g y MapReduce is processing part of Hadoop and manages the job. HDFS refers Hadoop Distributed File System and it stores all the data redundantly requires for computation. Hadoop’s origin came from google file system (GFS) and MapReduce which become apache HDFS and apache MapReduce. SHITAL KATKAR 3 HDFS HDFS stores all data requires for computation. It provides fault tolerance ie. If component fails a backup components or procedure can immediately take its place with no loss of Service. It is design to run on commodity hardware which cuts off the cost of buying special expensive hardware and RAID system. It provides high throughput access to application data therefore suitable for application that have large dataset. It has master-slave architecture ie large data is automatically split into chunks which are managed by a different node in Hadoop cluster. 3.1 HDFS ARCHITECTURE Fig – Hadoop distributed file system architecture Large amount of data is split into nodes. HDFS do not use data protection mechanism such as RAID to make data durable. Instead it replicate data on multiple node for reliability. Master file is also known as NameNode and slaves are known as DataNodes.
  • 10. 9 | H A D O O P T e c h n o l o g y SHITAL KATKAR 3.2 TYPES OF NODES 3.2.1 NameNode  Only 1 per Hadoop Cluster There is only 1 name node in the cluster  Manages the file system name space and metadata Data does not go through NameNode. Or Data does not stored in name node. Rather name node is use to manage the other nodes and metadata. Files and directories are represented on NameNode by inodes, which record attributes like permissions, modification, access times, namespace and disk space quotas.  Single point of failure To compensate for the fact there is only 1 name node one should configure the NameNode to write a copy of its state information to multiple locations, such as local disk and an NFS mount. Here we do not have to use expensive, commodity hardware.  Large memory Requirement The NameNode should also have as much RAM as possible because it keeps the entire file system metadata in memory. 3.2.2 DataNode  Many per Hadoop Cluster A typical HDFS has many DataNodes.  Manages blocks with data and s erves them to client Blocks from different files can be stored on the same DataNode. When client request the file, the client finds out from the NameNode which DataNodes stored the blocks that make up that file and the client directly reads the blocks from the individual DataNodes.  Periodically reports to NameNode the list of blocks it stores During start up each DataNode connects to the NameNode and performs a handshake. The purpose of handshake is to verify the namespace ID and the software version of DataNode. If either does not match that of the NameNode the DataNode automatically shut down. Also DataNode periodically reports to NameNode to provide/update information about list of blocks it stores.
  • 11. 10 | H A D O O P T e c h n o l o g y  Suitable for inexpensive, commodity hardware DataNode do not require expensive enterprise hardware or replication at the hardware layer. The DataNodes are design to run on commodity hardware and replication is provided at software layer. SHITAL KATKAR 3.3 FEATURES  HDFS is rack aware HDFS is rack aware in the sense that the namenode and the job tracker obtain a list of rack ids corresponding to each of the slave nodes (data nodes) and creates a mapping between the IP address and the rack id. HDFS uses this knowledge to replicate data across different racks so that data is not lost in the event of a complete rack power outage or switch failure.  Job Performance Hadoop does speculative execution where if a machine is slow in the cluster and the map/reduce tasks running on this machine are holding on to the entire map/reduce phase, then it runs redundant jobs on other machines to process the same task, and whichever task gets completed first reports back to the job tracker and results from the same are carried forward into the next phase.  Fault Tolerance The task tracker spawns different JVM processes to ensure that process failures do not bring down the task tracker. The task tracker keeps sending heartbeat messages to the job tracker to say that it is alive and to keep it updated with the number of empty slots available for running more tasks. The job tracker does some check pointing of its work in the file system. Whenever, it starts up it checks what was it up to till the last CP and resumes any incomplete jobs. Earlier, if the job tracker went down, all the active job information used to get lost.
  • 12. 11 | H A D O O P T e c h n o l o g y SHITAL KATKAR 4 MAP REDUCE In 2004, Google published a paper on a process called MapReduce. With MapReduce queries are split and distribute across parallel nodes and process in parallel (Map Step). Then results are gathered and delivered. The framework was very successful, so others wanted to replicate the algorithm. Therefore an implementation of MapReduce framework was adopted by Apache Open source project Hadoop. It provides a parallel processing model. It associated implementation to process huge amount of data. It is consist of two steps: Map Function A MapReduce task divides the input data set across various independent chunks which are processed in completely parallelized manner by map function. It generates the set of intermediate <key, value> pair. Map (k1,v1) –> list (k2,v2) Reduce Function The output of these map function serves as input to reduce function which reduce/merge all<key, value> pairs. Reduce (k2, list(v2) ) –> list (v3) 4.1 ARCHITECTURE
  • 13. 12 | H A D O O P T e c h n o l o g y Fig – MapReduce Architecture SHITAL KATKAR 4.2 TYPES OF NODES 4.2.1 Job Tracker  Only 1 per Hadoop Cluster There is only 1 name node in the cluster  Manages the file system name space and metadata Data does not go through NameNode. Or Data does not stored in name node. Rather name node is use to manage the other nodes and metadata. Files and directories are represented on NameNode by inodes, which record attributes like permissions, modification, access times, namespace and disk space quotas.  Single point of failure To compensate for the fact there is only 1 name node one should configure the NameNode to write a copy of its state information to multiple locations, such as local disk and an NFS mount. Here we do not have to use expensive, commodity hardware.  Large memory Requirement The NameNode should also have as much RAM as possible because it keeps the entire file system metadata in memory. 5 DataNode  Many per Hadoop Cluster A typical HDFS has many DataNodes.  Manages blocks with data and servers them to client Blocks from different files can be stored on the same DataNode. When client request the file, the client finds out from the NameNode which DataNodes stored the blocks that make up that file and the client directly reads the blocks from the individual DataNodes.  Periodically reports to NameNode the list of blocks it stores During start up each DataNode connects to the NameNode and performs a handshake. The purpose of handshake is to verify the namespace ID and the software version of DataNode. If either does not match that of the NameNode the DataNode automatically shut down. Also
  • 14. 13 | H A D O O P T e c h n o l o g y DataNode periodically reports to NameNode to provide/update information about list of blocks it stores.  Suitable for inexpensive, commodity hardware DataNode do not require expensive enterprise hardware or replication at the hardware layer. The DataNodes are design to run con commodity hardware and replication is provided at software layer. SHITAL KATKAR 5. EXAMPLES 5.1 COMMON FRIEND LIST Consider one social networking site (let say Facebook). (NOTE: Facebook may or may not actually do this implementation. It is just an example) Facebook has a list of friends. Note that friends are bidirectional thing on Facebook (i.e. I’m your friend then you’re mine). When you visit someone’s profile, you see a list of friends that you have in common. Do you think that Facebook will calculate it on the spot? No! it serves millions of requests every second. So they pre-compute calculations everyday where they can reduce the processing time of request. Assume that there are 5 people A,B,C,D,E on Facebook. Friend list is denoted by Persons -> (List of Friends). So here our friend list is : • A  B C D • B  A C D E • C  A B D E • D  A B C E • E  B C D i.e. A is a friend of B, C and D. B is a friend of A, C, D and E. So on. Each line will be an argument to mapper step. For every fiend in the list of fiends the mapper will output key values pair. The key will be a friend along with the person and value will be list of friends. For MAP(A  B C D) ( A B )  B C D ( A C )  B C D ( A D )  B C D
  • 15. 14 | H A D O O P T e c h n o l o g y Similarly it is done for all person on FB. The key value pair is sorted s the friends are in order, causing all pairs of friends to go to same reduce. (eg. If (B,A) then it is sorted as (A,B)). SHITAL KATKAR Map(A  B C D) ( A B )  B C D ( A C )  B C D ( A D )  B C D Map(B  A C D E ) ( A B )  A C D E ( B C )  A C D E ( B D )  A C D E ( B E )  A C D E Map ( C  A B D E) ( A C )  A B D E ( B C )  A B D E ( C D )  A B D E ( C E )  A B D E Map(D  A B C E) ( A D )  A B C E ( B D )  A B C E ( C D )  A B C E ( D E )  A B C E Map ( E B C D) ( B E )  B C D ( C E )  B C D ( D E )  B C D Before we send these key value pairs to reduce step, we group them by their keys and get Now each line will passed as an argument to a reducer. The reduce function simply takes intersection.
  • 16. 15 | H A D O O P T e c h n o l o g y SHITAL KATKAR The result after reduction is: Now when D visits B’s profile, we can quickly look up (B, D) and see that they have 3 friends in common (A, C and D). 5.2 WORD COUNT Hadoop provides the tremendous solution for counting words from given set of files.
  • 17. 16 | H A D O O P T e c h n o l o g y SHITAL KATKAR 6 HADOOP PROJECTS  Eclipse is a popular IDE denoted by IBM to the open source community  Lucene is a text search engine library written in java.  Hbase is a Hadoop’s database initiated by Powerset and used for Table storage for semi-structured data  Hive provides data warehousing tool to extract, transform and load data, and then query this data to store in Hadoop files . It is initiated by Facebook. It uses SQL-like query language and metastore.  Pig is high level language that generates Map Reduce code to analyze large data sets. And it is initiated by Yahoo!.  Jaql is a query language for JavaScript open notation.  ZooKeepar is a centralized configuration service and naming registry for large distributed system initiated by Yahoo!  Avro is a data serialization system.  UIMA is a architecture for development, discovery, composition and deployment for the analysis of unstructured data
  • 18. 17 | H A D O O P T e c h n o l o g y SHITAL KATKAR 7 WHO USES HADOOP? Companies that offer services on or based around Hadoop are listed in http://wiki.apache.org/hadoop/PoweredBy. Following are some companies and their saying about how they use Hadoop.  A9.com - Amazon* o We build Amazon's product search indices using the streaming API and pre-existing C++, Perl, and Python tools. o We process millions of sessions daily for analytics, using both the Java and streaming APIs. o Our clusters vary from 1 to 100 nodes  AOL o We use Apache Hadoop for variety of things ranging from ETL style processing and statistics generation to running advanced algorithms for doing behavioural analysis and targeting. o The cluster that we use for mainly behavioural analysis and targeting has 150 machines, Intel Xeon, dual processors, dual core, each with 16GB Ram and 800 GB hard-disk.  EBay o 532 nodes cluster (8 * 532 cores, 5.3PB). o Heavy usage of Java MapReduce, Apache Pig, Apache Hive, Apache HBase o Using it for Search optimization and Research.  Facebook o We use Apache Hadoop to store copies of internal log and dimension data sources and use it as a source for reporting/analytics and machine learning. o Currently we have 2 major clusters:  A 1100-machine cluster with 8800 cores and about 12 PB raw storage.  A 300-machine cluster with 2400 cores and about 3 PB raw storage.  Each (commodity) node has 8 cores and 12 TB of storage.  We are heavy users of both streaming as well as the Java APIs. We have built a higher level data warehousing framework using these features called Hive We have also developed a FUSE implementation over HDFS.
  • 19. 18 | H A D O O P T e c h n o l o g y SHITAL KATKAR  LinkedIn o We have multiple grids divided up based upon purpose.  Twitter o We use Apache Hadoop to store and process tweets, log files, and many other types of data generated across Twitter. We store all data as compressed LZO files. o We use both Scala and Java to access Hadoop's MapReduce APIs o We use Apache Pig heavily for both scheduled and ad-hoc jobs, due to its ability to accomplish a lot with few statements. o We employ committers on Apache Pig, Apache Avro, Apache Hive, and Apache Cassandra, and contribute much of our internal Hadoop work to opensource .  Yahoo! o More than 100,000 CPUs in >40,000 computers running Hadoop o Our biggest cluster: 4500 nodes (2*4cpu boxes w 4*1TB disk & 16GB RAM)  Used to support research for Ad Systems and Web Search  Also used to do scaling tests to support development of Apache Hadoop on larger clusters o >60% of Hadoop Jobs within Yahoo are Apache Pig jobs.
  • 20. 19 | H A D O O P T e c h n o l o g y CONCLUSION In this work, we have explored the solution to big data problem using Hadoop data cluster, HDFS and Map Reduce programming framework using big data prototype application scenarios. The results obtained from various experiments indicate favourable results of above approach to address big data problem As big data continues down its path of growth, there is no doubt that these innovative approaches – utilizing Hadoop software – will be central to allowing companies reach full potential with data. Additionally, this rapid advancement of data technology has sparked a rising demand to hire the next generation of technical geniuses who can build up this powerful infrastructure. The cost of the technology and the talent may not be cheap, but for all of the value that big data is capable of bringing to table, companies are finding that it is a very worthy investment. SHITAL KATKAR
  • 21. 20 | H A D O O P T e c h n o l o g y REFERENCES 1. Aditya B.Patel, Manashvi Birla, Ushma Nair , “Addressing Big Data Problem Using Hadoop and SHITAL KATKAR Map Reduce” 2. Kamalpreet Singh , Raviner Kaur , “Hadoop : Addressing Challenges of Big Data “ IEEE 2014 3. Big data - Wikipedia, the free encyclopedia http://en.wikipedia.org/wiki/Big_data 4. Steve Krenzel -6.MapReduce : Finding Friends http://stevekrensel.com/finding-friends-with-mapreduce 5. Konstantin shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler. “The Hadoop Distributed File System”. Yahoo!, IEEE 2010 6. www.bigdatauniversity.com 7. Hadoop, HDFS, MapReduce and Hive - Some salient understandings: Hadoop - Namenode, DataNode, Job Tracker and TaskTracker http://hadoop-gyan.blogspot.in/2012/11/hadoop-namenode- datanode-job-tracker.html