Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detection(summer training main)

©Ashok Royal
POORNIMA INSTITUTE OF ENGINEERING &
TECHNOLOGY, JAIPUR
(DEPARTMENT OF COMPUTER ENGINEERING)
Big Data Hadoop
Presented By: Guided By:
Ashok Royal Dr. E.S. Pilli

Topics
1. Organization Profile
2. Schedule
3. Training Description
4. Project Description
5. Learning
6. Snapshots
7. Future Scope
8. Conclusion
9. References
©Ashok Royal 9 September 2014

Organization Profile
 Name - Malviya National Institute of Techonology,
Jaipur.
 MNIT, Jaipur is one of 30 national institutes of
technology in India.
 MNIT, established in 1963 inspired by Pt. Madan
Mohan Malviya.
 The institute's director is I. K. Bhat and the chairman
of the board of Governors is Dr. K. K. Aggarwal.

Organization Contacts
Organization’s contacts:
Address: Jawaharlal Nehru Marg,
Jhalana, Malviya Nagar, Jaipur,
Rajasthan 302017
Phone: 0141 271 3201
Email : espilli.cse@mnit.ac.in
Website : www.mnit.ac.in

Schedule
Our training at MNIT were broadly divided into three
phases:
 Case study of Hadoop and related papers (first 30
days).
 Hadoop cluster making (first 30 days).
 Single node
 Multi node
 Implementation of Near Duplicate Detection Using
Hadoop MapReduce (last 15 days).

Training Coordinator
 Name - Dr. E. S. Pilli
 Assistant Professor at MNIT, Jaipur. He has done
Ph.D.(CSE) from IIT Roorkee. He is a very
supportive person and currently guiding many
M.Tech and Ph. D students in Cloud Computing,
Big Data & Botnets.
 Email: espilli.cse@mnit.ac.in

What is Big Data ?
 Lots of data (Terabyte or Petabyte)
 BigData is a term used for a collection of data sets
so large and complex that it becomes difficult to
process using existing traditional data processing
application.
 Big data refers to large data sets that are
challenging to store, search, share, visualize and
analyze. ©Ashok Royal 9 September 2014

Various forms of Data
 Data comes mainly in three forms-
 Structured
 Unstructured Data
 Semi-structured data

Why Data is so BIG ?
 20 terabyte photos uploaded to Facebook each
month .
 330 terabyte data that the large collider will
produce each week.
 530 terabyte all the videos on YouTube.
 1 petabyte data processed by Google's servers
every 72 minutes.

Data growth

What is hadoop ?
 It is a open source software library
written in java.
 Hadoop Software library is a
framework that allow for the
distributed processing of large data sets (Big
Data) across clusters of computers using
programming models.

Modules of hadoop
 Hadoop Commons
 Hadoop Distributed File system (HDFS)
 Hadoop mapReduce

Hadoop Commons
 It provides access to the file system supported by
Hadoop.
 The Hadoop common package contain the
necessary JAR files and scripts needed to start
hadoop.
 The package also provides sourse code,
documentation, and a contribution section which
include projects from Hadoop Community.

HDFS
 Hadoop uses HDFS, a distributed file system
based on GFS (Google File System) as its
shared filesystem.
 HDFS architecture divides files into large
chunks distributed across data servers.
 It has namenode and datanodes.

Main components of HDFS
 Namenode:
 Master of the system
 Maintains and manage the blocks which are present on the
datanodes.
 Datanodes:
 Slaves which are deployed on each machine and provide
the actual storage.
 Responsible for serving read and write requests for the
clients.

Main components of HDFS
 Secondary Name node
 Used as savepoint
 Connects to Namenode every hour*
 Backup of namenode metadata
 Saved metadata can build a failed Namenode.

Map Reduce
 The Hadoop MapReduce framework
harnesses a cluster of machines and excute
user define MapReduce jobs across the nodes
in the cluster.
 A MapReduce computation has two phases
 A map phase and
 A reduce phase

HDFS and MapReduce Layers

Hadoop Server Roles
JobTracker
MapReduce job
submitted by
client computer
Master node
Slave node
TaskTracker
Task instance
Slave node
TaskTracker
Task instance
Slave node
TaskTracker
Task instance

Hadoop Architecture

Hadoop Streaming
 It allows to create and run map/reduce jobs with
any executable or script as the Mapper and/or
Reducer.
 HDFS is basically designed to process large files
on commodity cluster at high speed.
 Write onces read many approch. After huge data
being placed, we tend to use the data not modify
it.
 Time to read the whole data is more important.

Hadoop Workflow
1. Load data into the cluster (HDFS writes)
2. Analyze the data (Map Reduce)
3. Store results in the cluster (HDFS writes)
4. Read the results from the cluster (HDFS
reads)

Word count Example

Prominent users of Hadoop
 Amazon – 100 nodes
 Facebook – two cluster of 8000 and 3000 nodes
 Adobe – 80 nodes
 Ebay – 532 nodes
 Yahoo – cluster of about 4500 nodes
 IIIT Hyderabad – 30 nodes
 IBM, Microsoft many more companies are also using
Hadoop.

 near duplicates = pairs of objects with high similarity
 similarity -> quantitative way -> similarity function
 Given a collection of records, the similarity join problem is to find all pairs of records, <x,y>, such that
sim(x,y)>=t
 Tokenize:
 Each record is a set of tokens from a finite universe.
 Suppose each record is a single text document
 x = “yes as soon as possible”
 y = “as soon as possible please”
 x = {A, B, C, D, E}
 y = {B, C, D, E, F}

Project Description
 Project Name - Near Duplicate Detection
 Comparative analysis of millions documents exist in
network jargon to find similar document based on a
predefined threshold value.
 Near duplicate detection is essentially used in web
crawls and many others data mining tasks.
 The near duplicates are not considered as “exact
duplicates ” , but are files with minute differences .

Application in Search Engine

Application in Search Engine
cont.
 The web documents with similarity score greater
than a predefined threshold are considered as
near duplicates
 These near duplicated pages are not added to the
repository of search engine.
 Reduce storage cost of search engines
 Improve the quality of search index

Similarity Function
For calculation of similarity between two document we
x 
y
J x y 
have
used jaccard Function.
 Jaccard Similarity Function:
 Example:
t
x 
y
( , ) 
x = {A,B,C,D,E}
y = {B,C,D,E,F}
4/6 = 0.67

Steps to Detect near duplicate

Snapshots - HDFS

Snapshot- mapreduce
Processing

Conclusion
 Training in big data helped us to know what is the
crazy trend in IT industries and how technology is
becoming more fruitful to human development.
 Big Data is the future. Currently A lot of research is
going on in this field. As data is increasing at faster
rate thus there is a huge need of such tools and
technology which can handle it.

Conclusion
 Hadoop is the most emerging framework used by
most of big firms like Facebook, Microsoft, IBM,
Yahoo, Amazon and lots of other more.
 Our experience at MNIT, was absolutely awesome
as it has given as the platform and support for our
tasks and case study.

 Bigdata and bigdata solutions is one of the
burning issues in the present it industry so,
work on those will surely make us more useful
to that.

 The proposed method solve the difficulties of
information retrieval from the web.
 The approach has detected the near duplicate web
pages efficiently based on the keywords extracted
from the web pages.
 It reduces the memory space for web repositories.
 The near duplicate detection increases the search
engines quality.

References
 J. G. Conrad, X. S. Guo, and C. P. Schriber.
Online duplicate document detection: signature
reliability in a dynamic retrieval environment. In
CIKM, 2003.
 2012 2nd International Conference on Computer
Science and Network Technology Near Duplicate
Detection Using Map-Reduce by Qinsheng Du,
Wei Liu, Guolin Li and Yonglin Tang

References
 Chuan Xiao, Wei Wang, Xuemin Lin, Jeffrey
Xu Yu, Guoren Wang. Efficient Similarity Joins
for Near-Duplicate Detection
 Apache Hadoop. http://hadoop.apache.org.
 Hadoop-beginners.blogspot.com.

Namenode and Datanode
 The NameNode executes file system namespace
operations like opening, closing, and renaming
files and directories. It also determines the
mapping of blocks to DataNodes.
 The DataNodes are responsible for serving read
and write requests from the file system’s clients.
The DataNodes also perform block creation,
deletion, and replication upon instruction from the
NameNode. ©Ashok Royal 9 September 2014

MapReduce paradigm
 Map phase: Once divided, datasets are assigned to the task
tracker to perform the Map phase. The data functional
operation will be performed over the data, emitting the
mapped key and value pairs as the output of the Map phase,
(i.e data processing).
 Reduce phase: The master node then collects the answers to
all the subproblems and combines them in some way to form
the output; the answer to the problem it was originally trying
to solve, (i.e data collection and digesting).

Any Query?

Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detection(summer training main)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detection(summer training main)

Similar to Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detection(summer training main) (20)

Recently uploaded

Recently uploaded (20)

Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detection(summer training main)