SlideShare a Scribd company logo
1 of 42
©Ashok Royal 
POORNIMA INSTITUTE OF ENGINEERING & 
TECHNOLOGY, JAIPUR 
(DEPARTMENT OF COMPUTER ENGINEERING) 
Big Data Hadoop 
Presented By: Guided By: 
Ashok Royal Dr. E.S. Pilli
Topics 
1. Organization Profile 
2. Schedule 
3. Training Description 
4. Project Description 
5. Learning 
6. Snapshots 
7. Future Scope 
8. Conclusion 
9. References 
©Ashok Royal 9 September 2014
Organization Profile 
 Name - Malviya National Institute of Techonology, 
Jaipur. 
 MNIT, Jaipur is one of 30 national institutes of 
technology in India. 
 MNIT, established in 1963 inspired by Pt. Madan 
Mohan Malviya. 
 The institute's director is I. K. Bhat and the chairman 
of the board of Governors is Dr. K. K. Aggarwal. 
©Ashok Royal 9 September 2014
Organization Contacts 
Organization’s contacts: 
Address: Jawaharlal Nehru Marg, 
Jhalana, Malviya Nagar, Jaipur, 
Rajasthan 302017 
Phone: 0141 271 3201 
Email : espilli.cse@mnit.ac.in 
Website : www.mnit.ac.in 
©Ashok Royal 9 September 2014
Schedule 
Our training at MNIT were broadly divided into three 
phases: 
 Case study of Hadoop and related papers (first 30 
days). 
 Hadoop cluster making (first 30 days). 
 Single node 
 Multi node 
 Implementation of Near Duplicate Detection Using 
Hadoop MapReduce (last 15 days). 
©Ashok Royal 9 September 2014
Training Coordinator 
 Name - Dr. E. S. Pilli 
 Assistant Professor at MNIT, Jaipur. He has done 
Ph.D.(CSE) from IIT Roorkee. He is a very 
supportive person and currently guiding many 
M.Tech and Ph. D students in Cloud Computing, 
Big Data & Botnets. 
 Email: espilli.cse@mnit.ac.in 
©Ashok Royal 9 September 2014
What is Big Data ? 
 Lots of data (Terabyte or Petabyte) 
 BigData is a term used for a collection of data sets 
so large and complex that it becomes difficult to 
process using existing traditional data processing 
application. 
 Big data refers to large data sets that are 
challenging to store, search, share, visualize and 
analyze. ©Ashok Royal 9 September 2014
Various forms of Data 
 Data comes mainly in three forms- 
 Structured 
 Unstructured Data 
 Semi-structured data 
©Ashok Royal 9 September 2014
Why Data is so BIG ? 
 20 terabyte photos uploaded to Facebook each 
month . 
 330 terabyte data that the large collider will 
produce each week. 
 530 terabyte all the videos on YouTube. 
 1 petabyte data processed by Google's servers 
every 72 minutes. 
©Ashok Royal 9 September 2014
Data growth 
©Ashok Royal 9 September 2014
What is hadoop ? 
 It is a open source software library 
written in java. 
 Hadoop Software library is a 
framework that allow for the 
distributed processing of large data sets (Big 
Data) across clusters of computers using 
programming models. 
©Ashok Royal 9 September 2014
Modules of hadoop 
 Hadoop Commons 
 Hadoop Distributed File system (HDFS) 
 Hadoop mapReduce 
©Ashok Royal 9 September 2014
Hadoop Commons 
 It provides access to the file system supported by 
Hadoop. 
 The Hadoop common package contain the 
necessary JAR files and scripts needed to start 
hadoop. 
 The package also provides sourse code, 
documentation, and a contribution section which 
include projects from Hadoop Community. 
©Ashok Royal 9 September 2014
HDFS 
 Hadoop uses HDFS, a distributed file system 
based on GFS (Google File System) as its 
shared filesystem. 
 HDFS architecture divides files into large 
chunks distributed across data servers. 
 It has namenode and datanodes. 
©Ashok Royal 9 September 2014
Main components of HDFS 
 Namenode: 
 Master of the system 
 Maintains and manage the blocks which are present on the 
datanodes. 
 Datanodes: 
 Slaves which are deployed on each machine and provide 
the actual storage. 
 Responsible for serving read and write requests for the 
clients. 
©Ashok Royal 9 September 2014
Main components of HDFS 
 Secondary Name node 
 Used as savepoint 
 Connects to Namenode every hour* 
 Backup of namenode metadata 
 Saved metadata can build a failed Namenode. 
©Ashok Royal 9 September 2014
Map Reduce 
 The Hadoop MapReduce framework 
harnesses a cluster of machines and excute 
user define MapReduce jobs across the nodes 
in the cluster. 
 A MapReduce computation has two phases 
 A map phase and 
 A reduce phase 
©Ashok Royal 9 September 2014
HDFS and MapReduce Layers 
©Ashok Royal 9 September 2014
Hadoop Server Roles 
JobTracker 
MapReduce job 
submitted by 
client computer 
Master node 
Slave node 
TaskTracker 
Task instance 
Slave node 
TaskTracker 
Task instance 
Slave node 
TaskTracker 
Task instance 
©Ashok Royal 9 September 2014
Hadoop Architecture 
©Ashok Royal 9 September 2014
Hadoop Streaming 
 It allows to create and run map/reduce jobs with 
any executable or script as the Mapper and/or 
Reducer. 
 HDFS is basically designed to process large files 
on commodity cluster at high speed. 
 Write onces read many approch. After huge data 
being placed, we tend to use the data not modify 
it. 
©Ashok Royal 9 September 2014 
 Time to read the whole data is more important.
Hadoop Workflow 
1. Load data into the cluster (HDFS writes) 
2. Analyze the data (Map Reduce) 
3. Store results in the cluster (HDFS writes) 
4. Read the results from the cluster (HDFS 
reads) 
©Ashok Royal 9 September 2014
Word count Example 
©Ashok Royal 9 September 2014
Prominent users of Hadoop 
 Amazon – 100 nodes 
 Facebook – two cluster of 8000 and 3000 nodes 
 Adobe – 80 nodes 
 Ebay – 532 nodes 
 Yahoo – cluster of about 4500 nodes 
 IIIT Hyderabad – 30 nodes 
 IBM, Microsoft many more companies are also using 
Hadoop. 
©Ashok Royal 9 September 2014
 near duplicates = pairs of objects with high similarity 
 similarity -> quantitative way -> similarity function 
 Given a collection of records, the similarity join problem is to find all pairs of records, <x,y>, such that 
sim(x,y)>=t 
 Tokenize: 
 Each record is a set of tokens from a finite universe. 
 Suppose each record is a single text document 
 x = “yes as soon as possible” 
 y = “as soon as possible please” 
 x = {A, B, C, D, E} 
 y = {B, C, D, E, F} 
©Ashok Royal 9 September 2014
Project Description 
 Project Name - Near Duplicate Detection 
 Comparative analysis of millions documents exist in 
network jargon to find similar document based on a 
predefined threshold value. 
 Near duplicate detection is essentially used in web 
crawls and many others data mining tasks. 
 The near duplicates are not considered as “exact 
duplicates ” , but are files with minute differences . 
©Ashok Royal 9 September 2014
Application in Search Engine 
©Ashok Royal 9 September 2014
Application in Search Engine 
cont. 
 The web documents with similarity score greater 
than a predefined threshold are considered as 
near duplicates 
 These near duplicated pages are not added to the 
repository of search engine. 
 Reduce storage cost of search engines 
 Improve the quality of search index 
©Ashok Royal 9 September 2014
Similarity Function 
For calculation of similarity between two document we 
x  
y 
J x y  
©Ashok Royal 9 September 2014 
have 
used jaccard Function. 
 Jaccard Similarity Function: 
 Example: 
t 
x  
y 
( , )  
x = {A,B,C,D,E} 
y = {B,C,D,E,F} 
4/6 = 0.67
Steps to Detect near duplicate 
©Ashok Royal 9 September 2014
Snapshots - HDFS 
©Ashok Royal 9 September 2014
Snapshot- mapreduce 
Processing 
©Ashok Royal 9 September 2014
Conclusion 
 Training in big data helped us to know what is the 
crazy trend in IT industries and how technology is 
becoming more fruitful to human development. 
 Big Data is the future. Currently A lot of research is 
going on in this field. As data is increasing at faster 
rate thus there is a huge need of such tools and 
technology which can handle it. 
©Ashok Royal 9 September 2014
Conclusion 
 Hadoop is the most emerging framework used by 
most of big firms like Facebook, Microsoft, IBM, 
Yahoo, Amazon and lots of other more. 
 Our experience at MNIT, was absolutely awesome 
as it has given as the platform and support for our 
tasks and case study. 
©Ashok Royal 9 September 2014
 Bigdata and bigdata solutions is one of the 
burning issues in the present it industry so, 
work on those will surely make us more useful 
to that. 
©Ashok Royal 9 September 2014
 The proposed method solve the difficulties of 
information retrieval from the web. 
 The approach has detected the near duplicate web 
pages efficiently based on the keywords extracted 
from the web pages. 
 It reduces the memory space for web repositories. 
 The near duplicate detection increases the search 
engines quality. 
©Ashok Royal 9 September 2014
References 
 J. G. Conrad, X. S. Guo, and C. P. Schriber. 
Online duplicate document detection: signature 
reliability in a dynamic retrieval environment. In 
CIKM, 2003. 
 2012 2nd International Conference on Computer 
Science and Network Technology Near Duplicate 
Detection Using Map-Reduce by Qinsheng Du, 
Wei Liu, Guolin Li and Yonglin Tang 
©Ashok Royal 9 September 2014
References 
 Chuan Xiao, Wei Wang, Xuemin Lin, Jeffrey 
Xu Yu, Guoren Wang. Efficient Similarity Joins 
for Near-Duplicate Detection 
 Apache Hadoop. http://hadoop.apache.org. 
 Hadoop-beginners.blogspot.com. 
©Ashok Royal 9 September 2014
Namenode and Datanode 
 The NameNode executes file system namespace 
operations like opening, closing, and renaming 
files and directories. It also determines the 
mapping of blocks to DataNodes. 
 The DataNodes are responsible for serving read 
and write requests from the file system’s clients. 
The DataNodes also perform block creation, 
deletion, and replication upon instruction from the 
NameNode. ©Ashok Royal 9 September 2014
MapReduce paradigm 
 Map phase: Once divided, datasets are assigned to the task 
tracker to perform the Map phase. The data functional 
operation will be performed over the data, emitting the 
mapped key and value pairs as the output of the Map phase, 
(i.e data processing). 
 Reduce phase: The master node then collects the answers to 
all the subproblems and combines them in some way to form 
the output; the answer to the problem it was originally trying 
to solve, (i.e data collection and digesting). 
©Ashok Royal 9 September 2014
Any Query? 
©Ashok Royal 9 September 2014
©Ashok Royal 9 September 2014

More Related Content

What's hot

Big data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructureBig data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructureRoman Nikitchenko
 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopAmir Shaikh
 
Big Data - An Overview
Big Data -  An OverviewBig Data -  An Overview
Big Data - An OverviewArvind Kalyan
 
Big data with Hadoop - Introduction
Big data with Hadoop - IntroductionBig data with Hadoop - Introduction
Big data with Hadoop - IntroductionTomy Rhymond
 
Introduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemIntroduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemMd. Hasan Basri (Angel)
 
Big data Big Analytics
Big data Big AnalyticsBig data Big Analytics
Big data Big AnalyticsAjay Ohri
 
Big Data: An Overview
Big Data: An OverviewBig Data: An Overview
Big Data: An OverviewC. Scyphers
 
Guest Lecture: Introduction to Big Data at Indian Institute of Technology
Guest Lecture: Introduction to Big Data at Indian Institute of TechnologyGuest Lecture: Introduction to Big Data at Indian Institute of Technology
Guest Lecture: Introduction to Big Data at Indian Institute of TechnologyNishant Gandhi
 
Big data, map reduce and beyond
Big data, map reduce and beyondBig data, map reduce and beyond
Big data, map reduce and beyonddatasalt
 
The Future of Data Science
The Future of Data ScienceThe Future of Data Science
The Future of Data Sciencesarith divakar
 
Hadoop,Big Data Analytics and More
Hadoop,Big Data Analytics and MoreHadoop,Big Data Analytics and More
Hadoop,Big Data Analytics and MoreTrendwise Analytics
 
Introduction To Big Data Analytics On Hadoop - SpringPeople
Introduction To Big Data Analytics On Hadoop - SpringPeopleIntroduction To Big Data Analytics On Hadoop - SpringPeople
Introduction To Big Data Analytics On Hadoop - SpringPeopleSpringPeople
 
Big data analytics, survey r.nabati
Big data analytics, survey r.nabatiBig data analytics, survey r.nabati
Big data analytics, survey r.nabatinabati
 
Big data introduction, Hadoop in details
Big data introduction, Hadoop in detailsBig data introduction, Hadoop in details
Big data introduction, Hadoop in detailsMahmoud Yassin
 
Big Data Course - BigData HUB
Big Data Course - BigData HUBBig Data Course - BigData HUB
Big Data Course - BigData HUBAhmed Salman
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Ranjith Sekar
 

What's hot (20)

Big data abstract
Big data abstractBig data abstract
Big data abstract
 
Big data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructureBig data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructure
 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and Hadoop
 
Big Data - An Overview
Big Data -  An OverviewBig Data -  An Overview
Big Data - An Overview
 
Big data with Hadoop - Introduction
Big data with Hadoop - IntroductionBig data with Hadoop - Introduction
Big data with Hadoop - Introduction
 
Introduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemIntroduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-System
 
Big Data simplified
Big Data simplifiedBig Data simplified
Big Data simplified
 
Big data Big Analytics
Big data Big AnalyticsBig data Big Analytics
Big data Big Analytics
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Big Data: An Overview
Big Data: An OverviewBig Data: An Overview
Big Data: An Overview
 
Guest Lecture: Introduction to Big Data at Indian Institute of Technology
Guest Lecture: Introduction to Big Data at Indian Institute of TechnologyGuest Lecture: Introduction to Big Data at Indian Institute of Technology
Guest Lecture: Introduction to Big Data at Indian Institute of Technology
 
Big Data Hadoop Tutorial by Easylearning Guru
Big Data Hadoop Tutorial by Easylearning GuruBig Data Hadoop Tutorial by Easylearning Guru
Big Data Hadoop Tutorial by Easylearning Guru
 
Big data, map reduce and beyond
Big data, map reduce and beyondBig data, map reduce and beyond
Big data, map reduce and beyond
 
The Future of Data Science
The Future of Data ScienceThe Future of Data Science
The Future of Data Science
 
Hadoop,Big Data Analytics and More
Hadoop,Big Data Analytics and MoreHadoop,Big Data Analytics and More
Hadoop,Big Data Analytics and More
 
Introduction To Big Data Analytics On Hadoop - SpringPeople
Introduction To Big Data Analytics On Hadoop - SpringPeopleIntroduction To Big Data Analytics On Hadoop - SpringPeople
Introduction To Big Data Analytics On Hadoop - SpringPeople
 
Big data analytics, survey r.nabati
Big data analytics, survey r.nabatiBig data analytics, survey r.nabati
Big data analytics, survey r.nabati
 
Big data introduction, Hadoop in details
Big data introduction, Hadoop in detailsBig data introduction, Hadoop in details
Big data introduction, Hadoop in details
 
Big Data Course - BigData HUB
Big Data Course - BigData HUBBig Data Course - BigData HUB
Big Data Course - BigData HUB
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
 

Similar to Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detection(summer training main)

Tools and techniques for data science
Tools and techniques for data scienceTools and techniques for data science
Tools and techniques for data scienceAjay Ohri
 
Hadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridHadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridEvert Lammerts
 
Building a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemBuilding a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemGregg Barrett
 
Analyst Report : The Enterprise Use of Hadoop
Analyst Report : The Enterprise Use of Hadoop Analyst Report : The Enterprise Use of Hadoop
Analyst Report : The Enterprise Use of Hadoop EMC
 
Hadoop Platforms - Introduction, Importance, Providers
Hadoop Platforms - Introduction, Importance, ProvidersHadoop Platforms - Introduction, Importance, Providers
Hadoop Platforms - Introduction, Importance, ProvidersMrigendra Sharma
 
All data accessible to all my organization - Presentation at OW2con'19, June...
 All data accessible to all my organization - Presentation at OW2con'19, June... All data accessible to all my organization - Presentation at OW2con'19, June...
All data accessible to all my organization - Presentation at OW2con'19, June...OW2
 
Review on Big Data Security in Hadoop
Review on Big Data Security in HadoopReview on Big Data Security in Hadoop
Review on Big Data Security in HadoopIRJET Journal
 
Big Data with Hadoop – For Data Management, Processing and Storing
Big Data with Hadoop – For Data Management, Processing and StoringBig Data with Hadoop – For Data Management, Processing and Storing
Big Data with Hadoop – For Data Management, Processing and StoringIRJET Journal
 
Big Data Infrastructure
Big Data InfrastructureBig Data Infrastructure
Big Data InfrastructureTrivadis
 
Introduction to Big Data, MapReduce, its Use Cases, and the Ecosystems
Introduction to Big Data, MapReduce, its Use Cases, and the EcosystemsIntroduction to Big Data, MapReduce, its Use Cases, and the Ecosystems
Introduction to Big Data, MapReduce, its Use Cases, and the EcosystemsJongwook Woo
 
Bigdata and Hadoop Bootcamp
Bigdata and Hadoop BootcampBigdata and Hadoop Bootcamp
Bigdata and Hadoop BootcampSpotle.ai
 
Design architecture based on web
Design architecture based on webDesign architecture based on web
Design architecture based on webcsandit
 
DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...
DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...
DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...cscpconf
 

Similar to Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detection(summer training main) (20)

Tools and techniques for data science
Tools and techniques for data scienceTools and techniques for data science
Tools and techniques for data science
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridHadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG Grid
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Building a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemBuilding a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystem
 
Analyst Report : The Enterprise Use of Hadoop
Analyst Report : The Enterprise Use of Hadoop Analyst Report : The Enterprise Use of Hadoop
Analyst Report : The Enterprise Use of Hadoop
 
Hadoop Platforms - Introduction, Importance, Providers
Hadoop Platforms - Introduction, Importance, ProvidersHadoop Platforms - Introduction, Importance, Providers
Hadoop Platforms - Introduction, Importance, Providers
 
IJET-V3I2P14
IJET-V3I2P14IJET-V3I2P14
IJET-V3I2P14
 
All data accessible to all my organization - Presentation at OW2con'19, June...
 All data accessible to all my organization - Presentation at OW2con'19, June... All data accessible to all my organization - Presentation at OW2con'19, June...
All data accessible to all my organization - Presentation at OW2con'19, June...
 
HDFS
HDFSHDFS
HDFS
 
Review on Big Data Security in Hadoop
Review on Big Data Security in HadoopReview on Big Data Security in Hadoop
Review on Big Data Security in Hadoop
 
Big data Analytics Hadoop
Big data Analytics HadoopBig data Analytics Hadoop
Big data Analytics Hadoop
 
Big Data with Hadoop – For Data Management, Processing and Storing
Big Data with Hadoop – For Data Management, Processing and StoringBig Data with Hadoop – For Data Management, Processing and Storing
Big Data with Hadoop – For Data Management, Processing and Storing
 
Big Data Infrastructure
Big Data InfrastructureBig Data Infrastructure
Big Data Infrastructure
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop in action
Hadoop in actionHadoop in action
Hadoop in action
 
Introduction to Big Data, MapReduce, its Use Cases, and the Ecosystems
Introduction to Big Data, MapReduce, its Use Cases, and the EcosystemsIntroduction to Big Data, MapReduce, its Use Cases, and the Ecosystems
Introduction to Big Data, MapReduce, its Use Cases, and the Ecosystems
 
Bigdata and Hadoop Bootcamp
Bigdata and Hadoop BootcampBigdata and Hadoop Bootcamp
Bigdata and Hadoop Bootcamp
 
Design architecture based on web
Design architecture based on webDesign architecture based on web
Design architecture based on web
 
DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...
DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...
DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...
 

Recently uploaded

WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 

Recently uploaded (20)

WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 

Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detection(summer training main)

  • 1. ©Ashok Royal POORNIMA INSTITUTE OF ENGINEERING & TECHNOLOGY, JAIPUR (DEPARTMENT OF COMPUTER ENGINEERING) Big Data Hadoop Presented By: Guided By: Ashok Royal Dr. E.S. Pilli
  • 2. Topics 1. Organization Profile 2. Schedule 3. Training Description 4. Project Description 5. Learning 6. Snapshots 7. Future Scope 8. Conclusion 9. References ©Ashok Royal 9 September 2014
  • 3. Organization Profile  Name - Malviya National Institute of Techonology, Jaipur.  MNIT, Jaipur is one of 30 national institutes of technology in India.  MNIT, established in 1963 inspired by Pt. Madan Mohan Malviya.  The institute's director is I. K. Bhat and the chairman of the board of Governors is Dr. K. K. Aggarwal. ©Ashok Royal 9 September 2014
  • 4. Organization Contacts Organization’s contacts: Address: Jawaharlal Nehru Marg, Jhalana, Malviya Nagar, Jaipur, Rajasthan 302017 Phone: 0141 271 3201 Email : espilli.cse@mnit.ac.in Website : www.mnit.ac.in ©Ashok Royal 9 September 2014
  • 5. Schedule Our training at MNIT were broadly divided into three phases:  Case study of Hadoop and related papers (first 30 days).  Hadoop cluster making (first 30 days).  Single node  Multi node  Implementation of Near Duplicate Detection Using Hadoop MapReduce (last 15 days). ©Ashok Royal 9 September 2014
  • 6. Training Coordinator  Name - Dr. E. S. Pilli  Assistant Professor at MNIT, Jaipur. He has done Ph.D.(CSE) from IIT Roorkee. He is a very supportive person and currently guiding many M.Tech and Ph. D students in Cloud Computing, Big Data & Botnets.  Email: espilli.cse@mnit.ac.in ©Ashok Royal 9 September 2014
  • 7. What is Big Data ?  Lots of data (Terabyte or Petabyte)  BigData is a term used for a collection of data sets so large and complex that it becomes difficult to process using existing traditional data processing application.  Big data refers to large data sets that are challenging to store, search, share, visualize and analyze. ©Ashok Royal 9 September 2014
  • 8. Various forms of Data  Data comes mainly in three forms-  Structured  Unstructured Data  Semi-structured data ©Ashok Royal 9 September 2014
  • 9. Why Data is so BIG ?  20 terabyte photos uploaded to Facebook each month .  330 terabyte data that the large collider will produce each week.  530 terabyte all the videos on YouTube.  1 petabyte data processed by Google's servers every 72 minutes. ©Ashok Royal 9 September 2014
  • 10. Data growth ©Ashok Royal 9 September 2014
  • 11. What is hadoop ?  It is a open source software library written in java.  Hadoop Software library is a framework that allow for the distributed processing of large data sets (Big Data) across clusters of computers using programming models. ©Ashok Royal 9 September 2014
  • 12. Modules of hadoop  Hadoop Commons  Hadoop Distributed File system (HDFS)  Hadoop mapReduce ©Ashok Royal 9 September 2014
  • 13. Hadoop Commons  It provides access to the file system supported by Hadoop.  The Hadoop common package contain the necessary JAR files and scripts needed to start hadoop.  The package also provides sourse code, documentation, and a contribution section which include projects from Hadoop Community. ©Ashok Royal 9 September 2014
  • 14. HDFS  Hadoop uses HDFS, a distributed file system based on GFS (Google File System) as its shared filesystem.  HDFS architecture divides files into large chunks distributed across data servers.  It has namenode and datanodes. ©Ashok Royal 9 September 2014
  • 15. Main components of HDFS  Namenode:  Master of the system  Maintains and manage the blocks which are present on the datanodes.  Datanodes:  Slaves which are deployed on each machine and provide the actual storage.  Responsible for serving read and write requests for the clients. ©Ashok Royal 9 September 2014
  • 16. Main components of HDFS  Secondary Name node  Used as savepoint  Connects to Namenode every hour*  Backup of namenode metadata  Saved metadata can build a failed Namenode. ©Ashok Royal 9 September 2014
  • 17. Map Reduce  The Hadoop MapReduce framework harnesses a cluster of machines and excute user define MapReduce jobs across the nodes in the cluster.  A MapReduce computation has two phases  A map phase and  A reduce phase ©Ashok Royal 9 September 2014
  • 18. HDFS and MapReduce Layers ©Ashok Royal 9 September 2014
  • 19. Hadoop Server Roles JobTracker MapReduce job submitted by client computer Master node Slave node TaskTracker Task instance Slave node TaskTracker Task instance Slave node TaskTracker Task instance ©Ashok Royal 9 September 2014
  • 20. Hadoop Architecture ©Ashok Royal 9 September 2014
  • 21. Hadoop Streaming  It allows to create and run map/reduce jobs with any executable or script as the Mapper and/or Reducer.  HDFS is basically designed to process large files on commodity cluster at high speed.  Write onces read many approch. After huge data being placed, we tend to use the data not modify it. ©Ashok Royal 9 September 2014  Time to read the whole data is more important.
  • 22. Hadoop Workflow 1. Load data into the cluster (HDFS writes) 2. Analyze the data (Map Reduce) 3. Store results in the cluster (HDFS writes) 4. Read the results from the cluster (HDFS reads) ©Ashok Royal 9 September 2014
  • 23. Word count Example ©Ashok Royal 9 September 2014
  • 24. Prominent users of Hadoop  Amazon – 100 nodes  Facebook – two cluster of 8000 and 3000 nodes  Adobe – 80 nodes  Ebay – 532 nodes  Yahoo – cluster of about 4500 nodes  IIIT Hyderabad – 30 nodes  IBM, Microsoft many more companies are also using Hadoop. ©Ashok Royal 9 September 2014
  • 25.  near duplicates = pairs of objects with high similarity  similarity -> quantitative way -> similarity function  Given a collection of records, the similarity join problem is to find all pairs of records, <x,y>, such that sim(x,y)>=t  Tokenize:  Each record is a set of tokens from a finite universe.  Suppose each record is a single text document  x = “yes as soon as possible”  y = “as soon as possible please”  x = {A, B, C, D, E}  y = {B, C, D, E, F} ©Ashok Royal 9 September 2014
  • 26. Project Description  Project Name - Near Duplicate Detection  Comparative analysis of millions documents exist in network jargon to find similar document based on a predefined threshold value.  Near duplicate detection is essentially used in web crawls and many others data mining tasks.  The near duplicates are not considered as “exact duplicates ” , but are files with minute differences . ©Ashok Royal 9 September 2014
  • 27. Application in Search Engine ©Ashok Royal 9 September 2014
  • 28. Application in Search Engine cont.  The web documents with similarity score greater than a predefined threshold are considered as near duplicates  These near duplicated pages are not added to the repository of search engine.  Reduce storage cost of search engines  Improve the quality of search index ©Ashok Royal 9 September 2014
  • 29. Similarity Function For calculation of similarity between two document we x  y J x y  ©Ashok Royal 9 September 2014 have used jaccard Function.  Jaccard Similarity Function:  Example: t x  y ( , )  x = {A,B,C,D,E} y = {B,C,D,E,F} 4/6 = 0.67
  • 30. Steps to Detect near duplicate ©Ashok Royal 9 September 2014
  • 31. Snapshots - HDFS ©Ashok Royal 9 September 2014
  • 32. Snapshot- mapreduce Processing ©Ashok Royal 9 September 2014
  • 33. Conclusion  Training in big data helped us to know what is the crazy trend in IT industries and how technology is becoming more fruitful to human development.  Big Data is the future. Currently A lot of research is going on in this field. As data is increasing at faster rate thus there is a huge need of such tools and technology which can handle it. ©Ashok Royal 9 September 2014
  • 34. Conclusion  Hadoop is the most emerging framework used by most of big firms like Facebook, Microsoft, IBM, Yahoo, Amazon and lots of other more.  Our experience at MNIT, was absolutely awesome as it has given as the platform and support for our tasks and case study. ©Ashok Royal 9 September 2014
  • 35.  Bigdata and bigdata solutions is one of the burning issues in the present it industry so, work on those will surely make us more useful to that. ©Ashok Royal 9 September 2014
  • 36.  The proposed method solve the difficulties of information retrieval from the web.  The approach has detected the near duplicate web pages efficiently based on the keywords extracted from the web pages.  It reduces the memory space for web repositories.  The near duplicate detection increases the search engines quality. ©Ashok Royal 9 September 2014
  • 37. References  J. G. Conrad, X. S. Guo, and C. P. Schriber. Online duplicate document detection: signature reliability in a dynamic retrieval environment. In CIKM, 2003.  2012 2nd International Conference on Computer Science and Network Technology Near Duplicate Detection Using Map-Reduce by Qinsheng Du, Wei Liu, Guolin Li and Yonglin Tang ©Ashok Royal 9 September 2014
  • 38. References  Chuan Xiao, Wei Wang, Xuemin Lin, Jeffrey Xu Yu, Guoren Wang. Efficient Similarity Joins for Near-Duplicate Detection  Apache Hadoop. http://hadoop.apache.org.  Hadoop-beginners.blogspot.com. ©Ashok Royal 9 September 2014
  • 39. Namenode and Datanode  The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes.  The DataNodes are responsible for serving read and write requests from the file system’s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode. ©Ashok Royal 9 September 2014
  • 40. MapReduce paradigm  Map phase: Once divided, datasets are assigned to the task tracker to perform the Map phase. The data functional operation will be performed over the data, emitting the mapped key and value pairs as the output of the Map phase, (i.e data processing).  Reduce phase: The master node then collects the answers to all the subproblems and combines them in some way to form the output; the answer to the problem it was originally trying to solve, (i.e data collection and digesting). ©Ashok Royal 9 September 2014
  • 41. Any Query? ©Ashok Royal 9 September 2014
  • 42. ©Ashok Royal 9 September 2014