SlideShare a Scribd company logo
1 of 4
Download to read offline
IOSR Journal of Computer Engineering (IOSR-JCE)
e-ISSN: 2278-0661,p-ISSN: 2278-8727, Volume 17, Issue 6, Ver. I (Nov – Dec. 2015), PP 91-94
www.iosrjournals.org
DOI: 10.9790/0661-17619194 www.iosrjournals.org 91 | Page
Hadoop add-on API for Advanced Content Based Search &
Retrieval
1
Kshama Jain, 2
Aditya Kamble, 3
Siddhesh Palande, 4
Rahul Rao,
Prof. ShaileshHule
Department of Computer EngineeringPimpriChinchwad College of Engineering, Pune
Guided by: Prof. ShaileshHule
Abstract:Unstructured data like doc, pdf is lengthy to search and filter for desired information. We need to go
through every file manually for finding information. It is very time consuming and frustrating. We can use
features of big data management system like Hadoop to organize unstructured data dynamically and return
desired information. Hadoop provides features like Map Reduce, HDFS, HBase to filter data as per user input.
Finally we can develop Hadoop Addon for content search and filtering on unstructured data.
Index Terms:Hadoop, Map Reduce, HBase, Content BasedRetrieval.
I. Introduction
A. Hadoop
Hadoop is an open-source software framework written in Java for distributed storage and distributed
processing of very large data sets on computer clusters built from commodity hardware. Commodity computing,
or commodity cluster computing, is the use of large numbers of already-available computing components for
parallel computing, to get the greatest amount of useful computation at low cost. All the modules in Hadoop are
designed with a fundamental assumption that hardware failures (of individual machines, or racks of machines)
are commonplace and thus should be automatically handled in software by the framework. The core of Apache
Hadoop consists of a storage part (Hadoop Distributed File System (HDFS)) and a processing part (Map
Reduce). Hadoop splits files into large blocks and distributesthem amongst the nodes in the cluster.
B. HDFS
The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity
hardware. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides
high throughput access to application data and is suitable for applications that have large data sets.
C. MapReduce
A MapReduce program is composed of a Map() procedure(method) that performs filtering and sorting
and a Reduce() method that performs a summary operation. The MapReduce System run the various tasks in
parallel, managing all communications and data transfers between the various parts of the system.
D. HBase
HBase is a data model that is similar to Googles big table designed to provide quick random access to
huge amounts of structured data. This tutorial provides an introduction to HBase, the procedures to set up HBase
on Hadoop File Systems, and ways to interact with HBase shell. It also describes how to connect to HBase using
java, and how to perform basic operations on HBase using java.
E. Current System
Tasks like assignments, taking notes from text books and reference books on particular topic, topics for
presentation need deep reading and need to go through every document manually just to find relevant content on
given topic. Currently present systems are only searching based on document title, author, size, and time but not
on content. So to do content based search on big data documents and large text data Haddop framework can be
used
F. Content Based Approach
Manually filtering from any kind of unstructured data like PDF is tedious and time consuming, we are
developing API for Hadoop finding relevant information from large sets and retrieving the same is main
concern. So using Hadoop Big Data management framework consist of HDFS, MapReduce, and HBase, we are
developing content based search on PDF documents to solve real life problem. So this is basic motivation for the
Hadoop add-on API for Advanced Content Based Search & Retrieval
DOI: 10.9790/0661-17619194 www.iosrjournals.org 92 | Page
project.
II. Proposed System
This system shall retrieve the required contents of files which are in an unstructured format containing
huge amount of Data like e-books in a digital library where the number of books are present in thousands. The
scope here is initially limited to PDF files which may be expanded to other unstructured formats like ePUB,
mobi. The input is provided to the system API in the form of search query which will be firstly filtered to find
the important expression as a query to the API which will return the important content to the user in the form of
paragraph or text highlighted using the power of distributed computing. The particular page or the Entire book
itself can be downloaded by the Library users if the content is satisfied else the search continues for finding
relevant content
Fig. 1. Data Flow Diagram
Algorithm 1: Content Based Search Algorithm
1) Take Input Query from User
2) Use the Suffix Stripping Algorithm to remove English Suffixes like ’ed’,’ing’,’ ’ ’s ’ to extract keywords
3) Match keywords with Documents’ metadata present in Hbase and filter the documents to be searched.
4) Apply Searching Algorithm based on map-reduce operations like keyword count,positions.
 Open the PDF
 Distribute the text using map operation.
 Combine the results with reduce operations.
5) Sort the documents which are returned by Searching w.r.t keyword count and relevance.
6) Depending on API methods return the result with
 Paragraph
 Whole Page
 Current and Previous Page
A. Document Metadata
Searching and retrieving information from data inside HDFS. Hadoop Map Reduce operations will be
used to perform key value pair generation and depend upon result information is searched.
a) Here, data is documents in PDF (Portable Document Format) format. These documents are stored on HDFS
Hadoop Distributed File System. These documents are considered as raw data. These documents can have
properties like
 Name Size Date
 Author
b) Document meta data is also maintain in HBase distributed database. It has attributes like
 Content Keywords
Hadoop add-on API for Advanced Content Based Search & Retrieval
DOI: 10.9790/0661-17619194 www.iosrjournals.org 93 | Page
 Index Keyword
 Document Keywords
This information will be useful to filter documents before performing content based searching operations.
Fig. 2. Activity Diagram
Algorithm 2: Meta-data Extraction Service Algorithm
Meta-data Extraction Service
1) Take the input pdf files from user
2) Check if corrupt
3) Upload the file into the Hadoop Distributed Filesystem
4) Run the Metadata Extracting Algorithm using the Map Reduce Engine.
5) Extract the Bibliography and Index Contents of each and every document (if present) and related keywords.
6) Store it in the HBase as metadata.
7) This metadata will be used by map reduce engine to search and retrieve data from relevant document inside
HDFS using keyword relative custom partitioning.
8) Return the result to the search API
9) Stop
Hadoop add-on API for Advanced Content Based Search & Retrieval
DOI: 10.9790/0661-17619194 www.iosrjournals.org 94 | Page
III. Applications
Content based search and retrieval on
1) Digital library books
2) Conference Papers
3) Private Authorities
4) Government Authorities
5) Unstructured text files
This API can be used to develop above functionality on platforms like
1) Web Application
2) Android Application
3) Standalone Application
4) Command Line Interface Application
IV. Future Scope
Extending our API further to read scanned text copies and retrieve the data from them would solve
further advanced issues. This API again can be extended further to work with Image processing so that it may
find a relevant information from the digital images for the end- users.
V. Conclusion
This API would be favorable for many large organizations like Large-scale Industries and Educational
institutes, Power-Grids, Air-lines, Government Offices and many others working with and producing Big-Data
to retrieve data in an efficient and a very cost effective way. Thus we can use this API in Hadoop to reduce
manual efforts and bring advance content based search and retrieval
References
[1] Lars George, ”HBase: The Definitive Guide”, 1st edition, O’Reilly Media, September 2011, ISBN 9781449396107
[2] Tom White, ”Hadoop: The Definitive Guide”, 1st edition, O’Reilly Media, June 2009, ISBN 9780596521974
[3] Apache Hadoop HDFS homepage http://hadoop.apache.org/hdfs/
[4] MehulNalinVora ,”Hadoop-HBase for Large-Scale Data”,InnovationLabs,PERC,ICCNT Conference
[5] YijunBei,ZhenLin,ChenZhao,Xiaojun Zhu ,”HBase System-based Distributed Framework for Searching Large Graph
Databases”,ICCNT Conference
[6] SeemaMaitrey,C.K.”Handling Big Data Efficiently by using Map Reduce Technique”,ICCICT
[7] Maitrey S, Jha. An Integrated Approach for CURE Clustering using Map-Reduce Technique. In Proceedings of Elsevier,ISBN 978-
81- 910691-6-3,2 nd August 2013.

More Related Content

What's hot

Performance evaluation of Map-reduce jar pig hive and spark with machine lear...
Performance evaluation of Map-reduce jar pig hive and spark with machine lear...Performance evaluation of Map-reduce jar pig hive and spark with machine lear...
Performance evaluation of Map-reduce jar pig hive and spark with machine lear...IJECEIAES
 
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Cognizant
 
Design architecture based on web
Design architecture based on webDesign architecture based on web
Design architecture based on webcsandit
 
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap IT Strategy Group
 
Design and Implementation of SOA Enhanced Semantic Information Retrieval web ...
Design and Implementation of SOA Enhanced Semantic Information Retrieval web ...Design and Implementation of SOA Enhanced Semantic Information Retrieval web ...
Design and Implementation of SOA Enhanced Semantic Information Retrieval web ...iosrjce
 
Survey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization MethodsSurvey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization Methodspaperpublications3
 
Infrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical WorkloadsInfrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical WorkloadsCognizant
 
A comparative survey based on processing network traffic data using hadoop pi...
A comparative survey based on processing network traffic data using hadoop pi...A comparative survey based on processing network traffic data using hadoop pi...
A comparative survey based on processing network traffic data using hadoop pi...ijcses
 
Intro to Hybrid Data Warehouse
Intro to Hybrid Data WarehouseIntro to Hybrid Data Warehouse
Intro to Hybrid Data WarehouseJonathan Bloom
 
Sparkr sigmod
Sparkr sigmodSparkr sigmod
Sparkr sigmodwaqasm86
 
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?IJCSIS Research Publications
 
RDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs SparkRDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs SparkLaxmi8
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File SystemNilaNila16
 

What's hot (17)

Performance evaluation of Map-reduce jar pig hive and spark with machine lear...
Performance evaluation of Map-reduce jar pig hive and spark with machine lear...Performance evaluation of Map-reduce jar pig hive and spark with machine lear...
Performance evaluation of Map-reduce jar pig hive and spark with machine lear...
 
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
 
B1803040412
B1803040412B1803040412
B1803040412
 
Design architecture based on web
Design architecture based on webDesign architecture based on web
Design architecture based on web
 
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
 
Design and Implementation of SOA Enhanced Semantic Information Retrieval web ...
Design and Implementation of SOA Enhanced Semantic Information Retrieval web ...Design and Implementation of SOA Enhanced Semantic Information Retrieval web ...
Design and Implementation of SOA Enhanced Semantic Information Retrieval web ...
 
Survey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization MethodsSurvey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization Methods
 
Big data
Big dataBig data
Big data
 
Infrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical WorkloadsInfrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical Workloads
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
 
A comparative survey based on processing network traffic data using hadoop pi...
A comparative survey based on processing network traffic data using hadoop pi...A comparative survey based on processing network traffic data using hadoop pi...
A comparative survey based on processing network traffic data using hadoop pi...
 
Intro to Hybrid Data Warehouse
Intro to Hybrid Data WarehouseIntro to Hybrid Data Warehouse
Intro to Hybrid Data Warehouse
 
Big data
Big dataBig data
Big data
 
Sparkr sigmod
Sparkr sigmodSparkr sigmod
Sparkr sigmod
 
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
 
RDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs SparkRDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs Spark
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 

Viewers also liked

RSA Algorithm as a Data Security Control Mechanism in RFID
RSA Algorithm as a Data Security Control Mechanism in RFIDRSA Algorithm as a Data Security Control Mechanism in RFID
RSA Algorithm as a Data Security Control Mechanism in RFIDIOSR Journals
 
OFDM PAPR Reduction by Shifting Two Null Subcarriers among Two Data Subcarriers
OFDM PAPR Reduction by Shifting Two Null Subcarriers among Two Data SubcarriersOFDM PAPR Reduction by Shifting Two Null Subcarriers among Two Data Subcarriers
OFDM PAPR Reduction by Shifting Two Null Subcarriers among Two Data SubcarriersIOSR Journals
 
Effectiveness of Information Communication Technologies for Education System
Effectiveness of Information Communication Technologies for Education SystemEffectiveness of Information Communication Technologies for Education System
Effectiveness of Information Communication Technologies for Education SystemIOSR Journals
 
Video Encryption and Decryption with Authentication using Artificial Neural N...
Video Encryption and Decryption with Authentication using Artificial Neural N...Video Encryption and Decryption with Authentication using Artificial Neural N...
Video Encryption and Decryption with Authentication using Artificial Neural N...IOSR Journals
 
Analysis of Ketoconazole and Piribedil Using Ion Selective Electrodes
Analysis of Ketoconazole and Piribedil Using Ion Selective ElectrodesAnalysis of Ketoconazole and Piribedil Using Ion Selective Electrodes
Analysis of Ketoconazole and Piribedil Using Ion Selective ElectrodesIOSR Journals
 
Study on Coupling Model of Methanol Steam Reforming and Simultaneous Hydrogen...
Study on Coupling Model of Methanol Steam Reforming and Simultaneous Hydrogen...Study on Coupling Model of Methanol Steam Reforming and Simultaneous Hydrogen...
Study on Coupling Model of Methanol Steam Reforming and Simultaneous Hydrogen...IOSR Journals
 
Leather Quality Estimation Using an Automated Machine Vision System
Leather Quality Estimation Using an Automated Machine Vision SystemLeather Quality Estimation Using an Automated Machine Vision System
Leather Quality Estimation Using an Automated Machine Vision SystemIOSR Journals
 
Cloud Computing for E-Commerce
Cloud Computing for E-CommerceCloud Computing for E-Commerce
Cloud Computing for E-CommerceIOSR Journals
 
Toxicological Effect of Effluents from Indomie Plc on Some Biochemical Parame...
Toxicological Effect of Effluents from Indomie Plc on Some Biochemical Parame...Toxicological Effect of Effluents from Indomie Plc on Some Biochemical Parame...
Toxicological Effect of Effluents from Indomie Plc on Some Biochemical Parame...IOSR Journals
 
Experimental Evaluation of the Effect of Thread Angle on the Fatigue Life of ...
Experimental Evaluation of the Effect of Thread Angle on the Fatigue Life of ...Experimental Evaluation of the Effect of Thread Angle on the Fatigue Life of ...
Experimental Evaluation of the Effect of Thread Angle on the Fatigue Life of ...IOSR Journals
 
Study On Traffic Conlict At Unsignalized Intersection In Malaysia
Study On Traffic Conlict At Unsignalized Intersection In Malaysia Study On Traffic Conlict At Unsignalized Intersection In Malaysia
Study On Traffic Conlict At Unsignalized Intersection In Malaysia IOSR Journals
 
Parametric Optimization of Single Cylinder Diesel Engine for Pyrolysis Oil an...
Parametric Optimization of Single Cylinder Diesel Engine for Pyrolysis Oil an...Parametric Optimization of Single Cylinder Diesel Engine for Pyrolysis Oil an...
Parametric Optimization of Single Cylinder Diesel Engine for Pyrolysis Oil an...IOSR Journals
 
Computation of Theoretical Heat of Formation in a Kiln Using Fortran Language
Computation of Theoretical Heat of Formation in a Kiln Using Fortran LanguageComputation of Theoretical Heat of Formation in a Kiln Using Fortran Language
Computation of Theoretical Heat of Formation in a Kiln Using Fortran LanguageIOSR Journals
 
A Novel Approach of Area-Efficient FIR Filter Design Using Distributed Arithm...
A Novel Approach of Area-Efficient FIR Filter Design Using Distributed Arithm...A Novel Approach of Area-Efficient FIR Filter Design Using Distributed Arithm...
A Novel Approach of Area-Efficient FIR Filter Design Using Distributed Arithm...IOSR Journals
 
Industrial Process Management Using LabVIEW
Industrial Process Management Using LabVIEWIndustrial Process Management Using LabVIEW
Industrial Process Management Using LabVIEWIOSR Journals
 

Viewers also liked (20)

RSA Algorithm as a Data Security Control Mechanism in RFID
RSA Algorithm as a Data Security Control Mechanism in RFIDRSA Algorithm as a Data Security Control Mechanism in RFID
RSA Algorithm as a Data Security Control Mechanism in RFID
 
OFDM PAPR Reduction by Shifting Two Null Subcarriers among Two Data Subcarriers
OFDM PAPR Reduction by Shifting Two Null Subcarriers among Two Data SubcarriersOFDM PAPR Reduction by Shifting Two Null Subcarriers among Two Data Subcarriers
OFDM PAPR Reduction by Shifting Two Null Subcarriers among Two Data Subcarriers
 
Effectiveness of Information Communication Technologies for Education System
Effectiveness of Information Communication Technologies for Education SystemEffectiveness of Information Communication Technologies for Education System
Effectiveness of Information Communication Technologies for Education System
 
Video Encryption and Decryption with Authentication using Artificial Neural N...
Video Encryption and Decryption with Authentication using Artificial Neural N...Video Encryption and Decryption with Authentication using Artificial Neural N...
Video Encryption and Decryption with Authentication using Artificial Neural N...
 
Analysis of Ketoconazole and Piribedil Using Ion Selective Electrodes
Analysis of Ketoconazole and Piribedil Using Ion Selective ElectrodesAnalysis of Ketoconazole and Piribedil Using Ion Selective Electrodes
Analysis of Ketoconazole and Piribedil Using Ion Selective Electrodes
 
Study on Coupling Model of Methanol Steam Reforming and Simultaneous Hydrogen...
Study on Coupling Model of Methanol Steam Reforming and Simultaneous Hydrogen...Study on Coupling Model of Methanol Steam Reforming and Simultaneous Hydrogen...
Study on Coupling Model of Methanol Steam Reforming and Simultaneous Hydrogen...
 
F017213747
F017213747F017213747
F017213747
 
Leather Quality Estimation Using an Automated Machine Vision System
Leather Quality Estimation Using an Automated Machine Vision SystemLeather Quality Estimation Using an Automated Machine Vision System
Leather Quality Estimation Using an Automated Machine Vision System
 
Cloud Computing for E-Commerce
Cloud Computing for E-CommerceCloud Computing for E-Commerce
Cloud Computing for E-Commerce
 
Toxicological Effect of Effluents from Indomie Plc on Some Biochemical Parame...
Toxicological Effect of Effluents from Indomie Plc on Some Biochemical Parame...Toxicological Effect of Effluents from Indomie Plc on Some Biochemical Parame...
Toxicological Effect of Effluents from Indomie Plc on Some Biochemical Parame...
 
J012537580
J012537580J012537580
J012537580
 
Experimental Evaluation of the Effect of Thread Angle on the Fatigue Life of ...
Experimental Evaluation of the Effect of Thread Angle on the Fatigue Life of ...Experimental Evaluation of the Effect of Thread Angle on the Fatigue Life of ...
Experimental Evaluation of the Effect of Thread Angle on the Fatigue Life of ...
 
Study On Traffic Conlict At Unsignalized Intersection In Malaysia
Study On Traffic Conlict At Unsignalized Intersection In Malaysia Study On Traffic Conlict At Unsignalized Intersection In Malaysia
Study On Traffic Conlict At Unsignalized Intersection In Malaysia
 
Parametric Optimization of Single Cylinder Diesel Engine for Pyrolysis Oil an...
Parametric Optimization of Single Cylinder Diesel Engine for Pyrolysis Oil an...Parametric Optimization of Single Cylinder Diesel Engine for Pyrolysis Oil an...
Parametric Optimization of Single Cylinder Diesel Engine for Pyrolysis Oil an...
 
Computation of Theoretical Heat of Formation in a Kiln Using Fortran Language
Computation of Theoretical Heat of Formation in a Kiln Using Fortran LanguageComputation of Theoretical Heat of Formation in a Kiln Using Fortran Language
Computation of Theoretical Heat of Formation in a Kiln Using Fortran Language
 
E012644143
E012644143E012644143
E012644143
 
A Novel Approach of Area-Efficient FIR Filter Design Using Distributed Arithm...
A Novel Approach of Area-Efficient FIR Filter Design Using Distributed Arithm...A Novel Approach of Area-Efficient FIR Filter Design Using Distributed Arithm...
A Novel Approach of Area-Efficient FIR Filter Design Using Distributed Arithm...
 
Industrial Process Management Using LabVIEW
Industrial Process Management Using LabVIEWIndustrial Process Management Using LabVIEW
Industrial Process Management Using LabVIEW
 
F1802052830
F1802052830F1802052830
F1802052830
 
J010337176
J010337176J010337176
J010337176
 

Similar to M017619194

OPTIMIZATION OF MULTIPLE CORRELATED QUERIES BY DETECTING SIMILAR DATA SOURCE ...
OPTIMIZATION OF MULTIPLE CORRELATED QUERIES BY DETECTING SIMILAR DATA SOURCE ...OPTIMIZATION OF MULTIPLE CORRELATED QUERIES BY DETECTING SIMILAR DATA SOURCE ...
OPTIMIZATION OF MULTIPLE CORRELATED QUERIES BY DETECTING SIMILAR DATA SOURCE ...Puneet Kansal
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHitendra Kumar
 
project report on hadoop
project report on hadoopproject report on hadoop
project report on hadoopManoj Jangalva
 
A Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - IntroductionA Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - Introductionsaisreealekhya
 
DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...
DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...
DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...cscpconf
 
IJSRED-V2I3P43
IJSRED-V2I3P43IJSRED-V2I3P43
IJSRED-V2I3P43IJSRED
 
Hadoop Based Data Discovery
Hadoop Based Data DiscoveryHadoop Based Data Discovery
Hadoop Based Data DiscoveryBenjamin Ashkar
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with HadoopNalini Mehta
 
Unstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus ModelUnstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus ModelEditor IJCATR
 

Similar to M017619194 (20)

HDFS
HDFSHDFS
HDFS
 
G017143640
G017143640G017143640
G017143640
 
OPTIMIZATION OF MULTIPLE CORRELATED QUERIES BY DETECTING SIMILAR DATA SOURCE ...
OPTIMIZATION OF MULTIPLE CORRELATED QUERIES BY DETECTING SIMILAR DATA SOURCE ...OPTIMIZATION OF MULTIPLE CORRELATED QUERIES BY DETECTING SIMILAR DATA SOURCE ...
OPTIMIZATION OF MULTIPLE CORRELATED QUERIES BY DETECTING SIMILAR DATA SOURCE ...
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log Processing
 
What is hadoop
What is hadoopWhat is hadoop
What is hadoop
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
Unit-3_BDA.ppt
Unit-3_BDA.pptUnit-3_BDA.ppt
Unit-3_BDA.ppt
 
Big data
Big dataBig data
Big data
 
project report on hadoop
project report on hadoopproject report on hadoop
project report on hadoop
 
A Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - IntroductionA Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - Introduction
 
paper
paperpaper
paper
 
Big data Presentation
Big data PresentationBig data Presentation
Big data Presentation
 
DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...
DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...
DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...
 
Hadoop in action
Hadoop in actionHadoop in action
Hadoop in action
 
IJSRED-V2I3P43
IJSRED-V2I3P43IJSRED-V2I3P43
IJSRED-V2I3P43
 
Hadoop Based Data Discovery
Hadoop Based Data DiscoveryHadoop Based Data Discovery
Hadoop Based Data Discovery
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Unstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus ModelUnstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus Model
 
hadoop resume
hadoop resumehadoop resume
hadoop resume
 

More from IOSR Journals (20)

A011140104
A011140104A011140104
A011140104
 
M0111397100
M0111397100M0111397100
M0111397100
 
L011138596
L011138596L011138596
L011138596
 
K011138084
K011138084K011138084
K011138084
 
J011137479
J011137479J011137479
J011137479
 
I011136673
I011136673I011136673
I011136673
 
G011134454
G011134454G011134454
G011134454
 
H011135565
H011135565H011135565
H011135565
 
F011134043
F011134043F011134043
F011134043
 
E011133639
E011133639E011133639
E011133639
 
D011132635
D011132635D011132635
D011132635
 
C011131925
C011131925C011131925
C011131925
 
B011130918
B011130918B011130918
B011130918
 
A011130108
A011130108A011130108
A011130108
 
I011125160
I011125160I011125160
I011125160
 
H011124050
H011124050H011124050
H011124050
 
G011123539
G011123539G011123539
G011123539
 
F011123134
F011123134F011123134
F011123134
 
E011122530
E011122530E011122530
E011122530
 
D011121524
D011121524D011121524
D011121524
 

Recently uploaded

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfngoud9212
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 

Recently uploaded (20)

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdf
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 

M017619194

  • 1. IOSR Journal of Computer Engineering (IOSR-JCE) e-ISSN: 2278-0661,p-ISSN: 2278-8727, Volume 17, Issue 6, Ver. I (Nov – Dec. 2015), PP 91-94 www.iosrjournals.org DOI: 10.9790/0661-17619194 www.iosrjournals.org 91 | Page Hadoop add-on API for Advanced Content Based Search & Retrieval 1 Kshama Jain, 2 Aditya Kamble, 3 Siddhesh Palande, 4 Rahul Rao, Prof. ShaileshHule Department of Computer EngineeringPimpriChinchwad College of Engineering, Pune Guided by: Prof. ShaileshHule Abstract:Unstructured data like doc, pdf is lengthy to search and filter for desired information. We need to go through every file manually for finding information. It is very time consuming and frustrating. We can use features of big data management system like Hadoop to organize unstructured data dynamically and return desired information. Hadoop provides features like Map Reduce, HDFS, HBase to filter data as per user input. Finally we can develop Hadoop Addon for content search and filtering on unstructured data. Index Terms:Hadoop, Map Reduce, HBase, Content BasedRetrieval. I. Introduction A. Hadoop Hadoop is an open-source software framework written in Java for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware. Commodity computing, or commodity cluster computing, is the use of large numbers of already-available computing components for parallel computing, to get the greatest amount of useful computation at low cost. All the modules in Hadoop are designed with a fundamental assumption that hardware failures (of individual machines, or racks of machines) are commonplace and thus should be automatically handled in software by the framework. The core of Apache Hadoop consists of a storage part (Hadoop Distributed File System (HDFS)) and a processing part (Map Reduce). Hadoop splits files into large blocks and distributesthem amongst the nodes in the cluster. B. HDFS The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets. C. MapReduce A MapReduce program is composed of a Map() procedure(method) that performs filtering and sorting and a Reduce() method that performs a summary operation. The MapReduce System run the various tasks in parallel, managing all communications and data transfers between the various parts of the system. D. HBase HBase is a data model that is similar to Googles big table designed to provide quick random access to huge amounts of structured data. This tutorial provides an introduction to HBase, the procedures to set up HBase on Hadoop File Systems, and ways to interact with HBase shell. It also describes how to connect to HBase using java, and how to perform basic operations on HBase using java. E. Current System Tasks like assignments, taking notes from text books and reference books on particular topic, topics for presentation need deep reading and need to go through every document manually just to find relevant content on given topic. Currently present systems are only searching based on document title, author, size, and time but not on content. So to do content based search on big data documents and large text data Haddop framework can be used F. Content Based Approach Manually filtering from any kind of unstructured data like PDF is tedious and time consuming, we are developing API for Hadoop finding relevant information from large sets and retrieving the same is main concern. So using Hadoop Big Data management framework consist of HDFS, MapReduce, and HBase, we are developing content based search on PDF documents to solve real life problem. So this is basic motivation for the
  • 2. Hadoop add-on API for Advanced Content Based Search & Retrieval DOI: 10.9790/0661-17619194 www.iosrjournals.org 92 | Page project. II. Proposed System This system shall retrieve the required contents of files which are in an unstructured format containing huge amount of Data like e-books in a digital library where the number of books are present in thousands. The scope here is initially limited to PDF files which may be expanded to other unstructured formats like ePUB, mobi. The input is provided to the system API in the form of search query which will be firstly filtered to find the important expression as a query to the API which will return the important content to the user in the form of paragraph or text highlighted using the power of distributed computing. The particular page or the Entire book itself can be downloaded by the Library users if the content is satisfied else the search continues for finding relevant content Fig. 1. Data Flow Diagram Algorithm 1: Content Based Search Algorithm 1) Take Input Query from User 2) Use the Suffix Stripping Algorithm to remove English Suffixes like ’ed’,’ing’,’ ’ ’s ’ to extract keywords 3) Match keywords with Documents’ metadata present in Hbase and filter the documents to be searched. 4) Apply Searching Algorithm based on map-reduce operations like keyword count,positions.  Open the PDF  Distribute the text using map operation.  Combine the results with reduce operations. 5) Sort the documents which are returned by Searching w.r.t keyword count and relevance. 6) Depending on API methods return the result with  Paragraph  Whole Page  Current and Previous Page A. Document Metadata Searching and retrieving information from data inside HDFS. Hadoop Map Reduce operations will be used to perform key value pair generation and depend upon result information is searched. a) Here, data is documents in PDF (Portable Document Format) format. These documents are stored on HDFS Hadoop Distributed File System. These documents are considered as raw data. These documents can have properties like  Name Size Date  Author b) Document meta data is also maintain in HBase distributed database. It has attributes like  Content Keywords
  • 3. Hadoop add-on API for Advanced Content Based Search & Retrieval DOI: 10.9790/0661-17619194 www.iosrjournals.org 93 | Page  Index Keyword  Document Keywords This information will be useful to filter documents before performing content based searching operations. Fig. 2. Activity Diagram Algorithm 2: Meta-data Extraction Service Algorithm Meta-data Extraction Service 1) Take the input pdf files from user 2) Check if corrupt 3) Upload the file into the Hadoop Distributed Filesystem 4) Run the Metadata Extracting Algorithm using the Map Reduce Engine. 5) Extract the Bibliography and Index Contents of each and every document (if present) and related keywords. 6) Store it in the HBase as metadata. 7) This metadata will be used by map reduce engine to search and retrieve data from relevant document inside HDFS using keyword relative custom partitioning. 8) Return the result to the search API 9) Stop
  • 4. Hadoop add-on API for Advanced Content Based Search & Retrieval DOI: 10.9790/0661-17619194 www.iosrjournals.org 94 | Page III. Applications Content based search and retrieval on 1) Digital library books 2) Conference Papers 3) Private Authorities 4) Government Authorities 5) Unstructured text files This API can be used to develop above functionality on platforms like 1) Web Application 2) Android Application 3) Standalone Application 4) Command Line Interface Application IV. Future Scope Extending our API further to read scanned text copies and retrieve the data from them would solve further advanced issues. This API again can be extended further to work with Image processing so that it may find a relevant information from the digital images for the end- users. V. Conclusion This API would be favorable for many large organizations like Large-scale Industries and Educational institutes, Power-Grids, Air-lines, Government Offices and many others working with and producing Big-Data to retrieve data in an efficient and a very cost effective way. Thus we can use this API in Hadoop to reduce manual efforts and bring advance content based search and retrieval References [1] Lars George, ”HBase: The Definitive Guide”, 1st edition, O’Reilly Media, September 2011, ISBN 9781449396107 [2] Tom White, ”Hadoop: The Definitive Guide”, 1st edition, O’Reilly Media, June 2009, ISBN 9780596521974 [3] Apache Hadoop HDFS homepage http://hadoop.apache.org/hdfs/ [4] MehulNalinVora ,”Hadoop-HBase for Large-Scale Data”,InnovationLabs,PERC,ICCNT Conference [5] YijunBei,ZhenLin,ChenZhao,Xiaojun Zhu ,”HBase System-based Distributed Framework for Searching Large Graph Databases”,ICCNT Conference [6] SeemaMaitrey,C.K.”Handling Big Data Efficiently by using Map Reduce Technique”,ICCICT [7] Maitrey S, Jha. An Integrated Approach for CURE Clustering using Map-Reduce Technique. In Proceedings of Elsevier,ISBN 978- 81- 910691-6-3,2 nd August 2013.