SlideShare a Scribd company logo
1 of 4
Download to read offline
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 05 Issue: 06 | June -2018 www.irjet.net p-ISSN: 2395-0072
© 2018, IRJET | Impact Factor value: 6.171 | ISO 9001:2008 Certified Journal | Page 1035
Multi-Faceted Approach to Automated Classification of Business Web
Pages
Venkatesh R1
1RRD, Chennai, India
---------------------------------------------------------------------***---------------------------------------------------------------------
Abstract - Data Services Companies routinely aggregate
commercial business content automatically from entity
websites on a large scale. A key product is the base
information on new and upcoming businesses which have a
potential value for business information databases and CRM
products. A legal name, contact information, business activity
and name of principal/leadership team are valuable basic
information that can be aggregated. However there is cost
involved in crawling all the pages to identify the right page
where the four key information points are located. A business
website has an average of 5 pages and pinging and crawling
each page involves IP cost at granular level and virtual
machines capacity at a high level. The efficiency of the current
process and algorithms in use is not satisfactory and results in
increased cost of operation. However, because the current
algorithms are “Black Box”, companies find it hard to tweak
the algorithm to better suit its needs and achieve higher
accuracy figures. The researchisaimedatbuildinganeffective
model to classify pages and reduce the crawl volume and
control running costs. The objective is to predict the most
likely pages to pass on for extraction and the most unlikely
pages for bypassing.
Key Words: Information Retrieval; Information Science;
Data Mining; Supervised Learning; Webpage Classification
1. INTRODUCTION
Data Services companies routinely aggregate commercial
business content automatically from entity websites on a
large scale. A key product is the base information on new
and upcoming businesses which have a potential value for
business information databases and CRM products.Address
and names of Executives are valuable basic information that
can be aggregated along with a multitude of other
requirement based data. Companies have proprietary
method to discover websites of new websites and aggregate
base information. However, there is cost involved in
crawling all the pages to identify the right page where the
four key information points are located. A business website
has an average of 5 pages and pinging and crawling each
page involves IP cost at granular level and virtual machines
capacity at a high level.
Amongst the multitude of steps orchestrated by these
software, a prominent activity is to classify if a page has
relevant content or not. The platform channels all relevant
pages to a downstream extraction process.Companiesfacea
challenge with significant False Negatives in classification
impacting the revenues as every business data point missed
due to incorrect classification is potential revenue loss.
Companies are looking for alternative statistical model(s)
built and deployed on top of its text mining and data
collection engines that could deliver better and more
accurate results. The objective of the research is toimprove
the current web page classification engine used at data
services companies by buildinganeffectivepredictivemodel
to classify the pages and reduce the crawl volume and
control running costs. The model will predict if a web page
contains executive information andBusinessAddressor not,
thereby improving revenue potential, reduce Operational
costs via improved coverage.
1.1 Limitations
The devising of the strategies to build the model majorly
using information in URL text was limited by its current
capacity for data collection. Hence the data may not be
exhaustive of all possible keywordsthatmightappearinURL
texts of webpages containing the target information
Since the project uses supervisedlearningalgorithms,the
target variables required manual flagging which:
 Limited the size of the training set available
 The logic and methodology of data collection is
outside the scope of the project and hence has
not been discussed in detail.
1.2 Challenges
There is no prior classification of the pages. The
researcher had to examine each link and manually classify
the training set thereby increasing data preparation efforts
or limiting the size of the training set.
The page links provided were generated six months ago.
Some of these links are likely to be obsolete or inactive.
Data for page characteristics has to be generated using
automated means to manage time. Automation might not
yield 100% data due to safeguards against scraping or
inherent flaws of the page, limiting the size of the training
set.
1.3 Machine Learning
The following list of Machine Learning algorithms were
used for the final model: due to the fact that dataset contains
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 05 Issue: 06 | June -2018 www.irjet.net p-ISSN: 2395-0072
© 2018, IRJET | Impact Factor value: 6.171 | ISO 9001:2008 Certified Journal | Page 1036
a high class imbalance, with respect to both the target
variables.
• Lasso
• Logit
• Random Forest
• Naïve Bayes
• C5.0 – Rule Based
•
1.4 Model Evaluation & Validation
In machine learning, the classifier is evaluated by a
confusion matrix. Considering the case of imbalanced data
assessment, accuracy places more weight to the common
classes than on the rare classes, which in turn makes it
difficult for the classifier to perform well on the rare classes.
It becomes a misleading indicator because of which
additional metrics are developed by combining sensitivity
and specificity. The Area under the ROC Curve (AUC) of the
model is to be used as the evaluation metric, as the data is
highly imbalanced and using accuracy as thesolemeasure of
performance evaluation will paint a false picture.
1.5 Data Approach
The methodology follows a 4-stage process – gathering
data from web pages, identifying characteristics and
preparing a data frame to represent the characteristics,
applying various classification techniques to classify the
pages and evaluating the models.
Fig -1: The Figure depicts the 4-Stage Process mentioned
above
1.6 Literature Review
In the paper, “Naive Bayes Web Page Classification with
HTML Mark-Up Enrichment,”[1] thestudyconcentratesonly
the plain text of the HTML documents and are based on
frequencies, and two weight functions which combine
several criteria including information from HTML mark-up.
In Nidhi Saxena’s[2] paper “An Improved TechniqueforWeb
Page Classification in RespectofDomainSpecific Search”, the
researchers use WSD, POS and ignore Meta Data in
webpages in addition to highly intelligent Machine Learning
models in order to reach at high accuracy of classification. It
focuses only on HTML content for web page classification.In
the article “Automated Classification of Web Sites using
Naïve Bayes Algorithm,” A. Patil[3] uniquely suggests that
the home page of an HTML based website can be a source of
conclusive information for classification of web pages into
very broad categories. Therefore, the modelling strategies
have been designed to concentrate on extracting specific
features to help with the same.
What distinguishes the research from the work described
in the research works above is the fact that the researcher
have approached the problem with the goal of reducing the
cost incurred as a result of web crawling and extensive text
mining. Hence, our model greatly reduces the time and
complexity involved with rigorous text mining and instead
greatly relied upon the information present inURLlink texts
in order to devise a strategy to predict the presence of our
target information in them. However, the researcher have
also included one variable with features derived directly
through the usage of text mining into the HTML pages of
webpages for certain complementary information based on
domain experience and examined its performance and
significance. In addition to this, what also separates our
work from the existing body of the work is our end goal to
classify webpages on the basis of a highly business-specific
goal rather than making broad classifications.
2. Modelling
There is a high class bias with only 914 webpages out of
11430containing address informationi.e.about7.99%ofthe
webpages; and 286 webpages out of the 11430 webpages
containing executive information i.e. about 2.5% of the web
pages.
As there were two different variables to predict, the problem
was approached as a Multi Label Imbalance Problem. As the
chances of success in fitting a single model onto a multilabel
dataset were very slim. The researcher decided to build
separatemodels to predict both the variablesandpredictthe
chances of a page containing address information or
executive information, separately.
2.1 Resampling
As the dataset contains a high class imbalance, with
respect to both the target variables, various resampling
techniques, such as Ovun over Sampling and SMOTE were
considered. However, resampling was not done as
resampled data would not be representative of the
population and would give a probability distribution that is
not related to the actual scenario. Hence, no resampling was
done. The data was split into randomlysampledtrainingand
test datasets, for the sake of thorough cross validation in the
ratio 85:15, containing 9715 and 1715 records respectively.
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 05 Issue: 06 | June -2018 www.irjet.net p-ISSN: 2395-0072
© 2018, IRJET | Impact Factor value: 6.171 | ISO 9001:2008 Certified Journal | Page 1037
2.2 Class Predictions
The AUC of each model was computed and the intersection
of sensitivity and specificity curve plotted as the cut-off
distribution was considered the optimal distribution for the
below mentioned models.
2.2.1 Address Flag
Table 1. Performance of Different Models on Train Data for
Address Flag
Algorithm AUC Accuracy Specificity Sensitivit
y
Negative
Prediction
Rate
Positive
Predicti
on Rate
Lasso 88.5
3
82.11 82.17 81.40 28.30 98.08
Logit 88.7
2
83.53 80.10 83.83 29.98 97.99
Random
Forest
93.6
3
87.91 83.85 88.26 38.18 98.44
Naïve
Bayes
84.5
7
80.88 73.51 81.51 25.58 97.27
C5.0 –
Rule
Based
90.9
7
87.13 79.84 87.76 36.06 92.04
By comparing the Top Five models that performed well on
the data, we can see that Random Forest witha GiniSplitand
a C5.0 with a Rule Based Tree is performing at similar levels
with minimal drop in AUC and higher Negative Prediction
Rate. As stated earlier, AUC has been taken as the primary
evaluation metric, as the data is highly imbalanced.
With the business requirements and objectives in
consideration, the rule based C5.0 is recommended for
identifying pages with Address Information, as the
performance of the model has beenconsistentinthetraining
and testing phase. The Rulesets generated by the algorithm
can also be extracted and the underlying logic can be
implemented in any existing infrastructure to reduce the
cost of implementation and enhance integration of the
model.
Table 2. Performance of Different Models on Test Data for
Address Flag
Algorithm AUC Accuracy Specificity Sensitivity Negative
Prediction
Rate
Positive
Predicti
on Rate
Lasso 88.5 80.92 81.43 80.87 27.60 97.98
Logit 89.91 82.8 81.43 82.92 29.92 98.03
Random
Forest
90.05 85.55 74.29 86.56 33.12 97.41
Naïve
Bayes
85.64 79.33 77.14 79.53 25.23 97.49
C5.0 –
Rule
Based
89.03 86.02 77.86 86.76 34.49 97.76
2.2.2 Executive Flag
Table 3. Performance of Different Models on Train Data for
Executive Flag.
Algorithm AUC Accuracy Specificity Sensitivity Negative
Prediction
Rate
Positive
Predicti
on Rate
Lasso 95.61 93.42 67.07 94.11 23.03 99.01
Logit 95.94 87.83 89.96 87.77 16.02 99.70
Random
Forest
99.11 96.93 89.56 97.12 44.94 99.72
C5.0 –
Rule
Based
99.94 99.33 98.39 99.36 80.07 99.96
When the Top 4 models builtto identify pages withexecutive
information are compared, it can be seen that all the
algorithms have a really high Accuracy and Positive
prediction rate, primarily due to the high class imbalance;
however, C5.0 and Random Forest have again outperformed
the other models in controlling classification. Due to the
business reasons previously mentioned, the researcher
would recommendtheimplementationoftherulebasedC5.0,
over the Random Forest, as the predictions are easier to
interpret and the logic is easier to integrate into business
applications.
Table 3. Performance of Different Models on Train Data for
Executive Flag.
Algorithm AUC Accuracy Specificity Sensitivity Negative
Prediction
Rate
Positive
Predicti
on Rate
Lasso 96.03 93.72 75.68 94.12 22.22 99.43
Logit 96.66 88.02 86.49 88.06 13.85 99.66
Random
Forest
98.07 96.83 81.08 97.18 38.96 99.57
C5.0 –
Rule
Based
98.67 98.41 81.08 98.80 60.00 99.58
3. CONCLUSIONS
A prediction algorithm based on URL text alone is sufficient,
at the initial level, to reduce hit count and the rigor of text
analytics as above 90% accuracy figures were obtained in
indicative tests.
Key words are a significant determinant for the presence of
Executive Information in a webpage. The following
keywords were found to be highly significant.
“history”, “company”, “people”, “staff”, “Board”, “site”,
“about”, “team”.
Key words are a significant determinant for the presence of
Business Address Information in a webpage. The following
keywords were found to be highly significant.
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 05 Issue: 06 | June -2018 www.irjet.net p-ISSN: 2395-0072
© 2018, IRJET | Impact Factor value: 6.171 | ISO 9001:2008 Certified Journal | Page 1038
“collections”, “gallery”,“News”,“location”,“blog”,“Services”,
“Index”, “Product”, “contact”, “Home_page”.
After the initial stage, an independent model can be built on
top of features derived from HTML bodies of websites to
further increase efficiency, but it is not highlyrecommended
if cost reduction is the primary criterion.
For predicting presence of Business Address and Executive
Information in a webpage, Lasso, Logit (Logistic regression),
Random Forest, C5.0 – Rule Based were found to be highly
efficient. Any one of these or an ensemble model of two or
more of these models can be deployed toreplacetheexisting
models.
Hidden variables may be looked for in order to feed into the
recommended models to increase the accuracy figures in
real time scenarios.
Way forward for data service companies is to automate the
models built and build a Robotic Process Automation which
would significantly reduce their Operational losses.
The classification models discovered sets of keywords
displaying statistically significant bearing on the likelihood
of finding Executive informationandBusinessAddressinthe
webpages examined. Deploymentofsuchmodelsatthestage
of URL scraping, which precedes actual web text scraping,
were found to be highly potent in increasing the strike
efficiency and hence increasing the revenue of data service
companies.
The performance figures derived from the entire exercise
was satisfactory and the researcher was able to achieve
accuracy figures greater than the target of 80% with
multiple models.
REFERENCES
[1] V. Fernandez; R. Unanue; S. Herranz; A. Rubio. Naive
Bayes Web Page Classification with HTML Mark-Up
Enrichment. ICCGI 2006.
[2] V. Chandra; N. Saxena. An Improved Technique for Web
Page Classification in RespectofDomainSpecificSearch,
IJCA, 2014, Volume 102-4, 10.5120/17801-8615.
[3] A. Patil, B.V. Pawar , Automated Classification of Web
Sites using Naive Bayesian Algorithm, IMECS 2012,
Volume 1.
BIOGRAPHIES
Venkatesh Radhakrishnan is an
experienced Research Analyst and
Data Scientist with a demonstrated
history of working in the Research
& Analytics industry. With a
Masters in Business Analytics and
a Bachelors in Information Science,
he is passionate about Neural
Networks, Q-Learning Agents,
Automation,ExplainerModels,and
Automated Machine Learning.

More Related Content

Similar to IRJET- Multi-Faceted Approach to Automated Classification of Business Web Pages

IRJET-Multi -Stage Smart Deep Web Crawling Systems: A Review
IRJET-Multi -Stage Smart Deep Web Crawling Systems: A ReviewIRJET-Multi -Stage Smart Deep Web Crawling Systems: A Review
IRJET-Multi -Stage Smart Deep Web Crawling Systems: A ReviewIRJET Journal
 
Enactment of Firefly Algorithm and Fuzzy C-Means Clustering For Consumer Requ...
Enactment of Firefly Algorithm and Fuzzy C-Means Clustering For Consumer Requ...Enactment of Firefly Algorithm and Fuzzy C-Means Clustering For Consumer Requ...
Enactment of Firefly Algorithm and Fuzzy C-Means Clustering For Consumer Requ...IRJET Journal
 
I/O-Efficient Techniques for Computing Pagerank : NOTES
I/O-Efficient Techniques for Computing Pagerank : NOTESI/O-Efficient Techniques for Computing Pagerank : NOTES
I/O-Efficient Techniques for Computing Pagerank : NOTESSubhajit Sahu
 
Module 1 introduction to web analytics
Module 1   introduction to web analyticsModule 1   introduction to web analytics
Module 1 introduction to web analyticsGayathri Choda
 
Module 1 introduction to web analytics
Module 1   introduction to web analyticsModule 1   introduction to web analytics
Module 1 introduction to web analyticsGayathri Choda
 
Providing highly accurate service recommendation for semantic clustering over...
Providing highly accurate service recommendation for semantic clustering over...Providing highly accurate service recommendation for semantic clustering over...
Providing highly accurate service recommendation for semantic clustering over...IRJET Journal
 
IRJET- Big Data Processes and Analysis using Hadoop Framework
IRJET- Big Data Processes and Analysis using Hadoop FrameworkIRJET- Big Data Processes and Analysis using Hadoop Framework
IRJET- Big Data Processes and Analysis using Hadoop FrameworkIRJET Journal
 
Recommender System- Analyzing products by mining Data Streams
Recommender System- Analyzing products by mining Data StreamsRecommender System- Analyzing products by mining Data Streams
Recommender System- Analyzing products by mining Data StreamsIRJET Journal
 
Benchmarking Logistics Performance
Benchmarking Logistics PerformanceBenchmarking Logistics Performance
Benchmarking Logistics PerformanceARC Advisory Group
 
Top 3 challenges of data governance & performance measurement in 2020
Top 3 challenges of data governance & performance measurement in 2020Top 3 challenges of data governance & performance measurement in 2020
Top 3 challenges of data governance & performance measurement in 2020ObservePoint
 
IRJET- Development and Design of Recommendation System for User Interest Shop...
IRJET- Development and Design of Recommendation System for User Interest Shop...IRJET- Development and Design of Recommendation System for User Interest Shop...
IRJET- Development and Design of Recommendation System for User Interest Shop...IRJET Journal
 
eBook Spreadsheet to WebAPP
eBook Spreadsheet to WebAPPeBook Spreadsheet to WebAPP
eBook Spreadsheet to WebAPPAbhishek Ranjan
 
Web Mine Customer Relationship Management
Web Mine Customer Relationship ManagementWeb Mine Customer Relationship Management
Web Mine Customer Relationship ManagementIRJET Journal
 
WebSite Visit Forecasting Using Data Mining Techniques
WebSite Visit Forecasting Using Data Mining  TechniquesWebSite Visit Forecasting Using Data Mining  Techniques
WebSite Visit Forecasting Using Data Mining TechniquesChandana Napagoda
 
IRJET- Recommendation System based on Graph Database Techniques
IRJET- Recommendation System based on Graph Database TechniquesIRJET- Recommendation System based on Graph Database Techniques
IRJET- Recommendation System based on Graph Database TechniquesIRJET Journal
 
Hadoop and Hive Inspecting Maintenance of Mobile Application for Groceries Ex...
Hadoop and Hive Inspecting Maintenance of Mobile Application for Groceries Ex...Hadoop and Hive Inspecting Maintenance of Mobile Application for Groceries Ex...
Hadoop and Hive Inspecting Maintenance of Mobile Application for Groceries Ex...IIRindia
 
IRJET- Sentimental Analysis of Product Reviews for E-Commerce Websites
IRJET- Sentimental Analysis of Product Reviews for E-Commerce WebsitesIRJET- Sentimental Analysis of Product Reviews for E-Commerce Websites
IRJET- Sentimental Analysis of Product Reviews for E-Commerce WebsitesIRJET Journal
 
6 key things UXers need to know while working with APIs
6 key things UXers need to know while working with APIs6 key things UXers need to know while working with APIs
6 key things UXers need to know while working with APIsMargaret Hanley
 
BIG MART SALES PREDICTION USING MACHINE LEARNING
BIG MART SALES PREDICTION USING MACHINE LEARNINGBIG MART SALES PREDICTION USING MACHINE LEARNING
BIG MART SALES PREDICTION USING MACHINE LEARNINGIRJET Journal
 
RESEARCH CHALLENGES IN WEB ANALYTICS – A STUDY
RESEARCH CHALLENGES IN WEB ANALYTICS – A STUDYRESEARCH CHALLENGES IN WEB ANALYTICS – A STUDY
RESEARCH CHALLENGES IN WEB ANALYTICS – A STUDYIRJET Journal
 

Similar to IRJET- Multi-Faceted Approach to Automated Classification of Business Web Pages (20)

IRJET-Multi -Stage Smart Deep Web Crawling Systems: A Review
IRJET-Multi -Stage Smart Deep Web Crawling Systems: A ReviewIRJET-Multi -Stage Smart Deep Web Crawling Systems: A Review
IRJET-Multi -Stage Smart Deep Web Crawling Systems: A Review
 
Enactment of Firefly Algorithm and Fuzzy C-Means Clustering For Consumer Requ...
Enactment of Firefly Algorithm and Fuzzy C-Means Clustering For Consumer Requ...Enactment of Firefly Algorithm and Fuzzy C-Means Clustering For Consumer Requ...
Enactment of Firefly Algorithm and Fuzzy C-Means Clustering For Consumer Requ...
 
I/O-Efficient Techniques for Computing Pagerank : NOTES
I/O-Efficient Techniques for Computing Pagerank : NOTESI/O-Efficient Techniques for Computing Pagerank : NOTES
I/O-Efficient Techniques for Computing Pagerank : NOTES
 
Module 1 introduction to web analytics
Module 1   introduction to web analyticsModule 1   introduction to web analytics
Module 1 introduction to web analytics
 
Module 1 introduction to web analytics
Module 1   introduction to web analyticsModule 1   introduction to web analytics
Module 1 introduction to web analytics
 
Providing highly accurate service recommendation for semantic clustering over...
Providing highly accurate service recommendation for semantic clustering over...Providing highly accurate service recommendation for semantic clustering over...
Providing highly accurate service recommendation for semantic clustering over...
 
IRJET- Big Data Processes and Analysis using Hadoop Framework
IRJET- Big Data Processes and Analysis using Hadoop FrameworkIRJET- Big Data Processes and Analysis using Hadoop Framework
IRJET- Big Data Processes and Analysis using Hadoop Framework
 
Recommender System- Analyzing products by mining Data Streams
Recommender System- Analyzing products by mining Data StreamsRecommender System- Analyzing products by mining Data Streams
Recommender System- Analyzing products by mining Data Streams
 
Benchmarking Logistics Performance
Benchmarking Logistics PerformanceBenchmarking Logistics Performance
Benchmarking Logistics Performance
 
Top 3 challenges of data governance & performance measurement in 2020
Top 3 challenges of data governance & performance measurement in 2020Top 3 challenges of data governance & performance measurement in 2020
Top 3 challenges of data governance & performance measurement in 2020
 
IRJET- Development and Design of Recommendation System for User Interest Shop...
IRJET- Development and Design of Recommendation System for User Interest Shop...IRJET- Development and Design of Recommendation System for User Interest Shop...
IRJET- Development and Design of Recommendation System for User Interest Shop...
 
eBook Spreadsheet to WebAPP
eBook Spreadsheet to WebAPPeBook Spreadsheet to WebAPP
eBook Spreadsheet to WebAPP
 
Web Mine Customer Relationship Management
Web Mine Customer Relationship ManagementWeb Mine Customer Relationship Management
Web Mine Customer Relationship Management
 
WebSite Visit Forecasting Using Data Mining Techniques
WebSite Visit Forecasting Using Data Mining  TechniquesWebSite Visit Forecasting Using Data Mining  Techniques
WebSite Visit Forecasting Using Data Mining Techniques
 
IRJET- Recommendation System based on Graph Database Techniques
IRJET- Recommendation System based on Graph Database TechniquesIRJET- Recommendation System based on Graph Database Techniques
IRJET- Recommendation System based on Graph Database Techniques
 
Hadoop and Hive Inspecting Maintenance of Mobile Application for Groceries Ex...
Hadoop and Hive Inspecting Maintenance of Mobile Application for Groceries Ex...Hadoop and Hive Inspecting Maintenance of Mobile Application for Groceries Ex...
Hadoop and Hive Inspecting Maintenance of Mobile Application for Groceries Ex...
 
IRJET- Sentimental Analysis of Product Reviews for E-Commerce Websites
IRJET- Sentimental Analysis of Product Reviews for E-Commerce WebsitesIRJET- Sentimental Analysis of Product Reviews for E-Commerce Websites
IRJET- Sentimental Analysis of Product Reviews for E-Commerce Websites
 
6 key things UXers need to know while working with APIs
6 key things UXers need to know while working with APIs6 key things UXers need to know while working with APIs
6 key things UXers need to know while working with APIs
 
BIG MART SALES PREDICTION USING MACHINE LEARNING
BIG MART SALES PREDICTION USING MACHINE LEARNINGBIG MART SALES PREDICTION USING MACHINE LEARNING
BIG MART SALES PREDICTION USING MACHINE LEARNING
 
RESEARCH CHALLENGES IN WEB ANALYTICS – A STUDY
RESEARCH CHALLENGES IN WEB ANALYTICS – A STUDYRESEARCH CHALLENGES IN WEB ANALYTICS – A STUDY
RESEARCH CHALLENGES IN WEB ANALYTICS – A STUDY
 

More from IRJET Journal

TUNNELING IN HIMALAYAS WITH NATM METHOD: A SPECIAL REFERENCES TO SUNGAL TUNNE...
TUNNELING IN HIMALAYAS WITH NATM METHOD: A SPECIAL REFERENCES TO SUNGAL TUNNE...TUNNELING IN HIMALAYAS WITH NATM METHOD: A SPECIAL REFERENCES TO SUNGAL TUNNE...
TUNNELING IN HIMALAYAS WITH NATM METHOD: A SPECIAL REFERENCES TO SUNGAL TUNNE...IRJET Journal
 
STUDY THE EFFECT OF RESPONSE REDUCTION FACTOR ON RC FRAMED STRUCTURE
STUDY THE EFFECT OF RESPONSE REDUCTION FACTOR ON RC FRAMED STRUCTURESTUDY THE EFFECT OF RESPONSE REDUCTION FACTOR ON RC FRAMED STRUCTURE
STUDY THE EFFECT OF RESPONSE REDUCTION FACTOR ON RC FRAMED STRUCTUREIRJET Journal
 
A COMPARATIVE ANALYSIS OF RCC ELEMENT OF SLAB WITH STARK STEEL (HYSD STEEL) A...
A COMPARATIVE ANALYSIS OF RCC ELEMENT OF SLAB WITH STARK STEEL (HYSD STEEL) A...A COMPARATIVE ANALYSIS OF RCC ELEMENT OF SLAB WITH STARK STEEL (HYSD STEEL) A...
A COMPARATIVE ANALYSIS OF RCC ELEMENT OF SLAB WITH STARK STEEL (HYSD STEEL) A...IRJET Journal
 
Effect of Camber and Angles of Attack on Airfoil Characteristics
Effect of Camber and Angles of Attack on Airfoil CharacteristicsEffect of Camber and Angles of Attack on Airfoil Characteristics
Effect of Camber and Angles of Attack on Airfoil CharacteristicsIRJET Journal
 
A Review on the Progress and Challenges of Aluminum-Based Metal Matrix Compos...
A Review on the Progress and Challenges of Aluminum-Based Metal Matrix Compos...A Review on the Progress and Challenges of Aluminum-Based Metal Matrix Compos...
A Review on the Progress and Challenges of Aluminum-Based Metal Matrix Compos...IRJET Journal
 
Dynamic Urban Transit Optimization: A Graph Neural Network Approach for Real-...
Dynamic Urban Transit Optimization: A Graph Neural Network Approach for Real-...Dynamic Urban Transit Optimization: A Graph Neural Network Approach for Real-...
Dynamic Urban Transit Optimization: A Graph Neural Network Approach for Real-...IRJET Journal
 
Structural Analysis and Design of Multi-Storey Symmetric and Asymmetric Shape...
Structural Analysis and Design of Multi-Storey Symmetric and Asymmetric Shape...Structural Analysis and Design of Multi-Storey Symmetric and Asymmetric Shape...
Structural Analysis and Design of Multi-Storey Symmetric and Asymmetric Shape...IRJET Journal
 
A Review of “Seismic Response of RC Structures Having Plan and Vertical Irreg...
A Review of “Seismic Response of RC Structures Having Plan and Vertical Irreg...A Review of “Seismic Response of RC Structures Having Plan and Vertical Irreg...
A Review of “Seismic Response of RC Structures Having Plan and Vertical Irreg...IRJET Journal
 
A REVIEW ON MACHINE LEARNING IN ADAS
A REVIEW ON MACHINE LEARNING IN ADASA REVIEW ON MACHINE LEARNING IN ADAS
A REVIEW ON MACHINE LEARNING IN ADASIRJET Journal
 
Long Term Trend Analysis of Precipitation and Temperature for Asosa district,...
Long Term Trend Analysis of Precipitation and Temperature for Asosa district,...Long Term Trend Analysis of Precipitation and Temperature for Asosa district,...
Long Term Trend Analysis of Precipitation and Temperature for Asosa district,...IRJET Journal
 
P.E.B. Framed Structure Design and Analysis Using STAAD Pro
P.E.B. Framed Structure Design and Analysis Using STAAD ProP.E.B. Framed Structure Design and Analysis Using STAAD Pro
P.E.B. Framed Structure Design and Analysis Using STAAD ProIRJET Journal
 
A Review on Innovative Fiber Integration for Enhanced Reinforcement of Concre...
A Review on Innovative Fiber Integration for Enhanced Reinforcement of Concre...A Review on Innovative Fiber Integration for Enhanced Reinforcement of Concre...
A Review on Innovative Fiber Integration for Enhanced Reinforcement of Concre...IRJET Journal
 
Survey Paper on Cloud-Based Secured Healthcare System
Survey Paper on Cloud-Based Secured Healthcare SystemSurvey Paper on Cloud-Based Secured Healthcare System
Survey Paper on Cloud-Based Secured Healthcare SystemIRJET Journal
 
Review on studies and research on widening of existing concrete bridges
Review on studies and research on widening of existing concrete bridgesReview on studies and research on widening of existing concrete bridges
Review on studies and research on widening of existing concrete bridgesIRJET Journal
 
React based fullstack edtech web application
React based fullstack edtech web applicationReact based fullstack edtech web application
React based fullstack edtech web applicationIRJET Journal
 
A Comprehensive Review of Integrating IoT and Blockchain Technologies in the ...
A Comprehensive Review of Integrating IoT and Blockchain Technologies in the ...A Comprehensive Review of Integrating IoT and Blockchain Technologies in the ...
A Comprehensive Review of Integrating IoT and Blockchain Technologies in the ...IRJET Journal
 
A REVIEW ON THE PERFORMANCE OF COCONUT FIBRE REINFORCED CONCRETE.
A REVIEW ON THE PERFORMANCE OF COCONUT FIBRE REINFORCED CONCRETE.A REVIEW ON THE PERFORMANCE OF COCONUT FIBRE REINFORCED CONCRETE.
A REVIEW ON THE PERFORMANCE OF COCONUT FIBRE REINFORCED CONCRETE.IRJET Journal
 
Optimizing Business Management Process Workflows: The Dynamic Influence of Mi...
Optimizing Business Management Process Workflows: The Dynamic Influence of Mi...Optimizing Business Management Process Workflows: The Dynamic Influence of Mi...
Optimizing Business Management Process Workflows: The Dynamic Influence of Mi...IRJET Journal
 
Multistoried and Multi Bay Steel Building Frame by using Seismic Design
Multistoried and Multi Bay Steel Building Frame by using Seismic DesignMultistoried and Multi Bay Steel Building Frame by using Seismic Design
Multistoried and Multi Bay Steel Building Frame by using Seismic DesignIRJET Journal
 
Cost Optimization of Construction Using Plastic Waste as a Sustainable Constr...
Cost Optimization of Construction Using Plastic Waste as a Sustainable Constr...Cost Optimization of Construction Using Plastic Waste as a Sustainable Constr...
Cost Optimization of Construction Using Plastic Waste as a Sustainable Constr...IRJET Journal
 

More from IRJET Journal (20)

TUNNELING IN HIMALAYAS WITH NATM METHOD: A SPECIAL REFERENCES TO SUNGAL TUNNE...
TUNNELING IN HIMALAYAS WITH NATM METHOD: A SPECIAL REFERENCES TO SUNGAL TUNNE...TUNNELING IN HIMALAYAS WITH NATM METHOD: A SPECIAL REFERENCES TO SUNGAL TUNNE...
TUNNELING IN HIMALAYAS WITH NATM METHOD: A SPECIAL REFERENCES TO SUNGAL TUNNE...
 
STUDY THE EFFECT OF RESPONSE REDUCTION FACTOR ON RC FRAMED STRUCTURE
STUDY THE EFFECT OF RESPONSE REDUCTION FACTOR ON RC FRAMED STRUCTURESTUDY THE EFFECT OF RESPONSE REDUCTION FACTOR ON RC FRAMED STRUCTURE
STUDY THE EFFECT OF RESPONSE REDUCTION FACTOR ON RC FRAMED STRUCTURE
 
A COMPARATIVE ANALYSIS OF RCC ELEMENT OF SLAB WITH STARK STEEL (HYSD STEEL) A...
A COMPARATIVE ANALYSIS OF RCC ELEMENT OF SLAB WITH STARK STEEL (HYSD STEEL) A...A COMPARATIVE ANALYSIS OF RCC ELEMENT OF SLAB WITH STARK STEEL (HYSD STEEL) A...
A COMPARATIVE ANALYSIS OF RCC ELEMENT OF SLAB WITH STARK STEEL (HYSD STEEL) A...
 
Effect of Camber and Angles of Attack on Airfoil Characteristics
Effect of Camber and Angles of Attack on Airfoil CharacteristicsEffect of Camber and Angles of Attack on Airfoil Characteristics
Effect of Camber and Angles of Attack on Airfoil Characteristics
 
A Review on the Progress and Challenges of Aluminum-Based Metal Matrix Compos...
A Review on the Progress and Challenges of Aluminum-Based Metal Matrix Compos...A Review on the Progress and Challenges of Aluminum-Based Metal Matrix Compos...
A Review on the Progress and Challenges of Aluminum-Based Metal Matrix Compos...
 
Dynamic Urban Transit Optimization: A Graph Neural Network Approach for Real-...
Dynamic Urban Transit Optimization: A Graph Neural Network Approach for Real-...Dynamic Urban Transit Optimization: A Graph Neural Network Approach for Real-...
Dynamic Urban Transit Optimization: A Graph Neural Network Approach for Real-...
 
Structural Analysis and Design of Multi-Storey Symmetric and Asymmetric Shape...
Structural Analysis and Design of Multi-Storey Symmetric and Asymmetric Shape...Structural Analysis and Design of Multi-Storey Symmetric and Asymmetric Shape...
Structural Analysis and Design of Multi-Storey Symmetric and Asymmetric Shape...
 
A Review of “Seismic Response of RC Structures Having Plan and Vertical Irreg...
A Review of “Seismic Response of RC Structures Having Plan and Vertical Irreg...A Review of “Seismic Response of RC Structures Having Plan and Vertical Irreg...
A Review of “Seismic Response of RC Structures Having Plan and Vertical Irreg...
 
A REVIEW ON MACHINE LEARNING IN ADAS
A REVIEW ON MACHINE LEARNING IN ADASA REVIEW ON MACHINE LEARNING IN ADAS
A REVIEW ON MACHINE LEARNING IN ADAS
 
Long Term Trend Analysis of Precipitation and Temperature for Asosa district,...
Long Term Trend Analysis of Precipitation and Temperature for Asosa district,...Long Term Trend Analysis of Precipitation and Temperature for Asosa district,...
Long Term Trend Analysis of Precipitation and Temperature for Asosa district,...
 
P.E.B. Framed Structure Design and Analysis Using STAAD Pro
P.E.B. Framed Structure Design and Analysis Using STAAD ProP.E.B. Framed Structure Design and Analysis Using STAAD Pro
P.E.B. Framed Structure Design and Analysis Using STAAD Pro
 
A Review on Innovative Fiber Integration for Enhanced Reinforcement of Concre...
A Review on Innovative Fiber Integration for Enhanced Reinforcement of Concre...A Review on Innovative Fiber Integration for Enhanced Reinforcement of Concre...
A Review on Innovative Fiber Integration for Enhanced Reinforcement of Concre...
 
Survey Paper on Cloud-Based Secured Healthcare System
Survey Paper on Cloud-Based Secured Healthcare SystemSurvey Paper on Cloud-Based Secured Healthcare System
Survey Paper on Cloud-Based Secured Healthcare System
 
Review on studies and research on widening of existing concrete bridges
Review on studies and research on widening of existing concrete bridgesReview on studies and research on widening of existing concrete bridges
Review on studies and research on widening of existing concrete bridges
 
React based fullstack edtech web application
React based fullstack edtech web applicationReact based fullstack edtech web application
React based fullstack edtech web application
 
A Comprehensive Review of Integrating IoT and Blockchain Technologies in the ...
A Comprehensive Review of Integrating IoT and Blockchain Technologies in the ...A Comprehensive Review of Integrating IoT and Blockchain Technologies in the ...
A Comprehensive Review of Integrating IoT and Blockchain Technologies in the ...
 
A REVIEW ON THE PERFORMANCE OF COCONUT FIBRE REINFORCED CONCRETE.
A REVIEW ON THE PERFORMANCE OF COCONUT FIBRE REINFORCED CONCRETE.A REVIEW ON THE PERFORMANCE OF COCONUT FIBRE REINFORCED CONCRETE.
A REVIEW ON THE PERFORMANCE OF COCONUT FIBRE REINFORCED CONCRETE.
 
Optimizing Business Management Process Workflows: The Dynamic Influence of Mi...
Optimizing Business Management Process Workflows: The Dynamic Influence of Mi...Optimizing Business Management Process Workflows: The Dynamic Influence of Mi...
Optimizing Business Management Process Workflows: The Dynamic Influence of Mi...
 
Multistoried and Multi Bay Steel Building Frame by using Seismic Design
Multistoried and Multi Bay Steel Building Frame by using Seismic DesignMultistoried and Multi Bay Steel Building Frame by using Seismic Design
Multistoried and Multi Bay Steel Building Frame by using Seismic Design
 
Cost Optimization of Construction Using Plastic Waste as a Sustainable Constr...
Cost Optimization of Construction Using Plastic Waste as a Sustainable Constr...Cost Optimization of Construction Using Plastic Waste as a Sustainable Constr...
Cost Optimization of Construction Using Plastic Waste as a Sustainable Constr...
 

Recently uploaded

Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024hassan khalil
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girlsssuser7cb4ff
 
Current Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLCurrent Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLDeelipZope
 
HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2RajaP95
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escortsranjana rawat
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort servicejennyeacort
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionDr.Costas Sachpazis
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxDeepakSakkari2
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxbritheesh05
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...VICTOR MAESTRE RAMIREZ
 
Internship report on mechanical engineering
Internship report on mechanical engineeringInternship report on mechanical engineering
Internship report on mechanical engineeringmalavadedarshan25
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escortsranjana rawat
 
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...ZTE
 
microprocessor 8085 and its interfacing
microprocessor 8085  and its interfacingmicroprocessor 8085  and its interfacing
microprocessor 8085 and its interfacingjaychoudhary37
 
Introduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptxIntroduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptxvipinkmenon1
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSCAESB
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerAnamika Sarkar
 

Recently uploaded (20)

🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girls
 
Current Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLCurrent Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCL
 
HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptx
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptx
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...
 
Internship report on mechanical engineering
Internship report on mechanical engineeringInternship report on mechanical engineering
Internship report on mechanical engineering
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
 
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
 
microprocessor 8085 and its interfacing
microprocessor 8085  and its interfacingmicroprocessor 8085  and its interfacing
microprocessor 8085 and its interfacing
 
Introduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptxIntroduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptx
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentation
 
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCRCall Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
 
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
 

IRJET- Multi-Faceted Approach to Automated Classification of Business Web Pages

  • 1. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 05 Issue: 06 | June -2018 www.irjet.net p-ISSN: 2395-0072 © 2018, IRJET | Impact Factor value: 6.171 | ISO 9001:2008 Certified Journal | Page 1035 Multi-Faceted Approach to Automated Classification of Business Web Pages Venkatesh R1 1RRD, Chennai, India ---------------------------------------------------------------------***--------------------------------------------------------------------- Abstract - Data Services Companies routinely aggregate commercial business content automatically from entity websites on a large scale. A key product is the base information on new and upcoming businesses which have a potential value for business information databases and CRM products. A legal name, contact information, business activity and name of principal/leadership team are valuable basic information that can be aggregated. However there is cost involved in crawling all the pages to identify the right page where the four key information points are located. A business website has an average of 5 pages and pinging and crawling each page involves IP cost at granular level and virtual machines capacity at a high level. The efficiency of the current process and algorithms in use is not satisfactory and results in increased cost of operation. However, because the current algorithms are “Black Box”, companies find it hard to tweak the algorithm to better suit its needs and achieve higher accuracy figures. The researchisaimedatbuildinganeffective model to classify pages and reduce the crawl volume and control running costs. The objective is to predict the most likely pages to pass on for extraction and the most unlikely pages for bypassing. Key Words: Information Retrieval; Information Science; Data Mining; Supervised Learning; Webpage Classification 1. INTRODUCTION Data Services companies routinely aggregate commercial business content automatically from entity websites on a large scale. A key product is the base information on new and upcoming businesses which have a potential value for business information databases and CRM products.Address and names of Executives are valuable basic information that can be aggregated along with a multitude of other requirement based data. Companies have proprietary method to discover websites of new websites and aggregate base information. However, there is cost involved in crawling all the pages to identify the right page where the four key information points are located. A business website has an average of 5 pages and pinging and crawling each page involves IP cost at granular level and virtual machines capacity at a high level. Amongst the multitude of steps orchestrated by these software, a prominent activity is to classify if a page has relevant content or not. The platform channels all relevant pages to a downstream extraction process.Companiesfacea challenge with significant False Negatives in classification impacting the revenues as every business data point missed due to incorrect classification is potential revenue loss. Companies are looking for alternative statistical model(s) built and deployed on top of its text mining and data collection engines that could deliver better and more accurate results. The objective of the research is toimprove the current web page classification engine used at data services companies by buildinganeffectivepredictivemodel to classify the pages and reduce the crawl volume and control running costs. The model will predict if a web page contains executive information andBusinessAddressor not, thereby improving revenue potential, reduce Operational costs via improved coverage. 1.1 Limitations The devising of the strategies to build the model majorly using information in URL text was limited by its current capacity for data collection. Hence the data may not be exhaustive of all possible keywordsthatmightappearinURL texts of webpages containing the target information Since the project uses supervisedlearningalgorithms,the target variables required manual flagging which:  Limited the size of the training set available  The logic and methodology of data collection is outside the scope of the project and hence has not been discussed in detail. 1.2 Challenges There is no prior classification of the pages. The researcher had to examine each link and manually classify the training set thereby increasing data preparation efforts or limiting the size of the training set. The page links provided were generated six months ago. Some of these links are likely to be obsolete or inactive. Data for page characteristics has to be generated using automated means to manage time. Automation might not yield 100% data due to safeguards against scraping or inherent flaws of the page, limiting the size of the training set. 1.3 Machine Learning The following list of Machine Learning algorithms were used for the final model: due to the fact that dataset contains
  • 2. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 05 Issue: 06 | June -2018 www.irjet.net p-ISSN: 2395-0072 © 2018, IRJET | Impact Factor value: 6.171 | ISO 9001:2008 Certified Journal | Page 1036 a high class imbalance, with respect to both the target variables. • Lasso • Logit • Random Forest • Naïve Bayes • C5.0 – Rule Based • 1.4 Model Evaluation & Validation In machine learning, the classifier is evaluated by a confusion matrix. Considering the case of imbalanced data assessment, accuracy places more weight to the common classes than on the rare classes, which in turn makes it difficult for the classifier to perform well on the rare classes. It becomes a misleading indicator because of which additional metrics are developed by combining sensitivity and specificity. The Area under the ROC Curve (AUC) of the model is to be used as the evaluation metric, as the data is highly imbalanced and using accuracy as thesolemeasure of performance evaluation will paint a false picture. 1.5 Data Approach The methodology follows a 4-stage process – gathering data from web pages, identifying characteristics and preparing a data frame to represent the characteristics, applying various classification techniques to classify the pages and evaluating the models. Fig -1: The Figure depicts the 4-Stage Process mentioned above 1.6 Literature Review In the paper, “Naive Bayes Web Page Classification with HTML Mark-Up Enrichment,”[1] thestudyconcentratesonly the plain text of the HTML documents and are based on frequencies, and two weight functions which combine several criteria including information from HTML mark-up. In Nidhi Saxena’s[2] paper “An Improved TechniqueforWeb Page Classification in RespectofDomainSpecific Search”, the researchers use WSD, POS and ignore Meta Data in webpages in addition to highly intelligent Machine Learning models in order to reach at high accuracy of classification. It focuses only on HTML content for web page classification.In the article “Automated Classification of Web Sites using Naïve Bayes Algorithm,” A. Patil[3] uniquely suggests that the home page of an HTML based website can be a source of conclusive information for classification of web pages into very broad categories. Therefore, the modelling strategies have been designed to concentrate on extracting specific features to help with the same. What distinguishes the research from the work described in the research works above is the fact that the researcher have approached the problem with the goal of reducing the cost incurred as a result of web crawling and extensive text mining. Hence, our model greatly reduces the time and complexity involved with rigorous text mining and instead greatly relied upon the information present inURLlink texts in order to devise a strategy to predict the presence of our target information in them. However, the researcher have also included one variable with features derived directly through the usage of text mining into the HTML pages of webpages for certain complementary information based on domain experience and examined its performance and significance. In addition to this, what also separates our work from the existing body of the work is our end goal to classify webpages on the basis of a highly business-specific goal rather than making broad classifications. 2. Modelling There is a high class bias with only 914 webpages out of 11430containing address informationi.e.about7.99%ofthe webpages; and 286 webpages out of the 11430 webpages containing executive information i.e. about 2.5% of the web pages. As there were two different variables to predict, the problem was approached as a Multi Label Imbalance Problem. As the chances of success in fitting a single model onto a multilabel dataset were very slim. The researcher decided to build separatemodels to predict both the variablesandpredictthe chances of a page containing address information or executive information, separately. 2.1 Resampling As the dataset contains a high class imbalance, with respect to both the target variables, various resampling techniques, such as Ovun over Sampling and SMOTE were considered. However, resampling was not done as resampled data would not be representative of the population and would give a probability distribution that is not related to the actual scenario. Hence, no resampling was done. The data was split into randomlysampledtrainingand test datasets, for the sake of thorough cross validation in the ratio 85:15, containing 9715 and 1715 records respectively.
  • 3. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 05 Issue: 06 | June -2018 www.irjet.net p-ISSN: 2395-0072 © 2018, IRJET | Impact Factor value: 6.171 | ISO 9001:2008 Certified Journal | Page 1037 2.2 Class Predictions The AUC of each model was computed and the intersection of sensitivity and specificity curve plotted as the cut-off distribution was considered the optimal distribution for the below mentioned models. 2.2.1 Address Flag Table 1. Performance of Different Models on Train Data for Address Flag Algorithm AUC Accuracy Specificity Sensitivit y Negative Prediction Rate Positive Predicti on Rate Lasso 88.5 3 82.11 82.17 81.40 28.30 98.08 Logit 88.7 2 83.53 80.10 83.83 29.98 97.99 Random Forest 93.6 3 87.91 83.85 88.26 38.18 98.44 Naïve Bayes 84.5 7 80.88 73.51 81.51 25.58 97.27 C5.0 – Rule Based 90.9 7 87.13 79.84 87.76 36.06 92.04 By comparing the Top Five models that performed well on the data, we can see that Random Forest witha GiniSplitand a C5.0 with a Rule Based Tree is performing at similar levels with minimal drop in AUC and higher Negative Prediction Rate. As stated earlier, AUC has been taken as the primary evaluation metric, as the data is highly imbalanced. With the business requirements and objectives in consideration, the rule based C5.0 is recommended for identifying pages with Address Information, as the performance of the model has beenconsistentinthetraining and testing phase. The Rulesets generated by the algorithm can also be extracted and the underlying logic can be implemented in any existing infrastructure to reduce the cost of implementation and enhance integration of the model. Table 2. Performance of Different Models on Test Data for Address Flag Algorithm AUC Accuracy Specificity Sensitivity Negative Prediction Rate Positive Predicti on Rate Lasso 88.5 80.92 81.43 80.87 27.60 97.98 Logit 89.91 82.8 81.43 82.92 29.92 98.03 Random Forest 90.05 85.55 74.29 86.56 33.12 97.41 Naïve Bayes 85.64 79.33 77.14 79.53 25.23 97.49 C5.0 – Rule Based 89.03 86.02 77.86 86.76 34.49 97.76 2.2.2 Executive Flag Table 3. Performance of Different Models on Train Data for Executive Flag. Algorithm AUC Accuracy Specificity Sensitivity Negative Prediction Rate Positive Predicti on Rate Lasso 95.61 93.42 67.07 94.11 23.03 99.01 Logit 95.94 87.83 89.96 87.77 16.02 99.70 Random Forest 99.11 96.93 89.56 97.12 44.94 99.72 C5.0 – Rule Based 99.94 99.33 98.39 99.36 80.07 99.96 When the Top 4 models builtto identify pages withexecutive information are compared, it can be seen that all the algorithms have a really high Accuracy and Positive prediction rate, primarily due to the high class imbalance; however, C5.0 and Random Forest have again outperformed the other models in controlling classification. Due to the business reasons previously mentioned, the researcher would recommendtheimplementationoftherulebasedC5.0, over the Random Forest, as the predictions are easier to interpret and the logic is easier to integrate into business applications. Table 3. Performance of Different Models on Train Data for Executive Flag. Algorithm AUC Accuracy Specificity Sensitivity Negative Prediction Rate Positive Predicti on Rate Lasso 96.03 93.72 75.68 94.12 22.22 99.43 Logit 96.66 88.02 86.49 88.06 13.85 99.66 Random Forest 98.07 96.83 81.08 97.18 38.96 99.57 C5.0 – Rule Based 98.67 98.41 81.08 98.80 60.00 99.58 3. CONCLUSIONS A prediction algorithm based on URL text alone is sufficient, at the initial level, to reduce hit count and the rigor of text analytics as above 90% accuracy figures were obtained in indicative tests. Key words are a significant determinant for the presence of Executive Information in a webpage. The following keywords were found to be highly significant. “history”, “company”, “people”, “staff”, “Board”, “site”, “about”, “team”. Key words are a significant determinant for the presence of Business Address Information in a webpage. The following keywords were found to be highly significant.
  • 4. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 05 Issue: 06 | June -2018 www.irjet.net p-ISSN: 2395-0072 © 2018, IRJET | Impact Factor value: 6.171 | ISO 9001:2008 Certified Journal | Page 1038 “collections”, “gallery”,“News”,“location”,“blog”,“Services”, “Index”, “Product”, “contact”, “Home_page”. After the initial stage, an independent model can be built on top of features derived from HTML bodies of websites to further increase efficiency, but it is not highlyrecommended if cost reduction is the primary criterion. For predicting presence of Business Address and Executive Information in a webpage, Lasso, Logit (Logistic regression), Random Forest, C5.0 – Rule Based were found to be highly efficient. Any one of these or an ensemble model of two or more of these models can be deployed toreplacetheexisting models. Hidden variables may be looked for in order to feed into the recommended models to increase the accuracy figures in real time scenarios. Way forward for data service companies is to automate the models built and build a Robotic Process Automation which would significantly reduce their Operational losses. The classification models discovered sets of keywords displaying statistically significant bearing on the likelihood of finding Executive informationandBusinessAddressinthe webpages examined. Deploymentofsuchmodelsatthestage of URL scraping, which precedes actual web text scraping, were found to be highly potent in increasing the strike efficiency and hence increasing the revenue of data service companies. The performance figures derived from the entire exercise was satisfactory and the researcher was able to achieve accuracy figures greater than the target of 80% with multiple models. REFERENCES [1] V. Fernandez; R. Unanue; S. Herranz; A. Rubio. Naive Bayes Web Page Classification with HTML Mark-Up Enrichment. ICCGI 2006. [2] V. Chandra; N. Saxena. An Improved Technique for Web Page Classification in RespectofDomainSpecificSearch, IJCA, 2014, Volume 102-4, 10.5120/17801-8615. [3] A. Patil, B.V. Pawar , Automated Classification of Web Sites using Naive Bayesian Algorithm, IMECS 2012, Volume 1. BIOGRAPHIES Venkatesh Radhakrishnan is an experienced Research Analyst and Data Scientist with a demonstrated history of working in the Research & Analytics industry. With a Masters in Business Analytics and a Bachelors in Information Science, he is passionate about Neural Networks, Q-Learning Agents, Automation,ExplainerModels,and Automated Machine Learning.