Agent based Authentication for Deep Web Data Extraction

International Journal of Innovative Research in Information Security (IJIRIS) ISSN: 2349-7017(O)
Issue 2, Volume 4 (April 2015) ISSN: 2349-7009(P)
www.ijiris.com
____________________________________________________________________________________________________________
© 2014-15, IJIRIS- All Rights Reserved Page -44
Agent based Authentication for Deep Web Data Extraction
G.Muneeswari
Associate Professor
Department of Information Technology
SSN College of Engineering
Abstract— Deep web contents are accessed by queries submitted to web databases and the returned data records are enwrapped in
dynamically generated web pages (they will be called deep Web pages). Extracting structured data from deep web pages is a
challenging problem due to the underlying intricate structures of such pages. As the popular two-dimensional media, the contents
on web pages are always displayed regularly for users to browse. This motivates us to seek a different way for deep web data
extraction to overcome the limitations of previous works by utilizing some interesting common visual features on the deep web
pages. In this paper, an agent based authentication mechanism is proposed which uses an agent program runs on the server to
authenticate the user. A novel vision-based approach that is web page programming language-independent is also proposed. This
approach primarily utilizes the visual features on the deep web pages to implement deep web data extraction, including data record
extraction and data item extraction. Our experiments on a large set of web databases show that the proposed vision-based
approach is highly effective for deep web data extraction and provides security for the registered user.
Keywords— Authentication, vision based approach, web data, agent program, web page.
I. INTRODUCTION
The World Wide Web has more and more online web databases which can be searched through their web query interfaces.
The query results are enwrapped in web pages in the form of data records and are generated dynamically. These web pages are called
deep web pages and are hard to index by the traditional crawler-based search engines. To ease the human users consumption of
information retrieved from search engines, always the deep web pages [data records and data items] are arranged with the visual
regularity to meet the reading habits of human beings. We explore the visual regularity of data records and data items on the web
pages and propose a novel vision-based approach, Vision-based Data Extractor (ViDE) to extract results from deep web pages
automatically. It is primarily based on visual features (font, position, content, layout and appearance features) and also utilizes some
simple non visual information such as data types and frequent symbols. ViDE consists of two main components such as Vision-based
Data Record Extractor (ViDRE) and Vision-based Data Item Extractor (ViDIE).
A. Visual Features
Font
The fonts of texts on a web page are a very useful visual information, which are determined by many attributes such as size, face,
color etc.
Position Features (PF’s)
These features indicate the location of a data region on a deep web page. PF1: Data regions are always centered horizontally.
PF2: The size of the data region is usually large relative to the area size of the whole page.
Layout Features (LF’s)
These features indicate how the data records in the data region are typically arranged. LF1: The data records are usually aligned left in
the data region. LF2: All data records are adjoining.LF3: Adjoining data records do not overlap.
Appearance Features (AF’s)
These features capture the visual features within the data records. AF1: Data records are very similar in their appearances such as
sizes of images they contain and fonts they use. AF2: The data items of same semantic in different data records have similar
presentations. AF3: The neighboring text data items of different semantics often use distinguishable fonts.
Content Features (CF’s)
These features hint the regularity of contents in data records.CF1: First data item in each data record is always of a mandatory type.
CF2: The presentation of data items in data records follows a fixed order.CF3: There are often some fixed static texts in data records.
The problem of web data extraction has perceived a lot of attraction because most of the solutions proposed are based on
analyzing the HTML source code or the tag trees of web pages. Consider a web page which is vivid and more colorfully designed
using complex Java script and CSS, it makes more difficult to infer the regularity of the structure of web pages by only analyzing the
tag structures. Hence, a novel vision-based approach gains necessity. The overall objective of our project is to develop an approach
that is Web-page-programming-language-independent for extracting the data from dynamically generated web pages. A number of
approaches have been reported for extracting information from web pages.

www.ijiris.com
____________________________________________________________________________________________________________
Those are classified based on the degree of automation in data extraction as manual approaches, semiautomatic approaches and
automatic approaches. The earliest are manual approaches in which languages were designed to assist programmer in constructing
wrappers to identify and extract all the desired data items/fields. Example: Minerva, TSIMMIS and Web-OQL.Semi automatic
techniques can be classified into sequence based and tree based. Example: Soft Mealy and Stalker, W4F and XWrap. These
approaches require manual efforts for example, labelling some sample pages. Automatic approaches improve the efficiency and
reduce the manual efforts. Example: Omini, IEPAD, Roadrunner, MDR, DEPTA etc. But these approaches do not generate wrappers
(i.e.) they identify patterns and extract data from web pages directly without using the previously derived extraction rules.
B. Information Extraction and Web Intelligence
The problem of information extraction is to transform the contents of input documents into structured data. Unlike
information retrieval, which concerns how to identify relevant documents from a collection, information extraction produces
structured data ready for post-processing, which is crucial to many applications of text mining. IEPAD architecture shown in fig.1,
attempts to eliminate the need of user-labeled training examples. The idea is based on the fact that data on a semi-structured web page
is often rendered in some particular alignment and order and thus exhibit regular and contiguous pattern. By discovering these
patterns in target web pages, an extractor can be generated. IEPAD has a pattern discovery algorithm that can be applied to any input
webpage without training examples. This greatly reduces the cost of extractor construction. A huge number of extractors can now be
generated and sophisticated web intelligence applications become practical.
Fig.1 IEPAD Architecture
C. The IEPAD Architecture
Web page information extraction has been implemented into a system called IEPAD, which includes three components:
 Pattern discoverer accepts an input web page and discovers potential patterns that contain the target data to be extracted.
 Rule generator contains a graphical user interface, called pattern viewer, which shows patterns discovered. Users can then
select the one that extracts interesting data and then the rule generator will remember the pattern and save it as an extraction
rule for later applications. Users can also use pattern viewer to assign attribute names of the extracted data.
 Extractor extracts desired information from similar web pages based on the designated extraction rule.
II. LITERATURE SURVEY
A. Automatic Information Extraction from Semi-Structured Web Pages By Pattern Discovery
In this paper, an unsupervised approach to semi-structured information extraction is presented. The key features of this approach
are as follows. First, by applying the PAT-tree pattern discovery algorithm, the discovery of record boundaries can be completed
automatically without user- labeled training examples. Second, by applying the multiple alignment algorithm, discovered patterns can
be generalized over unseen pages. Third, the extraction rule is pattern-based instead of delimiter-based. Finally, by allowing
alternative expressions in the extraction rules, IEPAD can handle exceptions such as missing attributes, multiple attribute values and
variant attribute permutations. Comparing IEPAD to previous work, IEPAD performs equally well in terms of extraction accuracy but
requires much less human intervention to produce an extractor. In terms of the tolerance of layout irregularity, the extraction rules
generated by IEPAD allow alternative tokens and hence can tolerate exceptions and variance such as missing attributes in the input.
Multi-level alignment can also be applied to extract finer information by extracting attributes from a discovered data record.

www.ijiris.com
____________________________________________________________________________________________________________
B. Automatic Extraction of Dynamic Record Sections From Search Engine Result Pages
In this paper, a method to automatically generate wrappers for extracting search result records from all dynamic sections on
result pages returned by search engines is presented. This method has the following novel features: (1) It aims to explicitly identify all
dynamic sections, including those that are not seen on sample result pages used to generate the wrapper. (2) It addresses the issue of
correctly differentiating sections and records. Experimental results indicate that this method is very promising. Automatic search
result record extraction is critical for applications that need to interact with search engines such as automatic construction and
maintenance of meta-search engines and deep web crawling. In this paper, an algorithm (MSE) is presented to tackle the problem of
automatically extracting dynamic sections as well as search result records within these sections. The solution can potentially be very
useful for all web applications that need to interact with web-based search systems, including regular search engines and e-commerce
search engines. By being able to extract search result records from all dynamic sections and maintaining the section-record
relationships, MSE allows an application to select the desired sections for data extraction.
C. VIPS: a Vision-based Page Segmentation Algorithm
A large number of techniques have been proposed to address this problem, but all of them have inherent limitations because
they are web-page-programming language dependent. In this paper a new approach for extracting web content structure based on
visual representation was proposed. The produced web content structure is very helpful for applications such as web adaptation,
information retrieval and information extraction. By identifying the logic relationship of web content based on visual layout
information, web content structure can effectively represent the semantic structure of the web page. It simulates how a user
understands the layout structure of a web page based on its visual representation. Compared with traditional DOM based segmentation
method, this scheme utilizes useful visual cues to obtain a better partition of a page at the semantic level. It is also independent of
physical realization and works well even when the physical structure is far different from visual presentation.
III. VISION BASED APPROACH
We propose an automatic novel vision-based approach ViDE to extract data from deep web pages. It utilizes the visual
features of a web page and hence it is Web-page-programming-language-independent or HTML independent. The information on
web pages consists of both texts and images and our approach provides an efficient mechanism to extract and cluster those
information to ease the utility by the user.
A. Vision-based Data Extractor
Our proposed system depicted in fig.2, is a vision-based data extractor tool. Given an input URL, the local web browser uses
the crawler mechanism to access the global web and fetch the requested web page. The contents of the web page such as the images,
tables, headers, email, phone numbers etc are then extracted using the data extractor tool. The clustering process then utilizes the
visual appearance of the web page to group the semantic content together and display it in the form of HTML file.
Fig 2. Vision based System Architecture
1) Crawler
Crawler is a mechanism that methodically scans or crawls through internet pages to create an index of data it is looking for.
Hence given a URL as an input, our tool uses this mechanism to display the corresponding web page.
CRAW
LER
DATA
EXTRA
CTOR
HTML
FILE
INPUT
URL

www.ijiris.com
____________________________________________________________________________________________________________
The process is as follows. A user enters the URL of the web page he/she wants to extract in the local browser. The local browser then
contacts the World Wide Web to identify and fetch the web page requested by the user. The local browser displays the fetched web
page in the display panel.
2) Extractor
First, fetched web page is given as an input to the extractor which uses regular expressions to retrieve the contents from the
web page and save it in the local file storage. The extraction process is as follows. Our web data extractor tool extracts different types
of data from the acquired web page such as images, tables, headers, content, URL’s traversed, phone numbers, emails and fax
information. Hence, when the extraction process starts, the regular expressions are constructed to extract the different types of data
mentioned above. These regular expressions are compared with the fetched web page to extract the required information. Once the
extraction process gets over, the extracted information is stored as individual files in the local file storage.
3) Clustering
Clustering involves grouping the content extracted from the web pages in the previous step using the web data extractor tool.
Clustering is performed by utilizing the visual appearance of the web page. The relevant contents are grouped together based on the
visual features such as image size, image position, font size etc.The clustering process is as follows. Images are first clustered based
on the size and position of the image and as a result images with identical size are retrieved and the other images are labeled as noise
images. The noise images are then removed. Tables and other contents are grouped by utilizing the font features of the text displayed
in the web page. Hence, at the end of clustering process the relevant content extracted from the web pages are stored in the form of a
HTML file.
4) Display
Display process utilizes a search engine mechanism which displays the results according to the search query submitted. The
display process is as follows. The HTML file which contains the clustered content is obtained for various web sites. Hence the
individual HTML files are placed in local file storage and the relevant pages are displayed when the search query is submitted.
IV. AGENT BASED AUTHENTICATION
A. Agent Server for Authentication
Whenever the client wants to search for the web page, it has to first send a query to the agent server. Prior to that every client has
to register with the agent server. The agent server registers all the client details in its internal database. Whenever the query has been
received then the encrypted username and password will be verified accordingly. Then only the authenticated clients can access the
server or send the query to the respective servers.
Fig 3. Agent Server implementation
The agent sever will be acting as a proxy for ‘n’ number of clients and servers on the web. The agent server implements a linked list
for storing the user name and password in encrypted form.
B. Victim/Detection Model for Authentication
The victim model in our general framework consists of multiple back-end servers, which can be Web/application servers,
database servers, and distributed file systems. We do not take classic multitier Web servers as the model, since our detection scheme
is deployed directly on the victim tier and identifies the attacks targeting at the same victim tier; thus, multitier attacks should be
separated into several classes to utilize this detection scheme. The victim model along with front-end proxies is shown in Fig.3. We
assume that all the back-end servers provide multiple types of application services to clients using HTTP/1.1 protocol on TCP
connections. Each back-end server is assumed to have the same amount of resource.
Agent
Server
Web Server
Client
Authentication

www.ijiris.com
____________________________________________________________________________________________________________
Moreover, the application services to clients are provided by K virtual private servers (K is an input parameter), which are
embedded in the physical back-end server machine and operating in parallel. Each virtual server is assigned with equal amount of
static service resources, e.g., CPU, storage, memory, and network bandwidth. The operation of any virtual server will not affect the
other virtual servers in the same physical machine. The reasons for utilizing virtual servers are twofold: first, each virtual server can
reboot independently ,thus is feasible for recovery from possible fatal destruction; second, the state transfer overhead for moving
clients among different virtual servers is much smaller than the transfer among physical server machines. As soon as the client
requests arrive at the front-end proxy, they will be distributed to multiple back-end servers for load balancing, whether session sticked
or not. Notice that our detection scheme is behind this front-end tier, so the load balancing mechanism is orthogonal to our setting. On
being accepted by one physical server, one request will be simply validated based on the list of all identified attacker IDs (black list).
If it passes the authentication, it will be distributed to one virtual servers within this machine by means of virtual switch. This
distribution depends on the testing matrix generated by the detection algorithm. By periodically monitoring the average response time
to service requests and comparing it with specific thresholds fetched from a legitimate profile, each virtual server is associated with a
“negative” or “positive” outcome. This victim model is deployed on the agent server for authentication purpose. Once the clients are
authenticated then the web pages could be easily accessed with the proposed methodology.
Fig.3. Victim/detection model
V. EVALUATION RESULTS
The Performance snapshot for all the 10 clients at a given instance is given in Table 5.3. Any misuse detected in any of the
parameters increases or decreases the performance parameters. For example in node ‘n1’ an hacker has been able to successfully
authenticate himself and enter the system. The user feedback are all poor on the compromised system and hence its values is set to ‘0’.
For the first run and for sample we have taken 10 clients. The implementation is done with matlab and Flame tool for agent simulation
for the server. The various performance factors like confidentiality, security, privacy, non repudiation, authentication and
authorization are taken into account and the following tabular column is formulated for this purpose.
Table 1. Authentication and other related Performance Factors
First Run Confidentiality Security Privacy Non repudiation Authentication Authorization
no 0.625 0.625 0.5 0.375 0.75 0.375
n1 0.75 0.5 0.625 0.5 0 0.125
n2 0.625 0.875 0.5 0.375 0.5 0.625
n3 1 0.25 0.75 0.75 0.5 0.25
n4 0.125 0.25 0.5 0.625 0.875 0
n5 0.75 0.125 1 0.75 0.5 0.75
n6 0.25 1 0.125 1 0.375 0.25
n7 0.125 0.625 0 1 0.75 0.375
n8 0.625 0.5 0 0.125 0.75 0.75
n9 0.5 0.625 0 0.125 0.5 0.375

www.ijiris.com
____________________________________________________________________________________________________________
Fig.4. Performance Factors Comparison
The performance factors generated by MATlab is shown in fig.4. From this observation we could ensure that only authenticated
clients can access the server and also deep web page data extraction has been done by our proposed method.
A. Browser Page
The browser page is the first window in the Web Data Extraction process. The web page URL is given as input to the agent
server. All the servers where the original data available are entered in the agent server. Whenever there is a request from the client for
a particular website immediately the request will be forwarded to the agent server. The main responsibility of the agent server is to
map the requested website with its own database. However if the matching is found then the agent server generates a valid key to the
client which is the count of the number of access to the particular website. It also sends the closely related authenticated websites to
the client. are browser page .The browser page uses the crawler mechanism to fetch and display the requested page.
Fig 5. Browser page Fig 6. Web page display
B. Web Data Extractor
The next main window shown in fig 7 to 10 is the Web Data Extractor tool which uses regular expression to validate and
extract the data from the fetched Web page.

www.ijiris.com
____________________________________________________________________________________________________________
Fig 7. Web data extractor Fig 8. Extracting images
Fig 9. Saving images Fig 10.Saved images in folder
C. Clustering
The next window is the clustering window which groups the extracted content of the web pages based on the visual cues.
Here, the final HTML table which contains the structured view of the web page is created and stored.
Fig 11. Build file
Fig 12. Cluster file
D. Display
The final window (fig 11 and fig.12) in the extraction process is a search engine which traverses the saved HTML files and
fetches the content when a search query is entered.
VI. CONCLUSION AND FUTURE ENHANCEMENTS
There are several approaches which had been proposed before for extracting data from the web pages. All those approaches
had some drawbacks which have been overcome by using the visual-based approach. However, extracting data from the web pages
completely based on visual features is beyond the scope of this project. Hence, we have proposed a system which utilizes the visual
appearance to cluster the data records extracted from the semi-structured web pages. The variation which we have proposed will lead
the path to extracting the web page contents purely based on visual regularity of the web pages. Our Web Extraction tool provides the
scope for amateur people to view the structured content of the web pages in an efficient manner. Once the extraction and clustering
process gets completed, the user can have a complete structured view of the contents of the previously semi-structured web page.

www.ijiris.com
____________________________________________________________________________________________________________
Thus our system converts the unstructured information in the web pages into a structured format which eases the user’s consumption
of the web data. Our system limits the utilization of the visual features only to the clustering process. In future, the same can be
expanded to extracting the contents directly from the web page. A separate search query mechanism is implemented to retrieve the
collected information from the web pages. In future a search engine may be embedded in the same Web Extractor tool to improve
performance. We also given an adaptive agent based authentication for security.
REFERENCES
[1] G.O. Arocena and A.O. Mendelzon, “WebOQL: Restructuring Documents, Databases, and Webs,” Proc. Int’l Conf. Data Eng.
(ICDE), pp. 24-33, 1998.
[2] D. Buttler, L. Liu, and C. Pu, “A Fully Automated Object Extraction System for the World Wide Web,” Proc. Int’l Conf.
Distributed Computing Systems (ICDCS), pp. 361-370, 2001.
[3] D. Cai, X. He, J.-R. Wen, and W.-Y. Ma, “Block-Level Link Analysis,” Proc. SIGIR, pp. 440-447, 2004.
[4] D. Cai, S. Yu, J. Wen, and W. Ma, “Extracting Content Structure for Web Pages Based on Visual Representation,” Proc. Asia
Pacific Web Conf. (APWeb), pp. 406-417, 2003.
[5] C.-H. Chang, M. Kayed, M.R. Girgis, and K.F. Shaalan, “A Survey of Web Information Extraction Systems,” IEEE Trans.
Knowledge and Data Eng., vol. 18, no. 10, pp. 1411-1428, Oct. 2006.
[6] C.-H. Chang, C.-N. Hsu, and S.-C. Lui, “Automatic Information Extraction from Semi-Structured Web Pages by Pattern
Discovery,” Decision Support Systems, vol. 35, no. 1, pp. 129-147, 2003.
[7] V. Crescenzi and G. Mecca, “Grammars Have Exceptions,” Information Systems, vol. 23, no. 8, pp. 539-565, 1998.
[8] V. Crescenzi, G. Mecca, and P. Merialdo, “RoadRunner: Towards Automatic Data Extraction from Large Web Sites,” Proc. Int’l
Conf. Very Large Data Bases (VLDB), pp. 109-118, 2001.
[9] D.W. Embley, Y.S. Jiang, and Y.-K. Ng, “Record-Boundary Discovery in Web Documents,” Proc. ACM SIGMOD, pp. 467-478,
1999.
[10] W. Gatterbauer, P. Bohunsky, M. Herzog, B. Krpl, and B. Pollak, “Towards Domain Independent Information Extraction from
Web Tables,” Proc. Int’l World Wide Web Conf. (WWW), pp. 71-80, 2007.
[11] J. Hammer, J. McHugh, and H. Garcia-Molina, “Semistructured Data: The TSIMMIS Experience,” Proc. East-European
Workshop Advances in Databases and Information Systems (ADBIS), pp. 1-8, 1997.
[12] C.-N. Hsu and M.-T. Dung, “Generating Finite-State Transducers for Semi-Structured Data Extraction from the Web,”
Information Systems, vol. 23, no. 8, pp. 521-538, 1998.
[13] http://daisen.cc.kyushu-u.ac.jp/TBDW/, 2009.
[14] http://www.w3.org/html/wg/html5/, 2009.
[15] N. Kushmerick, “Wrapper Induction: Efficiency and Expressiveness, Artificial Intelligence, vol. 118, nos. 1/2, pp. 15-68, 2000.
[16] A. Laender, B. Ribeiro-Neto, A. da Silva, and J. Teixeira, “A Brief Survey of Web Data Extraction Tools,” SIGMOD Record,
vol. 31, no. 2, pp. 84-93, 2002.
[17] B. Liu, R.L. Grossman, and Y. Zhai, “Mining Data Records in Web Pages,” Proc. Int’l Conf. Knowledge Discovery and Data
Mining (KDD), pp. 601-606, 2003.
[18] W. Liu, X. Meng, and W. Meng, “Vision-Based Web Data Records Extraction,” Proc. Int’l Workshop Web and Databases
(WebDB ’06), pp. 20-25, June 2006.
[19] L. Liu, C. Pu, and W. Han, “XWRAP: An XML-Enabled Wrapper Construction System for Web Information Sources,” Proc.
Int’l Conf. Data Eng. (ICDE), pp. 611-621, 2000.
[20] Y. Lu, H. He, H. Zhao, W. Meng, and C.T. Yu, “Annotating Structured Data of the Deep Web,” Proc. Int’l Conf. Data Eng.
(ICDE), pp. 376-385, 2007.
[21] J. Madhavan, S.R. Jeffery, S. Cohen, X.L. Dong, D. Ko, C. Yu, and A. Halevy, “Web-Scale Data Integration: You Can Only
Afford to Pay As You Go,” Proc. Conf. Innovative Data Systems Research (CIDR), pp. 342-350, 2007.
[22] I. Muslea, S. Minton, and C.A. Knoblock, “Hierarchical Wrapper Induction for Semi-Structured Information Sources,”
Autonomous Agents and Multi-Agent Systems, vol. 4, nos. 1/2, pp. 93-114, 2001.
[23] Z. Nie, J.-R. Wen, and W.-Y. Ma, “Object-Level Vertical Search,” Proc. Conf. Innovative Data Systems Research (CIDR), pp.
235-246, 2007.
[24] A. Sahuguet and F. Azavant, “Building Intelligent Web Applications Using Lightweight Wrappers,” Data and Knowledge Eng.,
vol. 36, no. 3, pp. 283-316, 2001.
[25] K. Simon and G. Lausen, “ViPER: Augmenting Automatic Information Extraction with Visual Perceptions,” Proc. Conf.
Information and Knowledge Management (CIKM), pp. 381-388, 2005.
[26] R. Song, H. Liu, J.-R. Wen, and W.-Y. Ma, “Learning Block Importance Models for Web Pages,” Proc. Int’l World Wide Web
Conf. (WWW), pp. 203-211, 2004.
[27] J. Wang and F.H. Lochovsky, “Data Extraction and Label Assignment for Web Databases,” Proc. Int’l World Wide Web Conf.
(WWW), pp. 187-196, 2003.

www.ijiris.com
____________________________________________________________________________________________________________
[28] X. Xie, G. Miao, R. Song, J.-R. Wen, and W.-Y. Ma, “Efficient Browsing of Web Search Results on Mobile Devices Based on
Block Importance Model,” Proc. IEEE Int’l Conf. Pervasive Computing and Comm. (PerCom), pp. 17-26, 2005.
[29] Y. Zhai and B. Liu, “Web Data Extraction Based on Partial Tree Alignment,” Proc. Int’l World Wide Web Conf. (WWW), pp.
76-85, 2005.
[30] H. Zhao, W. Meng, Z. Wu, V. Raghavan, and C.T. Yu, “Fully Automatic Wrapper Generation for Search Engines,” Proc. Int’l
World Wide Web Conf. (WWW), pp. 66-75, 2005.
[31] H. Zhao, W. Meng, and C.T. Yu, “Automatic Extraction of Dynamic Record Sections from Search Engine Result Pages,” Proc.
Int’l Conf. Very Large Data Bases (VLDB), pp. 989-1000, 2006.
[32] J. Zhu, Z. Nie, J. Wen, B. Zhang, and W. Ma, “Simultaneous Record Detection and Attribute Labeling in Web Data Extraction,”
Proc. Int’l Conf. Knowledge Discovery and Data Mining (KDD), pp. 494-503, 2006.

Agent based Authentication for Deep Web Data Extraction

Recommended

Recommended

More Related Content

Similar to Agent based Authentication for Deep Web Data Extraction

Similar to Agent based Authentication for Deep Web Data Extraction (20)

Recently uploaded

Recently uploaded (20)

Agent based Authentication for Deep Web Data Extraction