Extracting Structured Records From Wikipedia


Published on

One of my Interesting Projects in MS on Extracting Structured Records from Wikipedia.I had developed a working prototype of it in Ruby on Rails

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

Extracting Structured Records From Wikipedia

  1. 1. Extracting Structured Records from Wikipedia Aniruddha Despande Shivkumar Chandrashekhar University of Texas at Arlington University of Texas at Arlington The Wikipedia is the largest collaborative knowledge sharing ABSTRACT web encyclopedia. It is one of the most frequently accessed The Wikipedia is a web-based collaborative knowledge sharing websites on the internet, undergoes frequent revisions, and is portal comprising of articles contributed by authors all over the available in around 250 languages with English alone estimated world. But its search capabilities are limited to title and full-text to possess around 2 million pages & around 800,000 registered search only. There is a growing interest in querying over the users. However, to find information from the vast amount of structure implicit in unstructured documents, and this paper articles on Wikipedia, users have to rely on a combination of explores fundamental ideas to achieve this objective using keyword search and browsing. These mechanisms are effective Wikipedia as a document source. We suggest that semantic but incapable of supporting complex aggregate queries over the information can be extracted from Wikipedia by identifying potentially rich set of structures embedded in Wikipedia text. For associations among prominent textual keywords and its example consider the pages about the states Ohio, Illinois and neighboring contextual text blocks. This association discovery Texas in Wikipedia. Information about the total area, total can be accomplished using a combination of primitive pattern or population and % water can be explicitly inferred from these regular expression matching along with a token frequency pages. determination algorithm for every Wikipedia page to heuristically promote certain structural entities as context headers and Table 1: Portions from info boxes found on Wikipedia pages hierarchically wrap surrounding elements with such identified Total Area Total Population % Water segment titles. The extracted results can be maintained in a domain neutral semantic schema customized for Wikipedia which Ohio 44,825 sq miles 11,353,140 8.7 could improve its search capabilities through the use of an efficient interface extended over a relational data source realizing Total Area Total Population % Water this modeled schema. Experimental results and the implemented prototype indicate that this notion is successful is in achieving Illinois 57,918 sq miles 12,831,970 4.0 good accuracy in entity associations and a high recall in extraction of diverse Wikipedia structural types. Total Area Total Population % Water Categories and Subject Descriptors Texas 268,820 sq miles 20,851,820 2.5 H.2 [Information Systems]: Database Management; H.3.3 [Information Systems]: Information Storage and Retrieval – However, this restricts us from expressing SQL queries to order Information Extraction and Retrieval the states in increasing sequence of their population densities (total area / total population) or similar operations to process the General Terms cumulative knowledge contained in these pages. The ability to Algorithms, Experimentation, Standardization query this set of structures is highly desirable. Such data can be practically found in different kinds of Wikipedia structures Keywords namely info boxes, wiki tables, lists, images, etc. In this paper, Information Extraction, Wikipedia we present a scheme to extract such structured information from a given Wikipedia page, irrespective of its inherent structural make-up. Wikipedia pages differ from one another in the 1. INTRODUCTION compositional make-up of the structural types they are built from. The need to amalgamate the web’s structured information and Wikipedia allows authors a wide variety of rich representations knowledge to enable semantically rich queries is a widely to choose from, thus creating a healthy diversity in the content accepted necessitate. This is the goal behind most of the modern representation formats used across pages. This heterogeneity day research into information extraction and integration, with across structural make-up of different pages (even from the same standards like the Semantic Web being advertised as the future domain) presents a challenge to textual extraction and content vision of the current day web. We investigate one plausible association. approach to achieve expression of semantically rich queries over one such web document source, the content rich, multi-lingual The rest of the paper is organized as follows. A brief description Wikipedia portal (http://www.wikipedia.org) to be particular. of the motivation of the project is presented in Section 2. In Section 3 we formalize the problem definition. Section 4 surveys related work. Section 5 describes the data/ query model and the 1
  2. 2. project’s architectural components. The algorithm design and manual and automatic data extraction to construct ontology from details are debriefed in Section 6. Section 7 touches upon the Wikipedia templates. In paper [7], the authors aim to construct implementation details and we present our evaluation results in knowledge bases that are focused to the task of organizing and Section 8. We conclude and provide directions for future work in facilitating retrieval within individual Wikipedia document Section 9. collections. Our approach resembles the idea presented in [8] where the notion of using a relational system as a basis for a workbench for extracting and querying structure from 2. MOTIVATION unstructured data in Wikipedia. Their paper focuses on The incentive for this project can be summarized as an attempt to incrementally evolving the understanding of the data in the leverage the rich structures inherent in Wikipedia content for context of the relational workbench, while our approach relies on successful information extraction & association and augment its an element-term frequency measure to determine its relative traditional search options with a mechanism to support rich weight in a given page and model our associations as gravitating relational operations over the Wikipedia information base. Thus towards the lesser used tokens. The paper [2] presents an our motivation arises from a need to formulate a mechanism to approach to mining information relating people, places, recognize inherent structures occurring on Wikipedia pages, organizations and events from Wikipedia and linking them on a design and develop an extraction framework which mines for time scale, while the authors in paper [1] explore the possibility structural segments in the Wikipedia text, and construction of an of automatically identifying “common sense” statements from comprehensive interface to support effective query formulation unrestricted natural language text found in Wikipedia and over the Wikipedia textual corpus and thus realize an integrated mapping them to RDF. This system works on a hypothesis that model enabling analytical queries on the Wikipedia knowledge common sense knowledge is often expressed in a subject base. predicate form, and their work focuses on the challenge of automatically identifying such generic statements. 3. PROBLEM DEFINITION The problem being addressed in this project can be adequately 5. EXTRACTION FRAMEWORK described as a research initiative into extraction strategies and This section provides details of the data model used in the design & implementation of an efficient & capable extraction system and the architectural composition of the extraction framework to identify and retrieve structured information from framework devised. The conceptual view paradigm widely Wikipedia and accurately establish associations among them so adopted in relational databases as an abstract representation or as preserve them in a relational data source. This problem model of the real world does not apply to our case. The idea of requirement involves dealing with various Wikipedia content identifying entities and tabulating them or establishing representation types such as text segments, info boxes, wiki relationships across them in a Wikipedia source is highly tables, images, paragraphs, links, etc and the creation of a irrelevant and certainly not scalable. database scheme adequate enough to encompass such diverse extracted tuples. The provision for rich SQL queries is a simple 5.1 Data/ Query Model addition to this system, and hence will not be fully explored but Our data model to accommodate Wikipedia data has been rather the support for querying realized by presenting a simplistic designed to be domain independent. The database schema has querying interface. The problem definition can thus be summed been modeled to resemble RDF tuples, and is primarily designed up as – to scale with increasing content size as well deal with vivid heterogeneity or diversity in extracted data. 1. Research into extraction strategies, design & implementation of a retrieval framework to identify and Table 2: pics table extract structured information from Wikipedia. Id Title Image Tag pic_url 2. Enable extraction of various Wikipedia content types 11 Alabama Flag Al http://wikimedia.org/AlFlg.jpg including text segments, info boxes, free text, references and images. Table 3: infoboxes table 3. Design of a database schema to accommodate diverse data fields extracted from Wikipedia. Id Title Property PValue 4. Provision of basic querying primitives over the extracted 11 Alabama Governor Robert R. Riley information. Table 4: twikis table 4. RELATED WORK Id Title Content Tag We performed a wide literature survey to learn about similar extraction initiatives. The idea to bring Semantics into Wikipedia 11 Alabama It is seen .. Law_and_government is not new and several studies on this topic have been carried out in the last few years. Semantic extraction and relationships were discussed in [6]. The authors analyze relevant measures for The data schema reflects the structural types encountered on inferring the semantic relationships between page categories of Wikipedia as tables in the relational source. This data model Wikipedia. DBpedia [5] is a community based effort that uses allows for easy incorporation of a new representation type 2
  3. 3. whenever it is encountered in a new Wikipedia page. Records are grouped independent of its parent domain or title document. The original page can be constructed back from the segregated tuples by joining across the ‘Title’ attribute. A title of a Wikipedia is a distinguishing attribute and hence chosen to be a key in our schema. The ‘ID’ field is an auto incrementing column which behaves as the primary key within the database tables, however all operations are based on the logically coherent ‘Title’ field. Since all tuples in a given table correspond to same or similar Wikipedia content types, the extracted tuples are uniform hence allowing for faster indexing options. However the data model suffers from the traditional weakness of a RDF oriented schema of requiring too many joins to construct data back. In this paper, we emphasize on the extraction of the tuples and believe that the extracted data in these tuples can be easily migrated to a more rigidly normalized data store, and hence choose to ignore the limitations of RDF. We have consciously chosen to record only location addresses of images rather than the binary content to preserve precious database space as well account for update of images on the actual Wikipedia website. The database tables contain additional attributes including auto-number identifiers and timestamps primarily introduced for housekeeping purposes. This data model clearly favors ease of insertion and quick retrieval however does not support quick updates of the linked tuples. We believe that updates are relatively fewer in our system, and the response times for updates can be improved by treating updates as delete operation followed by a fresh insertion action, which also inadvertently helps in flushing stale tuples. The data model actively adapts to evolving Wikipedia structural types; however the addition of a new type is one time manual affair. The anatomy of few of the tables is presented in the table above, and as mentioned earlier, they resemble key-value form. Figure 1: System Architecture 5.2 Architecture The Wikipedia template set is a pattern list which can be The project adopts a simple architecture and is presented in incrementally improved to account for different Wikipedia pages Figure 1. The ‘Wiki Parser’ is an extraction engine built by or even web pages on the general web. The templates also act as extending a general HTML parser. The HTML parser is a simple a noise filter to selectively weed out incomplete tags or structures interface in our system that includes a small crawler segment to existing in the Wikipedia page. The user interface is an AJAX selectively pick out Wikipedia pages of interest, and extract the enabled web front end which allows users to express queries HTML source from these pages and convert it into an internal upon Wikipedia, and displays the query outcome visually. The object representation. The template set is a collection of user interface works with the system database to serve user Wikipedia HTML/ CSS templates or classes and regular queries using extracted information. However, the user interface expressions built over them to easily identify frequently is also equipped to display extracted tuples returned by the ‘Wiki occurring content headers in Wikipedia pages. In addition, to Parser’, during the online mode of functioning. these templates a frequency determination module augments the token identification process. The ‘Wiki Parser’ uses the token 6. ALGORITHM DESIGN frequency identification component in association with the The extraction algorithm is a two pass algorithm over an input template matching to isolate main tokens in a given Wikipedia web document which identifies structural tokens and their page. The ‘Wiki parser’ is capable of handing diverse elements hierarchies in the first phase and performs appropriate token to types including images, lists, tables, sections, free text and text matching associations in the subsequent phase. We present headers. It then iteratively associates surrounding context to the extraction challenge in sub-section 6.1, algorithm details in these identified tokens to determine key-value pairs to be 6.2 and analysis in sub-section 6.3 included in a system hash table. The hash tables are mapped onto the system relational database using the structural type to tabular 6.1 Extraction Challenge mapping explained in the data model. The database mapping also The main challenge to extraction is to identify section title or key generates XML records of the extracted knowledge for direct elements in a given HTML page. It can be seen that Wikipedia consumption by specific applications. offers a wide choice of elements to be promoted as section titles. For example, the segments headers on a Wikipedia page about 3
  4. 4. the state ‘New York’ could be demarcated with a CSS class 6.3 Algorithm Analysis ‘mw-headline’ and the surrounding text could appear in a CSS This two-pass algorithm displays a reasonable execution trace for class ‘reference’. This observation however does not stand to be a moderate size input document set. We provide some formal consistent across all Wikipedia pages. There could be pages notations below to aid the analysis of our algorithm and compute where the section titles are tagged using CSS class ‘reference’ asymptotic time complexity. Let the set D denote the input and its associated text uses CSS class ‘toctext’. Hence it is not document set = {d1, d2, d3, .. dn}. The set K denotes the token set trivial to identify which elements are keys within a web page and = {k1, k2, .. kj} for each document. Let P (e) denote the token what are its corresponding values. This occurrence is further identification time and P (a) represent the association time per complicated due the free usage of JavaScript between structural token. The time complexity is given as – types. Also, the segment text headers may not always be chosen using a Wikipedia provided type which maps to an equivalent Σ 1 to n [ dj * P(e) + Σ 1 to j kj * P (a) ] CSS class. Since Wikipedia provides support for authors to insert their own HTML formatting, some authors choose to tag their The term P (e) is the key determining factor in this equation, and headers using standard HTML tags like <h3> tag, while others we use the heuristic based regular expression matching to reduce may choose a different tag like <h4> while the rest may opt for the unit time per token identification, as an attempt to speedup the Wikipedia provided format classes. This adds a layer of the algorithm execution. ambiguity to the problem of accurately selecting the key element 7. IMPLEMENTATION fields. To overcome this contextual uncertainty, we use a statistical measure to predict or promote certain fields as The project implementation has been modeled into a web headers. The details on this approach are presented in the application by making the query interface web enabled. The following sub-section. application has been built using the Ruby Language on the Rails web hosting framework. Ruby is an interpreted scripting 6.2 Algorithm Details language specially designed for quick and easy object oriented programming. The ‘Rails’ platform is an open source web This section provides the internal workings of statistical application framework developed to make web programming measures used for segment header identification. The efficient and simple. Development in Ruby on Rails explicitly implementation of this algorithmic procedure has been performed follows the Model View Controller, the industry proven in the Token Frequency Identification module explained as part implementation architecture. The Model View Controller (MVC) of the system architecture. The main essence of textual extraction architecture can be broadly defined as a design pattern that is to identify what text belongs under which header. Unlike describes a recurring problem and its solution where the solution XML, HTML context closely resembles a similar ambiguity of is never exactly the same for every recurrence. The associating context-body with context-headers in traditional implementation relies on the use of regular expressions to natural language processing. Our approach augments the use of perform template matching and for content identification. The simple pattern determining regular expressions with an element extracted information or tuples are maintained in a MySQL frequency score which is computed for every structural element. database which serves as a relational data source. For example if the CSS class ‘reference’ is used thrice in a HTML page, and the CSS class ‘text’ appears seven times, then we can relatively conclude with a high probability that since the CSS class ‘reference’ is sparingly used, it corresponds to a higher weight element. The algorithmic procedure performs two passes over a given document element list. It computes a frequency of occurrence for every unique structural element, and promotes the lesser frequent ones as possible segment definers. The second pass uses a combination of Wikipedia templates, the determined frequencies to be used for a proximity calculation to identify the most proximal match for a more frequent element with a probable context header identified in first pass. A crude verification is performed using predefined regular expressions to validate the grouping done using this two-pass token association algorithm. The algorithm displays very good accuracy in working with text segments and images, however encounters occasional erroneous predictions with working with Wikitables. The algorithm seeks to obtain a statistical score as a determining factor to base its associations upon, to overcome the ambiguity that the context presents. This algorithmic approach can be augmented by a context based certifier to determine a similarity score between the context heading and its associated value, and thus verify the statistical association computed. Figure 2: Implementation Snapshot 1 4
  5. 5. The implementation is intentionally made object-oriented to enough for different content types occurring on Wikipedia pages. provide to ease of extensibility and scaling. The querying The extraction algorithm was also tested for a heavy contiguous interface follows a query builder paradigm, has AJAX support load of around 60 web pages, which took 6 minutes on a 1 Ghz and has been built using the Google Web Toolkit (GWT). The machine, and thus exhibits good efficiency. GWT is an open source Java software development framework A functional evaluation of the system was also performed to test allowing web developers create AJAX applications in Java. the integrated working of the system and its inter-connections The project implementation of the querying interface provides between various components, which depicted a steady working support for an online querying mode where in the user’s query is state of the constructed prototype. served by real time extraction of the Wikipedia page. For analytical or complex queries we recommend the offline mode of 9. CONCLUSION querying, which works on the pre-extracted data and results in The work we have performed is still in the early stages of faster response times. The implementation supports preserving research, but we believe it offers an innovative way of extracted Wikipedia information not only in the relational contributing to the construction of ontology from web pages. MySQL data source but also as flat XML files. Likewise, we believe the prospect of using this methodology to help generate semantic extensions into Wikipedia is both exciting and useful. We summarize our work as to include the design, development and implementation of an extraction engine to retrieve structured information from Wikipedia; data model to map extracted knowledge into a relational store; XML representations to publish the acquired information; a scheme to handle data diversity during extraction and data preservation; a statistical inference mechanism to decipher contextual ambiguity and a capable querying interface to present these features. We envision future work on this topic in terms of incrementally augmenting the statistical score computation using domain analysis or through active regression based learning approaches. A parallel stem of research can involve identifying and managing inter-article relationships in Wikipedia by observing document references obtained from our extraction framework. Implementation specific effort can channeled for design and construction of web services publishing the extracted data, or interactively engaging with web agents and receiving queries from them. Focus can also be directed at enriching the extracted Figure 3: Sample XML Output Wikipedia knowledge base with information from external data sets by inventing ontology to support such objectives. 8. EVALUATION RESULTS 10. ACKNOWLEDGMENTS The implemented system was evaluated by performing extraction of over 100 articles from Wikipedia. The extraction of these We would like to thank our instructor Dr. Chengkai Li for his articles resulted in an extraction of over 4087 images, 3627 text guidance and support to help us develop the necessary skills in boxes, and 2866 info box property values. The extraction focused Information extraction and timely accomplish this project. on a number of diverse subject domains including geography, politics, famous personalities and tourist attractions. The results 11. REFERENCES indicate that the text and image extraction are domain–neutral, [1] Suh, S., Halpin, H., and Klein, E. Extracting Common and representation independent. The cleanliness of text Sense Knowledge from Wikipedia. ISWC Workshop, extractions was one of the promising aspects of this project. The Athens, Georgia (Nov. 2006). extraction results indicate a very high recall degree in text and [2] Bhole, A., Fortuna, B., Grobelnik, M. and Mladenic, D. image extraction segments. The association results were proven Extracting Named Entities and Relating Them over Time by validating against the real world associations occurring on the Based on Wikipedia. Informatica, 2007, 463-468. source Wikipedia page. The frequency based estimation technique fares an accuracy of over 90% for associating text, [3] Cafarella, M., Etzioni, O., and Suciu, D. Structured Queries images and info boxes with keywords. It displays an accuracy of Over Web Text. In Bulletin: IEEE Computer Society around 75% for deeply nested Wiki tables. Technical Committee on Data Engineering, 2006. The system was specifically evaluated to check the reliability of [4] Cafarella, M., Re, C., Suciu, D., Etzioni, O., and Banko, M. the associations for pages with unseen content definitions. The Structured Querying of Web Text: A Technical Challenge, system proved to yield acceptable results for around 18 different 3rd Biennial Conference on Innovative Data Systems representation types, out of the 21 different representation types Research (CIDR), Asimolar, California, January 2007. tested for. The database schema has been found to be flexible 5
  6. 6. [5] Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, [7] Milne, D., Witten, I., and Nichols, D., Extracting Corpus- R., and Ives, DBpedia: A Nucleus for a Web of Open Data, Specific Knowledge Bases from Wikipedia, CIKM’07, 6th International Semantic Web Conference (ISWC 2007), Lisbon, Portugal, November 2007. Busan, Korea, November 2007. [8] Chu, E., Baid, A., Chen, T., Doan, A., and Naughton, J. A [6] Chernov, S., Iofciu, T., Nejdl, and W., Zhou, X., Extracting Relational Approach to Incrementally Extracting and Semantic Relationships between Wikipedia Categories, Querying Structure in Unstructured Data, VLDB’ 07, Vienna, Austria, September 2007. SemWiki2006, 2006. 6