I donovana itechnology


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

I donovana itechnology

  1. 1. Automated Information Extraction and Database Fielding Technology Submitted by Eddie Donovan Confidential - Copyright  2001 iDonovan.com, Inc. For more information: eddie@idonovan.com
  2. 2. iDonovan.com, Inc. Confidential Business Plan Table of Contents 1. Copyright and Disclaimer page 3 2. Non-Disclosure Agreement page 3 3. Overview page 3 4. Applications of the iDonovan Extraction Framework page 4 5. Building a Database of Job Openings page 4 6. Creating a Comprehensive Database of Companies page 5 7. Learning about Continuing Education Courses page 6 8. Moving Beyond Building Databases page 6 9. How iDonovan Technology Extracts Information page 7 10. Machine Learning for Information Extraction and Classification page 9 11. Achieving High Accuracy With Low Costs page 9 12. Maintaining Up-To-Date Data page 10 13. iDonovan Products and Services page 10 03/03/07 Confidential - Copyright  2001 iDonovan.com, Inc. For more information: eddie@idonovan.com Page 2
  3. 3. iDonovan.com, Inc. Confidential Business Plan 1. Copyright and Disclaimer This technology plan is presented here to benefit and promote the services of iDonovan , Inc. The information and ideas herein are the confidential, proprietary, sole, and exclusive property of iDonovan , Inc. For more information please contact eddie@iDonovan 2. Non-Disclosure Agreement You are being furnished with confidential information that has been prepared by iDonovan , Inc. and its representatives or agents in connection with evaluating a possible transaction with the company. You hereby agree that this evaluation material will be used solely for purposes in connection with a possible transaction with the company, that such information will be kept permanently confidential by you and your representatives, and that you will not distribute this evaluation material or any part hereof to others at any time without the prior written consent of the company. You agree to restrain your representatives from prohibited or unauthorized disclosure or use of the evaluation material and shall be responsible for any such breach hereof. This evaluation material is being delivered for informational purposes and upon the express understanding that it will be used only for the purposes set forth above. Your retention of the evaluation material shall constitute acceptance of the terms and conditions hereof. If you do not agree to the terms hereof, please do not read the evaluation material and immediately return such to the company. 3. Overview iDonovan will be the world’s leading provider of information extraction solutions and services. Our technology will automatically construct structured databases by finding and collecting information embedded in millions of Web pages and other text documents available on the Internet, intranets and corporate databases. People will gain access to factual information that is embedded in text documents, making it possible to query documents as though the facts were in a relational database. The technology will also be used to build highly structured databases about facts that are not typically available, including information about corporations, people, products, jobs and educational resources. Because our technology will be based on machine learning algorithms, it can be trained to extract nearly any type of factual information that is represented in text documents. 03/03/07 Confidential - Copyright  2001 iDonovan.com, Inc. For more information: eddie@idonovan.com Page 3
  4. 4. iDonovan.com, Inc. Confidential Business Plan 4. Applications of the iDonovan Extraction Framework The iDonovan Extraction Framework has been used to automatically extract a variety of data from text sources. Several examples follow: 5. Building a Database of Job Openings The iDonovan Extraction Framework will be the backbone of the world’s largest database of job openings, iDonovan.com. It will be the home of an estimated 500,000 job announcements from over 50,000 corporate Web sites in 2,400 cities. To build the initial version of this database our software will automatically crawl and analyze ten million web pages each day for a period of several months. The database will be completely rebuilt each week by automatically crawling and capturing original job postings on corporate Web sites, adding new job postings that have recently appeared on the sites and deleting old job postings that have been removed. Detailed information, including the job title, location, employer name, address for applications, general job category and specific job function, will be available with each listing. iDonovan will be the leading provider and developer of information extraction technology that turns unstructured data into relevant information. Figure 1 Extracting Job Attributes From Text On a Web Site Figure 1 illustrates the extraction process that will be used to expand the iDonovan.com database. On the left side is a page from the FoodScience.com Web site. The system located this Web page by searching the site, beginning at the home page and automatically following hyperlinks likely to lead to job descriptions. It automatically classified the page shown here as a page containing the target information in 03/03/07 Confidential - Copyright  2001 iDonovan.com, Inc. For more information: eddie@idonovan.com Page 4
  5. 5. iDonovan.com, Inc. Confidential Business Plan this case, job descriptions. On the right of Figure 1 is the database entry that was automatically extracted by the system. The entry includes specific facts about the job, such as the title, location and employer. 6. Creating a Comprehensive Database of Companies A leading provider of corporate data may use the iDonovan Extraction Framework to create a detailed database that describes millions of corporations. Each year Dun & Bradstreet compiles targeted data about more than 60 million businesses. Figure 2. Extracting Corporate Information From a Web Site. To illustrate this application, the right side of Figure 2 shows a database entry that was automatically extracted from the marketsoft.com Web site. The extracted data shown on the right includes the street, city, state and zip code of the corporate headquarters, contact information and the general category of business indicated by the company’s SIC code. It also includes names and titles of officers, directors and other personnel mentioned on the Web site, as well as addresses of offices beyond the corporate headquarters and the names of other companies mentioned on the marketsoft.com site. The left side of this figure shows a page from the marketsoft.com Web site. Color highlights added by the extraction system indicate the specific fields of information that have been extracted from the Web page. In fact, the database description shown here was built by automatically examining the entire marketsoft.com Web site, then distilling information that was originally scattered over many pages into a single, concise, structured data record. The fields of information shown in this case are only a subset of those that can be extracted. Other extractable information includes products, customers, partners, financial information, e-commerce offerings, accepted credit cards and more. Because the information extraction algorithms are based on machine learning methods, targeted information fields can be easily extended to any usage. 03/03/07 Confidential - Copyright  2001 iDonovan.com, Inc. For more information: eddie@idonovan.com Page 5
  6. 6. iDonovan.com, Inc. Confidential Business Plan 7. Learning about Continuing Education Courses A third theoretical example involves applying the iDonovan Extraction Framework to create a huge database of continuing education courses for the US Department of Labor. In this case, the software would crawl the Web to identify continuing education courses and seminars from universities, colleges and other providers across the United States. IDonovan would regularly refresh the database to ensure it is updated as new information appears on the Web. Figure 3 illustrates a small portion of the database of extracted courses (right), along with a Web page from which one of the courses was extracted (left). Again, the color-coding on the Web page has been added by the extraction system to indicate the targeted fields of information, which include course title, cost and location. Figure 3. Extracting Education Course Attributes. 8. Moving Beyond Building Databases To build databases such as these, iDonovan will develop software that automatically crawls the Internet and intranets, automatically classifying documents of all kinds and extracting simple entities and complex data records. The components of this technology have important additional uses that go beyond building large databases, including: • Change Notification alerts users to significant changes in the content of a document set or collection of Web sites. Changes in the text are first detected, then analyzed to determine the nature of the change (e.g., does the change reflect and officer leaving the company or 03/03/07 Confidential - Copyright  2001 iDonovan.com, Inc. For more information: eddie@idonovan.com Page 6
  7. 7. iDonovan.com, Inc. Confidential Business Plan the removal of a former strategic partner?). This allows users to receive notification of changes based on specific criteria. • Document classification organizes a collection of documents or Web sites by automatically classifying each document according to user-defined categories. The same software that is used to determine which Web pages contain specific information can be retrained to classify Web pages, news articles, e-mails, or other text documents. So, for example, you can catalog text entries in a database according to a user-provided topic hierarchy. • Automatic form filling capturing the meanings of fields within HTML forms, so that these forms can be automatically filled out. The same technology we use to extract fields of information can be used to interpret the meaning of individual fields in HTML forms. This capability is useful in supporting automated e-wallet applications. • Manual information collection processes allow companies whose business involves collecting information manually from text databases, intranets, or the Web, to improve the quality and timeliness of this information while reducing costs. For example, if human teams are used to locate and manually collect information from text sources, productivity can be improved by using iDonovan classification and alerting software to help locate and detect updates in the information. The extraction software also enhances data-entry processes. 9. How iDonovan Technology Extracts Information The process of building a database of information from the Web involves four steps: • Crawling • Classifying • Extracting • Compiling information into the database. These steps are illustrated schematically in Figures 4a through 4d . For example, in order to develop the jobs database mentioned above, the software system searches thousands of corporate Web sites for jobs, then extracts the desired data elements for each. Beginning at the home page of each new Web site, the system first crawls forward following hyperlinks out of the home page, giving preference to hyperlinks whose descriptions suggest they lead to pages with user-specified information—in this case, jobs. This is illustrated in Figure 4a. 03/03/07 Confidential - Copyright  2001 iDonovan.com, Inc. For more information: eddie@idonovan.com Page 7
  8. 8. iDonovan.com, Inc. Confidential Business Plan Each time it follows a link, the software automatically determines and classifies whether or not the resulting Web page contains the target information (e.g., job descriptions) as shown in Figure 4b. Once the software classifies a page as one that contains the target information, it then extracts the specific data elements (e.g., job title, location, etc. for each job present on the page), as shown in Figure 4c. Finally, the software compiles the extracted data elements into a new database record that describes the job and merges the file into the growing database, as shown in Figure 4d. As part of the compilation process, the system checks whether or not the extracted database record is an elaboration of an existing record and appropriately merges the information into the database. The final result is a highly structured database that reflects factual information that was previously trapped within the text and scattered across hundreds of thousands of Web pages. Each factual entry in the new database points back to the Web page or other text source from which the information was extracted, allowing users to explore the relevant text and the system to routinely check for updates to the text sources that justify the extracted fact. This process of crawling, classifying, extracting and finally compiling the information into a structured database has been successfully applied to construct a variety of databases, including the examples described earlier. 03/03/07 Confidential - Copyright  2001 iDonovan.com, Inc. For more information: eddie@idonovan.com Page 8
  9. 9. iDonovan.com, Inc. Confidential Business Plan 10. Machine Learning for Information Extraction and Classification At the heart of this four-step process for information extraction lies a collection of software based on proprietary, patent-pending, statistically based machine-learning algorithms. The approach is called machine learning because you can literally "train" the software to find and extract desired data elements. For example, in the case of building the database of job postings for iDonovan.com, the software must be trained to locate job descriptions and extract data such as job title, description and location. In the case of corporate information, the same software must be retrained to find and extract data elements such as company name, headquarters and address. To illustrate the use of machine learning, consider the task of automatically classifying Web pages or other documents into a predefined set of categories. Although classification occurs as the second step in the database-building process, document classification can also be useful as a means for automatically organizing a large collection of documents by topic. For example, consider the problem of classifying Web pages into two groups: the first contains job descriptions while the second does not. The software learns how to classify pages by examining numerous positive training examples (i.e., Web pages that do contain jobs) and numerous negative training examples (i.e., Web pages that do not). Once trained, it then classifies new Web pages by determining whether the features of the new Web page are most “similar” to either the positive or negative examples. The “similarities” learned by the software from the training examples are based on thousands of unique features of each Web page, including individual words that occur on the page, sequences of words that are recognized as certain types of entities (e.g., job titles), HTML formatting and markup, lexicons and other linguistic features, the page title and URL, hyperlinks that point to the Web page from other pages and the text on these hyperlinks— even the location of the page within the overall Web site. Machine-learning software uses advanced statistical algorithms to determine which combinations of these thousands of features are strongly predictive of each class of Web pages for which it is trained. iDonovan will have pending patents on a variety of advanced machine learning algorithms designed for this purpose. Our methods augment existing algorithms such as naive Bayes and nearest neighbor algorithms and offer a variety of patent-pending methods to get more performance with greater efficiency. These include using unlabeled training data to augment labeled data, dynamically learning site-specific regularities and taking advantage of a classification taxonomy to boost accuracy. Once the software has learned which combinations of features are predictive of each class of document, it can classify new Web pages by examining the document’s features and detecting which pages exhibit the patterns associated with each class. A similar machine learning approach is used to train the software for the extraction task. Consider training the system to extract personal names from text. In this case, the positive training examples consist of individual names that occur in documents and have been tagged as positive examples by a human trainer. Similarly, other word sequences that are not tagged as personal names are automatically tagged as negative examples. As in the problem of document classification, the machine learning software learns the patterns that are highly predictive of person-al names and that differentiate positive from negative training examples. Here, the set of features considered is somewhat different from those for document classification and include the sequence of words immediately preceding and following the personal name, as well as complex linguistic and formatting features of these word sequences. Once trained with these hand-labeled training examples, the software recognizes similarities between untagged data on a page it has never seen and the tagged data from the pages on which it was trained. It then uses these recognized similarities to tag and extract new personal names or any field of information it has been trained to extract. 03/03/07 Confidential - Copyright  2001 iDonovan.com, Inc. For more information: eddie@idonovan.com Page 9
  10. 10. iDonovan.com, Inc. Confidential Business Plan 11. Achieving High Accuracy With Low Costs Automated methods for information extraction offer the benefit of immense efficiency and scale compared to manual methods. Unfortunately, they are not perfect and do not “read” text with the same level of understanding as a human. As a result, for difficult applications that require a high level of accuracy, some means of combining automation with selective manual intervention is appropriate. All iDonovan software for document classification and for information extraction will automatically emit a statistical confidence level, or a probability estimate, each time it performs a classification or extracts a piece of new data. This confidence level is a statistical representation of how confident the system is that the data it has tagged is the same data element that it was trained to find. If a particular application requires a certain level of accuracy in its extraction of a data element, these confidence levels may be used to efficiently filter out inaccurate data. We call this Confidence Assessment ™ . As the software extracts data for a particular application, it automatically compares its assessment of accuracy with the accuracy requirement a customer sets for the application. If the results are less accurate than required, the software automatically tags this individual piece of extracted information so that it can be human-verified using special quality assurance software developed by iDonovan. The result is that the desired level of data accuracy is achieved in the most cost-effective fashion, by allowing the software to automatically process those examples for which it has high confidence, and tagging lower confidence information for manual verification. 12. Maintaining Up-To-Date Data Once a database is developed, it must be regularly updated. In the case of the database for iDonovan.com, for example, job openings change weekly. Therefore, the trained software must go back to the 50,000 employer web sites each week to re-extract the entire database. Notice that although the initial build of the database required searching hundreds of millions of web pages, the entire database can now be refreshed with less than one percent of this effort. This efficiency is possible because our software will keep track of the exact web pages on which the job information was found for each of these 50,000 sites, enabling it to avoid searching these sites during the database refresh. Instead, it returns with pinpoint accuracy to the exact location where the job postings have been found in the past. Because all iDonovan software will keep track of the exact text source for each extracted database entry, it can efficiently check the continuing validity of the data as frequently as desired, or even at the time the data is actually accessed by a user or software application. 13. iDonovan Products and Services iDonovan will offers best-of-breed software for automated information extraction and for each of the component steps involved. We have combined our advanced statistical learning methods for document classification, entity extraction and record extraction with the ability to crawl a virtually unlimited number of Web sites and pages. The result is a flexible software framework that can be reconfigured to find and extract factual information from the Web, corporate intranets and other document databases, using full automation when the results are sufficiently accurate and automatically identifying and supporting cases where human verification is advisable. At iDonovan , we develop custom information extraction solutions. Our advanced software automatically finds and extracts factual information from the Web, corporate intranets, extranets and other document databases. The software handles all the serious data crunching. And administration is flexible. iDonovan personnel will build and train the software to meet your exact needs. This will ensure that you’ll get the facts you need to make accurate and informed decisions. 03/03/07 Confidential - Copyright  2001 iDonovan.com, Inc. For more information: eddie@idonovan.com Page 10