Published on

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Check attribution Lexicon: lookup Classify candidates Sliding window – when candidats not known Boundary model – window+classification in one Finite state machine for complete path Grammers
  • All of these provide a form of API for integration with other code
  • ppt

    1. 1. Towards Web-Scale Information Extraction Eugene Agichtein Mathematics & Computer Science Emory University [email_address] http:// www.mathcs.emory.edu /~eugene/
    2. 2. The Value of Text Data <ul><li>“ Unstructured” text data is the primary form of human-generated information </li></ul><ul><ul><li>Blogs, web pages, news, scientific literature, online reviews, … </li></ul></ul><ul><ul><li>Semi-structured data (database generated): see Prof. Bing Liu’s KDD webinar: http://www.cs.uic.edu/~liub/WCM-Refs.html </li></ul></ul><ul><ul><li>The techniques discussed here are complimentary to structured object extraction methods </li></ul></ul><ul><li>Need to extract structured information to effectively manage, search, and mine the data </li></ul><ul><li>Information Extraction: mature, but active research area </li></ul><ul><ul><li>Intersection of Computational Linguistics, Machine Learning, Data mining, Databases, and Information Retrieval </li></ul></ul><ul><ul><li>Traditional focus on accuracy of extraction </li></ul></ul>
    3. 3. Example: Answering Queries Over Text For years, Microsoft Corporation CEO Bill Gates was against open source. But today he appears to have changed his mind. &quot;We can be open source. We love the concept of shared source,&quot; said Bill Veghte , a Microsoft VP . &quot;That's a super-important shift for us in terms of code access.“ Richard Stallman , founder of the Free Software Foundation , countered saying… Name Title Organization Bill Gates CEO Microsoft Bill Veghte VP Microsoft Richard Stallman Founder Free Soft.. PEOPLE Select Name From PEOPLE Where Organization = ‘Microsoft’ Bill Gates Bill Veghte (from William Cohen’s IE tutorial, 2003)
    4. 4. Outline <ul><li>Information Extraction Tasks </li></ul><ul><ul><li>Entity tagging </li></ul></ul><ul><ul><li>Relation extraction </li></ul></ul><ul><ul><li>Event extraction </li></ul></ul><ul><li>Scaling up Information Extraction </li></ul><ul><ul><li>Focus on scaling up to large collections (where data mining can be most beneficial) </li></ul></ul><ul><ul><li>Other dimensions of scalability </li></ul></ul>
    5. 5. Information Extraction Tasks <ul><li>Extracting entities and relations: this talk </li></ul><ul><ul><li>Entities : named (e.g., Person) and generic (e.g., disease name) </li></ul></ul><ul><ul><li>Relations : entities related in a predefined way (e.g., Location of a Disease outbreak, or a CEO of a Company) </li></ul></ul><ul><ul><li>Events : can be composed from multiple relation tuples </li></ul></ul><ul><li>Common extraction subtasks: </li></ul><ul><ul><li>Preprocess : sentence chunking, syntactic parsing, morphological analysis </li></ul></ul><ul><ul><li>Create rules or extraction patterns : hand-coded, machine learning, and hybrid </li></ul></ul><ul><ul><li>Apply extraction patterns or rules to extract new information </li></ul></ul><ul><ul><li>Postprocess and integrate information </li></ul></ul><ul><ul><ul><li>Co-reference resolution, deduplication, disambiguation </li></ul></ul></ul>
    6. 6. Entity Tagging <ul><li>Identifying mentions of entities (e.g., person names, locations, companies) in text </li></ul><ul><ul><li>MUC (1997): Person, Location, Organization, Date/Time/Currency </li></ul></ul><ul><ul><li>ACE (2005): more than 100 more specific types </li></ul></ul><ul><li>Hand-coded vs. Machine Learning approaches </li></ul><ul><li>Best approach depends on entity type and domain: </li></ul><ul><ul><li>Closed class (e.g., geographical locations, disease names, gene & protein names): hand coded + dictionaries </li></ul></ul><ul><ul><li>Syntactic (e.g., phone numbers, zip codes): regular expressions </li></ul></ul><ul><ul><li>Semantic (e.g., person and company names): mixture of context, syntactic features, dictionaries, heuristics, etc. </li></ul></ul><ul><ul><li>“ Almost solved” for common/typical entity types </li></ul></ul>
    7. 7. Example: Extracting Entities from Text <ul><ul><li>Useful for data warehousing, data cleaning, web data integration </li></ul></ul>1 4089 Whispering Pines Nobel Drive San Diego CA 92122 Address Citation Ronald Fagin, Combining Fuzzy Information from Multiple Systems , Proc. of ACM SIGMOD , 2002 House number Building Road City Zip State Year 2002 S 4 Conference Proc. of ACM SIGMOD S 3 Title Combining Fuzzy Information from Multiple Systems S 2 Author Ronald Fagin S 1 Label( s i ) Sequence Segment( s i )
    8. 8. Hand-Coded Methods <ul><li>Easy to construct in some cases </li></ul><ul><ul><li>e.g., to recognize prices, phone numbers, zip codes, conference names, etc. </li></ul></ul><ul><li>Intuitive to debug and maintain </li></ul><ul><ul><li>Especially if written in a “high-level” language: </li></ul></ul><ul><ul><li>Can incorporate domain knowledge </li></ul></ul><ul><li>Scalability issues: </li></ul><ul><ul><li>Labor-intensive to create </li></ul></ul><ul><ul><li>Highly domain-specific </li></ul></ul><ul><ul><li>Often corpus-specific </li></ul></ul><ul><ul><li>Rule-matches can be expensive </li></ul></ul>[IBM Avatar] ContactPattern  RegularExpression(Email.body,”can be reached at”)
    9. 9. Machine Learning Methods <ul><li>Can work well when lots of training data easy to construct </li></ul><ul><li>Can capture complex patterns that are hard to encode with hand-crafted rules </li></ul><ul><ul><li>e.g., determine whether a review is positive or negative </li></ul></ul><ul><ul><li>extract long complex gene names </li></ul></ul><ul><ul><li>Non-local dependencies </li></ul></ul>The human T cell leukemia lymphotropic virus type 1 Tax protein represses MyoD-dependent transcription by inhibiting MyoD-binding to the KIX domain of p300.“ [From AliBaba]
    10. 10. Representation Models [Cohen and McCallum, 2003] Any of these models can be used to capture words, formatting or both. Lexicons Alabama Alaska … Wisconsin Wyoming Sliding Window Classify Pre-segmented Candidates Finite State Machines Context Free Grammars Boundary Models Abraham Lincoln was born in Kentucky. member? Abraham Lincoln was born in Kentucky. Abraham Lincoln was born in Kentucky . Classifier which class? … and beyond Abraham Lincoln was born in Kentucky. Classifier which class? Try alternate window sizes: Classifier which class? BEGIN END BEGIN END BEGIN Abraham Lincoln was born in Kentucky. Most likely state sequence? Abraham Lincoln was born in Kentucky. NNP V P NP V NNP NP PP VP VP S Most likely parse?
    11. 11. Popular Machine Learning Methods <ul><li>Naive Bayes </li></ul><ul><li>SRV [Freitag 1998], Inductive Logic Programming </li></ul><ul><li>Rapier [Califf and Mooney 1997] </li></ul><ul><li>Hidden Markov Models [Leek 1997] </li></ul><ul><li>Maximum Entropy Markov Models [McCallum et al. 2000] </li></ul><ul><li>Conditional Random Fields [Lafferty et al. 2001] </li></ul><ul><li>Scalability </li></ul><ul><ul><li>Can be labor intensive to construct training data </li></ul></ul><ul><ul><li>At run time, complex features can be expensive to construct or process (batch algorithms can help: [ Chandel et al. 2006] ) </li></ul></ul>For details: [Feldman, 2006 and Cohen, 2004]
    12. 12. Some Available Entity Taggers <ul><li>ABNER: </li></ul><ul><ul><li>http://www.cs.wisc.edu/~bsettles/abner/ </li></ul></ul><ul><ul><li>Linear-chain conditional random fields (CRFs) with orthographic and contextual features. </li></ul></ul><ul><li>Alias-I LingPipe </li></ul><ul><ul><li>http://www.alias-i.com/lingpipe/ </li></ul></ul><ul><li>MALLET: </li></ul><ul><ul><li>http://mallet.cs.umass.edu/index.php/Main_Page </li></ul></ul><ul><ul><li>Collection of NLP and ML tools, can be trained for name entity tagging </li></ul></ul><ul><li>MinorThird : </li></ul><ul><ul><li>http://minorthird.sourceforge.net/ </li></ul></ul><ul><ul><li>Tools for learning to extract entities, categorization, and some visualization </li></ul></ul><ul><li>Stanford Named Entity Recognizer: </li></ul><ul><ul><li>http://nlp.stanford.edu/software/CRF-NER.shtml </li></ul></ul><ul><ul><li>CRF-based entity tagger with non-local features </li></ul></ul>
    13. 13. Alias-I LingPipe ( http://www.alias-i.com/lingpipe/ ) <ul><li>Statistical named entity tagger </li></ul><ul><ul><li>Generative statistical model </li></ul></ul><ul><ul><ul><li>Find most likely tags given lexical and linguistic features </li></ul></ul></ul><ul><ul><ul><li>Accuracy at (or near) state of the art on benchmark tasks </li></ul></ul></ul><ul><li>Explicitly targets scalability: </li></ul><ul><ul><li>~100K tokens/second runtime on single PC </li></ul></ul><ul><ul><li>Pipelined extraction of entities </li></ul></ul><ul><ul><li>User-defined mentions, pronouns and stop list </li></ul></ul><ul><ul><ul><li>Specified in a dictionary, left-to-right, longest match </li></ul></ul></ul><ul><ul><li>Can be trained/bootstrapped on annotated corpora </li></ul></ul>
    14. 14. Outline <ul><li>Overview of Information Extraction </li></ul><ul><ul><li>Entity tagging </li></ul></ul><ul><ul><li>Relation extraction </li></ul></ul><ul><ul><li>Event extraction </li></ul></ul><ul><li>Scaling up Information Extraction </li></ul><ul><ul><li>Focus on scaling up to large collections (where data mining and ML techniques shine) </li></ul></ul><ul><ul><li>Other dimensions of scalability </li></ul></ul>
    15. 15. Relation Extraction Examples <ul><li>Extract tuples of entities that are related in predefined way </li></ul>Disease Outbreaks relation Relation Extraction „ We show that CBF-A and CBF-C interact with each other to form a CBF-A-CBF-C complex and that CBF-B does not interact with CBF-A or CBF-C individually but that it associates with the CBF-A-CBF-C complex.“ [From AliBaba] Zaire Ebola May 1995 U.S. Pneumonia Feb. 1995 July 1995 Jan. 1995 Date Location Disease Name U.K. Mad Cow Disease Ethiopia Malaria May 19 1995 , Atlanta -- The Centers for Disease Control and Prevention, which is in the front line of the world's response to the deadly Ebola epidemic in Zaire , is finding itself hard pressed to cope with the crisis… CBF-A CBF-C CBF-B CBF-A-CBF-C complex interact complex associates
    16. 16. Relation Extraction Approaches <ul><li>Knowledge engineering </li></ul><ul><li>Experts develop rules, patterns: </li></ul><ul><ul><li>Can be defined over lexical items: “<company> located in <location>” </li></ul></ul><ul><ul><li>Over syntactic structures: “((Obj <company>) (Verb located) (*) (Subj <location>))” </li></ul></ul><ul><li>Sophisticated development/debugging environments: </li></ul><ul><ul><li>Proteus, GATE </li></ul></ul><ul><li>Machine learning </li></ul><ul><li>Supervised : Train system over manually labeled data </li></ul><ul><ul><li>Soderland et al. 1997, Muslea et al. 2000, Riloff et al. 1996, Roth et al 2005, Cardie et al 2006, Mooney et al. 2005, … </li></ul></ul><ul><li>Partially-supervised : train system by bootstrapping from “seed” examples: </li></ul><ul><ul><li>Agichtein & Gravano 2000, Etzioni et al., 2004, Yangarber & Grishman 2001, … </li></ul></ul><ul><ul><li>“ Open” ( no seeds ): Sekine et al. 2006, Cafarella et al. 2007, Banko et al. 2007 </li></ul></ul><ul><li>Hybrid or interactive systems: </li></ul><ul><ul><li>Experts interact with machine learning algorithms (e.g., active learning family) to iteratively refine/extend rules and patterns </li></ul></ul><ul><ul><li>Interactions can involve annotating examples, modifying rules, or any combination </li></ul></ul>
    17. 17. Open Information Extraction [Banko et al., IJCAI 2007] <ul><li>Self-Supervised Learner: </li></ul><ul><ul><li>All triples in a sample corpus ( e 1 , r, e 2 ) are considered potential “tuples” for relation r </li></ul></ul><ul><ul><li>Positive examples: candidate triplets generated by a dependency parser </li></ul></ul><ul><ul><li>Train classifier on lexical features for positive and negative examples </li></ul></ul><ul><li>Single-Pass Extractor: </li></ul><ul><ul><li>Classify all pairs of candidate entities for some (undetermined) relation </li></ul></ul><ul><ul><li>Heuristically generate a relation name from the words between entities </li></ul></ul><ul><li>Redundancy-Based Assessor: </li></ul><ul><ul><li>Estimate probability that entities are related from co-occurrence statistics </li></ul></ul><ul><li>Scalability </li></ul><ul><ul><li>Extraction/Indexing </li></ul></ul><ul><ul><ul><li>No tuning or domain knowledge during extraction, relation inclusion determined at query time </li></ul></ul></ul><ul><ul><ul><li>0.04 CPU seconds pre sentence, 9M web page corpus in 68 CPU hours </li></ul></ul></ul><ul><ul><ul><li>Every document retrieved, processed (parsed, indexed, classified) in a single pass </li></ul></ul></ul><ul><ul><li>Query-time </li></ul></ul><ul><ul><ul><li>Distributed index for tuples by hashing on the relation name text </li></ul></ul></ul><ul><li>Related efforts : [Cucerzan and Agichtein 2005], [Pasca et al. 2006], [Sekine et al. 2006], [Rozenfeld and Feldman 2006], … </li></ul>
    18. 18. Event Extraction <ul><li>Similar to Relation Extraction, but: </li></ul><ul><ul><li>Events can be nested </li></ul></ul><ul><ul><li>Significantly more complex (e.g., more slots) than relations/template elements </li></ul></ul><ul><ul><li>Often requires coreference resolution, disambiguation , deduplication , and inference </li></ul></ul><ul><li>Example: an integrated disease outbreak event [Hatunnen et al. 2002] </li></ul>
    19. 19. Event Extraction: Integration Challenges <ul><li>Information spans multiple documents </li></ul><ul><ul><li>Missing or incorrect values </li></ul></ul><ul><ul><li>Combining simple tuples into complex events </li></ul></ul><ul><ul><li>No single key to order or cluster likely duplicates while separating them from similar but different entities. </li></ul></ul><ul><ul><li>Ambiguity: distinct physical entities with same name (e.g., Kennedy) </li></ul></ul><ul><li>Duplicate entities, relation tuples extracted </li></ul><ul><ul><li>Large lists with multiple noisy mentions of the same entity/tuple </li></ul></ul><ul><ul><li>Need to depend on fuzzy and expensive string similarity functions </li></ul></ul><ul><ul><li>Cannot afford to compare each mention with every other. </li></ul></ul><ul><li>See Part II of KDD 2006 Tutorial “Scalable Information Extraction and Integration” -- scaling up integration: http://www.scalability-tutorial.net / </li></ul>
    20. 20. Other Information Extraction Tutorials <ul><li>See these tutorials for more details: </li></ul><ul><ul><li>R. Feldman, Information Extraction – Theory and Practice, ICML 2006 http://www.cs.biu.ac.il/~feldman/icml_tutorial.html </li></ul></ul><ul><ul><li>W. Cohen, A. McCallum, Information Extraction and Integration: an Overview, KDD 2003 http:// www.cs.cmu.edu/~wcohen/ie-survey.ppt </li></ul></ul>
    21. 21. Summary: Accuracy of Extraction Tasks <ul><li>Errors cascade (errors in entity tag cause errors in relation extraction) </li></ul><ul><li>This estimate is optimistic: </li></ul><ul><ul><li>Primarily for well-established (tuned) tasks </li></ul></ul><ul><ul><li>Many specialized or novel IE tasks (e.g. bio- and medical- domains) exhibit lower accuracy </li></ul></ul><ul><ul><li>Accuracy for all tasks is significantly lower for non-English </li></ul></ul>[Feldman, ICML 2006 tutorial]
    22. 22. Multilingual Information Extraction <ul><li>Active research area, beyond the scope of this talk. Nevertheless, a few (incomplete) pointers are provided. </li></ul><ul><li>Closely tied to machine translation and cross-language information retrieval efforts. </li></ul><ul><li>Language-independent named entity tagging and related tasks at CoNLL: </li></ul><ul><ul><li>2006: multi-lingual dependency parsing ( http://nextens.uvt.nl/~conll/ ) </li></ul></ul><ul><ul><li>2002, 2003 shared tasks: language independent Named Entity Tagging ( http://www.cnts.ua.ac.be/conll2003/ner/ ) </li></ul></ul><ul><li>Global Autonomous Language Exploitation program (GALE): </li></ul><ul><ul><li>http://www.darpa.mil/ipto/Programs/gale/concept.htm </li></ul></ul><ul><li>Interlingual Annotation of Multilingual Text Corpora (IAMTC) </li></ul><ul><ul><li>Tools and data for building MT and IE systems for six languages </li></ul></ul><ul><ul><li>http://aitc.aitcnet.org/nsf/iamtc/index.html </li></ul></ul><ul><li>REFLEX project: NER for 50 languages </li></ul><ul><ul><li>Exploit for training temporal correlations in weekly aligned corpora </li></ul></ul><ul><ul><li>http://l2r.cs.uiuc.edu/~cogcomp/wpt.php?pr_key=REFLEX </li></ul></ul><ul><li>Cross-Language Information Retrieval (CLEF) </li></ul><ul><ul><li>http://www.clef-campaign.org/ </li></ul></ul>
    23. 23. Outline <ul><li>Overview of Information Extraction </li></ul><ul><ul><li>Entity tagging </li></ul></ul><ul><ul><li>Relation extraction </li></ul></ul><ul><ul><li>Event Extraction </li></ul></ul><ul><li>Scaling up Information Extraction </li></ul><ul><ul><li>Focus on scaling up to large collections (where data mining and ML techniques shine) </li></ul></ul><ul><ul><li>Other dimensions of scalability </li></ul></ul>
    24. 24. Scaling Information Extraction to the Web <ul><li>Dimensions of Scalability </li></ul><ul><ul><li>Corpus size : </li></ul></ul><ul><ul><ul><li>Applying rules/patterns is expensive </li></ul></ul></ul><ul><ul><ul><li>Need efficient ways to select/filter relevant documents </li></ul></ul></ul><ul><ul><li>Document accessibility : </li></ul></ul><ul><ul><ul><li>Deep web: documents only accessible via a search interface </li></ul></ul></ul><ul><ul><ul><li>Dynamic sources: documents disappear from top page </li></ul></ul></ul><ul><ul><li>Source heterogeneity : </li></ul></ul><ul><ul><ul><li>Coding/learning patterns for each source is expensive </li></ul></ul></ul><ul><ul><ul><li>Requires many rules (expensive to apply) </li></ul></ul></ul><ul><ul><li>Domain diversity: </li></ul></ul><ul><ul><ul><li>Extracting information for any domain, entities, relationships </li></ul></ul></ul><ul><ul><ul><li>Some recent progress (e.g., see slide 17) </li></ul></ul></ul><ul><ul><ul><li>Not the focus of this talk </li></ul></ul></ul>
    25. 25. Scaling Up Information Extraction <ul><li>Scan-based extraction </li></ul><ul><ul><li>Classification/filtering to avoid processing documents </li></ul></ul><ul><ul><li>Sharing common tags/annotations </li></ul></ul><ul><li>General keyword index -based techniques </li></ul><ul><ul><li>QXtract, KnowItAll </li></ul></ul><ul><li>Specialized indexes </li></ul><ul><ul><li>BE/KnowItNow, Linguist’s Search Engine </li></ul></ul><ul><li>Parallelization/distributed processing </li></ul><ul><ul><li>IBM WebFountain, UIMA, Google’s Map/Reduce </li></ul></ul>
    26. 26. Efficient Scanning for Information Extraction <ul><li>80/20 rule: use few simple rules to capture majority of the instances [Pantel et al. 2004] </li></ul><ul><li>Train a classifier to discard irrelevant documents without processing [Grishman et al. 2002] </li></ul><ul><ul><li>( e.g., the Sports section of NYT is unlikely to describe disease outbreaks) </li></ul></ul><ul><li>Share base annotations (entity tags) for multiple extraction tasks </li></ul>Extraction System Text Database <ul><li>Extract output tuples </li></ul><ul><li>Process documents </li></ul><ul><li>Retrieve docs from database </li></ul>filtered … Output Tuples Classifier <ul><li>Filter documents </li></ul>
    27. 27. Exploiting Keyword and Phrase Indexes <ul><li>Generate queries to retrieve only relevant documents </li></ul><ul><li>Data mining problem! </li></ul><ul><li>Some methods in literature: </li></ul><ul><ul><li>Traversing Query Graphs [Agichtein et al. 2003] </li></ul></ul><ul><ul><li>Iteratively refine queries [Agichtein and Gravano 2003] </li></ul></ul><ul><ul><li>Iteratively partition document space [Etzioni et al., 2004] </li></ul></ul><ul><li>Case studies: QXtract , KnowItAll </li></ul>
    28. 28. Simple Strategy: Iterative Set Expansion <ul><li>Execution time = |Retrieved Docs| * (R + P) + |Queries| * Q </li></ul>Extraction System Text Database <ul><li>Extract tuples from docs </li></ul><ul><li>Process retrieved documents </li></ul><ul><li>Query database with seed tuples </li></ul>Time for retrieving a document Time for answering a query Time for processing a document Query Generation <ul><li>Augment seed tuples with new tuples </li></ul>(e.g., [ Ebola AND Zaire ]) (e.g., <Malaria, Ethiopia>) … Output Tuples
    29. 29. Reachability via Querying t 1 retrieves document d 1 that contains t 2 t 1 t 2 t 3 t 4 t 5 Upper recall limit : determined by the size of the biggest connected component Reachability Graph Tuples Documents t 1 t 2 t 3 t 4 t 5 d 1 d 2 d 3 d 4 d 5 <SARS, China> <Ebola, Zaire> <Malaria, Ethiopia> <Cholera, Sudan> <H5N1, Vietnam> [Agichtein et al. 2003b]
    30. 30. Reachability Graph for DiseaseOutbreaks DiseaseOutbreaks, New York Times 1995
    31. 31. Getting Around Reachability Limits <ul><li>KnowItAll [Etzioni et al., WWW 2004]: </li></ul><ul><ul><li>Add keywords to partition documents into retrievable disjoint sets </li></ul></ul><ul><ul><li>Submit queries with parts of extracted instances </li></ul></ul><ul><li>QXtract [Agichtein and Gravano 2003a] </li></ul><ul><ul><li>General queries with many matching documents </li></ul></ul><ul><ul><li>Assumes many documents retrievable per query </li></ul></ul>
    32. 32. QXtract [Agichtein and Gravano 2003] <ul><li>Get document sample with “likely negative” and “likely positive” examples. </li></ul><ul><li>Label sample documents using information extraction system as “oracle.” </li></ul><ul><li>Train classifiers to “recognize” useful documents. </li></ul><ul><li>Generate queries from classifier model/rules. </li></ul>Query Generation Information Extraction Seed Sampling Classifier Training Queries User-Provided Seed Tuples
    33. 33. Using Generic Indexes: Summary <ul><li>Order of magnitude speed up in runtime </li></ul><ul><li>But: keyword indexes are approximate (so the queries are not precise) </li></ul><ul><li>Require many documents to retrieve </li></ul><ul><li>Can we do better? </li></ul>
    34. 34. Index Structures for Information Extraction <ul><li>Bindings Engine [Cafarella and Etzioni 2005] </li></ul><ul><li>Indexing and querying entities: [K. Chakrabarti et al. 2006 ] </li></ul><ul><li>IBM Avatar project: </li></ul><ul><ul><li>http:// www.almaden.ibm.com/cs/projects/avatar / </li></ul></ul><ul><li>Other indexing schemes: </li></ul><ul><ul><li>Linguist’s search engine (P. Resnik): http://lse.umiacs.umd.edu:8080/ </li></ul></ul><ul><ul><li>FREE: Indexing regular expressions: [Cho and Rajagolapan, ICDE 2002] </li></ul></ul><ul><ul><li>Indexing and querying linguistic information in XML: [Bird et al., 2006] </li></ul></ul>
    35. 35. <ul><li>Variabilized search query language </li></ul><ul><li>Integrates variable/type data with inverted index, minimizing query seeks </li></ul><ul><ul><li>Index <NounPhrase>, <Adj-Term> terms </li></ul></ul><ul><li>Key idea: neighbor index </li></ul><ul><ul><li>At each position in the index, store “neighbor text” – both lexemes and tags </li></ul></ul><ul><li>Query: “ cities such as <NounPhrase> ” </li></ul>Bindings Engine (BE) [Cafarella and Etzioni 2005] neighbor 1 str 1 #docs … pos 0 pos 1 docid #docs-1 pos #docs-1 #posns pos 0 pos 1 … pos #pos-1 docid 0 docid 1 #posns pos 0 neighbor 0 pos 1 neighbor 1 … pos #pos-1 … #neighbors blk_offset 19 12 Result: in document 19: “I love cities such as Philadelphia .” neighbor 0 str 0 words such philadelphia nickels mayors give friendly cities billy as NP right Philadelphia 3 <offset> AdjT left such
    36. 36. Related Approach: [ K. Chakrabarti et al. 2006] <ul><li>Support “relationship” keyword queries over indexed entities </li></ul><ul><li>Top-K support for early processing termination </li></ul>
    37. 37. Workload-Driven Indexing [ S. Chakrabarti et al. 2006] Indexing Thousands of Entity Types
    38. 38. Workload-Driven Indexing (continued)
    39. 39. Parallelization/Adaptive Processing <ul><li>Parallelize processing: </li></ul><ul><ul><li>WebFountain [Gruhl et al. 2004] </li></ul></ul><ul><ul><li>UIMA architecture </li></ul></ul><ul><ul><li>Map/Reduce </li></ul></ul>
    40. 40. IBM WebFountain <ul><li>Dedicated share-nothing 256-node cluster </li></ul><ul><li>Blackboard annotation architecture </li></ul><ul><li>Data pipelined and streamed past each “augmenter” to add annotations </li></ul><ul><li>Merge and index annotations </li></ul><ul><li>Index both tokens and annotations </li></ul><ul><li>Between 25K-75K entities per second </li></ul>[Gruhl et al. 2004]
    41. 41. UIMA (IBM Research) <ul><li>Unstructured Information Management Architecture (UIMA) </li></ul><ul><ul><li>http:// www.research.ibm.com /UIMA/ </li></ul></ul><ul><li>Open component software architecture for development, composition, and deployment of text processing and analysis components. </li></ul><ul><li>Run-time framework allows to plug in components and applications and run them on different platforms. Supports distributed processing, failure recovery, … </li></ul><ul><li>Scales to millions of documents – incorporated into IBM OmniFind, grid computing-ready </li></ul><ul><li>The UIMA SDK ( freely available ) includes a run-time framework, APIs, and tools for composing and deploying UIMA components. </li></ul><ul><li>Framework source code also available on Sourceforge: </li></ul><ul><ul><li>http://uima-framework.sourceforge.net/ </li></ul></ul>
    42. 42. Map/Reduce ( [ Dean & Ghemawat, OSDI 2004 ])
    43. 43. Map/Reduce (continued) <ul><li>General framework </li></ul><ul><ul><li>Scales to 1000s of machines </li></ul></ul><ul><ul><li>Recently implemented in Nutch and other open source efforts </li></ul></ul><ul><li>“ Maps” nicely to information extraction </li></ul><ul><ul><li>Map phase: </li></ul></ul><ul><ul><ul><li>Parse individual documents </li></ul></ul></ul><ul><ul><ul><li>Tag entities </li></ul></ul></ul><ul><ul><ul><li>Propose candidate relation tuples </li></ul></ul></ul><ul><ul><li>Reduce phase </li></ul></ul><ul><ul><ul><li>Merge multiple mentions of same relation tuple </li></ul></ul></ul><ul><ul><ul><li>Resolve co-references, duplicates </li></ul></ul></ul>
    44. 44. Summary <ul><li>Presented a brief overview of information extraction </li></ul><ul><li>Techniques and ideas to scale up information extraction </li></ul><ul><ul><li>Scan-based techniques (limited impact) </li></ul></ul><ul><ul><li>Exploiting general indexes (limited accuracy) </li></ul></ul><ul><ul><li>Building specialized index structures (most promising) </li></ul></ul><ul><li>Scalability is a data mining problem </li></ul><ul><ul><li>Querying graphs  link discovery </li></ul></ul><ul><ul><li>Workload mining for index optimization </li></ul></ul><ul><ul><li>Automatically optimizing for specific text mining application </li></ul></ul><ul><ul><li>Considerations for building integrated data mining systems </li></ul></ul>
    45. 45. References and Supplemental Materials <ul><li>References, slides, and additional information available at: </li></ul><ul><li>http:// www.mathcs.emory.edu/~eugene/kdd -webinar/ </li></ul><ul><li>Will also post answers to questions </li></ul>