download
Upcoming SlideShare
Loading in...5
×
 

download

on

  • 208 views

 

Statistics

Views

Total Views
208
Slideshare-icon Views on SlideShare
208
Embed Views
0

Actions

Likes
0
Downloads
1
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Firstly I’ll introduce peculiarities of SDM. They ‘re particularly interesting because the practice of geo-referencing them have caused a growing demand for powerful exploratory data analysis techniques overcomes classical statistical and data mining techniques and, among other things,support the analysis of socio economic phenomena by a spatial point of view. In this talk I’ll focus my attention on a specific task that is the discovery of spatial association rules For this purpose I’ll present ARES a system to extract association rules from census data and illustrate an application ARES to mine spatial association rules on North West England 1998 census data in order to study the mportality risk in Greater manchester county
  • What is IE. As a task it is… Starting with some text… and a empty data base with a defined ontology of fields and records, Use the information in the text to fill the database.
  • ML… although this is an area where ML has not yet trounced the hand-built systems. In some of the latest evaluations, hand-built shared 1 st place with a ML. Now many companies making a business from IE (from the Web): WasBang, Inxight, Intelliseek, ClearForest.
  • Data sparseness, robustness
  • CV i.e. it is divided into 5 folds (Four are used for training and one for testing in turn).
  • Initial ILP reasearch deals with concept learning in form of predicate definition learning
  • ATRE is a multiple-concept learning system, which solves the following problem:
  • Since the generation of a clause depends on the chosen seed, several seeds have to be chosen such that at least one seed per incomplete predicate definition is kept . Therefore, the search space is actually a forest of as many search-trees as the number of chosen seeds. The parallel exploration of the forest related to odd and even numbers. Spec. hierarchies are traversed top-dow. Search proceeds towards deeper and deeper levels of the specialization hierarchies until at least a user-defined number of consistent clauses is found. A supervisor task decides whether the search should carry on or not on the basis of the results returned by the concurrent tasks. When the search is stopped, the supervisor selects the “best” consistent clause according to the user’s preference criterion. This strategy has the advantage that simpler consistent clauses are found first, independently of the predicates to be learned. First learning step Consistent clauses in red
  • Second learning step
  • CV i.e. it is divided into 5 folds (Four are used for training and one for testing in turn).
  • If we guarantee the following two conditions: ……………………… then after a finite number of steps a theory T , which is complete and consistent, is built. If we denote by LHM( T i ) the least Herbrand model of a theory T i , the stepwise construction of theories entails that LHM( T i )  LHM( T i+1 ), for each i  {0, 1,  , n-1}, since the addition of a clause to a theory can only augment the LHM
  • In order to guarantee the first of the two conditions it is possible to proceed as follows. First, a positive example e + of a predicate p to be learned is selected, such that e + is not in LHM( T i ). The example e + is called seed . Then the space of definite clauses more general than e + is explored, looking for a clause C, if any, such that neg(LHM( T i  { C })) =  . In this way we guarantee that the second condition above holds as well. When found, C is added to T i giving T i+1 . If some positive examples are not included in LHM( T i+1 ) then a new seed is selected and the process is repeated. The second condition is more difficult to guarantee because of the non-monotonicity property. The approach followed in ATRE to remove inconsistency due to the addition of a clause to the theory consists of simple syntactic changes in the theory, which eventually creates new layers . The layering of a theory introduces a first variation of the classical separate-and-conquer strategy sketched above, since the addition of a locally consistent clause generated in the conquer stage is preceded by a global consistency check.
  • Learning multi-relational patterns from multi-relational data and background knowledge It allows to navigate the relational structure of data

download download Presentation Transcript

  • Learning for Biomedical Information Extraction with ILP Margherita Berardi Vincenzo Giuliano Donato Malerba
  • Outline of the talk
    • IE for Biomedicine
    • Looking around
    • IE problem formulation
      • which representation model on data? which features?
      • which framework for reasoning?
    • Mutual Recursion in IE
    • Text processing & domain knowledge
    • Application to studies on mitochondrial genome
    • Conclusions & Future work
  • What is “Information Extraction” Filling slots in a database from sub-segments of text. As a task: October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… NAME TITLE ORGANIZATION
  • What is “Information Extraction” Filling slots in a database from sub-segments of text. As a task: October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte , a Microsoft VP . "That's a super-important shift for us in terms of code access.“ Richard Stallman , founder of the Free Software Foundation , countered saying… NAME TITLE ORGANIZATION Bill Gates CEO Microsoft Bill Veghte VP Microsoft Richard Stallman founder Free Soft.. IE
  • IE from Biomedical Texts: Motivation
    • Complexity of biological systems:
      • Too many specialized biological tasks
      • Several entities interacting in a single phenomenon
      • Many conditions to simultaneously verify
    • Complexity of biomedical languages:
      • Several nomenclatures, dictionaries, lexica
      • tending to quickly become obsolete
    Genome decoding  increasing amount of published literature Too much to read!
  • IE History
    • Message Understanding Conference (MUC) DARPA [’87-’95], TIPSTER [’92-’96]
    • Most early work dominated by hand-built models
      • E.g. SRI’s FASTUS , hand-built FSMs.
      • But by 1990’s, some machine learning: Lehnert, Cardie, Grishman and then HMMs: Elkan [Leek ’97], BBN [Bikel et al ’98]
      • Wrapper Induction: initially hand-build, then ML [Soderland ’96], [Kushmeric ’97], …
    • Most learning attempts based on statistical approaches
      • Learning of production rules constrained by probability measures (e.g., HMMs, Probabilistic Context-free Grammars)
    • Some recent logic-based approaches
      • Rapier (Califf ’98)
      • SRV (Freitag ’98)
      • INTHELEX (Ferilli et al. ’01)
      • FOIL-based (Aitken ’02)
      • Aleph-based (Goadrich et al. ’04)
  • Learning Language in biomedicine
    • BioCreAtIvE - Critical Assessment for Information Extraction in Biology ( http://biocreative.sourceforge.net/ )
    • BioNLP, Natural language processing of biology text ( http://www. bionlp . org )
    • ACL/COLING Workshops on Natural Language Processing in Biomedicine
    • SIGIR Workshops on Text Analysis for Bioinformatics
    • Special Interest Group in Text Mining since ISMB ’03 (Intelligent Systems for Molecular Biology): BioLINK (Biology Literature, Information and Knowledge)
    • PSB (Pacific Symposium on Biocomputing) tracks
    • Genomic tracks in TREC (Text Retrieval Conference)
    • PASCAL challenges on information extraction http://nlp.shef.ac.uk/pascal/
    • Workshops: IJCAI, ECAI, ECML/PKDD, ICML (Learning Language in Logic since ’99, challenge task on Extracting Relations from Biomedical Texts)
  • Is there “Logic” in language learning?
    • IE systems limitations, in general:
      • Portability (domain-dependent, task-dependent)
      • Scalability (work well on “relevant” data)
    • Statistics-based approaches
      • wide coverage,
      • scalability,
      • no semantics,
      • no domain knowledge
    • Logic-based approaches:
      • natural encoding of natural language statements and queries in first-order logic,
      • human-comprehensible models,
      • domain knowledge
      • refinement of models
    • [R. J. Mooney, Learning for Semantic Interpretation: Scaling Up Without Dumbing Down , ICML Workshop on Language Learning in Logic, 1999]
  • IE problem formulation for HmtDB
    • HmtDB resource of variability data associated to clinical phenotypes concerning human mithocondrial genome
    ( http:// www.hmdb.uniba.it / )
  • Textual Entity Extraction
    • Ex: “ Cytoplasts from two unrelated patients with MELAS (mitochondrial myopathy, encephalopathy, lactic acidosis, and strokelike episodes) harboring an A-*G transition at nucleotide position 3243 in the tRNALeU(UUR) gene of the mitochondrial genome were fused with human cells lacking endogenous mitochondrial DNA (mtDNA) ”
    • pathology associated to the mutation under study,
    • substitution that causes the mutation,
    • type of the mutation,
    • position in the DNA where the mutation occurs,
    • gene correlated to the mutation.
    By modelling the sentence structure: substitution (X)  follows (Y,X), type (Y) Extractors cannot be learned independently!!!
  • Textual Entity Extraction
    • Each entity is characterized by some slots defining a template
    • The task is to learn rules to fill slots (template filling)
    • Relations in data may allow:
      • intra-template dependencies to be learned
      • context-sensitive application of “extractors”
    Mutation Sampled population DNA sample tissue DNA screening method … Title Abstract Introduction Methods
  • The learning task
    • Classification
      • Each class (slot) is a concept ( target predicate ), each model (template filler) induced for the class is a logical theory explaining the concept ( set of predicate definitions )
      • Predefined models of classification should be provided
      • Importance of domain knowledge and first-order representations
      • Usefulness of mutual recursion (concept dependencies )
    • ILP = Inductive Learning  Logic Programming
    • From IL: inductive reasoning from observations and background knowledge
    • From LP: first-order logic as representation formalism
  • ATRE (Apprendimento di Teorie Ricorsive da Esempi) http://www.di.uniba.it/~malerba/software/atre/
    • Given
    • a set of concepts C 1 , C 2 , ... , C r
    • a set of objects O described in a language L O
    • a background knowledge BK described in a language L BK
    • a language of hypotheses L H that defines the space of hypotheses S H
    • a user’s preference criterion PC
    Find a (possibly recursive) logical theory T for the concepts C 1 , C 2 , ... , C r , such that T is complete and consistent with respect to the set of observations and satisfies the preference criterion PC .
  • ATRE
    • Main Characteristics
    • Learning problem : induce recursive theories from examples
    • ILP setting : learning from interpretations
    • Observation language : ground multiple-head clauses
    • Hypothesis language : non-ground definite clauses
    • Constraints : linkedness + range-restrictedness
    • Generalization model : generalized implication
    • Search strategy for a recursive theory : separate-and-parallel-conquer
    • Continuous and discrete attributes and relations
    • Background knowledge : intensionally defined
  • … the learning strategy… Example: Parallel search for the predicates even and odd seeds even(0) odd(1) Simplest consistent clauses are found first, independently of the predicates to be learned
  • … the learning strategy… Example: Parallel search for the predicates even and odd seeds even(2) odd(1) A predicate dependency is discovered! even(X)  succ ( Y,X ) even(X)  succ( X , Y ) odd(X)  succ(Y,X) odd(X)  succ(X,Y) even(X)  succ(Y,X), succ(Z,Y) odd(X)  succ(Y,X), even(Y) odd(X)  succ(Y,X), zero(Y) even(X)  succ(X,Y), succ(Y,Z)
  • Data preparation
    • ATRE’s observation language: multiple-head clauses
    • Enumeration of positive and negative examples (expert users manual annotations + unlabelled tokens)
    • Descriptions of examples: which features?
      • Statistical (frequencies)
      • Lexical (alphanumeric, capitalized, …)
      • Syntactical (nouns, verbs, adjectives, …)
      • Domain-specific (dictionaries)
  •  
  • Text processing
    • The GATE (A General Architecture for Text Engeneering) framework ( http://gate. ac . uk / )
    • ANNIE is the IE core :
      • Tokeniser
      • Sentence Splitter
      • POS tagger
      • Morphological Analyser
      • Gazetteers
      • Semantic tagger (JAPE transducer)
      • Orthomatcher (orthographic coreference)
    • Some domain specific gazetteers have been added (diseases, enzymes, genes, methods of analysis)
  • Text processing
    • Some reg. expr. to capture some domain specific patterns (alphanumeric strings, appositions, etc.)
    • Shallow acronym resolution
    • Screening operations:
    • Some POSs (nouns, verbs, adjectives, numbers, symbols)
    • Punctuation
    • stopwords (glimpse.cs.arizona.edu. )
    • Stemming (Porter)
  • Text description
    • word_to_string(token)
    • Numerical:
    • lenght(token) , word_frequency(token) , distance_word_category(token1,token2)
    • Structural:
    • s_part_of(token1,token2) , fi rst(token) , last(token) , fi rst_is_char(token) , fi rst_is_numeric(token) , middle_is_char(token) , middle_is_numeric(token) , last_is_char(token) , last_is_numeric(token) , single_char(token) , follows(token1,token2)
    • Lexical:
    • type_of(token) , type_POS(token)
    • Domain dependent:
    • word_category(token)
  • Application
    • We considered 71 documents selected by biologists
    • Expert users manually annotated occurrences of entities of interest, namely
      • Mutation : position, type, substitution, type_position, locus
      • Subjects : nationality, method, pathology, category , number
    • The extraction process (both learning and recognition) is locally performed to text portions of interest, automatically classified
  • Textual portions of papers were categorized in five classes: Abstract, Introduction, Materials & Methods, Discussion and Results The abstract of each paper was processed Avg. No. of categories correctly classified
    • An A-to-G mutation at nucleotide position (np) 3243 in the mitochondrial tRNALeu(UUR) gene is closely associated with various clinical phenotypes of diabetes mellitus.
    • [ annotation(3)=substitution , annotation(4)=no_tag, annotation(5)=no_tag, annotation(6)=no_tag, annotation(7)=position , annotation(8)=no_tag, annotation(9)=locus, annotation(10)=no_tag, annotation(11)=no_tag, annotation(12)=no_tag, annotation(13)=no_tag, annotation(14)=no_tag, annotation(15)=no_tag, annotation(16)=pathology],
    • [ part_of(1,2)=true, contain(2,3)=true , …, contain(2,16)=true, word_to_string(3)=‘A-to-G ', word_to_string(4)='mutation', word_to_string(5)='nucleotid', word_to_string(6)='position', word_to_string(7)='3243 ', word_to_string(8)='mitochondri', word_to_string(9)='trnaleu(uur)', word_to_string(10)='gene', word_to_string(11)='clos', word_to_string(12)='associat', word_to_string(13)='variou', word_to_string(14)='clinic', word_to_string(15)='phenotyp', word_to_string(16)='diabetes_mellitus', type_of(3)=upperinitial , …, type_of(7)=numeric , type_POS(3)=jj , type_POS(4)=nn, …, type_POS(15)=nns, word_frequency(3)=3, word_frequency(4)=6, …, word_frequency(16)=1, word_category(9)=locus, word_category(16)=disease, distance_word_category(9,16)=1, follows(3,4)=true , follows(4,5)=true,…, follows(14,15)=true, follows(15,16)=true]).
    Example description
  • Background knowledge
    • follows(X,Z)  follows(X,Y)=true, f ollows(Y,Z)=true
    • char_number_char(X)=true  first_is_char(X)=true, middle_is_numeric(X)=true, last_is_char(X)=true
    • number_char_char(X)=true  first_is_numeric(X)=true, middle_is_char(X)=true, last_is_char(X)=true
    • char_char_number(X)=true  first_is_char(X)=true, middle_is_char(X)=true, last_is_numeric(X)=true
    • Domain knowledge:
    • word_to_string(X)=transition  w ord_to_string(X)=transversion
    • word_to_string(X)=substitution  word_to_string(X)=replacement
  • Experiments
    • Mutation template
    • 6-fold cross validation
    • The user manually annotates 355 tokens (8.65 per abstract)
    • About 11% positives
  • Experiments
  • Learned theories
    • annotation(X1)=position 
    • follows(X2,X1)=true, type_of(X1)=numeric, follows(X1,X3)=true, word_category(X3)=gene, word_to_string(X2)=position.
    • annotation(X1)=type 
    • follows(X1,X2)=true, word_frequency(X2) in [8..140], follows(X3,X1)=true, annotation(X3)=substitution
    • annotation(X1)=position 
    • follows(X2,X1)=true, annotation(X2)=substitution , follows(X3,X1)=true, follows(X1,X4)=true, word_frequency(X4) in [6..6], annotation(X3)=type , follows(X1,X5)=true, annotation(X5)=locus , word_frequency(X1) in [1..2]
  • Wrap-up
    • IE in Biomedicine
    • The ILP approach to IE within a multi-relational framework allows to implicitly define
      • Domain knowledge
      • Learning from users’ interaction
      • Relational representations
      • Learning relational patterns to allow context-sensitive application of models
    • Recursive Theory Learning in IE: ATRE
    • Efforts on text processing level:
      • Ambiguities
      • Data sparseness
      • Noise
    • Encouraging results on a real-world data set
  • Where from here?
    • Test on available corpus for Bio IE
      • Genia
      • BioCreative
      • NLPBA
      • Genic interaction challenges
    • Investigation of semisupervised approaches: online extension of dictionaries
    • How to encapsulate taxonomical knowledge?
    • Can information extracted by ATRE be really used as background knowledge for genomic database mining?
    • Data sparseness
    • Om + di com=il sistema nn regge le varietà morfosint
    • Locus e position=wordtostring modelli specifici
  • Textual Pattern Extraction
    • “ immortal cells have lost their growth regulatory mechanisms and, thus, continue to divide indefinitely. ”
    • abstract(11695244).
    • contain_vx(11695244,'lose'). contain_nx(11695244,n1).
    • word(n1,'immort'). word(n1,'cell').
    • close_to(n1,'immort','cell'). contain_nx(11695244,n2).
    • word(n2,'growth').
    • word(n2,'regulatori').
    • word(n2,'mechan').
    • close_to(n2,'growth','regulatori'). close_to(n2,'regulatori','mechan').
    • subject_object(n1,n2).
    • contain_vx(11695244,'divide').
    Goal : to find descriptions of texts belonging to the “abstract” class Task relevant objects : Nominal chuncks, Words Reference object : abstract
  • Language bias
    • A language bias has been defined in ATRE to allow users to suggest initial models that the learned theory has to satisfy.
    • Example declarations can be used to specify language biases:
    • starting_number_of_literals(p, N)
    • starting_clause(p, [L1,L2,…,LN])
    • starting_literal(p, [L1,L2,…,LN])
  • Efficiency issues in ATRE
    • Caching the structure of already explored search space as much as possible :
      • c lauses generated during the i-th learning step are saved and reused at the (i+1)-th learning step
      • s ome pruning and grafting operations are used to adapt previously explored hierarchies of clauses for current learning step
    • Caching for clause evaluation:
      • saving much of the computational effort spent to find the positive and negative examples covered by each generated clause
      • It can be applied only for independent clauses, since, their positive/negative examples can decrease or remain unchanged (but not increase)
  • The learning strategy…
    • The basic idea
    • Stepwise construction of a recursive theory T
    • T 0 =  , T 1 , …, T i , T i+1 , …, T n =T
    • such that:
    • T i+1 = T i  {C} for some clause C
    • LHM ( T i )  LHM ( T i +1 ),  i  {0, 1,  , n -1}
    • pos( LHM (T i+1 )) > pos( LHM (T i )) for each 1  i  n
    • neg( LHM (T i )) = 0 for each 1  i  n
  • … the learning strategy…
    • 1) pos( LHM (T i+1 )) > pos( LHM (T i )) for each 1  i  n
    • Choose at least one seed for each predicate p to be learned, namely a positive example e + of p such that e +  LHM (Ti).
    • Explore the space of clauses more general than e + looking for C such that neg( LHM (T i  {C})) = 0
    • 2) neg( LHM (T i )) = 0 for each 1  i  n
    • Select the best consistent clause and apply the layering technique whenever global inconsistency arises
    • Variation of the classical separate-and-conquer strategy
  • The ILP approach to Data Mining
    • Relational representations
    • Domain knowledge
    • Learning from users’ interaction
    • Learning relational patterns to allow context-sensitive application of models
    • ILP = Inductive Learning  Logic Programming
    • From IL: inductive reasoning from observations and background knowledge
    • From LP: first-order logic as representation formalism