• Like
  • Save
Finding Semantic Relations
Upcoming SlideShare
Loading in...5
×
 

Finding Semantic Relations

on

  • 726 views

 

Statistics

Views

Total Views
726
Views on SlideShare
726
Embed Views
0

Actions

Likes
0
Downloads
2
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Semantic relation ( degrades, binds ..) expressed in the sentence Roles: 2 choices: protein, or none
  • Each question consisted of one of the extracted statements with the devices highlighted in red. The task for the labeler (aka worker) was to choose between ‘Yes’, if the statement contained a relation between the devices, ‘No’ if it did not, or ‘not applicable’ if the text extract was not a sentence, or if the query words were not used as different devices
  • We find Mechanical Turk to be a quite interesting, inexpensive, fairly accurate and fast way to obtain labeled data

Finding Semantic Relations Finding Semantic Relations Presentation Transcript

  • Discovering Semantic Relations (for Proteins and Digital Devices) Barbara Rosario Intel Research
  • Outline
    • Semantic relations
      • Protein-protein interactions (joint work with Marti Hearst)
      • Digital devices (joint work with Bill Schilit, Google and Oksana Yakhnenko, Iowa State University)
    • Models to do text classification and information extraction
    • Two new proposals for getting labeled data
  • Text mining
    • Text Mining is the discovery by computers of new, previously unknown information, via automatic extraction of information from text
    • Example: a (human) analysis of titles of articles in the biomedical literature suggested a role of magnesium deficiency in migraines [Swanson]
  • Text mining
    • Text:
      • Stress is associated with migraines
      • Stress can lead to loss of magnesium
      • Calcium channel blockers prevent some migraines
      • Magnesium is a natural calcium channel blocker
    1: Extract semantic entities from text
  • Text mining
    • Text:
      • Stress is associated with migraines
      • Stress can lead to loss of magnesium
      • Calcium channel blockers prevent some migraines
      • Magnesium is a natural calcium channel blocker
    Stress Migraine Magnesium Calcium channel blockers 1: Extract semantic entities from text
  • Text mining (cont.)
    • Text:
      • Stress is associated with migraines
      • Stress can lead to loss of magnesium
      • Calcium channel blockers prevent some migraines
      • Magnesium is a natural calcium channel blocker
    2: Classify relations between entities Stress Migraine Magnesium Calcium channel blockers Associated with Lead to loss Prevent Subtype-of (is a)
  • Text mining (cont.)
    • Text:
      • Stress is associated with migraines
      • Stress can lead to loss of magnesium
      • Calcium channel blockers prevent some migraines
      • Magnesium is a natural calcium channel blocker
    Stress Migraine Magnesium Calcium channel blockers 3: Do reasoning: find new correlations Prevent Subtype-of (is a) Associated with Lead to loss
  • Relations
    • The identification and classification of semantic relations is crucial for the semantic analysis of text
    • Protein-protein interactions
    • Relations for digital devices
  • Protein-protein interactions
    • Applications throughout biology
    • There are several protein-protein interaction databases (BIND, MINT,..), all manually curated
    • Most of the biomedical research and new discoveries are available electronically but only in free text format.
    • Automatic mechanisms are needed to convert text into more structured forms
  • Protein-protein interactions
    • Supervised systems require manually trained data, while purely unsupervised are still to be proven effective for these tasks.
    • We propose the use of resources developed in the biomedical domain to address the problem of gathering labeled data for the task of classifying interactions between proteins
  • HIV-1, Protein interaction database
    • “ The goal of this project is to provide scientists a summary of all known interactions of HIV-1 proteins with host cell proteins, other HIV-1 proteins, or proteins from disease organisms associated with HIV/AIDS”
    • There are 2224 interacting protein pairs and 51 types of interaction
    http://www.ncbi.nlm.nih.gov/RefSeq/HIVInteractions/
  • HIV-1, Protein interaction database 10893419 9223324 14519844, … 11156964 Paper ID activates 155871 10000 … degraded by 155348 10197 induces 155871 1017 binds 155030 10015 Interaction Protein 2 Protein 1
  • Protein-protein interactions
    • Idea: use this to “label data”
    Label them with the interaction given in the database 11156964 Paper ID activates 155871 10000 Interaction Protein 2 Protein 1 Extract from the paper all the sentences with Protein 1 and Protein 2 …
  • Protein-protein interactions
    • Idea: use this to “label data”
    Extract from the paper all the sentences with Protein 1 and Protein 2 … Label them with the interaction given in the database activates activates 11156964 Paper ID activates 155871 10000 Interaction Protein 2 Protein 1
  • Protein-protein interactions
    • Use citations
    • Find all the papers
    • that cite the papers
    • in the database
    ID 9918876 ID 9971769 11156964 Paper ID activates 155871 10000 Interaction Protein 2 Protein 1
  • Protein-protein interactions
    • From the papers, extract
    • the citation sentences;
    • from these extract the
    • sentences with Protein 1
    • and Protein 2
    • Label them
    11156964 Paper ID activates 155871 10000 Interaction Protein 2 Protein 1 ID 9918876 ID 9971769 activates
  • Protein-protein interactions
    • Task:
    • Given the sentences extracted from paper ID and/or the citation sentences:
    • Determine the interaction given in the HIV-1 database for paper ID
    • Identify the proteins involved in the interaction (protein name tagging, or role extraction).
    99 51 Suppresses 84 78 Inhibits 98 119 Upregulates 297 96 Requires 100 62 Interacts with 92 68 Inactivates 324 98 Binds 64 103 Stimulates 101 86 Synergizes with 63 60 Degrades Citances Papers Interaction
  • The models (1) Naïve Bayes (NB) for interaction classification.
  • The models (2) Dynamic graphical model (DM) for protein interaction classification (and role extraction) .
  • Dynamic graphical models
    • Graphical model composed of repeated segments
    • HMMs (Hidden Markov Models)
      • POS tagging, speech recognition, IE
    t N w N
  • HMMs
    • Joint probability distribution
      • P(t 1 ,.., t N , w 1 ,.., w N) = P(t 1 )  P(t i |t i-1 )P(w i |t i )
    • Estimate P(t 1 ), P(t i |t i-1 ), P(w i |t i ) from labeled data
    t N w N
  • HMMs
    • Joint probability distribution
      • P(t 1 ,.., t N , w 1 ,.., w N) = P(t 1 )  P(t i |t i-1 )P(w i |t i )
    • Estimate P(t 1 ), P(t i |t i-1 ), P(w i |t i ) from labeled data
    • Inference: P(t 1 , t 2 ,… t N | w 1 , w 2 ,… w N )
    t N w N
  • Graphical model for role and relation extraction
      • Markov sequence of states (roles)
      • States generate multiple observations
      • Relation generate the state sequence and the observations
    Interaction Roles Features
  • Analyzing the results
    • Hiding the protein names: “ Selective CXCR4 antagonism by Tat ” becomes: “ Selective PROT1 antagonism by PROT2 ”
      • To check whether the interaction types could be unambiguously determined by the protein names.
    • Compare results with a trigger words approach
  • Results: interaction classification 26.1 11.1 21.8 Baseline: most frequent inter. 53.4 46.7 59.7 NB 26.1 40.0 25.8 Trigger words 52.3 44.4 60.5 DB No Protein Names 58.1 60.5 All Classification accuracies 57.8 57.8 Papers Citances 55.7 NB 53.4 DB Model
  • Results: proteins extraction 0.79 0.84 0.75 Citances 0.67 0.83 0.56 Papers 0.79 0.85 0.74 All F-measure Precision Recall
  • Conclusions of protein-protein interaction project
    • Difficult and important problem: the classification of (ten) different interaction types between proteins in text
    • The dynamic graphical model DM can simultaneously perform protein name tagging and relation identification
    • High accuracy on both problems (well above the baselines)
    • The results obtained removing the protein names indicate that our models learn the linguistic context of the interactions.
    • Found evidence supporting the hypothesis that citation sentences are a good source of training data, most likely because they provide a concise and precise way of summarizing facts in the bioscience literature.
    • Use of a protein-interaction database to automatically gather labeled data for this task.
  • Relations for digital devices
    • Identification of activities/relations between device pairs.
    • What can you do with a given device pair?
      • Digital camera and TV
      • Media server and computer
      • Media server and wireless router
      • Toshiba laptop and wireless audio adapter
      • PC and DVR
      • TV and DVR
  • Looking for relations
    • You can searches the Web?
      • Google searches TV DVR and PC DVR
    • Current search engines find co-occurrence of query terms
    • Often you need to find semantically related entities
    • For text mining, inference and for search (IR)
  • Looking for relations
    • You can searches the Web?
      • Google searches PC DVR and TV DVR
    • You may want to see instead all the sentences in which the two devices are involved in an activity/relation and get a sense of what you can do with these devices
        • Activities_between(PC DVR)
          • From which you learn for example that
            • Can build a Better DVR out of an Old PC
            • Any modern Windows PC can be used for DVR duty
        • Activities_between(TV DVR)
          • From whichyou learn for example that
            • DVR allows you to pause live TV
            • Can watch Google Satellite TV through your "internet ready" Google DVR
  • Looking for relations
    • We can frame this problem as a classification problem:
    • Given a sentence containing two digital devices, is there a relations between them expressed in the sentence or not?
  • Looking for relations
    • Media server and computer
      • The Allegro Media Server application reads the iTunes music library file to find the music stored on your computer
        • YES
      • You will use the FTP software to transfer files from your computer to the media server
        • YES
      • The media server has many functions and it needs to be a high-end computer with plenty of hard drive space to store the very large video files that get created
        • YES
      • Sometimes you might want to play faster than your computer , or your Internet connection, or your media server , can handle
        • NO
      • Anderson , George Homsy, A Continuous Media I/O Server and Its Synchronization Mechanism, Computer , v.24 n.10, p.51-57, October 1991
        • NO
      • GSIC > Research Computer System > Obtaining Accouts > Media Server
        • NO
  • Looking for relations
    • Media server and wireless router
      • For example, if you access a local media server in your house that is connected to a wireless router that has a port speed of only 100 Mbps [..]
        • YES
      • Besides serving as a router , a wireless access point, and a four-port switch, the WRTSL54GS includes a storage link and a media server
        • YES
      • It has a built in video server, media server , home automation, wireless router , internet gateway
        • NO
  • Our system
    • Set of 57 pairs of digital devices
    • Searched the Web (Google) using the device pairs as queries
    • From the Web pages retrieved, we extracted the text (3627) excerpts containing both devices
    • We labeled them (YES or NOT)
    • Trained a classification system
  • Our FUTURE system
    • Will allow to identify the Web pages containing relations.
      • Could display only those.
      • Could highlight only sentences with relations
      • For digital devices, this would allow, for example, useful queries for troubleshooting
        • Searching the web is one of the principle methods used to seek out information and to resolve problems involving digital devices for home networks
  • Our FUTURE system
    • Possible extensions of the project to get the activities types
      • We look at the sentences extracted and come up with a set of possible activities. Build a (multi) classification system to classify the different activities (supervised)
      • Extract the most indicative words for the activities (like the words highlighted here ); cluster them to get “activity clusters” (unsupervised)
  • Our system
    • Set of 50 Device Pairs
    • Search the Web (Google) using the device pairs as query
    • From the Web pages retrieved, we extracted the sentences containing both devices
    • We labeled them (YES or NOT)
    • Trained a classification system
  • Labeling with Mechanical Turk
    • To train a classification system, we need labels
      • Time consuming, subjective, different for each domain and task
      • (But unsupervised systems work usually worse)
    • We used a web service, Mechanical Turk ( MTurk , http:// www.mturk.com ) that allows to create and post a task that requires human intervention, and offers a reward for the completion of the task.
  • Mechanical Turk HIT for labeling relations
  • Surveys
  • Surveys
  • Mechanical Turk
    • We created a total of 121 surveys consisting of 30 questions.
    • Our reward to users was between 15 and 30 cents per survey (< 1 cent for text segment)
      • We obtained labels for 3627 text segments for under $70.
    • HIT completed (by all 3 “workers”) within a few minutes to a half-hour
      • We had perfect agreement for 49% of all sentences
      • 5% received all three labels (discarded)
      • 46% two labels were assigned (the majority vote was used to determine the final label)
    • 1865 text segments were labeled YES
    • 1485 text segments were labeled NO
  • Classification
    • Now we have labeled data
    • Need a (binary) classifier
  • Summary (from lecture 17)
    • Algorithms for Classification
    • Binary classification
      • Perceptron
      • Winnow
      • Support Vector Machines (SVM)
      • Kernel Methods
      • Multilayer Neural Networks
    • Multi-Class classification
      • Decision Trees
      • Naïve Bayes
      • K nearest neighbor
  • Support Vector Machine (SVM)
    • Large Margin Classifier
    • Linearly separable case
    • Goal: find the hyperplane that maximizes the margin
    From Lecture 17 w T x + b = 0 M w T x a + b = 1 w T x b + b = -1 Support vectors
  • Graphical models
    • Directed (like Naïve Bayes and HMM)
    • Undirected (Markov Network)
  • Maximum Margin Markov Networks
    • Large Margin Classifier + (undirected) Markov Networks [Taskar 03]
      • To combine the strengths of the two methods:
        • High dimensional feature space, strong theoretical guarantees
        • Problem structure, ability to capture correlation between labels
    Benjamin Taskar, Carlos Guestrin, and Daphne Koller. 2003. Max-margin markov networks. In NIPS .
  • Directed Maximum Margin Model
    • Large Margin Classifier + ( directed ) graphical model (Naïve Bayes)
    • MMNB: Maximum Margin Naïve Bayes
      • Essentially, to combine the strengths of graphic models (better at interpreting data, worse performance in classification) with discriminative models (better performance, unintelligible working mechanism)
  • Results
    • Compare with Naïve Bayes and Perceptron (Weka)
    • Classification accuracy:
      • MMNB: 79.98
      • Naïve Bayes: 75.62
      • Perceptron: 63.03
  • Conclusion
    • Semantic relations
    • Two projects: interactions between proteins and relations between digital devices
    • Statistical models (dynamic graphical models, maximum margin naïve bayes)
    • Creative ways of obtaining labeled data: protein database and “paying” people (Mturk)
  • Thanks! Barbara Rosario [email_address] Intel Research
  • Additional slides
  • All device pairs
    • desktop wireless router
    • PC stereo
    • digital camera television
    • pc wireless audio adapter
    • digital camera tv set
    • pc wireless router
    • ibm laptop buffalo media player
    • Phillips stereo pc
    • ibm laptop linksys wireless router
    • prismq media player wireless router
    • ibm laptop squeezebox
    • stereo laptop
    • ibm laptop wireless audio adapter
    • stereo toshiba laptop
    • kodak camera television
    • toshiba laptop buffalo media player
    • laptop linksys wireless router
    • toshiba laptop linksys wireless router
    • laptop media server
    • toshiba laptop netgear wireless router
    • laptop squeezebox
    • toshiba laptop squeezebox
    • laptop stereo
    • toshiba laptop wireless audio adapter
    • laptop wireless audio adapter
  • All device pairs (cont.)
    • buffalo media player wireless router
    • laptop wireless router
    • buffalo media server wireless router
    • linkstation home server wireless router
    • camera tv
    • linkstation multimedia server wireless router
    • computer linksys wireless router
    • media player wireless router
    • computer media server
    • media server linksys wireless router
    • computer stereo
    • media server netgear wireless router
    • computer wireless audio adapter
    • media server wireless router
    • computer wireless router
    • network media player wireless router
    • desktop media server
    • nikon camera television
    • desktop stereo
    • pc media server
    • desktop wireless audio adapter
    • pc squeezebox