Transcriptomics and lexico-syntactic analysis

Transcriptomics and Lexico-syntactic Analysis (Yet another meaning of the TLA homonym) Lars Juhl Jensen EMBL Heidelberg

A brief history of TLA The joke started at the E-BioSci/ORIEL Annual Workshop Barend and I (among few others) gave somewhat provocative talks We afterwards discussed the homonym problem caused by excessive use of acronyms I told the TLA joke ... I got invited to give a talk here Thanks! I was asked for the title of my talk and TLA was suggested Which meant I had to do some serious research on TLA ... http:// www.acronymfinder.com Three Letter Abbreviation Three Letter Acronym Telemetry Link Adapter Telephone Link Adapter Temporary Lodging Allowance Temporary Lodging Assistance Tennessee Library Association Term Loan A Terminal Low Altitude Texas Library Association Theater of the Living Arts Thin Layer Activation Three Letter Agency ...

The context – what I actually work on (When I’m not telling other people to work on my IE project) Construction and analysis of cross-species networks of functional associations This network will contain many types of edges Assignment of orthologous/homologous genes In silico links derived from genomic context (protein fusion, phylogenetic profiles, and genomic co-localization) Links supported by similar gene expression profiles Protein interaction data from large-scale screens Literature derived links extracted from Medline abstracts To do this we must be able to resolve the gazillion different names and identifiers for each gene in each species

“Biologists would rather share their toothbrush than share a gene name” Lists of synonymous identifiers and names were compiled from SWISS-PROT/TrEMBL SGD, WormBase, and FlyBase BLAST search against UniGene Several types of identifiers Various database identifiers and accession numbers Gene symbols and gene names Lack of standardization 8+ identifiers per yeast gene Many names refer to unrelated genes in different species The synonyms and orthologs lists can be downloaded from: http://www.bork.embl.de/synonyms

Number of uniquely resolvable names for each species 40,038 18,702 7.7 48,291 6,210 S. cerevisiae 15,865 116,712 6.6 132,577 20,006 M. musculus 18,944 181,186 7.1 200,130 27,936 H. sapiens 14,072 77,757 22,707 6.1 103,208 16,871 D. melanogaster 18,214 65,749 45,835 5.4 110,602 20,348 C. elegans 20,158 118,818 5.3 138,976 25,957 A. thaliana Uni-Gene SWALL Species specific Ratio Names Proteins

Orthographic variations of gene names The list of gene names is automatically expanded to include the most common orthographic variations A hyphen can be replaced by a space “p” can used as postfix on gene names to signify proteins This orthographic expansion gives rise ~280,000 gene/protein names in yeast alone There is still quite a lot of orthographic variation missed Multi-word gene names often cause trouble as word order can sometimes change Greek letters in names also give some problems

Retraining TreeTagger for Medline abstracts The English parameter file distributed with TreeTagger was trained on the UPenn Treebank We retrained TreeTagger on the manually annotated GENIA 3.0 corpus (466,179 tokens) adding gene names to the dictionary Performance of the two taggers was evaluated on 55,166 tokes not used during training Retraining eliminated more than half of all tagging errors

Tagging is really easy ... compared to extracting the information you are after Many ways to write the same thing A activates the transcription of B B transcription is induced by A A is a transcriptional activator of B Overexpression of A increases B mRNA levels Transcription is enhanced when A binds to the B promoter The B promoter contains an A UAS Multiple pieces of information and negations in a sentence A is a transcriptional activator of B , C , D , E , and F B was not suppressed by A The A transcription factor affects B but not C C phosphorylation of A leads to increased expression of B

A mini-ontology of transcription regulation Entities (boxes) generic (gray) regulator (yellow) activator (red) repressor (green) target (blue) Relations (arrows) is-a (black) part-of (blue) Events (arrows) creates (green) binds (red)

Parsing abstracts to identify relationships between genes/proteins Sentence and word boundaries are identified using Tokenizer Our retrained TreeTagger is used for tagging part-of-speech Abstracts are chunked with a custom CASS grammar to identify noun and verb chunks Noun chunks are categorized according to a mini-ontology Lexico-syntactic patterns are used to identify event chunks SRN1 NNPG NXPGSG EVSUPVA can MD | suppress SUPV | rna2 NNPG NXPGPL | rna3 NNPG | | rna4 NNPG | | rna5 NNPG | | rna6 NNPG | | and CC | | rna8 NNPG | | singly RB or CC in IN pairs NNS

TIGERSearch is used for searching and browsing the large processed text corpus

Pattern recognize sentences in both active and passive voice

Typical results are shown “ The expression of an FLR1 lacZ reporter construct is strongly induced by the overexpression of either CAP1 or YAP1 , indicating that the FLR1 gene is transcriptionally regulated by the CAP1 and YAP1 proteins ” “ In addition , the mot3Delta mutation caused a partial derepression of the Mig1 Tup1 Ssn6 repressed SUC2 gene , but not the alpha2 Mcm1 Tup1 Ssn6 repressed STE2 gene “ “ We demonstrate here that overexpression of LRE1 represses CTS1 whereas deletion of LRE1 induces the expression of CTS1 ”

We can only wish that all biologists mention their results twice “ The expression of the cytochrome genes CYC1 and CYC7 is controlled by HAP1 , which binds to dissimilar DNA sequences in UAS 1 of CYC1 and the UAS of CYC7 ” 2 correct regulatory interactions identified The second mentioning of the the same interactions is missed “ A disruption of the SKO1 gene causes a partial derepression of SUC2 , indicating that SKO1 is a negative regulator of the SUC2 gene ” We correctly interpret their results and extract an activation But we miss the following sentence where the results are explicitly interpreted for us?!

Two out of three is not bad at all “ Multicopy MEU1 expression suppressed the constitutive ADH2 expression caused by cre2 1 . Disruption of MEU1 reduced endogenous ADH2 expression about twofold but had no effect on cell viability or growth” Correctly identifies the suppression twice (despite the tokenizer accidentially joining two sentences!) We cannot handle overlapping events that are not fully embedded “ Sin3p negatively regulates the INO1 , CHO1 , CHO2 and OPI3 genes while Ume6p negatively regulates the INO1 gene and positively regulates the other genes ” Correctly identifies 4 negative regulations from two events Misses 3 positive regulations of “the other genes”

Life is unfair “ The Uga43p factor negatively regulates GZF3 expression and vice versa ” Be happy for what you got ... “ With wild type CDC28 , filament formation induced by CLN1 overexpression was markedly decreased in a SWE1 deletion ” ... even if you got it by pure luck

Why not extract phosphorylations while we are at it? “ Loss of Hnt1 enzyme activity also leads to hypersensitivity to mutations in Ccl1 , Tfb3 , and Kin28 , which constitute the TFIIK kinase subcomplex of general transcription factor TFIIH and to mutations in Cak1 , which phosphorylates Kin28 ” “ Consistent with the proposed model , Pkc1p selectively phosphorylates Bck1p in vitro and Mpk1p protein kinase activity requires a functional BCK1 gene”

Using text mining of Medline abstract to support predicted regulatory interactions By applying the scheme just described to all Medline abstracts, a set of regulatory interactions in multiple species is obtained We will use it to classify protein associations derived from Microarray gene expression Chromatin IP data Physical protein interaction screens (e.g. Y2H and TAP) Cross-species analysis of genomic context (STRING) To integrate all of these different data sources the list of synonymous gene names and identifiers is again needed as different data sets use different identifiers

Microarrays 101 The level of expression in two samples can be compared for all genes simultaneously Each spot corresponds contains either cDNA or short probes specific to one gene The amount of labeled mRNA from a sample that hybridizes to each spot is measured as a fluorescence intensity Spotted microarrays are quite cheap compared to GeneChips

Non-linear normalization of intensities and correction for spatial effects Downloaded SMD data After intensity normalization Spatial bias estimate After spatial normalization

Combining arrays from multiple experiments into one gene expression matrix For each species, all arrays in SMD are merged To ensure comparable data, all arrays were re-normalized A matrix is constructed with each row being a gene and each column an array This integration is complicated by the lack of consistency in the choice of gene identifiers even within SMD To deal with this, a very large list of synonymous gene names and identifiers was compiled based on SGD, WormBase, FlyBase, SWISS-PROT, and UniGene Such a list is also very useful for integrating the expression data with protein-protein interaction data, STRING, and text-mining of Medline abstracts

“And now we cluster correlated expression profiles ... no, wait a second!” Traditional clustering of genes with correlated expression profiles is not well suited for inferring functional links No appropriate distance measure All arrays are not of the same quality Not all experiments will be equally useful for inferring function The arrays are not all from mutually independent experiments Multi-functional proteins If (A,B) are in the same cluster and (A,C) also cluster together, (B,C) will by definition be in the same cluster Functional relations for the pairs (A,B) and (A,C) do not necessarily imply a functional relation for (B,C) if A has two or more functions

Singular value decomposition – letting the data speak for themselves Singular value decomposition is run on the gene expression matrix Defines an ordered set of non-correlated basis vectors Each singular vector is a linear combination of arrays The first singular vectors effectively average over related arrays Finds replicate arrays including dye-swaps Adjacent arrays in time series and related experiments are combined The last vectors mainly contain noise, e.g. replicate differences    Sporulation  Polysomes  Salt treatment  Heat-shock    Starvation      RNA stability 8 7 6 5 4 3 2 1

Inferring functional links from projections of genes onto singular vectors Analyze each singular vector Do 1D density estimation of expression ratio projections for genes of known function 2D density estimation for pairs of functionally related genes Use Bayes’ law for estimating log-odds of functional link given a pair of projections Different types of regulation Up- vs. down-regulation Anti-correlated expression The log-odds from the first N singular vectors are summed

Proteins linked to the human mitotic checkpoint protein BUB1 Comments Description Identifier Cyclin-dependent kinases regulatory subunit CKS1_HUMAN Involved in cell cycle arrest Serine/threonine-protein kinase Chk1 CHK1_HUMAN Involved in mitotic regulation Serine/threonine-protein kinase NEK2 NEK2_HUMAN Cell cycle-dependent expression CRM1 protein O14980 May act as a negative regulator of entry into mitosis Wee1-like protein kinase WEE1_HUMAN Kinesin-like protein 2 KNS2_HUMAN Kinesin-like protein 2 Q96SE4 Kinesin-like protein KIF14 KF14_HUMAN HCAP-H protein Q15003 Contains six WD40 repeats L2DTL protein Q9NZJ0 Cyclin A2 CGA2_HUMAN Contains a PRY and a SPRY domain Hypothetical protein Q8N324 Polymyositis/scleroderma autoantigen 1 PMC1_HUMAN M-phase inducer phosphatase 1 MPI1_HUMAN DNA topoisomerase II TP2A_HUMAN Cell cycle regulated kinase, inhibits Cdc2 Membrane-associated kinase O14731 Phosphorylated by Cdk2 during S-phase Myb-related protein B MYBB_HUMAN Associated with "growth cones" Brain acid soluble protein 1 BASP_HUMAN Phosphorylated in M-phase Forkhead box protein M1 FXM1_HUMAN High mobility group protein 2 HMG2_HUMAN Cyclin-dependent kinase inhibitor 3 CDN3_HUMAN Mitotic kinesin-like protein 1 Kinesin-like 5 Q8WVP0

Tha-tha-tha-that’s all folks! I believe that literature mining methods are very useful and much need in the field of biology It is simply not possible to read through all potentially interesting papers being written on any but the most narrowly defined topics However, it is not very useful if used alone Information extraction is particularly important for interpreting high-throughput experiments The data sets do by nature ask global/broad questions Yet they should be interpreted in the context of current knowledge We are working on several fronts on making a system that allow a unified overview of both data and literature

The STRING web service Relies on genomic context analysis of 110 species with a total of 440.000 genes Most of these are prokaryotes (8 eukaryotic genomes) Contains orthologous group assignment for 80% of the genes (50% for eukaryotes) Of those 70% are have links with >75% accuracy STRING is accessible at: http:// www.bork.embl.de/ STRING

Honestly – it’s not my fault! The text mining people at EML Jasmin Saric Isabel Rojas The STRING team Christian von Mering Berend Snel Martijn Huynen Daniel Jaeggi Steffen Schmidt Peer Bork Microarray normalization Chris Workman PROPHECIES web service Julien Lagarde Web resources www.bork.embl.de/ STRING (soon moving to string.embl.de ) www.bork.embl.de/synonyms

Transcriptomics and lexico-syntactic analysis

More Related Content

What's hot

Viewers also liked

Similar to Transcriptomics and lexico-syntactic analysis

More from Lars Juhl Jensen

Recently uploaded

Transcriptomics and lexico-syntactic analysis