Termite & Termite Expressions


Published on

Termite is a next-generation text-mining and semantic markup engine for life sciences. Designed to be used by researchers and content providers alike, we enhance unstructured text and find the topics and relationships that matter!

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Termite & Termite Expressions

  1. 1. SciBite Termite and Termite-Express Powerful Text-Indexing For Life Sciences © 2013 SciBite Limited © 2013 SciBite Limited.
  2. 2. Whats The Issue? So much public, private, professional textual content available… Standard text-search tools don’t help much because they aren’t semantic… Semantic means searching by “thing” not by synonym (keyword)… Semantic means more accurate and complete results! © 2013 SciBite Limited © 2013 SciBite Limited.
  3. 3. Users & Applications Researchers Enterprise Search Content Providers You Are: A life science professional who‟s job involves hunting for key facts in literature, patents, grants and internal documents You Are: A company wishing to make its internal search portals more accurate You Are: Anyone who produces or supplies textual content in the life-sciences We Offer: The ability to datamine millions of documents to identify critical mentions and relationships We Offer: The ability to enhance your existing search tool to find key biological entities more accurately, making your users happier and more productive! We Offer: The opportunity to enrich your content for search, navigation and significantly increase the value to your consumers © 2013 SciBite Limited © 2013 SciBite Limited.
  4. 4. Selecting A Semantic Recognition Engine You want something that:            Choices Is commercially supported Is highly configurable Is accurate Is scalable (millions of documents) Is fast (MB/sec processing) Is flexible (abstracts to full documents) Supports batch & on-demand (web service) processing Is tuned to life sciences data Comes supplied with highly curated thesauri Comfortable with ambiguity of life science texts Goes beyond recognition to identify critical phrases in a document Termite meets all these criteria © 2013 SciBite Limited © 2013 SciBite Limited.
  5. 5. Semantic Entity Recognition : Basics  Two main approaches,  Thesaurus: match text to a list of known synonyms  Algorithmic: try to identify an entity synonym “on the fly”  Termite uses both mechanisms to identify entities with high accuracy  Thesauri are often an afterthought from tool providers, pointing to free public sources  While these are good starting points, they will deliver variable results  Our view: Commercial grade text-mining requires commercial grade thesauri © 2013 SciBite Limited © 2013 SciBite Limited.
  6. 6. Our Thesauri Products  Thesauri are at our heart, not an afterthought:  Combine crowd-sourced and professional curation with experienced biomedical/pharma researchers  Thesauri are built to tackle real world text  Integration-ready: We use public identifiers by default  Include mappings to other resources and many are organised via ontologies © 2013 SciBite Limited © 2013 SciBite Limited.
  7. 7. Some Examples  Human Gene  We have over 4.5 million synonyms, and when combined with our on-the-fly algorithms, we match over 30 million gene name mentions  Indication (Disease)  We have extensive coverage of over 5000 of the most important human diseases, along with over 63,000 manually verified synonyms  Protein Type  Recognises concepts such as “interluekin”, “cytokine”, “ion channel”, rather than specific genes. Arguably these terms are used more often in biomedical text than gene names yet such entities are very poorly identified by other tools  Drugs  Recognise over 1 million synonyms covering >60,000 launched and research therapeutics. Updated on a daily basis from our internet-wide scanning at SciBite.com  We also cover:  Adverse Events, Cells, Tissues, Species (and species-specific gene thesauri), Companies, Micro RNAs, Mutations, Hormones & Messengers, Investigative procedures (e.g. Biopsy), Laboratory Chemicals, Laboratory Procedures, Restriction Enzymes, Plasmids, General Laboratory Products & more! © 2013 SciBite Limited © 2013 SciBite Limited.
  8. 8. Synonyms Aren‟t Easy..  Biomedical terms are very ambiguous GSK (GlaxoSmithkiline or Glucose Synthase Kinase?) Hedgehog (Animal or developmental regulator protein?) Android (The FDA approved drug or the Phone OS?) Transgene (The company or the technique?) MCD (macular dystrophy (corneal) or malformations of cortical development)  Pacific (Pacific Biotechnology or the ocean?)  EGFR (The kinase receptor or e-glomerular filtration rate?)      © 2013 SciBite Limited © 2013 SciBite Limited.
  9. 9. Ambiguity: Termites Strength  Termite‟s engine and thesauri understand which synonyms are Fairly Dependable (e.g. Pfizer), Often Ambiguous (e.g. MCD) or correct but very dangerous (e.g. Pacific)  As a document is analysed, Termite uses both:  Synonym Range: Which synonyms are used, how ambiguous as a whole, not just one-by-one?  Synonym Metrics: Frequency and position of synonyms, relationship of abbreviations and full terms  Document context: Does the document mention key terms (but not synonyms) that increase or decrease the chances the ambiguous synonym is correct © 2013 SciBite Limited © 2013 SciBite Limited.
  10. 10. Bottom Line Termite allows you to use ambiguous synonyms in your Thesauri to increase recall without returning a lot of rubbish! © 2013 SciBite Limited © 2013 SciBite Limited.
  11. 11. Termite handles Greek characters with ease β-actin β actin actin-β b-actin b- actin Beta actin Actin, beta Including HTML Entity codes! ß-actin The German Sharp isnt beta but that doesn’t stop people using it © 2013 SciBite Limited © 2013 SciBite Limited.
  12. 12. Termite handles variations with ease The usual variations …. Muscarinic M1 Receptor(s) Muscarinic (M1) Receptor M1 Muscarinic Receptor Muscarinic Receptor M1 Muscarinic Receptors M1 Muscarinic Receptor type M1 © 2013 SciBite Limited © 2013 SciBite Limited.
  13. 13. Termite handles “broken” phrases with ease M1/M2 muscarinic receptors H1 and H2 Histamine Receptors Kinases ERK1 and 2 ERK1/2 © 2013 SciBite Limited © 2013 SciBite Limited.
  15. 15. Accuracy Of Termite On Random Selection Of 400 Entries From Biocreative Gene-Mention Task 7% 5% Correct Diasagreement Incorrect 88% http://biocreative.sourceforge.net/ © 2013 SciBite Limited © 2013 SciBite Limited.
  16. 16. WebConnect – “Termite Live” http://scibite.com/site/p3/webconnect.html © 2013 SciBite Limited © 2013 SciBite Limited.
  17. 17. TEXPRESS Going Beyond Recognition © 2013 SciBite Limited © 2013 SciBite Limited.
  18. 18. NER, Patterns, NLP  Termite is a Named Entity Recognition (NER) engine – it finds mentions of “things” in text  Natural Language Processing (NLP) is an area of linguistics that seeks to develop a computer-understandable representation of human text  NLP is both powerful and complex. Human language can vary greatly, and results in many facets to consider in NLP results  Critically, many use-cases do not require full NLP, users wish to simply “identify any relationships between entities in the text”  Texpress uses “patterns” to achieve this © 2013 SciBite Limited © 2013 SciBite Limited.
  19. 19. An Example  Use case: Scan an input set of documents and identify disease-gene relationships within the text and output these to a file for downstream processing  We supply a simple pattern Indication{0,3}(Gene|Protein_Class), which means:  Find an indication  Followed by 0-3 other words  And then a gene or protein class. Its critical to use the “(Gene|Protein_Class)” when looking for gene/protein info as often classes are used (see purple text below).  For example, on the text: “Simvastatin induces heme oxygenase-1 expression but fails to reduce inflammation in the capsule surrounding a silicone shell implant in rats” [DRUG:CHEMBL1064]simvastatin [VERB:!INDUCES]induces [GENE:HMOX1]heme _oxygenase _1 [VERB:!EXPRESSION]expression but {NEG}fails to [VERB:!REDUCE]reduce [INDICATION:D007249]inflammation in the capsule surrounding a silicone shell implant in [ORG:RAT]rats © 2013 SciBite Limited © 2013 SciBite Limited.
  20. 20. Identifying Causal Relationships  E.g. we want to look for drugs that treat Lymphocytic Choriomeningitis Virus (LCV)  We use the pattern: DRUG.{0,1}:treat.{0,1}:INDICATION(D001117)  Which is translated as:  Find any drug  in close proximity to the verb “treat”  Followed closely by the specific indication (D001117 is the ID for LCV)  From the following text, we obtain the computer-readable result below:  “To investigate its therapeutic potential, we used rapamycin to treat Lymphocytic Choriomeningitis Virus (LCMV)-infected perforin-deficient (Prf1(-/-) ) mice according to a well-established model of HLH” [DRUG:CHEMBL413]rapamycin to [VERB:!TREAT]treat [D001117]lymphocytic _choriomeningitis _virus © 2013 SciBite Limited © 2013 SciBite Limited.
  21. 21. Other features  Extension (pattern will match multiple entities in a list)  <Indication><gene> will find all genes in  Cancer due to mutations in p53, SCA1 and BRCA1  Negativity  TExpress will note where the extracted phrase contains negative keywords or sentiment  Verb Extraction  Identify causal/action relationships and return the verb used i.e.  <gene> <any_verb> <gene> on “p53 binds mdm2” => binds  Auto-continuation  We‟ll match multiple entities of the same type in a list in the pattern (e.g. matching both drugs in the phrase “cancer can be treated with paclitaxel and bortezomib”) using an “<INDICATION><DRUG>” pattern © 2013 SciBite Limited © 2013 SciBite Limited.
  22. 22. Why TExpress?  Built on Termite with all its advantages (quality thesauri, ambiguity processing, coverage)  Simple patterns, easy to create and understand  High performance/scalability (around 10% slower than Termite alone)  Supports narrow focus (e.g. „<Gene1> inhibits <Gene2>‟) and wide focus (e.g. “<Gene1> <any_verb> <Gene2>”) relationships  Simple JSON, TSV or XML output © 2013 SciBite Limited © 2013 SciBite Limited.
  23. 23. Want to know more? Ask us for a demo today! Email: info@scibite.com Twitter: @scibitely Call Us: +44 (0)20 8819 2776 © 2013 SciBite Limited © 2013 SciBite Limited.