Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

ICIC 2013 Conference Proceedings David Milward Linguamatics

1,222 views

Published on

Unstructured Text in Big Data: the Elephant in the Room
David Milward (Linguamatics, UK)
A traditional approach has been to extract structured data from the unstructured text. When it comes to big data, manual or semi-automated methods just don't scale. Even with fully automated methods, it is infeasible to extract all the facts and relationships buried in the text to produce a 'knowledge base of everything' that answers every possible end-user question. In recent years, there has therefore been a trend towards using text mining over unstructured data to directly answer questions, effectively creating novel databases on the fly. However, this has widened the gap between information professionals and end users.

In this talk I will explain how multiple approaches are being adopted to exploit the skills of information professionals to improve decision support for end users. These include specialised real-time querying, alerting based on semantic queries, semantic enrichment of documents, and population of semantic stores or linked data.

Published in: Technology, Business
  • Be the first to comment

ICIC 2013 Conference Proceedings David Milward Linguamatics

  1. 1. Unstructured Text in Big Data The Elephant in the Room David Milward ICIC, October 2013
  2. 2. Click toto edit Master title style Unstructured Big style Click edit Master titleData • Big Data – Volume, Variety, Velocity 80% • Estimated 80% of all data is unstructured – Need to be able to make decisions based on this data • Ever increasing amount of data makes it harder and harder to filter by hand – Need more automation 2 © 2013 Linguamatics Ltd. 20%
  3. 3. Click toto edit Master Data style From edit Master title style Click Unstructured title to Structured Identify concepts and relationships Structured, relational data Normalized concept identifiers Traditional text mining works in batch mode Unstructured or Semi-Structured Content Sources Linguamatics Confidential Requires each extraction to be programmed in, or learnt from a large amount of annotated material
  4. 4. Click Agile Text Mining I2E to edit Master title style Click to edit Master title style Querying Results Information Consumers Confidential
  5. 5. Click toto edit Master title style Increasing Range style Click edit Master titleOf Applications Sentiment Analysis in Social Media Vocabulary Development Mining Conference FDA Drug In the beginning... Labels Abstract Mining Electronic Medical Records Mining SAR Competitive ProteinExtractionIntelligenceProtein Target Interactions Identification & Prioritization Safety/Tox Confidential Plant Metabolic Pathways Patent Analysis Extracting Chemical Numerical Search and Experimental Data Biomarker Discovery In-licensing Opportunities Key Opinion Leader Identification Patent FTO Systems Biology Drug Repositioning
  6. 6. Click toto edit Master title style The edit Master title style ClickDatabase Approach • Can we predetermine what is required? • One approach is to try to extract everything … but in practice have to make assumptions, whether by hand or automatically – Do we include negative statements? – Do we include speculation? – What context do we include? • Need a good idea of how the data will be used to know what needs to be extracted • There are cases where this works well e.g. – have useful structured data and want to add to it from free text – have new records structured, but want to bring in information from legacy free text 6 © 2013 Linguamatics Ltd.
  7. 7. Click toto edit Master title style Extracting for a Relational Database Click edit Master title style • Extracting all relevant information from pathology reports 7 © 2013 Linguamatics Ltd.
  8. 8. Click toto edit Master title style Extracting for a Semantic Click edit Master title style Store • Output as a set of triples • Because all the parts have URIs, the parts are webaddressable Treat Peptide Psoriasis Express Associate 8 © 2013 Linguamatics Ltd. Web Infectious Disease
  9. 9. Click toto Hoc Approach The edit Master title style ClickAd edit Master title style • No need to build a database • Regard text mining queries as ways to create a database on the fly • Keep all the unstructured free text, and query for what you need when you need it • Even index structured data with text mining to link information together • Used very successfully by knowledge professionals: – Just like a search engine, there are thousands of questions people want to ask of the data – You can’t predict them all beforehand 9 © 2013 Linguamatics Ltd.
  10. 10. ClickHoc Querying style style Ad to edit Master title Click to edit Master title • Find relationships that you want for any particular information request • Treat the text mining as an “agile” database to answer thousands of different questions e.g. metabolites from fruit 10 © 2013 Linguamatics Ltd.
  11. 11. ClickWe NeedMaster title style Do to edit Master title style Click to edit More? • I2E Agile Text Mining has opened up text mining to the end user, but it is still mainly used by knowledge professionals • Given the importance of unstructured big data, can we bring the benefits of text mining to an even wider audience? 11 © 2013 Linguamatics Ltd.
  12. 12. Click toto edit Master title style Workflow Automation Click edit Master title style • Successful queries converted into regular workflows using Pipeline Pilot or KNIME • Real-time workflows for up-to-date dashboards, alerts • Further analysis, visualization, integration with other data Pipeline Pilot 12
  13. 13. Click toto Trials Analysis from I2E Clinical edit Master title Click edit Master title style style Text Mining Results Colours indicate Phase Using Pipeline Pilot visualization Dates 13 © 2013 Linguamatics Ltd.
  14. 14. Click toto edit I2Etitle style Web Apps Embedding Master title Click edit Master within style 10/21/2013 14 4734_8.pptx
  15. 15. Click toto edit Master title style Click edit Master title style Form-Based Querying • Create web interfaces connecting to I2E server • Information professionals develop sophisticated queries • Queries are parameterized to allow end users to customize • Javascript allows portability to mobile devices such as iPads
  16. 16. Click toto edit Master title style Enhancing Enterprise Click edit Master title styleSearch • Use text mining to annotate concepts to feed into a search engine to: – provide concept search – provide concepts for facets – provide intelligent thumbnails for documents, pulling out key information 16 © 2013 Linguamatics Ltd.
  17. 17. Value Click toto edit Master titledocuments Semantic annotation Click edit Master title styleof style Relationships Concepts Keywords Search Engine NLP-based Text Mining
  18. 18. Click toto edit Master title style Provide Concepts style Click edit Master titleas Facets for Enterprise Search Document Refine the search information and based on teaser terminologies 18 © 2013 Linguamatics Ltd. Document preview and additional information
  19. 19. Click toto edit Master title style Dictionary Matching Click edit Master title style Companies
  20. 20. Click toto edit MasterInstitutions Pattern Matching title Click edit Master title -style style • Not a fixed list: can find previously unknown institutions
  21. 21. Click toto edit MasterMutations Pattern Matching title Click edit Master title -style style
  22. 22. Click toto edit MasterChemicals Pattern Matching title Click edit Master title -style style • Assign structure to novel chemicals • Distinguish exemplified compounds • Distinguish compounds given properties
  23. 23. Click toto edit Master title style Whatever the title style Click edit MasterContent... • ... I2E can mine and extract with precision Scientific literature Twitter
  24. 24. Click totoWherever style style …. and edit Master Click edit Master title title Internal Content MEDLINE Patents Drug Labels Clinical Trials Content Provider Publisher I2E OnDemand I2E Enterprise I2E Platform External Content I2E Platform • I2E 4.1 Linked Servers • Users can query local information or information on the cloud • Proprietary content can reside within Enterprise • Standard sources e.g. patents can be hosted on the cloud, and shared by all users • Data from content providers can reside on their sites 24
  25. 25. Click toto edit Master title style Bringing Master title style Click editthe Elephant Down to Size • Data warehouses for data that you know you need • Embedding text mining in other applications e.g. Enterprise Search to provide benefits of semantic search • Building specialized new interfaces to provide self-service applications for end-users • Continuing role for ad hoc text mining – text mining can ask almost any question of the data – you can never predict every question 25 © 2013 Linguamatics Ltd.

×