The Road to Federated Text Mining:
Are we there yet?
II-SDV 2014
Guy Singh
Click to edit Master title styleClick to edit Master title style
“Federated search is an information retrieval technology ...
Click to edit Master title styleClick to edit Master title styleCurrent Situation
• Volume of data ever increasing
• Propr...
Click to edit Master title styleClick to edit Master title style
Data
Sources
Scientific
Literature
Social
Media
News
Web ...
Click to edit Master title styleClick to edit Master title style
5
Varying in Structure
Click to edit Master title styleClick to edit Master title styleHow does text mining differ from keyword search?
Example: ...
Click to edit Master title styleClick to edit Master title style
• Searching across documents using keywords is relatively...
Click to edit Master title styleClick to edit Master title style
• Integrate the data together into a data warehouse
– Ext...
Click to edit Master title styleClick to edit Master title style
Data
Normalisation
Link the
Content
Servers
Merge
Results...
Click to edit Master title styleClick to edit Master title style
10
Data Normalisation – Virtual Indexes
Pathology
Reports...
Click to edit Master title styleClick to edit Master title style
11
Data Normalisation – Document Structure
Pathology
Repo...
Click to edit Master title styleClick to edit Master title style
12
Data Normalisation - Entities
Journal
Abstracts
Pathol...
Linking Content Servers
Linguamatics Customer Confidential13
Click to edit Master title styleClick to edit Master title style
• I2E 4.1 introduced a new feature – Linked Server
• One ...
Click to edit Master title styleClick to edit Master title style
Linguamatics – Customer confidential
I2E 4.1 Linked Serve...
Merging Results (Part I)
Single Server, Multiple Queries
Click to edit Master title styleClick to edit Master title styleI2E 3.0 (2009) – Merging Results (part I) from one server
...
Click to edit Master title styleClick to edit Master title style
© Linguamatics 2013 - Confidential
I2E 3.0 – Merging Resu...
Merging Results (Part II)
Linguamatics Customer Confidential19
Multiple Servers, Multiple Queries
Click to edit Master title styleClick to edit Master title style
20
Each Server supplying separate set of results
Content
...
The Road to Federated Text Mining
Linking Content Servers
Click to edit Master title styleClick to edit Master title styleI2E 4.0: Multiple Clients, Multiple Results
I2E Server 2
F...
Click to edit Master title styleClick to edit Master title styleI2E 4.1/4.2: Single Client, Multiple Results
I2E Server 2
...
Merging Results (Part II)
Click to edit Master title styleClick to edit Master title styleQ4 2014: Single Client, Single Result, Multiple Servers
I2...
Click to edit Master title styleClick to edit Master title styleQ4 2014: Federated Text Mining Example
• Single Query
• Di...
Click to edit Master title styleClick to edit Master title styleThe Road to Federated – Are we there yet?
I2E 4.0
Dec 2012...
Demo
Linguamatics – Customer confidential
Click to edit Master title styleClick to edit Master title style
30
Demo
Cambridge
VPN
Nice
Linked Server
Journal Abstract...
Thank you
Linguamatics – Customer confidential
Upcoming SlideShare
Loading in …5
×

II-SDV 2014 The Road to Federated Text Mining: Are we there yet? (Guy Singh - Linguamatics, UK)

771 views

Published on

Published in: Software, Technology, Business
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
771
On SlideShare
0
From Embeds
0
Number of Embeds
376
Actions
Shares
0
Downloads
17
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

II-SDV 2014 The Road to Federated Text Mining: Are we there yet? (Guy Singh - Linguamatics, UK)

  1. 1. The Road to Federated Text Mining: Are we there yet? II-SDV 2014 Guy Singh
  2. 2. Click to edit Master title styleClick to edit Master title style “Federated search is an information retrieval technology that allows the simultaneous search of multiple searchable resources. 2 What is federated search? A user makes a single query request which is distributed to the search engines participating in the federation” - Wikipedia
  3. 3. Click to edit Master title styleClick to edit Master title styleCurrent Situation • Volume of data ever increasing • Proprietary content can reside within Enterprise • No need for everyone to keep standard sources up-to-date • Data from content providers can reside on their sites Linguamatics Customer Confidential3 Internal Content External Content MEDLINE Clinical Trials Publisher Content FDA Drug Labels Patents
  4. 4. Click to edit Master title styleClick to edit Master title style Data Sources Scientific Literature Social Media News Web Pages Internal Documents Patents RSS Clinical Trials 4 Increasing Range of Data Sources
  5. 5. Click to edit Master title styleClick to edit Master title style 5 Varying in Structure
  6. 6. Click to edit Master title styleClick to edit Master title styleHow does text mining differ from keyword search? Example: What genes affect breast cancer
  7. 7. Click to edit Master title styleClick to edit Master title style • Searching across documents using keywords is relatively trivial – Do not need to be aware of where the words occur and in what context • Text mining documents with varying structure requires a more sophisticated approach; Need to: – Know where words matching entities/concepts occur – Disambiguate depending on context and location – Find terms in particular regions/parts of document for targeted searches 7 Why does document structure matter?
  8. 8. Click to edit Master title styleClick to edit Master title style • Integrate the data together into a data warehouse – Extract, Transform and Load each data source into a new database – Multiple copies of the data – Data normalisation can be difficult and challenging – Time consuming and expensive process – Most database vendors take this approach – Allows users to perform a single search across all the content • Leave the data where it is, federated content – Data remains in it’s original form and location – Multiple data types – Multiple network locations – Single search across multiple different data sources 8 Approaches to dealing with different data sources
  9. 9. Click to edit Master title styleClick to edit Master title style Data Normalisation Link the Content Servers Merge Results Federated Text Mining 9 How do we get to Federated Text Mining?
  10. 10. Click to edit Master title styleClick to edit Master title style 10 Data Normalisation – Virtual Indexes Pathology Reports Index Journal Abstracts Index Virtual Index
  11. 11. Click to edit Master title styleClick to edit Master title style 11 Data Normalisation – Document Structure Pathology Reports Journal Abstracts
  12. 12. Click to edit Master title styleClick to edit Master title style 12 Data Normalisation - Entities Journal Abstracts Pathology ReportsCombined (Normalized)
  13. 13. Linking Content Servers Linguamatics Customer Confidential13
  14. 14. Click to edit Master title styleClick to edit Master title style • I2E 4.1 introduced a new feature – Linked Server • One I2E server can be linked to another I2E server • Provides access to remote and local indexes and queries through a single I2E interface (Linked Servers) – Indexes and queries on remote servers on the network appear the same as local indexes Linked Servers Development Status
  15. 15. Click to edit Master title styleClick to edit Master title style Linguamatics – Customer confidential I2E 4.1 Linked Servers I2E Enterprise on Customer network I2E OnDemand SaaS Infrastructure In-house Indexes I2E OnDemand Standard Indexes I2E Enterprise Access Custom Indexes Access via Linked Servers Access via single UI
  16. 16. Merging Results (Part I) Single Server, Multiple Queries
  17. 17. Click to edit Master title styleClick to edit Master title styleI2E 3.0 (2009) – Merging Results (part I) from one server Profiling Individuals • Example from news reports related to pharmaceutical industry • Pick up properties from one document or many © Linguamatics 2012 - Customer Confidential
  18. 18. Click to edit Master title styleClick to edit Master title style © Linguamatics 2013 - Confidential I2E 3.0 – Merging Results (part I) from one server Document Identifier Patient information Disease history Patient data Medications and dosages Hit displayed in context
  19. 19. Merging Results (Part II) Linguamatics Customer Confidential19 Multiple Servers, Multiple Queries
  20. 20. Click to edit Master title styleClick to edit Master title style 20 Each Server supplying separate set of results Content Server 1 Content Server 2 Content Server 3 Content Server 4 Merge into a single set of results
  21. 21. The Road to Federated Text Mining
  22. 22. Linking Content Servers
  23. 23. Click to edit Master title styleClick to edit Master title styleI2E 4.0: Multiple Clients, Multiple Results I2E Server 2 FDA Drug Labels I2E Server 1 Internal Documents external networkinternal network Linguamatics Customer Confidential23
  24. 24. Click to edit Master title styleClick to edit Master title styleI2E 4.1/4.2: Single Client, Multiple Results I2E Server 2 FDA Drug Labels I2E Server 1 Internal Documents external networkinternal network Linguamatics Customer Confidential24 Linked server
  25. 25. Merging Results (Part II)
  26. 26. Click to edit Master title styleClick to edit Master title styleQ4 2014: Single Client, Single Result, Multiple Servers I2E Server 2 FDA Drug Labels I2E Server 1 Internal Documents external networkinternal network Linguamatics Customer Confidential26 Linked server
  27. 27. Click to edit Master title styleClick to edit Master title styleQ4 2014: Federated Text Mining Example • Single Query • Differently structured data sources on different servers – Journal Articles (PubMed Central) on Enterprise Server – MEDLINE on I2E OnDemand • Single set of results Linguamatics Customer Confidential27
  28. 28. Click to edit Master title styleClick to edit Master title styleThe Road to Federated – Are we there yet? I2E 4.0 Dec 2012 I2E 4.1 October 2013 Next release: in Development Q4 2014 Merging the Results (part II) Data Normalisation Linking Content Servers
  29. 29. Demo Linguamatics – Customer confidential
  30. 30. Click to edit Master title styleClick to edit Master title style 30 Demo Cambridge VPN Nice Linked Server Journal Abstracts Pathology Reports
  31. 31. Thank you Linguamatics – Customer confidential

×