II-SDV 2013 Large scale Application of Text Mining and Visualization in the EU Fusepool Project

  • 88 views
Uploaded on

 

More in: Internet
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
88
On Slideshare
0
From Embeds
0
Number of Embeds
2

Actions

Shares
Downloads
2
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Treparel Delftechpark 26 2628 XH Delft The Netherlands www.treparel.com Text Analytics in the EU Fusepool project II-SDV ’13 - Nice Anton Heijs CTO anton@treparel.com April 15, 2013
  • 2. Agenda   •  Introduc.on  of  the  EU  Fusepool  project,  Treparel  and  the   consor.um  partners   •  Background  on  objec.ves  of  the  Fusepool  project   •  User  adap.ve  system   •  Data  pooling  and  linking   •  Large  scale  machine  learning  and  text  analy.cs   •  Summary   Treparel KMX – All rights reserved 2013 2www.treparel.com
  • 3. The  EU  Fusepool  Consor.um   Treparel (Delft, The Netherlands) is a global software provider in Big Data Text Analytics and Visualization. Global companies, government agencies, software vendors or data publishers are using Treparel KMX text analysis software to gain faster, reliable, precise insights in large complex unstructured data sets (like application notes, blogs, email and patents) allowing them to make better informed decisions. As part of the Fusepool consortium Treparel integrates the KMX Text Analytics technology for advanced classification, clustering and visualization of large complex document collections. Treparel KMX – All rights reserved 2013 3www.treparel.com
  • 4. Each  partner  provides  cri.cal  building  blocks:     BUAS:  Dynamic  user  interfaces     ENOLL:  End  users  with  specific  needs     GEOX:  NER,  UI  design     SEARCHBOX:  text  search  and  similarity  seman.c   matching     TREPAREL:  text/patent  classifica.on  engines     XEROX:  user-­‐adap.ve  learning  know-­‐how              BUT  BREAK-­‐THROUGH  BY  INTEGRATION   The  Big  Picture   Treparel KMX – All rights reserved 2013 4www.treparel.com
  • 5. WP4 Find & match WP5 GUI & Visuals WP2 Platform, sourcing WP3 Extract & enrich WP1 Reqs & Testing WP6 User Involvement Learning Learning Learning LearningFeedback Work  packages   The EC Fusepool project is a 2 year EU project and started in July 2012 Treparel KMX – All rights reserved 2013 5www.treparel.com
  • 6. Fusing  and  pooling  informa.on     for  product  development   Vision: User-adaptive system Living Lab: Rapid app development Data processing: Sourcing & interlinking Machine learning: Matching & optimizing Integrated Use Cases Treparel KMX – All rights reserved 2013 6www.treparel.com
  • 7. Background   •  SMEs  have  a  need  for  technology  intelligence  for  detec.ng   and  responding  to  opportuni.es  and  threats   •  This  a  partly  driven  by  growth  and  complexity  of   patents  and  lawsuits   •  Consumer  intelligence  to  detect  opinions  and  needs  of   consumers  for  product  development   •  Open  innova.on  requiring  coopera.on  (links  between  data,   e.g.  finding  business  partners)   •  Focus:  Machine  Learning  algorithms  to  improve  matching     Treparel KMX – All rights reserved 2013 7www.treparel.com
  • 8. User-­‐adap.ve  system   •  Focus:  monitor  and  learn  specific  needs  and  preferences   of  a  user  to  align  features,  func.onali.es,  and  graphical   interfaces   •  AdapDve:  machine  learning  from  crowd-­‐sourcing  (rather   than  rule-­‐based)   •  User-­‐aligned  prioriDzaDon:  more  usable  and  customized   interfaces,  sugges.ons  based  on  ac.vity  &  user  feedback   Treparel KMX – All rights reserved 2013 8www.treparel.com
  • 9. User-­‐adap.ve  matching   •  Main  goal:  automated  user-­‐adap.ve  matching  of  users   to:   –  Patent  analysis   –  Finding  funding  opportuni.es   –  Partner  matching   •  Key  asset:  informa.on  provided  by  the  user     •  User  Data  Credo:    accuracy  improves  with  quan.ty  and   quality  of  user  data  while  variety  (breadth)  increases   with  number  of  users   •  Living  lab:  Co-­‐crea.on  between  creators  and  consumers   of  the  Fusepool  plaWorm   Treparel KMX – All rights reserved 2013 9www.treparel.com
  • 10. Data  sourcing   •  Sources:  internal  &  external  content  from  web  harves.ng   and  structured  data  sources   –  using  content  databases  and  linked  open  data   •  Scope:  ini.al  data  corpus  includes  all  explicitly  in-­‐  and   excluded  sources   •  Gained  value  from  InformaDon:  recommenda.ons   based  on  machine  learning  from  feedback   Treparel KMX – All rights reserved 2013 10www.treparel.com
  • 11. Data  handling   1.  Text  analysis  and  feature  extracDon:  ML  &  NLP  methods   for  categorizing,  named  en.ty  extrac.on,  etc.   2.  Shared  metadata  models:  mapping  text  features  to   exis.ng/custom  ontologies  and  genera.on  of  seman.c   triplets   →  High-­‐level  abstrac.on  &  persistence  for  reuse   →  Lightweight  storage:  mostly  metadata  only,  text   indexing  and  abstrac.on  uses  schema-­‐free  key-­‐ value  (enabling  ac.onable  facets)   Treparel KMX – All rights reserved 2013 11www.treparel.com
  • 12. Data  interlinking   •  Contextualize:  terms  are  interlinked  with  same  and   similar  terms  across  sources:   –  Enrich  the  extracted  content  with  exis.ng  informa.on  available   in  the  Internet   –  Interlink  as  much  informa.on  as  possible  to  increase  the  value  of   knowledge  extrac.on   –  Use  available  public  sector  resources  in  Seman.c  Web  and  LOD   format   •  Metadata:  when  a  user  uploads  texts  to  be  matched  with   other  content,  only  the  metadata  descriptors  are   transmiYed   •  Data  privacy:  data  fusion  from  diverse  sources  without   endangering  user  privacy   Treparel KMX – All rights reserved 2013 12www.treparel.com
  • 13. Searching  &  finding   •  Key  search-­‐oriented  features:   –  Search  through  all  content  in  the  data  pool   –  Faceted  search  (categories,  metadata,  en..es)   –  Integra.on  of  Linked  Open  Data  (LOD)  results   –  Cross-­‐lingual  indexing  and  cross-­‐referencing   –  “Did  you  mean?”-­‐func.onality  in  case  of  typos  and  auto-­‐ comple.on  of  search  queries   •  User-­‐adapDve:  indexing  and  integra.on  based  on  users   needs  (e.g.  user  profiling)   Treparel KMX – All rights reserved 2013 13www.treparel.com
  • 14. Adapta.on  &  refinement   •  AdapDve  search:  results  are  aligned  to  user  preferences   based  on  analysis  of  user  implicit  and  explicit  feedback     •  MulD-­‐task  ranking:  good  trade-­‐off  between  user-­‐ independent  search  (high  coverage  but  lower  precision)   and  a  very  customized  approach   •  Query  intent  discovery:  analysis  of  the  query  structure   and  interlinking  of  queries   Treparel KMX – All rights reserved 2013 14www.treparel.com
  • 15. Correla.ng  &  matching   •  Search  guided  navigaDon:  seman.c  matching  extracts   contextual  rela.onships  to  list  related  content   –  sugges.ons  organized  by  categories   –  exposing  facets  within  related  content   •  Distributed  rule  and  event  model:  defines  states,   ac.ons,  and  consequences  (e.g.  no.fica.ons,   visualiza.ons)  for  reasoning  based  on  light-­‐weight   ontologies   Treparel KMX – All rights reserved 2013 15www.treparel.com
  • 16. Crowd  sourcing  &     supervised  automa.on   •  RelaDonal  learning:  related  instances  are  used  to  reason   about  the  focal  instance   –  Ra.onality  of  content  (links  to  other  content,  people,  etc.)   provide  rich  informa.on   –  Similari.es/dissimilari.es  to  other  content  is  established  purely   on  rela.onal  proper.es   Treparel KMX – All rights reserved 2013 16www.treparel.com
  • 17. Visualiza.on   Clustering   Classifica.on   Text  Preprocessing  and  Indexing   Acquire  documents   Present  Results   Taxonomies,   Ontologies   Seman.c  Analysis   Document  level  analysis   using  the  KMX  technology     KMX  unique  func.ons:   •  Extract  concepts  in  context   using  clustering  and   classifica.on  of  documents   •  Use  classifica.on  to  create   ranked  lists  and  to  tag  subsets   •  Support  of  binary  and  mul.-­‐ class  Classifica.on   •  Integra.on  with  other   applica.ons  through  KMX  API   Treparel KMX – All rights reserved 2012 www.treparel.com 17 Query & Search Tools
  • 18. Treparel KMX – All rights reserved 2013 18www.treparel.com NER  in  the  landscaping    
  • 19. Sentence  level  analysis  using     Named  En.ty  Recogni.on   •  The  aim  of  NER  is  to  iden.fy  en..es  in  unstructured  text   documents.   o  To  locate  :  mark-­‐up  the  en..es   o  To  classify  :  into  predefined  categories/domains   •  Aim  of  usage   o  To  recognise  trends  (trained  and  new  high  frequency)   o  To  find  all  „trained”  en..es   o  To  „discover”  new  en.tes   •  NER  approaches   o  Sta.s.cs  based  (supervised  machine  learning)   o  Rule  based  (regular  expressions)   Treparel KMX – All rights reserved 2013 19www.treparel.com
  • 20. Building  a  NER  model   Treparel KMX – All rights reserved 2013 20www.treparel.com
  • 21. 3  NER  model  examples     Training  text:  500  patents  from  EPO     Training:  model  building  by  Stanford  NER   •  LTE  (long  term  evoluDon)   •  F1:  88%   •  New  en..es:  „GSM  BSS”,  „LTE  TDD”   •  False  pos:  „Loca.on  Area  LA”   •  Elements   •  F1:  98%   •  False  pos:  „argon/hydrogen”   •  Cancer   •  F1:  87%   •  New  en..es:  „myeloma  cell”,  „tumor  .ssue”   •  False  pos:  „cell  mortality”,  „the  test  compound”   Treparel KMX – All rights reserved 2013 21www.treparel.com
  • 22. Using  NER  in  a  GUI   Treparel KMX – All rights reserved 2013 22www.treparel.com
  • 23. Raw tex Extract  terms  indica.ng   domains  using  patent   classifica.on  codes   Content  /  En.ty   Hub   Generate  Vector  Space  Model   Document Vectors Generate  Classifiers  to  es.mate  domains   Annotated tex with domain labels Domain labels Training Vectors DocId : 1122 Title : Pesticide device Classification Code : A61B Text: Hihaho Etc etc Table A61 : pesticide A61B : pesticide solvent Etc etc Many Vectors Doc-Id : 1122 750 terms + weights Classifier  on  pes.cide  solvents   25 Positive vectors Doc-Id’s Label 750 terms + weights 25 negative vectors Doc-Id’s Label 750 terms + weights Doc-Id : 1122 Labels + scores Title : Pesticide device Classification Code : A61B Text: Hihaho …Etc etc 1 2 3 4 5 6 7 8 9 10   11   Combining  text  analysis   approaches   Treparel KMX – All rights reserved 2013 23www.treparel.com
  • 24. •  Data  sourcing:  retrieves  data  from  data  sources  via  data   integrator   •  Storage:  raw  (e.g.  text)  or  processed  data  stored  in   database  or  triple  store   •  Using  ML  to  enable  learning  from  the  crowd   •  Push  &  Pull:       interface  to  consumers  (web  and  mobile  apps)     suppor.ng  quality  and  access  control  from  portal   •  Portal:  business  logic,  storage  of  registered  data  sources,   access  control  using  web  frontend   Somware  architecture   Treparel KMX – All rights reserved 2013 24www.treparel.com
  • 25. Open  Call  for  Users   •  ApplicaDons  received  from  Finland,  Germany,  Spain,  Greece,   Bulgaria,  Hungary,  Italy,  UK,  Belgium,  China,  Switzerland,  France,   Denmark,  Portugal,  Ireland  and  The  Netherlands     •  Business  areas  covered:     –  Bio-­‐medical,   –  Pharma  and  biotech,   –  ICT/Telecommunica.ons,     –  Digital  media,     –  Renewable  energies,     –  Educa.on,     –  Innova.on/consultancy  services  for  SMEs.     •  Mix  of  profiles  from  SMEs,  Research,  SME  intermediaries   (incubators,  Science  parks,  Living  Labs,  etc),  developers  and  more   •  MulDple  areas  of  research,  development  and  innova.on  areas  are   covered.   25Treparel KMX – All rights reserved 2013 25www.treparel.com
  • 26. •  Data  as  a  service:  scale  economies  of  scale  in  management  of  data     •  Data  pooling:  processes  need  .mely  aggrega.on  and  redistribu.on   of  diverse  data  but  building  own  is  redundant  and  prohibi.ve  for   SMEs     provide  services  on  top  of  pool  with  high  quality  data     provide  access  to  services  on  demand   •  Success  criteria:     Early  provision  of  scalable  basic  Fusepool  services     SME  involvement  and  uptake  of  Fusepool  services     Machine  learning  for  data  cura.on  &  user  adapta.on   •  Required  steps:     Stepwise  integra.on  of  exis.ng  &  new  technologies     Early  and  ongoing  feedback  from  end  users   EC  perspec.ve   Treparel KMX – All rights reserved 2013 26www.treparel.com
  • 27. The  EC  FP  7  program  Fusepool  is  all  about:     •  Building  a  plaWorm  with  web  enabled  services  for   •  Data  pooling   •  Large  scale  text  analysis   •  Large  scale  machine  learning  of  user  input   •  Enable  SME’s  with  analy.cs  for  improving  their   innova.on  and  compe..ve  strengths  using   •  SME’s  involvement  and  feedback  to  the  Fusepool  services   •  Machine  learning  for  data  cura.on  &  user  adapta.on   Summary   Treparel KMX – All rights reserved 2013 27www.treparel.com Welcome to visit Fusepool at: www.fusepool.net