Treparel
Delftechpark 26
2628 XH Delft
The Netherlands
www.treparel.com
Text Analytics
in the
EU Fusepool
project
II-SDV ’...
Agenda	
  
•  Introduc.on	
  of	
  the	
  EU	
  Fusepool	
  project,	
  Treparel	
  and	
  the	
  
consor.um	
  partners	
...
The	
  EU	
  Fusepool	
  Consor.um	
  
Treparel (Delft, The Netherlands) is a global software provider in Big Data Text An...
Each	
  partner	
  provides	
  cri.cal	
  building	
  blocks:	
  
  BUAS:	
  Dynamic	
  user	
  interfaces	
  
  ENOLL:	...
WP4
Find &
match
WP5
GUI &
Visuals
WP2
Platform,
sourcing
WP3
Extract &
enrich
WP1
Reqs &
Testing
WP6 User Involvement
Lea...
Fusing	
  and	
  pooling	
  informa.on	
  	
  
for	
  product	
  development	
  
Vision: User-adaptive system
Living Lab: ...
Background	
  
•  SMEs	
  have	
  a	
  need	
  for	
  technology	
  intelligence	
  for	
  detec.ng	
  
and	
  responding	...
User-­‐adap.ve	
  system	
  
•  Focus:	
  monitor	
  and	
  learn	
  specific	
  needs	
  and	
  preferences	
  
of	
  a	
 ...
User-­‐adap.ve	
  matching	
  
•  Main	
  goal:	
  automated	
  user-­‐adap.ve	
  matching	
  of	
  users	
  
to:	
  
–  P...
Data	
  sourcing	
  
•  Sources:	
  internal	
  &	
  external	
  content	
  from	
  web	
  harves.ng	
  
and	
  structured...
Data	
  handling	
  
1.  Text	
  analysis	
  and	
  feature	
  extracDon:	
  ML	
  &	
  NLP	
  methods	
  
for	
  categori...
Data	
  interlinking	
  
•  Contextualize:	
  terms	
  are	
  interlinked	
  with	
  same	
  and	
  
similar	
  terms	
  a...
Searching	
  &	
  finding	
  
•  Key	
  search-­‐oriented	
  features:	
  
–  Search	
  through	
  all	
  content	
  in	
  ...
Adapta.on	
  &	
  refinement	
  
•  AdapDve	
  search:	
  results	
  are	
  aligned	
  to	
  user	
  preferences	
  
based	...
Correla.ng	
  &	
  matching	
  
•  Search	
  guided	
  navigaDon:	
  seman.c	
  matching	
  extracts	
  
contextual	
  rel...
Crowd	
  sourcing	
  &	
  	
  
supervised	
  automa.on	
  
•  RelaDonal	
  learning:	
  related	
  instances	
  are	
  use...
Visualiza.on	
  
Clustering	
   Classifica.on	
  
Text	
  Preprocessing	
  and	
  Indexing	
  
Acquire	
  documents	
  
Pre...
Treparel KMX – All rights reserved 2013 18www.treparel.com
NER	
  in	
  the	
  landscaping	
  	
  
Sentence	
  level	
  analysis	
  using	
  	
  
Named	
  En.ty	
  Recogni.on	
  
•  The	
  aim	
  of	
  NER	
  is	
  to	
  ...
Building	
  a	
  NER	
  model	
  
Treparel KMX – All rights reserved 2013 20www.treparel.com
3	
  NER	
  model	
  examples	
  
  Training	
  text:	
  500	
  patents	
  from	
  EPO	
  
  Training:	
  model	
  build...
Using	
  NER	
  in	
  a	
  GUI	
  
Treparel KMX – All rights reserved 2013 22www.treparel.com
Raw tex
Extract	
  terms	
  indica.ng	
  
domains	
  using	
  patent	
  
classifica.on	
  codes	
  
Content	
  /	
  En.ty	
...
•  Data	
  sourcing:	
  retrieves	
  data	
  from	
  data	
  sources	
  via	
  data	
  
integrator	
  
•  Storage:	
  raw	...
Open	
  Call	
  for	
  Users	
  
•  ApplicaDons	
  received	
  from	
  Finland,	
  Germany,	
  Spain,	
  Greece,	
  
Bulga...
•  Data	
  as	
  a	
  service:	
  scale	
  economies	
  of	
  scale	
  in	
  management	
  of	
  data	
  	
  
•  Data	
  p...
The	
  EC	
  FP	
  7	
  program	
  Fusepool	
  is	
  all	
  about:	
  	
  
•  Building	
  a	
  plaWorm	
  with	
  web	
  e...
Upcoming SlideShare
Loading in …5
×

II-SDV 2013 Large scale Application of Text Mining and Visualization in the EU Fusepool Project

275 views
220 views

Published on

Published in: Internet
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
275
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

II-SDV 2013 Large scale Application of Text Mining and Visualization in the EU Fusepool Project

  1. 1. Treparel Delftechpark 26 2628 XH Delft The Netherlands www.treparel.com Text Analytics in the EU Fusepool project II-SDV ’13 - Nice Anton Heijs CTO anton@treparel.com April 15, 2013
  2. 2. Agenda   •  Introduc.on  of  the  EU  Fusepool  project,  Treparel  and  the   consor.um  partners   •  Background  on  objec.ves  of  the  Fusepool  project   •  User  adap.ve  system   •  Data  pooling  and  linking   •  Large  scale  machine  learning  and  text  analy.cs   •  Summary   Treparel KMX – All rights reserved 2013 2www.treparel.com
  3. 3. The  EU  Fusepool  Consor.um   Treparel (Delft, The Netherlands) is a global software provider in Big Data Text Analytics and Visualization. Global companies, government agencies, software vendors or data publishers are using Treparel KMX text analysis software to gain faster, reliable, precise insights in large complex unstructured data sets (like application notes, blogs, email and patents) allowing them to make better informed decisions. As part of the Fusepool consortium Treparel integrates the KMX Text Analytics technology for advanced classification, clustering and visualization of large complex document collections. Treparel KMX – All rights reserved 2013 3www.treparel.com
  4. 4. Each  partner  provides  cri.cal  building  blocks:     BUAS:  Dynamic  user  interfaces     ENOLL:  End  users  with  specific  needs     GEOX:  NER,  UI  design     SEARCHBOX:  text  search  and  similarity  seman.c   matching     TREPAREL:  text/patent  classifica.on  engines     XEROX:  user-­‐adap.ve  learning  know-­‐how              BUT  BREAK-­‐THROUGH  BY  INTEGRATION   The  Big  Picture   Treparel KMX – All rights reserved 2013 4www.treparel.com
  5. 5. WP4 Find & match WP5 GUI & Visuals WP2 Platform, sourcing WP3 Extract & enrich WP1 Reqs & Testing WP6 User Involvement Learning Learning Learning LearningFeedback Work  packages   The EC Fusepool project is a 2 year EU project and started in July 2012 Treparel KMX – All rights reserved 2013 5www.treparel.com
  6. 6. Fusing  and  pooling  informa.on     for  product  development   Vision: User-adaptive system Living Lab: Rapid app development Data processing: Sourcing & interlinking Machine learning: Matching & optimizing Integrated Use Cases Treparel KMX – All rights reserved 2013 6www.treparel.com
  7. 7. Background   •  SMEs  have  a  need  for  technology  intelligence  for  detec.ng   and  responding  to  opportuni.es  and  threats   •  This  a  partly  driven  by  growth  and  complexity  of   patents  and  lawsuits   •  Consumer  intelligence  to  detect  opinions  and  needs  of   consumers  for  product  development   •  Open  innova.on  requiring  coopera.on  (links  between  data,   e.g.  finding  business  partners)   •  Focus:  Machine  Learning  algorithms  to  improve  matching     Treparel KMX – All rights reserved 2013 7www.treparel.com
  8. 8. User-­‐adap.ve  system   •  Focus:  monitor  and  learn  specific  needs  and  preferences   of  a  user  to  align  features,  func.onali.es,  and  graphical   interfaces   •  AdapDve:  machine  learning  from  crowd-­‐sourcing  (rather   than  rule-­‐based)   •  User-­‐aligned  prioriDzaDon:  more  usable  and  customized   interfaces,  sugges.ons  based  on  ac.vity  &  user  feedback   Treparel KMX – All rights reserved 2013 8www.treparel.com
  9. 9. User-­‐adap.ve  matching   •  Main  goal:  automated  user-­‐adap.ve  matching  of  users   to:   –  Patent  analysis   –  Finding  funding  opportuni.es   –  Partner  matching   •  Key  asset:  informa.on  provided  by  the  user     •  User  Data  Credo:    accuracy  improves  with  quan.ty  and   quality  of  user  data  while  variety  (breadth)  increases   with  number  of  users   •  Living  lab:  Co-­‐crea.on  between  creators  and  consumers   of  the  Fusepool  plaWorm   Treparel KMX – All rights reserved 2013 9www.treparel.com
  10. 10. Data  sourcing   •  Sources:  internal  &  external  content  from  web  harves.ng   and  structured  data  sources   –  using  content  databases  and  linked  open  data   •  Scope:  ini.al  data  corpus  includes  all  explicitly  in-­‐  and   excluded  sources   •  Gained  value  from  InformaDon:  recommenda.ons   based  on  machine  learning  from  feedback   Treparel KMX – All rights reserved 2013 10www.treparel.com
  11. 11. Data  handling   1.  Text  analysis  and  feature  extracDon:  ML  &  NLP  methods   for  categorizing,  named  en.ty  extrac.on,  etc.   2.  Shared  metadata  models:  mapping  text  features  to   exis.ng/custom  ontologies  and  genera.on  of  seman.c   triplets   →  High-­‐level  abstrac.on  &  persistence  for  reuse   →  Lightweight  storage:  mostly  metadata  only,  text   indexing  and  abstrac.on  uses  schema-­‐free  key-­‐ value  (enabling  ac.onable  facets)   Treparel KMX – All rights reserved 2013 11www.treparel.com
  12. 12. Data  interlinking   •  Contextualize:  terms  are  interlinked  with  same  and   similar  terms  across  sources:   –  Enrich  the  extracted  content  with  exis.ng  informa.on  available   in  the  Internet   –  Interlink  as  much  informa.on  as  possible  to  increase  the  value  of   knowledge  extrac.on   –  Use  available  public  sector  resources  in  Seman.c  Web  and  LOD   format   •  Metadata:  when  a  user  uploads  texts  to  be  matched  with   other  content,  only  the  metadata  descriptors  are   transmiYed   •  Data  privacy:  data  fusion  from  diverse  sources  without   endangering  user  privacy   Treparel KMX – All rights reserved 2013 12www.treparel.com
  13. 13. Searching  &  finding   •  Key  search-­‐oriented  features:   –  Search  through  all  content  in  the  data  pool   –  Faceted  search  (categories,  metadata,  en..es)   –  Integra.on  of  Linked  Open  Data  (LOD)  results   –  Cross-­‐lingual  indexing  and  cross-­‐referencing   –  “Did  you  mean?”-­‐func.onality  in  case  of  typos  and  auto-­‐ comple.on  of  search  queries   •  User-­‐adapDve:  indexing  and  integra.on  based  on  users   needs  (e.g.  user  profiling)   Treparel KMX – All rights reserved 2013 13www.treparel.com
  14. 14. Adapta.on  &  refinement   •  AdapDve  search:  results  are  aligned  to  user  preferences   based  on  analysis  of  user  implicit  and  explicit  feedback     •  MulD-­‐task  ranking:  good  trade-­‐off  between  user-­‐ independent  search  (high  coverage  but  lower  precision)   and  a  very  customized  approach   •  Query  intent  discovery:  analysis  of  the  query  structure   and  interlinking  of  queries   Treparel KMX – All rights reserved 2013 14www.treparel.com
  15. 15. Correla.ng  &  matching   •  Search  guided  navigaDon:  seman.c  matching  extracts   contextual  rela.onships  to  list  related  content   –  sugges.ons  organized  by  categories   –  exposing  facets  within  related  content   •  Distributed  rule  and  event  model:  defines  states,   ac.ons,  and  consequences  (e.g.  no.fica.ons,   visualiza.ons)  for  reasoning  based  on  light-­‐weight   ontologies   Treparel KMX – All rights reserved 2013 15www.treparel.com
  16. 16. Crowd  sourcing  &     supervised  automa.on   •  RelaDonal  learning:  related  instances  are  used  to  reason   about  the  focal  instance   –  Ra.onality  of  content  (links  to  other  content,  people,  etc.)   provide  rich  informa.on   –  Similari.es/dissimilari.es  to  other  content  is  established  purely   on  rela.onal  proper.es   Treparel KMX – All rights reserved 2013 16www.treparel.com
  17. 17. Visualiza.on   Clustering   Classifica.on   Text  Preprocessing  and  Indexing   Acquire  documents   Present  Results   Taxonomies,   Ontologies   Seman.c  Analysis   Document  level  analysis   using  the  KMX  technology     KMX  unique  func.ons:   •  Extract  concepts  in  context   using  clustering  and   classifica.on  of  documents   •  Use  classifica.on  to  create   ranked  lists  and  to  tag  subsets   •  Support  of  binary  and  mul.-­‐ class  Classifica.on   •  Integra.on  with  other   applica.ons  through  KMX  API   Treparel KMX – All rights reserved 2012 www.treparel.com 17 Query & Search Tools
  18. 18. Treparel KMX – All rights reserved 2013 18www.treparel.com NER  in  the  landscaping    
  19. 19. Sentence  level  analysis  using     Named  En.ty  Recogni.on   •  The  aim  of  NER  is  to  iden.fy  en..es  in  unstructured  text   documents.   o  To  locate  :  mark-­‐up  the  en..es   o  To  classify  :  into  predefined  categories/domains   •  Aim  of  usage   o  To  recognise  trends  (trained  and  new  high  frequency)   o  To  find  all  „trained”  en..es   o  To  „discover”  new  en.tes   •  NER  approaches   o  Sta.s.cs  based  (supervised  machine  learning)   o  Rule  based  (regular  expressions)   Treparel KMX – All rights reserved 2013 19www.treparel.com
  20. 20. Building  a  NER  model   Treparel KMX – All rights reserved 2013 20www.treparel.com
  21. 21. 3  NER  model  examples     Training  text:  500  patents  from  EPO     Training:  model  building  by  Stanford  NER   •  LTE  (long  term  evoluDon)   •  F1:  88%   •  New  en..es:  „GSM  BSS”,  „LTE  TDD”   •  False  pos:  „Loca.on  Area  LA”   •  Elements   •  F1:  98%   •  False  pos:  „argon/hydrogen”   •  Cancer   •  F1:  87%   •  New  en..es:  „myeloma  cell”,  „tumor  .ssue”   •  False  pos:  „cell  mortality”,  „the  test  compound”   Treparel KMX – All rights reserved 2013 21www.treparel.com
  22. 22. Using  NER  in  a  GUI   Treparel KMX – All rights reserved 2013 22www.treparel.com
  23. 23. Raw tex Extract  terms  indica.ng   domains  using  patent   classifica.on  codes   Content  /  En.ty   Hub   Generate  Vector  Space  Model   Document Vectors Generate  Classifiers  to  es.mate  domains   Annotated tex with domain labels Domain labels Training Vectors DocId : 1122 Title : Pesticide device Classification Code : A61B Text: Hihaho Etc etc Table A61 : pesticide A61B : pesticide solvent Etc etc Many Vectors Doc-Id : 1122 750 terms + weights Classifier  on  pes.cide  solvents   25 Positive vectors Doc-Id’s Label 750 terms + weights 25 negative vectors Doc-Id’s Label 750 terms + weights Doc-Id : 1122 Labels + scores Title : Pesticide device Classification Code : A61B Text: Hihaho …Etc etc 1 2 3 4 5 6 7 8 9 10   11   Combining  text  analysis   approaches   Treparel KMX – All rights reserved 2013 23www.treparel.com
  24. 24. •  Data  sourcing:  retrieves  data  from  data  sources  via  data   integrator   •  Storage:  raw  (e.g.  text)  or  processed  data  stored  in   database  or  triple  store   •  Using  ML  to  enable  learning  from  the  crowd   •  Push  &  Pull:       interface  to  consumers  (web  and  mobile  apps)     suppor.ng  quality  and  access  control  from  portal   •  Portal:  business  logic,  storage  of  registered  data  sources,   access  control  using  web  frontend   Somware  architecture   Treparel KMX – All rights reserved 2013 24www.treparel.com
  25. 25. Open  Call  for  Users   •  ApplicaDons  received  from  Finland,  Germany,  Spain,  Greece,   Bulgaria,  Hungary,  Italy,  UK,  Belgium,  China,  Switzerland,  France,   Denmark,  Portugal,  Ireland  and  The  Netherlands     •  Business  areas  covered:     –  Bio-­‐medical,   –  Pharma  and  biotech,   –  ICT/Telecommunica.ons,     –  Digital  media,     –  Renewable  energies,     –  Educa.on,     –  Innova.on/consultancy  services  for  SMEs.     •  Mix  of  profiles  from  SMEs,  Research,  SME  intermediaries   (incubators,  Science  parks,  Living  Labs,  etc),  developers  and  more   •  MulDple  areas  of  research,  development  and  innova.on  areas  are   covered.   25Treparel KMX – All rights reserved 2013 25www.treparel.com
  26. 26. •  Data  as  a  service:  scale  economies  of  scale  in  management  of  data     •  Data  pooling:  processes  need  .mely  aggrega.on  and  redistribu.on   of  diverse  data  but  building  own  is  redundant  and  prohibi.ve  for   SMEs     provide  services  on  top  of  pool  with  high  quality  data     provide  access  to  services  on  demand   •  Success  criteria:     Early  provision  of  scalable  basic  Fusepool  services     SME  involvement  and  uptake  of  Fusepool  services     Machine  learning  for  data  cura.on  &  user  adapta.on   •  Required  steps:     Stepwise  integra.on  of  exis.ng  &  new  technologies     Early  and  ongoing  feedback  from  end  users   EC  perspec.ve   Treparel KMX – All rights reserved 2013 26www.treparel.com
  27. 27. The  EC  FP  7  program  Fusepool  is  all  about:     •  Building  a  plaWorm  with  web  enabled  services  for   •  Data  pooling   •  Large  scale  text  analysis   •  Large  scale  machine  learning  of  user  input   •  Enable  SME’s  with  analy.cs  for  improving  their   innova.on  and  compe..ve  strengths  using   •  SME’s  involvement  and  feedback  to  the  Fusepool  services   •  Machine  learning  for  data  cura.on  &  user  adapta.on   Summary   Treparel KMX – All rights reserved 2013 27www.treparel.com Welcome to visit Fusepool at: www.fusepool.net

×