• Like

2014: Treparel Big Data Text Analytics & Visualization

  • 1,735 views
Uploaded on

Text and content analytics have become a source of competitive advantage, enabling business, government agencies, and researchers to extract unprecedented value from unstructured data. …

Text and content analytics have become a source of competitive advantage, enabling business, government agencies, and researchers to extract unprecedented value from unstructured data.

Treparel (Delft, The Netherlands) is a independent provider of Text analytics and Visualization software. Organizations like Philips, Bayer, Abbott, NXP Semiconductors are using KMX Text Analytics software to gain faster, reliable, precise insights in large complex unstructured data sets.

The KMX API allows software and service companies to enhance their unstructured data analysis capabilities by embedding world class machine learning based clustering, categorization and visualization.

A recent review by IDC states: “KMX visualization capabilities around its auto-categorization and clustering offer immediate insight into unstructured data sets and appear to be adaptable and customizable to customer needs. Its approach to auto-categorization utilizes statistical principles and machine learning that require significantly less training and tuning on the part of customers than other approaches.”

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,735
On Slideshare
0
From Embeds
0
Number of Embeds
4

Actions

Shares
Downloads
29
Comments
0
Likes
3

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Introducing Treparel: Big Data Text Analytics & Visualization applications Treparel Delftechpark 26 2628 XH Delft The Netherlands www.treparel.com Jeroen Kleinhoven CEO jeroen@treparel.com February, 2014
  • 2. Industry  Thought  Leaders  about  Treparel   “Treparel  KMX’s  visualiza(on  capabili(es  around  its  auto-­‐categoriza8on   and  clustering  offer  immediate  insight  into  unstructured  data  sets  and   appear  to  be  adaptable  and  customizable  to  customer  needs.  Its  approach  to   auto-­‐categoriza8on  u8lizes  sta8s8cal  principles  and  machine  learning  that   require  significantly  less  training  and  tuning  on  the  part  of  customers  than   other  approaches.”  David  Schubmehl,  IDC   “As  we  acquire  more  and  more  informa8on,  we  need  tools  that  will  guide  us   through  the  data  maze.  Analysts  need  tools  to  help  them  understand   paGerns  and  define  clusters.    Users  need  to  explore  data  to  uncover   rela8onships  from  scaGered  sources.    Treparel’s  KMX  serves  both  these   needs  with  its  ability  to  cluster  and  categorize  collec8ons  of  data  with  a  high   degree  of  accuracy,  and  its  interac8ve  visualiza8on  tools  that  enable   explora8on  of  large  data  sets.”  Sue  Feldman,  Synthexis.com  (author:   The  Answer  Machine.   Treparel KMX – All Rights Reserved 2013 www.treparel.com 2
  • 3. Some  of  our  clients  &  partners   KMX  is  an  integral  part  of  our  IP  analysis  toolbox.  It  contributes  to  our   capability  of  making  added  value  IP  analyses  of  technologies  and   compe8tors  to  support  strategic  decision  making.   “We’ve  speed  up  our  patent  searches  from  2  days  to  2  hours  using   KMX  technology”   www.fusepool.eu Treparel KMX – All rights reserved 2014 3
  • 4. Key  Business  Problems  Treparel  KMX  solves   Applica'on  Area   Business  problem   Value   IP  &  Patent  Search   How  to  improve  the  Bme-­‐ consuming  and  costly  manual   search-­‐process  of  patents.   Reduce  research  Bme,  improve   precision  &  recall  of  relevant   documents.  Improve  legal  posiBon   and  drive  more  revenue  from  IP.   Compe''ve  Analysis     How  to  increase  knowledge  on   compeBtors  by  gaining   clustered  insights  from  (semi-­‐)   public  sources.   Improve  compeBBve  advantage  by   determining  internaBonal  strategy,   product  roadmap,  R&D  planning,   markeBng  campaigns  and  customer   senBment.   Healthcare     How  to  idenBfy  health  risks  and   find  correlaBons  in  deceases  or   medical  defects.   Early  idenBficaBon  on  health  risks  by   cross-­‐discipline  analyses  on  medical   records,  clinical  observaBons  and   medical  images.   Media  &  Publishing   How  to  improve  search  and   content  analyBcs  on  large   volumes  of  publicaBons.   Text  analyBcs  embedded  in  publishing   improves  relevance  and  accuracy  of   search  and  shows  previously  hidden   documents.   Treparel KMX – All Rights Reserved 2013 www.treparel.com 4
  • 5. Key  Business  Problems  Treparel  KMX  solves  -­‐  2   Use  Cases   Business  problem   Value   Sen'ment  Analysis   How  to  manage  current  and   future  customers  and  their   interacBons   Deriving  senBment  from  criBcal   customer-­‐based  text  sources  can   drive  revenue,  saBsfacBon  and   loyalty     Voice  of  Customer   Analyzing  HR-­‐related  informaBon   How  to  manage  communicaBons   (like  CVs  and  projects)  to  match   and  interacBons  with  employees,   demand  to  supply.   managers,  subordinates  and   employment  candidates   eDiscovery   How  to  manage  and  miBgate   general  liBgaBon  risk  and  cost  in   large  sets  of  text  and  emails.   Text  analyBcs  applied  to   legal  trials  or  in  laws  and   jurisprudence  improves  accuracy   in  legal  cases  and  lowers  costs.   Predic've  Analysis   How  to  idenBfy  early  signs  of   required  maintenance  that  affect   customer  saBsfacBon  and   operaBonal  costs   Use  customer  saBsfacBon  surveys   on  food  quality  to  idenBfy  airplane   ovens  requiring  maintenance  tune-­‐ ups   5
  • 6. Part  1:     KMX:  Ready  to  Use  Text  AnalyBcs     Intui8ve  Content  Clustering,   Classifica8on  &  Visualiza8on   Treparel KMX – All rights reserved 2014 www.treparel.com 6
  • 7. KMX  Text  AnalyBcs  ApplicaBon  overview   Query & Search Tools Acquire  documents   Text  Preprocessing  and  Indexing   Clustering   ClassificaBon   VisualizaBon   SemanBc  Analysis   KMX  unique  funcBons:   •  Extract  concepts  in  context   using  clustering  and   classificaBon  of  documents   •  Use  classificaBon  to  create   ranked  lists  and  to  tag  subsets   •  Support  of  binary  and  mulB-­‐ class  ClassificaBon   •  Enterprise  ediBon  (server/ cloud)  &  Professional  ediBon   (desktop)   •  IntegraBon  with  other   applicaBons  through  KMX  API   Taxonomies,   Ontologies   Present  Results   Treparel KMX – All rights reserved 2013 7
  • 8. Clustering:  User  Unsupervised  AnalyBcs   Benefits:  Get  quick  insights  through  automated  visual  clusters   with  annotaBons  to  enhance  the  discovery  process     1.  Analyze  the  clusters  and  the  relaBonships  in  the  data     2.  Explore  outliers  in  the  data   3.  Find  documents  of  interest   What  it  does:  A  visualizaBon  of  clusters  where  the  documents   are  displayed  as  points  and  the  distance  between  them  shows   their  similarity.       What  KMX  delivers:  Use  KMX  to  do:   1.  2.  3.  4.  Perform  text  preprocessing  (stemming/tokenizaBon  etc)   Calculate  between  all  documents  a  similarity  measure   Calculate  visualizaBon  (landscape)  with  automaBc  annotaBon   Create  the  visualizaBon     –  As  a  staBc  image   –  Or  provide  interacBon  where  the  user  can  zoom  in/out  with   support  for  adapBve  annotaBon   Treparel KMX – All rights reserved 2014 www.treparel.com 8
  • 9. ClassificaBon:  User  Supervised  AnalyBcs   Benefits:  Finding  fast,  accurate  and  precise  small  result  sets  and  enabling  trend   reporBng  and  AlerBng  by  reusing  predefined  categorizaBon  models.   1.  Obtain  a  ranked  list  of  the  most  relevant  documents     2.  Separate  the  important  documents  from  the  irrelevant  documents  (noise)   How  it  works:  A  list  of  the  relevant  documents  defined  from  a  users   perspecBve.       What  KMX  delivers:  Use  KMX  to  do:   1.  Tag  (label)  a  small  number  of  relevant  and  irrelevant  documents   –  Use  search  to  idenBfy  documents  that  need  to  be  tagged   –  Perform  manual  tagging   –  Select  documents  interacBve  from  the  visualizaBon  (brushing)   2.  Create  a  Classifier  (categorizer)  using  the  tagged  documents   3.  AutomaBcally  perform  the  classificaBon  on  all  documents     4.  Obtain  the  important  documents  as  ranked  high  and  the  irrelevant   documents  which  are  ranked  low   Treparel KMX – All rights reserved 2014 www.treparel.com 9
  • 10. VisualizaBon:  Discovering  Unexpected  Insights   Benefits:  KMX  VisualisaBons  are  supporBng     the  process  of  construcBng  a  visual  image     in  the  mind  to  understand  the  data  be_er.   How  it  works:  KMX  offers  a  visualizaBon  framework  with  various  methods  for   seeing  the  unseen.  It  enriches  the  process  of  discovery  and  fosters  profound   and  unexpected  insights.     What  KMX  delivers:  Different  visualizaBons  or  visual  pipelines  to:   •  Comprehend  large  datasets,  datasets  that  are  too  large  to  grasp  by  mental   imaginaBon.   •  Discover  previous  unknown  properBes  of  the  data  set  that  may  not  have   been  anBcipated   •  Reveal  inherent  problems  of  the  data,  for  instance  errors  and  artefacts   •  Examine  large-­‐scale  features  of  the  dataset  as  well  as  the  local  features  or   allows  the  user  to  see  local  features  in  a  larger  scale  reference   •  Let  users  form  hypothesis  based  on  the  (newly)  observed  phenomena  or   developed  insights     Treparel KMX – All rights reserved 2014 www.treparel.com 10
  • 11. Add-­‐on  servers:   Auto  ReporBng  &  Batch  ClassificaBon   •  Auto  Repor'ng  Server   –  Support  automated  analysis  for  aggregated   results  for  mulBple  users   –  Pie  &  bar  charts   –  Landscape  visualizaBons  for  overview  of   subjects   –  Enabling  rich  interacBon  via  web  interface   •  Classifica'on  Batch  Server   –  high-­‐performance  stand-­‐alone  text-­‐ classificaBon  server   –  Enables  large  scale  parallel  processing   Treparel KMX – All rights reserved 2014 Page 11 www.treparel.com 11
  • 12. Business  Value  from  Content  with  KMX   þ  Text  Analy'cs  for  Anyone  and  Everyone  –  IntuiBve  to  use  and  learn.  Designed   for  every  user:  business  (info  consumers)  and  scienBfic  (info  creators).   þ  Instant  Business  Insights  –  Explore  all  of  your  unstructured  data  (text,  blogs,   email,  patents)  without  limits.     þ  Rapid  Time  to  Value  -­‐  Adaptable  and  customizable  to  users  needs.  No   implementaBon  or  extensive  and  expensive  modelling  or  development.   Significant  less  training  and  tuning.       þ  Any  size  deployment  –  Meets  every  business  need  from  a  single  user  to  large   mulBlevel  type  user  groups.     þ  Language  independent  –  Search  and  analyze  most  of  the  world’s  languages   using  machine  translaBon.   þ  Any  kind  or  deployment  -­‐  Use  it  from  your  desktop  or  in  a  -­‐  private  -­‐    cloud.  Buy   the  socware-­‐as-­‐a-­‐service  or  get  the  output-­‐as-­‐a-­‐service.       þ  Enterprise-­‐proven,  IP  &  IT  friendly  –  Successfully  delivering  value  to  IP,  business   and  markets  in  mulBnaBonal  companies.   þ  Integra'on  –  Use  the  KMX  API  to  increase  the  value  of  unstructured  data  in  your   IP  discovery  infrastructure   www.treparel.com Treparel KMX – All rights reserved 2012 12
  • 13. Part  2:     KMX  socware:     User  Interface,  key  func8ons  &  value   Treparel KMX – All rights reserved 2014 www.treparel.com 13
  • 14. KMX  :  Model,  Analyse,  Discover  and  Visualize     in  one  view  and  deploy  it  to  large  scale   Search  and   highligh'ng   Brushing   Filtering   Document  text   Landscape  visualiza'on   www.treparel.com rights reserved 2014 Treparel KMX – All Coloring  of  classifica'on  score   14 KMX Example: ‘Ebola, SARS, Bird flue: How do they relate?’
  • 15. KMX  :  OpBmize  Output     using  ClassificaBon  Performance  Tuning   Precision   And     Recall   Document   classifica'on   for  three   classes   Distribu'on  of  classifica'on  scores   www.treparel.com rights reserved 2014 Treparel KMX – All 15
  • 16. Use  Case  1:  Performing  small  to  large  scale  SWOT   analysis  (on  AstraZeneca  patents)   SWOT  analysis  example     Start  with  removing  irrelevant   patents  using  Classifica8on  and   Filtering  to  determine:   •  Who  are  the  important  players   (assignees,  inventors)?   •  Where  are  the  important  patents   filed  (countries)?   •  What  is  the  trend  over  Bme  (growth   of  patents  over  the  years)?   •  NB:  we  used  a  (very)  simple  query  to   find  986  patents  filed  under   Astrazeneca.       Patent   Database   Queries   +10.000 patents Ranking   Filtering   Ranking   Filtering   986 patents 29 patents Ranking   Filtering   Business   User       Treparel KMX – All rights reserved 2014 Output
  • 17. Landscaping  and  Ranking:   From  986  to  the  most  relevant  patents   Fig: Using vlsual selection (brushing) to build a classification model (Classifier) to be able to rank the full data set and to extract the most relevant. 17
  • 18. Landscaping  and  Ranking:   What  are  most  relevant  Respiratory  &  Inflamma8on  patents?   Yellow = most important patents (+80% score) Blue = least relevant patents (for this analysis) NB: crosshair points to 1 specific patent (full text in left pane) Fig: Ranked patents using a Classifier for Respiratory & Inflammation patents (In yellow the selection of 29 18 absolute relevant patents to be further analyzed). We used ‘respiratory’ to demonstrate highlighting capabilities.
  • 19. How  Reliable  &  Accurate  are  the  results?   Review  your  results  with  advanced  performance  tools   The  quality  of  the  automaBc  classificaBon  (categorizaBon)  is  shown  in  the   histogram,  where  a  small  number  of  documents  with  a  high  classificaBon  score   are  separated  from  the  large  number  of  documents.   Fig: Classification performance 1280 patents on ‘biomass’ Non  relevant  documents   Relevant  documents   KMX  calculates  the  Precision  and  Recall  of  the  results  using  cross  validaBon.   • Precision  is  essenBal  for:  First  analysis  &  AlerBng  services   • Recall  is  crucial  for:  Freedom  to  Operate  search,  Validity  search  Patentability  search   • Both  need  to  be  high  for:  Patent  porkolio  landscape  analysis,  Technology  ExploraBon,  Risk  Assessments     19
  • 20. Use  Case  2:   Concept  detecBon  using  document  classificaBon   Extrac8ng  concepts  in  context  from  classifica8on  of  documents   1.  VisualizaBon  à  mulBple  topic   clusters   2.  Select  cluster  à  select  documents   with  similar  topics   3.  Select  training  documents  within   the  sub-­‐cluster   4.  Build  Classifier  and  classify   5.  Rank  documents  à  find  set  of   documents  with  related  concepts   6.  Extract  concepts   KMX Example: ‘Ebola, SARS, Bird flue: How do they relate?’ Treparel KMX – All rights reserved 2014 Page  20  |     20
  • 21. Part  3:     NEW:  Content  Dashboard  (InfoApp)   Integrated  SAAS  based  search,  repor8ng,   visualiza8on  and  analysis   Treparel KMX – All rights reserved 2014 www.treparel.com 21
  • 22.   Role  of  KMX  in  Integrated  InformaBon  ApplicaBons     Client/ Server Reporting Dashboard Informa'on   Consumers   (+  100  users)   Mobile Web Search Alerting Visualization Exploring Domain or Market Specific InfoApps (by Partners) Management, Development and Integration Text Mining Text PreP Creators/   Data  Scien'sts   (1-­‐5  users)   Stem/Token Tweets Documents Treparel KMX – All Rights Reserved 2013 Indexing Patent Data Clustering Classification Research Literature Enterprise Content jeroen@treparel.com Visualize Email Text Websites 22
  • 23. Content  Dashboard:     Content  Driven  AnalyBcal  solu8on   Ease of Use access to Search, Reporting & Analysis of content like Patents, Emails, Legislation, Application Notes, websites Treparel KMX – All rights reserved 2014 www.treparel.com 23
  • 24. Content  Dashboard:     Content  analyBcs  beyond  key-­‐word  search   Interactive taxonomy with multiple coupled views and advanced search in large sets of documents Treparel KMX – All rights reserved 2014 www.treparel.com 24
  • 25. Content  Dashboard:     Built  in  analy8cs  &  interac8ve  visualiza8ons   Ad-hoc or Standard interactive visualizations leading directly to the underlying documents or notes Treparel KMX – All rights reserved 2014 www.treparel.com 25
  • 26. Part  4:     NEW:  KMX  API  for  OEM  partners:   Put  best  in  class  content  analy8cs   in  your  solu8ons   Treparel KMX – All rights reserved 2014 www.treparel.com 26
  • 27. SoluBons  built  on  KMX   KMX Empowers InfoApps (solution partners/OEM/VAR) Partner solutions: •  IP & Patent Analytics •  Media & Publishing •  HR •  eDiscovery (Law & Legislation) •  Fraud Detection •  National Security & Police •  Sentiment analytics •  CRM/Voice of Customer •  Government •  Sharepoint (Enrich & Migrate) •  Content-based Dashboards KMX platform Big Data Text Analytics (cloud based platform / API) Fig 1. McKinsey diagram showing the three technology layers of the Big Data technology stack 27
  • 28. KMX  API  for  OEM:   Embed  Advanced  Text  AnalyBcs  in  your  soluBon   Clustering Provides users unsupervised analytics and automatically identifies inherent themes or information clusters. Classification Supervised analytics to help users automatically categorize large sets of documents. Through a dynamic hierarchical topic view into search results it enables users to quickly focus on annotated subjects rather than scrolling through long results lists. The Classification process can use a small number of documents sets for learn-byexample categorization. KMX API XML-RPC and REST (JSON) Python Pickle protocol Visualization Advanced visual knowledge discovery for displaying, exporting and sharing data results, ranked document lists, labeled and enriched data or interactive visualizations. Server: User / Tenant mgt User objects mgt (datasets, work spaces, classifiers, stop lists,.) Databases: Oracle, PostgreSQL Client Application: Native Windows (for creating Analysis pipelines) Using QT for GUI Using OpenGL for visualizations By sorting the content of documents by topic, relevancy and keywords users can apply their own models or rules for classification. Terms can be extracted to use in building thesauri or taxonomies. Example Applications Areas Advanced Visualizations, Interactive Analytics, Text Disambiguation, Data Enrichment, Clickthrough Optimization, Concept Extraction, Automated Tagging, Semantic Discovery, Named Entity Recognition Document Overlap Display, SWOT analysis, Sentiment Analysis, Predictive Analytics
  • 29. KMX enables information and knowledge professionals to gain faster, reliable, more precise insights in large complex unstructured data sets allowing them to make better informed decisions. Treparel is a leading technology solution provider in Big Data Text Analytics & Visualization