Content Processing Architecture and Applications - Introduction to Text Mining

1,022 views
925 views

Published on

Introduction to text mining. Presented by Paweł Wróblewski & Marcin Goss at Warsaw University of Technology.

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,022
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
27
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Content Processing Architecture and Applications - Introduction to Text Mining

  1. 1. CONTENT PROCESSING ARCHITECTURE AND APPLICATIONS Introduction to text mining – Warsaw University of Technology
  2. 2. Plan Findwise – who we are, what we do. What is content? Why content processing is important Content processing and information retrieval Technology for content processing Methods for content processing Examples of usage
  3. 3. Findwise – Search Driven Solutions •  Founded  in  2005   •  Offices  in  Sweden,  Denmark,                  Norway,  Poland  and  Australia   •  90  employees   Our  objecBve  is  to  be  a  leading  provider  of  Findability  soluBons  uBlising   the  full  potenBal  of  search  technology  to  create  customer  business  value.     •  Paweł  Wróblewski  &  Marcin  Goss  
  4. 4. WHAT IS CONTENT?
  5. 5. Content ≥ Information From the business point of view INFORMATION is the key to success. ”Informa)on  can  only  be  an  asset  when  it  enables  a   task  to  be  completed.”   “The  value  is  in  the  outcome  of  the  task,  not  in   the  informa)on  itself.”   MarBn  White   Employee productivity (The hidden cost… IDC April 2006): ” “the cost for wasted time on the part of professional searching, but not !nding relevant information, amounts to $5.3 million annually for an enterprise with 1000 knowledge workers.””
  6. 6. Information is hidden Big Data is commonly described with 3V: 1.  Variety Human  generated  vs.  Machine  generated   Text  &  MulBmedia   2.  Volume Up  to  Petabytes   3.  Velocity Stream  of  data   GBs  per  day,  hour,  minute,  second  
  7. 7. Information lives in the context The right Information is hidden in text. Text forms a context: word -> sentence -> paragraph -> chapter -> document Content processing is about extracting required information from the context.
  8. 8. WHY CONTENT PROCESSING IS IMPORTANT?
  9. 9. Why content processing is important To get right information in seconds •  Usage  of  faceted  search   To tag consistently large document set •  Usage  of  automaBc  extactor   To biuld semantic database •  ExtracBon  of  concepts  with  linkage  to  taxonomy/ontology   To perform document classi#cation •  ExtracBon  of  enBBes  with  grouping  /  clustering   Examples  from  publicly  available  websites  [live  show]  
  10. 10. Conclusion Content processing is a set of techniques enabling text analytics. Content processing leverages the value of data stored in companies improving data consumption. Content processing used with search engines helps #nd information in any context. •  Enteprise  Findability   •  E-­‐commerce  
  11. 11. TECHNOLOGY FOR CONTENT PROCESSING
  12. 12. General architecture of search engines
  13. 13. Content Processing – the idea Format   Language   Spell   Lemmas   Synonyms   Conversion   Detec?on   Checking   (tenses,  forms)  Document   Geography   Taxonomy   Custom   Companies   Vectorizer   En??es   Classifica?on   PLUG-­‐IN   People   Scopifier     index   PARIS  (Reuters)  -­‐  Venus  Williams  raced  into  the  second  round  of   the  $11.25  million  French  Open  Monday,  brushing  aside   Bianka  Lamade,  6-­‐3,  6-­‐3,  in  65  minutes.     The  Wimbledon  and  U.S.  Open  champion,  seeded  second,  breezed   past  the  German  on  a  blustery  center  court  to  become  the   first  seed  to  advance  at  Roland  Garros.  "I  love  being  here,  I   love  the  French  Open  and  more  than  anything  Id  love  to  do   well  here,"  the  American  said.    Input:        byte  stream  Output:  structured  document  ready  to  be  indexed  
  14. 14. Content Processing – the implementation Hydra is used in order to refine content before it hits the index. Every document fetched from a source runs through a targeted pipeline, which includes a number of stages. A stage can be considered as an “app” within Appstore or the Android market. Findwise have created a huge amount of such stages, where each stage has a small purpose to enhance the content of the item. It is possible to create additional stages to serve a specific customer functionality.
  15. 15. Hydra - example Select  stages  to  use  in  the  pipeline,  the  leX  column  corresponds  to  the   “market”,  and  the  right  is  the  stages  used.  
  16. 16. Hydra - example Modify  the  format  of  the  date  to  only  include  year.      
  17. 17. Hydra - example The  new  year  meta-­‐data  can  be  used  as  a  facet  
  18. 18. Hydra - example Map  every  author  field  to  a  metadata  field  called  author.   Pipeline  A         Pipeline  B        
  19. 19. Hydra - example In  the  search  result…      
  20. 20. Hydra is Open Source http://#ndwise.github.com/Hydra/
  21. 21. METHODS FOR CONTENT PROCESSING
  22. 22. Named entity recognition – statistical classi#ers •  OpenNLP (http://opennlp.apache.org/) •  Markov chains •  Mallet (http://mallet.cs.umass.edu/) •  Conditional random #elds Input: Mark has been in London since Mary dumped him. Output: <person>Mark</person> has been in <place>London</place> since <person>Mary</person> dumped him.
  23. 23. Classi#ers - training •  Training set - language corpora •  (http://nkjp.pl/) for Polish Set of manually tagged texts in given language. Preferably from various sources, various topics. Tokens   PoS  tags   Name  tags     He   Pronoun   O   went   Verb   O   to   Prep.   O   United   AdjecBve   Place   States   Noun   Place   .   Interp   O  
  24. 24. Classi#ers – supervised training •  Training input •  Features extracted from each token token: text, PoS tag, token class prev token: text, PoS tag, token class next token: text, PoS tag, token class previous tags assigned •  Token classes examples lowercase alphabetic, digits, contains number and letter, contains number and a hyphen, all caps, all caps with dots inbetween ... •  Training output •  <place> <location> <person> •  <B-place> <I-place> <L-place> <U-place>
  25. 25. Classi#ers – approaches „Warszawskie Koło Brydżΐowe im. Jana Nowaka organizuje turniej w Sheratonie” Location? Organisation name? Person name? •  One classi!er for all name-types •  faster •  automatically resolves con#icts •  One classi!er per name-type •  slower, memory consuming •  provides more information
  26. 26. EXAMPLES
  27. 27. Naive approach Often people names intersect with location names: - Kazimierz - Washington Company names may come from common language: - Oracle - Dialog Conlcusion: dictionaries are not enough without contextual analysis
  28. 28. Findwise implementation
  29. 29. QUESTIONS?
  30. 30. Paweł Wróblewski pawel.wroblewski@#ndwise.com Marcin Goss marcin.goss@#ndwise.com

×