CONTENT PROCESSING ARCHITECTURE AND   APPLICATIONS	 Introduction to text mining – Warsaw University of Technology
Plan	  Findwise – who we are, what we do.  What is content?  Why content processing is important  Content processing and i...
Findwise – Search Driven Solutions	 •  Founded	  in	  2005	   •  Offices	  in	  Sweden,	  Denmark,	  	   	  	  	  	  	  	  	...
WHAT IS CONTENT?
Content ≥ Information	 From the business point of view INFORMATION is the key to success.	 	 ”Informa)on	  can	  only	  be...
Information is hidden	 Big Data is commonly described with 3V:	 	 1.  Variety	        Human	  generated	  vs.	  Machine	  ...
Information lives in the  context	 The right Information is hidden in text.	 	 Text forms a context:	 word -> sentence -> ...
WHY CONTENT PROCESSING IS       IMPORTANT?
Why content processing is important	 To get right information in seconds	 •  Usage	  of	  faceted	  search	   	 To tag con...
Conclusion	 Content processing is a set of techniques enabling text analytics.	 	 Content processing leverages the value o...
TECHNOLOGY FOR CONTENT      PROCESSING
General architecture of search engines
Content Processing – the idea	                     Format	           Language	                                 Spell	     ...
Content Processing – the implementation	 Hydra is used in order to refine content before it hits the index. Every document...
Hydra - example	 Select	  stages	  to	  use	  in	  the	  pipeline,	  the	  leX	  column	  corresponds	  to	  the	   “marke...
Hydra - example	 Modify	  the	  format	  of	  the	  date	  to	  only	  include	  year.	   	   	  
Hydra - example	 The	  new	  year	  meta-­‐data	  can	  be	  used	  as	  a	  facet	  
Hydra - example	 Map	  every	  author	  field	  to	  a	  metadata	  field	  called	  author.	   Pipeline	  A	   	   	   	   ...
Hydra - example	 In	  the	  search	  result…	   	   	  
Hydra is Open Source	 http://#ndwise.github.com/Hydra/
METHODS FOR CONTENT PROCESSING
Named entity recognition – statistical classi#ers		     •  OpenNLP (http://opennlp.apache.org/)	                 •  Markov...
Classi#ers - training		     •  Training set - language corpora	                       •  (http://nkjp.pl/) for Polish	    ...
Classi#ers – supervised training		     •  Training input	             •  Features extracted from each token	              ...
Classi#ers – approaches		     „Warszawskie Koło Brydżΐowe im. Jana Nowaka organizuje turniej w     Sheratonie”	     	     ...
EXAMPLES
Naive approach	 Often people names intersect with location names:	        	- Kazimierz	        	- Washington	 	 Company na...
Findwise implementation
QUESTIONS?
Paweł Wróblewski	pawel.wroblewski@#ndwise.com	       Marcin Goss	   marcin.goss@#ndwise.com
Upcoming SlideShare
Loading in …5
×

Content Processing Architecture and Applications - Introduction to Text Mining

876
-1

Published on

Introduction to text mining. Presented by Paweł Wróblewski & Marcin Goss at Warsaw University of Technology.

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
876
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
27
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Content Processing Architecture and Applications - Introduction to Text Mining

  1. 1. CONTENT PROCESSING ARCHITECTURE AND APPLICATIONS Introduction to text mining – Warsaw University of Technology
  2. 2. Plan Findwise – who we are, what we do. What is content? Why content processing is important Content processing and information retrieval Technology for content processing Methods for content processing Examples of usage
  3. 3. Findwise – Search Driven Solutions •  Founded  in  2005   •  Offices  in  Sweden,  Denmark,                  Norway,  Poland  and  Australia   •  90  employees   Our  objecBve  is  to  be  a  leading  provider  of  Findability  soluBons  uBlising   the  full  potenBal  of  search  technology  to  create  customer  business  value.     •  Paweł  Wróblewski  &  Marcin  Goss  
  4. 4. WHAT IS CONTENT?
  5. 5. Content ≥ Information From the business point of view INFORMATION is the key to success. ”Informa)on  can  only  be  an  asset  when  it  enables  a   task  to  be  completed.”   “The  value  is  in  the  outcome  of  the  task,  not  in   the  informa)on  itself.”   MarBn  White   Employee productivity (The hidden cost… IDC April 2006): ” “the cost for wasted time on the part of professional searching, but not !nding relevant information, amounts to $5.3 million annually for an enterprise with 1000 knowledge workers.””
  6. 6. Information is hidden Big Data is commonly described with 3V: 1.  Variety Human  generated  vs.  Machine  generated   Text  &  MulBmedia   2.  Volume Up  to  Petabytes   3.  Velocity Stream  of  data   GBs  per  day,  hour,  minute,  second  
  7. 7. Information lives in the context The right Information is hidden in text. Text forms a context: word -> sentence -> paragraph -> chapter -> document Content processing is about extracting required information from the context.
  8. 8. WHY CONTENT PROCESSING IS IMPORTANT?
  9. 9. Why content processing is important To get right information in seconds •  Usage  of  faceted  search   To tag consistently large document set •  Usage  of  automaBc  extactor   To biuld semantic database •  ExtracBon  of  concepts  with  linkage  to  taxonomy/ontology   To perform document classi#cation •  ExtracBon  of  enBBes  with  grouping  /  clustering   Examples  from  publicly  available  websites  [live  show]  
  10. 10. Conclusion Content processing is a set of techniques enabling text analytics. Content processing leverages the value of data stored in companies improving data consumption. Content processing used with search engines helps #nd information in any context. •  Enteprise  Findability   •  E-­‐commerce  
  11. 11. TECHNOLOGY FOR CONTENT PROCESSING
  12. 12. General architecture of search engines
  13. 13. Content Processing – the idea Format   Language   Spell   Lemmas   Synonyms   Conversion   Detec?on   Checking   (tenses,  forms)  Document   Geography   Taxonomy   Custom   Companies   Vectorizer   En??es   Classifica?on   PLUG-­‐IN   People   Scopifier     index   PARIS  (Reuters)  -­‐  Venus  Williams  raced  into  the  second  round  of   the  $11.25  million  French  Open  Monday,  brushing  aside   Bianka  Lamade,  6-­‐3,  6-­‐3,  in  65  minutes.     The  Wimbledon  and  U.S.  Open  champion,  seeded  second,  breezed   past  the  German  on  a  blustery  center  court  to  become  the   first  seed  to  advance  at  Roland  Garros.  "I  love  being  here,  I   love  the  French  Open  and  more  than  anything  Id  love  to  do   well  here,"  the  American  said.    Input:        byte  stream  Output:  structured  document  ready  to  be  indexed  
  14. 14. Content Processing – the implementation Hydra is used in order to refine content before it hits the index. Every document fetched from a source runs through a targeted pipeline, which includes a number of stages. A stage can be considered as an “app” within Appstore or the Android market. Findwise have created a huge amount of such stages, where each stage has a small purpose to enhance the content of the item. It is possible to create additional stages to serve a specific customer functionality.
  15. 15. Hydra - example Select  stages  to  use  in  the  pipeline,  the  leX  column  corresponds  to  the   “market”,  and  the  right  is  the  stages  used.  
  16. 16. Hydra - example Modify  the  format  of  the  date  to  only  include  year.      
  17. 17. Hydra - example The  new  year  meta-­‐data  can  be  used  as  a  facet  
  18. 18. Hydra - example Map  every  author  field  to  a  metadata  field  called  author.   Pipeline  A         Pipeline  B        
  19. 19. Hydra - example In  the  search  result…      
  20. 20. Hydra is Open Source http://#ndwise.github.com/Hydra/
  21. 21. METHODS FOR CONTENT PROCESSING
  22. 22. Named entity recognition – statistical classi#ers •  OpenNLP (http://opennlp.apache.org/) •  Markov chains •  Mallet (http://mallet.cs.umass.edu/) •  Conditional random #elds Input: Mark has been in London since Mary dumped him. Output: <person>Mark</person> has been in <place>London</place> since <person>Mary</person> dumped him.
  23. 23. Classi#ers - training •  Training set - language corpora •  (http://nkjp.pl/) for Polish Set of manually tagged texts in given language. Preferably from various sources, various topics. Tokens   PoS  tags   Name  tags     He   Pronoun   O   went   Verb   O   to   Prep.   O   United   AdjecBve   Place   States   Noun   Place   .   Interp   O  
  24. 24. Classi#ers – supervised training •  Training input •  Features extracted from each token token: text, PoS tag, token class prev token: text, PoS tag, token class next token: text, PoS tag, token class previous tags assigned •  Token classes examples lowercase alphabetic, digits, contains number and letter, contains number and a hyphen, all caps, all caps with dots inbetween ... •  Training output •  <place> <location> <person> •  <B-place> <I-place> <L-place> <U-place>
  25. 25. Classi#ers – approaches „Warszawskie Koło Brydżΐowe im. Jana Nowaka organizuje turniej w Sheratonie” Location? Organisation name? Person name? •  One classi!er for all name-types •  faster •  automatically resolves con#icts •  One classi!er per name-type •  slower, memory consuming •  provides more information
  26. 26. EXAMPLES
  27. 27. Naive approach Often people names intersect with location names: - Kazimierz - Washington Company names may come from common language: - Oracle - Dialog Conlcusion: dictionaries are not enough without contextual analysis
  28. 28. Findwise implementation
  29. 29. QUESTIONS?
  30. 30. Paweł Wróblewski pawel.wroblewski@#ndwise.com Marcin Goss marcin.goss@#ndwise.com
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×