• Share
  • Email
  • Embed
  • Like
  • Private Content
Content Processing Architecture and Applications - Introduction to Text Mining

Content Processing Architecture and Applications - Introduction to Text Mining



Introduction to text mining. Presented by Paweł Wróblewski & Marcin Goss at Warsaw University of Technology.

Introduction to text mining. Presented by Paweł Wróblewski & Marcin Goss at Warsaw University of Technology.



Total Views
Views on SlideShare
Embed Views



3 Embeds 15

http://www.linkedin.com 12
http://tweets.findwise.com 2
https://www.linkedin.com 1



Upload Details

Uploaded via as Adobe PDF

Usage Rights

CC Attribution-NonCommercial-ShareAlike LicenseCC Attribution-NonCommercial-ShareAlike LicenseCC Attribution-NonCommercial-ShareAlike License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

    Content Processing Architecture and Applications - Introduction to Text Mining Content Processing Architecture and Applications - Introduction to Text Mining Presentation Transcript

    • CONTENT PROCESSING ARCHITECTURE AND APPLICATIONS Introduction to text mining – Warsaw University of Technology
    • Plan Findwise – who we are, what we do. What is content? Why content processing is important Content processing and information retrieval Technology for content processing Methods for content processing Examples of usage
    • Findwise – Search Driven Solutions •  Founded  in  2005   •  Offices  in  Sweden,  Denmark,                  Norway,  Poland  and  Australia   •  90  employees   Our  objecBve  is  to  be  a  leading  provider  of  Findability  soluBons  uBlising   the  full  potenBal  of  search  technology  to  create  customer  business  value.     •  Paweł  Wróblewski  &  Marcin  Goss  
    • Content ≥ Information From the business point of view INFORMATION is the key to success. ”Informa)on  can  only  be  an  asset  when  it  enables  a   task  to  be  completed.”   “The  value  is  in  the  outcome  of  the  task,  not  in   the  informa)on  itself.”   MarBn  White   Employee productivity (The hidden cost… IDC April 2006): ” “the cost for wasted time on the part of professional searching, but not !nding relevant information, amounts to $5.3 million annually for an enterprise with 1000 knowledge workers.””
    • Information is hidden Big Data is commonly described with 3V: 1.  Variety Human  generated  vs.  Machine  generated   Text  &  MulBmedia   2.  Volume Up  to  Petabytes   3.  Velocity Stream  of  data   GBs  per  day,  hour,  minute,  second  
    • Information lives in the context The right Information is hidden in text. Text forms a context: word -> sentence -> paragraph -> chapter -> document Content processing is about extracting required information from the context.
    • Why content processing is important To get right information in seconds •  Usage  of  faceted  search   To tag consistently large document set •  Usage  of  automaBc  extactor   To biuld semantic database •  ExtracBon  of  concepts  with  linkage  to  taxonomy/ontology   To perform document classi#cation •  ExtracBon  of  enBBes  with  grouping  /  clustering   Examples  from  publicly  available  websites  [live  show]  
    • Conclusion Content processing is a set of techniques enabling text analytics. Content processing leverages the value of data stored in companies improving data consumption. Content processing used with search engines helps #nd information in any context. •  Enteprise  Findability   •  E-­‐commerce  
    • General architecture of search engines
    • Content Processing – the idea Format   Language   Spell   Lemmas   Synonyms   Conversion   Detec?on   Checking   (tenses,  forms)  Document   Geography   Taxonomy   Custom   Companies   Vectorizer   En??es   Classifica?on   PLUG-­‐IN   People   Scopifier     index   PARIS  (Reuters)  -­‐  Venus  Williams  raced  into  the  second  round  of   the  $11.25  million  French  Open  Monday,  brushing  aside   Bianka  Lamade,  6-­‐3,  6-­‐3,  in  65  minutes.     The  Wimbledon  and  U.S.  Open  champion,  seeded  second,  breezed   past  the  German  on  a  blustery  center  court  to  become  the   first  seed  to  advance  at  Roland  Garros.  "I  love  being  here,  I   love  the  French  Open  and  more  than  anything  Id  love  to  do   well  here,"  the  American  said.    Input:        byte  stream  Output:  structured  document  ready  to  be  indexed  
    • Content Processing – the implementation Hydra is used in order to refine content before it hits the index. Every document fetched from a source runs through a targeted pipeline, which includes a number of stages. A stage can be considered as an “app” within Appstore or the Android market. Findwise have created a huge amount of such stages, where each stage has a small purpose to enhance the content of the item. It is possible to create additional stages to serve a specific customer functionality.
    • Hydra - example Select  stages  to  use  in  the  pipeline,  the  leX  column  corresponds  to  the   “market”,  and  the  right  is  the  stages  used.  
    • Hydra - example Modify  the  format  of  the  date  to  only  include  year.      
    • Hydra - example The  new  year  meta-­‐data  can  be  used  as  a  facet  
    • Hydra - example Map  every  author  field  to  a  metadata  field  called  author.   Pipeline  A         Pipeline  B        
    • Hydra - example In  the  search  result…      
    • Hydra is Open Source http://#ndwise.github.com/Hydra/
    • Named entity recognition – statistical classi#ers •  OpenNLP (http://opennlp.apache.org/) •  Markov chains •  Mallet (http://mallet.cs.umass.edu/) •  Conditional random #elds Input: Mark has been in London since Mary dumped him. Output: <person>Mark</person> has been in <place>London</place> since <person>Mary</person> dumped him.
    • Classi#ers - training •  Training set - language corpora •  (http://nkjp.pl/) for Polish Set of manually tagged texts in given language. Preferably from various sources, various topics. Tokens   PoS  tags   Name  tags     He   Pronoun   O   went   Verb   O   to   Prep.   O   United   AdjecBve   Place   States   Noun   Place   .   Interp   O  
    • Classi#ers – supervised training •  Training input •  Features extracted from each token token: text, PoS tag, token class prev token: text, PoS tag, token class next token: text, PoS tag, token class previous tags assigned •  Token classes examples lowercase alphabetic, digits, contains number and letter, contains number and a hyphen, all caps, all caps with dots inbetween ... •  Training output •  <place> <location> <person> •  <B-place> <I-place> <L-place> <U-place>
    • Classi#ers – approaches „Warszawskie Koło Brydżΐowe im. Jana Nowaka organizuje turniej w Sheratonie” Location? Organisation name? Person name? •  One classi!er for all name-types •  faster •  automatically resolves con#icts •  One classi!er per name-type •  slower, memory consuming •  provides more information
    • Naive approach Often people names intersect with location names: - Kazimierz - Washington Company names may come from common language: - Oracle - Dialog Conlcusion: dictionaries are not enough without contextual analysis
    • Findwise implementation
    • Paweł Wróblewski pawel.wroblewski@#ndwise.com Marcin Goss marcin.goss@#ndwise.com