Content Processing Architecture and Applications - Introduction to Text Mining
CONTENT PROCESSING ARCHITECTURE AND APPLICATIONS Introduction to text mining – Warsaw University of Technology
Plan Findwise – who we are, what we do. What is content? Why content processing is important Content processing and information retrieval Technology for content processing Methods for content processing Examples of usage
Findwise – Search Driven Solutions • Founded in 2005 • Oﬃces in Sweden, Denmark, Norway, Poland and Australia • 90 employees Our objecBve is to be a leading provider of Findability soluBons uBlising the full potenBal of search technology to create customer business value. • Paweł Wróblewski & Marcin Goss
Content ≥ Information From the business point of view INFORMATION is the key to success. ”Informa)on can only be an asset when it enables a task to be completed.” “The value is in the outcome of the task, not in the informa)on itself.” MarBn White Employee productivity (The hidden cost… IDC April 2006): ” “the cost for wasted time on the part of professional searching, but not !nding relevant information, amounts to $5.3 million annually for an enterprise with 1000 knowledge workers.””
Information is hidden Big Data is commonly described with 3V: 1. Variety Human generated vs. Machine generated Text & MulBmedia 2. Volume Up to Petabytes 3. Velocity Stream of data GBs per day, hour, minute, second
Information lives in the context The right Information is hidden in text. Text forms a context: word -> sentence -> paragraph -> chapter -> document Content processing is about extracting required information from the context.
Why content processing is important To get right information in seconds • Usage of faceted search To tag consistently large document set • Usage of automaBc extactor To biuld semantic database • ExtracBon of concepts with linkage to taxonomy/ontology To perform document classi#cation • ExtracBon of enBBes with grouping / clustering Examples from publicly available websites [live show]
Conclusion Content processing is a set of techniques enabling text analytics. Content processing leverages the value of data stored in companies improving data consumption. Content processing used with search engines helps #nd information in any context. • Enteprise Findability • E-‐commerce
Content Processing – the idea Format Language Spell Lemmas Synonyms Conversion Detec?on Checking (tenses, forms) Document Geography Taxonomy Custom Companies Vectorizer En??es Classiﬁca?on PLUG-‐IN People Scopiﬁer index PARIS (Reuters) -‐ Venus Williams raced into the second round of the $11.25 million French Open Monday, brushing aside Bianka Lamade, 6-‐3, 6-‐3, in 65 minutes. The Wimbledon and U.S. Open champion, seeded second, breezed past the German on a blustery center court to become the ﬁrst seed to advance at Roland Garros. "I love being here, I love the French Open and more than anything Id love to do well here," the American said. Input: byte stream Output: structured document ready to be indexed
Content Processing – the implementation Hydra is used in order to refine content before it hits the index. Every document fetched from a source runs through a targeted pipeline, which includes a number of stages. A stage can be considered as an “app” within Appstore or the Android market. Findwise have created a huge amount of such stages, where each stage has a small purpose to enhance the content of the item. It is possible to create additional stages to serve a specific customer functionality.
Hydra - example Select stages to use in the pipeline, the leX column corresponds to the “market”, and the right is the stages used.
Hydra - example Modify the format of the date to only include year.
Hydra - example The new year meta-‐data can be used as a facet
Hydra - example Map every author ﬁeld to a metadata ﬁeld called author. Pipeline A Pipeline B
Named entity recognition – statistical classi#ers • OpenNLP (http://opennlp.apache.org/) • Markov chains • Mallet (http://mallet.cs.umass.edu/) • Conditional random #elds Input: Mark has been in London since Mary dumped him. Output: <person>Mark</person> has been in <place>London</place> since <person>Mary</person> dumped him.
Classi#ers - training • Training set - language corpora • (http://nkjp.pl/) for Polish Set of manually tagged texts in given language. Preferably from various sources, various topics. Tokens PoS tags Name tags He Pronoun O went Verb O to Prep. O United AdjecBve Place States Noun Place . Interp O
Classi#ers – supervised training • Training input • Features extracted from each token token: text, PoS tag, token class prev token: text, PoS tag, token class next token: text, PoS tag, token class previous tags assigned • Token classes examples lowercase alphabetic, digits, contains number and letter, contains number and a hyphen, all caps, all caps with dots inbetween ... • Training output • <place> <location> <person> • <B-place> <I-place> <L-place> <U-place>
Classi#ers – approaches „Warszawskie Koło Brydżΐowe im. Jana Nowaka organizuje turniej w Sheratonie” Location? Organisation name? Person name? • One classi!er for all name-types • faster • automatically resolves con#icts • One classi!er per name-type • slower, memory consuming • provides more information
Naive approach Often people names intersect with location names: - Kazimierz - Washington Company names may come from common language: - Oracle - Dialog Conlcusion: dictionaries are not enough without contextual analysis