• Like
  • Save
Large-scale Data Processing for Information Retrieval #nlhug
Upcoming SlideShare
Loading in...5
×
 

Large-scale Data Processing for Information Retrieval #nlhug

on

  • 1,077 views

Modern web search engines are making increasing use of signals other than mere textual statistics. While documents used to be matched to keyword queries based on term counting alone, modern ...

Modern web search engines are making increasing use of signals other than mere textual statistics. While documents used to be matched to keyword queries based on term counting alone, modern information retrieval systems incorporate and learn from a large number of features pertaining to the query, user, documents, entities, sessions, etc. In particular, a document ranking generated by a web search engine involves combining signals from rich representations of users (including their location, browser, device, profile, history, etc.), semantics (ranging from simple spell-checking to recognizing entities), popularity, social networking, and more. All of these features need to be computed at an increasingly large scale and call for Big Data storage and analytics methods. In this talk I will give some examples of current IR research being done at the University of Amsterdam, leaning heavily on MapReduce and related programming paradigms.

See http://www.nlhug.org/events/56584462/.

Statistics

Views

Total Views
1,077
Views on SlideShare
1,071
Embed Views
6

Actions

Likes
0
Downloads
0
Comments
0

3 Embeds 6

https://si0.twimg.com 2
http://www.linkedin.com 2
https://www.linkedin.com 2

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Large-scale Data Processing for Information Retrieval #nlhug Large-scale Data Processing for Information Retrieval #nlhug Presentation Transcript

    • Large-scale Data Processing forInformation RetrievalEdgar MeijInformatics Institute
    • Joint work with Amit Bronner, Hendrike Peetz,Wouter Weerkamp, Anne Schuth, Maarten de Rijke Large-scale Data Processing for IR 2
    • Big Information data retrievallingualmationcess Machine Theory and translation models Evaluation methodology Text mining Intelligent Information retrieval information for information services access Political information Storytelling Human- computer Knowledge information representation retrieval & reasoning Information Exploratory integration Foundations search Large-scale Data Processing for IR 3 of XML
    • Semantic search Real-time analytics Social signal analysis Big Information data retrievalachine Theory andnslation models Evaluation methodology Intelligent Information retrieval information for information services access Large-scale Data Processing for IR 3 Political information
    • s Real-time analytics Synchronize content Big Inform data retrMulti-lingualinformation access Machine translation Text mining Intelligent Information retrieval information for information services access Storytelling Large-scale Data Processing for IR 3 Human-
    • Intelligent Information retrieval information for information services access Political information Human- computer Knowledgeinformation representation retrieval & reasoning Information integration Foundations of XML Multi-modal Open summaries data Large-scale Data Processing for IR 3
    • Text mining Intelligent Information retrieval information for information services accessStorytelling Human- computer Knowledg information representat retrieval & reasonin Exploratory Foundations search of XML Multi-modal Op summaries da Large-scale Data Processing for IR 3
    • Me¢ Information retrieval (~ search engines)¢ Semantic search/annotations¢ Use knowledge bases (Wikipedia, Freebase, etc.) as £ primary information source for search or £ as complement to traditional retrieval Large-scale Data Processing for IR 4
    • Search engines Large-scale Data Processing for IR 5
    • Search engines – a bird’s eye view¢ Main ingredient: Counting words £ Query ~ distribution over words £ Document ~ distribution over words £ Ranking ~ comparing distributions Large-scale Data Processing for IR 6
    • Search engines – a bird’s eye view¢ Main ingredient: Counting words £ Query ~ distribution over words £ Document ~ distribution over words £ Ranking ~ comparing distributions Large-scale Data Processing for IR 6
    • Forecasters are watching fore cas tropical storms that could t pose hurricane threats tohurricane fun the southern United States. tropical One is a downgraded … windweather home Large-scale Data Processing for IR 7
    • Search engines – a bit of history¢ Anno 1995 £ Counting words (only)... £ Stopwords £ Linguistic normalization Large-scale Data Processing for IR 8
    • Search engines – a bit of history¢ Anno 2000: 2nd generation £ Link structure ˜ Anchor text ˜ PageRank £ Document structure ˜ title, top/bottom, etc. ˜ boilerplate £ Click-through data Large-scale Data Processing for IR 9
    • Search engines – a bit of history¢ Anno now £ Real-time indexing/search £ Increasingly personalized £ Increasingly social £ Apply “observations” of human behavior to improve, to evaluate ˜ Search behavior, click behavior, dwell behavior, reading time, …, other things that are happening in the world £ Rich signals Large-scale Data Processing for IR 10
    • Signals¢ Users/Personalisation £ group: country, region, language, device, browser, etc. £ individual: profile, history, sessions, etc. Why “learning to rank”?¢ Linguistics (e.g., spell-checking)¢ Semantics (e.g., entities)¢ Popularity (e.g. PageRank)¢ Social (e.g. G+) And more... 1¢ • More and more features are found to be useful for ranking £ readability, relevance assessments, clicks, etc. documents. • How should we combine these? 1 http://www.flickr.com/photos/sameli/540933604/ Large-scale Data Processing for IR 11 KH&MdR (U. Amsterdam) Advanced Information Retrieval MS
    • Applying signals¢ Typically at query time... £ Leaning heavily on machine learning¢ Not the focus here... Why “learning to rank”? 1 • More and more features are found to be useful for ranking documents. • How should we combine these? 1 http://www.flickr.com/photos/sameli/540933604/ Large-scale Data Processing for IR 12 KH&MdR (U. Amsterdam) Advanced Information Retrieval MS
    • What generates (non-monetary) value? Large-scale Data Processing for IR 13
    • What generates (non-monetary) value?¢ What is value? £ Better/Richer UX ˜ Clever term/phrase suggestions ˜ Clever, rich snippets £ Finding what you need faster/better/... ˜ Homing in on what you want to find ˜ Task/Problem solving £ and more... Large-scale Data Processing for IR 14
    • For instance... good camera under 300 euro Large-scale Data Processing for IR 15
    • Or... Large-scale Data Processing for IR 16
    • Or... Large-scale Data Processing for IR 17
    • Large-scale Data Processing for IR 18
    • Large-scale Data Processing for IR 19
    • Large-scale Data Processing for IR 20
    • Large-scale Data Processing for IR 21
    • So, where else do you get value from?¢ Improving signals... £ Richer/Better/More focused signals ˜ Richer data/better extraction/... ˜ "Google acquires Freebase"¢ ... or the application thereof £ Algorithmic innovations £ Training data ˜ Logs (queries, clicks, ...) – from toolbars, redirects, etc. ˜ Relevance assessments – manual, professionals, mechanical turk, etc.¢ "More intelligent systems" Large-scale Data Processing for IR 22
    • Intelligence?¢ Need analysis of (large quantities of) data £ Typically, "transformations" ˜ graphs (PageRank, FriendRank) ˜ text => structure ˜ aggregations ˜ etc.¢ Then, aggregate analyses to obtain "value" £ count/sum/min/max/avg/etc.¢ Hadoop! Large-scale Data Processing for IR 23
    • Use-cases Large-scale Data Processing for IR 24
    • Use-case 1: Search and analysis on tweets¢ Even getting them is not quite trivial¢ Example: TREC Microblog track £ 16M tweets ˜ Published as ID ˜ Default HTML download option without metadata (geo data, original tweet when retweeted, reply-to, etc.) ˜ JSON format has all the beautiful stuff £ HTML crawling vs getting the JSON objects ˜ JSON download limited to 150 tweets per hour per IP address ™ On a single machine: more than 12 years ™ 884 nodes running for close to a week Large-scale Data Processing for IR 25
    • And once you have millions of tweets…¢ Text analytics on twitter streams £ Information extraction, sentiment analysis, … £ Given an entity (company, product, …), what is being said about it? Obama almost 15mins late... wonder if hes watching college hoops. Less than 2mins left in Texas Oakland game #NCAA #Marc ... Large-scale Data Processing for IR 26
    • And once you have millions of tweets…¢ Text analytics on twitter streams £ Information extraction, sentiment analysis, … £ Given an entity (company, product, …), what is being said about it? Which aspects? Which attitudes? £ Extract triples X–R–Y £ Dependency parsing Large-scale Data Processing for IR 27
    • Large-scale Data Processing for IR 28
    • Some numbers¢ Data £ ~10% public English tweets in 2010 £ ~250M tweets¢ Performance £ Single machine (1 Dual core, 2.2GHz, 3GB ram) ˜ ~2 years £ Sara Hadoop cluster (20 nodes x Dual core, 2.6GHz, 16GB ram) ˜ ~30 days £ DAS4 Hadoop cluster (36 nodes x Dual quad-core, 2.4GHz, 24GB ram) ˜ ~1 day Large-scale Data Processing for IR 29
    • Intermezzo: The-Web-as-a-corpus¢ Web retrieval £ TREC Web track – ClueWeb09 ˜ 1,040,809,705 web pages, in 10 languages ˜ 25TB uncompressed¢ Parse TBs of web data £ SARA Hadoop £ cloud9/Ivory(/Elasticsearch/SOLR/Lucene) £ POS, DEP, entities £ easy peasy Large-scale Data Processing for IR 30
    • Using Bursts for Query ModelingUse-case 2: Temporal patterns for IR¢ Temporal relevance?¢ Relevant documents £ query: ‘grammys’ £ time (in days) along the x-axis £ nr. of judged relevant documents along the y-axis¢ Value: detect “temporal” queries (a) Relevant documents Table 1: Temporal Processing for IR Large-scale Data distributions for the que 31 Figure 1a is the same as Figure 1?
    • 4d), with many more new home productsbeing sold, has a knot point at 10 hoursversus Anchorage’s 29 (4c).Unique visitors: Unlike inter-versionmeans, there is no statistical difference in Use-case 2: Temporal patterns for IRwhere the knot point falls as a function ofunique visitors. This is consistent with thefact that while popular pages change moreoften, they change less whenplot do, and ¢ “Term lifespan” theythus require the same amount of time to“stabilize” as less popular pages. the x-axis £ time (in days) alongURL Depth: Thealong the page is in the £ terms deeper the y-axispage £ every the further the knot an hierarchy dot represents point,potentially indicating that content on pagesdeep within a site “decay” atthat day occurrence on a slower rate.Category: Perhaps unsurprisingly, their first £ terms are ordered by Newsand Sports pages have an earlierwebpages occurrence in the knot pointas content in these pages is likely to bereplacedon allrecipes.com quickly. Industry/trade pages,including corporate home pages, display amuch more gradual rate of content decaybefore reaching the knot point.4.3 Term-Level ChangeThe above analysis explores how page Figure 5. Term lifespanfor IR for several pages Large-scale Data Processing plots 32content changes across an entire Web replaced with the BestBuy homepage. Time (in
    • Or from Wikipedia access logs...¢ 1 year = ~ 555GB of raw Wikipedia logs £ filter £ aggregate £ link £ visualize¢ Inherently parallelizable Large-scale Data Processing for IR 33
    • Or from Wikipedia access logs... 31-2 30000 01012 nts-2¢ 1 year = ~ 555GB of pagec ou 3 57482 raw Wikipedia logs [... ] ristm 68 76 as 11 rol 1 713 th%20Apax a en Ch stmas%20C oling%20W i 1 602 ri ar 1 en Ch stmas%20C slip 1 59 £ filter en Chri tmas% 20Cow d 1 630 n Chris as%20Isla ture 1 72 ant%20Wal 0 l%20D ecal 1 611 en ristm itera %20Gi £ aggregate en Ch stmas%20L e%20Quote 593 ri re en Ch stmas%20T 20medium 1 1 596 98 en Chri s%2 0by% %20 wall 1 5 £ link r istma 20fantasy all%20art i 1 605 en Ch stmas% Chri l%20w 0viny s_Solis_I nvict en mas%2 i hrist s%23Natal £ visualize e e n C n Chr istma [...]¢ Inherently parallelizable Large-scale Data Processing for IR 33
    • Use-case 3: Mining user edits on Wikipedia¢ As a social signal …¢ As a language resource … £ Target: User edits, textual differences between revisions of the same document £ Objective: Distinguish between factual edits (alter the meaning) and fluency edits (address style or readability) £ Dataset: Full revision history of the English Wikipedia Large-scale Data Processing for IR 34
    • The data¢ Average of 3.5 to 4 million revisions per month £ English Wikipedia, August 2006 to August 2011 £ Each revision may contain multiple edits (many are irrelevant) £ 342GB compressed text (snapshot of 15/01/2011) Large-scale Data Processing for IR 35
    • What to do?¢ A lot of pre-processing £ Filtering out irrelevant revisions £ Parsing wiki markup £ Words tokenization £ Sentence splitting £ Computing textual diff between revisions £ Indexing user edits at sentence level and across sentence boundaries £ Computing classification features per user edit¢ And then £ Execution:15 nodes, each processes a data stream £ Average of 2-3 days per node¢ Outcome: 6.3 million textual diff segments, 4.3 million user edits Large-scale Data Processing for IR 36
    • What to do?¢ A lot of pre-processing £ Filtering out irrelevant revisions £ Parsing wiki markup £ Words tokenization £ Sentence splitting £ Computing textual diff between revisions £ Indexing user edits at sentence level and across sentence boundaries £ Computing classification features per user edit¢ And then £ Execution:15 nodes, each processes a data stream £ Average of 2-3 days per node¢ Outcome: 6.3 million textual diff segments, 4.3 million user edits Large-scale Data Processing for IR 36
    • What’s next? Large-scale Data Processing for IR 37
    • Real-time semantic analysis¢ Example: reputation management¢ Follow twitter stream £ Am I being mentioned? £ What are they saying about me? £ Is this potentially damaging?¢ Why a challenge £ Ambiguity £ Noise £ “I need to know now!”¢ Big data Large-scale Data Processing for IR 38
    • Extreme personalisation¢ “Zero click”, “zero query”¢ Tell me what I should know £ Summarize a few million documents £ Show a semantically meaningful result on my screen¢ Big data Large-scale Data Processing for IR 39
    • Social search¢ Socially improved search £ General search, personalized search £ Thousands of users of social networks actively share content and attitudes and opinions and experiences £ Use this to “push content” £ Return results that you care about, with a broad “subjective context”¢ Big data Large-scale Data Processing for IR 40
    • Thanks!¢ Edgar Meij £ http://edgar.meij.pro £ edgar.meij@uva.nl £ @edgarmeij Large-scale Data Processing for IR 41