presented 09/23/14 at NYC Search, Discovery & Analytics meetup
Classification of short text into a predefined hierarchy of categories is a challenge. The need to categorize short texts arises in multiple domains: keywords and queries in online advertising, improvement of search engine results, analysis of tweets or messages in social networks, etc. We leverage community-moderated, freely-available data sets (Wikipedia, DBPedia, Freebase) and open-source tools (Hadoop, Solr) to build a flexible and extensible classification model.
Magnetic is an online advertising company specializing in search retargeting and applying data science to online search behavior. We create custom real-time audience segments based on what users have searched for across the web. Targeting individual keywords found in user search history is a great way to build an audience. But the need to create manually selected keywords might present operational challenge. The ability to classify queries and keywords helps to create larger audiences with less effort and better accuracy. Among the other use cases for keyword classification in online advertising are reporting on size of inventory available by category, and campaign performance optimization.
We will share our experiences building a real-world data science system that scales to production data volumes of more than 20 million keyword classifications per hour. And will touch on some aspect of knowledge discovery such as language detection, n-gram extraction, and entity recognition.
about the speaker: Alex Dorman, CTO at Magnetic.
Alex has used Hadoop technologies since 2007. Before joining Magnetic, Alex built big data platforms and teams at Proclivity Media and ContextWeb/PulsePoint.
The current revolution in the music industry represents great opportunities and challenges for music recommendation systems. Recommendation systems are now central to music streaming platforms, which are rapidly increasing in listenership and becoming the top source of revenue for the music industry. It is increasingly more common for a music listener to simply access music than to purchase and own it in a personal collection. In this scenario, recommendation calls no longer for a one-shot recommendation for the purpose of a track or album purchase, but for a recommendation of a listening experience, comprising a very wide range of challenges, such as sequential recommendation, or conversational and contextual recommendations. Recommendation technologies now impact all actors in the rich and complex music industry ecosystem (listeners, labels, music makers and producers, concert halls, advertisers, etc.).
The influence of intelligent technology on the way we discover and experience...Fabien Gouyon
Invited talk at event organized by the HUMAINT project (the Joint Research Centre of the European Commission aiming to understand the impact of machine intelligence on human behaviour)
A simple and easy tool to organize PS Queries into logical business processes. Will be an aid to organisations that maintain hundreds of PS Queries and use Query Manager/Viewer to run them. This tool helps you to classify queries into corresponding business processes and later display and run the queries from a custom page.
More details on the framework on my blog - www.peoplesofthrms.blogspot.com
Reflected Intelligence: Lucene/Solr as a self-learning data systemTrey Grainger
What if your search engine could automatically tune its own domain-specific relevancy model? What if it could learn the important phrases and topics within your domain, automatically identify alternate spellings (synonyms, acronyms, and related phrases) and disambiguate multiple meanings of those phrases, learn the conceptual relationships embedded within your documents, and even use machine-learned ranking to discover the relative importance of different features and then automatically optimize its own ranking algorithms for your domain?
In this presentation, you’ll learn you how to do just that - to evolving Lucene/Solr implementations into self-learning data systems which are able to accept user queries, deliver relevance-ranked results, and automatically learn from your users’ subsequent interactions to continually deliver a more relevant experience for each keyword, category, and group of users.
Such a self-learning system leverages reflected intelligence to consistently improve its understanding of the content (documents and queries), the context of specific users, and the relevance signals present in the collective feedback from every prior user interaction with the system. Come learn how to move beyond manual relevancy tuning and toward a closed-loop system leveraging both the embedded meaning within your content and the wisdom of the crowds to automatically generate search relevancy algorithms optimized for your domain.
ADV Slides: Graph Databases on the EdgeDATAVERSITY
Graph databases may be the unsung heroes of data platforms. They are poised to expand dramatically in the next few years as the nature of important analytics data expands dramatically into understanding. We live and work today in a highly connected world where individuals and their relationships organize perceptions, consumer behaviors, and many other business success factors. Where patterns are involved in relationships, it is imperative to understand them. Graph databases are the technology that is best-suited to determining and understanding data relationships.
This code-lite session is a primer on graph databases and the relationship data stored in them for the analytics architect in the enterprise. It will help you determine why, how, and where to apply graphs, and how to get started.
The current revolution in the music industry represents great opportunities and challenges for music recommendation systems. Recommendation systems are now central to music streaming platforms, which are rapidly increasing in listenership and becoming the top source of revenue for the music industry. It is increasingly more common for a music listener to simply access music than to purchase and own it in a personal collection. In this scenario, recommendation calls no longer for a one-shot recommendation for the purpose of a track or album purchase, but for a recommendation of a listening experience, comprising a very wide range of challenges, such as sequential recommendation, or conversational and contextual recommendations. Recommendation technologies now impact all actors in the rich and complex music industry ecosystem (listeners, labels, music makers and producers, concert halls, advertisers, etc.).
The influence of intelligent technology on the way we discover and experience...Fabien Gouyon
Invited talk at event organized by the HUMAINT project (the Joint Research Centre of the European Commission aiming to understand the impact of machine intelligence on human behaviour)
A simple and easy tool to organize PS Queries into logical business processes. Will be an aid to organisations that maintain hundreds of PS Queries and use Query Manager/Viewer to run them. This tool helps you to classify queries into corresponding business processes and later display and run the queries from a custom page.
More details on the framework on my blog - www.peoplesofthrms.blogspot.com
Reflected Intelligence: Lucene/Solr as a self-learning data systemTrey Grainger
What if your search engine could automatically tune its own domain-specific relevancy model? What if it could learn the important phrases and topics within your domain, automatically identify alternate spellings (synonyms, acronyms, and related phrases) and disambiguate multiple meanings of those phrases, learn the conceptual relationships embedded within your documents, and even use machine-learned ranking to discover the relative importance of different features and then automatically optimize its own ranking algorithms for your domain?
In this presentation, you’ll learn you how to do just that - to evolving Lucene/Solr implementations into self-learning data systems which are able to accept user queries, deliver relevance-ranked results, and automatically learn from your users’ subsequent interactions to continually deliver a more relevant experience for each keyword, category, and group of users.
Such a self-learning system leverages reflected intelligence to consistently improve its understanding of the content (documents and queries), the context of specific users, and the relevance signals present in the collective feedback from every prior user interaction with the system. Come learn how to move beyond manual relevancy tuning and toward a closed-loop system leveraging both the embedded meaning within your content and the wisdom of the crowds to automatically generate search relevancy algorithms optimized for your domain.
ADV Slides: Graph Databases on the EdgeDATAVERSITY
Graph databases may be the unsung heroes of data platforms. They are poised to expand dramatically in the next few years as the nature of important analytics data expands dramatically into understanding. We live and work today in a highly connected world where individuals and their relationships organize perceptions, consumer behaviors, and many other business success factors. Where patterns are involved in relationships, it is imperative to understand them. Graph databases are the technology that is best-suited to determining and understanding data relationships.
This code-lite session is a primer on graph databases and the relationship data stored in them for the analytics architect in the enterprise. It will help you determine why, how, and where to apply graphs, and how to get started.
Design for Findability: metadata, metrics and collaboration on LOC.govUXPA International
UXPA 2013 Annual Conference Friday July 12, 2013 3:00pm - 4:00pm ET by Jill MacNeice
The Library of Congress has 2.2 million digitized searchable items online, including 89,000 web pages, and catalog records, books, musical scores, films, newspapers and 1 million plus images.
How does anyone ever find anything?
In Design for Findability, I’ll talk about what the Library of Congress is doing on the interface, in the back end, and at the institutional level, to make content and objects on LOC.gov more findable. And I invite you to share your own efforts to enhance findability on your sites. The goal is to create a framework for findability that be used for many different types of sites.
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA
Mike Limcaco, Analytics Specialist / Customer Engineer at Google
Measure trends in a particular topic or search term across Google Search across the US down to the city-level. Integrate these data signals into analytic pipelines to drive product, retail, media (video, audio, digital content) recommendations tailored to your audience segment. We'll discuss how Google unique datasets can be used with Google Cloud smart analytic services to process, enrich and surface the most relevant product or content that matches the ever-changing interests of your local customer segment.
Design for Findability at the Library of CongressJill MacNeice
How do people find things on a site with 17.5 million in the online catalog? This presentation discusses the Findability framework we use on LOC.gov that has transformed our site and made our content more findable. Hint: It relies heavily on metadata, metrics and, above all, collaboration
This presentation complements Meg Peters' "Design for Findability: Collaboration on Congress.gov":
http://www.slideshare.net/megpeters946/findability-congress-gov
An introductory presentation about the current state of personalization in (Web) search for Bibliotekarforbundet's series of 'gå-hjem-møder'. Presented on May 17, 2016 at Aalborg University Copenhagen.
NTEN Webinar - Data Cleaning and Visualization Tools for NonprofitsAzavea
Slides from a webinar we conducted for NTEN that covers tools that nonprofits can use to clean and prepare their datasets and then visualize them via charts, maps, and graphs.
Presentation to SWIB23 in Berlin.
The journey to implement a production Linked Data Management and Discovery System for the National Library Board of Singapore.
How can we mine, analyse and visualise the Social Web?
In this lecture, you will learn about mining social web data for analysis. Data preparation and gathering basic statistics on your data.
Lecture 5: Mining, Analysis and VisualisationMarieke van Erp
This is the fourth lecture in the Social Web course at the VU University Amsterdam
Visit the website for more information: <a>Social Web 2012</a>
Document Retrieving is characterized as the coordinating of some expressed client questions against a lot of free-text records. These records could be any sort of fundamentally unstructured content, for example, paper articles, land records, or passages in a manual.
Whether we like it or not, data-hungry algorithms and AI-powered recommendation engines are now mediating all performing arts engagement online. Oddly, the technologies behind these algorithms were initially not designed for commercial interests but rather for collaboration. So, shall we simply comply with Google and Alexa’s requirements for data? Or shall we rather build a shared data ecosystem that will serve both our needs and those of bots?
This presentation was developed and delivered as part of the linked digital future initiative. For more information, visit: https://linkeddigitalfuture.ca/resources/workshops/
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
More Related Content
Similar to Magnetic - Query Categorization at Scale
Design for Findability: metadata, metrics and collaboration on LOC.govUXPA International
UXPA 2013 Annual Conference Friday July 12, 2013 3:00pm - 4:00pm ET by Jill MacNeice
The Library of Congress has 2.2 million digitized searchable items online, including 89,000 web pages, and catalog records, books, musical scores, films, newspapers and 1 million plus images.
How does anyone ever find anything?
In Design for Findability, I’ll talk about what the Library of Congress is doing on the interface, in the back end, and at the institutional level, to make content and objects on LOC.gov more findable. And I invite you to share your own efforts to enhance findability on your sites. The goal is to create a framework for findability that be used for many different types of sites.
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA
Mike Limcaco, Analytics Specialist / Customer Engineer at Google
Measure trends in a particular topic or search term across Google Search across the US down to the city-level. Integrate these data signals into analytic pipelines to drive product, retail, media (video, audio, digital content) recommendations tailored to your audience segment. We'll discuss how Google unique datasets can be used with Google Cloud smart analytic services to process, enrich and surface the most relevant product or content that matches the ever-changing interests of your local customer segment.
Design for Findability at the Library of CongressJill MacNeice
How do people find things on a site with 17.5 million in the online catalog? This presentation discusses the Findability framework we use on LOC.gov that has transformed our site and made our content more findable. Hint: It relies heavily on metadata, metrics and, above all, collaboration
This presentation complements Meg Peters' "Design for Findability: Collaboration on Congress.gov":
http://www.slideshare.net/megpeters946/findability-congress-gov
An introductory presentation about the current state of personalization in (Web) search for Bibliotekarforbundet's series of 'gå-hjem-møder'. Presented on May 17, 2016 at Aalborg University Copenhagen.
NTEN Webinar - Data Cleaning and Visualization Tools for NonprofitsAzavea
Slides from a webinar we conducted for NTEN that covers tools that nonprofits can use to clean and prepare their datasets and then visualize them via charts, maps, and graphs.
Presentation to SWIB23 in Berlin.
The journey to implement a production Linked Data Management and Discovery System for the National Library Board of Singapore.
How can we mine, analyse and visualise the Social Web?
In this lecture, you will learn about mining social web data for analysis. Data preparation and gathering basic statistics on your data.
Lecture 5: Mining, Analysis and VisualisationMarieke van Erp
This is the fourth lecture in the Social Web course at the VU University Amsterdam
Visit the website for more information: <a>Social Web 2012</a>
Document Retrieving is characterized as the coordinating of some expressed client questions against a lot of free-text records. These records could be any sort of fundamentally unstructured content, for example, paper articles, land records, or passages in a manual.
Whether we like it or not, data-hungry algorithms and AI-powered recommendation engines are now mediating all performing arts engagement online. Oddly, the technologies behind these algorithms were initially not designed for commercial interests but rather for collaboration. So, shall we simply comply with Google and Alexa’s requirements for data? Or shall we rather build a shared data ecosystem that will serve both our needs and those of bots?
This presentation was developed and delivered as part of the linked digital future initiative. For more information, visit: https://linkeddigitalfuture.ca/resources/workshops/
Similar to Magnetic - Query Categorization at Scale (20)
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
Magnetic - Query Categorization at Scale
1. Query Categorization at Scale
NYC Search, Discovery & Analytics meetup
September 23rd, 2014
Alex Dorman, CTO
alex at magnetic dot com
2. About Magnetic
One of the largest aggregators of intent data
2
First company to focus 100% on applying search intent to display
Proprietary media platform and targeting algorithm
Display, mobile, video and social retargeting capabilities
Strong solution for customer acquisition and retention
3. Search Retargeting
3
Search retargeting combines the purchase intent
from search with the scale from display
1) Magnetic collects
search data
2) Magnetic builds
audience segments
3) Magnetic serves
retargeted ads
4) Magnetic optimizes
campaign
4. Slovak Academy of Science
4
Institute of Informatics
- One of top European research institutes
- Significant experience in:
- Informational retrieval
- Semantic WEB
- Natural Language Processing
- Parallel and Distributed processing
- Graph/Networks Analysis
5. Search Data - Natural and Navigational
5
Natural Searches and Navigational Searches
Natural Search:
“iPhone”
Navigational Search:
“iPad Accessories”
6. Search Data – Page Keywords
6
Page keywords from
article metadata:
“Recipes, Cooking,
Holiday Recipes”
7. Search Data – Page Keywords
7
Article Titles:
“Microsoft is said to be in talks
to acquire Minecraft”
8. Search Data – Why Categorize?
8
• Targeting categories instead of keywords = Scale
• Use category name to optimize advertising as an
additional feature in predictive models
• Reporting by category is easier to grasp as compared to
reporting by keyword
10. Query Categorization – Academic Approach
10
• Usual approach (academic publications):
– Get documents from a web search
– Classify based on the retrieved documents
11. Query Categorization
Query Categories
apple
Computers Hardware
Living Food & Cooking
FIFA 2006
Sports Soccer
Sports Schedules & Tickets
Entertainment Games & Toys
cheesecake recipes
Living Food & Cooking
Information Arts & Humanities
friendships poem
Information Arts & Humanities
Living Dating & Relationships
11
• Usual approach:
• Get results for query
• Categorize returned documents
• Best algorithms work with the entire web (search
API)
12. Long Time Ago …
12
• Relying on Bing Search API:
– Get search results using the query we want to categorize
– See if some category-specific “characteristic“ keywords appear
in the results
– Combine scores
– Not too bad....
13. Long Time Ago …
13
• ... But ....
• .... We have ~8Bn queries
per month to categorize ....
• $2,000 * 8,000 = Oh My!
15. Our Query Categorization Approach – Take 2
C
15
• Assign a category to each
Wikipedia document (with
a score)
• Load all documents and
scores into an index
• Search within the index
• Compute the final score
for the query
A
B
C
B C
17. Measuring Quality
• Precision is the fraction of retrieved documents that are relevant to
the find
17
• Recall is the fraction of the documents that are relevant to the query
that are successfully retrieved.
18. Measuring Quality
18
Relevant Items
(to the left from the
straight line)
Errors
(in red)
• Recall is the fraction of
the documents that are
relevant to the query that
are successfully
retrieved.
• Precision is the fraction
of retrieved documents
that are relevant to the
find
22. Query Categorization - Overview
C
22
• Assign a category to each
Wikipedia document (with
a score)
• Load all documents and
scores into an index
• Search within the index
• Compute final score for
the query
A
B
C
B C
How?
23. Search within index
Combine scores from each
document in results
Step-by-Step
Create Map
Category: {seed documents}
Compute N-grams:
Category: {n-grams: score}
Parse Wikipedia:
Document:
{title, redirects, anchor text, etc}
Categorize Documents
Document:
{title, redirects, anchor text, etc}
{category: score}
Build index
Query
Query: {category: score}
Preparation Steps
Real Time Query
Categorization
23
24. Step By Step – Seed Documents
24
• Each category represented by one or multiple wiki pages
(manual mapping)
• Example: Electronics & ComputingCell Phone
– Mobile phone
– Smartphone
– Camera phone
25. N-grams Generation From Seed Wikipages
25
• Wikipedia is rich on links and metadata.
• We utilize links between pages to find “similar
concepts”.
• Set of similar concepts is saved as list of n-grams
26. N-grams Generation From Seed Wikipages and Links
26
Mobile phone 1.0
Smartphone 1.0
Camera phone 1.0
Mobile operating system 0.3413
Android (operating system) 0.2098
Tablet computer 0.1965
Comparison of smartphones 0.1945
Personal digital assistant 0.1934
IPhone 0.1926
• For each link we compute
similarity of linked page with
the seed page (as cosine
similarity)
27. Extending Seed Documents with Redirects
27
• There are many redirects and alternative names in
Wikipedia.
• For example “Cell Phone” redirects to “Mobile Phone”
• Alternative names are added to the list of n-grams of this
category
Mobil phone 1.0
Mobilephone 1.0
Mobil Phone 1.0
Cellular communication standard 1.0
Mobile communication standard 1.0
Mobile communications 1.0
Environmental impact of mobile phones 1.0
Kosher phone 1.0
How mobilephones work? 1.0
Mobile telecom 1.0
Celluar telephone 1.0
Cellular Radio 1.0
Mobile phones 1.0
Cellular phones 1.0
Mobile telephone 1.0
Mobile cellular 1.0
Cell Phone 1.0
Flip phones 1.0
…..
28. Creating index – What information to use?
• Some information in Wikipedia helped more than other
• We’ve tested combinations of different fields and applied different algorithms to
select approach with the best results
• Data Set for the test – KDD 2005 Cup “Internet User Search Query Categorization” –
800 queries annotated by 3 reviewers
28
29. Creating index – Parsed Fields
• Fields for Categorization of Wikipedia documents:
– title
– abstract
– db_category
– fb_category
– category
29
30. What else goes into the index– Freebase, DBpedia
30
• Some of Freebase/DBpedia categories are mapped to Magnetic Taxonomy
(manual mapping)
• (Freebase and DBpedia have links back to Wikipedia documents)
• Examples:
– Arts & EntertainmentPop Culture & Celebrity News:
Celebrity; music.artist; MusicalArtist; …
– Arts & EntertainmentMovies/Television:
TelevisionStation; film.film; film.actor; film.director; …
– AutomotiveManufacturers:
automotive.model, automotive.make
31. Wikipedia page categorization: n-gram matching
31
Ted Nugent
Theodore Anthony "Ted" Nugent ( ; born December 13,
1948) is an American rock musician from Detroit,
Michigan. Nugent initially gained fame as the lead
guitarist of The Amboy Dukes before embarking on a solo
career. His hits, mostly coming in the 1970s, such as
"Stranglehold", "Cat Scratch Fever", "Wango Tango", and
"Great White Buffalo", as well as his 1960s Amboy Dukes
…
Article abstract
rock and roll, 1970s in music, Stranglehold (Ted Nugent
song), Cat Scratch Fever (song), Wango Tango (song),
Conservatism in the United States, Gun politics in the
United States, Republican Party (United States)
Additional text from abstract links
33. Wikipedia page categorization: n-gram matching
33
Found n-gram keywords with scores for categories
rock musician - Arts & EntertainmentMusic:0.08979
Advocate - Law-Government-PoliticsLegal:0.130744
christians - LifestyleReligion and Belief:0.055088;
- LifestyleWedding & Engagement:0.0364
gun rights - NegativeFirearms:0.07602
rock and roll - Arts & EntertainmentMusic:0.104364
reality television series
- LifestyleDating:0.034913;
- Arts&EntertainmentMovies/Television:0.041453
...
35. Wikipedia page categorization: results
35
Document: Ted Nugent
Arts & EntertainmentPop Culture & Celebrity News:0.956686
Arts & EntertainmentMusic:0.956681
Arts & EntertainmentMovies/Television:0.954364
Arts & EntertainmentBooks and Literature:0.908852
SportsGame & Fishing:0.874056
Result categories for document combined from:
- text n-gram matching
- DBpedia mapping
- Freebase mapping
36. Query Categorization
• Take search fields
• Search using standard Lucene’s
implementation of TF/IDF scoring
• Get results
• Filter results using alternative
name
• Combine remaining document pre-computed
categories
• Remove low confidence score
results
• Return resulting set of categories
with confidence score
37. Query Categorization: search within index
• Searching within all data stored in Lucene index
• Computing categories for each result normalized by Lucene score
• Example: “Total recall Arnold Schwarzenegger”
• List of documents found (with Lucene score):
1. Arnold Schwarzenegger filmography; score: 9.455296
2. Arnold Schwarzenegger; score: 6.130941
3. Total Recall (2012 film); score: 5.9359055
4. Political career of Arnold Schwarzenegger; score: 5.7361355
5. Total Recall (1990 film); score: 5.197826
6. List of awards and nominations received by Arnold Schwarzenegger; score: 4.9710693
7. California gubernatorial recall election; score: 4.9665976
8. Patrick Schwarzenegger; score: 3.2915113
9. Recall election; score: 3.2077827
10. Gustav Schwarzenegger; score: 3.1247897
38. Prune Results Based on Alternative Names
• Searching within all data stored in Lucene index
• Computing categories for each result normalized by Lucene score
• Example: “Total recall Arnold Schwarzenegger”
• List of documents found (with Lucene score):
1. Arnold Schwarzenegger filmography; score: 9.455296
2. Arnold Schwarzenegger; score: 6.130941
3. Total Recall (2012 film); score: 5.9359055
4. Political career of Arnold Schwarzenegger; score: 5.7361355
5. Total Recall (1990 film); score: 5.197826
6. List of awards and nominations received by Arnold Schwarzenegger; score: 4.9710693
7. California gubernatorial recall election; score: 4.9665976
8. Patrick Schwarzenegger; score: 3.2915113
9. Recall election; score: 3.2077827
10. Gustav Schwarzenegger; score: 3.1247897
”Total recall Arnold Schwarzenegger”
Alternative Names:
total recall (upcoming film),
total recall (2012 film), total recall,
total recall (2012), total recall 2012
39. Prune Results Based on Alternative Names
• Matched using alternative names
1. Arnold Schwarzenegger filmography; score: 9.455296
2. Arnold Schwarzenegger; score: 6.130941
3. Total Recall (2012 film); score: 5.9359055
4. Political career of Arnold Schwarzenegger; score: 5.7361355
5. Total Recall (1990 film); score: 5.197826
6. List of awards and nominations received by Arnold Schwarzenegger; score: 4.9710693
7. California gubernatorial recall election; score: 4.9665976
8. Patrick Schwarzenegger; score: 3.2915113
9. Recall election; score: 3.2077827
10. Gustav Schwarzenegger; score: 3.1247897
40. Retrieve Categories for Each Document
2. Arnold Schwarzenegger; score: 6.130941
Arts & EntertainmentMovies/Television:0.999924
Arts & EntertainmentPop Culture & Celebrity News:0.999877
Business:0.99937
Law-Government-PoliticsPolitics:0.9975
GamesVideo & Computer Games:0.986331
3. Total Recall (2012 film); score: 5.9359055
Arts & EntertainmentMovies/Television:0.999025
Arts & EntertainmentHumor:0.657473
5. Total Recall (1990 film); score: 5.197826
Arts & EntertainmentMovies/Television:0.999337
GamesVideo & Computer Games:0.883085
Arts & EntertainmentHobbiesAntiques & Collectables:0.599569
41. Combine Results and Calculate Final Score
“Total recall Arnold Schwarzenegger”
Arts & EntertainmentMovies/Television:0.996706
GamesVideo & Computer Games:0.960575
Arts & EntertainmentPop Culture & Celebrity News:0.85966
Business:0.859224
42. Combining Scores from Multiple Documents
42
If P(A(x,c)) is the probability that entity (query or
document) x should be assigned to category c, than
we can combine scores from multiple documents using
following formula:
P(R(q, D)) is the probability that query q is related to the document D
P(A(D, c)) is the probability that document D should be assigned to category c
43. Combining Scores
43
• Should we limit number
of documents in result
set?
• Based on research we
decided to limit to the
top 20
45. 45
Categorizing Other Languages
1 000 000+ Documents:
Deutsch
English
Español
Français
Italiano
Nederlands
Polski
Русский
Sinugboanong Binisaya
Svenska
Tiếng Việt
Winaray
46. 46
Categorizing other languages
- In development
- Combining indexes for multiple languages into one common
index
- Focus:
- Spanish
- French
- German
- Portuguese
- Dutch
47. Preprocessing Workflow
• Automatized Hadoop and local jobs
• Luigi library and scheduler
• Steps:
– Download
– Uncompressing
– Parse Wikipedia/Freebase/DBpedia
– Generate N-grams
– JOIN together (Wikipage) – JSON per page
– Preprocess Wikipage categories
– Produce JSON for Solr or local index
– Load to SOLR + Check quality
47
48. Query Categorization: Scale
• Scale is achieved by combination of
multiple categorization boxes, load
balancing, and Varnish (open source)
cache layer in front of Solr
• We have 6 servers in production
today
• Load Balancer - HAProxy
• Capacity – 1,000 QPS/server
• More servers can be added if needed
Search
Engine
(Solr)
Solr Index
Cache
(Varnish)
Wikipedia
Dump
DBpedia
Dump
Freebase
Dump
Hadoop
Reporting
48
LB
49. Architected for Scale
• Bidders, AdServers developed in Python and use PyPy VM with JIT
• Response time critical - typically under 100ms as measured by exchange
• High volume of auctions – 200,000 QPS at peak
• Hadoop – 25 nodes cluster
• 3 DC – US East, West and London
• Data centers have multiple load balancers – HAProxy
• Overview of servers in production:
• US East: 6LB, 45 Bidders, 6 AdServers, 4 trackers, 25 Hadoop, 9 Hbase, 8 Kyoto DB
• US West: 3LB, 17 Bidders, 6 AdServers, 4 trackers, 4 Kyoto DB
• London: 8 Bidders, 2 AdServers, 2 trackers, 4 Kyoto DB
50. ERD Challenge
• ERD'14: Entity Recognition and Disambiguation Challenge
• Organized as a workshop at SIGIR 2014 Gold Coast
• Goal: Submit working systems that identify the entities mentioned in text
• We participated in the “Short Text” track
• 19 team participated in the challenge
• We took 4th place