The document discusses optimizing search user interfaces (SUIs) and interactions within professional social networks. It analyzes search behavior data from Facebook to answer four research questions. Key findings include: (1) Users use named entity queries (NEQs) more for friends and structured queries (SQs) more for non-friends, complementing each other; (2) Younger users search more non-friends while older users search more friends; (3) Females write more queries than males across query types; (4) Number of friends correlates with more friend NEQs but less non-friend SQs. The study provides implications for personalizing search suggestions across demographics.
Better Search Through Query Understanding
Presented as a Data Talk at Intuit on April 22, 2014
Search is a fundamental problem of our time — we use search engines daily to satisfy a variety of personal and professional information needs. But search engine development still feels stuck in an information retrieval paradigm that focuses on result ranking. In this talk, I’ll advocate an emphasis on query understanding. I’ll talk about how we implement query understanding at LinkedIn, and I’ll present examples from the broader web. Hopefully you’ll come out with a different perspective on search and share my appreciation for how we can improve search through query understanding.
About the Speaker
Daniel Tunkelang leads LinkedIn's efforts around query understanding. Before that, he led LinkedIn's product data science team. He previously led a local search quality team at Google and was a founding employee of Endeca (acquired by Oracle in 2011). He has written a textbook on faceted search, and is a recognized advocate of human-computer interaction and information retrieval (HCIR). He has a PhD in Computer Science from CMU, as well as BS and MS degrees from MIT.
Better Search Through Query Understanding
Presented as a Data Talk at Intuit on April 22, 2014
Search is a fundamental problem of our time — we use search engines daily to satisfy a variety of personal and professional information needs. But search engine development still feels stuck in an information retrieval paradigm that focuses on result ranking. In this talk, I’ll advocate an emphasis on query understanding. I’ll talk about how we implement query understanding at LinkedIn, and I’ll present examples from the broader web. Hopefully you’ll come out with a different perspective on search and share my appreciation for how we can improve search through query understanding.
About the Speaker
Daniel Tunkelang leads LinkedIn's efforts around query understanding. Before that, he led LinkedIn's product data science team. He previously led a local search quality team at Google and was a founding employee of Endeca (acquired by Oracle in 2011). He has written a textbook on faceted search, and is a recognized advocate of human-computer interaction and information retrieval (HCIR). He has a PhD in Computer Science from CMU, as well as BS and MS degrees from MIT.
Description of the DaCENA approach to the contextual exploration of knowledge graphs. We use machine learning to learn user preferences using a limited number of user inputs. Through these inputs, we learn a personalized ranking function over semantic associations (semi-paths in a knowledge graph) that best fit users' interests. References for the presentation are:
Bianchi & al.: Actively Learning to Rank Semantic Associations for Personalized Contextual Exploration of Knowledge Graphs. ESWC (1) 2017: 120-135.
Palmonari & al.: DaCENA: Serendipitous News Reading with Data Contexts. ESWC (Satellite Events) 2015: 133-137
- What is Clustering, Honeypots and Density Based Clustering?
- What is Optics Clustering and how is it different than DB Clustering? …and how
can it be used for outlier detection.
- What is so-called soft clustering and how is it different than clustering? …and how
can it be used for outlier detection.
Recognizing Names with AI Using Solr Hashes and Stream ProcessingLucidworks
Michael Harris, Solutions Engineer & Christopher Biow, SVP Global Public Sector & CRO, Basis Technology. Presentation from ACTIVATE 2019, the Search and AI Conference hosted by Lucidworks. http://www.activate-conf.com
Exploring Session Context using Distributed Representations of Queries and Re...Bhaskar Mitra
Search logs contain examples of frequently occurring patterns of user reformulations of queries. Intuitively, the reformulation "san francisco" → "san francisco 49ers" is semantically similar to "detroit" →"detroit lions". Likewise, "london"→"things to do in london" and "new york"→"new york tourist attractions" can also be considered similar transitions in intent. The reformulation "movies" → "new movies" and "york" → "new york", however, are clearly different despite the lexical similarities in the two reformulations. In this paper, we study the distributed representation of queries learnt by deep neural network models, such as the Convolutional Latent Semantic Model, and show that they can be used to represent query reformulations as vectors. These reformulation vectors exhibit favourable properties such as mapping semantically and syntactically similar query changes closer in the embedding space. Our work is motivated by the success of continuous space language models in capturing relationships between words and their meanings using offset vectors. We demonstrate a way to extend the same intuition to represent query reformulations.
Furthermore, we show that the distributed representations of queries and reformulations are both useful for modelling session context for query prediction tasks, such as for query auto-completion (QAC) ranking. Our empirical study demonstrates that short-term (session) history context features based on these two representations improves the mean reciprocal rank (MRR) for the QAC ranking task by more than 10% over a supervised ranker baseline. Our results also show that by using features based on both these representations together we achieve a better performance, than either of them individually.
Paper: http://research.microsoft.com/apps/pubs/default.aspx?id=244728
Semantic search helps business people find answers to pressing questions by wading through oceans of information to find nuggets of meaningful information. In this presentation we’ll discuss how semantic search and content analysis technologies are starting to appear in the marketplace today. We’ll provide a recap of what semantic search is and what the key benefits are, then we’ll answer the following questions:
• Is semantic search a feature, an application, or enterprise system?
• How can I add semantic search to my existing work processes?
• Will I need to replace my existing content technologies?
• What will I need to do to prepare my content for semantic search?
• Is semantic search just for documents or can I search my data too?
• Can I use semantic search to find information on the internet and other public data sources?
• Are there standards to consider?
I will try to say – what is QA, how could we get the answer to questions on natural language and how successful have we been in that domain.
I have gained all of my knowledge from three proposed papers and what I read around them.
Optimizing Search Interactions within Professional Social Networks (thesis p...Nik Spirin
We must redesign all major elements of the search user interface, such as input, control, and informational, to provide more effective search interactions for users of professional social networks (PSNs). The existing interfaces deliver suboptimal utility as they underutilize structured nature of professional social networks entities.
Description of the DaCENA approach to the contextual exploration of knowledge graphs. We use machine learning to learn user preferences using a limited number of user inputs. Through these inputs, we learn a personalized ranking function over semantic associations (semi-paths in a knowledge graph) that best fit users' interests. References for the presentation are:
Bianchi & al.: Actively Learning to Rank Semantic Associations for Personalized Contextual Exploration of Knowledge Graphs. ESWC (1) 2017: 120-135.
Palmonari & al.: DaCENA: Serendipitous News Reading with Data Contexts. ESWC (Satellite Events) 2015: 133-137
- What is Clustering, Honeypots and Density Based Clustering?
- What is Optics Clustering and how is it different than DB Clustering? …and how
can it be used for outlier detection.
- What is so-called soft clustering and how is it different than clustering? …and how
can it be used for outlier detection.
Recognizing Names with AI Using Solr Hashes and Stream ProcessingLucidworks
Michael Harris, Solutions Engineer & Christopher Biow, SVP Global Public Sector & CRO, Basis Technology. Presentation from ACTIVATE 2019, the Search and AI Conference hosted by Lucidworks. http://www.activate-conf.com
Exploring Session Context using Distributed Representations of Queries and Re...Bhaskar Mitra
Search logs contain examples of frequently occurring patterns of user reformulations of queries. Intuitively, the reformulation "san francisco" → "san francisco 49ers" is semantically similar to "detroit" →"detroit lions". Likewise, "london"→"things to do in london" and "new york"→"new york tourist attractions" can also be considered similar transitions in intent. The reformulation "movies" → "new movies" and "york" → "new york", however, are clearly different despite the lexical similarities in the two reformulations. In this paper, we study the distributed representation of queries learnt by deep neural network models, such as the Convolutional Latent Semantic Model, and show that they can be used to represent query reformulations as vectors. These reformulation vectors exhibit favourable properties such as mapping semantically and syntactically similar query changes closer in the embedding space. Our work is motivated by the success of continuous space language models in capturing relationships between words and their meanings using offset vectors. We demonstrate a way to extend the same intuition to represent query reformulations.
Furthermore, we show that the distributed representations of queries and reformulations are both useful for modelling session context for query prediction tasks, such as for query auto-completion (QAC) ranking. Our empirical study demonstrates that short-term (session) history context features based on these two representations improves the mean reciprocal rank (MRR) for the QAC ranking task by more than 10% over a supervised ranker baseline. Our results also show that by using features based on both these representations together we achieve a better performance, than either of them individually.
Paper: http://research.microsoft.com/apps/pubs/default.aspx?id=244728
Semantic search helps business people find answers to pressing questions by wading through oceans of information to find nuggets of meaningful information. In this presentation we’ll discuss how semantic search and content analysis technologies are starting to appear in the marketplace today. We’ll provide a recap of what semantic search is and what the key benefits are, then we’ll answer the following questions:
• Is semantic search a feature, an application, or enterprise system?
• How can I add semantic search to my existing work processes?
• Will I need to replace my existing content technologies?
• What will I need to do to prepare my content for semantic search?
• Is semantic search just for documents or can I search my data too?
• Can I use semantic search to find information on the internet and other public data sources?
• Are there standards to consider?
I will try to say – what is QA, how could we get the answer to questions on natural language and how successful have we been in that domain.
I have gained all of my knowledge from three proposed papers and what I read around them.
Optimizing Search Interactions within Professional Social Networks (thesis p...Nik Spirin
We must redesign all major elements of the search user interface, such as input, control, and informational, to provide more effective search interactions for users of professional social networks (PSNs). The existing interfaces deliver suboptimal utility as they underutilize structured nature of professional social networks entities.
Beyond Collaborative Filtering: Learning to Rank Research ArticlesMaya Hristakeva
At Elsevier we work on recommender systems to help researchers connect to their research and to collaborators (e.g. Mendeley Suggest, Science Direct, Funding Opportunities and Evise Reviewer recommenders). This talk focused on the recent improvements the team has made to the Science Direct research articles recommender by deploying ranking models in production.
I gave this presentation at the 7th RecSys London Meetup - https://www.meetup.com/RecSys-London/events/255362180/
Wimmics Research Team 2015 Activity ReportFabien Gandon
Extract of the activity report of the Wimmics joint research team between Inria Sophia Antipolis - Méditerranée and I3S (CNRS and Université Nice Sophia Antipolis). Wimmics stands for web-instrumented man-machine interactions, communities and semantics. The team focuses on bridging social semantics and formal semantics on the web.
From “Selena Gomez” to “Marlon Brando”: Understanding Explorative Entity SearchMounia Lalmas-Roelleke
Slides of our paper. Work with Iris Miliaraki and Roi Blanco. Paper published at 24th International World Wide Web Conference (WWW 2015), Florence, Italy.
Abstract: Consider a user who submits a search query "Shakira" having a specific search goal in mind (such as her age) but at the same time willing to explore information for other entities related to her, such as comparable singers. In previous work, a system called Spark, was developed to provide such search experience. Given a query submitted to the Yahoo search engine, Spark provides related entity suggestions for the query, exploiting, among else, public knowledge bases from the Semantic Web. We refer to this search scenario as explorative entity search. The effectiveness and efficiency of the approach has been demonstrated in previous work. The way users interact with these related entity suggestions and whether this interaction can be predicted have however not been studied. In this paper, we perform a large-scale analysis into how users interact with the entity results returned by Spark. We characterize the users, queries and sessions that appear to promote an explorative behavior. Based on this analysis, we develop a set of query and user-based features that reflect the click behavior of users and explore their
effectiveness in the context of a prediction task.
EgoSystem: Presentation to LITA, American Library Association, Nov 8 2014James Powell
The Internet represents the connections among computers and devices, the world wide web is a network of interconnected documents, and the semantic web is the closest thing we have today to a network of interconnected facts. Noticeably absent from these global networks is any sort of open, formal representation for an online global social network. Each users' online presence, and its immediate social network, are isolated and typically only available within the confines of the social networking site that hosts it. Discovery across explicit online social networks and implicit social networks such as those that can be inferred from co-authorship relationships and affiliations is, for all practical purposes, impossible. And yet there are practical and non-nefarious reasons why an organization might be interested in exploring portions of such a network. Outreach is one such interest. Los Alamos National Laboratory (LANL) prototyped EgoSystem to harvest and explore the professional social networks of post doctoral students. The project's goal is to enlist past students and other Lab alumni as ambassadors and advocates for LANL's ongoing mission. During this talk we will discuss the various technologies that support the EgoSystem and demonstrate some of its capabilities.
Scaling Recommendations, Semantic Search, & Data Analytics with solrTrey Grainger
This presentation is from the inaugural Atlanta Solr Meetup held on 2014/10/21 at Atlanta Tech Village.
Description: CareerBuilder uses Solr to power their recommendation engine, semantic search, and data analytics products. They maintain an infrastructure of hundreds of Solr servers, holding over a billion documents and serving over a million queries an hour across thousands of unique search indexes. Come learn how CareerBuilder has integrated Solr into their technology platform (with assistance from Hadoop, Cassandra, and RabbitMQ) and walk through api and code examples to see how you can use Solr to implement your own real-time recommendation engine, semantic search, and data analytics solutions.
Speaker: Trey Grainger is the Director of Engineering for Search & Analytics at CareerBuilder.com and is the co-author of Solr in Action (2014, Manning Publications), the comprehensive example-driven guide to Apache Solr. His search experience includes handling multi-lingual content across dozens of markets/languages, machine learning, semantic search, big data analytics, customized Lucene/Solr scoring models, data mining and recommendation systems. Trey is also the Founder of Celiaccess.com, a gluten-free search engine, and is a frequent speaker at Lucene and Solr-related conferences.
Social Network Analysis based on MOOC's (Massive Open Online Classes)ShankarPrasaadRajama
Collected data by conducting a survey about MOOC among fellow classmates and created edge lists of students and their skills and students and MOOC websites they do courses using Python from the survey data.
Performed visualization of student network in UCINET and found out the densities among clusters in the network.
Performed hypothesis testing to see whether characteristic of a student affects their position(centrality) in the network.
Search Analytics: Conversations with Your Customersrichwig
Did you know that the search box on your home page handles half or more of all your visitors requests? What do people search for most often when they visit your Web site? How can you tune your site search -- and your site -- to perform better?
Rich Wiggins presents a talk that he and co-author Lou Rosenfeld prepared, covering the topis of search analytics, Best Bets, and tuning your Web site to match what your customers seek.
Improving Semantic Search Using Query Log AnalysisStuart Wrigley
Despite the attention Semantic Search is continuously gaining, several challenges affecting tool performance and user experience remain unsolved. Among these are: matching user terms with the searchspace, adopting view-based interfaces in the Open Web as well as supporting users while building their queries. This paper proposes an approach to move a step forward towards tackling these challenges by creating models of usage of Linked Data concepts and properties extracted from semantic query logs as a source of collaborative knowledge. We use two sets of query logs from the USEWOD workshops to create our models and show the potential of using them in the mentioned areas.
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfEnterprise Wired
In this guide, we'll explore the key considerations and features to look for when choosing a Trusted analytics platform that meets your organization's needs and delivers actionable intelligence you can trust.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
Adjusting OpenMP PageRank : SHORT REPORT / NOTESSubhajit Sahu
For massive graphs that fit in RAM, but not in GPU memory, it is possible to take
advantage of a shared memory system with multiple CPUs, each with multiple cores, to
accelerate pagerank computation. If the NUMA architecture of the system is properly taken
into account with good vertex partitioning, the speedup can be significant. To take steps in
this direction, experiments are conducted to implement pagerank in OpenMP using two
different approaches, uniform and hybrid. The uniform approach runs all primitives required
for pagerank in OpenMP mode (with multiple threads). On the other hand, the hybrid
approach runs certain primitives in sequential mode (i.e., sumAt, multiply).
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Optimizing Search User Interfaces and Interactions within Professional Social Networks
1. Optimizing Search User Interfaces and
Interactions within Professional
Social Networks
PhD Candidate: Nikita V Spirin (UIUC)
PhD Committee: Prof. Karrie G Karahalios (UIUC, co-adviser)
Prof. ChengXiang Zhai (UIUC, co-adviser)
Prof. Jiawei Han (UIUC)
Dr. Daniel Tunkelang (LinkedIn, Google, Endeca)
2. Imagine that you are looking for a
Software Engineer job in New York
3
3. Keywordsearch for entities
(e.g. people, jobs, groups)
Faceted search to filter
entities based on attributes
To help users cope with the immense scale and
influx of new information, professional social
networks provide search functionality
4
4. Search within PSNs is fundamentally different
from web search and traditional IR
• The units of retrieval are structured and typed entities
rather than documents.
• The entities aren't independent from each other but form
the entity graph. Plus, users form the part of this graph.
• Sorting by relevance, typical for web search, is not the only
way to order search results. There are many new ways of
ordering, e.g. sort by date, sort by salary, and etc.
• Rather than providing services to mass market, PSNs'
target audience are knowledge workers.
5
5. “...it is clearly the case that the new models and
associated representation and ranking techniques
lead to only incremental (if that) improvement in
performance over previous models and techniques,
which is generally not statistically significant (e.g.
Sparck Jones, 2005); and, that such improvement,
as determined in TREC-style evaluation, rarely, if
ever, leads to improved performance by human
searchers in interactive IR systems...”
Nicholas Belkin
Keynote at ECIR 2008
6
6. How can we optimize search user
interfaces (SUI) and interactions
within professional social networks?
7
7. How can we optimize SUIs and interactions
within professional social networks?
Filters
Query formulation, suggestions… Resorting
Snippets for jobs/people
Snippets for jobs/people
Snippets for jobs/people
Breadcrumbs Breadcrumbs Breadcrumbs
8
8. How can we optimize SUIs and interactions
within professional social networks?
Filters
Query formulation, suggestions… Resorting
Snippets for jobs/people
Snippets for jobs/people
Snippets for jobs/people
Breadcrumbs Breadcrumbs Breadcrumbs
9
9. How can we optimize SUIs and interactions
within professional social networks?
Filters
Query formulation, suggestions… Resorting
Snippets for jobs/people
Snippets for jobs/people
Snippets for jobs/people
Breadcrumbs Breadcrumbs Breadcrumbs
10
15. • Interactive free-text queries (e.g. “Stephen Robertson“,
“SIGIR”, “Chinese Buffet”)
• Interactive structured queries (e.g. “Photos of people
who visited Beijing“)
• One-shot free-text queries (e.g. “big data”, “query log
mining“, “Shanghai”) limited to users' status updates
16
16. • Interactive free-text queries (e.g. “Stephen Robertson“,
“SIGIR”, “Chinese Buffet”)
• Interactive structured queries (e.g. “Photos of people
who visited Beijing“)
• One-shot free-text queries (e.g. “big data”, “query log
mining“, “Shanghai”) limited to users' status updates
17
17. We explore the way people search for
people on Facebook
• RQ1: How does search behavior differ for NEQs and SQs?
• RQ2: How does search behavior depend on the graph search
distance (friend vs. non-friend)?
• RQ3: How does search behavior depend on demographic
attributes (age, gender, number of friends, celebrity status)?
• RQ4: How structured querying capabilities are used by the
users of Graph Search?
18
18. Anonymized Named
Entity Query Log
• 3M non-novice users
• 58.5M queries
• Sept 2013 – Oct 2013
We use four interconnected data sets
provided by Facebook
Anonymized Structured
Query Log
• 3M non-novice users
• 10.9M queries
• Sept 2013 – Oct 2013
Anonymized Social Graph
• 858M vertexes
• 270B edges
• Oct 2013 snapshot
Anonymized User Profiles
• 858M vertexes
• Age, gender, # of friends
• en_US (English + USA)
19
19. Definitions: graph search distance
Named Entity Query
Use a traditional graph-theoretical
definition of the graph distance
Structured Query
1. If one entity, use a traditional
graph-theoretical definition
2. If 2+ entities, compute the
distance to each one and
renormalize by computing a bit
vector with three components
(one for each of the three classes
of the graph distance).
RQ1,RQ220
20. NEQs and SQs complement each other enabling
more effective exploration of the network
• Users search for friends using NEQs and search for non-
friends using SQs.
• Self queries are less popular compared to an overall
query volume.
• Users search for themselves more using SQs.
RQ1,RQ221
22. Graph search distance vs. Age (10-year bins)
Users write NEQs for friends more often compared to NEQs
for non-friends across all age bins.
0
2
4
6
8
10
12
14
10 20 30 40 50 60 70 80
NEQ 1st/user
NEQ 2nd+/user
23
RQ2,RQ3
23. Graph search distance vs. Age (10-year bins)
The graph for SQs is bi-modal. Non-friend SQs prevail for
the younger users. Friend SQs prevail for the older users.
0
0.5
1
1.5
2
2.5
3
10 20 30 40 50 60 70 80
SQ 1st/user
SQ 2nd+/user
24
RQ2,RQ3
24. Graph search distance vs. Age (10-year bins)
The younger users more actively search for non-friends and
the older – for friends, relative to the average user.
0
0.2
0.4
0.6
0.8
1
1 2 3 4 5 6 7 8
NEQ
1st/(1st + 2nd+)
ratio
SQ
1st/(1st + 2nd+)
ratio
25
RQ2,RQ3
25. Graph search distance vs. Gender
Females write more queries than males and it is consistent
across the query types (both for NEQs and SQs).
0
5
10
15
20
25
female male
NEQ
1st/user
NEQ
2nd+/user
NEQ/user
0
0.5
1
1.5
2
2.5
3
3.5
4
female male
SQ
1st/user
SQ
2nd+/user
SQ/user
26
RQ2,RQ3
26. Graph search distance vs. Number of friends
(100-friend bins, from 0 to 1500)
The more friends a user has, the more friend NEQs the user
writes. The trend for non-friend NEQs slightly declines.
0
5
10
15
20
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
NEQ 1st/user
NEQ 2nd+/user
27
RQ2,RQ3
27. Users with more friends write less non-friend SQs.
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
SQ 1st/user
SQ 2nd+/user
SQ/user
Graph search distance vs. Number of friends
(100-friend bins, from 0 to 1500)
28
RQ2,RQ3
28. The trend for non-friend NEQs is flat, while friend NEQs
contribute to the growth of the query volume.
Graph search distance vs. Number of friends
(100-friend bins, from 0 to 1500)
0
5
10
15
20
25
30
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
NEQ 1st/user
NEQ 2nd+/user
NEQ/user
29
RQ2,RQ3
29. 0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
SQ 1st/user
SQ 2nd+/user
SQ/user
Graph search distance vs. Number of friends
(100-friend bins, from 0 to 1500)
The trend for friend SQs is flat, while the volume of non-
friend SQs changes with the number of friends.
30
RQ2,RQ3
30. Graph search distance vs. Celebrity status
Celebrity users submit more NEQs and less SQs than
typical users.
0
5
10
15
20
25
30
35
40
typical celebrity
NEQ 1st/user
NEQ 2nd+/user
NEQ/user
0
0.5
1
1.5
2
2.5
3
3.5
typical celebrity
SQ 1st/user
SQ 2nd+/user
SQ/user
31
RQ2,RQ3
31. Graph search distance vs. Celebrity status
Celebrities search more for other celebrities than typical users.
Both user groups are more likely to search for a celebrity when
they write a non-friend query relative to a friend query.
32
RQ3
32. Graph search distance vs. Celebrity status
Celebrities search more for other celebrities than typical users.
Both user groups are more likely to search for a celebrity when
they write a non-friend query relative to a friend query. RQ333
33. Graph search distance vs. Celebrity status
Celebrities search more for other celebrities than typical users.
Both user groups are more likely to search for a celebrity when
they write a non-friend query relative to a friend query. RQ334
36. Structured query popularity vs. Length,
measured as # of functional predicates
RQ4
• Shorter SQs are more popular.
• Users write shorter grammar queries when they search for the
first degree connections.
37
37. Structured query popularity vs. Length,
measured as # of functional predicates
• Shorter SQs are more popular.
• Users write shorter grammar queries when they search for
the first degree connections.
RQ438
39. Grammar usage for name disambiguation
RQ4
Top-5 groups of
disambiguation
predicates used in SQs
1. Location
2. Affiliation (e.g. Company)
3. Interest
4. Gender
5. Relationship
40
40. Key take-aways and design implications
• Both NEQs and SQs are important to facilitate navigation
and exploration within the social network
– Users search for friends with NEQs
– Users search for non-friends and explore the graph using SQs
• Personalized search query suggestions are very promising
– Focus on SQs if have limited time or resources to achieve maximum
results since it has higher variance across demographic groups
– Don’t limit query suggestions to friends only; include some
interesting distant network vertices
– Take into account a predicate degree preference distribution, i.e.
ranking entities for a predicate using its graph distance distribution
41
44. Search for “product manager” sort by “relevance”Search for “product manager” sort by “relevance”
45
45. Search for “product manager” sort by “relevance”Search for “product manager” sort by “date desc”
46
46. Search for “table” sort by “relevance”Search for “table” sort by “relevance”
47
47. Search for “table” sort by “time desc”Search for “table” sort by “relevance”Search for “table” sort by “time desc”
48
48. Search for “chocolate” sort by “relevance”Search for “table” sort by “relevance”Search for “chocolate” sort by “relevance”
49
49. Search for “chocolate” sort by “price asc”Search for “table” sort by “relevance”Search for “chocolate” sort by “price desc”
50
50. Problems with the existing SUIs supporting
result re-sorting by an attribute value
• When results are sorted by relevance, the output is good
– Average Precision@10 is 0.86
– Results are personalized for the user
• When sorting by the attribute value, e.g. salary high-to-
low, price low-to-high, or date recent-to-old, there are
many irrelevant results at the top of the SERP
– Average Precision@1 is 0.44
– Average Precision@5 is 0.45
– 61% of queries have the Precision@10 below 0.5
– Personalization is gone
51
52. We explore how to improve relevance of
search results sorted by an attribute value
• RQ5: Can the quality be improved by incorporating
relevance into the ranking process?
• RQ6: What is the best way to accomplish it?
53
55. Evaluation trace for a toy example problem
{(0, 0); (1, 3); (2, 1); (3, 2); (4, 1); (5, 3)}
Dependencies between problems
in the memoization matrix and
proper evaluation order
Reconstruction of the optimal
path using the intermediate
values in the memoization matrix
56
56. • Predict relevance labels with Gradient Boosted Regression
Trees (5-fold cross validation partitioning)
• Extend MQ2007 and MSLR-WEB10K data sets by assigning a
random timestamp to each document to model the sorting
by the attribute value
• Apply filtering as the final step in the query processing
pipelines for the following baselines:
– B1: sort by the attribute value and do nothing else (weak)
– B2: predict relevance labels, take all above the threshold, re-sort by
the attribute value (somewhat strong)
– B3: sort by relevance, take top-k results, re-sort by the attribute value
(strong)
• Average the results from 1,000 simulation runs
Experiments with the real L2R data sets (MSR
LETOR collections MQ2007 and MSLR-WEB10K)
RQ557
57. Our approach outperforms all baselines (including
top-k re-ranking) and leads to ~2-4% lift in NDCGMQ2007
(1,600queries;
{0,1,2}labels;40
doc/query;46feats.
MSLR-WEB10K
(10,000queries;
{0,1,2,3,4}labels;120
doc/query;136feats.
RQ558
58. The behavior of the algorithm for different
input sizes and relevance label distributions
59
59. The algorithm can process 1,000 results under 100ms
using our reference C++implementation on a
workstation with 4GB RAM and two 2.5GHz CPU cores
60
60. Key take-aways and design implications
• The quality of search results sorted by an attribute value could
be improved using relevance-aware filtering. The proposed
algorithm consistently outperforms all known baselines and
increases search quality by 2-4%
• Assuming that users scan the results sequentially, the proposed
algorithm is theoretically optimal as it directly optimizes search
quality metrics within the dynamic programming framework
• Higher gains are characteristic for the relevance label
distributions, where relevant results are more probable, and for
medium length result sets (20-100 tuples)
• The algorithm can process 1,000 results under 100ms using our
reference C++implementation on machine with 4GB, 2x2.5GHz
61
67. The problem is that search snippets are either
absent or generated with very naive heuristics
• Titles on the SERP are not informative since in job search queries are
the same as titles, e.g. for the query “Software Engineer”, relevant results
will have “Software Engineer” in their titles.
• Titles on the SERP are not discriminative and minimally help users in
making click decisions. Users play the “lottery” by trying to find a
relevant link among 10 similarly looking links.
• A title and a (query-biased) snippetare redundant, which requires
users to expend cognitive energy on the SERP without extra gains.
• Often the content of a snippet doesn’t provide useful information about
a job posting hidden behind the link. For example, snippets contain
irrelevantnumbers, names, and etc.
• For jobs, which are not directly related to the query, SERPs with the titles
only don’t help in making click decisions. For example, software
engineer in a data-driven company might do data science, but the
common belief is not => users will ignore such a job posting.
68
68. RQ7: What job attributes do people consider
important when deciding whether they want
to apply for a job?
69
69. Method: mixed-method need elicitation study
``Think-aloud” comments
while conducting job search
Job posting annotation
using Diigo plugin
Two surveys asking to score
attributes (SERP + job page)
RQ770
70. Participants have diverse backgrounds
(gender and professional interests / job title)
• Required criteria:
– At least 3 month internship experience
– Above 18 years old
– Used online job search engine (e.g. LinkedIn Job, Indeed)
• Gender: 13 females, 13 males
• Job title: Software Engineer (8), Data Scientist (3),
Healthcare Consultant (2), Research Scientist (2),
Personal Trainer (1), Genetics Counselor (1), Product
Manager (1), Translator (1), Occupational Therapist (1),
Marketing Manager (1), Business Analyst (1), Foreign
Policy Representative (1), Consultant (1), Biomedical
Product Developer (1), Pharmacist (1). RQ771
71. Participants have diverse backgrounds (age,
education, and years of work experience)
What is your current
student (work) status?
How many years of work
experience do you have?
RQ772
72. Findings from the need elicitation user
study about relative attribute importance
• Think-aloud comments:
– Company (36); skills (34); job title (29); responsibilities (25);
years of work experience (22); degree (15); location (14)
• Job posting annotation:
– Qualifications (83) including required skills (72), degree (59),
years of work experience (53); responsibilities (71), location (24),
job type (24), job title (16), work authorization (13)
• Surveys:
– Job type (9.39/10); company (8.99/10); job title (8.84/10);
required skills (8.65/10); responsibilities(8.41/10); educational
requirements including degree and major (8.29/10); location
including city and country (8.27/10); years of experience (8.26/10)
RQ773
73. “Think-aloud” comments made by participants
while searching for and annotating jobs
“I stopped once I saw SQL and other coding technologies
[skills]. I am a different kind of analyst. [P2]
“I try to count the number of required skills I cover. If a lot of the
skills don't match my background, I go for another job." [P14]
“When I search I try to follow the following strategy: if I am
sure that I fit, I will open the job posting [form the SERP], if it
is 50/50, I will still open it since I am exploring more options,
if am sure that I don't qualify, I will skip. Basically, I look for
must have criteria and if they aren't satisfied, I skip. For me
these are title, skill, major, and degree." [P22]
RQ774
74. Top ranked job attributed based on the survey
responses (scored on a 5-point Likert scale)
RQ775
75. The proposal is to standardize job postings using
information extraction and show responsibilities
and requirements in the snippets on the SERP
Generate extended snippets for job search Optimize detailed page views
76
76. We explore the feasibility to generate
structured snippets and their effectiveness
• RQ8: How to convert unstructured job postings into the
structured representation with minimal supervision?
• RQ9: Do extended structured snippets improve job
search user experience? How do users behave when such
structured snippets are used?
77
77. Jobs are quite regular and one word per section is
enough to prepare the training set for ML model
RQ878
78. Jobs are quite regular and one word per section is
enough to prepare the training set for ML model
RQ879
79. Jobs are quite regular and one word per section is
enough to prepare the training set for ML model
RQ880
80. Weakly-supervised VS. Supervised (English)
Our weakly-supervised approach achieves as good Precision as
the supervised model trained on a corpus of 1,000 labeled job
postings. At the same time, our approach is easily deployable
for many languages and has higher Page Coverage.
RQ881
81. Extraction quality across job titles using the
proposed weakly-supervised approach (English)
Extraction quality is consistently high across randomly
selected sample of job titles, which implies generalizability of
the model to the entire job search vertical (for English).
RQ882
82. Tuning for a special language (Russian) leads
to boost in information extraction quality
• Active learning pipeline to bootstrap more accurate
section detection rules, which minimizes human
intervention and efforts and increases model precision
• Hybrid algorithm based on rules and machine learning
as a back-off [2 stages] yields 97-98% for Precision:
– Do high accuracy classification using manually defined rules
– Classify with the machine learning model other sentences
0
50
100
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52
Pagescovered%
Number of rules
RQ983
83. Before (#1 job search engine in Russia)
Hard to differentiate
similar job titles +
no textual snippets
(only company,
location, date posted)
RQ984
84. After (tested in production A/B tests with #1 job search
engine in Russia): DEFAULT vs. RESP+REQ+COND
RQ985
86. The ratio of SERP clicks per query is less
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1 2 3 4 5 6 7 8 9 10 11
Lessisbetter
Days since the beginning of the experiment
Series1
Series2
RQ987
87. The ratio of job applications over job views is more
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Moreisbetter
Days since the beginning of the experiment
Series1
Series2
RQ988
88. Other relevant metrics from the A/B test
• Extraction quality: 97% precision at 100% coverage
• Decreased number of queries per session by 8%
• Decreased number of detailed page views by 1.4X
• Increased number of applications overall by 1.6%
• Increased application rate conditioned on click by 13%
• Decreased number of short clicks by 5.5%
• Decreased number of wasted views by 1.25X
• Decreased click entropy 1.98X
RQ989
89. Key take-aways and design implications
• In addition to the attributes currently shown on the SERP,
users pay attention to responsibilities and requirements
• By leveraging big data redundancy, we can generate large
scale annotated data sets with minimum supervision and use
them to train highly accurate ML models to extract
responsibilities and requirements sections from job postings
• The proposed weakly-supervised approach for information
extraction can be easily adapted to new languages
• Extended structured snippets improve search user experience:
– Minimize irrelevant clicks and click entropy
– Standardize job posting representation
– Eliminate title-snippet redundancy
90
98. In the case of structured search, the longer
the query, the more redundant and less
informative are the query-biased snippets.
Query-snippet duality
99
99. • M. Das et al., Generating Informative Snippet to Maximize Item
Visibility, CIKM ‘2013
• A. Kashyapet and Vagelis Hristidis, Comprehension-Based
Result Snippets, CIKM ’2012
• Z. Liu et al., Structured Search Result Differentiation, VLDB‘2009
• M. Miah et al., Standing Out in a Crowd: Selecting Attributes for
Maximum Visibility, ICDE ‘2008
• G. Das et al., Ordering the Attributes of Query Results,
SIGMOD‘2006
Prior work (DB community): from query-biased to
non-redundant snippets for structured search
100
106. • RQ10: What kind of snippets make users more
productive and successful when performing
structured search on mobile devices?
• RQ11: What kind of snippets do users prefer
based on their subjective feelings?
107
107. Method: laboratory interactive user study
Built a new structured
search mobile app
Invited 39 (12 + 24 +3)
participants to the lab
Made each participant
to do four search tasks
108
110. Participants have diverse backgrounds (mostly
UIUC students + a few working professionals)
• Gender: 21 females, 18 males (we recruited 3
extra people to make up for outlier sessions)
• Age: 22-34 years old, mean 26.3 years old
• Degree: BS/BA (24), MS/MFA/MA (10), PhD (5)
• Major / Field of Study: Computer Science (12),
Psychology (3), Biology (3), Mathematics (2),
MBA (2), Nutrition (2), Agriculture (2),
Mechanical Engineering (2), Civil Engineering (1),
Political Science (1), Kinesiology (1), European
Union Studies (1), Chemistry (1), Supply Chain
Management (1), Accounting (1), Linguistics (1),
Marketing (1), Medicine (1), Education (1)
111
111. Participants are active users of various social
networking sites and regularly do people search
How often do you use
LinkedIn?
How often do you search
for people online?
112
112. Tasks are inspired by real people search needs
• Task 1: Find five people to ask for
professional/career advice
• Task 2: Find five potential keynote speakers
for a conference
• Task 3: Help a recruiter find and evaluate five
candidates to interview
• Task 4: Find five potential candidates to
collaborate with you on a project
113
113. Tasks are inspired by real people search needs
• Task 1: Find five people to ask for
professional/career advice
• Task 2: Find five potential keynote speakers
for a conference
• Task 3: Help a recruiter find and evaluate five
candidates for the interview
• Task 4: Find five potential candidates to
collaborate with you on a project
114
114. Measurements collected from each participant
• Before (5-10 min)
– Pre-study survey
• During (4 x 10 min/task + 4 x 5 min/survey)
– User “think-aloud” comments
– Post-task system satisfaction surveys (Likert scale +
semantic differentials)
– Post-task subjective relevance judgments for top-5
retrieved results
– Search usage behavior logs
– Task completion times
• After (15-20 min)
– Post-study semi-structured interview
Quantitative
Qualitative
Quantitative
Quantitative
Quantitative
Quantitative
Qualitative
115
115. A pilot study with 12 participants and 4 systems.
Each participant does one task using one system.
The task/system order is randomized following
the Greek-Latin square experiment design.
116
117. Key findings from the pilot user study
• Users want to see more information about each
result on the SERP (longer snippets)
• Users don’t notice the extra scrolling cost
• Users tend to like non-redundant snippets more
than query-biased ones given a fixed snippet length
• Non-redundant snippets help users find generally
more relevant results
Query-
biased, 2
Non-
redundant, 2
Query-
biased,4
Non-
redundant, 4
System rank 3.5 +/- 0.7 2.8 +/- 1.0 2.2 +/- 0.8 1.5 +/- 0.8
118
119. A formal study with 24 participants and 2 systems. Each
participant does two tasks per system. The task/system
order is randomized following the Greek-Latin square
experiment design.120
120. Participants find tasks as realistic (4+/5 Likert scale)
suggesting high ecological validity of the study
Q8: I can see myself doing this
task in the real world
121
121. Users feel that the version of the SERP with the
query-biased snippets is easier to use
Q1: The system is easy to use
RQ11122
122. Users consider non-redundant snippets as
more useful based on the post-task surveys
Q5: The display of each profile
on the SERP is useful
Q7: The summaries/attributes presented
for each result on the SERP are useful RQ11123
123. Non-redundant snippets help users find more
relevant people based on personal judgments
Q3: The system helps me find
relevant candidates RQ11124
124. Non-redundant snippets help users find more
relevant people based on personal judgments
Q3: The system helps me find
relevant candidates
RQ10125
126. Users seem to be more effective when using
the system with non-redundant snippets
Metric
System 1
(query-biased)
System2
(non-redundant)
Ave. number of queries per
search session
5.91 +/- 5.60 4.96 +/- 3.90
Ave. number of SERP clicks
(profile views) per search session
18.51 +/- 5.80 16.00 +/- 4.60 (*)
Time between consecutive
queries within a session (sec.)
63.80 +/-39.59 56.77 +/- 34.88
Ave. query length (filters used) 3.03 +/- 0.97 3.14 +/- 0.95
Ave. SERP click position 6.48 +/- 3.97 6.18 +/- 4.77
Ave. Max SERP click position 28.13 +/- 21.14 24.39 +/- 19.78
RQ10127
127. For non-redundant snippets the first three results
(one screen) get about the same percentage of clicks
RQ10128
128. With non-redundant snippets users engage with the
SERP more and make more informed decisions
Metric
System1
(query-biased)
System2
(non-redundant)
Time to the first SERP click (sec.) 11.36 +/- 4.71 13.17 +/- 6.96
Ave. number of candidates
added to favorites (SERP)
1.00 +/- 2.13 1.44 +/- 2.77
Ave. number of candidates
removed from favorites (SERP)
0.43 +/- 1.11 0.65 +/- 1.40
Ave. number of candidates
added to favorites (Profile )
4.85 +/- 2.11 4.48 +/- 2.03
Ave. number of candidates
removed from favorites (Profile)
0.45 +/- 0.78 0.39 +/- 0.74
RQ10129
129. Users prefer non-
redundant snippets
(19/27) over query-
biased snippets
(8/27). The result is
statistically significant
based on the exact
two-tailed binomial
test at alpha=0.05
(p=0.0357).
RQ11130
130. Why do users prefer non-redundant snippets?
• Shows new non-redundant information (16 users)
• Helps discriminate the results on the SERP (12 users)
• Shows more relevant attributes (6 users)
• Helps accomplish the task faster (3 users)
• Requires less scrolling (3 users)
• Returns more relevant results (1 user)
RQ11131
131. Why do users prefer non-redundant snippets?
“System 2 reduces repetition of information displayed
on the screen. Therefore, it can show more results per
screen and I have to do less scrolling.” [P7]
“It [System 2] is better since there is no extra line of
information. I know they are all from Chicago and it is
good that I don't have to see it here [on SERP].” [P17]
“System 2 shows less information but not loosing any
information since it also shows search criteria
compactly at the top.” [P22] RQ11132
132. Why do users prefer query-biased snippets?
• Has a more regular layout (7 users)
• Shows more relevant attributes (6 users)
• More predictable and reassuring (6 users)
• Demands less cognitive load and effort (5 users)
• Good balance of selected and new information (2 users)
• Works faster (2 users)
• Returns more relevant results (1 user)
• Forces to check individual profiles (1 user)
RQ11133
133. Why do users prefer query-biased snippets?
“To be honest, I get distracted easily. If you show so
much novel information, I don't know what to focus
on. System 1 [query-biased] is more moderate in
that sense. It either shows all info as in the query or
only a few attributes are new.” [P6]
“If I searched for a PhD, it will obviously have all
results matching this filter. From that point of view,
this information is redundant. But I still feel better
psychologically when I see what I searched for. That’s
why I prefer System 1 more.” [P13]
RQ11134
134. Query formulation strategies
• Submit a specific search, then generalize (18 users)
– Users pay attention to the number of search results and make their queries
more specific if they see 100+ results.
• Submit a general search with only minimal requirements, then check
several result sets by playing with the optional attributes (6 users)
• Submit a general search, add many results to favorites, finally pick the
top-5 from favorites (4 users)
• Create a search “design space”, then methodically check all possible
combinations (6 users)
• Change the query, if see that the quality of results decreases with the
rank on the SERP (5 users)
• Users want to select several values per attribute (e.g. “Lyft or Uber”).
135
135. For non-redundant snippets we see more new queries,
query-biased snippets lead to more reformulations
136
136. Result examination strategies on the SERP
• Remember the query if it has 1-2 constraints, but find the
“breadcrumbs” useful for cases with 3+ query constraints
• Mostly top-to-bottom (position bias effect) with some exceptions
– “The order of examination is actually random, not top-to-bottom.
Sometime I just scroll and see if some word catches my eye and I
click on it.”
• Look at the names
– More distinct than other elements on the SERP since in bold
– Sometimes help decide whether the candidate is appropriate
• Look at the first and the last lines of the structured snippet
• Align one result to the next one and see the difference in attribute
values shown
137
137. Result selection / bookmarking strategies
• Users tend to select the results, which have some familiar attribute
values (e.g. big companies, well-known schools)
• Try to assess the social similarity based on the attributes on the SERP
(e.g. distinctive names, alumni from the same university, people from
the same age group, people with the same levels of qualification)
• Add to favorites (mostly) from the detailed page since don't have
enough info on the SERP to decide whether the candidate is good
• Some add to favorites from the SERP if:
– want to compare with the other results (can do it for System 2)
– there are too many results
– want to have candidates that have something remarkable to sell themselves
even from the SERP
• Skip “N/A”, “Self-employed”, “Intern” in the job title
• Want to specify attributes shown / CV layout depending on the task
138
138. Key take-aways and design implications
• Eliminate redundancy from the SERP
– Use non-redundant snippets (lead to effective and efficient search)
– Eliminate redundancy via interaction design (swipe instead of star
buttons, hovering navigation bar with the query breadcrumbs)
• Provide more control and transparency
– Show on the SERP more specific information to help assess the
relevance of results, e.g. “87% (Both you and John went to MIT)”
– Let users specify which attributes to show on the SERP
– Let users decide what kind of snippets to show or predict with ML
• Show more information per result on the SERP, yet show at
least several results to help users do result comparison
• Direct users to use better search strategies
– “Steer” users towards writing longer queries
– Explore new ways to encourage users to reformulate their queries
139
139. Scope User Understanding Technique
StructuredSearch
ProfessionalSocial
Networks
Large-scale Query Log Analysis
Study of People Search
Weakly-supervised
Approach for
Information Extraction
from Job Postings
Mixed-method User Study of
Job Attribute Importance
Interactive Task-oriented User
Study of Query-biased vs. Non-
redundant Structured Snippets
Relevance-aware
Structured Search
Results Filtering
140.
141. Thank you for your time!
Reach out if you seriously want to collaborate.
Nik Spirin
Email: spirinus@gmail.com
Skype: @spirinus
Twitter: @spirinus
Facebook: @spirinus
Instagram: @_spirinus_