July 27, 2011 Bay Area Search Presentation
Brian Johnson, Engineering Director, Query Services @ eBay
Query expansion is an important part of of the search recall for all search engines. In this talk I'll discuss some of the general trend driving Hadoop adoption within the Search Query Services team at eBay, and the types of algorithms/techniques we've moved to Hadoop at eBay. Over time we've moved from smaller, editorial data sets to large machine generated data sets mined from behavior log data, items/listings, catalogs, etc. One common workflow is to mine large candidate rewrites/expansions data sets from multiple data sources, use crowd sourced human judgment to classify a subset of the candidates (true positive, false positive), use machine learning techniques discard false positives, run automated validation on the final data set, and automatically push to production.
Ravi Jammalakadaka, Senior Applied Researcher, Query Services @ eBay
Ravi is a real engineer. Not a pointy haired manager like the previous speaker. Expect some real engineering:-) He'll be doing a literature review for acronym mining and discussing a real world implementation.
Title: Mining Acronyms From Raw Text
Abstract: Significant number of eBay products are known by their acronyms. eBay query expansion service expands user queries by their acronym equivalents to increase recall. The challenge is to mine acronyms from either seller ( ex. item descriptions, titles) or buyer ( ex. queries) data.
Ravi will present the state of the art algorithms from recent conferences that mine acronyms from raw text and present their limitations. He will present a new acronym mining algorithm that seeks to address the limitations identified with previous algorithms. He will present a machine learning classifier that seeks to remove the false positives generated from the acronym mining algorithm.
Overview of structured search technology. Using the structure of a document to create better search results for document search and retrieval.
How both search precision and recall is improved when the structure of a document is used.
How a keyword match in a title of a document can be used to boost the search score.
Case studies with the eXist native XML database.
Steps to set up a pilot project.
The eBay Architecture: Striking a Balance between Site Stability, Feature Ve...Randy Shoup
eBay architects Randy Shoup and Dan Pritchett give a guided tour of the eBay architecture. They cover the evolution of the technology stack from Perl to C++ to Java. And they discuss scaling strategies for the data tier, application tier, search, and operations.
Hadoop World 2011 Keynote: Ebay - Hugh WilliamsCloudera, Inc.
Hugh Williams will discuss building Cassini, a new search engine at eBay which processes over 250 million search queries and serves more than 2 billion page views each day. Hugh will trace the genesis and building of Cassini as well as highlight and demonstrate the key features of this new search platform. He will discuss some of the challenges in scaling arguably the world’s largest real-time search problem, including the unique considerations associated with e-commerce and eBay’s domain, and how Hadoop and HBase are used to solve these problems
The New Alchemy: Turning Data into Gold
Developers are leading the charge to turn consumer behavior into profitable solutions. By accessing and analyzing the explosion of data from consumer activities, any developer can create the personalized, relevant products and services that customers demand and merchants urgently need. We will discuss how to acquire, store, and mine information, and how to design analytics-focused software and build data-driven software engines.
Overview of structured search technology. Using the structure of a document to create better search results for document search and retrieval.
How both search precision and recall is improved when the structure of a document is used.
How a keyword match in a title of a document can be used to boost the search score.
Case studies with the eXist native XML database.
Steps to set up a pilot project.
The eBay Architecture: Striking a Balance between Site Stability, Feature Ve...Randy Shoup
eBay architects Randy Shoup and Dan Pritchett give a guided tour of the eBay architecture. They cover the evolution of the technology stack from Perl to C++ to Java. And they discuss scaling strategies for the data tier, application tier, search, and operations.
Hadoop World 2011 Keynote: Ebay - Hugh WilliamsCloudera, Inc.
Hugh Williams will discuss building Cassini, a new search engine at eBay which processes over 250 million search queries and serves more than 2 billion page views each day. Hugh will trace the genesis and building of Cassini as well as highlight and demonstrate the key features of this new search platform. He will discuss some of the challenges in scaling arguably the world’s largest real-time search problem, including the unique considerations associated with e-commerce and eBay’s domain, and how Hadoop and HBase are used to solve these problems
The New Alchemy: Turning Data into Gold
Developers are leading the charge to turn consumer behavior into profitable solutions. By accessing and analyzing the explosion of data from consumer activities, any developer can create the personalized, relevant products and services that customers demand and merchants urgently need. We will discuss how to acquire, store, and mine information, and how to design analytics-focused software and build data-driven software engines.
Triggering and Managing Knowledge Panels for Brands and Companies - Jason Bar...Jason Barnard
In just 77 slides, Jason Barnard explains the tips, tricks and strategies for triggering Knowledge Panels on Google. Then goes on to explain how you can manage and correct the information they contain (clue: the secret is trustworthiness, authority ... which give you control).
Jason finishes with a strategy for educating Google so that it correctly and confidently understands who you are what you do and who your audience is .
At that point, your Knowledge Panel will be driving your entire digital strategy.
SVC101 Building Search into Your App - AWS re: Invent 2012Amazon Web Services
Amazon CloudSearch is a fully-managed search service in the cloud that allows customers to easily integrate fast and highly scalable search functionality into their applications. In this session, we cover the basics of search and search engines. We take an introductory look at CloudSearch along with a deep dive showing how to build a CloudSearch-based web application.
Embrace NoSQL and Eventual Consistency with RippleSean Cribbs
So, there's this "NoSQL" thing you may have heard of, and this related thing called "eventual consistency". Supposedly, they help you scale, but no one has ever explained why! Well, wonder no more! This talk will demystify NoSQL, eventual consistency, how they might help you scale, and -- most importantly -- why you should care.
We'll look closely at how Riak, a linearly-scalable, distributed and fault-tolerant NoSQL datastore, implements eventual consistency, and how you can harness it from Ruby via the slick Ripple client/ORM. When the talk is finished, you'll have the tools both to understand eventual consistency and to handle it like a pro inside your next Ruby application.
Presentation to a combined meetup of Bay Area Lisp and Bay Area Clojure groups. Presented three Clojure projects at BackType:
Cascalog - Batch processing in Clojure
ElephantDB - Database written in Clojure
Storm - Distributed, fault-tolerant, reliable stream processing and RPC
As given at Scotland On Rails 2008.
(note: the highlighting and layering hasn't come through so well in Slideshares conversion process. I'll probably put a version to download from my site to solve this)
Deep Learning for Semantic Search in E-commerceSomnath Banerjee
Learn how deep learning is used in incorporating semantic understanding to solve the complex and challenging problem of e-commerce search. Get informed about the deep learning-based query understanding, image understanding and embedding generation systems developed at Walmart Labs. Gain insights on several practical aspects of building and deploying DL models on production to serve large scale live traffic.
Building a web framework: Django's design decisionsJacob Kaplan-Moss
Since its release three years ago Django’s grown by leaps and bounds; it’s now part of a highly successful new generation of web development tools.
However, it hasn’t all been smooth sailing for the Django team. As any Open Source community does, we’ve needed to make a series of tough decisions along the way. These decisions have shaped Django’s internals, public APIs, and community.
Inheritance - the myth of code reuse | Andrei Raifura | CodeWay 2015YOPESO
Watch this presentation if you want to know why inheritance is not always the most appropriate method for code reuse - and what to do instead.
Watch the video here:
https://www.youtube.com/watch?v=H6m0W-eDyAk
The code used for the demo:
https://github.com/yopeso/Inheritance
Darin Briskman, Amazon Web Services delivers a keynote at the Canadian Executive Cloud & DevOps Summit in Toronto on June 9, 2017 on the topic of Artificial Intelligence.
Venture Design Workshop: Business Model CanvasAlex Cowan
These slides support the various workshops I do and my online curriculum in two principal places:
1. Business Model Canvas Tutorial
This is a more fully articulated instructional, complete with templates: bit.ly/nicebmc.
2. Startup Sprints
This is a structured self-service for Venture Design/new venture creation: bit.ly/startupsprints.
Graph Walks & Vector Embeddings: Exploiting the head and exploring the tail Brian Johnson
Pinterest has the world’s largest catalog of human curated ideas. We’re building a visual discovery engine with 100+ billion ideas, collected by 175+ million people worldwide. As we work to match the right Pin to the right person at the right time, personalization is crucial. Random graph walks with restart are an excellent way to surface popular, high quality, relevant content. But we can also show you great ideas you may not even have known you were looking for - and that’s where vector embedding comes in. We embed you and these billions of ideas in a 128 or 256 dimensional space. Then we project them down into 1000 bits, cut them up into 16 bit chunks, index these chunks, and then find these ideas for you really fast using core search technology.
Bio
Brian joined Pinterest in 2017 as the Head of Knowledge. He was previously at eBay, Handspring, Excite@Home, Synopsys, and AT&T Bell Labs. Brian received his Ph.D. in Computer Science from the University of Maryland. His original Treemap data visualization paper has been cited thousands of times.
eBay Search Science: Leveraging Behavioral Data Analysis for Effective Query Reformulation
Brian will talk about combing through behavioral log files with Scala on Hadoop in order to generate large data sets used to drive dynamic, online query rewrites at eBay. He’ll cover the product/feature pipeline from ideation to data mining, prototyping, statistical analysis, offline side by side analysis, human judgment, online experimentation, and finally launch.
More Related Content
Similar to 2011 Search Query Rewrites - Synonyms & Acronyms
Triggering and Managing Knowledge Panels for Brands and Companies - Jason Bar...Jason Barnard
In just 77 slides, Jason Barnard explains the tips, tricks and strategies for triggering Knowledge Panels on Google. Then goes on to explain how you can manage and correct the information they contain (clue: the secret is trustworthiness, authority ... which give you control).
Jason finishes with a strategy for educating Google so that it correctly and confidently understands who you are what you do and who your audience is .
At that point, your Knowledge Panel will be driving your entire digital strategy.
SVC101 Building Search into Your App - AWS re: Invent 2012Amazon Web Services
Amazon CloudSearch is a fully-managed search service in the cloud that allows customers to easily integrate fast and highly scalable search functionality into their applications. In this session, we cover the basics of search and search engines. We take an introductory look at CloudSearch along with a deep dive showing how to build a CloudSearch-based web application.
Embrace NoSQL and Eventual Consistency with RippleSean Cribbs
So, there's this "NoSQL" thing you may have heard of, and this related thing called "eventual consistency". Supposedly, they help you scale, but no one has ever explained why! Well, wonder no more! This talk will demystify NoSQL, eventual consistency, how they might help you scale, and -- most importantly -- why you should care.
We'll look closely at how Riak, a linearly-scalable, distributed and fault-tolerant NoSQL datastore, implements eventual consistency, and how you can harness it from Ruby via the slick Ripple client/ORM. When the talk is finished, you'll have the tools both to understand eventual consistency and to handle it like a pro inside your next Ruby application.
Presentation to a combined meetup of Bay Area Lisp and Bay Area Clojure groups. Presented three Clojure projects at BackType:
Cascalog - Batch processing in Clojure
ElephantDB - Database written in Clojure
Storm - Distributed, fault-tolerant, reliable stream processing and RPC
As given at Scotland On Rails 2008.
(note: the highlighting and layering hasn't come through so well in Slideshares conversion process. I'll probably put a version to download from my site to solve this)
Deep Learning for Semantic Search in E-commerceSomnath Banerjee
Learn how deep learning is used in incorporating semantic understanding to solve the complex and challenging problem of e-commerce search. Get informed about the deep learning-based query understanding, image understanding and embedding generation systems developed at Walmart Labs. Gain insights on several practical aspects of building and deploying DL models on production to serve large scale live traffic.
Building a web framework: Django's design decisionsJacob Kaplan-Moss
Since its release three years ago Django’s grown by leaps and bounds; it’s now part of a highly successful new generation of web development tools.
However, it hasn’t all been smooth sailing for the Django team. As any Open Source community does, we’ve needed to make a series of tough decisions along the way. These decisions have shaped Django’s internals, public APIs, and community.
Inheritance - the myth of code reuse | Andrei Raifura | CodeWay 2015YOPESO
Watch this presentation if you want to know why inheritance is not always the most appropriate method for code reuse - and what to do instead.
Watch the video here:
https://www.youtube.com/watch?v=H6m0W-eDyAk
The code used for the demo:
https://github.com/yopeso/Inheritance
Darin Briskman, Amazon Web Services delivers a keynote at the Canadian Executive Cloud & DevOps Summit in Toronto on June 9, 2017 on the topic of Artificial Intelligence.
Venture Design Workshop: Business Model CanvasAlex Cowan
These slides support the various workshops I do and my online curriculum in two principal places:
1. Business Model Canvas Tutorial
This is a more fully articulated instructional, complete with templates: bit.ly/nicebmc.
2. Startup Sprints
This is a structured self-service for Venture Design/new venture creation: bit.ly/startupsprints.
Similar to 2011 Search Query Rewrites - Synonyms & Acronyms (20)
Graph Walks & Vector Embeddings: Exploiting the head and exploring the tail Brian Johnson
Pinterest has the world’s largest catalog of human curated ideas. We’re building a visual discovery engine with 100+ billion ideas, collected by 175+ million people worldwide. As we work to match the right Pin to the right person at the right time, personalization is crucial. Random graph walks with restart are an excellent way to surface popular, high quality, relevant content. But we can also show you great ideas you may not even have known you were looking for - and that’s where vector embedding comes in. We embed you and these billions of ideas in a 128 or 256 dimensional space. Then we project them down into 1000 bits, cut them up into 16 bit chunks, index these chunks, and then find these ideas for you really fast using core search technology.
Bio
Brian joined Pinterest in 2017 as the Head of Knowledge. He was previously at eBay, Handspring, Excite@Home, Synopsys, and AT&T Bell Labs. Brian received his Ph.D. in Computer Science from the University of Maryland. His original Treemap data visualization paper has been cited thousands of times.
eBay Search Science: Leveraging Behavioral Data Analysis for Effective Query Reformulation
Brian will talk about combing through behavioral log files with Scala on Hadoop in order to generate large data sets used to drive dynamic, online query rewrites at eBay. He’ll cover the product/feature pipeline from ideation to data mining, prototyping, statistical analysis, offline side by side analysis, human judgment, online experimentation, and finally launch.
eBay Search Science, IEEE Big Data, April 3rd, 2015Brian Johnson
Topic: eBay Search Science: Leveraging Behavioral Data Analysis for Effective Query Reformulation
Brian will talk about combing through behavioral log files with Scala on Hadoop in order to generate large data sets used to drive dynamic, online query rewrites at eBay. He’ll cover the product/feature pipeline from ideation to data mining, prototyping, statistical analysis, offline side by side analysis, human judgment, online experimentation, and finally launch. Time permitting he will also touch on statistical machine translation based spell correction and machine learned search spam detection.
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
UiPath Test Automation using UiPath Test Suite series, part 5DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 5. In this session, we will cover CI/CD with devops.
Topics covered:
CI/CD with in UiPath
End-to-end overview of CI/CD pipeline with Azure devops
Speaker:
Lyndsey Byblow, Test Suite Sales Engineer @ UiPath, Inc.
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofsAlex Pruden
This paper presents Reef, a system for generating publicly verifiable succinct non-interactive zero-knowledge proofs that a committed document matches or does not match a regular expression. We describe applications such as proving the strength of passwords, the provenance of email despite redactions, the validity of oblivious DNS queries, and the existence of mutations in DNA. Reef supports the Perl Compatible Regular Expression syntax, including wildcards, alternation, ranges, capture groups, Kleene star, negations, and lookarounds. Reef introduces a new type of automata, Skipping Alternating Finite Automata (SAFA), that skips irrelevant parts of a document when producing proofs without undermining soundness, and instantiates SAFA with a lookup argument. Our experimental evaluation confirms that Reef can generate proofs for documents with 32M characters; the proofs are small and cheap to verify (under a second).
Paper: https://eprint.iacr.org/2023/1886
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
GridMate - End to end testing is a critical piece to ensure quality and avoid...ThomasParaiso2
End to end testing is a critical piece to ensure quality and avoid regressions. In this session, we share our journey building an E2E testing pipeline for GridMate components (LWC and Aura) using Cypress, JSForce, FakerJS…
Maruthi Prithivirajan, Head of ASEAN & IN Solution Architecture, Neo4j
Get an inside look at the latest Neo4j innovations that enable relationship-driven intelligence at scale. Learn more about the newest cloud integrations and product enhancements that make Neo4j an essential choice for developers building apps with interconnected data and generative AI.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
2. Agenda
Ÿ 6:30 Eat & Greet - Free Food & Beer
Ÿ 7:00 Speaker #1 – Brian Johnson
Ÿ 7:45 Speaker #2 – Ravi Jammalamadaka
Ÿ Plan on 2 fabulous 45 minute presentations by excellent local search experts.
Please suggest speakers or topics you would like to hear.
Ÿ Great speakers, good food, fine beer, and everyone's favorite search term - Free,
Free, Free:-)
Ÿ Event will be held at the eBay campus just off 17/880 @ Hamilton in the main
Community building. Look for lobby/flagpole.
Ÿ 4th Wednesday of every month
Ÿ http://www.meetup.com/Bay-Area-Search/
3. How Can I Help?
Ÿ Speakers
Ÿ Feedback
Ÿ Organizers
Ÿ Videographers
4. Brian Johnson
Ÿ Brian is the Director of Engineering for Query Services at eBay. He has held this
role since January of 2011. Prior to that he managed the engineering teams for
Query Understanding (metrics and crowdsourced human judgment), classification,
data publishing, and browsing. Brian has been at eBay since 2002.
Ÿ Prior to eBay Brian was at (http://www.linkedin.com/in/brianscottjohnson)
– Handspring - Managed the team working on email/IM/web browsing for one of
the first smartphones (Treo)
– Excite@Home - Director of Engineering for the Excite homepage
– Synopsys - Engineer for chip design visualization
– AT&T Bell Labs - Data visualization research
Ÿ Brian received his PHD in Computer Science from the University of Maryland in
1993. His papers regarding visualizing hierarchical and categorical data with
Treemaps have been cited hundreds of times.
Ÿ Brian is a pleasure to listen to and I'm sure you'll appreciate his insights from the
trenches regarding search query rewrite research and practice at eBay.
5. Ravi Jammalamadaka
Ÿ Ravi works in the query services team at eBay
looking at ways to rewrite user queries to improve
both precision and recall.
Ÿ Received his PhD from University of California,
Irvine.
– Research on Data Security, Databases
Ÿ Ravi published 10 research papers in the areas of
databases, data security and data mining.
Ÿ Ravi was invited to be a Program committee member
for IEEE ISI 2010, 2011 and ICDE 2010 (demo
track).
6. Query
Rewrites
Brian Johnson
Bay Area Search
July 27, 2011
8. What Is A Query?
Ÿ Queries are more than a text box
Ÿ Keywords=Red Size 7 Shoes
Ÿ Keywords=Red, Category=Shoes
Ÿ Keywords=Red, Category=Shoes, Size = 7
Ÿ Many filter variables affects recall
Ÿ Query, category, attributes current context dimension targets
Ÿ Format, condition, location/distance, shipping, seller, price
9. Questions About Queries
Ÿ Popularity/Rank
Ÿ Supply
Ÿ Demand
Ÿ Click Through Rate (CTR)
Ÿ Conversion
Ÿ Rewrites/Expansions
Ÿ Related Searches with CTR & Conversion
Ÿ Category Supply/Demand/CTR/Sales
Ÿ Product Supply/Demand/CTR/Sales
Ÿ Top Products
Ÿ Items (recalled, view, bin, bid, offer, watch, ask, purchase)
Ÿ Autocompletes
Ÿ Classification (broad, narrow, ambiguous, help, navigational)
Ÿ Purchase Site
Ÿ Frequency by day, day of week, time of day
Ÿ Cross Border
Ÿ Sales
Ÿ Position distribution in user sessions
Ÿ Result set size
Ÿ Exit Rate
Ÿ Exit Destination
9
13. Example Query Services/Rewrites
• Related Search
canon sd1300is, canon sd1400 is, canon sd4000, canon sd1400is, canon sd, canon sd1300 is waterproof,
canon sd 1300, canon
• Stemming (ipod or ipods)
• Spelling (cannon or canon)
• Condition (new or condition=new)
• Synonyms (boat carpet or marine carpet)
• Space Synonyms (MarioKart > Mario Kart)
• Item Specifics (blue or color=blue)
• Acronyms (os = one size in CSA | Operating Systems in Electronics)
• Category (shoes or category=63850)
• Cross Border (site=0 and category =123) or (site=3 and category=456)
• Fitment (fits model=X)
• Term Removal (Harry Potter and the Order of the Phoenix (daily deal))
13
14. Context & Specificity
Ÿ Beyond decontextualized single entities
Ÿ Examples
– Stemming failures
○ (cowboy v cowboys) and (hat v hats)
○ Doesn’t work for cowboy hats & dallas cowboy caps/hats
– hp printer > (hp v “hewlett packard”) printer
– 15 hp pump > 15 (hp v horsepower) pump
– motor bike > motor (bike v cycle)
– audi b6 > (audi v make=audi) & (b6 v platform=b6) v (product=789)
– the who != who the
– Time
○ Today: latest generation > latest generation v (generation=4)
○ Tomorrow: latest generation > latest generation v (generation=5)
17. Better, Faster, Cheaper
Better
• Better recall
• Awesome related search suggestions
• Mind reading spell corrections
Faster
• <3 milliseconds per query
• 1.2 billion queries per day
• 1,000’s of queries per second on a single machine
Cheaper
• Hadoop offline
• Caching online
18. Metrics/Evaluation
Ÿ Revenue (A/B Test)
Ÿ Relevance (Recall, Precision, DCG, etc.)
Ÿ Result Count
Ÿ Result Set Overlap
Ÿ Click Through Rate
Ÿ Feedback (site links)
Ÿ Human Judgment
Ÿ Competitive/Benchmark data
Ÿ “Gold” test sets
18
19. Thinking about rewrites
Ÿ Query length Ÿ Language detection
Ÿ Intent identification Ÿ Concept vs instantiation
Ÿ Autocomplete, (ex: car vs honda)
autosuggest Ÿ Phrases
Ÿ Summarization Ÿ Bracketing
Ÿ Inference (ex: movie 9) Ÿ Normalization
Ÿ Stemming Ÿ Key term extraction
Ÿ Synonyms Ÿ Term relaxation /
Ÿ Spell checking constraining
Ÿ Stopwords, noise words Ÿ Session context
Ÿ Abbreviations, acronyms Ÿ Trend detection
Ÿ Units, brands, sizes, Ÿ Online feedback
dimensions Ÿ Temporal queries, recency
Ÿ Buzz
19
21. Synonym Candidates
Synonyms
derived
from
top
changes
in
successive
queries
frame
frames
lamp
lamps
case
cases
grill
grille
shoe
shoes
Synonyms
derived
from
top
queries
in
item
query
clusters
texas
instruments
ba
ii
plus
4
ba
ii
plus
brighton
handbag
brighton
purse
lenovo
x200
thinkpad
x200
king
bedspread
king
coverlet
rockabilly
dress
swing
dress
1963
ford
falcon
63
falcon
jessica
simpson
hair
extensions
jessica
simpson
hairdo
Abbrevia<ons/acronym
derived
from
query
transi<ons
stanford
ky
stanford
kentucky
dc
sub
dc
subwoofer
meridian
ms
meridian
mississippi
front
royal
va
front
royal
virginia
baseball
pin
baseball
pinback
snowboard
helmet
l
snowboard
helmet
large
motorcycle
cam
motorcycle
camera
diamond
amp
diamond
amplifier
ac4ve
sub
ac4ve
subwoofer
shapleigh
me
shapleigh
maine
23. Spell Check – Offline
Ÿ Successive queries qi and qi’ are candidates q1
for spell correction analysis if the edit
distance is within 40% of the average query
length. q2
• qi and qi’ may have tokens in common, called
anchors. q3 q1’
• Use transitivity remove intermediate queries.
Ÿ Create a bipartite graph for spell correction q4 q2’
candidates.
Ÿ Same query can exist on the source and sink q5
sides of the graph.
Ÿ Compute input and output degrees of each
sink node, indicating how info flows in and q6
out of a query.
Ÿ A correct spelling candidate is a sink node
with a far more flow into rather than out of it.
24. Spell Check – Online
query
Tokenize to tokens
In the white
list? (wi-2, wi-1, wi)
Found a
match
Calculate
contextual Priority
possibility Queue
Search in
dictionary
No, go
Obtain entropy to next
N-Gram Index Last?
Yes, get the
A list of best
Edit distance, candidates Obtain cosine
phonetics similarity Result
26. Acronyms
Ÿ Expand User Queries
– Increase recall without sacrificing precision
– Better deals for buyers
Ÿ Examples
BAPE 2,540 results
OR(Bathing Ape, Bape) 2987 results
Rescue Project 26
27. Mining Acronyms From Query Reformulations
Ÿ Learn from user behavioral data
Ÿ Example
UCB Sweatshirt CSA
University of California Berkeley CSA
Sweatshirt
Rescue Project 27
28. Acronym Context & Specificity
Ÿ Need to express context sensitive expansions
– Categorical
○ ATC > Armored Troop Carrier in Toys and Hobbies
○ ATC > Artist trading card in ART
○ ATC > Automatic Tool Change in Business and Industrial
– Directional
○ Old > Antique
○ Yoga towels/mats > Yogitoes
Rescue Project 28
29. Acronym/Abbreviation Category Based
Mining Expansions
• Acronyms/Abbreviation mined from Raw
text and query logs hp
Electronics Cars and Trucks
• Look for patterns of text
• long form (short form)
• short form (long form)
• Employ intelligent matching algorithms to Hewlett Packard horsepower
mine candidates
Example title: System allows
new cheap Playstation portable (PSP) • Category based expansions
Acronym discovered • Directional expansions
PSP -> PlayStation Portable • Positive and Negative
Candidates mined are fed through a expansions
machine learning classifier to remove the
false positives
32. Talk Overview
Ÿ Motivation
– Introduction of the Acronym mining problem.
Ÿ Related Work
– Algorithm overview.
Ÿ eBay Acronym Mining algorithm.
– Architecture.
– Algorithm overview.
Ÿ Results.
Ÿ Conclusions.
33. Motivation
Ÿ User queries are incomplete representation of their
information needs
– Spelling mistakes
○ Jetsky instead of Jetski
– Synonyms are not considered
○ PS3 and PlayStation 3 ( Acronym, topic of talk)
○ JetSki and Personal Watercraft
– Users are not experts in search engine technology
○ Example: Anniversary gifts for men
eBay, Inc. 33
34. Need for Query Rewrites
JetSky 2 results
Spelling Correction
JetSki 23782 results
Synonym Expansion
OR( Jetski, Personal WaterCraft) 24151 results
eBay, Inc. 34
36. Where can we find Acronyms?
Grand Theft Auto III (GTA 3) (PlayStation 2, 2001)
New Uke
Grand Theft Auto IV (GTA 4) PS3 mint condition
Warhawk (No Headset) PlayStation 3 (PS3) BRAND NEW!
New Ukulele COLD LASER. Low Level Laser Therapy(LLLT) + Acupuncture
From Item Title/Descriptions
From Query Reformulations
i.e how users change their
queries.
eBay, Inc. 36
38. Schwartz et al: Greedy Match Algorithm
Warhawk (No Headset) PlayStation 3 (PS3) BRAND NEW!
Warhawk (No Headset) PlayStation 3 (PS3) BRAND NEW!
eBay, Inc. 38
39. Identifying Abbreviation Definitions in Biomedical Text.
Ÿ Mining for patterns
– long form ( short form)
– short form ( long form)
– Long form is no more than min ( |A| + 5 , |A| * 2).
– Roche et. al. proposes that number to be less than
|A|*3.
Ÿ The characters in the short form should match the long
form in the same order and the first character in the
short form should be at the beginning of a word.
Ÿ Example:
– PS3 -> PlayStation 3
eBay, Inc. 39
40. Schwartz et al
Ÿ Pros:
– Finds almost all abbreviations and acronyms
Ÿ Cons:
– High False positive rate.
○ Foot Massage Diabetes Treatment (FEET)
– Suffers from truncated long form problem.
– Example: American Automobile Association (AAA)
eBay, Inc. 40
41. Acronym-Expansion Recognition and Ranking on the Web
Ÿ First few characters match
Ÿ Ignore Stop words
Ÿ Example:
– Cool - > Cooperation in Ontology and Linguistics.
Alpa Jain, Silviu Cucerzan, Saliha Azzam. Acronym-Expansion
Recognition and Ranking on the Web.
eBay, Inc. 41
42. Jain et al
Ÿ Pros:
– Low false positive rate
Ÿ Cons:
– Does not do a good job at identifying abbreviations
– Misses out on a lot of actual acronyms
○ Will not find PlayStation 3 and PS3 association.
eBay, Inc. 42
43. eBay Acronym Mining Architecture
Candidate
Feature
Classifier
Generator
Extractor
User
Dic4onary
Data
Live
on
Human
A/B
Test
Site
Judgment
44. eBay Acronym/Abbreviation Mining Algorithm
Ÿ Desirable Properties
– Find all abbreviation and Acronyms like the greedy match
– Reduce the amount of false positives
– Solve the truncated long form problem.
Ÿ What makes a good acronym – expansion pair?
– Characters in the acronym are found at the beginning of the words.
– Expansions generally do not have words that are skipped or not
represented in the acronym.
– Can a cost metric capture the intuition ?
eBay, Inc. 44
45. Cost Based Approach for Mining Abbreviations
CIM ------- Computer Interface Module
Total Cost: Low cost
PVC ------- PolyVinyl Cloride
Total Cost: medium cost
HSF –-- Heat shock transcription factor
Total Cost: High Cost
eBay, Inc. 45
46. Cost Based Recursive Algorithm
Title: new American Automobile Association (AAA) map of
mexico
Objective: Find the longest form with the lowest cost
American Automobile Association (AAA)
Min ( American Automobile Associ (AA) , American Automobile Associ (AAA) )
+
Cost so far
eBay, Inc. 46
47. Salient Properties of the new algorithm
Ÿ If Cost > Threshold, then the long form is a false positive.
Ÿ As cost increases
– False positives increase
– The chance that a real acronym is not identified decreases
Ÿ As cost decreases
– False positives decrease
– The chance that a real acronym is not identified increases.
Ÿ At lower costs, the algorithm behaves like the first few characters
match.
Ÿ At high costs, the algorithm behaves like the greedy match
algorithm.
eBay, Inc. 47
48. Experiments
Sample Dataset: 2.5 million item titles
Algorithm Total Candidates False Positive Rate Yield
Greedy Match 2548 39 % 1554
First Few 759 4% 728
Characters Match
Cost Based Match, 1223 14 % 1051
k1
Cost Based Match, 1604 16 % 1284
k2
Cost Based Match, 2023 20 % 1554
k3
eBay, Inc. 48
49. Removing false positives
Ÿ Goal
– Develop a classification algorithm that will classify is a
candidate is a acronym or not.
Ÿ Classification algorithm
– Decision trees
○ TreeNet data mining tool.
Ÿ Candidate are tagged with many features.
Ÿ Classifier learns on the tagged golden set.
Ÿ New candidates are then run through the classifier.
eBay, Inc. 49
50. Example of a Decision Tree
Tid Refund Marital Taxable
Splitting Attributes
Status Income Cheat
1 Yes Single 125K No
2 No Married 100K No Refund
No
Yes No
3 No Single 70K
4 Yes Married 120K No NO MarSt
5 No Divorced 95K Yes Married
Single, Divorced
6 No Married 60K No
7 Yes Divorced 220K No TaxInc NO
8 No Single 85K Yes < 80K > 80K
9 No Married 75K No
NO YES
10 No Single 90K Yes
Model: Decision Tree
10
Training Data
eBay, Inc. 50
Acknowledgements: George Kollios, gkollios@cs.bu.edu
51. Features: Neighborhood Similarity
Ÿ Rationale: Two synonym candidates A and B, will tend
to have similar neighbors (viz keywords) surrounding
them.
Neighborhood
similarity = Intersection ( Neighbours(A) , Neighbours(b) )
Min (Neighbours(a), Neighbours(b))
eBay, Inc. 51
52. Features: Mutual Information
Ÿ Rationale: The goal of this metric is determine if the co-occurrence of the
candidates in the description is significantly more than the random
chance of them co-occurring.
eBay, Inc. 52
53. Features: KL divergence
Ÿ Rationale: Two synonym candidates will have similar
category distributions of their inventory.
eBay, Inc. 53
54. Kl distance: Example
Ipods: Electronics (50), Electronics (100),
Ipod:
Clothing Shoes and
Clothing Shoes and
Accessories (1)
Accessories (3)
Ipod: Electronics (100),
T-shirt Clothing Shoes and
Clothing Shoes and Accessories (1000),
Accessories (3) Uniforms ( 50)
KL divergence: 0.83 KL divergence:
128592.74
56. Classifier Results
Ÿ False positive rate at the candidate generation stage 20 %
Ÿ False positive rate after going through the classifier is 5.5 %
Ÿ The remaining false positives are removed by human
judges.
eBay, Inc. 56
57. Conclusions
Ÿ We presented the state of the art algorithms for acronym
mining and their limitations.
Ÿ We presented a new cost based algorithm for mining
acronyms from raw text that seeks to address the limitations
of the previous algorithms.
Ÿ We presented a classifier approach to remove false
positives.
Ÿ We experimentally validated our approach and show it is a
viable approach for mining acronyms.
eBay, Inc. 57
58. References
Ÿ [1] Ariel S Schwartz, Marti A. Hearst. A simple Aglorithm for Identifying
Abbreviation definition in BioMedical Text.
Ÿ [2] Yongja Park, Roy J. Byrd. Hybrid text mining for finding abbreviations
and their definitions.
Ÿ [3] Mathieu Roche, Violaine Prince. Managing the Acronym/Expansion
Identification Process for Text-mining Applications.
eBay, Inc. 58
59. References(2)
Ÿ [4] Yee Fan Tan, Ergin Elmacioglu, Min-Yen Kan, Dongwon Lee. Efficient Web-
Based Linkage of Short to Long Forms.
Ÿ [5] Alpa Jain, Silviu Cucerzan, Saliha Azzam. Acronym-Expansion Recognition
and Ranking on the Web.
Ÿ [6]Xiaonan Ji, Gu Xu, James Bailey and Hang Li. Mining, Ranking and Using
Acronym Patterns.
eBay, Inc. 59