This is the slides used in our 3-hour tutorial at VLDB'2014.
Yunyao Li, Ziyang Liu, Huaiyu Zhu: Enterprise Search in the Big Data Era: Recent Developments and Open Challenges. PVLDB 7(13): 1717-1718 (2014)
Abstract:
Enterprise search allows users in an enterprise to retrieve desired information through a simple search interface. It is widely viewed as an important productivity tool within an enterprise. While Internet search engines have been highly successful, enterprise search remains notoriously challenging due to a variety of unique challenges, and is being made more so by the increasing heterogeneity and volume of enterprise data. On the other hand, enterprise
search also presents opportunities to succeed in ways beyond current Internet search capabilities. This tutorial presents an organized overview of these challenges and opportunities, and reviews the state-of-the-art techniques for building a reliable and high quality enterprise search engine, in the context of the rise of big data.
Building Search Systems for the EnterpriseYunyao Li
This is a nice high-level summary for Gumshoe, the enterprise engine built by our group, which is currently powering IBM intranet search. One of SIGIR 2011 Industrial Track Keynote Talk.
Invited Talk at Modern Data Management Systems Summit on August 29-30, 2014 at Tsinghua University in Beijing, China.
http://ise.thss.tsinghua.edu.cn/MDMS/English/program.jsp
Abstract:
Modern enterprises are increasingly relying on complex analyses on large data sets to drive business decisions. Tasks such as root cause analysis from system logs and lead generation based on social media, customer retention and digital marketing are rapidly gaining importance. These applications generally consist of three major analytic phases: text analytics, semi-structured data processing (joins, group-by, aggregation), and statistical/predictive modeling. The size of the datasets in conjunction with the complexity of the analysis necessitates large-scale distributed processing of the analytical algorithms. At IBM we are building tools and technologies based on declarative languages to support each of these analytic phases. The declarative nature of the language abstracts away the need for programmer-optimization. Furthermore, the syntax of these languages is designed to appeal to the corresponding communities. As an example for statistical modeling, we expose a high-level language with syntax similar to R -- a very popular statistical processing language.
In this talk I will give an overview of some real-world big data applications we are currently working on and use that to motivate the need for declarative analytics consisting of the three major phases discussed above. I will then describe, in some detail, declarative systems for text analytics along with a discussion on speeds, feeds and comparisons.
Natural language understanding is a fundamental task in artificial intelligence. English understanding has reached a mature state and successfully deployed in multiple IBM AI products and services, such as Watson Natural Language Understanding and Watson Discovery. However, scaling existing products/services to support additional languages remain an open challenge. In this talk, we will discuss the open challenges in supporting universal natural language understanding. We will share our work in the past few years in addressing these challenges. We will also showcase how universal semantic representation of natural languages can enable cross-lingual information extraction in concrete domains (e.g. compliance) and show ongoing efforts towards seamless scaling existing NLP capabilities across languages with minimal efforts.
SystemT: Declarative Information ExtractionYunyao Li
Slides used for my talk "SystemT: Declarative Information Extraction" at the event "University of Oregon Big Opportunities with Big Data Meeting" on August 8, 2014 (http://bigdata.uoregon.edu).
Human in the Loop AI for Building Knowledge Bases Yunyao Li
The ability to build large-scale domain-specific knowledge bases that capture and extend the implicit knowledge of human experts is the foundation for many AI systems. We use an ontology-driven approach for the creation, representation and consumption of such domain-specific knowledge bases. This approach relies on several well-known building blocks: natural language processing, entity resolution, data transformation and fusion. I will present several human-in-the-loop work that target domain experts (rather than programmers) to extract the domain knowledge from the human expert and map it into the "right" models or algorithms. I will also share successful use cases in several domains, including Compliance, Finance, and Healthcare: by using these tools we can match the level of accuracy achieved by manual efforts, but at a significantly lower cost and much higher scale and automation.
Building Search Systems for the EnterpriseYunyao Li
This is a nice high-level summary for Gumshoe, the enterprise engine built by our group, which is currently powering IBM intranet search. One of SIGIR 2011 Industrial Track Keynote Talk.
Invited Talk at Modern Data Management Systems Summit on August 29-30, 2014 at Tsinghua University in Beijing, China.
http://ise.thss.tsinghua.edu.cn/MDMS/English/program.jsp
Abstract:
Modern enterprises are increasingly relying on complex analyses on large data sets to drive business decisions. Tasks such as root cause analysis from system logs and lead generation based on social media, customer retention and digital marketing are rapidly gaining importance. These applications generally consist of three major analytic phases: text analytics, semi-structured data processing (joins, group-by, aggregation), and statistical/predictive modeling. The size of the datasets in conjunction with the complexity of the analysis necessitates large-scale distributed processing of the analytical algorithms. At IBM we are building tools and technologies based on declarative languages to support each of these analytic phases. The declarative nature of the language abstracts away the need for programmer-optimization. Furthermore, the syntax of these languages is designed to appeal to the corresponding communities. As an example for statistical modeling, we expose a high-level language with syntax similar to R -- a very popular statistical processing language.
In this talk I will give an overview of some real-world big data applications we are currently working on and use that to motivate the need for declarative analytics consisting of the three major phases discussed above. I will then describe, in some detail, declarative systems for text analytics along with a discussion on speeds, feeds and comparisons.
Natural language understanding is a fundamental task in artificial intelligence. English understanding has reached a mature state and successfully deployed in multiple IBM AI products and services, such as Watson Natural Language Understanding and Watson Discovery. However, scaling existing products/services to support additional languages remain an open challenge. In this talk, we will discuss the open challenges in supporting universal natural language understanding. We will share our work in the past few years in addressing these challenges. We will also showcase how universal semantic representation of natural languages can enable cross-lingual information extraction in concrete domains (e.g. compliance) and show ongoing efforts towards seamless scaling existing NLP capabilities across languages with minimal efforts.
SystemT: Declarative Information ExtractionYunyao Li
Slides used for my talk "SystemT: Declarative Information Extraction" at the event "University of Oregon Big Opportunities with Big Data Meeting" on August 8, 2014 (http://bigdata.uoregon.edu).
Human in the Loop AI for Building Knowledge Bases Yunyao Li
The ability to build large-scale domain-specific knowledge bases that capture and extend the implicit knowledge of human experts is the foundation for many AI systems. We use an ontology-driven approach for the creation, representation and consumption of such domain-specific knowledge bases. This approach relies on several well-known building blocks: natural language processing, entity resolution, data transformation and fusion. I will present several human-in-the-loop work that target domain experts (rather than programmers) to extract the domain knowledge from the human expert and map it into the "right" models or algorithms. I will also share successful use cases in several domains, including Compliance, Finance, and Healthcare: by using these tools we can match the level of accuracy achieved by manual efforts, but at a significantly lower cost and much higher scale and automation.
Video: https://www.youtube.com/watch?v=Rt2oHibJT4k
Technologies such as Hadoop have addressed the "Volume" problem of Big Data, and technologies such as Spark have recently addressed the "Velocity" problem – but the "Variety" problem is largely unaddressed – there is a lot of manual "data wrangling" to mange data models.
These manual processes do not scale well. Not only is the variety of data increasing, also the rate of change in the data definitions is increasing. We can’t keep up. NoSQL data repositories can handle storage, but we need effective models of the data to fully utilize it.
This talk will present tools and a methodology to manage Big Data Models in a rapidly changing world. This talk covers:
Creating Semantic Metadata Models of Big Data Resources
Graphical UI Tools for Big Data Models
Tools to synchronize Big Data Models and Application Code
Using NoSQL Databases, such as Amazon DynamoDB, with Big Data Models
Using Big Data Models with Hadoop, Storm, Spark, Giraph, and Inference
Using Big Data Models with Machine Learning to generate Predictive Models
Developer Collaborative/Coordination processes using Big Data Models and Git
Managing change – Big Data Models with rapidly changing Data Resources
The most profitable insurance organizations will outperform competitors in key areas as personalized customer service, claims processing, subrogation recovery, fraud detection and product innovation. This requires thinking beyond the traditional data warehouse to the data fabric - an emerging data management architecture.
In this webinar Andy Sohn, Senior Advisor at NewVantage Partners, and Bob Parker, Senior Director for Insurance at Cambridge Semantics, explore the role of the data discovery and integration layer in an enterprise data fabric for the Insurance industry. These are their slides.
Presented at SQL Saturday Atlanta May 18, 2013
Text mining is projected to dominate data mining, and the reasons are evident: we have more text available than numeric data. Microsoft introduced a new technology to SQL Server 2012 called Semantic Search. This session's detailed description and demos give you important information for the enterprise implementation of Tag Index and Document Similarity Index. The demos include a web-based Silverlight application, and content documents from Wikipedia. We'll also look at strategy tips for how to best leverage the new semantic technology with existing Microsoft data mining.
Speaker: Philippe Mizrahi - Associate Product Manager - Lyft
Abstract: Philippe Mizrahi works on Lyft’s data discovery and metadata engine, Amundsen. With the help of a Neo4j graph database, Amundsen has improved Lyft’s data discovery by reducing time to discover data by 10x.
During this session, Philippe will dive deep into Amundsen’s use cases, impact, and architecture, which effectively combines a comprehensive knowledge graph based upon Neo4j, centralized metadata and other search ranking optimizations to discover data quickly.
A Data Science Project using data mining techniques (N-Grams, TF-IDF text analytics, sentiment detection) combined with R and ggplot2 for exploratory data analysis to predict stock market trends based on world news events sourced from Reddit /r/worldnews using Decision Trees and SVM (Support Vector Machines) on KNIME. All experiments were done using public cloud infrastructure, running HIVE queries to prefilter data with HDInsights on Azure.
Improving Search in Workday Products using Natural Language ProcessingDataWorks Summit
Workday is a leading provider of cloud-based enterprise software products such as Human Capital Management, Talent, Finance, Student, Planning etc. These products produce a wealth of natural language data. However, this data is unstructured and denormalized. Retrieving relevant information from such data is a challenging task. Using simple index-based search methods can only take us so far. The Data Science team at Workday is determined to apply Machine Learning and AI to make search better across Workday’s products.
In this session, we present to you, how we use word embeddings to normalize the data and add structure to it. We will also talk about using word representations to make search intelligent. The specific use cases we will discuss are adding synonyms detection and entity-recommendation.
In this talk, we will focus on the word-embeddings techniques explored, metrics used to evaluate Natural Language Processing Models, tools built, and future work as a part of improving search.
Speaker
Namrata Ghadi, Workday Inc, Software Development Engineer (Data Science)
Adam Baker, Workday Inc, Sr Software Engineer
Deep learning for e-commerce: current status and future prospectsRakuten Group, Inc.
Deep learning is the prime avenue for Artificial Intelligence, with spectacular accomplishments in diverse fields such as computer vision, natural language processing, and board games such as Go. Its impact on e-commerce is already significant and will continue to grow in future years. In this talk, we will review some of the successful deep learning algorithms in light of their current and expected impact on e-commerce.
In this webinar, data analytics gurus Sathish Thyagarajan and Steve Sarsfield introduce AnzoGraph™, our graph OLAP database, demonstrate the different types of analyses you can perform with it and how it complements Neo4j, AWS Neptune and other OLTP systems. Finally, they’ll show how you can get it up and running on your laptop in about 5 minutes.
Text mining is projected to dominate data mining, and the reasons are evident: we have more text available than numeric data. Microsoft introduced a new technology to SQL Server 2012 called Semantic Search. This session's detailed description and demos give you important information for the enterprise implementation of Tag Index and Document Similarity Index. The demos include a web-based Silverlight application, and content documents from Wikipedia. We'll also look at strategy tips for how to best leverage the new semantic technology with existing Microsoft data mining.
Knowledge Graph Discussion: Foundational Capability for Data Fabric, Data Int...Cambridge Semantics
Knowledge graphs are on the rise at businesses hungry for greater automation and intelligence with use cases spreading across industries, from fraud detection and chatbots, to risk analysis and recommendation engines. In this webinar we dive into key technical and business considerations, use cases and best practices in leveraging knowledge graphs for better knowledge management.
Triplestores and inference, applications in Finance, text-mining. Projects and solutions for financial media and publishers.
Keystone Industrial Panel, ISWC 2014, Riva del Garda, 18 Oct 2014.
Thanks to Atanas Kiryakov for this presentation, I just cut it to size.
Many powerful Machine Learning algorithms are based on graphs, e.g., Page Rank (Pregel), Recommendation Engines (collaborative filtering), text summarization, and other NLP tasks. Also, the recent developments with Graph Neural Networks connect the worlds of Graphs and Machine Learning even further.
Considering data pre-processing and feature engineering which are both vital tasks in Machine Learning Pipelines extends this relationship across the entire ecosystem. In this session, we will investigate the entire range of Graphs and Machine Learning with many practical exercises.
Enterprise Search: How do we get there from here?Daniel Tunkelang
Enterprise Search: How Do We Get There From Here?
by Daniel Tunkelang (Head of Query Understanding, LinkedIn)
Keynote at 2013 Enterprise Search Summit
We've been tackling the challenges of enterprise and site search for at least 3 decades. We've succeeded to the point that search is the gateway to many of our information repositories. Nonetheless, users of enterprise search systems are frustrated with these systems' shortcomings. We see this frustration in surveys, but, more importantly, most of us experience it personally in our daily work life. We all dream of a world where searching any information repository is as effective as searching the web—perhaps even more so. A world where we find what we're looking for, or quickly determine that it doesn't exist. Is this Utopia possible? If so, how do we get there from here? Or at least somewhere close? In this talk, Tunkelang reviews the track record of enterprise search. He talks about what's worked and what hasn't, especially as compared to web search. Finally, he proposes some paths to bring us closer to our dream.
--
Daniel Tunkelang is Head of Query Understanding at LinkedIn. Educated at MIT and CMU, he has his career working on big data, addressing key challenges in search, data mining, user interfaces, and network analysis. He co-founded enterprise search and business intelligence pioneer Endeca, where he spent a decade as its Chief Scientist. In 2011, Endeca was acquired by Oracle for over $1B. Previous to LinkedIn, he led a team at Google working on local search quality. Daniel has authored fifteen patents, written a textbook on faceted search, and created the annual symposium on human-computer interaction and information retrieval.
Social Data Analytics using IBM Big Data TechnologiesNicolas Morales
Distilling Insights from Social Media Using Big Data Technologies
Have you ever wondered what your customers are saying about you in Social media, and the impact it might be having on your business? This session will focus on how BigInsights and Big Data technologies can be used to glean useful and actionable insights from social media data.
You'll see how data can be ingested and prepped and do text analytics on social data in real time. Using Hadoop, we'll show you how you can store and analyze your large volume of historical social media data and reference data. This talk and demo will provide an introduction to text analytics and how it is used within the IBM Big Data platform for a social media solution.
Video: https://www.youtube.com/watch?v=Rt2oHibJT4k
Technologies such as Hadoop have addressed the "Volume" problem of Big Data, and technologies such as Spark have recently addressed the "Velocity" problem – but the "Variety" problem is largely unaddressed – there is a lot of manual "data wrangling" to mange data models.
These manual processes do not scale well. Not only is the variety of data increasing, also the rate of change in the data definitions is increasing. We can’t keep up. NoSQL data repositories can handle storage, but we need effective models of the data to fully utilize it.
This talk will present tools and a methodology to manage Big Data Models in a rapidly changing world. This talk covers:
Creating Semantic Metadata Models of Big Data Resources
Graphical UI Tools for Big Data Models
Tools to synchronize Big Data Models and Application Code
Using NoSQL Databases, such as Amazon DynamoDB, with Big Data Models
Using Big Data Models with Hadoop, Storm, Spark, Giraph, and Inference
Using Big Data Models with Machine Learning to generate Predictive Models
Developer Collaborative/Coordination processes using Big Data Models and Git
Managing change – Big Data Models with rapidly changing Data Resources
The most profitable insurance organizations will outperform competitors in key areas as personalized customer service, claims processing, subrogation recovery, fraud detection and product innovation. This requires thinking beyond the traditional data warehouse to the data fabric - an emerging data management architecture.
In this webinar Andy Sohn, Senior Advisor at NewVantage Partners, and Bob Parker, Senior Director for Insurance at Cambridge Semantics, explore the role of the data discovery and integration layer in an enterprise data fabric for the Insurance industry. These are their slides.
Presented at SQL Saturday Atlanta May 18, 2013
Text mining is projected to dominate data mining, and the reasons are evident: we have more text available than numeric data. Microsoft introduced a new technology to SQL Server 2012 called Semantic Search. This session's detailed description and demos give you important information for the enterprise implementation of Tag Index and Document Similarity Index. The demos include a web-based Silverlight application, and content documents from Wikipedia. We'll also look at strategy tips for how to best leverage the new semantic technology with existing Microsoft data mining.
Speaker: Philippe Mizrahi - Associate Product Manager - Lyft
Abstract: Philippe Mizrahi works on Lyft’s data discovery and metadata engine, Amundsen. With the help of a Neo4j graph database, Amundsen has improved Lyft’s data discovery by reducing time to discover data by 10x.
During this session, Philippe will dive deep into Amundsen’s use cases, impact, and architecture, which effectively combines a comprehensive knowledge graph based upon Neo4j, centralized metadata and other search ranking optimizations to discover data quickly.
A Data Science Project using data mining techniques (N-Grams, TF-IDF text analytics, sentiment detection) combined with R and ggplot2 for exploratory data analysis to predict stock market trends based on world news events sourced from Reddit /r/worldnews using Decision Trees and SVM (Support Vector Machines) on KNIME. All experiments were done using public cloud infrastructure, running HIVE queries to prefilter data with HDInsights on Azure.
Improving Search in Workday Products using Natural Language ProcessingDataWorks Summit
Workday is a leading provider of cloud-based enterprise software products such as Human Capital Management, Talent, Finance, Student, Planning etc. These products produce a wealth of natural language data. However, this data is unstructured and denormalized. Retrieving relevant information from such data is a challenging task. Using simple index-based search methods can only take us so far. The Data Science team at Workday is determined to apply Machine Learning and AI to make search better across Workday’s products.
In this session, we present to you, how we use word embeddings to normalize the data and add structure to it. We will also talk about using word representations to make search intelligent. The specific use cases we will discuss are adding synonyms detection and entity-recommendation.
In this talk, we will focus on the word-embeddings techniques explored, metrics used to evaluate Natural Language Processing Models, tools built, and future work as a part of improving search.
Speaker
Namrata Ghadi, Workday Inc, Software Development Engineer (Data Science)
Adam Baker, Workday Inc, Sr Software Engineer
Deep learning for e-commerce: current status and future prospectsRakuten Group, Inc.
Deep learning is the prime avenue for Artificial Intelligence, with spectacular accomplishments in diverse fields such as computer vision, natural language processing, and board games such as Go. Its impact on e-commerce is already significant and will continue to grow in future years. In this talk, we will review some of the successful deep learning algorithms in light of their current and expected impact on e-commerce.
In this webinar, data analytics gurus Sathish Thyagarajan and Steve Sarsfield introduce AnzoGraph™, our graph OLAP database, demonstrate the different types of analyses you can perform with it and how it complements Neo4j, AWS Neptune and other OLTP systems. Finally, they’ll show how you can get it up and running on your laptop in about 5 minutes.
Text mining is projected to dominate data mining, and the reasons are evident: we have more text available than numeric data. Microsoft introduced a new technology to SQL Server 2012 called Semantic Search. This session's detailed description and demos give you important information for the enterprise implementation of Tag Index and Document Similarity Index. The demos include a web-based Silverlight application, and content documents from Wikipedia. We'll also look at strategy tips for how to best leverage the new semantic technology with existing Microsoft data mining.
Knowledge Graph Discussion: Foundational Capability for Data Fabric, Data Int...Cambridge Semantics
Knowledge graphs are on the rise at businesses hungry for greater automation and intelligence with use cases spreading across industries, from fraud detection and chatbots, to risk analysis and recommendation engines. In this webinar we dive into key technical and business considerations, use cases and best practices in leveraging knowledge graphs for better knowledge management.
Triplestores and inference, applications in Finance, text-mining. Projects and solutions for financial media and publishers.
Keystone Industrial Panel, ISWC 2014, Riva del Garda, 18 Oct 2014.
Thanks to Atanas Kiryakov for this presentation, I just cut it to size.
Many powerful Machine Learning algorithms are based on graphs, e.g., Page Rank (Pregel), Recommendation Engines (collaborative filtering), text summarization, and other NLP tasks. Also, the recent developments with Graph Neural Networks connect the worlds of Graphs and Machine Learning even further.
Considering data pre-processing and feature engineering which are both vital tasks in Machine Learning Pipelines extends this relationship across the entire ecosystem. In this session, we will investigate the entire range of Graphs and Machine Learning with many practical exercises.
Enterprise Search: How do we get there from here?Daniel Tunkelang
Enterprise Search: How Do We Get There From Here?
by Daniel Tunkelang (Head of Query Understanding, LinkedIn)
Keynote at 2013 Enterprise Search Summit
We've been tackling the challenges of enterprise and site search for at least 3 decades. We've succeeded to the point that search is the gateway to many of our information repositories. Nonetheless, users of enterprise search systems are frustrated with these systems' shortcomings. We see this frustration in surveys, but, more importantly, most of us experience it personally in our daily work life. We all dream of a world where searching any information repository is as effective as searching the web—perhaps even more so. A world where we find what we're looking for, or quickly determine that it doesn't exist. Is this Utopia possible? If so, how do we get there from here? Or at least somewhere close? In this talk, Tunkelang reviews the track record of enterprise search. He talks about what's worked and what hasn't, especially as compared to web search. Finally, he proposes some paths to bring us closer to our dream.
--
Daniel Tunkelang is Head of Query Understanding at LinkedIn. Educated at MIT and CMU, he has his career working on big data, addressing key challenges in search, data mining, user interfaces, and network analysis. He co-founded enterprise search and business intelligence pioneer Endeca, where he spent a decade as its Chief Scientist. In 2011, Endeca was acquired by Oracle for over $1B. Previous to LinkedIn, he led a team at Google working on local search quality. Daniel has authored fifteen patents, written a textbook on faceted search, and created the annual symposium on human-computer interaction and information retrieval.
Social Data Analytics using IBM Big Data TechnologiesNicolas Morales
Distilling Insights from Social Media Using Big Data Technologies
Have you ever wondered what your customers are saying about you in Social media, and the impact it might be having on your business? This session will focus on how BigInsights and Big Data technologies can be used to glean useful and actionable insights from social media data.
You'll see how data can be ingested and prepped and do text analytics on social data in real time. Using Hadoop, we'll show you how you can store and analyze your large volume of historical social media data and reference data. This talk and demo will provide an introduction to text analytics and how it is used within the IBM Big Data platform for a social media solution.
Results from the Enterprise Search and Findability Survey 2012Findwise
A few preliminary results form the Enterprise Search and Findability Survey. The dataset for the survey is very large, and the analysis on the complete dataset will be in the report that will be published in June.
This presentation is a mash-up of the versions presented at the Enterprise Search Summit, 15th of May, 2012, in New York, US, Enterprise Search Europe in London, 30th of May 2012, IKS Semantic Enterprise Technologies Workshop on the 12th of June in Salzburg, Austria. Also presented at Findability Day 2012 Stockholm,
Search for the enterprise seems to have hit a wall. Bad search is the top complaint of users interacting with their internal data. Meanwhile, there is a seemingly never-ending flood of products, SaaS offerings and new solutions in the market all claiming and attempting to solve the problem.
In this roundtable, we will define what expectations organizations should really have about their search platforms and discuss what benefits to expect from using techniques like boosting, auto-classification, natural language processing, query expansion, entity extraction and ontologies. We will also explore what will supersede search in the enterprise.
The Enterprise Knowledge Graph is a disruptive platform that combines emerging Big Data and Graph technologies to reinvent knowledge management inside organizations. This platform aims to organize and distribute the organization’s knowledge, and making it centralized and universally accessible to every employee. The Enterprise Knowledge Graph is a central place to structure, simplify and connect the knowledge of an organization. By removing complexity, the knowledge graph brings more transparency, openness and simplicity into organizations. That leads to democratized communications and empowers individuals to share knowledge and to make decisions based on comprehensive knowledge. This platform can change the way we work, challenge the traditional hierarchical approach to get work done and help to unleash human potential!
Join Concept Searching and partner C/D/H for this thought-provoking webinar on what intelligent enterprise search should be.
Our solution is unique in the marketplace, and overcomes the limitations of other enterprise search engines. It was originally deployed as an enterprise search solution for engineers and support staff.
This webinar will focus on how one unified view of all unstructured, semi-structured, and structured data assets, including 2D and 3D images, can be integrated into the search interface, with previewers and navigational aids.
Both business and technical professionals will benefit from this session:
• Understand how the technology works, and how it can be set up with a platform and search engine of choice
• See how search returns results, and provides visual and navigational aids for all information retrieved
• Watch how to select an image based on color, size, or shape
• Learn how any business or artificial intelligence applications can benefit from the multi-term metadata created
• Find out why the search framework provides a responsive user interface for any tablet, PC or mobile device
Talk given for the #phpbenelux user group, March 27th in Gent (BE), with the goal of convincing developers that are used to build php/mysql apps to broaden their horizon when adding search to their site. Be sure to also have a look at the notes for the slides; they explain some of the screenshots, etc.
An accompanying blog post about this subject can be found at http://www.jurriaanpersyn.com/archives/2013/11/18/introduction-to-elasticsearch/
Data Model for Mainframe in Splunk: The Newest Feature of IronstreamPrecisely
Valuable mainframe data is often the missing piece in a holistic infrastructure view within Splunk. But if you're not a mainframe expert, knowing which data sources, fields and calculations are needed to get results within Splunk can be a challenge. Even those with mainframe knowledge can sometimes struggle.
With Syncsort Ironstream® you can easily capture the elements you need in real-time – and Ironstream's new Mainframe Data Model makes it easier than ever to work with complex mainframe metrics in Splunk.
View this webinar on-demand to learn more about this new feature, as well as how to:
• See categorized mainframe metrics in easily understood terms
• Get results faster – no need to research data sources, fields and calculations
• Broaden access to more team members – without the need for deep mainframe knowledge
• Use built-in Splunk tooling to get up and running quickly
• Realize valuable ROI sooner and eliminate the mainframe blind spot
Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa...Zaloni
When building your data stack, the architecture could be your biggest challenge. Yet it could also be the best predictor for success. With so many elements to consider and no proven playbook, where do you begin to assemble best practices for a scalable data architecture? Ben Sharma, thought leader and coauthor of Architecting Data Lakes, offers lessons learned from the field to get you started.
Solving the Disconnected Data Problem in Healthcare Using MongoDBMongoDB
The data diversity in healthcare and life sciences is exploding and the market is fundamentally changing as a result of healthcare reform. The result is more and more data but it is compartmentalized and disconnected. At Zephyr Health, we have developed a data platform that is able to provide connectivity between thousands of healthcare data assets using an ontology driven approach storing data in MongoDB. This session will show how we break down this very challenging problem and how some of MongoDBs more recent features have been utilized to do so.
MongoDB Days Germany: Data Processing with MongoDBMongoDB
Presented by Marc Schwering, Senior Solutions Architect, MongoDB
Modern architectures are moving away from "one size fits all" solutions. The best tools need to be put to the job and given the large amounts of options today, chances are that you’ll end up using MongoDB for your operational workload, as well as Data Processing Systems like Apache Flink or Spark for your high speed data processing needs. When documents or data structures are modeled, there are some key aspects that need to be attended. This takes into consideration the distribution of data nodes, streaming capabilities, performance, aggregation, and queryability options, and how we can integrate the different data processing software that can benefit from subtle but substantial model changes. This session will cover the way how you enhance your architecture using data processing technologies such as Apache Flink and Spark. It will take the audience through the evolution of an app from simple to complex with its architectural requirements . We´ll look into similarities and differences of available technologies and you will walk away with an understanding how to use MongoDB to fulfill more advanced tasks such as personalization through clustering algorithms.
Big Data Analytics 2: Leveraging Customer Behavior to Enhance Relevancy in Pe...MongoDB
This session covers how to capture and analyize customer behavior to create more relevent contexts for customers. We will cover how to use your current BI features, and more importantly, how newer technologies approach the challenge. You will walk away with a good idea on how to build and drive even more contextually relevant experiences to customers for even more successful engagements.
CREATE SEARCH DRIVEN BUSINESS INTELLIGENCE APPLICATION USING FAST SEARCH FO...Netwoven Inc.
Understand the Importance of Search Based Applications in today’s enterprise and how to integrate Business Intelligence and Search for business benefit.
Role of Microsoft FAST Search in an enterprise for building Search based Business
IntelligenBusiness Intelligence Application.
Demonstration of a FAST search based BI applications.
As data sets continue to grow, search remains a key technology for many applications. But what is the current state of the enterprise search market? Which providers are gaining market share, and what are the latest developments and innovations? Based on experience from dozens of recent search projects using a range of technologies, this presentation will summarize market conditions, discuss current best practices for creating great search systems, and suggest some future trends to watch out for.
Has your app taken off? Are you thinking about scaling? MongoDB makes it easy to horizontally scale out with built-in automatic sharding, but did you know that sharding isn't the only way to achieve scale with MongoDB?
In this webinar, we'll review three different ways to achieve scale with MongoDB. We'll cover how you can optimize your application design and configure your storage to achieve scale, as well as the basics of horizontal scaling. You'll walk away with a thorough understanding of options to scale your MongoDB application.
Topics covered include:
- Scaling Vertically
- Hardware Considerations
- Index Optimization
- Schema Design
- Sharding
The Role of Patterns in the Era of Large Language ModelsYunyao Li
Slides for my keynote at PAN-DL Workshop (Pattern-based Approaches to NLP in the Age of Deep Learning) at EMNLP'2023 (December. 6, 2023).
In this talk, I share our initial learnings from constructing, growing and serving large knowledge graphs
Building, Growing and Serving Large Knowledge Graphs with Human-in-the-LoopYunyao Li
Keynote talk at HILDA'2023 at SIGMOD on June 18, 2023.
Abstract: The ability to build large-scale knowledge bases that capture and extend the implicit knowledge of human experts is the foundation for many AI systems. We use an ontology-driven approach for the building, growing and serving of such knowledge bases. This approach relies on several well-known building blocks: document conversion, natural language processing, entity resolution, data transformation and fusion. In this talk, I will discuss wide range of real-world challenges related to the building of these blocks and present our work to address these challenges via better human-machine cooperation.
Meaning Representations for Natural Languages: Design, Models and ApplicationsYunyao Li
EMNLP'2022 Tutorial "Meaning Representations for Natural Languages: Design, Models and Applications"
Instructors: Jeffrey Flanigan, Ishan Jindal, Yunyao Li, Tim O’Gorman, Martha Palmer
Abstract:
We propose a cutting-edge tutorial that reviews the design of common meaning representations, SoTA models for predicting meaning representations, and the applications of meaning representations in a wide range of downstream NLP tasks and real-world applications. Reporting by a diverse team of NLP researchers from academia and industry with extensive experience in designing, building and using meaning representations, our tutorial has three components: (1) an introduction to common meaning representations, including basic concepts and design challenges; (2) a review of SoTA methods on building models for meaning representations; and (3) an overview of applications of meaning representations in downstream NLP tasks and real-world applications. We will also present qualitative comparisons of common meaning representations and a quantitative study on how their differences impact model performance. Finally, we will share best practices in choosing the right meaning representation for downstream tasks.
Invited talk at Document Intelligence workshop at KDD'2021.
Harvesting information from complex documents such as in financial reports and scientific publications is critical to building AI applications for business and research. Such documents are often in PDF format with critical facts and data conveyed in table and graphs. Extracting such information is essential to extract insights from these documents. In IBM Research, we have a rich agenda in this area that we call Deep Document Understanding. In this talk, I will focus on our research on Deep Table Understanding — extracting and understanding tables from PDF documents. I will introduce key challenges in table extraction and understanding and how we address such challenges, from how to acquire data at scale to enable deep neural network models to how to build, customize and evaluate such models. I will also describe how our work enables real-world use cases in domains such as finance and life science. Finally, I will briefly present TableQA, an important downstream task enabled by Deep Table Understanding.
Explainability for Natural Language ProcessingYunyao Li
Final deck for our popular tutorial on "Explainability for Natural Language Processing" at KDD'2021. See links below for downloadable version (with higher resolution) and recording of the live tutorial.
Title: Explainability for Natural Language Processing
Presenter: Marina Danilevsky, Shipi Dhanorkar, Yunyao Li and Lucian Popa and Kun Qian and Anbang Xu
Website: http://xainlp.github.io/
Recording: https://www.youtube.com/watch?v=PvKOSYGclPk&t=2s
Downloadable version with higher resolution: https://drive.google.com/file/d/1_gt_cS9nP9rcZOn4dcmxc2CErxrHW9CU/view?usp=sharing
@article{kdd2021xaitutorial,
title={Explainability for Natural Language Processing},
author= {Marina Danilevsky, Shipi Dhanorkar and Yunyao Li and Lucian Popa and Kun Qian and Anbang Xu},
journal={KDD},
year={2021}
}
Abstract:
This lecture-style tutorial, which mixes in an interactive literature browsing component, is intended for the many researchers and practitioners working with text data and on applications of natural language processing (NLP) in data science and knowledge discovery. The focus of the tutorial is on the issues of transparency and interpretability as they relate to building models for text and their applications to knowledge discovery. As black-box models have gained popularity for a broad range of tasks in recent years, both the research and industry communities have begun developing new techniques to render them more transparent and interpretable.Reporting from an interdisciplinary team of social science, human-computer interaction (HCI), and NLP/knowledge management researchers, our tutorial has two components: an introduction to explainable AI (XAI) in the NLP domain and a review of the state-of-the-art research; and findings from a qualitative interview study of individuals working on real-world NLP projects as they are applied to various knowledge extraction and discovery at a large, multinational technology and consulting corporation. The first component will introduce core concepts related to explainability inNLP. Then, we will discuss explainability for NLP tasks and reporton a systematic literature review of the state-of-the-art literaturein AI, NLP and HCI conferences. The second component reports on our qualitative interview study, which identifies practical challenges and concerns that arise in real-world development projects that require the modeling and understanding of text data.
Explainability for Natural Language ProcessingYunyao Li
NOTE: Please check out the final version here with small but important updates and links to downloadable version and recording: https://www.slideshare.net/YunyaoLi/explainability-for-natural-language-processing-249992241
Updated version on our popular tutorial on "Explainability for Natural Language Processing" as a tutorial at KDD'2021.
Title: Explainability for Natural Language Processing
@article{kdd2021xaitutorial,
title={Explainability for Natural Language Processing},
author= {Marina Danilevsky, Dhanorkar, Shipi and Li, Yunyao and Lucian Popa and Kun Qian and Anbang Xu},
journal={KDD},
year={2021}
}
Presenter: Marina Danilevsky, Dhanorkar, Shipi and Li, Yunyao and Lucian Popa and Kun Qian and Anbang Xu
Website: http://xainlp.github.io/
Abstract:
This lecture-style tutorial, which mixes in an interactive literature browsing component, is intended for the many researchers and practitioners working with text data and on applications of natural language processing (NLP) in data science and knowledge discovery. The focus of the tutorial is on the issues of transparency and interpretability as they relate to building models for text and their applications to knowledge discovery. As black-box models have gained popularity for a broad range of tasks in recent years, both the research and industry communities have begun developing new techniques to render them more transparent and interpretable.Reporting from an interdisciplinary team of social science, human-computer interaction (HCI), and NLP/knowledge management researchers, our tutorial has two components: an introduction to explainable AI (XAI) in the NLP domain and a review of the state-of-the-art research; and findings from a qualitative interview study of individuals working on real-world NLP projects as they are applied to various knowledge extraction and discovery at a large, multinational technology and consulting corporation. The first component will introduce core concepts related to explainability inNLP. Then, we will discuss explainability for NLP tasks and reporton a systematic literature review of the state-of-the-art literaturein AI, NLP and HCI conferences. The second component reports on our qualitative interview study, which identifies practical challenges and concerns that arise in real-world development projects that require the modeling and understanding of text data.
Slides for talk given at Women in Engineering on March 20, 2021.
Abstract:
Natural language understanding is a fundamental task in artificial intelligence. English understanding has reached a mature state and successfully deployed in multiple IBM AI products and services, such as Watson Natural Language Understanding and Watson Discovery. However, scaling existing products/services to support additional languages remain an open challenge. In this talk, we will discuss the open challenges in supporting universal natural language understanding. We will share our work in the past few years in addressing these challenges. We will also showcase how universal semantic representation of natural languages can enable cross-lingual information extraction in concrete domains (e.g. compliance) and show ongoing efforts towards seamless scaling existing NLP capabilities across languages with minimal efforts.
Explainability for Natural Language ProcessingYunyao Li
Tutorial at AACL'2020 (http://www.aacl2020.org/program/tutorials/#t4-explainability-for-natural-language-processing).
More recent version: https://www.slideshare.net/YunyaoLi/explainability-for-natural-language-processing-249912819
Title: Explainability for Natural Language Processing
@article{aacl2020xaitutorial,
title={Explainability for Natural Language Processing},
author= {Dhanorkar, Shipi and Li, Yunyao and Popa, Lucian and Qian, Kun and Wolf, Christine T and Xu, Anbang},
journal={AACL-IJCNLP 2020},
year={2020}
Presenter: Shipi Dhanorkar, Christine Wolf, Kun Qian, Anbang Xu, Lucian Popa and Yunyao Li
Video: https://www.youtube.com/watch?v=3tnrGe_JA0s&feature=youtu.be
Abstract:
We propose a cutting-edge tutorial that investigates the issues of transparency and interpretability as they relate to NLP. Both the research community and industry have been developing new techniques to render black-box NLP models more transparent and interpretable. Reporting from an interdisciplinary team of social science, human-computer interaction (HCI), and NLP researchers, our tutorial has two components: an introduction to explainable AI (XAI) and a review of the state-of-the-art for explainability research in NLP; and findings from a qualitative interview study of individuals working on real-world NLP projects at a large, multinational technology and consulting corporation. The first component will introduce core concepts related to explainability in NLP. Then, we will discuss explainability for NLP tasks and report on a systematic literature review of the state-of-the-art literature in AI, NLP, and HCI conferences. The second component reports on our qualitative interview study which identifies practical challenges and concerns that arise in real-world development projects which include NLP.
Towards Universal Language Understanding (2020 version)Yunyao Li
Keynote talk given at Pacific Asia Conference on Language, Information and Computation (PACLIC 34) on Pacific Asia Conference on Language, Information and Computation (PACLIC 34) on October 24, 2020.
Title: Towards Universal Natural Language Understanding
Abstract:
Understanding the semantics of the natural language is a fundamental task in artificial intelligence. English semantic understanding has reached a mature state and successfully deployed in multiple IBM AI products and services, such as Watson Natural Language Understanding and Watson Compare and Comply. However, scaling existing products/services to support additional languages remain an open challenge. In this talk, we will discuss the open challenges in supporting universal natural language understanding. We will share our work in addressing these challenges in the past few years to provide the same unified semantic representation across languages. We will also showcase how such universal semantic understanding of natural languages can enable cross-lingual information extraction in concrete domains (e.g. insurance and compliance) and show promise towards seamless scaling existing NLP capabilities across languages with minimal efforts.
Towards Universal Semantic Understanding of Natural LanguagesYunyao Li
Keynote talk at TextXD 2019(https://www.textxd.org)
Abstract:
Understanding the semantics of the natural language is a fundamental task in artificial intelligence. English semantic understanding has reached a mature state and successfully deployed in multiple IBM AI products and services, such as Watson Natural Language Understanding and Watson Compare and Comply. However, scaling existing products/services to support additional languages remain an open challenge. In this demo, we will present Polyglot, a multilingual semantic parser capable of semantically parsing sentences in 9 different languages from 4 different language groups into the same unified semantic representation. We will also showcase how such universal semantic understanding of natural languages can enable cross-lingual information extraction in concrete domains (e.g. insurance and compliance) and show promise towards seamless scaling existing NLP capabilities across languages with minimal efforts.
An In-depth Analysis of the Effect of Text Normalization in Social MediaYunyao Li
Poster corresponding to our NAACL'2015 paper "An In-depth Analysis of the Effect of Text Normalization in Social Media"
Abstract: Recent years have seen increased interest in text normalization in social media, as the in-formal writing styles found in Twitter and other social media data often cause problems for NLP applications. Unfortunately, most current approaches narrowly regard the nor- malization task as a “one size fits all" task of replacing non-standard words with their standard counterparts. In this work we build a taxonomy of normalization edits and present a study of normalization to examine its effect on three different downstream applications (de- pendency parsing, named entity recognition, and text-to-speech synthesis). The results sug- gest that how the normalization task should be viewed is highly dependent on the targeted application. The results also show that normalization must be thought of as more than word replacement in order to produce results comparable to those seen on clean text.
Paper: https://www.aclweb.org/anthology/N15-1045
Exploiting Structure in Representation of Named Entities using Active LearningYunyao Li
Slides for our COLING'18 paper: http://aclweb.org/anthology/C18-1058
Fundamental to several knowledge-centric applications is the need to identify named entities from their textual mentions. However, entities lack a unique representation and their mentions can differ greatly. These variations arise in complex ways that cannot be captured using textual similarity metrics. However, entities have underlying structures, typically shared by entities of the same entity type, that can help reason over their name variations. Discovering, learning and manipulating these structures typically requires high manual effort in the form of large amounts of labeled training data and handwritten transformation programs. In this work, we propose an active-learning based framework that drastically reduces the labeled data required to learn the structures of entities. We show that programs for mapping entity mentions to their structures can be automatically generated using human-comprehensible labels. Our experiments show that our framework consistently outperforms both handwritten programs and supervised learning models. We also demonstrate the utility of our framework in relation extraction and entity resolution tasks.
K-SRL: Instance-based Learning for Semantic Role LabelingYunyao Li
Slides for our COLING'16 paper http://aclweb.org/anthology/C/C16/C16-1058.pdf
Abstract:
Semantic role labeling (SRL) is the task of identifying and labeling predicate-argument structures in sentences with semantic frame and role labels. A known challenge in SRL is the large number of low-frequency exceptions in training data, which are highly context-specific and difficult to generalize. To overcome this challenge, we propose the use of instance-based learning that performs no explicit generalization, but rather extrapolates predictions from the most similar instances in the training data. We present a variant of k-nearest neighbors (kNN) classification with composite features to identify nearest neighbors for SRL. We show that high-quality predictions can be derived from a very small number of similar instances. In a comparative evaluation we experimentally demonstrate that our instance-based learning approach significantly outperforms current state-of-the-art systems on both in-domain and out-of-domain data, reaching F1-scores
of 89,28% and 79.91% respectively
Natural Language Data Management and Interfaces: Recent Development and Open ...Yunyao Li
Slides deck for SIGMOD 2017 Tutorial.
ABSTRACT:
The volume of natural language text data has been rapidly increasing over the past two decades, due to factors such as the growth of the Web, the low cost associated to publishing and the progress on the digitization of printed texts. This growth combined with the proliferation of natural language systems for search and retrieving information provides tremendous opportunities for studying some of the areas where database systems and natural language processing systems overlap. This tutorial explores two more relevant areas of overlap to the database community: (1) managing
natural language text data in a relational database, and (2) developing natural language interfaces to databases. The tutorial presents state-of-the-art methods, related systems, research opportunities and challenges covering both area.
Polyglot: Multilingual Semantic Role Labeling with Unified LabelsYunyao Li
Poster for our ACL paper "Polyglot: Multilingual Semantic Role Labeling with Unified Labels".
Abstract:
We present POLYGLOT, a semantic role labeling system capable of semantically parsing sentences in 9 different languages from 4 different language groups. A core differentiator is that this system predicts English Proposition Bank labels for all supported languages. This means that
for instance a Japanese sentence will be tagged with the same labels as an English sentence with similar semantics would be. This is made possible by training the system with target language data that was automatically labeled with English PropBank labels using an annotation projection approach. We give an overview of our system, the automatically produced training data, and discuss possible applications
and limitations of this work. We present a demonstrator that accepts sentences in English, German, French, Spanish, Japanese, Chinese, Arabic, Russian and Hindi and
outputs a visualization of its shallow semantics.
Tyler Baldwin, Yunyao Li, Bogdan Alexe, Ioana Roxana Stanoi: Automatic Term Ambiguity Detection. ACL (2) 2013: 804-809
Abstract:
While the resolution of term ambiguity is important for information extraction (IE) systems, the cost of resolving each instance of an entity can be prohibitively expensive on large datasets. To combat this, this work looks at ambiguity detection at the term, rather than the instance,
level. By making a judgment about the general ambiguity of a term, a system is able to handle ambiguous and unambiguous cases differently, improving through-put and quality. To address the term ambiguity detection problem, we employ a model that combines data from language models, ontologies, and topic modeling. Results over a dataset of entities from four product domains show that the
proposed approach achieves significantly above baseline F-measure of 0.96.
Information Extraction --- An one hour summaryYunyao Li
This is the deck that I made when taking CS767 at Univ. of Michigan in 2006. While it is a few years' old, it is still a useful deck for people who are new to information extraction.
Adaptive Parser-Centric Text NormalizationYunyao Li
Wonderful work done with Congle Zhang (my summer intern in 2012) and my IBM colleagues. Nominated for best paper award and presented at ACL 2013.
Adaptive Parser-Centric Text Normalization
Congle Zhang, Tyler Baldwin, Howard Ho, Benny Kimelfeld, Yunyao Li
Proceedings of ACL, pp. 1159--1168, 2013
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfPeter Spielvogel
Building better applications for business users with SAP Fiori.
• What is SAP Fiori and why it matters to you
• How a better user experience drives measurable business benefits
• How to get started with SAP Fiori today
• How SAP Fiori elements accelerates application development
• How SAP Build Code includes SAP Fiori tools and other generative artificial intelligence capabilities
• How SAP Fiori paves the way for using AI in SAP apps
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Leading Change strategies and insights for effective change management pdf 1.pdf
Enterprise Search in the Big Data Era: Recent Developments and Open Challenges
1. Enterprise Search in the
Big Data Era
Yunyao Li
Ziyang Liu
Huaiyu Zhu
IBM Research - Almaden
NEC Labs
IBM Research - Almaden
2. 1
Enterprise Search
• Providing intuitive access to an organization’s
various digital content
1
Report Find
• IDC report [IDC 05] • $5k/person/year wasted salary due to poor search
• 9-10hr/person/week doing search
• unsuccessful 1/3-1/2 of the time
• Butler Group
[Edwards 06]
• 10% of salary cost wasted through ineffective search
• Accenture survey
[Accenture 07]
• Middle managers spend 2 hr/day searching
• >50% of what they found have not value
• Hawking, Enterprise Search, http://david-hawking.net/pubs/ModernIR2_Hawking_chapter.pdf
• [IDC 05] the enterprise workplace: How it will change the way we work”. IDC Report 32919
• [Edwards 06] www.butlergroup.com/pdf/PressReleases/ESRReportPressRelease.pdf
• [Accenture 07] http://newsroom.accenture.com/article_display.cfm?article_id=4484
3. 2
Magic
Search from User’s Point of View
Results
1 ……………..
2 ……….
3 ……………..
4 ……….
…………
……………
INTRODUCTION SEARCH
4. 3
What Happens Behind the Scene
Backend
Collect data
Analyze data
Index data
Frontend
Serve user queries
Return results
Index
Data
Source
INTRODUCTION SEARCH
5. 4
How Does a Query Match a Document?
Index
Document
………………..
………… …………
… ….. ……..
…………………
…………
Document
………………..
………… …………
… ….. ……..
…………………
…………
Results
Doc 1 ………..
Doc 2 …….
Doc 3 …………..
Doc 4 ……….
…………
……………
Analyze query
Present results
Analyze document
Search index
Build index
INTRODUCTION SEARCH
6. 5
Search Is More Than Keyword Match
• Specific features in documents are important
– Title, url, person name, product, actions, …
• Features combine to form higher level concepts
– In document: Home page + person personal homepage
– Cross document: URL link analysis, …
• The string representation in document may not match that in
user query
– Person name: Bill Clinton William Jefferson Clinton
• User queries may be ambiguous
– Multiple interpretations
• Presenting the results to user
– Ranking, grouping, interactive refinement
INTRODUCTION SEARCH
7. 6
Internet vs Enterprise – Web data
[Fagin WWW2003]
Internet Enterprise
Creation of
content
• Democratic
• Appealing to reader
• Links approval
• Bureaucratic
• Conform to mandate
• Links internal structure
Relevant
query results
• Large number
• Overlapping information
• Reasonable subset suffices
• Ranking is more universal
• Small number
• Specific function
• Specific pages required
• Ranking is relative to query
Spamming
• Spam infested
• Ranking can only be based
on external authority
• Mostly spam-free
• Ranking based on content
or metadata are reliable
Search
engine
friendliness
• Web pages designed to be
search results
• Web page document
• Documents not designed to
be search results
• Special treatment
INTRODUCTION ENTERPRISE VS INTERNET
8. 7
Internet vs Enterprise – Big Data
Internet Enterprise
Content being
searched
• Sources: Web crawl
• Formats: html, xml, pdf,
• Variety of sources
• Variety of formats:
• Email, database, application-
specific access and formats
Search queries
/expected
results
• Target: web pages, office
documents
• Expect list of documents
• Expect little personalization
• Return result directly
• Target: rows, figures, experts, ...
• Expect customized results
• Personalization required:
geography, access,
• Customize results
Related
information
• Link approval
• Small number of domain-
specific knowledge
• Generic analysis
• Link organization structure
• Large number of dynamic
domain-specific knowledge
• Highly specialized analysis
Skill set of
search admins
• Large number of admins
• Search experts
• Facilitate update of search
algorithms
• Small number of admins
• Domain experts
• Facilitate use of domain
knowledge
INTRODUCTION ENTERPRISE VS INTERNET
9. 8
Search Engine Components
Backend
Collect data
Analyze data
Store and index data
Admin
System performance
Search quality control/improvement
Frontend
Interpret user query
Search index
Present results
Interact with user
index
Data
source
INTRODUCTION TUTORIAL OVERVIEW COMPONENTS
10. 9
Search Engine Architecture
Backend
Collect data
Analyze data
Store and index data
Backend
Collect data
Analyze data
Store and index data
Admin
System performance
Search quality control
Frontend
Interpret user query
Search index
Present results
Interact with user
index
Data
source
11. 10
Main Backend Functions
Analysis (Understand)
Information extraction
Analyse and transform data
Indexing (Prepare for search)
Generate terms suitable for match queries
Index search terms
index
Document Ingestion (Collect)
Collect all the data to be searched
Transform and store as documents
Local Analysis
(in-document analysis)
Global Analysis
(cross-document analysis)
13. 12
Typical analytics pipeline
S1={f11, f12, …}
S2={f21, f22, …}
S3={f31, f32, …}
G1 = {g1, …}
G2 = {g2, g3, …}
LA GA Idx
Data ingestion
• Collect data
• Transform to uniform
document format
• Store in document store
Data ingestion
• Collect data
• Transform to uniform
document format
• Store in document store
Document
………………..
… …… …………
… … ……..
………… ………
…… ……
Document
………………..
… …… …………
… … ……..
………… ………
…… ……
Document
………………..
… …… …………
… … ……..
………… ………
…… ……
Document
………………..
… …… …………
… … ……..
………… ………
…… ……
Global analysis
• Cross document analysis
• Rank, group, merge, and
filter documents
Global analysis
• Cross document analysis
• Rank, group, merge, and
filter documents
index
Indexing
• Generate search terms,
• Index documents by
search terms
Indexing
• Generate search terms,
• Index documents by
search terms
Local analysis:
• Information extraction
from each document
Local analysis:
• Information extraction
from each document
DI
BACKEND OVERVIEW
14. 13
Digression: Classical IR
S1={f11, f12, …}
S2={f21, f22, …}
S3={f31, f32, …}
G1 = {g1, …}
G2 = {g2, g3, …}
LA GA Idx
Data ingestion
• Given set of files
Data ingestion
• Given set of files
Document
………………..
… …… …………
… … ……..
………… ………
…… ……
Document
………………..
… …… …………
… … ……..
………… ………
…… ……
Document
………………..
… …… …………
… … ……..
………… ………
…… ……
Document
………………..
… …… …………
… … ……..
………… ………
…… ……
Global analysis
• Calculate statistics of
terms in documents
Global analysis
• Calculate statistics of
terms in documents
index
Indexing
• Generate search terms,
• Index by terms with
statistics
Indexing
• Generate search terms,
• Index by terms with
statistics
Local analysis:
• Tokenize
• Stop wording
• Stemming
• Form n-grams
Local analysis:
• Tokenize
• Stop wording
• Stemming
• Form n-grams
DI
BACKEND OVERVIEW
15. 14
Digression: Classical Web search
S1={f11, f12, …}
S2={f21, f22, …}
S3={f31, f32, …}
G1 = {g1, …}
G2 = {g2, g3, …}
LA GA Idx
Data ingestion
• Crawl web pages
Data ingestion
• Crawl web pages
Document
………………..
… …… …………
… … ……..
………… ………
…… ……
Document
………………..
… …… …………
… … ……..
………… ………
…… ……
Document
………………..
… …… …………
… … ……..
………… ………
…… ……
Document
………………..
… …… …………
… … ……..
………… ………
…… ……
Global analysis
• Calculate eigenvalues of
connection matrix
Global analysis
• Calculate eigenvalues of
connection matrix
index
Indexing
• Generate search terms
• Index documents by
search terms, with page
rank
Indexing
• Generate search terms
• Index documents by
search terms, with page
rank
Local analysis:
• Extract out links
Local analysis:
• Extract out links
DI
BACKEND OVERVIEW
16. 15
Demands of Enterprise Search
S1={f11, f12, …}
S2={f21, f22, …}
S3={f31, f32, …}
G1 = {g1, …}
G2 = {g2, g3, …}
LA GA Idx
Data ingestion
• Handle variety of sources
• Handle variety of formats
• Deal with access policy
• Deal with update policy
Data ingestion
• Handle variety of sources
• Handle variety of formats
• Deal with access policy
• Deal with update policy
Document
………………..
… …… …………
… … ……..
………… ………
…… ……
Document
………………..
… …… …………
… … ……..
………… ………
…… ……
Document
………………..
… …… …………
… … ……..
………… ………
…… ……
Document
………………..
… …… …………
… … ……..
………… ………
…… ……
Global analysis
• Cross document analysis
• Rank, group, merge, and
filter documents
Global analysis
• Cross document analysis
• Rank, group, merge, and
filter documents
index
Indexing
• Generate search terms,
• Index documents by
search terms
Indexing
• Generate search terms,
• Index documents by
search terms
Local analysis:
• Incorporate domain knowledge
• Extract rich set of semantics
• Categorize documents
Local analysis:
• Incorporate domain knowledge
• Extract rich set of semantics
• Categorize documents
DI
BACKEND OVERVIEW
17. 16
• Efficient incremental updates
– Fast turn around time for updates
• System performance and reliability
– Scaling with data size and resource available
– Fault tolerance
• Ease of administration quality improvement
– Allow search admin to customize domain specific
configurations
BACKEND OVERVIEW CHALLENGES / OPPORTUNITIES
Desiderata of backend
19. 18
Data Ingestion
BACKEND DATA INGESTION
Doc. Store
Crawl/push
Web DB App
Convert to
document
…
Convert
to text
From: xxx
To: yyy
Date: 12/21
………………..
… …… …………
… … ……..
Attch: file1.pdf
Docid: 0002
___________
…….ABCD…..
… 01/12 …………
… … ……..
………
………..
… ………
Docid: 0001
___________
From: xxx
To: yyy
Date: 12/21
………………..
… …… …………
… … ……..
Attch: file.pdf
Email +attach
Docid: 0002
___________
title: ABCD.
Date: 01/12
…………
… … ……..
………
………..
… ………
Docid: 0001
___________
From: xxx
To: yyy
Date: 12/21
………………..
… …… …………
… … ……..
Attch: file.pdf
Variety of
sources
Support update &
retention policy
Pdf file
20. 19
Document-centric View
• Data as a collection of documents
– Document as unit of storage and search result.
– Three major components
• Unique document identifier in the whole system
• Metadata fields: url, date, language, …
• Content field: text to be searched
• Representation of data of different structures
– Web pages Each page is a document
– Relational data Each row is a document
– Hierarchical data Each node is a document
BACKEND DATA INGESTION
21. 20
Push vs Pull
Pull Push
Definition
• Search engine initiate
transfer of data
• (Web crawler)
• Content owner initiate transfer
of data
• (Apps with push notice)
Advantage
• Operated by search engine
• Use standard crawlers
• Can handle special access
methods
• Easy to adjust refresh rate
• Easy to handle special format
Disadvantage
• Difficult to access special
data sources
• Difficult to adjust domain
specific treatment
• Need synchronization with
content owner
Applicability
• Prevalent for Internet
• Also useful for enterprise
• Rare for Internet
• Very important for enterprise
BACKEND DATA INGESTION
22. 21
Transform the Data
• Format conversion
– Convert content to text: pdf, doc, …
• Keep as much structure as possible
• Metadata conversion
– Obtain and transform metadata: HTTP headers,
DB table metadata, …
• Merge /split documents
– One-to-many: Zip file, email thread, attachments
– Many-to-one: social tags merge to original doc
BACKEND DATA INGESTION
23. 22
Storage options
Options Pro Con
SQL database
• Traditional RDBM strengths
• Support insert, update,
delete, fielded query,
• Too much system overhead
Indexing
engine
(Lucene)
• Closer to document centric
view
• Supports insert, delete,
fielded query
• No direct in-document update
• Need special treatment for
distributed processing
NoSQL
databases
• Light weight
• Sufficient for simple use
• May lack features in the future
• Transaction?
BACKEND DATA INGESTION
Issues to consider
• In document update
• Access/Retention policy
• Parallel processing
25. 24
Local Analysis
• Annotating pages
– Extract structured elements: title, header, …
– Extract features for people, projects,
communities, …
– Extract features for cross-document analysis.
• Categorizing pages
– Label by standard categories
• Language, geography, date, …
– Label pages by custom categories
• IBM examples: HR, person, IT help, ISSI, sales information,
marketing, corporate standards, legal & IP-law, …
Local analysis is essentially information extraction
BACKEND LOCAL ANALYSIS
26. 25
Rule-based IE ML-based IE
PRO
• Declarative
• Easy to comprehend
• Easy to maintain
• Easy to incorporate domain
knowledge
• Easy to debug
• Trainable
• Adaptable
• Reduces manual effort
CON
• Heuristic
• Requires tedious manual
labor
• Requires labeled data
• Requires retraining for domain
adaptation
• Requires ML expertise to use or
maintain
• Opaque (not transparent)
BACKEND LOCAL ANALYSIS INFORMATION EXTRACTION
Rule-based vs. Learning-based IE
27. 26
Commercial Vendors
(2013)
NLP Papers
(2003-2012)
100%
50%
0%
3.5%
21%
75%
Rule-
Based
Hybrid
Machine
Learning
Based
45%
22%
33%
Large Vendors
67%
17%
17%
All Vendors
• GATE Information Extraction
• IBM InfoSphere BigInsights
• Microsoft FAST
• SAP HANA
• SAS Text Analytics
• HP Autonomy
• Attensity
• Clarabridge
Example Industrial Systems
Source: [CLR2013] Rule-based Information Extraction is Dead! Long Live Rule-based Information Extraction Systems!, EMNLP 2013
BACKEND LOCAL ANALYSIS INFORMATION EXTRACTION
Landscape of Entity Extraction
Implementations
28. 27
Intranet
page
NavPanel Extraction
NavPanels
Self link
identification
Title Extraction
Matching title
patterns
Titles
Dictionary
Match
Person name
dictionary
Person name in title
Title Extraction
Matching title
patterns
Titles
Title Name
URL Extraction
URLs
Matching URL
patterns
URL Name
Person name dictionary = employee directory
IBM Global Services Security Home
IBM Global Services Security
G J Chaitin Home Page
G J Chaitin
1. http://w3-03.ibm.com/marketing/
2. http://w3-03.ibm.com/isc/index.html
3. http://chis.at.ibm.com/
1. marketing
2. isc
3. chis
BACKEND LOCAL ANALYSIS EXAMPLES
[Zhu et al., WWW’07]
Local analysis for different features
29. 28
Consolidation
– Example: Document language consolidation
• HTTP header Accept-Language: en-us,en;q=0.5
• Meta tags <meta http-equiv="Content-Type" content="text/html;
charset=utf-8" />
• Document text encoding
• URL http://enterprise.com/hr/benefits/us/ca/
BACKEND LOCAL ANALYSIS TRANFORMATIONS
31. 30
Global Analysis
• Deduplication
– Save resources, reduce result clutter
• Identify root of URL hierarchy
– Used for result grouping and ranking
• Anchor text analysis
– Assign external labels to documents
• Social tagging analysis
– Assign tags and their weights to documents
• Identify different versions of the same document
– Due to variations in date, language, …
• Enterprise-specific global analysis
– When certain documents co-exists, do this …
• …
BACKEND GLOBAL ANALYSIS
32. 31
Shingle based deduplication
(Leskovec, http://www.mmds.org/)
S1={s1, s2, …}
S2={s1, s3, …}
S3={s2, s3, …}
{h1(S1), h2(S2), …}
{h1(S2), h2(S2), …}
{h1(S3), h2(S2), …}
Document
………………..
… …… …………
… … ……..
………… ………
…… ……
Shingles:
• Character or token n-gram
• Possibly stemmed
• Possibly related to stop words
Shingles:
• Character or token n-gram
• Possibly stemmed
• Possibly related to stop words
Document
………………..
… …… …………
… … ……..
………… ………
…… ……
Document
………………..
… …… …………
… … ……..
………… ………
…… ……
Document
………………..
… …… …………
… … ……..
………… ………
…… ……
Minhash:
• Maps sets to integers
• Based on permutation of universal set
Jaccard distance :
Theorem:
The probability that the minhash function for a random
permutation of rows produces the same value for two sets
equals the Jaccard similarity of those sets
Minhash:
• Maps sets to integers
• Based on permutation of universal set
Jaccard distance :
Theorem:
The probability that the minhash function for a random
permutation of rows produces the same value for two sets
equals the Jaccard similarity of those sets
| A∩B | / | A∪B |
More diverse set of documents. More precise.
BACKEND GLOBAL ANALYSIS DEDUPLICATION
33. 32
Metadata-based deduplication
(IBM Gumshoe search engine)
S1=[h11, h12, …]
S2=[h21, h22, …]
S3=[h31, h32, …]
G1 = {S1, …}
G2 = {S2, S3, …}
Document
………………..
… …… …………
… … ……..
………… ………
…… ……
Significant metadata:
• Document title
• Section headers
• Signatures from URL
Ensure that all similar candidates
have the same signature
Significant metadata:
• Document title
• Section headers
• Signatures from URL
Ensure that all similar candidates
have the same signature
Document
………………..
… …… …………
… … ……..
………… ………
…… ……
Document
………………..
… …… …………
… … ……..
………… ………
…… ……
Document
………………..
… …… …………
… … ……..
………… ………
…… ……
Group by signature
• Perform detailed analysis
In-group similarity analysis:
• Analyze documents within candidate groups
Group by signature
• Perform detailed analysis
In-group similarity analysis:
• Analyze documents within candidate groups
More customizable for intranet. Less cost.
BACKEND GLOBAL ANALYSIS DEDUPLICATION
34. 33
URL Root Analysis (Zhu et al., WWW’07)
host1/b/a/~user1/pub
host1/b/a
host1/b/a/~user1/
host1/b/c
host1/b/a/x_index.htm/ host1/b/c/d host1/b/c/home.html
host1/b/c/d/e/index.html?a=us host1/b/c/d/e/index.html?a=uk
host1/b/c/d/e/index.html
• Given a set of documents all with the same value V of feature X.
• E.g., At one time all webpages from IBM Tucson site had the same title
• Find the roots of URL forest. These will be preferred result for query X=V.
• E.g., when searching for “Tuscon home page”, only the IBM Tuscon homepage will match.
BACKEND GLOBAL ANALYSIS ROOT ANALYSIS
35. 34
Label Assignment (Zhu et al., WWW’07)
BACKEND GLOBAL ANALYSIS LABEL ASSIGNMENT
Document B
………………..
… …… …………
… … ……..
………… ………
…… ……
Document A1
………………..
… X home …
…………
… … ……..
………… ………
…… ……Document A2
………………..
… X home …
…………
… … ……..
………… ………
…… ……
Bookmark C1
X home
Anchor text global
analysis:
• Assign label “X” and
/ or “Y” based on
frequency
Bookmark C2
X
Bookmark C3
Y home
Document A2
………………..
… X home …
…………
… … ……..
………… ………
…… ……
Social tagging
global analysis:
• Assign label “X
home”, “X”, and “Y
home” based on
frequency
36. 35
Entity Integration using HIL
Entity Population Rules
• Create entities (from raw records, other
entities, and links)
• Clean, normalize, aggregate, fuse
Various data
sources
Information
Extraction
Entity
Resolution
Fuse
Aggregate
Entity Integration
Entity Resolution
Rules
• Create links between
raw records or entities
Map
Unstructured
Data
Unified
entities
Defines entity types (the logical
data model of the integration flow)
(SQL-like) rules to specify the
integration logic
Raw
Records
HIL
[Hernández et
al, EDBT’13]
Declarative IE
(IBM SystemT)
[Chiticariu et al, ACL
2010]
Optimizing compiler to Big Data runtime (Jaql and Hadoop)
BACKEND GLOBAL ANALYSIS ENTITY INTEGRATION
38. 37
Indexing
• Generate and index search terms, to be
matched by terms generated at runtime from
user queries.
• Challenges:
– Extracted terms do not match user query terms
• Morphological changes, synonyms, …
– Importance of term depends on query
• Needs for bucketing of indexes, …
– Support of incremental indexing
BACKEND INDEXING
39. 38
Term normalization
• Example: Date time normalization
– Given any of these
Wed Aug 27 10:06:11 PDT 2014
27 Aug 2014, 10:06:11
2014-08-27T10:06:11-07:00
27 Aug 2014
1409133971
– Normalize to 2014-08-27T10:06:11-07:00
– Other examples: Person names, product names,
…
BACKEND INDEXING TERM NORMALIZATION
40. 39
Why Generate Variant Terms?
• Extracted feature string ≠ query string
– People names
• Document: John Doe Search: Doe, John Search: J Doe
– Acronym expansions
• gts Global Technology Services
– N-gram variant generation
• Title: reimbursement of travel expenses
• Terms: reimbursement, travel expenses, reimbursement travel,
reimbursement of travel, reimbursement expenses
• Normalization is not sufficient solution
– People names
• Document: John Doe J. Doe Search: Jean Doe J. Doe
• These are not supposed to match
• Solution:
– Generate variant terms with different levels of approximation.
BACKEND INDEXING VARIANT TERM GENERATION
41. 40
Configurable Term Generation
• Configuration knobs determine the set of outputs
• Given “Mr. John (Jack) M. Doe Jr.”
– Configuration1:
Initial=both, Dot: with, NickName: both, MiddleName: both, NameSuffix:
without, Title: without, Comma:both
John M. Doe Doe, John M.
John Doe Doe, John
J. M. Doe Doe, J. M.
J. Doe Doe, J.
Jack M. Doe Doe, Jack M.
Jack Doe Doe, Jack
– Configuration2 (normalization):
Initial=without, Dot: without, NickName: without, MiddleName: without,
NameSuffix:without, Title: without, Comma: without
John Doe
BACKEND INDEXING VARIANT TERM GENERATION
43. 42
Search Engine Architecture
Backend
Collect data
Analyze data
Store and index data
Backend
Collect data
Analyze data
Store and index data
Admin
System performance
Search quality control
Frontend
Interpret user query
Search index
Present results
Interact with user
Frontend
Interpret user query
Search index
Present results
Interact with user
index
Data
source
44. Serving User Queries at Front End (52)
1. Ambiguity (29)
2. Ranking (3)
3. Representation (6)
4. Expert Search (6)
5. Privacy (8)
45. 44
1. Ambiguity
• Optimal keywords may not be used.
– Misspelled
• “datbase”
– Under-specified
• polysemy: “java”
• too general: “database papers”
– Over-specified:
• synonyms, acronyms, abbreviations &
alternative names: “green card” ≡
“permanent residency”
• too specific: “MS Office 2007 for Mac x64
edition”
– Non-quantitative:
• “small laptop”
query cleaning query autocompletion
query refinement
query rewriting
query rewriting
46. 45
Summary of Solutions
• query cleaning
– correct various types of spelling errors
• query autocompletion
– prevent spelling errors.
• query refinement
– making queries more specific, returning fewer results.
• query rewriting
– making queries more general / on-topic, returning more
relevant results.
• query forms
– enabling users to specify precise queries
FRONTEND AMBIGUITY
47. 46
Graph-based Spelling Correction
(bao acl 11)
• Repartition the query.
– Each partition (token) should be plausible: confidence
(correcting it) > threshold.
– confidence: linear combination of multiple scores, parameters
learned from SVM.
• Domain knowledge is often used in calculating confidence.
• For each partition, generate candidate corrections with
high scores.
“enterpricsea rch”
“enterpricse arch”
“enterpric search”
“enter pric search”
etc.
price: 0.8
prim: 0.6
etc.
pric
QUERY CLEANING UNSTRUCTURED DATAFRONTEND AMBIGUITY
“enterpricsea rch”
48. 47
Graph-based Spelling Correction
(bao acl 11)
• Build a graph that connects candidate
corrections.
• Each full path is a candidate query.
– Find k top-weighted full paths
enterprise
enter
price
prim
arc
sea rich
search
1. correction score
(node weight)
2. merge penalty
(node weight)
3. split penalty
(edge weight)
enterprise → search
enter → price → sea → rich
e.g.,
weights
QUERY CLEANING UNSTRUCTURED DATAFRONTEND AMBIGUITY
price: 0.8
prim: 0.6
etc.
pric
“enterpricsea rch”
49. 48
Graph-based Spelling Correction
(bao acl 11)
• Weight doesn’t consider term correlations.
• Calculate a score for each path
– Score includes term correlations.
• This ensures the cleaned query has good quality
results.
• Correlations are computed based on number of co-
occurrences.
• Finally returns paths with high scores.
e.g., correlation(“enterprise search”) > correlation (“enterprise arc”)
QUERY CLEANING UNSTRUCTURED DATAFRONTEND AMBIGUITY
e.g., “enterprise search” vs. “enterprise arc”
50. 49
XClean (lu icde 11)
– based on the noisy channel model that finds the
intended word given the user’s input word.
– results on XML are subtrees rooted at entity nodes.
• A result quality score is calculated for each entity node in
T, and then aggregated.
• e.g., if Johnny and Mike works in the same department,
then “Johnn, Mike” → “Johnny, Mike” rather than “John,
Mike”.
– processes each word individually, i.e., no merge or
split.
Query Cleaning on Relational Data: Pu VLDB 08
related
department
head
Johnny
employees
…
QUERY CLEANING STRUCTURED DATAFRONTEND AMBIGUITY
51. 50
Summary of Solutions
• query cleaning
– correct various types of spelling errors
• query autocompletion
– prevent spelling errors.
• query refinement
– making queries more specific, returning fewer results.
• query rewriting
– making queries more general / on-topic, returning more
relevant results.
• query forms
– enabling users to specify precise queries
FRONTEND AMBIGUITY
52. 51
Query Autocompletion
Problem Space Dimensions
showing keywords
vs.
showing results
single keyword
vs.
multiple keyword
exact matching
vs.
fuzzy matching
QUERY AUTOCOMPLETIONFRONTEND AMBIGUITY
53. 52
Problem Space Dimensions
showing keywords
vs.
showing results
single keyword
vs.
multiple keyword
exact matching
vs.
fuzzy matching
Error-Tolerating Autocompletion
(chaudhuri sigmod 09)
desr
desert
dessert
deserve
QUERY AUTOCOMPLETIONFRONTEND AMBIGUITY
54. 53
n
c
ae
x
Error-Tolerating Autocompletion
(chaudhuri sigmod 09)
data contains “search”, “sand” and “text”
max. edit distance = 1
no input input: s input: se input: sen
s
a
r
t
e
t
h
d
n
c
ae
x
s
a
r
t
e
t
h
d
n
c
ae
x
s
a
r
t
e
t
h
d
n
c
ae
x
s
a
r
t
e
t
h
d
Showing results instead of keywords
can be achieved
by associating inverted lists to trie nodes.
trie
QUERY AUTOCOMPLETIONFRONTEND AMBIGUITY
55. 54
Tastier(li vldbj 11)
Problem Space Dimensions
showing keywords
vs.
showing results
single keyword
vs.
multiple keyword
exact matching
vs.
fuzzy matching
“have a nni” show results for “have a nice day”
QUERY AUTOCOMPLETIONFRONTEND AMBIGUITY
56. 55
Tastier(li vldbj 11)
• Trie-based (similar as previous paper).
– Trie leaf nodes are associated with inverted lists.
• To handle multiple keywords:
– Each record/document is associated with a sorted lists of
words in it (forward lists).
• so that a binary search can determine whether a string appears
in a record/document as a prefix.
• why not hash? Because we need to match prefix, not whole
words.
• Inverted list intersections are computed
incrementally using cache for improved efficiency.
“have a nice day” “a, day, have, nice”
example
forward list
QUERY AUTOCOMPLETIONFRONTEND AMBIGUITY
57. 56
Phrase Prediction(nandi vldb 07)
Problem Space Dimensions
showing keywords
vs.
showing results
single keyword
vs.
multiple keyword
exact matching
vs.
fuzzy matching
QUERY AUTOCOMPLETIONFRONTEND AMBIGUITY
a nice have a nice day
58. 57
Phrase Prediction(nandi vldb 07)
• Suggest phrases given the user input phrase.
– Need to find a good length of a suggested phrase.
• Too short: utility is small.
• Too long: low chance of being accepted.
• (modified) suffix tree-based.
– Each node is a word, rather than a letter.
– Why not use trie: phrases have no definitive starting point.
A phrase may start in the middle of a sentence (i.e., start at
a suffix of the sentence), hence suffix tree.
• Significant phrases.
laptop
have a nice day
QUERY AUTOCOMPLETIONFRONTEND AMBIGUITY
59. 58
Summary of Solutions
• query cleaning
– correct various types of spelling errors
• query autocompletion
– prevent spelling errors.
• query refinement
– making queries more specific, returning fewer results.
• query rewriting
– making queries more general / on-topic, returning more
relevant results.
• query forms
– enabling users to specify precise queries
FRONTEND AMBIGUITY
60. 59
Query Refinement
• Motivation
– Some under-specified queries on large data
corpus have too many results.
– Ranking cannot always be perfect.
• Approaches
– Identifying important terms in results
(structured/unstructured)
– Clustering results
(structured/unstructured)
– Faceted search
(structured)
FRONTEND AMBIGUITY QUERY REFINEMENT
61. 60
Using Clustered Results (liu pvldb 11)
All suggested queries are about
programming language.
It is desirable to refine an ambiguous query
by its distinct meanings.
“Java”
FRONTEND AMBIGUITY QUERY REFINEMENT
62. 61
• → Input: clustered results
– clustering method is irrelevant.
– e.g., the result of “Java” may have 3 clusters
corresponding to Java language, Java island, and
Java tea.
• ← Output: one refined query for each cluster.
Each refined query:
– maximally retrieves the results in its cluster
(recall)
– minimally retrieves the results not in its cluster
(precision)
Using Clustered Results (liu pvldb 11)
FRONTEND AMBIGUITY QUERY REFINEMENT
63. 62
Using Important Terms in Results
(tao edbt 09)
• For relational data only.
• Given a keyword query, it outputs top-k most
frequent non-keyword terms in the results,
without generating the results.
– Avoiding result generation is possible since the
terms are ranked only by frequency: tradeoff of
quality and efficiency.
Data Clouds (for structured data): Koutrika EDBT 09
(more sophisticated term ranking, but needs to generate query results first.)
related
FRONTEND AMBIGUITY QUERY REFINEMENT
64. 63
Faceted Search
all
location:
Sunnyvale, CA
location:
Phoenix, AZ
location:
Amherst, MA
department:
data management
department:
machine learning
1. How to select facets and facets conditions at each level, to
minimize the user’s expected navigation cost?
2. How to rank facets and facets conditions?
challenges
Chakrabarti SIGMOD 04 Kashyap CIKM 10
……
……
……
FRONTEND AMBIGUITY QUERY REFINEMENT
65. 64
Summary of Solutions
• query cleaning
– correct various types of spelling errors
• query autocompletion
– prevent spelling errors.
• query refinement
– making queries more specific, returning fewer results.
• query rewriting
– making queries more general / on-topic, returning more
relevant results.
• query forms
– enabling users to specify precise queries
FRONTEND AMBIGUITY
66. 65
Query Rewriting
• Motivation
– Synonyms, alternative names: “green card” vs
“permanent residency”.
– Too specific: “MS Office 2007 for Mac x64 edition”
– Non-quantitative: “small laptop”
• Approaches
– Using query/click logs
– Finding rewriting rules from missing results
• e.g., replace “green card” with “permanent residency”.
– Using “differential queries”
FRONTEND AMBIGUITY QUERY CLEANING
67. 66
Using Query and Click Logs (cheng
icde 10)
The availability of query and click logs
can be used to assess ground truth.
query Q
query log
click log
synonyms
hypernyms
hyponyms
of Q
“query” “search”
synonym
“MySQL” “database”
hypernym
“database” “MySQL”
hyponym
find and return historical queries
whose “ground truth” (via click
log) significantly overlaps with
top-k results of Q.
idea
FRONTEND AMBIGUITY QUERY CLEANING
68. 67
Automatic Suggestion of Rewriting
Rules from Missing Results (bao sigir 12)
• Challenges for automatically generating
rewriting rules:
– rules should be semantically natural.
– a new rule designed for one query may eliminate
good results of another query.
FRONTEND AMBIGUITY QUERY CLEANING
“green card”
result d is missing / should
be ranked higher
result d contains phrase
“permanent residency”
rewriting rule:
green card → permanent residency
69. 68
→ Input: query q, missed
desirable results d
← Output: selected
set of rules
Generate candidate
rules L → R.
• L: n-grams in q.
• R: n-grams in high-
quality fields of d.
Identify semantically
natural rules by
machine learning.
Greedily select a
subset of rules that
maximizes the
overall query quality.
Automatic Suggestion of Rewriting
Rules from Missing Results (bao sigir 12)
FRONTEND AMBIGUITY QUERY CLEANING
green card → permanent residency
green card → federal government
70. 69
Keyword++ (Entity Databases)
(xin pvldb 10)
“small IBM laptop”
ID Product Name BrandName Screen Size Description
1 ThinkPad E545 Lenovo 15 The IBM laptop...small
business…
2 ThinkPad X240 Lenovo 12 This notebook...
To “understand” a term, compare two queries that
differ on this term, and analyze the differences of
attribute value distributions in the results.
idea
e.g., to understand term “IBM”, we can compare the results of
“IBM laptop” vs. “laptop”.
FRONTEND AMBIGUITY QUERY CLEANING
71. 70
Suppose: “IBM laptop” → 50 results, 30 having “brand: Lenovo”
“laptop” → 500 results, only 50 having “brand: Lenovo”
The difference on “brand: Lenovo” is significant,
reflecting the meaning of “IBM”.
IBM brand: Lenovo
small order by size ASC
Offline: compute the best mapping for all terms in query log
Online: compute the best segmentation of the query (DP).
“laptop”
“small laptop”
likewise:
Keyword++ (Entity Databases)
(xin pvldb 10)
FRONTEND AMBIGUITY QUERY CLEANING
72. 71
Summary of Solutions
• query cleaning
– correct various types of spelling errors
• query autocompletion
– prevent spelling errors.
• query refinement
– making queries more specific, returning fewer results.
• query rewriting
– making queries more general / on-topic, returning more
relevant results.
• query forms
– enabling users to specify precise queries
FRONTEND AMBIGUITY
73. 72
Offline: how many query forms, and which query
forms, should be generated?
• Too many – hard to find the relevant forms.
• Too few – limiting query expressiveness.
Online: how to identify query forms relevant to
users’ search needs?
Query Forms
Enabling users to issue precise structured queries
without mastering structured query languages.
advantage
challenges
Baid SIGMOD 09 Jayapandian PVLDB 08 Ramesh PVLDB 11 Tang TKDE 13
FRONTEND AMBIGUITY QUERY FORMS
74. Serving User Queries at Front End (52)
1. Ambiguity (29)
2. Ranking (3)
3. Representation (6)
4. Expert Search (6)
5. Privacy (8)
75. 74
2. Ranking
Ranking Method Categories
Unstructured Data
• represents queries and documents using vectors
• each component is a term; the value is its weight
• ranking score = similarity (query vector, result vector)
Structured Data
• a document → a node or a result (subgraph/subtree)
vector space model
proximity based ranking
…
authority based ranking
…
FRONTEND RANKING
76. 75
2. Ranking
Ranking Method Categories
Unstructured Data
• proximity of keyword matches in a document can
boost its ranking.
Structured Data
• weighted tree/graph size, total distance from root to
each leaf, semantic distance, etc.
vector space model
…
authority based ranking
…
proximity based ranking
FRONTEND RANKING
77. 76
2. Ranking
Ranking Method Categories
vector space model
…
…
Unstructured Data
• nodes linked by many other important nodes are
important.
Structured Data
• authority may flow in both directions of an edge
• different types of edges in the data (e.g., entity-entity
edge, entity-attribute edge) may be treated differently.
proximity based ranking
authority based ranking
FRONTEND RANKING
78. Serving User Queries at Front End (52)
1. Ambiguity (29)
2. Ranking (3)
3. Representation (6)
4. Expert Search (6)
5. Privacy (8)
79. 78
3. Representation
• Enterprise corpus can be much more
heterogeneous than a collection of documents or
web pages.
• Different searches may have different types:
retrieving a document, a figure, a tuple, a
subgraph, analytical keyword queries, etc.
Result diversification
Result summarization
Result differentiation
solutions
FRONTEND REPRESENTATION
80. 79
Result Diversification
• Result diversification is essentially the same
problem as query refinement.
– e.g., Java → Java language, Java tea, Java island.
• Same techniques apply.
FRONTEND REPRESENTATION DIVERSIFICATION
81. 80
Result Summarization
• Unstructured data: lots of work on text
summarization in machine learning, natural
language processing and IR communities.
• Structured data:
– Size-l object summary (Relational)
– Result snippet (XML)
Das, CMU 07 (unpublished)
Nenkova, Mining Text Data 12
surveys
FRONTEND REPRESENTATION SUMMARIZATION
82. 81
Size-l Object Summary (fakas pvldb 11)
……Mike……
first
window
“Mike”
unstructured
Mike
paper paper patent patent…
conference John …
… … …
… …
?
structured
FRONTEND REPRESENTATION SUMMARIZATION
83. 82
Size-l Object Summary (fakas pvldb 11)
• Each tuple has:
– a static importance score.
• similar idea as PageRank
– a run-time relevance score.
• distance to result root
• connectivity properties to result root
• Objective: find a connected snippet of the result,
which consists of l tuples and has the maximum
score.
• Dynamic programming based solution.
Result snippet for XML: Liu TODS 10
related
FRONTEND REPRESENTATION SUMMARIZATION
84. 83
Result Differentiation
Result 1 Result 2
event: year 2000 2012
paper: title OLAP
data
mining
cloud
scalability
search
“NEC Labs Open House”
result 1: a large table with many
people / papers / posters
result 2: a large table with many
people / papers / posters
…
results result differentiation
vs. comparing different credit cards on a bank website:
only with pre-defined features.
FRONTEND REPRESENTATION DIFFERENTIATION
85. 84
4. Expert Search
documents in which a candidate and a topic co-occur
topics near a candidate in a document
problem solving / ticket routing history
user’s knowledge on a topic
• expert should be more knowledgeable
social relationship between expert and user
• problem solving is usually more effective if expert has a close
social relationship with user
external corpus
• many employees publish stuff externally, i.e., papers, blogs.
ways for judging an expert
Find an expert within an enterprise to solve a particular problem.
goal
FRONTEND EXPERT SEARCH
86. 85
Classical Methods
• Builds a feature vector for each expert using various
evidence
• Ranks experts based on query, using traditional
retrieval models
candidate model
• First finds documents related to query, then locates
experts in documents
• Mimics the process a human takes.
document model
Balog CIKM 08
survey
FRONTEND EXPERT SEARCH
87. 86
User-Oriented Model (smirnova ecir 11)
Users prefer experts who:
are more knowledgeable
than themselves.
knowledge gain: p(e|q) – p(u|q)
have a close social relationship
with themselves.
time-to-contact: shortest path
department
head
John
employees
…
e = expert
u = user
FRONTEND EXPERT SEARCH
88. 87
Using Web Search Engine
(santos inf. process. manage. 11)
query q
result from intranet
web query q’ result from internetformulate
web query
search
intranet
corpus combine
candidate’s full name: “Jeff Smisek”
organization’s name: “IBM”
terms in q: “data integration”
excluding results from organization: “-site:ibm.com”
FRONTEND EXPERT SEARCH
89. 88
Ticket Routing (shao kdd 08)
new ticket: DB2 login failure
transferred to group A
transferred to group B
transferred to group C
resolved
How to find the best group and
reduce problem solving time?
Markov chain model
Using only previous routing
history (not ticket content)
FRONTEND EXPERT SEARCH
90. 89
Ticket Routing (shao kdd 08)
Pr(g|S)
possibility to route a ticket to
group g given previous groups S
Pr(g|S) includes the probability that:
• g can solve the ticket
• g can correctly re-route the ticket.
Train the Markov chain model from ticket routing history.
FRONTEND EXPERT SEARCH
91. Serving User Queries at Front End (52)
1. Ambiguity (29)
2. Ranking (3)
3. Representation (6)
4. Expert Search (6)
5. Privacy (8)
92. 91
5. Privacy
It is sometimes desirable that the search engine doesn’t
know which documents a user wants to retrieve.
• For users: privacy
• For enterprises: avoiding liability
user privacy
While a search engine answers individual keyword
searches, there are methods that perform multiple
searches and, from the answers, piece together
aggregate information about underlying corpus.
• Enterprises may not want to disclose such information to all
users.
data privacy
93. 92
User Privacy
Private Information Retrieval (PIR)
• old topic, tons of theoretical papers
Modifying search engine. e.g.,
• forcing it to forget user activities
• embellishing queries with decoy terms (Pang PVLDB 10)
Using ghost queries to obfuscate user intention (Pang ICDE 12)
• no change to search engine
• light-weight
solutions
It is sometimes desirable that the search engine doesn’t
know which documents a user wants to retrieve.
• For users: privacy
• For enterprises: avoiding liability
user privacy
94. 93
Private Information Retrieval (PIR)
• Idea: retrieve more documents than needed.
• Naïve: retrieve the entire corpus.
• How to minimize the number of retrieved &
unneeded documents?
• Tons of theoretical papers on different variations
of the problem, e.g.,
– different computation power of the search engine
– different number of non-communicating corpus
replica.
Gasarch EATCS Bulletin 2004
survey
95. 94
Ghost Queries (pang icde 12)
• Challenges
– Generate ghost queries on topics different from user’s
topics of interest, and make it difficult for the search
engine to infer user’s topics.
– Ghost queries need to be meaningful/realistic, so that
they cannot be easily identified.
generate
ghost queries
ghost queries
discard ghost
query results
results
submit to
search engine
user query
96. 95
Ghost Queries (pang icde 12)
• (e1, e2) privacy model
– Given a user query, if the probability of a topic
increases more than e1, it should be reduced to
below e2 by the ghost queries.
• Topics are predefined.
• A ghost query must be coherent: all words in
the ghost query should describe common or
related topics.
• Randomized algorithm based solution.
97. 96
Data Privacy
While a search engine answers individual keyword searches, there
are methods that perform multiple searches and, from the answers,
piece together aggregate information about underlying corpus.
• Enterprises may not want to disclose such information to all users.
data privacy
inserting dummy tuples OR randomly generating attribute values
• only applicable to structured data
disallowing certain queries OR return snippets
• search quality loss
altering a small number of results: adding dummy results;
modifying results, hiding some results (Zhang SIGMOD 12)
solutions
FRONTEND PRIVACY
98. 97
Aggregate Suppression (zhang sigmod 12)
• Example: consider corpus A and B.
– A: n documents
– B: 2n documents
– A ⊂ B
• Goal: suppress COUNT(*), i.e., adversary cannot tell which
corpus is larger.
• Naïve approach 1: deterministically remove n documents from B.
– achieves the goal, but with search utility loss: those n documents can
never be retrieved.
• Naïve approach 2: randomly drop half of the results at run time.
– no search utility loss, but fails to achieve the goal: a clever adversary
can still get the information.
FRONTEND PRIVACY
99. 98
Aggregate Suppression (zhang sigmod 12)
• Algorithm ideas
– carefully adjusting query degree (number of
documents matched by a query) and document
degree (number of queries matching a
document) by document hiding at run-time.
– decline a query if its result can be covered by a
small number of previous queries. Return
previous query results instead.
FRONTEND PRIVACY
100. 99
Backend
Collect data
Analyze data
Store and index data
Admin
System performance
Search quality control/improvement
Admin
System performance
Search quality control/improvement
Frontend
Interpret user query
Search index
Present results
Interact with user
index
Data
source
Tutorial Outline
101. 100
Enterprise Search Administrators
• Main responsibilities
– Care and feeding of an enterprise search solution
• Monitor intranet help inboxes and respond to requests.
• Assist in troubleshooting intranet issues for content contributors
• Core skills required
– Understand general corporate business processes
– Experience in coordinating activities and managing
relationships
• with employees, content administrators, stakeholders, IT teams and
external agencies
Search Admin
Search administrators ≠≠≠≠ IR experts
Key Observation
Admin Overview
102. 101
What a Search Administrator Need?
Bad results
for query …
I’m missing the
golden URL…
Result 22 should
be ranked much
higher!
Enterprise Users
Query Logs
Query “global
campus” seems
unsatisfying
• Understand overall search
quality
• Overall trend
• YOY change
• By segmentation
• Understand individual search
results
• Why certain result is or
isn’t brought back
• Its ranking
• Maintain search quality
• Underlying data evolves
• Terminology changes
• Policy/Business Process
changes
• Organization changes
• Hot topics
Search Admin
Admin Overview
105. 104
What a Search Administrator Need?
Bad results
for query …
I’m missing the
golden URL…
Result 22 should
be ranked much
higher!
Enterprise Users
Query Logs
Query “global
campus” seems
unsatisfying
• Understand overall search
quality
• Overall trend
• YOY change
• By segmentation
• Understand individual search
results
• Why certain result is or
isn’t brought back
• Its ranking
• Maintain search quality
• Underlying data evolves
• Terminology changes
• Policy/Business Process
changes
• Organization changes
• Hot topics
Search Admin
Admin Examples
115. 114
Experience at IBM Internal Search
• IBM deployed a commercially available search engine
– Implementing standard IR techniques
• Search quality went down over time to the point that
Search results were unacceptable!
Success (≥ 1 relevant results): 14% on top-1, 23% on
top-5, 34% on top-50! [Zhu et al., WWW’07]
So, they implemented various solutions…
To the administrators managing the engine, exposed
control knobs were insufficient
Case Study Background
116. 115
Attempts to Improve Search
• Enhanced link analysis by
incorporating external links
to/from external WWW
• Creative hacks: added fake terms
to documents & queries
– # terms per document determined by
“popularity”: how much TF increase
required for needed rank boost ?
• Hard-coded custom results for the
top 1200+ queries
Didn’t help…
Quality went down!
Maintenance nightmare:
Heuristic needs to be updated
upon each nontrivial change in
term stats./ranking parameters
Even bigger nightmare!
How to deal with continuously
changing terminology?
Case Study Background
117. 116
Goals of Gumshoe
Network Station Manager search
Thin Client ManagerProduct names change:
Continually changing terminology
Domain-specific meaning
Paula Summa search
bring Paula Summa from
employee directories
per diem search
Domain-specific repetitions
popcorn search
conference call!
• Result 1: IBM Travel: Per Diem
• Result 2: IBM Travel: Per Diem Rates
• Result 3: IBM Travel: National perdiems
• Result 25: IBM Travel: Per Diem Policy
…
Gumshoe:
• Generic search solution, customizable & maintainable in many domains
– Simple customization with reasonable effort
– Ongoing search-quality management
• Philosophy: programmable search
Case Study Background
118. 117
Programmable Search: Main Idea
• Goals:
– Transparency
• Know “precisely” why every result item is being brought back
• Understand how changes in content/intents affect search
– Maintainability and “Debugability”
• Ranking logic is guided by explicit rules
• Properly react to changes in content/intents
• Building blocks:
– Deep analytics on documents
– Domain-specific analysis of queries
– Transparent customizable rule-driven ranking
runtime rules
backendbackend
analytics
interpretations
Case Study Background
119. 118
Distributed Analytics Platform (IBM InfoSphere BigInsights)
Crawling, information extraction, token generation (TG), indexing
Search runtime
Index
Index and rule
update services
backendbackend
analytics
runtime rulesinterpretations
backend
frontend
Implementation Architecture
Case Study Background
120. 119
Backend Analytics: 3 Parts
Local Analysis
(per-page analysis)
Global Analysis
(cross-page analysis)
Token Generation
(TG)
index
Case Study Background
121. 120
Local Analysis
• Categorizing pages
– Label pages by custom categories
• IBM examples: HR, person, IT help, ISSI, sales information,
marketing, corporate standards, legal & IP-law, …
– Geo classification
• Associate documents with the relevant countries & regions
• Annotating pages
– Identify HomePage annotation for people, projects,
communities, …
Simply knowing where a page is physically hosted is not enough
(example: Czech Republic hosts all pages for IBM in Europe)
Case Study Backend Local Analysis
122. 121
• Declarative approach
– Define an operator for each basic operation
• Input tuple of annotations
• Output tuples of annotations
– Compose operators to build complex extractors
• Algebraic expression
• One document at a time trivial parallelism.
• Benefits of declarative approach:
– Expressivity: Richer, cleaner rule semantics
– Performance: Better performance through optimization
Declarative IE System
Case Study Backend Local Analysis
123. 122
InfoSphere
Streams
Cost-based
optimization
...
SystemT – Overview
InfoSphere
BigInsights
SystemT RuntimeSystemT Runtime
Input
Documents
Extracted
Objects
SystemTSystemT
IBM Engines
UIMA
SystemT
Highly embeddable runtime
AQL Extractors
Embedded machine
learning model
AQL Rules
create view SentimentForCompany as
select T.entity, T.polarity
from classifyPolarity (SentimentFeatures ) T;
create view Company as
select ...
from ...
where ...
create view SentimentFeatures as
select
from ;
Case Study Backend Local Analysis
124. 123
G J Chaitin Home Page
Homepage Identification
Title Extraction
Matching titleMatching title
patterns
Title
s
Dictionary
Match
Home Page for
G J Chaitin
• http://w3.ibm.com/hr/idp/
• http://w3-03.ibm.com/isc/index.html
• http://chis.at.ibm.com/
URL Extraction
URLs
Matching URLMatching URL
patterns
Homepage for: idp isc chis
Employee
directory
… many more …
Intranet
page
[Zhu et al., WWW’07]
Case Study Backend Local Analysis
125. 124124 IBM Confidential124 IBM Confidential
Among the 38 pages with the exact same title,
which is the best for “Paula Summa”?
Role of Global Analysis
Case Study Backend Global Analysis
126. 125
Person
Title
Token Generation (TG)
Annotated values Index content
Ching-Tien T. (Howard) Ho
Ho Ching-Tien Tien Ho Ho, Tien
Howard Ho Ching-Tien H. ...
Global Technology Services
TG
Howard Ho Ching Tien ...
gts Global Technology Services
Global Technology Technology
Services Global Technology ...
GlobalTechnologyServices
nGramTG
spaceTG
……
…
…
…
Case Study Backend Token Generation
127. 126
3 Phases of Runtime Flow
Search Query
Phase 1:
Query
Semantics
• Rewrite rules
• Query interpretation
Phase 2:
Relevance
Ranking
By relevance buckets +
conventional IR
Phase 3:
Result
Construction
• Grouping rules
• Re-ranking rules
Case Study Frontend
128. 127
Phase 3: Result Construction
Phase 2: Relevance Ranking
Phase 1: Query Semantics
query search rewrite rules
queries
interpretations
partially ordered interpretations
interpretations execution
partially ordered results
result aggregation
ordered results
grouping rules
ordered & grouped results final results
re-ranking rules
Runtime Flow in More Details
Case Study Frontend
129. 128
Runtime Rules: Pattern-Action Language
(Fagin 2012)
Query Pattern Queries Matching Possible Action
EQUALS
[r=ibm|information|info]
[d=COUNTRY]
• ibm germany
• info india
Rewrite into “[country] hr”
(e.g., germany hr)
ENDS_WITH installation
• acrobat installation
• db2 on aix installation
Replace installation with ISSI
(e.g., acrobat ISSI)
CONTAINS directions to
[d=SITE]
• driving directions to almaden
• directions to watson from jfk
Pages of “siteserv” category
should be ranked higher
STARTS_WITH
[d=PERSON]
• john kelly biography
• steve mills announcement
Group together pages that
represent blog entries
Pattern expression,
matched against the
keyword query
Perform when
matchQuery pattern →Action
• Similar to the query-template rules of Agarwal et al. [WWW 2010]
Query SemanticsCase Study Frontend
131. 130130
The most important IBM page for benefits
changes over time: currently it is netbenefits
What’s Best for Benefits?
Query SemanticsCase Study Frontend
136. 135
135
IBM Confidential
Complex Rules
java jim and not in person category
interpretations execution
partially ordered results
result aggregation
ordered results
grouping rules
ordered & grouped results final results
re-ranking rules
interpretations
partially ordered interpretations
rewrite rules
queries
java search
Query SemanticsCase Study Frontend
137. 136
InterpretationsScenario: An IBM employee wants
to download Lotus Symphony 1.3
Runtime interpretation:
download symphony 1.3 category=issi software=symphony 1.3
Query SemanticsCase Study Frontend
138. 137
IBM Confidential
Complex Rules
java jim and not in person category
interpretations execution
partially ordered results
result aggregation
ordered results
grouping rules
ordered & grouped results final results
re-ranking rules
interpretations
partially ordered interpretations
rewrite rules
queries
java search
Query SemanticsCase Study Frontend
139. 138
3 Phases of Runtime Flow
Search Query
Phase 1:
Query
Semantics
• Rewrite rules
• Query interpretation
Phase 2:
Relevance
Ranking
By relevance buckets +
conventional IR
Phase 3:
Result
Construction
• Grouping rules
• Re-ranking rules
Relevance RankingCase Study Frontend
140. 139
Person
Title
Recall: Token Generation (TG)
Annotated values Index content
Ching-Tien T. (Howard) Ho
Global Technology Services
TG
Howard Ho Ching Tien ...
gts Global Technology Services
Global Technology Technology
Services Global Technology ...
GlobalTechnologyServices
nGramTG
spaceTG
……
…
…
…
Ho Ching-Tien Tien Ho Ho, Tien
Howard Ho Ching-Tien H. ...Person + personNameTG
Person + nGramTG
Title + acronymTG
Title + spaceTG
Title + nGramTG
Relevance RankingCase Study Frontend
141. 140
Annotation + TG Relevance Bucket
Howard Ho Ching Tien ...
GlobalTechnologyServices
……
Person + personNameTG
Person + nGramTG
Title + acronymTG
Title + spaceTG
Title + nGramTG
query search Relevance buckets
•Buckets are ranked
– Based on annotation type
– Based on TG quality
•A page can belong to
multiple buckets
•Within each bucket,
ranking is by
conventional IR
……
Relevance RankingCase Study Frontend
142. 141
Ranking by Relevance Buckets
grouping rules
ordered & grouped results final results
re-ranking rules
interpretations
partially ordered interpretations
rewrite rules
queries
interpretations execution
partially ordered results
result aggregation
ordered results
employment verification search
Relevance RankingCase Study Frontend
143. 142
3 Phases of Runtime Flow
Search Query
Phase 1:
Query
Semantics
• Rewrite rules
• Query interpretation
Phase 2:
Relevance
Ranking
By relevance buckets +
conventional IR
Phase 3:
Result
Construction
• Grouping rules
• Re-ranking rules
Result ConstructionCase Study Frontend
144. 143
Grouping Rules
• Grouping rules define how search results should be
grouped together
• Search administrators can improve the diversity of
search results (in 1st page)
– Based on their familiarity with the data sources
Group pages of the same category
per diem travel, you-and-ibm
ANY ISSI, IT Help Central, Forum,
Bluepedia, Media Library, …
Query pattern
Result ConstructionCase Study Frontend
145. 144
Need first page diversity
Flooding with Similar Pages
Result ConstructionCase Study Frontend
146. 145145 IBM Confidential
per diem travel, you-and-ibm
Grouping Rule to the Rescue
Result ConstructionCase Study Frontend
147. 146146 IBM Confidential
per diem travel, you-and-ibm
final results
re-ranking rules
interpretations
partially ordered interpretations
rewrite rules
queries
interpretations execution
partially ordered results
result aggregation
ordered results
grouping rules
ordered & grouped results
per diem search
Grouping Rule to the Rescue
Result ConstructionCase Study Frontend
148. 147
Re-ranking Rules
• Re-ranking rules adjust ranking of
search results based on categories
• Example: search administrator specifies the
important sources of “hot/current topics”
Hot topics Rank these categories higher
Bluepedia, News, About-IBM
smarter planet, cloud
computing, centennial, …
Result ConstructionCase Study Frontend
149. 148
Bluepedia
Technical News
Homepages of
“About IBM”
Hot topics Rank these categories higher
Bluepedia, News, About-IBM
smarter planet, cloud
computing, centennial, …
Re-ranking Rule for Hot Topics
Result ConstructionCase Study Frontend
150. 149
Re-ranking Rules for Person Queries
[d=PERSON]
executive_corner, media_library,
organization_chart, files
Result ConstructionCase Study Frontend
151. 150150 IBM Confidential
per diem travel, you-and-ibm
final results
re-ranking rules
interpretations
partially ordered interpretations
rewrite rules
queries
interpretations execution
partially ordered results
result aggregation
ordered results
grouping rules
ordered & grouped results
per diem search
Grouping Rule to the Rescue
Result ConstructionCase Study Frontend
152. 151
3 Phases of Runtime Flow
Search Query
Phase 1:
Query
Semantics
• Rewrite rules
• Query interpretation
Phase 2:
Relevance
Ranking
By relevance buckets +
conventional IR
Phase 3:
Result
Construction
• Grouping rules
• Re-ranking rules
Case Study Frontend
153. 152
What Administrators Need…
• Search administrators have major problems
with an opaque search engine
• Programmable search provides
– Customization to the specific domain
– Ongoing search-quality management
Allows the building of search quality toolkit.
Recap:
Case Study Admin
157. 156
Proof of Pudding is in the Eating
• Immediate Positive Impact within first 3 months
– Improve natural clickthrough rate by 100%+
– Top 5 results: selected about 90% of the time
• Sustained search quality Improvements 4 years since
going alive
• Stable natural search click through rate
Gumshoe (Aug. 2011– Oct. 2011)
Old Intranet Search (Aug. 2010– Aug. 2011)
Natural
clickthrough
rate
Case Study Results
158. 157
Summary
Programmable search:
Simple & flexible customization
Search quality management
Backend Analytics
Local analysis
(per-page analysis)
Global Analysis
(cross-page analysis)
Token Generation
(TG)
[Fagin et al.,
PODS’10,
PODS’11]
Tooling
• Search provenance
• Rule suggestion
• Utilization of relevance buckets
[Li et al.,
SIGIR’06,
Zhu et al.,
WWW’07]
Phase 1:
Query Semantics
• Rewrite rules
• Query interpretation
Phase 2:
Relevance
Ranking
By relevance buckets +
conventional IR
Phase 3:
Result
Construction
• Grouping rules
• Re-ranking rules
[ Bao et al.,
ACL’2010,
SIGIR’2012
CIKM’2012]
Case Study Summary
160. 159
Search Engine Components
Backend
Collect data
Analyze data
Store and index data
Admin
System performance
Search quality control/improvement
Frontend
Interpret user query
Search index
Present results
Interact with user
index
Data
source
161. 160
Future Directions
Data Heterogeneity
A rich variety of data types need to be searched in
enterprises.
• docs, databases, images, videos, social graphs, etc.
observations
How to automatically identify relevant data types, and
search and rank across different data types?
• e.g., for image search, should image recognition techniques
be incorporated in enterprise search engines? If so, how?
questions
162. 161
Future Directions
Data Freshness
New data is continuously collected and published in
enterprises, the rate of which can be very fast.
Web search engines are not required to index new websites
quickly, but in enterprises, new contents may need to be
searchable asap.
observations
How to build efficient real-time indexes to ensure data
freshness in enterprise search?
questions
163. 162
Future Directions
Search Context
Enterprise search users have richer profiles than web users.
• activities, bio, position, projects, experiences, etc.
observations
How to utilize users’ contexts to provide customized results?
Is it possible to predict the information a user may want, and
push it to the user?
questions
164. 163
Future Directions
User Preference
Different users in an enterprise have different expertise, and
may prefer different ways to express queries.
• e.g., some users prefer pure keyword search, while
others may want lightly-structured queries.
observations
How to effectively satisfy different needs for expressing
queries for different users?
questions
165. 164
Future Directions
Question Answering
The purpose of many enterprise searches are to find
answers to questions.
• e.g., what is the previous name of a product, and when
did we change to the current name?
observations
Is it possible to effectively use natural language processing
techniques and domain knowledge to automatically answer
natural language questions?
questions
166. 165
Future Directions
Transactional Search
Over 1/3 of enterprise search queries is transactional. It will
be desirable if enterprise search engines can recommend
business processes to accomplish a certain task given a
transactional search.
• E.g., given a customer’s lengthy complaint letter, how to find
out the departments relevant to the complaints.
observations
How to better support transactional search? How to initiate
a business process based on the results of a search?
questions
167. 166
Future Directions
Big Data Analytics
Rich information and knowledge lies in big data. Many
employees (not just data analysts) may benefit from the
ability to perform analytics on the company’s big data.
observations
How to build a low-cost, interactive platform that allows a
large number of employees to issue analytical queries?
How to give employees the capabilities to analyze big data,
if they have little knowledge of SQL or MapReduce
programming?
questions
168. 167
Future Directions
Tooling for Search Quality Maintenance
Most enterprise search engines have to be manually
evaluated and tuned by a search administrator with domain
knowledge, in an ad-hoc fashion.
observations
Can we automate this process, or at least minimize manual
involvement?
Can we fully utilize explicit user feedbacks?
• Explicit user feedbacks are easier to obtain in enterprise
search, and there are less spams.
questions
169. Thanks.
Acknowledgement:
IBM Research: Sriram Raghavan, Fred Reiss, Shiv Vaithyanathan, Ron Fagin
IBM CIO’s Office: Nicole Dri, Brian C. Meyer
LogicBlox: Benny Kimelfeld*
TripAdvisor: Adriano Crestani Campos*
Facebook: Zhuowei Bao*
NJIT: Yi Chen
UNSW: Wei Wang
* work done while at IBM