This is a nice high-level summary for Gumshoe, the enterprise engine built by our group, which is currently powering IBM intranet search. One of SIGIR 2011 Industrial Track Keynote Talk.
Google search vs Solr search for Enterprise searchVeera Shekar
- The document discusses implementing enterprise search and compares Google Search to alternative options. It covers topics like how search engines work, the architecture of Google Search, and top 5 requirements for enterprise search implementation.
- For each requirement, it identifies disadvantages of using Google Search and discusses alternative implementation options that may perform better like Apache Solr, Endeca, and Autonomy.
- The overall conclusion is that no single search engine fulfills all enterprise needs, and custom application development is often required to fully meet requirements, allowing the use of various tools.
Enterprise Search in the Big Data Era: Recent Developments and Open ChallengesYunyao Li
This is the slides used in our 3-hour tutorial at VLDB'2014.
Yunyao Li, Ziyang Liu, Huaiyu Zhu: Enterprise Search in the Big Data Era: Recent Developments and Open Challenges. PVLDB 7(13): 1717-1718 (2014)
Abstract:
Enterprise search allows users in an enterprise to retrieve desired information through a simple search interface. It is widely viewed as an important productivity tool within an enterprise. While Internet search engines have been highly successful, enterprise search remains notoriously challenging due to a variety of unique challenges, and is being made more so by the increasing heterogeneity and volume of enterprise data. On the other hand, enterprise
search also presents opportunities to succeed in ways beyond current Internet search capabilities. This tutorial presents an organized overview of these challenges and opportunities, and reviews the state-of-the-art techniques for building a reliable and high quality enterprise search engine, in the context of the rise of big data.
This document discusses options for integrating external data into SharePoint, including Business Connectivity Services (BCS). BCS allows SharePoint to connect to external data sources and make that data accessible via external lists. However, BCS has limitations and its future is uncertain. New options like Power BI and Logic Apps provide more flexibility for building applications that integrate external data without relying on BCS. Hybrid BCS enables accessing on-premises data from SharePoint Online by publishing data through an on-premises gateway.
An Introduction to Graph: Database, Analytics, and Cloud ServicesJean Ihm
Graph analysis employs powerful algorithms to explore and discover relationships in social network, IoT, big data, and complex transaction data. Learn how graph technologies are used in applications such as fraud detection for banking, customer 360, public safety, and manufacturing. This session will provide an overview and demos of graph technologies for Oracle Cloud Services, Oracle Database, NoSQL, Spark and Hadoop, including PGX analytics and PGQL property graph query language.
Presented at Analytics and Data Summit, March 20, 2018
The document describes a method for focused crawling to retrieve structured data from web pages. It involves using an online classifier trained on URL features to identify pages containing structured data. A bandit-based selection strategy is used to balance exploration and exploitation. Experiments show the adaptive approach retrieves 26% more relevant pages than static classification, and 66% more when focused on a specific objective. Decaying the bandit randomness over time improved results further. The method was able to retrieve hundreds of millions of structured data pages from billions of web pages.
This document provides an overview of Amundsen, an open source data discovery and metadata platform developed by Lyft. It begins with an introduction to the challenges of data discovery and outlines Amundsen's architecture, which uses a graph database and search engine to provide metadata about data resources. The document discusses how Amundsen impacts users at Lyft by reducing time spent searching for data and discusses the project's community and future roadmap.
Video: https://www.youtube.com/watch?v=Rt2oHibJT4k
Technologies such as Hadoop have addressed the "Volume" problem of Big Data, and technologies such as Spark have recently addressed the "Velocity" problem – but the "Variety" problem is largely unaddressed – there is a lot of manual "data wrangling" to mange data models.
These manual processes do not scale well. Not only is the variety of data increasing, also the rate of change in the data definitions is increasing. We can’t keep up. NoSQL data repositories can handle storage, but we need effective models of the data to fully utilize it.
This talk will present tools and a methodology to manage Big Data Models in a rapidly changing world. This talk covers:
Creating Semantic Metadata Models of Big Data Resources
Graphical UI Tools for Big Data Models
Tools to synchronize Big Data Models and Application Code
Using NoSQL Databases, such as Amazon DynamoDB, with Big Data Models
Using Big Data Models with Hadoop, Storm, Spark, Giraph, and Inference
Using Big Data Models with Machine Learning to generate Predictive Models
Developer Collaborative/Coordination processes using Big Data Models and Git
Managing change – Big Data Models with rapidly changing Data Resources
Google search vs Solr search for Enterprise searchVeera Shekar
- The document discusses implementing enterprise search and compares Google Search to alternative options. It covers topics like how search engines work, the architecture of Google Search, and top 5 requirements for enterprise search implementation.
- For each requirement, it identifies disadvantages of using Google Search and discusses alternative implementation options that may perform better like Apache Solr, Endeca, and Autonomy.
- The overall conclusion is that no single search engine fulfills all enterprise needs, and custom application development is often required to fully meet requirements, allowing the use of various tools.
Enterprise Search in the Big Data Era: Recent Developments and Open ChallengesYunyao Li
This is the slides used in our 3-hour tutorial at VLDB'2014.
Yunyao Li, Ziyang Liu, Huaiyu Zhu: Enterprise Search in the Big Data Era: Recent Developments and Open Challenges. PVLDB 7(13): 1717-1718 (2014)
Abstract:
Enterprise search allows users in an enterprise to retrieve desired information through a simple search interface. It is widely viewed as an important productivity tool within an enterprise. While Internet search engines have been highly successful, enterprise search remains notoriously challenging due to a variety of unique challenges, and is being made more so by the increasing heterogeneity and volume of enterprise data. On the other hand, enterprise
search also presents opportunities to succeed in ways beyond current Internet search capabilities. This tutorial presents an organized overview of these challenges and opportunities, and reviews the state-of-the-art techniques for building a reliable and high quality enterprise search engine, in the context of the rise of big data.
This document discusses options for integrating external data into SharePoint, including Business Connectivity Services (BCS). BCS allows SharePoint to connect to external data sources and make that data accessible via external lists. However, BCS has limitations and its future is uncertain. New options like Power BI and Logic Apps provide more flexibility for building applications that integrate external data without relying on BCS. Hybrid BCS enables accessing on-premises data from SharePoint Online by publishing data through an on-premises gateway.
An Introduction to Graph: Database, Analytics, and Cloud ServicesJean Ihm
Graph analysis employs powerful algorithms to explore and discover relationships in social network, IoT, big data, and complex transaction data. Learn how graph technologies are used in applications such as fraud detection for banking, customer 360, public safety, and manufacturing. This session will provide an overview and demos of graph technologies for Oracle Cloud Services, Oracle Database, NoSQL, Spark and Hadoop, including PGX analytics and PGQL property graph query language.
Presented at Analytics and Data Summit, March 20, 2018
The document describes a method for focused crawling to retrieve structured data from web pages. It involves using an online classifier trained on URL features to identify pages containing structured data. A bandit-based selection strategy is used to balance exploration and exploitation. Experiments show the adaptive approach retrieves 26% more relevant pages than static classification, and 66% more when focused on a specific objective. Decaying the bandit randomness over time improved results further. The method was able to retrieve hundreds of millions of structured data pages from billions of web pages.
This document provides an overview of Amundsen, an open source data discovery and metadata platform developed by Lyft. It begins with an introduction to the challenges of data discovery and outlines Amundsen's architecture, which uses a graph database and search engine to provide metadata about data resources. The document discusses how Amundsen impacts users at Lyft by reducing time spent searching for data and discusses the project's community and future roadmap.
Video: https://www.youtube.com/watch?v=Rt2oHibJT4k
Technologies such as Hadoop have addressed the "Volume" problem of Big Data, and technologies such as Spark have recently addressed the "Velocity" problem – but the "Variety" problem is largely unaddressed – there is a lot of manual "data wrangling" to mange data models.
These manual processes do not scale well. Not only is the variety of data increasing, also the rate of change in the data definitions is increasing. We can’t keep up. NoSQL data repositories can handle storage, but we need effective models of the data to fully utilize it.
This talk will present tools and a methodology to manage Big Data Models in a rapidly changing world. This talk covers:
Creating Semantic Metadata Models of Big Data Resources
Graphical UI Tools for Big Data Models
Tools to synchronize Big Data Models and Application Code
Using NoSQL Databases, such as Amazon DynamoDB, with Big Data Models
Using Big Data Models with Hadoop, Storm, Spark, Giraph, and Inference
Using Big Data Models with Machine Learning to generate Predictive Models
Developer Collaborative/Coordination processes using Big Data Models and Git
Managing change – Big Data Models with rapidly changing Data Resources
Engineering patterns for implementing data science models on big data platformsHisham Arafat
Discussion of practically implementing data science models on big data platforms from engineering perspective. An eye opener on the engineering factors associated with designing and working solution. We use a simple text mining example on social media analytics for brand marketing. At the first while, it seems simple solution however if you go deeply and think on implementation aspects of even a simple analytics model, you can discover the degree of complexity at each part of the solution. An Abstraction of the Big Data key advantages would be very helpful to select appropriate Big Data technology components out of very large landscape. Two examples with reference are given for using Lambda Architecture and unusual way of image processing using Big Data abstraction provided.
Implementing BCS-Business Connectivity Services - Sharepoint 2013- Office 365Shahzad S
BCS enables accessing external data from SharePoint and Office applications. It involves three phases - groundwork, SharePoint, and Office. Architectures include server-side only in SharePoint, client-side in Office, on-premises, cloud-only, and hybrid. Solutions can be built using Visual Studio or SharePoint Designer connecting to databases, web services, .NET assemblies, and OData sources. Security, performance, and limitations require consideration.
Amundsen is a metadata-driven application developed by Lyft to solve data discovery challenges. It provides a search-based UI and uses a distributed architecture with various microservices to index and serve metadata from multiple sources. Key components include a metadata service using Neo4j, a search service using Elasticsearch, and a frontend. The tool has been hugely successful at Lyft and is now open source. Future work includes expanding metadata coverage and integrating with other tools.
Solution architecture for big data projects
solution architecture,big data,hadoop,hive,hbase,impala,spark,apache,cassandra,SAP HANA,Cognos big insights
Presented at SQL Saturday Atlanta May 18, 2013
Text mining is projected to dominate data mining, and the reasons are evident: we have more text available than numeric data. Microsoft introduced a new technology to SQL Server 2012 called Semantic Search. This session's detailed description and demos give you important information for the enterprise implementation of Tag Index and Document Similarity Index. The demos include a web-based Silverlight application, and content documents from Wikipedia. We'll also look at strategy tips for how to best leverage the new semantic technology with existing Microsoft data mining.
Big data architectures and the data lakeJames Serra
The document provides an overview of big data architectures and the data lake concept. It discusses why organizations are adopting data lakes to handle increasing data volumes and varieties. The key aspects covered include:
- Defining top-down and bottom-up approaches to data management
- Explaining what a data lake is and how Hadoop can function as the data lake
- Describing how a modern data warehouse combines features of a traditional data warehouse and data lake
- Discussing how federated querying allows data to be accessed across multiple sources
- Highlighting benefits of implementing big data solutions in the cloud
- Comparing shared-nothing, massively parallel processing (MPP) architectures to symmetric multi-processing (
Relecura is a software platform that provides features for patent and portfolio analysis including searching, bucketing, trend analysis, taxonomy building, and collaboration tools. Key features allow users to initiate searches, import portfolios, bucket patents through auto, training or query methods, analyze growth trends, tag and share documents, map priority and citations, and generate automated reports. The platform also offers mobile access, custom dashboards, and support services.
The document discusses Lyft's data discovery tool called Amundsen. It provides an overview of Amundsen's architecture including its use of a graph database and Elasticsearch for metadata storage and search. It describes the challenges of data discovery that Amundsen addresses like time spent searching for data. The document outlines Amundsen's key components like its databuilder, metadata and search services. It discusses Amundsen's impact and popularity at Lyft and its open source community. Future roadmap plans include additional metadata types and deeper integrations with other tools.
NoSQL: what does it mean, how did we get here, and why should I care? - Hugo ...South London Geek Nights
The document provides an overview of NoSQL databases, including what NoSQL means, the rise of NoSQL as an alternative to relational databases, different classifications of NoSQL databases, pros and cons, use cases, and real-world examples. It discusses how NoSQL databases provide more flexible schemas and scalability than relational databases for applications like logging, shopping carts, and user preferences, while relational databases remain better for transactions and business critical data. The presenter then demonstrates CouchDB as one example of a NoSQL database.
This document discusses how insurance companies use MongoDB. It provides examples of how MongoDB allows insurance companies to create a single customer view, consolidate data from multiple disparate systems, and distribute claims information globally in real-time. MongoDB provides a flexible schema, automatic replication of data, and the ability to query data locally for improved customer experience, risk analysis, fraud detection, and claims processing. The document highlights several insurance companies that have adopted MongoDB to unify customer data, modernize legacy systems, and power new data-driven applications and services.
Text mining is projected to dominate data mining, and the reasons are evident: we have more text available than numeric data. Microsoft introduced a new technology to SQL Server 2012 called Semantic Search. This session's detailed description and demos give you important information for the enterprise implementation of Tag Index and Document Similarity Index. The demos include a web-based Silverlight application, and content documents from Wikipedia. We'll also look at strategy tips for how to best leverage the new semantic technology with existing Microsoft data mining.
Slides: NoSQL Data Modeling Using JSON Documents – A Practical ApproachDATAVERSITY
After three decades of relational data modeling, everyone’s pretty comfortable with schemas, tables, and entity-relationships. As more and more Global 2000 companies choose NoSQL databases to power their Digital Economy applications, they need to think about how to best model their data. How do they move from a constrained, table-driven model to an agile, flexible data model based on JSON documents?
This webinar is intended for architects and application developers who want to learn about new JSON document data modeling approaches, techniques, and best practices. This webinar will show you how to get started building a JSON document data model, how to migrate a table-based data model to JSON documents, and how to optimize your design to enable fast query performance.
This webinar will provide practical, experience-based advice and best practices for modeling JSON documents, including:
- When to embed or not embed objects in your JSON document
- Data modeling using a practical data access pattern approach
- Indexing your JSON documents
- Querying your data using N1QL (SQL for JSON)
Large Scale Graph Analytics with RDF and LPG Parallel ProcessingCambridge Semantics
Analytics that traverse large portions of large graphs have been problematic for both RDF and LPG graph engines. In this webinar Barry Zane, former co-founder of Netezza, Paraccel and SPARQL City and current VP of Engineering at Cambridge Semantics, discusses the native parallel-computing approach taken in AnzoGraph to yield interactive, scalable performance for RDF and LPG graphs.
Webinar: How to Drive Business Value in Financial Services with MongoDBMongoDB
Huge upheaval in the finance industry has led to a major strain on existing IT infrastructure and systems. New finance industry regulation has meant increased volume, velocity and variability of data. This coupled with cost pressures from the business has led these institutions to seek alternatives. Top tier institutions like MetLife have turned to MongoDB because of the enormous business value it enables.
In this session, hear how MongoDB enabled these successful real world examples:
Single View of a Customer - 3 months and $2M for a single view of a customer across 50 source systems
Reference Data Management - $40M in cost savings from migrating to MongoDB for reference data management
Private cloud - MongoDB as a PaaS across a tier 1 bank for enabling agility for operations, not just the developer
The use cases are specific to financial services but the patterns of usage - agility, scale, global distribution - will be applicable across many industries.
AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...Cambridge Semantics
Thomas Cook, director of sales, Cambridge Semantics, offers a primer on graph database technology and the rapid growth of knowledge graphs at Data Summit 2020 in his presentation titled "AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Connected World".
The Data Lake and Getting Buisnesses the Big Data Insights They NeedDunn Solutions Group
Do terms like "Data Lake" confuse you? You’re not alone. With all of the technology buzzwords flying around today, it can become a task to keep up with and clearly understand each of them. However a data lake is definitely something to dedicate the time to understand. Leveraging data lake technology, companies are finally able to keep all of their disparate information and streams of data in one secure location ready for consumption at any time – this includes structured, unstructured, and semi-structured data. For more information on our Big Data Consulting Services, don’t hesitate to visit us online at: http://bit.ly/2fvV5rR
Incorporating the Data Lake into Your Analytic ArchitectureCaserta
Joe Caserta, President at Caserta Concepts presented at the 3rd Annual Enterprise DATAVERSITY conference. The emphasis of this year's agenda is on the key strategies and architecture necessary to create a successful, modern data analytics organization.
Joe Caserta presented Incorporating the Data Lake into Your Analytics Architecture.
For more information on the services offered by Caserta Concepts, visit out website at http://casertaconcepts.com/.
Data Lakes are early in the Gartner hype cycle, but companies are getting value from their cloud-based data lake deployments. Break through the confusion between data lakes and data warehouses and seek out the most appropriate use cases for your big data lakes.
Has your app taken off? Are you thinking about scaling? MongoDB makes it easy to horizontally scale out with built-in automatic sharding, but did you know that sharding isn't the only way to achieve scale with MongoDB?
In this webinar, we'll review three different ways to achieve scale with MongoDB. We'll cover how you can optimize your application design and configure your storage to achieve scale, as well as the basics of horizontal scaling. You'll walk away with a thorough understanding of options to scale your MongoDB application.
Topics covered include:
- Scaling Vertically
- Hardware Considerations
- Index Optimization
- Schema Design
- Sharding
Using Compass to Diagnose Performance Problems MongoDB
Speaker: Brian Blevins, Technical Services Engineer, MongoDB
Level: 200 (Intermediate)
Track: Performance
Since the performance of your application drives engagement and revenue, it can make or break the success of your organization. You can use the Compass graphical client from MongoDB to visualize your database schema, collect information on optimization opportunities and make database changes to improve performance. In this talk, we will briefly introduce Compass and then delve into the features supporting database performance optimization. The talk will combine instruction on the use of Compass with recommendations for performance best practices. We will also review the detection and resolution of slow queries and excessive network utilization. After attending the talk, audience members will have a better understanding of the capabilities of Compass, including how those capabilities can be used to find and correct performance bottlenecks in MongoDB databases. This session is designed for those with limited MongoDB experience. Attendees should have a basic understanding of MongoDB’s schema design, the server/database/collection layout, and how their application accesses and uses the MongoDB database.
What You Will Learn:
- Identify excessive network utilization, adjust queries appropriately and use Compass to confirm results.
- Understand how the Compass graphical client can help you improve performance in your MongoDB deployment.
- Use Compass real time statistics to identify slow queries and recognize when a query is a good candidate for adding an index.
Engineering patterns for implementing data science models on big data platformsHisham Arafat
Discussion of practically implementing data science models on big data platforms from engineering perspective. An eye opener on the engineering factors associated with designing and working solution. We use a simple text mining example on social media analytics for brand marketing. At the first while, it seems simple solution however if you go deeply and think on implementation aspects of even a simple analytics model, you can discover the degree of complexity at each part of the solution. An Abstraction of the Big Data key advantages would be very helpful to select appropriate Big Data technology components out of very large landscape. Two examples with reference are given for using Lambda Architecture and unusual way of image processing using Big Data abstraction provided.
Implementing BCS-Business Connectivity Services - Sharepoint 2013- Office 365Shahzad S
BCS enables accessing external data from SharePoint and Office applications. It involves three phases - groundwork, SharePoint, and Office. Architectures include server-side only in SharePoint, client-side in Office, on-premises, cloud-only, and hybrid. Solutions can be built using Visual Studio or SharePoint Designer connecting to databases, web services, .NET assemblies, and OData sources. Security, performance, and limitations require consideration.
Amundsen is a metadata-driven application developed by Lyft to solve data discovery challenges. It provides a search-based UI and uses a distributed architecture with various microservices to index and serve metadata from multiple sources. Key components include a metadata service using Neo4j, a search service using Elasticsearch, and a frontend. The tool has been hugely successful at Lyft and is now open source. Future work includes expanding metadata coverage and integrating with other tools.
Solution architecture for big data projects
solution architecture,big data,hadoop,hive,hbase,impala,spark,apache,cassandra,SAP HANA,Cognos big insights
Presented at SQL Saturday Atlanta May 18, 2013
Text mining is projected to dominate data mining, and the reasons are evident: we have more text available than numeric data. Microsoft introduced a new technology to SQL Server 2012 called Semantic Search. This session's detailed description and demos give you important information for the enterprise implementation of Tag Index and Document Similarity Index. The demos include a web-based Silverlight application, and content documents from Wikipedia. We'll also look at strategy tips for how to best leverage the new semantic technology with existing Microsoft data mining.
Big data architectures and the data lakeJames Serra
The document provides an overview of big data architectures and the data lake concept. It discusses why organizations are adopting data lakes to handle increasing data volumes and varieties. The key aspects covered include:
- Defining top-down and bottom-up approaches to data management
- Explaining what a data lake is and how Hadoop can function as the data lake
- Describing how a modern data warehouse combines features of a traditional data warehouse and data lake
- Discussing how federated querying allows data to be accessed across multiple sources
- Highlighting benefits of implementing big data solutions in the cloud
- Comparing shared-nothing, massively parallel processing (MPP) architectures to symmetric multi-processing (
Relecura is a software platform that provides features for patent and portfolio analysis including searching, bucketing, trend analysis, taxonomy building, and collaboration tools. Key features allow users to initiate searches, import portfolios, bucket patents through auto, training or query methods, analyze growth trends, tag and share documents, map priority and citations, and generate automated reports. The platform also offers mobile access, custom dashboards, and support services.
The document discusses Lyft's data discovery tool called Amundsen. It provides an overview of Amundsen's architecture including its use of a graph database and Elasticsearch for metadata storage and search. It describes the challenges of data discovery that Amundsen addresses like time spent searching for data. The document outlines Amundsen's key components like its databuilder, metadata and search services. It discusses Amundsen's impact and popularity at Lyft and its open source community. Future roadmap plans include additional metadata types and deeper integrations with other tools.
NoSQL: what does it mean, how did we get here, and why should I care? - Hugo ...South London Geek Nights
The document provides an overview of NoSQL databases, including what NoSQL means, the rise of NoSQL as an alternative to relational databases, different classifications of NoSQL databases, pros and cons, use cases, and real-world examples. It discusses how NoSQL databases provide more flexible schemas and scalability than relational databases for applications like logging, shopping carts, and user preferences, while relational databases remain better for transactions and business critical data. The presenter then demonstrates CouchDB as one example of a NoSQL database.
This document discusses how insurance companies use MongoDB. It provides examples of how MongoDB allows insurance companies to create a single customer view, consolidate data from multiple disparate systems, and distribute claims information globally in real-time. MongoDB provides a flexible schema, automatic replication of data, and the ability to query data locally for improved customer experience, risk analysis, fraud detection, and claims processing. The document highlights several insurance companies that have adopted MongoDB to unify customer data, modernize legacy systems, and power new data-driven applications and services.
Text mining is projected to dominate data mining, and the reasons are evident: we have more text available than numeric data. Microsoft introduced a new technology to SQL Server 2012 called Semantic Search. This session's detailed description and demos give you important information for the enterprise implementation of Tag Index and Document Similarity Index. The demos include a web-based Silverlight application, and content documents from Wikipedia. We'll also look at strategy tips for how to best leverage the new semantic technology with existing Microsoft data mining.
Slides: NoSQL Data Modeling Using JSON Documents – A Practical ApproachDATAVERSITY
After three decades of relational data modeling, everyone’s pretty comfortable with schemas, tables, and entity-relationships. As more and more Global 2000 companies choose NoSQL databases to power their Digital Economy applications, they need to think about how to best model their data. How do they move from a constrained, table-driven model to an agile, flexible data model based on JSON documents?
This webinar is intended for architects and application developers who want to learn about new JSON document data modeling approaches, techniques, and best practices. This webinar will show you how to get started building a JSON document data model, how to migrate a table-based data model to JSON documents, and how to optimize your design to enable fast query performance.
This webinar will provide practical, experience-based advice and best practices for modeling JSON documents, including:
- When to embed or not embed objects in your JSON document
- Data modeling using a practical data access pattern approach
- Indexing your JSON documents
- Querying your data using N1QL (SQL for JSON)
Large Scale Graph Analytics with RDF and LPG Parallel ProcessingCambridge Semantics
Analytics that traverse large portions of large graphs have been problematic for both RDF and LPG graph engines. In this webinar Barry Zane, former co-founder of Netezza, Paraccel and SPARQL City and current VP of Engineering at Cambridge Semantics, discusses the native parallel-computing approach taken in AnzoGraph to yield interactive, scalable performance for RDF and LPG graphs.
Webinar: How to Drive Business Value in Financial Services with MongoDBMongoDB
Huge upheaval in the finance industry has led to a major strain on existing IT infrastructure and systems. New finance industry regulation has meant increased volume, velocity and variability of data. This coupled with cost pressures from the business has led these institutions to seek alternatives. Top tier institutions like MetLife have turned to MongoDB because of the enormous business value it enables.
In this session, hear how MongoDB enabled these successful real world examples:
Single View of a Customer - 3 months and $2M for a single view of a customer across 50 source systems
Reference Data Management - $40M in cost savings from migrating to MongoDB for reference data management
Private cloud - MongoDB as a PaaS across a tier 1 bank for enabling agility for operations, not just the developer
The use cases are specific to financial services but the patterns of usage - agility, scale, global distribution - will be applicable across many industries.
AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...Cambridge Semantics
Thomas Cook, director of sales, Cambridge Semantics, offers a primer on graph database technology and the rapid growth of knowledge graphs at Data Summit 2020 in his presentation titled "AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Connected World".
The Data Lake and Getting Buisnesses the Big Data Insights They NeedDunn Solutions Group
Do terms like "Data Lake" confuse you? You’re not alone. With all of the technology buzzwords flying around today, it can become a task to keep up with and clearly understand each of them. However a data lake is definitely something to dedicate the time to understand. Leveraging data lake technology, companies are finally able to keep all of their disparate information and streams of data in one secure location ready for consumption at any time – this includes structured, unstructured, and semi-structured data. For more information on our Big Data Consulting Services, don’t hesitate to visit us online at: http://bit.ly/2fvV5rR
Incorporating the Data Lake into Your Analytic ArchitectureCaserta
Joe Caserta, President at Caserta Concepts presented at the 3rd Annual Enterprise DATAVERSITY conference. The emphasis of this year's agenda is on the key strategies and architecture necessary to create a successful, modern data analytics organization.
Joe Caserta presented Incorporating the Data Lake into Your Analytics Architecture.
For more information on the services offered by Caserta Concepts, visit out website at http://casertaconcepts.com/.
Data Lakes are early in the Gartner hype cycle, but companies are getting value from their cloud-based data lake deployments. Break through the confusion between data lakes and data warehouses and seek out the most appropriate use cases for your big data lakes.
Has your app taken off? Are you thinking about scaling? MongoDB makes it easy to horizontally scale out with built-in automatic sharding, but did you know that sharding isn't the only way to achieve scale with MongoDB?
In this webinar, we'll review three different ways to achieve scale with MongoDB. We'll cover how you can optimize your application design and configure your storage to achieve scale, as well as the basics of horizontal scaling. You'll walk away with a thorough understanding of options to scale your MongoDB application.
Topics covered include:
- Scaling Vertically
- Hardware Considerations
- Index Optimization
- Schema Design
- Sharding
Using Compass to Diagnose Performance Problems MongoDB
Speaker: Brian Blevins, Technical Services Engineer, MongoDB
Level: 200 (Intermediate)
Track: Performance
Since the performance of your application drives engagement and revenue, it can make or break the success of your organization. You can use the Compass graphical client from MongoDB to visualize your database schema, collect information on optimization opportunities and make database changes to improve performance. In this talk, we will briefly introduce Compass and then delve into the features supporting database performance optimization. The talk will combine instruction on the use of Compass with recommendations for performance best practices. We will also review the detection and resolution of slow queries and excessive network utilization. After attending the talk, audience members will have a better understanding of the capabilities of Compass, including how those capabilities can be used to find and correct performance bottlenecks in MongoDB databases. This session is designed for those with limited MongoDB experience. Attendees should have a basic understanding of MongoDB’s schema design, the server/database/collection layout, and how their application accesses and uses the MongoDB database.
What You Will Learn:
- Identify excessive network utilization, adjust queries appropriately and use Compass to confirm results.
- Understand how the Compass graphical client can help you improve performance in your MongoDB deployment.
- Use Compass real time statistics to identify slow queries and recognize when a query is a good candidate for adding an index.
Using Compass to Diagnose Performance Problems in Your ClusterMongoDB
Using Compass to Diagnose Performance Problems in Your Cluster
Speaker: Brian Blevins, Technical Services Engineer, MongoDB
Date/Time: June 20, 1:50 PM
Track: Performance
Since the performance of your application drives engagement and revenue, it can make or break the success of your organization. You can use the Compass graphical client from MongoDB to visualize your database schema, collect information on optimization opportunities and make database changes to improve performance. In this talk, we will briefly introduce Compass and then delve into the features supporting database performance optimization. The talk will combine instruction on the use of Compass with recommendations for performance best practices. We will also review the detection and resolution of slow queries and excessive network utilization. After attending the talk, audience members will have a better understanding of the capabilities of Compass, including how those capabilities can be used to find and correct performance bottlenecks in MongoDB databases. This session is designed for those with limited MongoDB experience. Attendees should have a basic understanding of MongoDB’s schema design, the server/database/collection layout, and how their application accesses and uses the MongoDB database.
What You Will Learn:
- Identify excessive network utilization, adjust queries appropriately and use Compass to confirm results.
- Understand how the Compass graphical client can help you improve performance in your MongoDB deployment.
- Use Compass real time statistics to identify slow queries and recognize when a query is a good candidate for adding an index.
The document discusses leveraging SharePoint 2013's web content management (WCM) features to build a dynamic, search-driven intranet. It describes how to create a taxonomy and term store to enable search-based navigation and filtering. Display templates and web parts like the content search web part (CSWP) can be used to build targeted, contextual experiences. The approach allows building dynamic team sites, applications for customer service, research portals, and more based on indexed content and user behavior analytics.
THAT Conference 2021 - State-of-the-art Search with Azure Cognitive SearchBrian McKeiver
In person at THAT Conference 2021 - How to add AI / machine Learning to your website search through Azure Cognitive Services with it's brand new semantic search. Join the session to why semantic AI-powered search improves the quality of search results.
Here are some keywords I would use for this job description:
Candidate Keywords:
DNA synthesis scientist, enzymatic synthesis scientist, DNA/RNA polymerase, DNA/RNA modification, Sanger sequencing, next generation sequencing, biochemistry, molecular biology, bioengineering, chemical engineering
Document Keywords:
Resume, CV, profile, LinkedIn, ORCID, ResearchGate, alumni directory, publications, patents, thesis, dissertation
Search Engine Optimisation and Growth Hacking| John Caldwell | CreatorSEOEnterprise Ireland
This document discusses search engine optimization (SEO) and growth hacking techniques for growing an international business online. It covers the search landscape, how search engines work, keywords, Google algorithms, international SEO best practices, and using data to guide growth strategies. The key aspects of SEO discussed are developing a digital marketing strategy, focusing on relevant keywords and content, optimizing websites for findability and conversions, and analyzing data to improve performance over time through testing and experimentation.
This document discusses a proposed search engine optimization (SEO) system. It includes an abstract describing SEO and its goals. The scope section discusses how SEO is commonly used to improve search engine rankings. The proposed system would allow users to search for content by keyword and refine results. It would display search results across different formats. The system requirements, design, testing approach, and screenshots are also outlined. In conclusion, the document states that SEO is an ongoing process that requires constant adaptation to changes in technology and search engine algorithms.
Data Model for Mainframe in Splunk: The Newest Feature of IronstreamPrecisely
Valuable mainframe data is often the missing piece in a holistic infrastructure view within Splunk. But if you're not a mainframe expert, knowing which data sources, fields and calculations are needed to get results within Splunk can be a challenge. Even those with mainframe knowledge can sometimes struggle.
With Syncsort Ironstream® you can easily capture the elements you need in real-time – and Ironstream's new Mainframe Data Model makes it easier than ever to work with complex mainframe metrics in Splunk.
View this webinar on-demand to learn more about this new feature, as well as how to:
• See categorized mainframe metrics in easily understood terms
• Get results faster – no need to research data sources, fields and calculations
• Broaden access to more team members – without the need for deep mainframe knowledge
• Use built-in Splunk tooling to get up and running quickly
• Realize valuable ROI sooner and eliminate the mainframe blind spot
Presented at JavaOne 2013, Tuesday September 24.
"Data Modeling Patterns" co-created with Ian Robinson.
"Pitfalls and Anti-Patterns" created by Ian Robinson.
The document discusses global SEO performance tracking. It recommends tracking key performance indicators (KPIs) like keywords, landing pages, competitive domain and page authorities, internal and external links, and crawl stats. The presentation provides tips on keyword analysis including segmentation of head, body, and tail keywords. It also suggests tools for competitive analysis, site analysis, and tracking changes in keywords, landing pages, authorities, links, and crawl stats. The overall message is that tracking the right metrics across all geographies is essential to measure performance and make more money.
Microsoft Search Strategy Today - Exploring Office 365 Search in Real LifeJoel Oleson
You don't have to wait to get started with Microsoft Search. In this deck we discuss what is new and coming and how to apply strategies for what has been released.
This is from the Business Accelerator Marketing Academy. This takes an in depth look at on-page and off page ranking factors for Search Engine Optimization. It also introduces you to a few tools that are available. We also introduce Google Analytics, how to navigate the platform and read the data.
MongoDB World 2018: How an Idea Becomes a MongoDB FeatureMongoDB
The document describes the software development lifecycle used by the MongoDB Database Engineering Team. It involves carefully scoping projects, designing features, implementing code, testing, and getting acceptance from product management. Key aspects include establishing consensus during scoping, addressing downstream impacts, writing comprehensive tests, and continuously improving processes over time.
TLC2018 Thomas Haver: Transform with Enterprise AutomationAnna Royzman
Thomas Haver explains how to build a robust automation solution across the Enterprise to improve application quality, testing efficiency, and lower operational costs. He shows how to leverage all current resources to achieve this goal without affecting project delivery time at Test Leadership Congress 2018.
http://testleadershipcongress-ny.com
The document discusses various techniques for optimizing and scaling MongoDB deployments. It covers topics like schema design, indexing, monitoring workload, vertical scaling using resources like RAM and SSDs, and horizontal scaling using sharding. The key recommendations are to optimize the schema and indexes first before scaling, understand the workload, and ensure proper indexing when using sharding for horizontal scaling.
Oracle apps crm online training , oracle crm certification coursesmagnificsmile
This document provides information about Oracle Apps CRM online training courses offered by Magnific Training. It includes their contact information and locations in India and other countries. It also summarizes their approach to implementing a global CRM single instance, including consolidating data, using best-in-practice BI technology, and establishing governance processes to drive value. Visualizations and dashboards are highlighted to provide role-based intelligence for functions like sales, marketing, IT, and finance.
The document provides an overview of data science including definitions, careers, applications and tools. It defines data science, describes the typical steps in a data science project including understanding the problem, acquiring and preparing data, analyzing data, modeling data, visualizing results and deploying solutions. It also discusses careers in data engineering, data analysis, machine learning engineering and as a data scientist. Finally, it covers popular tools and frameworks used in data science like Anaconda, Jupyter Notebooks and examples of data science applications.
The document discusses the business case for website speed optimization. It notes that both users and search engines prefer faster sites, and cites studies showing that even small improvements in speed (e.g. 1 second faster load time) can increase key metrics like conversions by 2-14%. The document provides examples of companies like Walmart and Obama's 2012 campaign that saw increased revenue and donations from speed optimizations. It acknowledges IT concerns about speed work but argues a methodology is needed to prioritize and implement optimizations.
The document discusses technical SEO best practices for improving a website's performance and visibility in search engines. It provides tips for conducting a technical audit to identify and resolve issues, optimizing site speed, ensuring search engines have full access to content, and building good SEO practices into development processes. The document also outlines common technical SEO risks and solutions for working with large volumes of content.
Similar to Building Search Systems for the Enterprise (20)
Meaning Representations for-Natural Languages Design, Models, and Application...Yunyao Li
COLING-LREC'2024 Tutorial "Meaning Representations for Natural Languages: Design, Models and Applications"
Instructors: Julia Bonn, Jeffrey Flanigan, Jan Hajič, Ishan Jindal, Yunyao Li and Nianwen Xue
Abstract: This tutorial introduces a research area that has the potential to create linguistic resources and build computational models that provide critical components for interpretable and controllable NLP systems. While large language models have shown remarkable ability to generate fluent text, the blackbox nature of these models makes it difficult to know where to tweak these models to fix errors, at least for now. For instance, LLMs are known to hallucinate and there is no mechanism in these models to only provide factually correct answers. Addressing this issue requires that first of all the models have access to a body of verifiable facts, and then use it effectively. Interpretability and controllability in NLP systems are critical in high-stake applications such as the medical domain. There has been a steady accumulation of semantically annotated, increasingly richer resources, which can now be derived with high accuracy from raw texts. Hybrid models can be used to extract verifiable facts at scale to build controllable and interpretable systems, for grounding in human-robot interaction (HRI) systems, support logical reasoning, or used in extremely low resource settings. This tutorial will provide an overview of these semantic representations, the computational models that are trained on them, as well as the practical applications built with these representations, including future directions.
The Role of Patterns in the Era of Large Language ModelsYunyao Li
Slides for my keynote at PAN-DL Workshop (Pattern-based Approaches to NLP in the Age of Deep Learning) at EMNLP'2023 (December. 6, 2023).
In this talk, I share our initial learnings from constructing, growing and serving large knowledge graphs
Building, Growing and Serving Large Knowledge Graphs with Human-in-the-LoopYunyao Li
Keynote talk at HILDA'2023 at SIGMOD on June 18, 2023.
Abstract: The ability to build large-scale knowledge bases that capture and extend the implicit knowledge of human experts is the foundation for many AI systems. We use an ontology-driven approach for the building, growing and serving of such knowledge bases. This approach relies on several well-known building blocks: document conversion, natural language processing, entity resolution, data transformation and fusion. In this talk, I will discuss wide range of real-world challenges related to the building of these blocks and present our work to address these challenges via better human-machine cooperation.
Meaning Representations for Natural Languages: Design, Models and ApplicationsYunyao Li
EMNLP'2022 Tutorial "Meaning Representations for Natural Languages: Design, Models and Applications"
Instructors: Jeffrey Flanigan, Ishan Jindal, Yunyao Li, Tim O’Gorman, Martha Palmer
Abstract:
We propose a cutting-edge tutorial that reviews the design of common meaning representations, SoTA models for predicting meaning representations, and the applications of meaning representations in a wide range of downstream NLP tasks and real-world applications. Reporting by a diverse team of NLP researchers from academia and industry with extensive experience in designing, building and using meaning representations, our tutorial has three components: (1) an introduction to common meaning representations, including basic concepts and design challenges; (2) a review of SoTA methods on building models for meaning representations; and (3) an overview of applications of meaning representations in downstream NLP tasks and real-world applications. We will also present qualitative comparisons of common meaning representations and a quantitative study on how their differences impact model performance. Finally, we will share best practices in choosing the right meaning representation for downstream tasks.
Natural language understanding is a fundamental task in artificial intelligence. English understanding has reached a mature state and successfully deployed in multiple IBM AI products and services, such as Watson Natural Language Understanding and Watson Discovery. However, scaling existing products/services to support additional languages remain an open challenge. In this talk, we will discuss the open challenges in supporting universal natural language understanding. We will share our work in the past few years in addressing these challenges. We will also showcase how universal semantic representation of natural languages can enable cross-lingual information extraction in concrete domains (e.g. compliance) and show ongoing efforts towards seamless scaling existing NLP capabilities across languages with minimal efforts.
Invited talk at Document Intelligence workshop at KDD'2021.
Harvesting information from complex documents such as in financial reports and scientific publications is critical to building AI applications for business and research. Such documents are often in PDF format with critical facts and data conveyed in table and graphs. Extracting such information is essential to extract insights from these documents. In IBM Research, we have a rich agenda in this area that we call Deep Document Understanding. In this talk, I will focus on our research on Deep Table Understanding — extracting and understanding tables from PDF documents. I will introduce key challenges in table extraction and understanding and how we address such challenges, from how to acquire data at scale to enable deep neural network models to how to build, customize and evaluate such models. I will also describe how our work enables real-world use cases in domains such as finance and life science. Finally, I will briefly present TableQA, an important downstream task enabled by Deep Table Understanding.
Explainability for Natural Language ProcessingYunyao Li
Final deck for our popular tutorial on "Explainability for Natural Language Processing" at KDD'2021. See links below for downloadable version (with higher resolution) and recording of the live tutorial.
Title: Explainability for Natural Language Processing
Presenter: Marina Danilevsky, Shipi Dhanorkar, Yunyao Li and Lucian Popa and Kun Qian and Anbang Xu
Website: http://xainlp.github.io/
Recording: https://www.youtube.com/watch?v=PvKOSYGclPk&t=2s
Downloadable version with higher resolution: https://drive.google.com/file/d/1_gt_cS9nP9rcZOn4dcmxc2CErxrHW9CU/view?usp=sharing
@article{kdd2021xaitutorial,
title={Explainability for Natural Language Processing},
author= {Marina Danilevsky, Shipi Dhanorkar and Yunyao Li and Lucian Popa and Kun Qian and Anbang Xu},
journal={KDD},
year={2021}
}
Abstract:
This lecture-style tutorial, which mixes in an interactive literature browsing component, is intended for the many researchers and practitioners working with text data and on applications of natural language processing (NLP) in data science and knowledge discovery. The focus of the tutorial is on the issues of transparency and interpretability as they relate to building models for text and their applications to knowledge discovery. As black-box models have gained popularity for a broad range of tasks in recent years, both the research and industry communities have begun developing new techniques to render them more transparent and interpretable.Reporting from an interdisciplinary team of social science, human-computer interaction (HCI), and NLP/knowledge management researchers, our tutorial has two components: an introduction to explainable AI (XAI) in the NLP domain and a review of the state-of-the-art research; and findings from a qualitative interview study of individuals working on real-world NLP projects as they are applied to various knowledge extraction and discovery at a large, multinational technology and consulting corporation. The first component will introduce core concepts related to explainability inNLP. Then, we will discuss explainability for NLP tasks and reporton a systematic literature review of the state-of-the-art literaturein AI, NLP and HCI conferences. The second component reports on our qualitative interview study, which identifies practical challenges and concerns that arise in real-world development projects that require the modeling and understanding of text data.
Explainability for Natural Language ProcessingYunyao Li
This document provides an outline for a tutorial on explainability for natural language processing. It begins with an introduction to the topic and then outlines the current state of explainable AI research for NLP. Specifically, it discusses the literature review methodology, different types of explanations, techniques for generating and presenting explanations, common operations to enable explainability like visualization techniques and communication paradigms, and the evaluation of explanations. The tutorial aims to provide an overview of the key concepts and challenges in explainable AI as it relates to natural language processing applications and models.
Human in the Loop AI for Building Knowledge Bases Yunyao Li
The ability to build large-scale domain-specific knowledge bases that capture and extend the implicit knowledge of human experts is the foundation for many AI systems. We use an ontology-driven approach for the creation, representation and consumption of such domain-specific knowledge bases. This approach relies on several well-known building blocks: natural language processing, entity resolution, data transformation and fusion. I will present several human-in-the-loop work that target domain experts (rather than programmers) to extract the domain knowledge from the human expert and map it into the "right" models or algorithms. I will also share successful use cases in several domains, including Compliance, Finance, and Healthcare: by using these tools we can match the level of accuracy achieved by manual efforts, but at a significantly lower cost and much higher scale and automation.
Slides for talk given at Women in Engineering on March 20, 2021.
Abstract:
Natural language understanding is a fundamental task in artificial intelligence. English understanding has reached a mature state and successfully deployed in multiple IBM AI products and services, such as Watson Natural Language Understanding and Watson Discovery. However, scaling existing products/services to support additional languages remain an open challenge. In this talk, we will discuss the open challenges in supporting universal natural language understanding. We will share our work in the past few years in addressing these challenges. We will also showcase how universal semantic representation of natural languages can enable cross-lingual information extraction in concrete domains (e.g. compliance) and show ongoing efforts towards seamless scaling existing NLP capabilities across languages with minimal efforts.
Explainability for Natural Language ProcessingYunyao Li
Tutorial at AACL'2020 (http://www.aacl2020.org/program/tutorials/#t4-explainability-for-natural-language-processing).
More recent version: https://www.slideshare.net/YunyaoLi/explainability-for-natural-language-processing-249912819
Title: Explainability for Natural Language Processing
@article{aacl2020xaitutorial,
title={Explainability for Natural Language Processing},
author= {Dhanorkar, Shipi and Li, Yunyao and Popa, Lucian and Qian, Kun and Wolf, Christine T and Xu, Anbang},
journal={AACL-IJCNLP 2020},
year={2020}
Presenter: Shipi Dhanorkar, Christine Wolf, Kun Qian, Anbang Xu, Lucian Popa and Yunyao Li
Video: https://www.youtube.com/watch?v=3tnrGe_JA0s&feature=youtu.be
Abstract:
We propose a cutting-edge tutorial that investigates the issues of transparency and interpretability as they relate to NLP. Both the research community and industry have been developing new techniques to render black-box NLP models more transparent and interpretable. Reporting from an interdisciplinary team of social science, human-computer interaction (HCI), and NLP researchers, our tutorial has two components: an introduction to explainable AI (XAI) and a review of the state-of-the-art for explainability research in NLP; and findings from a qualitative interview study of individuals working on real-world NLP projects at a large, multinational technology and consulting corporation. The first component will introduce core concepts related to explainability in NLP. Then, we will discuss explainability for NLP tasks and report on a systematic literature review of the state-of-the-art literature in AI, NLP, and HCI conferences. The second component reports on our qualitative interview study which identifies practical challenges and concerns that arise in real-world development projects which include NLP.
Towards Universal Language Understanding (2020 version)Yunyao Li
The document discusses challenges and approaches for developing universal semantic understanding across languages. It describes generating semantic role labeling resources for many languages using parallel corpora and crowdsourcing techniques. The goal is to develop cross-lingual models and representations that can understand the semantics of text in different languages.
Towards Universal Semantic Understanding of Natural LanguagesYunyao Li
Keynote talk at TextXD 2019(https://www.textxd.org)
Abstract:
Understanding the semantics of the natural language is a fundamental task in artificial intelligence. English semantic understanding has reached a mature state and successfully deployed in multiple IBM AI products and services, such as Watson Natural Language Understanding and Watson Compare and Comply. However, scaling existing products/services to support additional languages remain an open challenge. In this demo, we will present Polyglot, a multilingual semantic parser capable of semantically parsing sentences in 9 different languages from 4 different language groups into the same unified semantic representation. We will also showcase how such universal semantic understanding of natural languages can enable cross-lingual information extraction in concrete domains (e.g. insurance and compliance) and show promise towards seamless scaling existing NLP capabilities across languages with minimal efforts.
An In-depth Analysis of the Effect of Text Normalization in Social MediaYunyao Li
This document proposes a taxonomy for characterizing different types of text normalization edits based on their level of granularity and impact on downstream natural language processing applications like syntactic parsing, named-entity recognition, and text-to-speech synthesis. It examines the effects of coarse-grained edit types like addition, replacement, and removal as well as more fine-grained types like corrections to verb forms, determiners, capitalization, contractions, slang, and Twitter-specific terms. The analysis found that parsing is impacted by most normalization operations, entity recognition depends strongly on replacements, and speech synthesis benefits broadly from normalization but requires handling domain-specific terms.
Exploiting Structure in Representation of Named Entities using Active LearningYunyao Li
This document presents a framework called LUSTRE that uses active learning to efficiently learn structured representations of named entities with minimal human effort. LUSTRE iteratively selects the most informative entity mentions for user labeling, infers mapping rules from these labels, and updates its model. It outperforms baselines on various entity types, generalizes well to new domains, and reduces manual effort by learning from few labeling iterations. The learned structured representations improve downstream tasks like relation extraction by enabling matching of entity variations.
K-SRL: Instance-based Learning for Semantic Role LabelingYunyao Li
This document summarizes a research paper on instance-based learning for semantic role labeling (SRL). It presents a simple but effective k-nearest neighbors approach using composite features that outperforms previous SRL systems on both in-domain and out-of-domain evaluation. The approach models global argument constraints and addresses SRL challenges like heavy-tailed label distributions and low-frequency exceptions through explicit representation of local biases in composite features.
The document discusses generating high quality proposition banks for multilingual semantic role labeling. It presents frames for common verbs like "buy", "like", and "give" in English. It then shows how frames were generated for these same verbs in other languages like Chinese, French, and German by projecting annotations across languages. The document concludes by introducing a two-step approach used to curate the projected frame mappings, which involves filtering incorrect mappings and grouping usage synonyms. This led to the creation of a publicly released Universal Proposition Bank version 1.0 with curated frame mappings for several languages.
The document discusses a multilingual strategy for information extraction using a system called PolyglotIE. The strategy involves defining language-independent rules in Annotation Query Language (AQL) that can be executed over multilingual text to extract semantic role labels. This allows information extraction to work immediately across languages with minimal effort. The rules are defined over universal proposition banks containing frames and roles that are shared across languages.
Natural Language Data Management and Interfaces: Recent Development and Open ...Yunyao Li
Slides deck for SIGMOD 2017 Tutorial.
ABSTRACT:
The volume of natural language text data has been rapidly increasing over the past two decades, due to factors such as the growth of the Web, the low cost associated to publishing and the progress on the digitization of printed texts. This growth combined with the proliferation of natural language systems for search and retrieving information provides tremendous opportunities for studying some of the areas where database systems and natural language processing systems overlap. This tutorial explores two more relevant areas of overlap to the database community: (1) managing
natural language text data in a relational database, and (2) developing natural language interfaces to databases. The tutorial presents state-of-the-art methods, related systems, research opportunities and challenges covering both area.
Polyglot: Multilingual Semantic Role Labeling with Unified LabelsYunyao Li
Poster for our ACL paper "Polyglot: Multilingual Semantic Role Labeling with Unified Labels".
Abstract:
We present POLYGLOT, a semantic role labeling system capable of semantically parsing sentences in 9 different languages from 4 different language groups. A core differentiator is that this system predicts English Proposition Bank labels for all supported languages. This means that
for instance a Japanese sentence will be tagged with the same labels as an English sentence with similar semantics would be. This is made possible by training the system with target language data that was automatically labeled with English PropBank labels using an annotation projection approach. We give an overview of our system, the automatically produced training data, and discuss possible applications
and limitations of this work. We present a demonstrator that accepts sentences in English, German, French, Spanish, Japanese, Chinese, Arabic, Russian and Hindi and
outputs a visualization of its shallow semantics.
Introduction of Cybersecurity with OSS at Code Europe 2024Hiroshi SHIBATA
I develop the Ruby programming language, RubyGems, and Bundler, which are package managers for Ruby. Today, I will introduce how to enhance the security of your application using open-source software (OSS) examples from Ruby and RubyGems.
The first topic is CVE (Common Vulnerabilities and Exposures). I have published CVEs many times. But what exactly is a CVE? I'll provide a basic understanding of CVEs and explain how to detect and handle vulnerabilities in OSS.
Next, let's discuss package managers. Package managers play a critical role in the OSS ecosystem. I'll explain how to manage library dependencies in your application.
I'll share insights into how the Ruby and RubyGems core team works to keep our ecosystem safe. By the end of this talk, you'll have a better understanding of how to safeguard your code.
AppSec PNW: Android and iOS Application Security with MobSFAjin Abraham
Mobile Security Framework - MobSF is a free and open source automated mobile application security testing environment designed to help security engineers, researchers, developers, and penetration testers to identify security vulnerabilities, malicious behaviours and privacy concerns in mobile applications using static and dynamic analysis. It supports all the popular mobile application binaries and source code formats built for Android and iOS devices. In addition to automated security assessment, it also offers an interactive testing environment to build and execute scenario based test/fuzz cases against the application.
This talk covers:
Using MobSF for static analysis of mobile applications.
Interactive dynamic security assessment of Android and iOS applications.
Solving Mobile app CTF challenges.
Reverse engineering and runtime analysis of Mobile malware.
How to shift left and integrate MobSF/mobsfscan SAST and DAST in your build pipeline.
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectorsDianaGray10
Join us to learn how UiPath Apps can directly and easily interact with prebuilt connectors via Integration Service--including Salesforce, ServiceNow, Open GenAI, and more.
The best part is you can achieve this without building a custom workflow! Say goodbye to the hassle of using separate automations to call APIs. By seamlessly integrating within App Studio, you can now easily streamline your workflow, while gaining direct access to our Connector Catalog of popular applications.
We’ll discuss and demo the benefits of UiPath Apps and connectors including:
Creating a compelling user experience for any software, without the limitations of APIs.
Accelerating the app creation process, saving time and effort
Enjoying high-performance CRUD (create, read, update, delete) operations, for
seamless data management.
Speakers:
Russell Alfeche, Technology Leader, RPA at qBotic and UiPath MVP
Charlie Greenberg, host
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...Alex Pruden
Folding is a recent technique for building efficient recursive SNARKs. Several elegant folding protocols have been proposed, such as Nova, Supernova, Hypernova, Protostar, and others. However, all of them rely on an additively homomorphic commitment scheme based on discrete log, and are therefore not post-quantum secure. In this work we present LatticeFold, the first lattice-based folding protocol based on the Module SIS problem. This folding protocol naturally leads to an efficient recursive lattice-based SNARK and an efficient PCD scheme. LatticeFold supports folding low-degree relations, such as R1CS, as well as high-degree relations, such as CCS. The key challenge is to construct a secure folding protocol that works with the Ajtai commitment scheme. The difficulty, is ensuring that extracted witnesses are low norm through many rounds of folding. We present a novel technique using the sumcheck protocol to ensure that extracted witnesses are always low norm no matter how many rounds of folding are used. Our evaluation of the final proof system suggests that it is as performant as Hypernova, while providing post-quantum security.
Paper Link: https://eprint.iacr.org/2024/257
Skybuffer SAM4U tool for SAP license adoptionTatiana Kojar
Manage and optimize your license adoption and consumption with SAM4U, an SAP free customer software asset management tool.
SAM4U, an SAP complimentary software asset management tool for customers, delivers a detailed and well-structured overview of license inventory and usage with a user-friendly interface. We offer a hosted, cost-effective, and performance-optimized SAM4U setup in the Skybuffer Cloud environment. You retain ownership of the system and data, while we manage the ABAP 7.58 infrastructure, ensuring fixed Total Cost of Ownership (TCO) and exceptional services through the SAP Fiori interface.
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-und-domino-lizenzkostenreduzierung-in-der-welt-von-dlau/
DLAU und die Lizenzen nach dem CCB- und CCX-Modell sind für viele in der HCL-Community seit letztem Jahr ein heißes Thema. Als Notes- oder Domino-Kunde haben Sie vielleicht mit unerwartet hohen Benutzerzahlen und Lizenzgebühren zu kämpfen. Sie fragen sich vielleicht, wie diese neue Art der Lizenzierung funktioniert und welchen Nutzen sie Ihnen bringt. Vor allem wollen Sie sicherlich Ihr Budget einhalten und Kosten sparen, wo immer möglich. Das verstehen wir und wir möchten Ihnen dabei helfen!
Wir erklären Ihnen, wie Sie häufige Konfigurationsprobleme lösen können, die dazu führen können, dass mehr Benutzer gezählt werden als nötig, und wie Sie überflüssige oder ungenutzte Konten identifizieren und entfernen können, um Geld zu sparen. Es gibt auch einige Ansätze, die zu unnötigen Ausgaben führen können, z. B. wenn ein Personendokument anstelle eines Mail-Ins für geteilte Mailboxen verwendet wird. Wir zeigen Ihnen solche Fälle und deren Lösungen. Und natürlich erklären wir Ihnen das neue Lizenzmodell.
Nehmen Sie an diesem Webinar teil, bei dem HCL-Ambassador Marc Thomas und Gastredner Franz Walder Ihnen diese neue Welt näherbringen. Es vermittelt Ihnen die Tools und das Know-how, um den Überblick zu bewahren. Sie werden in der Lage sein, Ihre Kosten durch eine optimierte Domino-Konfiguration zu reduzieren und auch in Zukunft gering zu halten.
Diese Themen werden behandelt
- Reduzierung der Lizenzkosten durch Auffinden und Beheben von Fehlkonfigurationen und überflüssigen Konten
- Wie funktionieren CCB- und CCX-Lizenzen wirklich?
- Verstehen des DLAU-Tools und wie man es am besten nutzt
- Tipps für häufige Problembereiche, wie z. B. Team-Postfächer, Funktions-/Testbenutzer usw.
- Praxisbeispiele und Best Practices zum sofortigen Umsetzen
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...Jason Yip
The typical problem in product engineering is not bad strategy, so much as “no strategy”. This leads to confusion, lack of motivation, and incoherent action. The next time you look for a strategy and find an empty space, instead of waiting for it to be filled, I will show you how to fill it in yourself. If you’re wrong, it forces a correction. If you’re right, it helps create focus. I’ll share how I’ve approached this in the past, both what works and lessons for what didn’t work so well.
5th LF Energy Power Grid Model Meet-up SlidesDanBrown980551
5th Power Grid Model Meet-up
It is with great pleasure that we extend to you an invitation to the 5th Power Grid Model Meet-up, scheduled for 6th June 2024. This event will adopt a hybrid format, allowing participants to join us either through an online Mircosoft Teams session or in person at TU/e located at Den Dolech 2, Eindhoven, Netherlands. The meet-up will be hosted by Eindhoven University of Technology (TU/e), a research university specializing in engineering science & technology.
Power Grid Model
The global energy transition is placing new and unprecedented demands on Distribution System Operators (DSOs). Alongside upgrades to grid capacity, processes such as digitization, capacity optimization, and congestion management are becoming vital for delivering reliable services.
Power Grid Model is an open source project from Linux Foundation Energy and provides a calculation engine that is increasingly essential for DSOs. It offers a standards-based foundation enabling real-time power systems analysis, simulations of electrical power grids, and sophisticated what-if analysis. In addition, it enables in-depth studies and analysis of the electrical power grid’s behavior and performance. This comprehensive model incorporates essential factors such as power generation capacity, electrical losses, voltage levels, power flows, and system stability.
Power Grid Model is currently being applied in a wide variety of use cases, including grid planning, expansion, reliability, and congestion studies. It can also help in analyzing the impact of renewable energy integration, assessing the effects of disturbances or faults, and developing strategies for grid control and optimization.
What to expect
For the upcoming meetup we are organizing, we have an exciting lineup of activities planned:
-Insightful presentations covering two practical applications of the Power Grid Model.
-An update on the latest advancements in Power Grid -Model technology during the first and second quarters of 2024.
-An interactive brainstorming session to discuss and propose new feature requests.
-An opportunity to connect with fellow Power Grid Model enthusiasts and users.
Northern Engraving | Nameplate Manufacturing Process - 2024Northern Engraving
Manufacturing custom quality metal nameplates and badges involves several standard operations. Processes include sheet prep, lithography, screening, coating, punch press and inspection. All decoration is completed in the flat sheet with adhesive and tooling operations following. The possibilities for creating unique durable nameplates are endless. How will you create your brand identity? We can help!
Driving Business Innovation: Latest Generative AI Advancements & Success StorySafe Software
Are you ready to revolutionize how you handle data? Join us for a webinar where we’ll bring you up to speed with the latest advancements in Generative AI technology and discover how leveraging FME with tools from giants like Google Gemini, Amazon, and Microsoft OpenAI can supercharge your workflow efficiency.
During the hour, we’ll take you through:
Guest Speaker Segment with Hannah Barrington: Dive into the world of dynamic real estate marketing with Hannah, the Marketing Manager at Workspace Group. Hear firsthand how their team generates engaging descriptions for thousands of office units by integrating diverse data sources—from PDF floorplans to web pages—using FME transformers, like OpenAIVisionConnector and AnthropicVisionConnector. This use case will show you how GenAI can streamline content creation for marketing across the board.
Ollama Use Case: Learn how Scenario Specialist Dmitri Bagh has utilized Ollama within FME to input data, create custom models, and enhance security protocols. This segment will include demos to illustrate the full capabilities of FME in AI-driven processes.
Custom AI Models: Discover how to leverage FME to build personalized AI models using your data. Whether it’s populating a model with local data for added security or integrating public AI tools, find out how FME facilitates a versatile and secure approach to AI.
We’ll wrap up with a live Q&A session where you can engage with our experts on your specific use cases, and learn more about optimizing your data workflows with AI.
This webinar is ideal for professionals seeking to harness the power of AI within their data management systems while ensuring high levels of customization and security. Whether you're a novice or an expert, gain actionable insights and strategies to elevate your data processes. Join us to see how FME and AI can revolutionize how you work with data!
Dandelion Hashtable: beyond billion requests per second on a commodity serverAntonios Katsarakis
This slide deck presents DLHT, a concurrent in-memory hashtable. Despite efforts to optimize hashtables, that go as far as sacrificing core functionality, state-of-the-art designs still incur multiple memory accesses per request and block request processing in three cases. First, most hashtables block while waiting for data to be retrieved from memory. Second, open-addressing designs, which represent the current state-of-the-art, either cannot free index slots on deletes or must block all requests to do so. Third, index resizes block every request until all objects are copied to the new index. Defying folklore wisdom, DLHT forgoes open-addressing and adopts a fully-featured and memory-aware closed-addressing design based on bounded cache-line-chaining. This design offers lock-free index operations and deletes that free slots instantly, (2) completes most requests with a single memory access, (3) utilizes software prefetching to hide memory latencies, and (4) employs a novel non-blocking and parallel resizing. In a commodity server and a memory-resident workload, DLHT surpasses 1.6B requests per second and provides 3.5x (12x) the throughput of the state-of-the-art closed-addressing (open-addressing) resizable hashtable on Gets (Deletes).
Discover top-tier mobile app development services, offering innovative solutions for iOS and Android. Enhance your business with custom, user-friendly mobile applications.
Your One-Stop Shop for Python Success: Top 10 US Python Development Providersakankshawande
Simplify your search for a reliable Python development partner! This list presents the top 10 trusted US providers offering comprehensive Python development services, ensuring your project's success from conception to completion.
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyScyllaDB
Freshworks creates AI-boosted business software that helps employees work more efficiently and effectively. Managing data across multiple RDBMS and NoSQL databases was already a challenge at their current scale. To prepare for 10X growth, they knew it was time to rethink their database strategy. Learn how they architected a solution that would simplify scaling while keeping costs under control.
HCL Notes and Domino License Cost Reduction in the World of DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-and-domino-license-cost-reduction-in-the-world-of-dlau/
The introduction of DLAU and the CCB & CCX licensing model caused quite a stir in the HCL community. As a Notes and Domino customer, you may have faced challenges with unexpected user counts and license costs. You probably have questions on how this new licensing approach works and how to benefit from it. Most importantly, you likely have budget constraints and want to save money where possible. Don’t worry, we can help with all of this!
We’ll show you how to fix common misconfigurations that cause higher-than-expected user counts, and how to identify accounts which you can deactivate to save money. There are also frequent patterns that can cause unnecessary cost, like using a person document instead of a mail-in for shared mailboxes. We’ll provide examples and solutions for those as well. And naturally we’ll explain the new licensing model.
Join HCL Ambassador Marc Thomas in this webinar with a special guest appearance from Franz Walder. It will give you the tools and know-how to stay on top of what is going on with Domino licensing. You will be able lower your cost through an optimized configuration and keep it low going forward.
These topics will be covered
- Reducing license cost by finding and fixing misconfigurations and superfluous accounts
- How do CCB and CCX licenses really work?
- Understanding the DLAU tool and how to best utilize it
- Tips for common problem areas, like team mailboxes, functional/test users, etc
- Practical examples and best practices to implement right away
Principle of conventional tomography-Bibash Shahi ppt..pptx
Building Search Systems for the Enterprise
1. Building Search SystemsBuilding Search Systems
for the Enterprisefor the Enterprise
IBM Research – Almaden
ACMACM
SIGIRSIGIR 20112011
Beijing, China
(on behalf of Shivakumar Vaithyanathan)
Yunyao Li
2. • Search for the EnterpriseSearch for the Enterprise
• Programmable Search (overview)Programmable Search (overview)
• Backend AnalyticsBackend Analytics
• Search RuntimeSearch Runtime
• Foundations and PrinciplesFoundations and Principles
• Concluding RemarksConcluding Remarks
outlineoutline
2
3. Experience at IBM Internal SearchExperience at IBM Internal Search
• IBM deployed a commercially available search engine
– Implementing standard IR techniques
• Search quality went down over time to the point that
Search results were unacceptable!Search results were unacceptable!
Success (≥ 1 relevant results): 14% on top-1, 23% on
top-5, 34% on top-50! [Zhu et al., WWW’07]
So, they implemented various solutions…
3
To the administrators managing the engine,
exposed knobs were insufficient
4. Attempts to Improve SearchAttempts to Improve Search
• Enhanced link analysis by
incorporating external links
to/from external WWW
• Creative hacks: added fake
terms to documents & queries
– # terms per document determined by
“popularity”: how much TF increase
required for needed rank boost ?
• Hard-coded custom results for
the top 1200+ queries
• Enhanced link analysis by
incorporating external links
to/from external WWW
• Creative hacks: added fake
terms to documents & queries
– # terms per document determined by
“popularity”: how much TF increase
required for needed rank boost ?
• Hard-coded custom results for
the top 1200+ queries
Didn’t help…
Quality went down!
Maintenance nightmare:
Heuristic needs to be updated
upon each nontrivial change in
term stats./ranking parameters
Even bigger nightmare!
How to deal with continuously
changing terminology?
4
5. What are the Problems?What are the Problems?
Network Station Manager search
Thin Client ManagerProduct names change:
Continually changing terminology!
Domain-specific meaning!
Paula Summa search
bring Paula Summa from
employee directories
per diem search
Domain-specific repetitions!
popcorn search
conference call!
These problems are not specific
to enterprise search… but:
• Result 1: IBM Travel: Per Diem
• Result 2: IBM Travel: Per Diem Rates
• Result 3: IBM Travel: National perdiems
• Result 25: IBM Travel: Per Diem Policy
5
…
6. The Enterprise Challenge!The Enterprise Challenge!
Domain-specific meaning! Domain-specific repetitions!
Generic search solutionGeneric search solution that is
customizable and maintainable in every
domain
Generic search solutionGeneric search solution that is
customizable and maintainable in every
domain
Simple customization with reasonable effort!Simple customization with reasonable effort!
Programmable SearchProgrammable Search
Ongoing search-quality managementOngoing search-quality management
6
Continually changing terminology!
7. • Search for the EnterpriseSearch for the Enterprise
• Programmable Search (overview)Programmable Search (overview)
• Backend AnalyticsBackend Analytics
• Search RuntimeSearch Runtime
• Foundations and PrinciplesFoundations and Principles
• Concluding RemarksConcluding Remarks
outlineoutline
7
8. Programmable Search: Main IdeaProgrammable Search: Main Idea
• Goals:Goals:
– Transparency
• Know “precisely” why every result item is being brought back
• Understand how changes in content/intents affect search
– Maintainability and “Debugability”
• Ranking logic is guided by explicit rules
• Properly react to changes in content/intents
• Building blocks:Building blocks:
– Deep analytics on documents
– Domain-specific analysis of queries
– Transparent customizable rule-driven ranking
runtime rulesruntime rules
backend
analytics
backend
analytics
interpretationsinterpretations
8
9. Distributed Analytics Platform
Crawling, information extraction, token generation (TG), indexing
Search runtime
Index
Index and rule
update services
backend
analytics
backend
analytics
runtime rulesruntime rulesinterpretationsinterpretations
Implementation Architecture
backend
frontend
9
10. • Search for the EnterpriseSearch for the Enterprise
• Programmable Search (overview)Programmable Search (overview)
• Backend AnalyticsBackend Analytics
• Search RuntimeSearch Runtime
• Foundations and PrinciplesFoundations and Principles
• Concluding RemarksConcluding Remarks
outlineoutline
10
11. Backend Analytics:Backend Analytics: 3 Parts3 Parts
Local AnalysisLocal Analysis
(per-page analysis)
Local AnalysisLocal Analysis
(per-page analysis)
Global AnalysisGlobal Analysis
(cross-page analysis)
Global AnalysisGlobal Analysis
(cross-page analysis)
Token GenerationToken Generation
(TG)
Token GenerationToken Generation
(TG)
index
11
12. Local AnalysisLocal Analysis
• Categorizing pages
– Label pages by custom categories
• IBM examples: HR, person, IT help, ISSI, sales information,
marketing, corporate standards, legal & IP-law, …
– Geo classification
• Associate documents with the relevant countries & regions
• Annotating pages
– Identify HomePage annotation for people, projects,
communities, …
Simply knowing where a page is physically hosted is not enough
(example: Czech Republic hosts all pages for IBM in Europe)
12
13. G J Chaitin Home Page
13
Homepage IdentificationHomepage Identification
Title ExtractionTitle Extraction
Matching title
patterns
Matching title
patterns
Title
s
Dictionary
Match
Dictionary
Match
Home Page for
G J Chaitin
• http://w3.ibm.com/hr/idp/
• http://w3-03.ibm.com/isc/index.html
• http://chis.at.ibm.com/
URL ExtractionURL Extraction
URLs
Matching URL
patterns
Matching URL
patterns
Homepage for: idp isc chis
Employee
directory
… many more …
Intranet
page
Intranet
page
More details in
[Zhu et al., WWW’07]
14. 14 IBM Confidential14 IBM Confidential
Among the 38 pages with the exact same title,
which is the best for “Paula Summa”?
Role of Global AnalysisRole of Global Analysis
14
15. PersonPerson
TitleTitle
Token Generation (TG)
Annotated values Index content
Ching-Tien T. (Howard) Ho
Ho Ching-Tien Tien Ho Ho, Tien
Howard Ho Ching-Tien H. ...
Global Technology Services
TG
personNameTG
Howard Ho Ching Tien ...
gts Global Technology Services
Global Technology Technology
Services Global Technology ...
GlobalTechnologyServices
nGramTG
spaceTG
acronymTG
nGramTG
……
… 15
…
…
16. • Search for the EnterpriseSearch for the Enterprise
• Programmable Search (overview)Programmable Search (overview)
• Backend AnalyticsBackend Analytics
• Search RuntimeSearch Runtime
• Foundations and PrinciplesFoundations and Principles
• Concluding RemarksConcluding Remarks
outlineoutline
16
18. Phase 3:Phase 3: Result Construction
Phase 2:Phase 2: Relevance Ranking
Phase 1:Phase 1: Query SemanticsQuery Semantics
query search rewrite rules
queries
interpretations
partially ordered interpretations
interpretations execution
partially ordered results
result aggregation
ordered results
grouping rules
ordered & grouped results final results
re-ranking rules
Runtime Flow in More DetailsRuntime Flow in More Details
18
19. Runtime Rules:Runtime Rules: Pattern-Action Language
Query Pattern Queries Matching Possible Action
EQUALS
[r=ibm|information|info]
[d=COUNTRY]
• ibm germany
• info india
Rewrite into “[country] hr”
(e.g., germany hr)
ENDS_WITH installation
• acrobat installation
• db2 on aix installation
Replace installation with ISSI
(e.g., acrobat ISSI)
CONTAINS directions to
[d=SITE]
• driving directions to almaden
• directions to watson from jfk
Pages of “siteserv” category
should be ranked higher
STARTS_WITH
[d=PERSON]
• john kelly biography
• steve mills announcement
Group together pages that
represent blog entries
Pattern expression,
matched against the
keyword query
Perform when
matchQuery pattern → Action
19
21. 21
What’s Best for Benefits?What’s Best for Benefits?
The most important IBM page for benefits
changes over time: currently it is netbenefits
The most important IBM page for benefits
changes over time: currently it is netbenefits
21
23. Interpretations
Scenario: An IBM employee wants
to download Lotus Symphony 1.3
Scenario: An IBM employee wants
to download Lotus Symphony 1.3
Runtime interpretation:
download symphony 1.3 category=issi software=symphony 1.3
interpretations execution
partially ordered results
result aggregation
ordered results
grouping rules
ordered & grouped results final results
re-ranking rules
rewrite rules
queries
interpretations
partially ordered interpretations
download symphony 1.3 search
23
24. 24
IBM Confidential
People with
first name Jim
People with
first name Jim
How can we avoid pages
from people category?
How can we avoid pages
from people category?
java jim
Complex RulesComplex Rules
24
25. java jim and not in person category
Complex RulesComplex Rules
interpretations execution
partially ordered results
result aggregation
ordered results
grouping rules
ordered & grouped results final results
re-ranking rules
interpretations
partially ordered interpretations
rewrite rules
queries
java search
25
27. PersonPerson
TitleTitle
Recall: Token Generation (TG)
Annotated values Index content
Ching-Tien T. (Howard) Ho
Global Technology Services
TG
personNameTG
Howard Ho Ching Tien ...
gts Global Technology Services
Global Technology Technology
Services Global Technology ...
GlobalTechnologyServices
nGramTG
spaceTG
acronymTG
nGramTG
……
…
…
…
Ho Ching-Tien Tien Ho Ho, Tien
Howard Ho Ching-Tien H. ...Person + personNameTG
Person + nGramTG
Title + acronymTG
Title + spaceTG
Title + nGramTG
27
28. Annotation + TG Relevance Bucket
Howard Ho Ching Tien ...
GlobalTechnologyServices
… 28…
Person + personNameTG
Person + nGramTG
Title + acronymTG
Title + spaceTG
Title + nGramTG
query search
Relevance bucketsRelevance buckets
•Buckets are ranked
– Based on annotation type
– Based on TG quality
•A page can belong to
multiple buckets
•Within each bucket,
ranking is by
conventional IR
……
31. • Grouping rules define how search results should
be grouped together
• Search administrators can improve the diversity
of search results (in 1st
page)
– Based on their familiarity with the data sources
Group pages of the same category
per diem travel, you-and-ibm
ANY ISSI, IT Help Central, Forum,
Bluepedia, Media Library, …
Grouping RulesGrouping Rules
Query pattern
31
32. Need first page diversityNeed first page diversity
Flooding with Similar PagesFlooding with Similar Pages
32
33. 33
33 IBM Confidential
Grouping Rule to the RescueGrouping Rule to the Rescue
per diem travel, you-and-ibm
final results
re-ranking rules
interpretations
partially ordered interpretations
rewrite rules
queries
interpretations execution
partially ordered results
result aggregation
ordered results
grouping rules
ordered & grouped results
per diem search
33
34. • Re-ranking rules adjust ranking of
search results based on categories
• Example: search administrator specifies the
important sources of “hot/current topics”
Re-ranking RulesRe-ranking Rules
Hot topics Rank these categories higher
Bluepedia, News, About-IBM
smarter planet, cloud
computing, centennial, …
34
35. BluepediaBluepedia
Technical NewsTechnical News
Re-ranking Rule for Hot TopicsRe-ranking Rule for Hot Topics
Homepages of
“About IBM”
Homepages of
“About IBM”
Hot topics Rank these categories higher
Bluepedia, News, About-IBM
smarter planet, cloud
computing, centennial, …
35
36. Re-ranking Rules for Person QueriesRe-ranking Rules for Person Queries
[d=PERSON]
executive_corner, media_library,
organization_chart, files
Media_librar
y
Media_librar
y
executive_cornerexecutive_corner
interpretations
partially ordered interpretations
rewrite rules
queries
interpretations execution
partially ordered results
result aggregation
ordered results
grouping rules
ordered & grouped results final results
re-ranking rules
Paula Summa search
36
38. What Administrators Need…
• Search administrators have major problems
with an opaque search engine
• Programmable search provides
– Customization to the specific domain
– Ongoing search-quality management
• Search administrators have major problems
with an opaque search engine
• Programmable search provides
– Customization to the specific domain
– Ongoing search-quality management
Okay… but:
The proof of the pudding is in the eating!The proof of the pudding is in the eating!
Recap:
38
The people in change of search are not SIGIR audience; they are IT admins; hence, all they can do are these hacks and hardcoding.
“ It may be the case that a day before, Thin Client Manager meant something else; so, intents change over nights as well.”
So we have different types of tokenization applied to the different types of annotated items; for each annotation type and TG type, the result is stored in a separate part of the index. In a few slides, I will explain how we use that during runtime.
In phase 1, we manipulate the search query, add variants and so on, without touching the index. The result is a set of queries. Next, in phase 2, we run the queries against the index and apply ranking, by a combination of conventional IR and relevance buckets that I will describe shortly. In phase 3, we build the final result by invoking the grouping and re-ranking rules supplied by the admins.
This slide gives a more detailed view of the runtime flow. Mouse click. And these are where the three phases are. Next, I will discuss the different actions in the boxes here.