Keynote talk given at the 10th Russian Summer School in Information Retrieval (RuSSIR ’16), Saratov, Russia, August 2016.
Note: part of the work is under still review; those slides are not yet included.
Cost-based query optimization in Apache Hive 0.14Julian Hyde
Tez is making Hive faster, and now cost-based optimization (CBO) is making it smarter. A new initiative in Hive introduces cost-based optimization for the first time, based on the Optiq framework. Optiq's lead developer Julian Hyde shows the improvements that CBO is bringing in Apache Hive 0.14.
For those interested in Hive internals, he gives an overview of the Optiq framework and shows some of the improvements that are coming to future versions of Hive with the Stinger.next initiative.
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...Databricks
Apache Spark is an excellent tool to accelerate your analytics, whether you’re doing ETL, Machine Learning, or Data Warehousing. However, to really make the most of Spark it pays to understand best practices for data storage, file formats, and query optimization. This talk will cover best practices I’ve applied over years in the field helping customers write Spark applications as well as identifying what patterns make sense for your use case.
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
As a general computing engine, Spark can process data from various data management/storage systems, including HDFS, Hive, Cassandra and Kafka. For flexibility and high throughput, Spark defines the Data Source API, which is an abstraction of the storage layer. The Data Source API has two requirements.
1) Generality: support reading/writing most data management/storage systems.
2) Flexibility: customize and optimize the read and write paths for different systems based on their capabilities.
Data Source API V2 is one of the most important features coming with Spark 2.3. This talk will dive into the design and implementation of Data Source API V2, with comparison to the Data Source API V1. We also demonstrate how to implement a file-based data source using the Data Source API V2 for showing its generality and flexibility.
How Adobe Does 2 Million Records Per Second Using Apache Spark!Databricks
Adobe’s Unified Profile System is the heart of its Experience Platform. It ingests TBs of data a day and is PBs large. As part of this massive growth we have faced multiple challenges in our Apache Spark deployment which is used from Ingestion to Processing.
Simplifying Big Data Integration with Syncsort DMX and DMX-hPrecisely
Today’s modern data strategies have to manage more than growing data volumes. They must also address the added complexity of integrating diverse data sources and types, adhere to security and governance mandates, and ensure the right tools and skills are in place to deliver business value from the data.
Learn how the latest enhancements to Syncsort DMX and DMX-h can help you achieve your modern data strategy goals with a single interface for accessing and integrating all your enterprise data sources – batch and streaming – across Hadoop, Spark, Linux, Windows or Unix – on premise or in the cloud.
Watch this on-demand customer education webcast to learn the latest product features introduced this year, including:
• Best in class data ingestion capabilities with enhanced support for mainframes, RDBMSs, MPP, Avro/Parquet, Kafka, NoSQL and more.
• Single interface for streaming and batch processes – now with support for Kafka and MapR Streams
• Secure data access, data governance and lineage with seamless integration with Kerberos, Apache Ranger, Apache Ambari, Cloudera Manager, Cloudera Navigator and Sentry.
• Evolution of our design once, deploy anywhere architecture – now with support for Spark!
Cost-based query optimization in Apache Hive 0.14Julian Hyde
Tez is making Hive faster, and now cost-based optimization (CBO) is making it smarter. A new initiative in Hive introduces cost-based optimization for the first time, based on the Optiq framework. Optiq's lead developer Julian Hyde shows the improvements that CBO is bringing in Apache Hive 0.14.
For those interested in Hive internals, he gives an overview of the Optiq framework and shows some of the improvements that are coming to future versions of Hive with the Stinger.next initiative.
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...Databricks
Apache Spark is an excellent tool to accelerate your analytics, whether you’re doing ETL, Machine Learning, or Data Warehousing. However, to really make the most of Spark it pays to understand best practices for data storage, file formats, and query optimization. This talk will cover best practices I’ve applied over years in the field helping customers write Spark applications as well as identifying what patterns make sense for your use case.
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
As a general computing engine, Spark can process data from various data management/storage systems, including HDFS, Hive, Cassandra and Kafka. For flexibility and high throughput, Spark defines the Data Source API, which is an abstraction of the storage layer. The Data Source API has two requirements.
1) Generality: support reading/writing most data management/storage systems.
2) Flexibility: customize and optimize the read and write paths for different systems based on their capabilities.
Data Source API V2 is one of the most important features coming with Spark 2.3. This talk will dive into the design and implementation of Data Source API V2, with comparison to the Data Source API V1. We also demonstrate how to implement a file-based data source using the Data Source API V2 for showing its generality and flexibility.
How Adobe Does 2 Million Records Per Second Using Apache Spark!Databricks
Adobe’s Unified Profile System is the heart of its Experience Platform. It ingests TBs of data a day and is PBs large. As part of this massive growth we have faced multiple challenges in our Apache Spark deployment which is used from Ingestion to Processing.
Simplifying Big Data Integration with Syncsort DMX and DMX-hPrecisely
Today’s modern data strategies have to manage more than growing data volumes. They must also address the added complexity of integrating diverse data sources and types, adhere to security and governance mandates, and ensure the right tools and skills are in place to deliver business value from the data.
Learn how the latest enhancements to Syncsort DMX and DMX-h can help you achieve your modern data strategy goals with a single interface for accessing and integrating all your enterprise data sources – batch and streaming – across Hadoop, Spark, Linux, Windows or Unix – on premise or in the cloud.
Watch this on-demand customer education webcast to learn the latest product features introduced this year, including:
• Best in class data ingestion capabilities with enhanced support for mainframes, RDBMSs, MPP, Avro/Parquet, Kafka, NoSQL and more.
• Single interface for streaming and batch processes – now with support for Kafka and MapR Streams
• Secure data access, data governance and lineage with seamless integration with Kerberos, Apache Ranger, Apache Ambari, Cloudera Manager, Cloudera Navigator and Sentry.
• Evolution of our design once, deploy anywhere architecture – now with support for Spark!
Apache Spark Training | Spark Tutorial For Beginners | Apache Spark Certifica...Edureka!
This Edureka "Apache Spark Training" tutorial will talk about how Apache Spark works practically. We have demonstrated a Movie Recommendation Project using Apache Spark in this tutorial. Below are the topics covered in this tutorial:
1) Use Cases Of Real Time Analytics
2) Movie Recommendation System Using Spark
3) What Is Spark?
4) Getting Movie Dataset
5) Spark Streaming
6) Collaborative Filtering
7) Spark MLlib
8) Fetching Results
9) Storing Results
Web 3.0 explained with a stamp (pt I: the basics)Freek Bijl
What really means web 3.0, or: the semantic web? With this presentation I explain the meaning of web 3.0 by an example of a stamp collection. This presentation is a translation of a Dutch version made earlier. For more detailed information in Dutch you can have a look at BijlBrand.nl
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Databricks
Structured Streaming has proven to be the best platform for building distributed stream processing applications. Its unified SQL/Dataset/DataFrame APIs and Spark’s built-in functions make it easy for developers to express complex computations. Delta Lake, on the other hand, is the best way to store structured data because it is a open-source storage layer that brings ACID transactions to Apache Spark and big data workloads Together, these can make it very easy to build pipelines in many common scenarios. However, expressing the business logic is only part of the larger problem of building end-to-end streaming pipelines that interact with a complex ecosystem of storage systems and workloads. It is important for the developer to truly understand the business problem that needs to be solved. Apache Spark, being a unified analytics engine doing both batch and stream processing, often provides multiples ways to solve the same problem. So understanding the requirements carefully helps you to architect your pipeline that solves your business needs in the most resource efficient manner.
In this talk, I am going examine a number common streaming design patterns in the context of the following questions.
WHAT are you trying to consume? What are you trying to produce? What is the final output that the business wants? What are your throughput and latency requirements?
WHY do you really have those requirements? Would solving the requirements of the individual pipeline actually solve your end-to-end business requirements?
HOW are going to architect the solution? And how much are you willing to pay for it?
Clarity in understanding the ‘what and why’ of any problem can automatically much clarity on the ‘how’ to architect it using Structured Streaming and, in many cases, Delta Lake.
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021StreamNative
You may be familiar with the Presto plugin used to run fast interactive queries over Pulsar using ANSI SQL and can be joined with other data sources. This plugin will soon get a rename to align with the rename of the PrestoSQL project to Trino. What is the purpose of this rename and what does it mean for those using the Presto plugin? We cover the history of the community shift from PrestoDB to PrestoSQL, as well as, the future plans for the Pulsar community to donate this plugin to the Trino project. One of the connector maintainers will then demo the connector and show what is possible when using Trino and Pulsar!
Exactly-Once Financial Data Processing at Scale with Flink and PinotFlink Forward
Flink Forward San Francisco 2022.
At Stripe we have created a complete end to end exactly-once processing pipeline to process financial data at scale, by combining the exactly-once power from Flink, Kafka, and Pinot together. The pipeline provides exactly-once guarantee, end-to-end latency within a minute, deduplication against hundreds of billions of keys, and sub-second query latency against the whole dataset with trillion level rows. In this session we will discuss the technical challenges of designing, optimizing, and operating the whole pipeline, including Flink, Kafka, and Pinot. We will also share our lessons learned and the benefits gained from exactly-once processing.
by
Xiang Zhang & Pratyush Sharma & Xiaoman Dong
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...Databricks
Of all the developers’ delight, none is more attractive than a set of APIs that make developers productive, that are easy to use, and that are intuitive and expressive. Apache Spark offers these APIs across components such as Spark SQL, Streaming, Machine Learning, and Graph Processing to operate on large data sets in languages such as Scala, Java, Python, and R for doing distributed big data processing at scale. In this talk, I will explore the evolution of three sets of APIs-RDDs, DataFrames, and Datasets-available in Apache Spark 2.x. In particular, I will emphasize three takeaways: 1) why and when you should use each set as best practices 2) outline its performance and optimization benefits; and 3) underscore scenarios when to use DataFrames and Datasets instead of RDDs for your big data distributed processing. Through simple notebook demonstrations with API code examples, you’ll learn how to process big data using RDDs, DataFrames, and Datasets and interoperate among them. (this will be vocalization of the blog, along with the latest developments in Apache Spark 2.x Dataframe/Datasets and Spark SQL APIs: https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html)
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2L4rPmM
This CloudxLab Basics of RDD tutorial helps you to understand Basics of RDD in detail. Below are the topics covered in this tutorial:
1) What is RDD - Resilient Distributed Datasets
2) Creating RDD in Scala
3) RDD Operations - Transformations & Actions
4) RDD Transformations - map() & filter()
5) RDD Actions - take() & saveAsTextFile()
6) Lazy Evaluation & Instant Evaluation
7) Lineage Graph
8) flatMap and Union
9) Scala Transformations - Union
10) Scala Actions - saveAsTextFile(), collect(), take() and count()
11) More Actions - reduce()
12) Can We Use reduce() for Computing Average?
13) Solving Problems with Spark
14) Compute Average and Standard Deviation with Spark
15) Pick Random Samples From a Dataset using Spark
Debugging PySpark: Spark Summit East talk by Holden KarauSpark Summit
Apache Spark is one of the most popular big data projects, offering greatly improved performance over traditional MapReduce models. Much of Apache Spark’s power comes from lazy evaluation along with intelligent pipelining, which can make debugging more challenging. This talk will examine how to debug Apache Spark applications, the different options for logging in Spark’s variety of supported languages, as well as some common errors and how to detect them.
Spark’s own internal logging can often be quite verbose, and this talk will examine how to effectively search logs from Apache Spark to spot common problems. In addition to the internal logging, this talk will look at options for logging from within our program itself.
Spark’s accumulators have gotten a bad rap because of how they interact in the event of cache misses or partial recomputes, but this talk will look at how to effectively use Spark’s current accumulators for debugging as well as a look to future for data property type accumulators which may be coming to Spark in future version.
In addition to reading logs, and instrumenting our program with accumulators, Spark’s UI can be of great help for quickly detecting certain types of problems.
NPM is a package manager for the JavaScript programming language. It is the default package manager for the JavaScript runtime environment Node.js. It consists of a command line client, also called npm, and an online database of public and paid-for private packages, called the npm registry.
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
A presentation cum workshop on Real time Analytics with Apache Kafka and Apache Spark. Apache Kafka is a distributed publish-subscribe messaging while other side Spark Streaming brings Spark's language-integrated API to stream processing, allows to write streaming applications very quickly and easily. It supports both Java and Scala. In this workshop we are going to explore Apache Kafka, Zookeeper and Spark with a Web click streaming example using Spark Streaming. A clickstream is the recording of the parts of the screen a computer user clicks on while web browsing.
In a world where compute is paramount, it is all too easy to overlook the importance of storage and IO in the performance and optimization of Spark jobs.
Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...StreamNative
Milvus is an open-source vector database that leverages a novel data fabric to build and manage vector similarity search applications. As the world's most popular vector database, it has already been adopted in production by thousands of companies around the world, including Lucidworks, Shutterstock, and Cloudinary. With the launch of Milvus 2.0, the community aims to introduce a cloud-native, highly scalable and extendable vector similarity solution, and the key design concept is log as data.
Milvus relies on Pulsar as the log pub/sub system. Pulsar helps Milvus to reduce system complexity by loosely decoupling each micro service, making the system stateless by disaggregating log storage and computation, which also makes the system further extendable. We will introduce the overview design, the implementation details of Milvus and its roadmap in this topic.
Takeaways:
1) Get a general idea about what is a vector database and its real-world use cases.
2) Understand the major design principles of Milvus 2.0.
3) Learn how to build a complex system with the help of a modern log system like Pulsar.
Building a Scalable Web Crawler with Hadoop by Ahad Rana from CommonCrawl
Ahad Rana, engineer at CommonCrawl, will go over CommonCrawl’s extensive use of Hadoop to fulfill their mission of building an open, and accessible Web-Scale crawl. He will discuss their Hadoop data processing pipeline, including their PageRank implementation, describe techniques they use to optimize Hadoop, discuss the design of their URL Metadata service, and conclude with details on how you can leverage the crawl (using Hadoop) today.
Visually Exploring Patent Collections for Events and PatternsXiaoyu Wang
My talk on Patent Visualization at The 3rd IEEE Workshop on Interactive Visual Text Analytics. Primary focus is to introduce the Scalable Visual Analytics research that my team is working on. Workshop paper can be found at: http://vialab.science.uoit.ca/textvis2013/papers/Ankam-TextVis2013.pdf
Lessons Learnt from the Named Entity rEcognition and Linking (NEEL) Challenge...Marieke van Erp
Giuseppe Rizzo, Biana Pereira, Andra Varga, Marieke van Erp, Amparo Elizabeth Cano Basave
Presented on Wednesday 10 October at the 17th International Semantic Web Conference (ISWC 2018)
Paper: http://www.semantic-web-journal.net/content/lessons-learnt-named-entity-recognition-and-linking-neel-challenge-series
Conference: http://iswc2018.semanticweb.org/
Apache Spark Training | Spark Tutorial For Beginners | Apache Spark Certifica...Edureka!
This Edureka "Apache Spark Training" tutorial will talk about how Apache Spark works practically. We have demonstrated a Movie Recommendation Project using Apache Spark in this tutorial. Below are the topics covered in this tutorial:
1) Use Cases Of Real Time Analytics
2) Movie Recommendation System Using Spark
3) What Is Spark?
4) Getting Movie Dataset
5) Spark Streaming
6) Collaborative Filtering
7) Spark MLlib
8) Fetching Results
9) Storing Results
Web 3.0 explained with a stamp (pt I: the basics)Freek Bijl
What really means web 3.0, or: the semantic web? With this presentation I explain the meaning of web 3.0 by an example of a stamp collection. This presentation is a translation of a Dutch version made earlier. For more detailed information in Dutch you can have a look at BijlBrand.nl
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Databricks
Structured Streaming has proven to be the best platform for building distributed stream processing applications. Its unified SQL/Dataset/DataFrame APIs and Spark’s built-in functions make it easy for developers to express complex computations. Delta Lake, on the other hand, is the best way to store structured data because it is a open-source storage layer that brings ACID transactions to Apache Spark and big data workloads Together, these can make it very easy to build pipelines in many common scenarios. However, expressing the business logic is only part of the larger problem of building end-to-end streaming pipelines that interact with a complex ecosystem of storage systems and workloads. It is important for the developer to truly understand the business problem that needs to be solved. Apache Spark, being a unified analytics engine doing both batch and stream processing, often provides multiples ways to solve the same problem. So understanding the requirements carefully helps you to architect your pipeline that solves your business needs in the most resource efficient manner.
In this talk, I am going examine a number common streaming design patterns in the context of the following questions.
WHAT are you trying to consume? What are you trying to produce? What is the final output that the business wants? What are your throughput and latency requirements?
WHY do you really have those requirements? Would solving the requirements of the individual pipeline actually solve your end-to-end business requirements?
HOW are going to architect the solution? And how much are you willing to pay for it?
Clarity in understanding the ‘what and why’ of any problem can automatically much clarity on the ‘how’ to architect it using Structured Streaming and, in many cases, Delta Lake.
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021StreamNative
You may be familiar with the Presto plugin used to run fast interactive queries over Pulsar using ANSI SQL and can be joined with other data sources. This plugin will soon get a rename to align with the rename of the PrestoSQL project to Trino. What is the purpose of this rename and what does it mean for those using the Presto plugin? We cover the history of the community shift from PrestoDB to PrestoSQL, as well as, the future plans for the Pulsar community to donate this plugin to the Trino project. One of the connector maintainers will then demo the connector and show what is possible when using Trino and Pulsar!
Exactly-Once Financial Data Processing at Scale with Flink and PinotFlink Forward
Flink Forward San Francisco 2022.
At Stripe we have created a complete end to end exactly-once processing pipeline to process financial data at scale, by combining the exactly-once power from Flink, Kafka, and Pinot together. The pipeline provides exactly-once guarantee, end-to-end latency within a minute, deduplication against hundreds of billions of keys, and sub-second query latency against the whole dataset with trillion level rows. In this session we will discuss the technical challenges of designing, optimizing, and operating the whole pipeline, including Flink, Kafka, and Pinot. We will also share our lessons learned and the benefits gained from exactly-once processing.
by
Xiang Zhang & Pratyush Sharma & Xiaoman Dong
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...Databricks
Of all the developers’ delight, none is more attractive than a set of APIs that make developers productive, that are easy to use, and that are intuitive and expressive. Apache Spark offers these APIs across components such as Spark SQL, Streaming, Machine Learning, and Graph Processing to operate on large data sets in languages such as Scala, Java, Python, and R for doing distributed big data processing at scale. In this talk, I will explore the evolution of three sets of APIs-RDDs, DataFrames, and Datasets-available in Apache Spark 2.x. In particular, I will emphasize three takeaways: 1) why and when you should use each set as best practices 2) outline its performance and optimization benefits; and 3) underscore scenarios when to use DataFrames and Datasets instead of RDDs for your big data distributed processing. Through simple notebook demonstrations with API code examples, you’ll learn how to process big data using RDDs, DataFrames, and Datasets and interoperate among them. (this will be vocalization of the blog, along with the latest developments in Apache Spark 2.x Dataframe/Datasets and Spark SQL APIs: https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html)
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2L4rPmM
This CloudxLab Basics of RDD tutorial helps you to understand Basics of RDD in detail. Below are the topics covered in this tutorial:
1) What is RDD - Resilient Distributed Datasets
2) Creating RDD in Scala
3) RDD Operations - Transformations & Actions
4) RDD Transformations - map() & filter()
5) RDD Actions - take() & saveAsTextFile()
6) Lazy Evaluation & Instant Evaluation
7) Lineage Graph
8) flatMap and Union
9) Scala Transformations - Union
10) Scala Actions - saveAsTextFile(), collect(), take() and count()
11) More Actions - reduce()
12) Can We Use reduce() for Computing Average?
13) Solving Problems with Spark
14) Compute Average and Standard Deviation with Spark
15) Pick Random Samples From a Dataset using Spark
Debugging PySpark: Spark Summit East talk by Holden KarauSpark Summit
Apache Spark is one of the most popular big data projects, offering greatly improved performance over traditional MapReduce models. Much of Apache Spark’s power comes from lazy evaluation along with intelligent pipelining, which can make debugging more challenging. This talk will examine how to debug Apache Spark applications, the different options for logging in Spark’s variety of supported languages, as well as some common errors and how to detect them.
Spark’s own internal logging can often be quite verbose, and this talk will examine how to effectively search logs from Apache Spark to spot common problems. In addition to the internal logging, this talk will look at options for logging from within our program itself.
Spark’s accumulators have gotten a bad rap because of how they interact in the event of cache misses or partial recomputes, but this talk will look at how to effectively use Spark’s current accumulators for debugging as well as a look to future for data property type accumulators which may be coming to Spark in future version.
In addition to reading logs, and instrumenting our program with accumulators, Spark’s UI can be of great help for quickly detecting certain types of problems.
NPM is a package manager for the JavaScript programming language. It is the default package manager for the JavaScript runtime environment Node.js. It consists of a command line client, also called npm, and an online database of public and paid-for private packages, called the npm registry.
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
A presentation cum workshop on Real time Analytics with Apache Kafka and Apache Spark. Apache Kafka is a distributed publish-subscribe messaging while other side Spark Streaming brings Spark's language-integrated API to stream processing, allows to write streaming applications very quickly and easily. It supports both Java and Scala. In this workshop we are going to explore Apache Kafka, Zookeeper and Spark with a Web click streaming example using Spark Streaming. A clickstream is the recording of the parts of the screen a computer user clicks on while web browsing.
In a world where compute is paramount, it is all too easy to overlook the importance of storage and IO in the performance and optimization of Spark jobs.
Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...StreamNative
Milvus is an open-source vector database that leverages a novel data fabric to build and manage vector similarity search applications. As the world's most popular vector database, it has already been adopted in production by thousands of companies around the world, including Lucidworks, Shutterstock, and Cloudinary. With the launch of Milvus 2.0, the community aims to introduce a cloud-native, highly scalable and extendable vector similarity solution, and the key design concept is log as data.
Milvus relies on Pulsar as the log pub/sub system. Pulsar helps Milvus to reduce system complexity by loosely decoupling each micro service, making the system stateless by disaggregating log storage and computation, which also makes the system further extendable. We will introduce the overview design, the implementation details of Milvus and its roadmap in this topic.
Takeaways:
1) Get a general idea about what is a vector database and its real-world use cases.
2) Understand the major design principles of Milvus 2.0.
3) Learn how to build a complex system with the help of a modern log system like Pulsar.
Building a Scalable Web Crawler with Hadoop by Ahad Rana from CommonCrawl
Ahad Rana, engineer at CommonCrawl, will go over CommonCrawl’s extensive use of Hadoop to fulfill their mission of building an open, and accessible Web-Scale crawl. He will discuss their Hadoop data processing pipeline, including their PageRank implementation, describe techniques they use to optimize Hadoop, discuss the design of their URL Metadata service, and conclude with details on how you can leverage the crawl (using Hadoop) today.
Visually Exploring Patent Collections for Events and PatternsXiaoyu Wang
My talk on Patent Visualization at The 3rd IEEE Workshop on Interactive Visual Text Analytics. Primary focus is to introduce the Scalable Visual Analytics research that my team is working on. Workshop paper can be found at: http://vialab.science.uoit.ca/textvis2013/papers/Ankam-TextVis2013.pdf
Lessons Learnt from the Named Entity rEcognition and Linking (NEEL) Challenge...Marieke van Erp
Giuseppe Rizzo, Biana Pereira, Andra Varga, Marieke van Erp, Amparo Elizabeth Cano Basave
Presented on Wednesday 10 October at the 17th International Semantic Web Conference (ISWC 2018)
Paper: http://www.semantic-web-journal.net/content/lessons-learnt-named-entity-recognition-and-linking-neel-challenge-series
Conference: http://iswc2018.semanticweb.org/
This is Part II of the tutorial "Entity Linking and Retrieval" given at SIGIR 2013 (together with E. Meij and D. Odijk). For the complete tutorial material (including slides for the other parts) visit http://ejmeij.github.io/entity-linking-and-retrieval-tutorial/
Test Trend Analysis : Towards robust, reliable and timely testsHugh McCamphill
Slides from my talk at Selenium Conference 2016.
In this talk you will get ideas about how you can instrument test result information to provide actionable data, paving the way for more robust, reliable and timely test results.
By capturing this information over time, and when combined with visualization tools, we can answer different questions than with existing solutions (Allure / CI tool build history). Some examples of these are:
Which tests are consistently flaky
What are the common causes of failure across tests
Which tests consistently take a long time to run
Using this information we can move away from the ‘re-run’ culture and better support continuous integration goals of having quick, reliable, deterministic tests
Video of the talk is here: https://youtu.be/29fPYx7OJnE?list=PL_7kBU2XBlbKuRNVHeqjXUygXtToqMHsn
Information School, University of Washington, 2014-05-21: INFX 598 - Introducing Linked Data: concepts, methods and tools. Guest lecture (Module 9) "Doing Business with Semantic Technologies": Introduction to Ontotext and some of its products, clients and projects.
Also see video:https://voicethread.com/myvoice/#thread/5784646/29625471/31274564
Dynamic Search Using Semantics & StatisticsPaul Hofmann
This presentation shows 3 applications of successfully combining semantics and statistics for text mining and interactive search.
1) We predict the Lehman bankruptcy using statistical topic modeling, SAP Business Objects entity extraction and associative memories (powered by Saffron Technologies).
2) We semi-automatically handle service requests at Cisco using knowledge extraction and knowledge reuse.
3) We discover user intent for interactive retrieval. User intent is defined as a latent state. The observations of this latent state are the reformulated query sequence, and the retrieved documents, together with the positive or negative feedback provided by the user. Demo shows recognizing user’s intent for health care search.
Building a semantic search system - one that can correctly parse and interpret end-user intent and return the ideal results for users’ queries - is not an easy task. It requires semantically parsing the terms, phrases, and structure within queries, disambiguating polysemous terms, correcting misspellings, expanding to conceptually synonymous or related concepts, and rewriting queries in a way that maps the correct interpretation of each end user’s query into the ideal representation of features and weights that will return the best results for that user. Not only that, but the above must often be done within the confines of a very specific domain - ripe with its own jargon and linguistic and conceptual nuances.
This talk will walk through the anatomy of a semantic search system and how each of the pieces described above fit together to deliver a final solution. We'll leverage several recently-released capabilities in Apache Solr (the Semantic Knowledge Graph, Solr Text Tagger, Statistical Phrase Identifier) and Lucidworks Fusion (query log mining, misspelling job, word2vec job, query pipelines, relevancy experiment backtesting) to show you an end-to-end working Semantic Search system that can automatically learn the nuances of any domain and deliver a substantially more relevant search experience.
Librarian use of authority files dates back to Callimachus and the Great Library of Alexandria around 300 BC. With the evolution of powerful computerized searching and retrieval systems, authority data appears to some to have outlived its usefulness. However, the Semantic Web provides an opportunity to use authority data to enable computers to search, aggregate, and combine information on the Web. Join this webinar to learn about the amazing services that can result when the rich data included in name authority files, and other standardized vocabularies are linked via the Semantic Web.
Similar to Entity Search: The Last Decade and the Next (20)
What Does Conversational Information Access Exactly Mean and How to Evaluate It?krisztianbalog
This talk discusses a set of specific tasks and scenarios related to information access within the vast space that is casually referred to as conversational AI. While most of these problems have been identified in the literature for quite some time now, progress has been limited. Apart from the inherently challenging nature of these problems, the lack of progress, in large part, can be attributed to the shortage of appropriate evaluation methodology and resources. This talk presents some recent work towards filling this gap.
In one line of research, we investigate the presentation of tabular search results in a conversational setting. Instead of generating a static summary of a result table, we complement brief summaries with clues that invite further exploration, thereby taking advantage of the conversational paradigm. One of the main contributions of this study is the development of a test collection using crowdsourcing.
Another line of work focuses on large-scale evaluation of conversational recommender systems via simulated users. Building on the well-established agenda-based simulation framework from dialogue systems research, we develop interaction and preference models specific to the item recommendation scenario. For evaluation, we compare three existing conversational movie recommender systems with both real and simulated users, and observe high correlation between the two means of evaluation.
This talk has been given at the CIIR talk series at the University of Massachusetts Amherst in Jan 2021 as well as at the IR seminar series at the University of Glasgow in March 2021.
Entity Retrieval (tutorial organized by Radialpoint in Montreal)krisztianbalog
This is Part II of the tutorial "Entity Linking and Retrieval for Semantic Search" given at a tutorial organized by Radialpoint (together with E. Meij and D. Odijk).
Previous versions of the tutorial were given at WWW'13, SIGIR'13, and WSDM'14. The current version contains an overhaul of the type-aware ranking part.
For the complete tutorial material (including slides for the other parts) visit http://ejmeij.github.io/entity-linking-and-retrieval-tutorial/
This is Part II of the tutorial "Entity Linking and Retrieval for Semantic Search" given at WSDM 2014 (together with E. Meij and D. Odijk). For the complete tutorial material (including slides for the other parts) visit http://ejmeij.github.io/entity-linking-and-retrieval-tutorial/
This is Part II of the tutorial "Entity Linking and Retrieval" given at WWW 2013 (together with E. Meij and D. Odijk). For the complete tutorial material (including slides for the other parts) visit http://ejmeij.github.io/entity-linking-and-retrieval-tutorial/
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™UiPathCommunity
In questo evento online gratuito, organizzato dalla Community Italiana di UiPath, potrai esplorare le nuove funzionalità di Autopilot, il tool che integra l'Intelligenza Artificiale nei processi di sviluppo e utilizzo delle Automazioni.
📕 Vedremo insieme alcuni esempi dell'utilizzo di Autopilot in diversi tool della Suite UiPath:
Autopilot per Studio Web
Autopilot per Studio
Autopilot per Apps
Clipboard AI
GenAI applicata alla Document Understanding
👨🏫👨💻 Speakers:
Stefano Negro, UiPath MVPx3, RPA Tech Lead @ BSP Consultant
Flavio Martinelli, UiPath MVP 2023, Technical Account Manager @UiPath
Andrei Tasca, RPA Solutions Team Lead @NTT Data
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
Enhancing Performance with Globus and the Science DMZGlobus
ESnet has led the way in helping national facilities—and many other institutions in the research community—configure Science DMZs and troubleshoot network issues to maximize data transfer performance. In this talk we will present a summary of approaches and tips for getting the most out of your network infrastructure using Globus Connect Server.
The Metaverse and AI: how can decision-makers harness the Metaverse for their...Jen Stirrup
The Metaverse is popularized in science fiction, and now it is becoming closer to being a part of our daily lives through the use of social media and shopping companies. How can businesses survive in a world where Artificial Intelligence is becoming the present as well as the future of technology, and how does the Metaverse fit into business strategy when futurist ideas are developing into reality at accelerated rates? How do we do this when our data isn't up to scratch? How can we move towards success with our data so we are set up for the Metaverse when it arrives?
How can you help your company evolve, adapt, and succeed using Artificial Intelligence and the Metaverse to stay ahead of the competition? What are the potential issues, complications, and benefits that these technologies could bring to us and our organizations? In this session, Jen Stirrup will explain how to start thinking about these technologies as an organisation.
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
Entity Search: The Last Decade and the Next
1. En#ty Search
The Last Decade and the Next
Krisz#an Balog
University of Stavanger
@krisz'anbalog
10th Russian Summer School in Informa'on Retrieval (RuSSIR 2016) | Saratov, Russia, 2016
2. WHAT IS AN ENTITY?
• An en#ty is an "object" or
"thing" in the real world that
can be dis'nctly iden'fied and
is characterized by the following
proper#es:
• unique iden#fier(s)
• name(s)
• type(s)
• aRributes (or descrip#on)
• (typed) rela#onships to other en##es
people
products
organiza#ons
loca#ons
11. xxx xxxx xx xx xxxx xx
x xxxxxx xxx x xxxxxx
xxxx xxxx xx xxxx xx
xxxx xx xxxx xx xxxxxx
xx xxxx xxxxx xxx x
xxxxxxx
xxx xxxx xx xx xxxx xx
x xxxxxx xxx x xxxxxx
xxxx xxxx xx xxxx xx
xxxx xx xxxx xx xxxxxx
xx xxxx xxxxx xxx x
xxxxxxx
TREC ENTERPRISE EXPERT FINDING
• How to rank en##es that have no direct
representa#ons?
• Idea: Look at co-occurrences of en##es and query
terms in documents
xxx xxxx xx xx xxxx xx
x xxxxxx xxx x xxxxxx
xxxx xxxx xx xxxx xx
xxxx xx xxxx xx xxxxxx
xx xxxx xxxxx xxx x
xxxxxxx
query terms
en#ty men#on
documents
12. PROFILE-BASED METHODS
• Build a direct term-based en#ty
representa#on based on
associated language usage
• "You shall know a word by the
company it keeps." [Firth, 1957]
• Use document retrieval
techniques for ranking en#ty
profile documents
q
xxxx x xxx xx xxxxxx xx
x xxx xx x xxxx xx xxx x
xxxxxx xx x xxx xx xxxx
xx xxx xx x xxxxx xxx xx
x xxxx x xxx xx xxxxxx
xx x xxx xx x xxxx xx
xxx x xxxxxx xx x xxx xx
xxxx xx xxx xx x xxxxx
xxx xx x
xxxx x xxx xx xxxxxx xx
x xxx xx x xxxx xx xxx x
xxxxxx xx x xxx xx xxxx
xx xxx xx x xxxxx xxx xx
x xxxx x xxx xx
xxxx x xxx xx xxxxxx xx
x xxx xx x xxxx xx xxx x
xxxxxx xxxxxx xx x xxx
xx x xxxx xx xxx x xxxxx
xx x xxx xx xxxx xx xxx
xx x xxxxx xxx
xxxx x xxx xx xxxxxx xx
x xxx xx x xxxx xx xxx x
xxxxxx xx x xxx xx xxxx
xx xxx xx x xxxxx xxx xx
x xxxx x xxx xx
e
xxxx x xxx xx xxxxxx xx
x xxx xx x xxxx xx xxx x
xxxxxx xx x xxx xx xxxx
xx xxx xx x xxxxx xxx xx
x xxxx x xxx xx
xxxx x xxx xx xxxxxx xx
x xxx xx x xxxx xx xxx x
xxxxxx xx x xxx xx xxxx
xx xxx xx x xxxxx xxx xx
x xxxx x xxx xx
e
e
13. DOCUMENT-BASED METHODS
• First rank documents
(or document snippets)
• Then aggregate evidence for
the associated en##es
q
xxxx x xxx xx xxxxxx xx
x xxx xx x xxxx xx xxx x
xxxxxx xx x xxx xx xxxx
xx xxx xx x xxxxx xxx xx
x xxxx x xxx xx xxxxxx
xx x xxx xx x xxxx xx
xxx x xxxxxx xx x xxx xx
xxxx xx xxx xx x xxxxx
xxx xx x
xxxx x xxx xx xxxxxx xx
x xxx xx x xxxx xx xxx x
xxxxxx xx x xxx xx xxxx
xx xxx xx x xxxxx xxx xx
x xxxx x xxx xx
xxxx x xxx xx xxxxxx xx
x xxx xx x xxxx xx xxx x
xxxxxx xxxxxx xx x xxx
xx x xxxx xx xxx x xxxxx
xx x xxx xx xxxx xx xxx
xx x xxxxx xxx
X
e
X
X
e
e
14. 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
EVALUATION CAMPAIGNS
TREC Enterprise
TREC Entity
INEX Entity Ranking
SemSearch
INEX
Question Answering over Linked Data
Task: en#ty ranking in Wikipedia
Input:
keyword++ query
(target types/examples)
Data collec'on: Wikipedia
En'ty ID: Wikipedia ar#cle ID
Movies with eight or more Academy Awards
+category: best picture oscar
+category: bri#sh films
+category: american films
15. INEX ENTITY RANKING
Movies with eight or more Academy Awards
+category: best picture oscar
+category: bri#sh films
+category: american films
Term-based representa'on
Category-based representa'on
16. 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
EVALUATION CAMPAIGNS
TREC Enterprise
TREC Entity
INEX Entity Ranking
SemSearch
INEX
Question Answering over Linked Data
Task: related en#ty finding
Input:
keyword++ query
(input en#ty, target type)
Data collec'on: Web
En'ty ID: en#ty homepage URL
airlines that currently use Boeing-747 planes
+en'ty: Boeing-747 (clueweb09-..292)
+target type: organiza#on
17. 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
EVALUATION CAMPAIGNS
TREC Enterprise
TREC Entity
INEX Entity Ranking
SemSearch
INEX
Question Answering over Linked Data
Task:
en#ty search
in the Web of Data
Input: keyword query
Data collec'on: RDF triples
En'ty ID: URI
nokia e73
boroughs of New York City
disney orlando
18. FIELDED DOCUMENT REPRESENTATION
FROM RDF TRIPLES
dbpedia:Audi_A4
subject object
predicate
subject
predicate
literal
foaf:name Audi A4
rdfs:label Audi A4
rdfs:comment The Audi A4 is a compact executive car
produced since late 1994 by the German car
manufacturer Audi, a subsidiary of the
Volkswagen Group. The A4 has been built
[...]
dbpprop:production 1994
2001
2005
2008
rdf:type dbpedia-owl:MeanOfTransportation
dbpedia-owl:Automobile
dbpedia-owl:manufacturer dbpedia:Audi
dbpedia-owl:class dbpedia:Compact_executive_car
owl:sameAs freebase:Audi A4
is dbpedia-owl:predecessor of dbpedia:Audi_A5
is dbpprop:similar of dbpedia:Cadillac_BLS
19. 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
EVALUATION CAMPAIGNS
TREC Enterprise
TREC Entity
INEX Entity Ranking
SemSearch
INEX
Question Answering over Linked Data
Task: ques#on answering over RDF data
Input: natural language query
Data collec'on: RDF triples
En'ty ID: URI
Which German ci#es have more than
250000 inhabitants?
Who is the youngest Pulitzer Prize winner?
20. 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
EVALUATION CAMPAIGNS
TREC Enterprise
TREC Entity
INEX Entity Ranking
SemSearch
INEX
Question Answering over Linked Data
Task: ad-hoc en#ty retrieval
Input: keyword query
Data collec'on: Wikipedia + RDF triples
En'ty ID: Wikipedia ar#cle ID
NASA country German
22. DATA EVOLUTION
2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
TREC Enterprise TREC Entity
INEX Entity Ranking
SemSearch
Question Answering over Linked Data
unstructured
structured
semistructured
INEX
• Clear trend moving towards structured data
• No meaningful/successful aRempt at combining unstructured and
structured data
23. QUERY EVOLUTION
2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
TREC Enterprise
TREC Entity
INEX Entity Ranking
SemSearch
Question Answering over Linked Data
keyword
natural language
keyword++
INEX
• Keyword queries are s#ll the most common way to search
• From providing explicit seman#c annota#ons to natural language
ques#ons
24. WHAT HAVE WE BEEN DOING?
• Core focus has been on retrieval models, and more
specifically on en'ty representa'ons
• In terms of associated language usage, descrip#on, types,
aRributes
• Richer query representa#ons (i.e., query
annota#ons) were taken for granted
28. DATA
J. Benetka, K. Balog, and K. Nørvåg.
Towards Building a Knowledge Base of Monetary
Transac'ons from a News Collec'on.
JCDL’17.
29. KNOWLEDGE BASES
• Modern en#ty-oriented search features are fueled
by knowledge bases—need con#nuous upda#ng
• Cri#cal to be able to verify the validity of data
• Supply provenance informa#on for each statement
• Validity check (s#ll) needs to be performed by a human
• Can we help human editors to maintain and
expand knowledge bases?
30. acquisitionFinancial event:OracleSubject: Find events
InsertConfidence
2004
NYT
USD 10 300 000 000
Value
NYT
Year
56%
2007
USD 1 500 000
… from the PeopleSoft purchase …
2005 NYT
2004
NYT
Snippet
NYT
82.8% …Oracle finally acquired PeopleSoft for…
pleSoft finally capitulated to Oracle's …
Link
2004
… which acquired PeopleSoft last year …
USD 11
75.3% USD 20 000 000 000
78.9%
66.7% PeopleSoft for $5.1 billion in cash.
USD 7 700 000 000
Counterpart Event attributes
Hyperion Solutions
Siebel Systems
Retek
PeopleSoft
BUILDING A KNOWLEDGE BASE OF
MONETARY TRANSACTIONS
Subject en'ty Predicate filter
Object en'ty
Extracted informa'on
A Boom in Merger Activity
In December 2004, after a
battle for control that grew
nasty, Oracle finally acquired
PeopleSoft for about $10.3
billion, becoming the second-
largest maker of business-
management software.
31. APPROACH
• Generate all possible event
interpreta#ons (quintuples)
Event representa'on
• Monetary value recogni#on
• Economic event recogni#on
• En#ty recogni#on
• Date extrac#on
• Seman#c role labeling
Seman'c annota'on of sentences
• Grouping sentences that discuss
the same economic event
Clustering events
• Assigning confidence score to
each interpreta#on
Supervised learning
s#1
s#2
s#3
s#4
s#5
s#1
s#1
s#2
s#5
s#3
s#4
0.85
0.65
0.91
0.43
0.45
0.77
1
2 3
4
s#1
s#2
s#5
A B
A B
A B
s#3
s#4
C D
C D
e#1
[C] <rel> [D]
e#2
[A] <rel> [B]
{
{
33. SUMMARY
• Building a domain-specific knowledge base
• NLP pipeline for informa#on extrac#on
• ML for establishing confidence for human processing
• Open research problems
• Long-tail en##es
• En##es "not worthy" of a Wikipedia page
• What are the aRributes that ma#er?
35. ANNOTATING QUERIES WITH ENTITIES
• Seman#c annota#ons of queries
were taken for granted so far
• How can automa'c en'ty
annota'ons of queries be
leveraged to improve en'ty
retrieval?
barack obama parents
36. APPROACH
<Barack_Obama>
Annotations:
barack obama parents
Entity-based representation ˆDˆD
Term-based representation DD
term-based
matching
entity-based
matching
entity linking
<dbo:birthPlace>: [<Honolulu>,
<Hawaii> ]
<dbo:child>: <Barack_Obama>
<dbo:wikiPageWikiLink>:
[ <United_States>,
<Family_of_Barack_Obama>, …]
Query terms:
<rdfs:label>: Ann Dunham
<dbo:abstract>: Stanley Ann Dunham the mother
Barack Obama, was an American
anthropologist who …
<dbo:birthPlace>: Honolulu Hawaii …
<dbo:child>: Barack Obama
<dbo:wikiPageWikiLink>:
United States Family Barack Obama
Term-based representa'on
En'ty-based representa'on
barack obama parents
<Barack_Obama>
Annotations:
barack obama parents
Entity-based representation ˆDˆD
Term-based representation DD
term-based
matching
entity-based
matching
entity linking
<dbo:birthPlace>: [<Honolulu>,
<Hawaii> ]
<dbo:child>: <Barack_Obama>
<dbo:wikiPageWikiLink>:
[ <United_States>,
<Family_of_Barack_Obama>, …]
Query terms:
<rdfs:label>: Ann Dunham
<dbo:abstract>: Stanley Ann Dunham the mother
Barack Obama, was an American
anthropologist who …
<dbo:birthPlace>: Honolulu Hawaii …
<dbo:child>: Barack Obama
<dbo:wikiPageWikiLink>:
United States Family Barack Obama
<Barack_Obama>
en'ty annota'on
(automa'c)
39. SUMMARY
• Automa#cally annota#ng queries with en##es can
significantly improve retrieval performance
• Open research problem:
• How should a query be answered (list, fact, table, etc.)?
40. ENTITY SUMMARIES
F. Hasibi, K. Balog, and S. E. Bratsberg.
Dynamic Factual Summaries for En'ty Cards.
SIGIR’17.
41. ENTITY SUMMARIES
• Summaries serve a dual purpose
• Synopsis of the en#ty
• Provide evidence why the en#ty is a good answer
for the given query
• How to generate dynamic en'ty
summaries that can directly address
users’ informa'on needs?
• Two subtasks
• Fact ranking — What should be in the summary?
• Summary genera#on — How should it be presented?
42. EXAMPLE
einstein awards
Sta'c (query-independent) summary Dynamic (query-dependent) summary
Born: March 14, 1879, Ulm, Germany
Died: April 18, 1955, Princeton, New Jersey, United States
Influenced by: Isaac Newton, Mahatma Gandhi, more
Spouse: Elsa Einstein, Mileva Marić
Children: Eduard Einstein, Lieserl Einstein, Hans A. Einstein
Born: March 14, 1879, Ulm, Germany
Died: April 18, 1955, Princeton, New Jersey, United States
Awards: Barnard Medal, Nobel Prize in Physics, more
Educa'on: Swiss Federal Polytechnic, University of Zurich
Influenced by: Isaac Newton, Mahatma Gandhi, more
43. FACT RANKING
• Ranking en#ty facts according to various
"goodness" criteria
• Importance: how well it describes the en#ty
• Relevance: how well it supports/explains why the en#ty is a
relevant result for the given query (informa#on need)
• U'lity: combines importance and relevance
• Learning-to-rank approach with specific features
designed for capturing importance and relevance
44. SUMMARY GENERATION
• A summary is more than a ranked list of facts
Seman'cally
iden'cal
predicates
Presenta'on
(human-readable labels, size constraints)
Mul'-valued
predicates
<dbo:capital> <dbpedia:Oslo>
<dbo:currency> <dbpedia:Norwegian_krone>
<dbo:leader> <dbpedia:Harald_V_of_Norway>
<dbp:establishedDate> 1814-05-17
<dbp:leaderName> <dbpedia:Harald_V_of_Norway>
<foaf:homepage> <hRp://www.norway.no/>
<dbo:language> <dbpedia:Norwegian_language>
<dbo:language> <dbpedia:Romani_language>
<dbo:language> <dbpedia:Scandoromani_language>
<dbp:website> <hRp://www.norway.no/>
<dbo:leaderTitle> President of the Stor#ng
<dbp:areaKm> 385178
vs.
Capital: Oslo
Currency: Norwegian krone
Leader: Harald V of Norway
Homepage: hRp://www.norway.no/
Language: Norwegian, Romani, more
45. SUMMARY GENERATION ALGORITHM
… …
headingiheadingi valueivaluei
height(⌧h)height(⌧h)
width(⌧w)width(⌧w)
lineilinei
1. Selec'ng line headings
• Recognizing seman#cally iden#cal predicates
• Mapping predicates to human readable labels
2. Collec'ng line values
• Grouping values for mul#-valued predicates
• Adhering to size constraints
47. END-TO-END (SUMMARY) EVALUATION
• How do sta#c and dynamic summaries compare
against each other?
Oracle (perfect) fact ranking
Automa#c fact ranking
0 25 50 75 100
31
37
23
16
46
47
Dynamic summary wins Sta#c summary wins
48. SUMMARY
• Addressed the problem of genera#ng dynamic
(query-dependent) en#ty summaries
• Open research problems
• What should be on the en#ty card?
• Other forms of result presenta#on (tables, lists, graphs, etc.)
50. ZERO-QUERY SEARCH
• ProacAve instead of reacAve search
• "An#cipate user needs and respond with
informa#on appropriate to the current
context without the user having to enter a
query" — (Allan et al., SIGIR Forum 2012)
• Using a person's check-in ac'vity
as context, can we an'cipate her
informa'on needs, and respond
with a set of informa'on cards
that directly address those needs?
Terminal
Weather
21ºC
Traffic
51. INFORMATION NEEDS FOR ACTIVITIES
• What are relevant informa#on needs in the context of
a given ac#vity?
• Use POI categories (Foursquare) to represent ac#vi#es
• Mine informa#on needs from search sugges#ons
52. ANTICIPATING INFORMATION NEEDS
• Maximize the likelihood of sa#sfying the user's
informa#on needs by considering each possible ac#vity
that might follow next
• Transi#on probabili#es are es#mated based on historical
check-in data
Activity A
Activity B
Activity C
Activity D
45%
34%
21%
?
53. Train Test80%
User 3
User 2
User 1
Check-in dataset
EVALUATION METHODOLOGY
Terminal
Weather
21ºC
Traffic
54. RESULTSNGCD@5
0,00
0,23
0,45
0,68
0,90
Top level Second level
Most frequent informa#on needs,
regardless of the last ac#vity
M0
Consider informa#on needs for all
possible upcoming ac#vi#es
In addi#on, consider the informa#on
needs relevant to the past ac#vity
(fixed weight for all info needs)
Consider the temporal sensi#vity of
each informa#on need individually
M1
M2
M3
55. SUMMARY
• Iden#fying informa#on needs that are relevant in the
context of a given ac#vity and proac#vely presen#ng
informa#on cards addressing those needs
• Open research problems
• Other contexts
• (Access to data, privacy...)
58. I see you're was'ng 'me away on
Facebook. Do you have 'me now to
talk about your holiday plans?Sure. I want an ac've holiday with
the family in beau'ful nature.
It sounds like you would definitely
love Norway. A cabin in the
mountains maybe?
Could be. But I want to go kayaking
and also catch some fish.
And not too much rain, please.
And something fun for the kids
nearby, I suppose?
Of course.
How does Oltedal sound?
People have been quite successful
with catching lake trout based on
what I found on Instagram.
There is also a theme park and
horse riding, both within 50kms.
59. And what about the weather? You know we’re talking about
Norway, right…?
Anyway, based on sta's'cs from the
past 30 years, this is one of the areas
with the least amount of rain if you
go in August.
I see. What about accommoda'on?
Here is a list of places that I think you
might like.
Any opinions on this one?
According to the reviews that I can
find on the web, the cabins are well
equipped, the staff is nice and they
even allow guests to borrow their
kayaks.
60. OK. Let’s find a date that works for
everyone. According to your wife's calendar, her
parents will be visi'ng you in the first
week of August. School starts for the
kids on the week of Aug 22. So there
is a two week window between Aug
8 and 21, assuming that I can cancel
the regular weekly mee'ngs with
your PhD students.
That's fine. The students won't mind.
Write them an email to upload their
holiday plans to the group wiki, and
add summer planning to the next
group mee'ng's agenda.
Guys,
What are your plans for the summer?
Please upload your away times to the
group wiki.
-Kr
To: XXX, YYY, ZZZ
Send
Agenda item Summer planning added
61. In the mean'me, I called the cabin to
check availability. Their online
booking system is down at the
moment. They s'll have some cabins
available. Do you want to see them?
No, I had enough of this for today.
Mail the pictures to my wife with
some kind words.
Anything else I can do for you?Order a water filter for my espresso
machine. I just found out that it'll
need to be replaced soon.
Darling,
You will love the place I found for us for a
vacation in August. It is by the water; at
night we will hear the waves. We will be
able to take our morning breakfasts on
the balcony, which ...
To: Wife
Send
63. UNDERSTANDING
INFORMATION NEEDS
• Natural language
conversa#onal interface
• An#cipa#ng informa#on needs
• Proac#ve recommenda#ons
It sounds like you would definitely
love Norway. A cabin in the
mountains maybe?
And something fun for the kids
nearby, I suppose?
I see you're was'ng 'me away on
Facebook. Do you have 'me now to
talk about your holiday plans?
64. DATA
• Long-tail en##es
• On-the-fly informa#on extrac#on
• "Personal" knowledge base
• "Wife", "My students", "my group", "my
espresso machine", ... en##es I care about
Here is a list of places that I think you
might like.
According to the reviews that I can
find on the web, ...
Order a water filter for my espresso
machine. I just found out that it'll
need to be replaced soon.
Breville BES860XL Barista
Express Espresso Machine
65. RESULT PRESENTATION
& USER INTERACTION
• Providing evidence
• "Ac#onable" en##es
• Make booking, order item, write email, ...
• Helping the user to get things
done
• Support for task comple#on
... based on sta's'cs from the past
30 years, ...
According to your wife's calendar, ...
Agenda item Summer planning added
Write them an email to upload their
holiday plans to the group wiki, and
add summer planning to the next
group mee'ng's agenda.
66. SUMMARY
Understanding
informa'on needs
Data source(s)
Result presenta'on
& user interac'on
Retrieval method
• Seman#c annota#ons
• An#cipa#ng info needs
• Natural language
conversa#onal interfaces
• Long tail en##es
• Personal knowledge base
• On-the-fly informa#on extrac#on
• Hybrid approaches
• En#ty cards
• Ac#onable en##es
• Support for task comple#on
67. ACKNOWLEDGMENTS
• Joint work with
• Faegheh Hasibi
• Jan Benetka
• Darío Gariglioz
• Kje#l Nørvåg
• Svein Erik Bratsberg