This document provides an overview of foundational research propelled by text analytics. It begins with an outline that discusses text analytics in the big data era, information extraction systems and formalisms, foundational research challenges, and conclusions. It then discusses how text analytics has become important for applications like semantic search, life science mining, e-commerce, CRM/BI, and log analysis. It notes the need for database management systems and general-purpose development and management systems to facilitate value extraction from big data by a wide range of users and skills. Core information extraction tasks like named entity recognition, relation extraction, event extraction, temporal information extraction, and coreference resolution are discussed. Several formalisms for information extraction are presented, including X
"Searching for Meaning: The Hidden Structure in Unstructured Data". Presentation by Trey Grainger at the Southern Data Science Conference (SDSC) 2018. Covers linguistic theory, application in search and information retrieval, and knowledge graph and ontology learning methods for automatically deriving contextualized meaning from unstructured (free text) content.
Reflected Intelligence: Real world AI in Digital TransformationTrey Grainger
The goal of most digital transformations is to create competitive advantage by enhancing customer experience and employee success, so giving these stakeholders the ability to find the right information at their moment of need is paramount. Employees and customers increasingly expect an intuitive, interactive experience where they can simply type or speak their questions or keywords into a search box, their intent will be understood, and the best answers and content are then immediately presented.
Providing this compelling experience, however, requires a deep understanding of your content, your unique business domain, and the collective and personalized needs of each of your users. Modern artificial intelligence (AI) approaches are able to continuously learn from both your content and the ongoing stream of user interactions with your applications, and to automatically reflect back that learned intelligence in order to instantly and scalably deliver contextually-relevant answers to employees and customers.
In this talk, we'll discuss how AI is currently being deployed across the Fortune 1000 to accomplish these goals, both in the digital workplace (helping employees more efficiently get answers and make decisions) and in digital commerce (understanding customer intent and connecting them with the best information and products). We'll separate fact from fiction as we break down the hype around AI and show how it is being practically implemented today to power many real-world digital transformations for the next generation of employees and customers.
Natural Language Search with Knowledge Graphs (Chicago Meetup)Trey Grainger
To optimally interpret most natural language queries, its important to understand a highly-nuanced, contextual interpretation of the domain-specific phrases, entities, commands, and relationships represented or implied within the search and within your domain.
In this talk, we'll walk through such a search system powered by Solr's Text Tagger and Semantic Knowledge graph. We'll have fun with some of the more search-centric use cases of knowledge graphs, such as entity extraction, query expansion, disambiguation, and pattern identification within our queries: for example, transforming the query "best bbq near activate" into:
{!func}mul(min(popularity,1),100) bbq^0.91032 ribs^0.65674 brisket^0.63386 doc_type:"restaurant" {!geofilt d=50 sfield="coordinates_pt" pt="38.916120,-77.045220"}
We'll see a live demo with real world data demonstrating how you can build and apply your own knowledge graphs to power much more relevant query understanding like this within your search engine.
South Big Data Hub: Text Data Analysis PanelTrey Grainger
Slides from Trey's opening presentation for the South Big Data Hub's Text Data Analysis Panel on December 8th, 2016. Trey provided a quick introduction to Apache Solr, described how companies are using Solr to power relevant search in industry, and provided a glimpse on where the industry is heading with regard to implementing more intelligent and relevant semantic search.
Semantic Web in Action: Ontology-driven information search, integration and a...Amit Sheth
Amit Sheth's Keynote talk given at: “Semantic Web in Action: Ontology-driven information search, integration and analysis,” Net Object Days 2003 and MATES03, Erfurt, Germany, September 23, 2003. http://knoesis.org
Note: slides 51-55 have audio.
This document provides an overview of information retrieval models. It begins with definitions of information retrieval and how it differs from data retrieval. It then discusses the retrieval process and logical representations of documents. A taxonomy of IR models is presented including classic, structured, and browsing models. Boolean, vector, and probabilistic models are explained as examples of classic models. The document concludes with descriptions of ad-hoc retrieval and filtering tasks and formal characteristics of IR models.
The Relevance of the Apache Solr Semantic Knowledge GraphTrey Grainger
The Semantic Knowledge Graph is an Apache Solr plugin that can be used to discover and rank the relationships between any arbitrary queries or terms within the search index. It is a relevancy swiss army knife, able to discover related terms and concepts, disambiguate different meanings of terms given their context, cleanup noise in datasets, discover previously unknown relationships between entities across documents and fields, rank lists of keywords based upon conceptual cohesion to reduce noise, summarize documents by extracting their most significant terms, generate recommendations and personalized search, and power numerous other applications involving anomaly detection, significance/relationship discovery, and semantic search. This talk will walk you through how to setup and use this plugin in concert with other open source tools (probabilistic query parser, SolrTextTagger for entity extraction) to parse, interpret, and much more correctly model the true intent of user searches than traditional keyword-based search approaches.
"Searching for Meaning: The Hidden Structure in Unstructured Data". Presentation by Trey Grainger at the Southern Data Science Conference (SDSC) 2018. Covers linguistic theory, application in search and information retrieval, and knowledge graph and ontology learning methods for automatically deriving contextualized meaning from unstructured (free text) content.
Reflected Intelligence: Real world AI in Digital TransformationTrey Grainger
The goal of most digital transformations is to create competitive advantage by enhancing customer experience and employee success, so giving these stakeholders the ability to find the right information at their moment of need is paramount. Employees and customers increasingly expect an intuitive, interactive experience where they can simply type or speak their questions or keywords into a search box, their intent will be understood, and the best answers and content are then immediately presented.
Providing this compelling experience, however, requires a deep understanding of your content, your unique business domain, and the collective and personalized needs of each of your users. Modern artificial intelligence (AI) approaches are able to continuously learn from both your content and the ongoing stream of user interactions with your applications, and to automatically reflect back that learned intelligence in order to instantly and scalably deliver contextually-relevant answers to employees and customers.
In this talk, we'll discuss how AI is currently being deployed across the Fortune 1000 to accomplish these goals, both in the digital workplace (helping employees more efficiently get answers and make decisions) and in digital commerce (understanding customer intent and connecting them with the best information and products). We'll separate fact from fiction as we break down the hype around AI and show how it is being practically implemented today to power many real-world digital transformations for the next generation of employees and customers.
Natural Language Search with Knowledge Graphs (Chicago Meetup)Trey Grainger
To optimally interpret most natural language queries, its important to understand a highly-nuanced, contextual interpretation of the domain-specific phrases, entities, commands, and relationships represented or implied within the search and within your domain.
In this talk, we'll walk through such a search system powered by Solr's Text Tagger and Semantic Knowledge graph. We'll have fun with some of the more search-centric use cases of knowledge graphs, such as entity extraction, query expansion, disambiguation, and pattern identification within our queries: for example, transforming the query "best bbq near activate" into:
{!func}mul(min(popularity,1),100) bbq^0.91032 ribs^0.65674 brisket^0.63386 doc_type:"restaurant" {!geofilt d=50 sfield="coordinates_pt" pt="38.916120,-77.045220"}
We'll see a live demo with real world data demonstrating how you can build and apply your own knowledge graphs to power much more relevant query understanding like this within your search engine.
South Big Data Hub: Text Data Analysis PanelTrey Grainger
Slides from Trey's opening presentation for the South Big Data Hub's Text Data Analysis Panel on December 8th, 2016. Trey provided a quick introduction to Apache Solr, described how companies are using Solr to power relevant search in industry, and provided a glimpse on where the industry is heading with regard to implementing more intelligent and relevant semantic search.
Semantic Web in Action: Ontology-driven information search, integration and a...Amit Sheth
Amit Sheth's Keynote talk given at: “Semantic Web in Action: Ontology-driven information search, integration and analysis,” Net Object Days 2003 and MATES03, Erfurt, Germany, September 23, 2003. http://knoesis.org
Note: slides 51-55 have audio.
This document provides an overview of information retrieval models. It begins with definitions of information retrieval and how it differs from data retrieval. It then discusses the retrieval process and logical representations of documents. A taxonomy of IR models is presented including classic, structured, and browsing models. Boolean, vector, and probabilistic models are explained as examples of classic models. The document concludes with descriptions of ad-hoc retrieval and filtering tasks and formal characteristics of IR models.
The Relevance of the Apache Solr Semantic Knowledge GraphTrey Grainger
The Semantic Knowledge Graph is an Apache Solr plugin that can be used to discover and rank the relationships between any arbitrary queries or terms within the search index. It is a relevancy swiss army knife, able to discover related terms and concepts, disambiguate different meanings of terms given their context, cleanup noise in datasets, discover previously unknown relationships between entities across documents and fields, rank lists of keywords based upon conceptual cohesion to reduce noise, summarize documents by extracting their most significant terms, generate recommendations and personalized search, and power numerous other applications involving anomaly detection, significance/relationship discovery, and semantic search. This talk will walk you through how to setup and use this plugin in concert with other open source tools (probabilistic query parser, SolrTextTagger for entity extraction) to parse, interpret, and much more correctly model the true intent of user searches than traditional keyword-based search approaches.
Konsep Dasar Information Retrieval - Edi faizal EdiFaizal2
This document discusses key concepts in information retrieval including the differences between information retrieval, recommender systems, and search engines. It also covers different types of information retrieval models such as set theoretic, algebraic, probabilistic, classical, non-classical, and alternative models. The document will next cover topics related to preparing information retrieval systems like crawling, indexing, natural language processing, and text representation.
The document provides an overview of semantic technologies and discusses their increasing mainstream adoption. It notes that Microsoft purchased Powerset in 2008, Apple purchased Siri in 2010, and Google bought Metaweb and released semantic search in 2013. It discusses how semantic technologies allow for interoperability through shared representations and reasoning. Examples are given of early semantic search applications from 1999-2002 and an operational semantic electronic medical record application deployed in 2006.
This document provides an overview of an information retrieval course. The course will cover topics related to information retrieval models, techniques, and systems. Students will complete exams, assignments, and a major project to build a search engine using both text-based and semantic retrieval techniques. The document defines key concepts in information retrieval and discusses different types of information retrieval systems and techniques.
The document discusses information retrieval models. It describes the Boolean retrieval model, which represents documents and queries as sets of terms combined with Boolean operators. Documents are retrieved if they satisfy the Boolean query, but there is no ranking of results. The Boolean model has limitations including difficulty expressing complex queries, controlling result size, and ranking results. It works best for simple, precise queries when users know exactly what they are searching for.
This document describes research on accessing and documenting relational databases through OWL ontologies. It introduces the topic and outlines the key contributions: a general approach for annotating data sources with ontologies, an extension of the Relational.OWL ontology to model relational databases, automatic extraction of ontologies from relational schemas, and applications of the framework. The paper presents an infrastructure for ontology extraction, including a data model ontology (DMO) to represent relational structure, a data source ontology (DSO) extracted from schemas, and a schema design ontology (SDO) that maps the DSO to the DMO. It also discusses query answering by rewriting SPARQL queries to SQL using the generated ontologies
Recent Trends in Big Data Research
[1] Big data research is becoming increasingly popular as more organizations recognize the competitive advantage of data analytics. However, big data research requires a broad set of skills that many organizations currently lack. [2] As data volumes continue to grow exponentially from sources like IoT, making sense of big data will require integrating data from multiple sources and automating data collection and analysis. [3] Successful big data research will depend on taking an end-to-end approach that considers all aspects of the data lifecycle from collection to insights.
Managing Metadata for Science and Technology Studies: the RISIS caseRinke Hoekstra
Presentation of our paper at the WHISE workshop at ESWC 2016 on requirements for metadata over non-public datasets for the science & technology studies field.
The Apache Solr Semantic Knowledge GraphTrey Grainger
What if instead of a query returning documents, you could alternatively return other keywords most related to the query: i.e. given a search for "data science", return me back results like "machine learning", "predictive modeling", "artificial neural networks", etc.? Solr’s Semantic Knowledge Graph does just that. It leverages the inverted index to automatically model the significance of relationships between every term in the inverted index (even across multiple fields) allowing real-time traversal and ranking of any relationship within your documents. Use cases for the Semantic Knowledge Graph include disambiguation of multiple meanings of terms (does "driver" mean truck driver, printer driver, a type of golf club, etc.), searching on vectors of related keywords to form a conceptual search (versus just a text match), powering recommendation algorithms, ranking lists of keywords based upon conceptual cohesion to reduce noise, summarizing documents by extracting their most significant terms, and numerous other applications involving anomaly detection, significance/relationship discovery, and semantic search. In this talk, we'll do a deep dive into the internals of how the Semantic Knowledge Graph works and will walk you through how to get up and running with an example dataset to explore the meaningful relationships hidden within your data.
Data Science has taken the world with a storm due to the rising need of web crawling and data acquisition to help make unheard advancements in the field of business intelligence and various technologies. We have compiled a list of the top 20 renowned data scientists who have taken quantum leaps in their fields with the data science and are changing how we see data on a day to day basis.
This document provides a full syllabus with questions and answers related to the course "Information Retrieval" including definitions of key concepts, the historical development of the field, comparisons between information retrieval and web search, applications of IR, components of an IR system, and issues in IR systems. It also lists examples of open source search frameworks and performance measures for search engines.
Information retrieval is concerned with searching for documents and metadata about documents. Documents contain information to be retrieved. There is overlap between terms like data retrieval, document retrieval, information retrieval, and text retrieval. Automated information retrieval systems are used to reduce information overload. Libraries and universities use IR systems to provide access to materials. Web search engines are a prominent example of IR applications. The idea of using computers for information retrieval was popularized in 1945. Early automated systems emerged in the 1950s and large-scale systems in the 1970s such as the Lockheed Dialog system. Many measures exist for evaluating IR system performance including precision, recall, and precision-recall curves.
The document provides an overview of data mining concepts and techniques. It discusses what data mining is, the data mining process, different types of data mining techniques including characterization, association, classification, clustering and outlier analysis. It also covers major issues in data mining such as methodology, performance, handling different data types, and applications.
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”VOGIN-academie
Smartlogic provides semantic search and content intelligence solutions to unlock business value from unstructured content. Their solution, Semaphore, uses natural language processing and machine learning to automatically enrich content with metadata, extract entities and facts, and categorize content according to customizable semantic models or ontologies. This helps organizations more effectively search, discover, and leverage information across diverse content sources. Semaphore delivers enhanced search capabilities, automated categorization, and tools to build and manage semantic models collaboratively. Customers report benefits such as reduced time spent searching, lower classification costs, and reduced risk of non-compliance by making more information accessible.
The diversity and complexity of contents available on the web have dramatically increased in recent years. Multimedia content such as images, videos, maps, voice recordings has been published more often than before. Document genres have also been diversified, for instance, news, blogs, FAQs, wiki. These diversified information sources are often dealt with in a separated way. For example, in web search, users have to switch between search verticals to access different sources. Recently, there has been a growing interest in finding effective ways to aggregate these information sources so that to hide the complexity of the information spaces to users searching for relevant information. For example, so-called aggregated search investigated by the major search engine companies will provide search results from several sources in a single result page. Aggregation itself is not a new paradigm; for instance, aggregate operators are common in database technology.
This talk presents the challenges faced by the like of web search engines and digital libraries in providing the means to aggregate information from several and complex information spaces in a way that helps users in their information seeking tasks. It also discusses how other disciplines including databases, artificial intelligence, and cognitive science can be brought into building effective and efficient aggregated search systems.
This curriculum vitae summarizes the qualifications and experience of Dr. Jie Bao. He is currently a research associate at Rensselaer Polytechnic Institute, a research affiliate at MIT, and a visiting scientist at Raytheon BBN Technologies. He received his Ph.D. in computer science from Iowa State University in 2007. His research focuses on areas including semantic web, linked data, description logics, and ontology engineering. He has over 50 publications and has served on numerous conference committees.
This document provides an overview of a course on data warehousing, filtering, and mining. The course is being taught in Fall 2004 at Temple University. The document includes the course syllabus which outlines topics like data warehousing, OLAP technology, data preprocessing, mining association rules, classification, cluster analysis, and mining complex data types. Grading will be based on assignments, quizzes, a presentation, individual project, and final exam. The document also provides introductory material on data mining including definitions and examples.
This document provides information about a computational intelligence and soft computing course including the instructor's contact information, class times, required text, and an overview of upcoming lectures on data mining with neural networks. It then discusses key issues in data mining such as theory, methods/algorithms, processes, applications, and tools/techniques. Several example data mining projects are also summarized along with homework and exam due dates for the course.
Post 1What is text analytics How does it differ from text mini.docxstilliegeorgiana
Post 1:
What is text analytics? How does it differ from text mining?
Text Analytics is applying of statistical and machine learning techniques to be able to predict /prescribe or infer any information from the text-mined data. Text mining is a tool that helps in getting the data cleaned up.Text analytics and text mining approaches have essentially equivalent performance. Text analytics requires an expert linguist to produce complex rule sets, whereas text mining requires the analyst to hand-label cases with outcomes or classes to create training data.
Differences between Text Mining and Text Analytics:
• Text Mining and Text Analytics solve the same problems, but use different techniques and are complementary ways to automatically extract meaning from text.
• Text Analytics is developed within the field of computational linguistics. It has the ability to encode human understanding into a series of linguistic rules which are generated by humans are high in precision, but they do not automatically adapt and are usually fragile when tried in new situations.
• Text mining is a newer discipline arising out of the fields of statistics, data mining, and machine learning. Its strength is the ability to inductively create models from collections of historical data. Because statistical models are learned from training data they are adaptive and can identify “unknown unknowns”, leading to the better recall. Still, they can be prone to missing something that would seem obvious to a human.
• Text analytics and text mining approaches have essentially equivalent performance. Text analytics requires an expert linguist to produce complex rule sets, whereas text mining requires the analyst to hand-label cases with outcomes or classes to create training data.
• Due to their different perspectives and strengths, combining text analytics with text mining often leads to better performance than either approach alone.
2. What technologies were used in building Watson (both hardware and software)?
Watson is an extraordinary computer system (a novel combination of advanced hardware an software) designed at answering questions posed in natural human language.Watson is an artificially intelligent computer system capable of answering questions posed in natural language, developed in IBM's DeepQA project by a research team led by principal investigator David Ferrucci. Watson was named after IBM's first CEO and industrialist Thomas J. Watson. The computer system was specifically developed to answer questions on the quiz show Jeopardy! In 2011, Watson competed on Jeopardy! against former winners Brad Rutter and Ken Jennings.
Watson received the first prize of $1 million.The goal was to advance computer science by exploring new ways for computer technology to affect science, business, and society.IBM undertook a challenge to build a computer system that could compete at the human champion level in real time on the American TV quiz show Jeopardy!The extent of the challenge in ...
Post 1What is text analytics How does it differ from text minianhcrowley
Post 1:
What is text analytics? How does it differ from text mining?
Text Analytics is applying of statistical and machine learning techniques to be able to predict /prescribe or infer any information from the text-mined data. Text mining is a tool that helps in getting the data cleaned up.Text analytics and text mining approaches have essentially equivalent performance. Text analytics requires an expert linguist to produce complex rule sets, whereas text mining requires the analyst to hand-label cases with outcomes or classes to create training data.
Differences between Text Mining and Text Analytics:
• Text Mining and Text Analytics solve the same problems, but use different techniques and are complementary ways to automatically extract meaning from text.
• Text Analytics is developed within the field of computational linguistics. It has the ability to encode human understanding into a series of linguistic rules which are generated by humans are high in precision, but they do not automatically adapt and are usually fragile when tried in new situations.
• Text mining is a newer discipline arising out of the fields of statistics, data mining, and machine learning. Its strength is the ability to inductively create models from collections of historical data. Because statistical models are learned from training data they are adaptive and can identify “unknown unknowns”, leading to the better recall. Still, they can be prone to missing something that would seem obvious to a human.
• Text analytics and text mining approaches have essentially equivalent performance. Text analytics requires an expert linguist to produce complex rule sets, whereas text mining requires the analyst to hand-label cases with outcomes or classes to create training data.
• Due to their different perspectives and strengths, combining text analytics with text mining often leads to better performance than either approach alone.
2. What technologies were used in building Watson (both hardware and software)?
Watson is an extraordinary computer system (a novel combination of advanced hardware an software) designed at answering questions posed in natural human language.Watson is an artificially intelligent computer system capable of answering questions posed in natural language, developed in IBM's DeepQA project by a research team led by principal investigator David Ferrucci. Watson was named after IBM's first CEO and industrialist Thomas J. Watson. The computer system was specifically developed to answer questions on the quiz show Jeopardy! In 2011, Watson competed on Jeopardy! against former winners Brad Rutter and Ken Jennings.
Watson received the first prize of $1 million.The goal was to advance computer science by exploring new ways for computer technology to affect science, business, and society.IBM undertook a challenge to build a computer system that could compete at the human champion level in real time on the American TV quiz show Jeopardy!The extent of the challenge in ...
Konsep Dasar Information Retrieval - Edi faizal EdiFaizal2
This document discusses key concepts in information retrieval including the differences between information retrieval, recommender systems, and search engines. It also covers different types of information retrieval models such as set theoretic, algebraic, probabilistic, classical, non-classical, and alternative models. The document will next cover topics related to preparing information retrieval systems like crawling, indexing, natural language processing, and text representation.
The document provides an overview of semantic technologies and discusses their increasing mainstream adoption. It notes that Microsoft purchased Powerset in 2008, Apple purchased Siri in 2010, and Google bought Metaweb and released semantic search in 2013. It discusses how semantic technologies allow for interoperability through shared representations and reasoning. Examples are given of early semantic search applications from 1999-2002 and an operational semantic electronic medical record application deployed in 2006.
This document provides an overview of an information retrieval course. The course will cover topics related to information retrieval models, techniques, and systems. Students will complete exams, assignments, and a major project to build a search engine using both text-based and semantic retrieval techniques. The document defines key concepts in information retrieval and discusses different types of information retrieval systems and techniques.
The document discusses information retrieval models. It describes the Boolean retrieval model, which represents documents and queries as sets of terms combined with Boolean operators. Documents are retrieved if they satisfy the Boolean query, but there is no ranking of results. The Boolean model has limitations including difficulty expressing complex queries, controlling result size, and ranking results. It works best for simple, precise queries when users know exactly what they are searching for.
This document describes research on accessing and documenting relational databases through OWL ontologies. It introduces the topic and outlines the key contributions: a general approach for annotating data sources with ontologies, an extension of the Relational.OWL ontology to model relational databases, automatic extraction of ontologies from relational schemas, and applications of the framework. The paper presents an infrastructure for ontology extraction, including a data model ontology (DMO) to represent relational structure, a data source ontology (DSO) extracted from schemas, and a schema design ontology (SDO) that maps the DSO to the DMO. It also discusses query answering by rewriting SPARQL queries to SQL using the generated ontologies
Recent Trends in Big Data Research
[1] Big data research is becoming increasingly popular as more organizations recognize the competitive advantage of data analytics. However, big data research requires a broad set of skills that many organizations currently lack. [2] As data volumes continue to grow exponentially from sources like IoT, making sense of big data will require integrating data from multiple sources and automating data collection and analysis. [3] Successful big data research will depend on taking an end-to-end approach that considers all aspects of the data lifecycle from collection to insights.
Managing Metadata for Science and Technology Studies: the RISIS caseRinke Hoekstra
Presentation of our paper at the WHISE workshop at ESWC 2016 on requirements for metadata over non-public datasets for the science & technology studies field.
The Apache Solr Semantic Knowledge GraphTrey Grainger
What if instead of a query returning documents, you could alternatively return other keywords most related to the query: i.e. given a search for "data science", return me back results like "machine learning", "predictive modeling", "artificial neural networks", etc.? Solr’s Semantic Knowledge Graph does just that. It leverages the inverted index to automatically model the significance of relationships between every term in the inverted index (even across multiple fields) allowing real-time traversal and ranking of any relationship within your documents. Use cases for the Semantic Knowledge Graph include disambiguation of multiple meanings of terms (does "driver" mean truck driver, printer driver, a type of golf club, etc.), searching on vectors of related keywords to form a conceptual search (versus just a text match), powering recommendation algorithms, ranking lists of keywords based upon conceptual cohesion to reduce noise, summarizing documents by extracting their most significant terms, and numerous other applications involving anomaly detection, significance/relationship discovery, and semantic search. In this talk, we'll do a deep dive into the internals of how the Semantic Knowledge Graph works and will walk you through how to get up and running with an example dataset to explore the meaningful relationships hidden within your data.
Data Science has taken the world with a storm due to the rising need of web crawling and data acquisition to help make unheard advancements in the field of business intelligence and various technologies. We have compiled a list of the top 20 renowned data scientists who have taken quantum leaps in their fields with the data science and are changing how we see data on a day to day basis.
This document provides a full syllabus with questions and answers related to the course "Information Retrieval" including definitions of key concepts, the historical development of the field, comparisons between information retrieval and web search, applications of IR, components of an IR system, and issues in IR systems. It also lists examples of open source search frameworks and performance measures for search engines.
Information retrieval is concerned with searching for documents and metadata about documents. Documents contain information to be retrieved. There is overlap between terms like data retrieval, document retrieval, information retrieval, and text retrieval. Automated information retrieval systems are used to reduce information overload. Libraries and universities use IR systems to provide access to materials. Web search engines are a prominent example of IR applications. The idea of using computers for information retrieval was popularized in 1945. Early automated systems emerged in the 1950s and large-scale systems in the 1970s such as the Lockheed Dialog system. Many measures exist for evaluating IR system performance including precision, recall, and precision-recall curves.
The document provides an overview of data mining concepts and techniques. It discusses what data mining is, the data mining process, different types of data mining techniques including characterization, association, classification, clustering and outlier analysis. It also covers major issues in data mining such as methodology, performance, handling different data types, and applications.
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”VOGIN-academie
Smartlogic provides semantic search and content intelligence solutions to unlock business value from unstructured content. Their solution, Semaphore, uses natural language processing and machine learning to automatically enrich content with metadata, extract entities and facts, and categorize content according to customizable semantic models or ontologies. This helps organizations more effectively search, discover, and leverage information across diverse content sources. Semaphore delivers enhanced search capabilities, automated categorization, and tools to build and manage semantic models collaboratively. Customers report benefits such as reduced time spent searching, lower classification costs, and reduced risk of non-compliance by making more information accessible.
The diversity and complexity of contents available on the web have dramatically increased in recent years. Multimedia content such as images, videos, maps, voice recordings has been published more often than before. Document genres have also been diversified, for instance, news, blogs, FAQs, wiki. These diversified information sources are often dealt with in a separated way. For example, in web search, users have to switch between search verticals to access different sources. Recently, there has been a growing interest in finding effective ways to aggregate these information sources so that to hide the complexity of the information spaces to users searching for relevant information. For example, so-called aggregated search investigated by the major search engine companies will provide search results from several sources in a single result page. Aggregation itself is not a new paradigm; for instance, aggregate operators are common in database technology.
This talk presents the challenges faced by the like of web search engines and digital libraries in providing the means to aggregate information from several and complex information spaces in a way that helps users in their information seeking tasks. It also discusses how other disciplines including databases, artificial intelligence, and cognitive science can be brought into building effective and efficient aggregated search systems.
This curriculum vitae summarizes the qualifications and experience of Dr. Jie Bao. He is currently a research associate at Rensselaer Polytechnic Institute, a research affiliate at MIT, and a visiting scientist at Raytheon BBN Technologies. He received his Ph.D. in computer science from Iowa State University in 2007. His research focuses on areas including semantic web, linked data, description logics, and ontology engineering. He has over 50 publications and has served on numerous conference committees.
This document provides an overview of a course on data warehousing, filtering, and mining. The course is being taught in Fall 2004 at Temple University. The document includes the course syllabus which outlines topics like data warehousing, OLAP technology, data preprocessing, mining association rules, classification, cluster analysis, and mining complex data types. Grading will be based on assignments, quizzes, a presentation, individual project, and final exam. The document also provides introductory material on data mining including definitions and examples.
This document provides information about a computational intelligence and soft computing course including the instructor's contact information, class times, required text, and an overview of upcoming lectures on data mining with neural networks. It then discusses key issues in data mining such as theory, methods/algorithms, processes, applications, and tools/techniques. Several example data mining projects are also summarized along with homework and exam due dates for the course.
Post 1What is text analytics How does it differ from text mini.docxstilliegeorgiana
Post 1:
What is text analytics? How does it differ from text mining?
Text Analytics is applying of statistical and machine learning techniques to be able to predict /prescribe or infer any information from the text-mined data. Text mining is a tool that helps in getting the data cleaned up.Text analytics and text mining approaches have essentially equivalent performance. Text analytics requires an expert linguist to produce complex rule sets, whereas text mining requires the analyst to hand-label cases with outcomes or classes to create training data.
Differences between Text Mining and Text Analytics:
• Text Mining and Text Analytics solve the same problems, but use different techniques and are complementary ways to automatically extract meaning from text.
• Text Analytics is developed within the field of computational linguistics. It has the ability to encode human understanding into a series of linguistic rules which are generated by humans are high in precision, but they do not automatically adapt and are usually fragile when tried in new situations.
• Text mining is a newer discipline arising out of the fields of statistics, data mining, and machine learning. Its strength is the ability to inductively create models from collections of historical data. Because statistical models are learned from training data they are adaptive and can identify “unknown unknowns”, leading to the better recall. Still, they can be prone to missing something that would seem obvious to a human.
• Text analytics and text mining approaches have essentially equivalent performance. Text analytics requires an expert linguist to produce complex rule sets, whereas text mining requires the analyst to hand-label cases with outcomes or classes to create training data.
• Due to their different perspectives and strengths, combining text analytics with text mining often leads to better performance than either approach alone.
2. What technologies were used in building Watson (both hardware and software)?
Watson is an extraordinary computer system (a novel combination of advanced hardware an software) designed at answering questions posed in natural human language.Watson is an artificially intelligent computer system capable of answering questions posed in natural language, developed in IBM's DeepQA project by a research team led by principal investigator David Ferrucci. Watson was named after IBM's first CEO and industrialist Thomas J. Watson. The computer system was specifically developed to answer questions on the quiz show Jeopardy! In 2011, Watson competed on Jeopardy! against former winners Brad Rutter and Ken Jennings.
Watson received the first prize of $1 million.The goal was to advance computer science by exploring new ways for computer technology to affect science, business, and society.IBM undertook a challenge to build a computer system that could compete at the human champion level in real time on the American TV quiz show Jeopardy!The extent of the challenge in ...
Post 1What is text analytics How does it differ from text minianhcrowley
Post 1:
What is text analytics? How does it differ from text mining?
Text Analytics is applying of statistical and machine learning techniques to be able to predict /prescribe or infer any information from the text-mined data. Text mining is a tool that helps in getting the data cleaned up.Text analytics and text mining approaches have essentially equivalent performance. Text analytics requires an expert linguist to produce complex rule sets, whereas text mining requires the analyst to hand-label cases with outcomes or classes to create training data.
Differences between Text Mining and Text Analytics:
• Text Mining and Text Analytics solve the same problems, but use different techniques and are complementary ways to automatically extract meaning from text.
• Text Analytics is developed within the field of computational linguistics. It has the ability to encode human understanding into a series of linguistic rules which are generated by humans are high in precision, but they do not automatically adapt and are usually fragile when tried in new situations.
• Text mining is a newer discipline arising out of the fields of statistics, data mining, and machine learning. Its strength is the ability to inductively create models from collections of historical data. Because statistical models are learned from training data they are adaptive and can identify “unknown unknowns”, leading to the better recall. Still, they can be prone to missing something that would seem obvious to a human.
• Text analytics and text mining approaches have essentially equivalent performance. Text analytics requires an expert linguist to produce complex rule sets, whereas text mining requires the analyst to hand-label cases with outcomes or classes to create training data.
• Due to their different perspectives and strengths, combining text analytics with text mining often leads to better performance than either approach alone.
2. What technologies were used in building Watson (both hardware and software)?
Watson is an extraordinary computer system (a novel combination of advanced hardware an software) designed at answering questions posed in natural human language.Watson is an artificially intelligent computer system capable of answering questions posed in natural language, developed in IBM's DeepQA project by a research team led by principal investigator David Ferrucci. Watson was named after IBM's first CEO and industrialist Thomas J. Watson. The computer system was specifically developed to answer questions on the quiz show Jeopardy! In 2011, Watson competed on Jeopardy! against former winners Brad Rutter and Ken Jennings.
Watson received the first prize of $1 million.The goal was to advance computer science by exploring new ways for computer technology to affect science, business, and society.IBM undertook a challenge to build a computer system that could compete at the human champion level in real time on the American TV quiz show Jeopardy!The extent of the challenge in ...
Department of Commerce App Challenge: Big Data DashboardsBrand Niemann
The document summarizes Dr. Brand Niemann's presentation at the 2012 International Open Government Data Conference. It discusses open data principles and provides an example using EPA data. It also describes Niemann's beautiful spreadsheet dashboard for EPA metadata and APIs. Finally, it outlines Niemann's data science analytics approach for the conference, including knowledge bases, data catalog, and using business intelligence tools to analyze linked open government data.
1) DRSI's Synthesys technology provides tools for analyzing large amounts of unstructured data through natural language processing, entity extraction, and relationship analysis.
2) It ingests both structured and unstructured data, performs analytics to identify concepts, entities, and relationships, and stores this information in a Knowledge Base.
3) The Knowledge Base allows powerful search and retrieval of entities and relationships. Synthesys also provides link analysis, concept resolution, and other functions to analyze relationships within and across data.
The document summarizes a research paper on DBLP Search Support Engine (SSE), a system that aims to provide intelligent and personalized search beyond traditional search engines. It extracts users' research interests based on publication frequency and recency using interest retention models. The system represents users and their interests using RDF and provides additional functionalities like query refinement, domain analysis and tracking based on users' interests. Future work includes improving the interest prediction model and providing a unified architecture for different system functions.
Data and Information Integration: Information ExtractionIJMER
Information extraction is generally concerned with the location of different items in any document, may be textual or web document. This paper is concerned with the methodologies and applications of information extraction. The field of information extraction plays a very important role in the natural language processing community. The architecture of information extraction system which acts as the base for all languages and fields is also discussed along with its different components. Information is hidden in the large volume of web pages and thus it is necessary to extract useful information from the web content, called Information Extraction. In information extraction, given a sequence of instances, we identify and pull out a sub-sequence of the input that represents information we are interested in.
Manual data extraction from semi supervised web pages is a difficult task. This paper focuses on study of various data extraction techniques and also some web data extraction techniques. In the past years, there was a rapid expansion of activities in the information extraction area. Many methods have been proposed for automating the process of extraction. We will survey various web data extraction tools. Several real-world applications of information extraction will be introduced. What role information extraction plays in different fields is discussed in these applications. Current challenges being faced by the available information extraction techniques are briefly discussed along with the future work going on using the current researches is discussed.
Here are some science-related events from EventKG that took place in Lyon:
- 1921: "À Lyon, fusion de la Société de médecine et de la Société des sciences médicales" (In Lyon, merger of the Medical Society and the Society of Medical Sciences)
- 1987: "The International Astronomical Union organizes its 24th General Assembly in Lyon"
- 1988: "The International Astronomical Union organizes its 25th General Assembly in Lyon"
- 2009: "The International Astronomical Union organizes its 26th General Assembly in Lyon"
- 2015: "The International Astronomical Union organizes its 29th General Assembly in Lyon"
-
Databases have been used for over 40 years to organize information in a variety of contexts like inventory, class schedules, and personal records. Relational databases remain popular today despite attempts to replace them with object-oriented databases. Cloud computing and big data have further transformed databases by allowing extremely large datasets to be analyzed for trends and patterns. Modern databases can provide targeted recommendations and offers by analyzing individual user information and behaviors.
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactDr. Sunil Kr. Pandey
This is my presentation on the Topic "Data Science - An emerging Stream of Science with its Spreading Reach & Impact". I have compiled and collected different statistics and data from different sources. This may be useful for students and those who might be interested in this field of Study.
A New Paradigm on Analytic-Driven Information and Automation V2.pdfArmyTrilidiaDevegaSK
The document proposes an end-to-end methodology for developing analytic-driven information and automation systems based on big data, data science, and artificial intelligence. The methodology involves 6 steps: 1) collecting data from multiple sources, 2) preprocessing the data, 3) extracting features from the data, 4) clustering and interpreting the data, 5) designing applications, and 6) implementing and evaluating the systems. It then provides an example of applying this methodology to develop an early warning system for monitoring higher education institutions in Indonesia. The system would collect data from various sources, analyze it using machine learning techniques, predict and prescribe interventions for student groups.
Rule-based Information Extraction for Airplane Crashes ReportsCSCJournals
Over the last two decades, the internet has gained a widespread use in various aspects of everyday living. The amount of generated data in both structured and unstructured forms has increased rapidly, posing a number of challenges. Unstructured data are hard to manage, assess, and analyse in view of decision making. Extracting information from these large volumes of data is time-consuming and requires complex analysis. Information extraction (IE) technology is part of a text-mining framework for extracting useful knowledge for further analysis.
Various competitions, conferences and research projects have accelerated the development phases of IE. This project presents in detail the main aspects of the information extraction field. It focused on specific domain: airplane crash reports. Set of reports were used from 1001 Crash website to perform the extraction tasks such as: crash site, crash date and time, departure, destination, etc. As such, the common structures and textual expressions are considered in designing the extraction rules.
The evaluation framework used to examine the system's performance is executed for both working and test texts. It shows that the system's performance in extracting entities and relations is more accurate than for events. Generally, the good results reflect the high quality and good design of the extraction rules. It can be concluded that the rule-based approach has proved its efficiency of delivering reliable results. However, this approach does require an intensive work and a cycle process of rules testing and modification.
Rule-based Information Extraction for Airplane Crashes ReportsCSCJournals
Over the last two decades, the internet has gained a widespread use in various aspects of everyday living. The amount of generated data in both structured and unstructured forms has increased rapidly, posing a number of challenges. Unstructured data are hard to manage, assess, and analyse in view of decision making. Extracting information from these large volumes of data is time-consuming and requires complex analysis. Information extraction (IE) technology is part of a text-mining framework for extracting useful knowledge for further analysis.
Various competitions, conferences and research projects have accelerated the development phases of IE. This project presents in detail the main aspects of the information extraction field. It focused on specific domain: airplane crash reports. Set of reports were used from 1001 Crash website to perform the extraction tasks such as: crash site, crash date and time, departure, destination, etc. As such, the common structures and textual expressions are considered in designing the extraction rules.
The evaluation framework used to examine the system's performance is executed for both working and test texts. It shows that the system's performance in extracting entities and relations is more accurate than for events. Generally, the good results reflect the high quality and good design of the extraction rules. It can be concluded that the rule-based approach has proved its efficiency of delivering reliable results. However, this approach does require an intensive work and a cycle process of rules testing and modification.
Introduction to question answering for linked data & big dataAndre Freitas
This document discusses question answering (QA) systems in the context of big data and heterogeneous data scenarios. It outlines the motivation and challenges for developing natural language interfaces for databases. The document covers the basic concepts and taxonomy of QA systems, including question types, answer types, data sources, and domains. It also discusses the anatomy and components of a typical QA system.
Search Solutions 2011: Successful Enterprise Search By DesignMarianne Sweeny
When your colleagues say they want Google, they don’t mean the Google Search Appliance. They mean the Google Search user experience: pervasive, expedient and delivering the information that they need. Successful enterprise search does not start with the application features, is not part of the information architecture, does not come from a controlled vocabulary and does not emerge on its own from the developers. It requires enterprise-specific data mining, enterprise-specific user-centered design and fine tuning to turn “search sucks” into search success within the firewall. This presentation looks at action items, tools and deliverables for Discovery, Planning, Design and Post Launch phases of an enterprise search deployment.
The document provides an introduction to information retrieval, including its history, key concepts, and challenges. It discusses how information retrieval aims to retrieve relevant documents from a collection to satisfy a user's information need. The main challenge in information retrieval is determining relevance, as relevance depends on personal assessment, task, context, time, location, and device. Three main issues in information retrieval are determining relevance, representing documents and queries, and developing effective retrieval models and algorithms.
The document provides an introduction to information retrieval, including its history, key concepts, and challenges. It discusses how information retrieval aims to retrieve relevant documents from a collection to satisfy a user's information need. The main challenge in information retrieval is determining relevance, as relevance depends on personal assessment and can change based on context, time, location, and device. The document outlines the major issues and developments in the field over time from the 1950s to present day.
This document provides an overview of text mining and web mining. It defines data mining and describes the common data mining tasks of classification, clustering, association rule mining and sequential pattern mining. It then discusses text mining, defining it as the process of analyzing unstructured text data to extract meaningful information and structure. The document outlines the seven practice areas of text mining as search/information retrieval, document clustering, document classification, web mining, information extraction, natural language processing, and concept extraction. It provides brief descriptions of the problems addressed within each practice area.
Bringing Machine Learning and Knowledge Graphs Together
Six Core Aspects of Semantic AI:
- Hybrid Approach
- Data Quality
- Data as a Service
- Structured Data Meets Text
- No Black-box
- Towards Self-optimizing Machines
The case discusses Edward Snowden, a former NSA contractor who leaked classified information about mass surveillance programs to journalists in 2013. He contacted journalists Glenn Greenwald and Laura Poitras to disclose secret NSA documents, meeting them in Hong Kong. The documents revealed that the NSA was collecting phone records and metadata of millions of Americans under Section 215 of the Patriot Act. Snowden's leaks sparked a major debate about government surveillance and privacy worldwide.
Machine Learned Relevance at A Large Scale Search EngineSalford Systems
The document discusses machine learned relevance at a large scale search engine. It provides biographies of the two authors who have extensive experience in machine learning and search engines. It then outlines the topics to be covered, including an introduction to machine learned ranking for search, relevance evaluation methodologies, data collection and metrics, the Quixey search engine system, model training approaches, and conclusions.
Similar to Text Analytics - JCC2014 Kimelfeld (20)
Este documento proporciona una lista de temas relacionados con sistemas de información, incluyendo:
- Sistemas de información bibliográficos como Koha.
- Buscadores de información como Apache Solr.
- Herramientas de extracción, transformación y carga de datos (ETL) como Talend Open Studio.
- Herramientas de inteligencia de negocios como Metabase.
- Herramientas de minería de datos como RapidMiner.
- Sistemas de colaboración como Citadel/UX.
El documento presenta una introducción a los conceptos de necesidades humanas, deseos, demandas, productos y servicios. Luego define los servicios de información como las actividades que desarrollan las organizaciones de información para satisfacer las demandas de información de los usuarios. Finalmente, destaca algunas claves para el éxito de los servicios de información como no suponer las necesidades de los usuarios, mantenerse en contacto con ellos e invertir en investigación sobre sus necesidades.
Este documento trata sobre la gestión del conocimiento y las herramientas colaborativas. Define el conocimiento y explica la historia del conocimiento desde las filosofías occidental y oriental hasta las corrientes filosóficas del siglo XX. También describe los sistemas de gestión del conocimiento, el capital intelectual, el trabajo en grupo y la integración de sistemas de gestión del conocimiento con sistemas de información. Finalmente, presenta herramientas para la gestión del conocimiento como wikis, blogs y sistemas de gestión de contenidos.
Este documento define Inteligencia de Negocios (BI) y Big Data, y describe su historia, componentes y tendencias. Explica que BI es un marco conceptual para apoyar la toma de decisiones mediante el análisis de datos, mientras que Big Data se refiere a grandes volúmenes de datos de diversos tipos. También resume las principales técnicas de BI como data mining, gestión de datos y análisis predictivo, así como herramientas líderes como Tableau, Power BI y Hadoop para Big Data.
El documento describe lo que es una biblioteca móvil, incluyendo que es un vehículo motorizado que transporta materiales de biblioteca como libros, DVDs, CDs y más a áreas con poco acceso a bibliotecas fijas. Explica que los objetivos de las bibliotecas móviles incluyen servir a residentes con menos acceso, promover la equidad en el acceso a la información y brindar servicios flexibles a poblaciones fluctuantes. También menciona ejemplos como bibliobuses, bibliotrenes y más.
Este documento presenta información sobre la gestión de la información en bibliotecología y archivística. Aborda temas como datos, información, sistemas de información, teoría general de sistemas, dimensiones de los sistemas de información, organizaciones, tecnología, fuentes de noticias, inteligencia de negocios, sistemas operativos especializados, evolución de los sistemas operativos, formatos de datos, software de gestión bibliotecaria y tendencias de búsqueda relacionadas con la pandemia de COVID-19
El documento describe el proceso de digitalización de libros. Explica que la digitalización amplía el acceso a los libros, permite el acceso las 24 horas, aumenta la cantidad de usuarios concurrentes, y ayuda a preservar y conservar los libros impresos de manera más duradera. Luego, detalla las diferentes etapas del proceso de digitalización, incluyendo la selección de libros, escaneo, edición de imágenes, reconocimiento óptico de caracteres, y publicación de archivos digitales.
Este documento resume un artículo sobre los problemas de diseño de redes de transporte urbano. Presenta definiciones de estos problemas, clasificaciones de objetivos, variables de decisión y métodos de solución. El artículo ofrece una revisión exhaustiva del tema con el fin de proporcionar una visión general y permitir la comparación de enfoques.
Este documento describe un enfoque para automatizar el diseño de heurísticas para problemas de embalaje mediante programación genética. Se presenta un sistema que puede generar automáticamente heurísticas para problemas de embalaje de 1, 2 y 3 dimensiones, como el problema de la mochila y el depósito de embalaje. El sistema utiliza programación genética para explorar el espacio de soluciones de heurísticas posibles representadas como árboles. Realiza experimentos que muestran que las heurísticas generadas de forma automática son competitivas con las
Este documento describe hormigas artificiales y su aplicación para la resolución distribuida de problemas. Las hormigas artificiales se basan en el comportamiento de las hormigas reales, que pueden lograr tareas complejas como grupo a pesar de la limitada inteligencia individual. Algunas aplicaciones de las hormigas artificiales incluyen la optimización de rutas, el balance de carga en redes y el ruteo. El documento también presenta un applet que modela cómo las hormigas encuentran el camino más corto entre el nido y la comida dejando feromonas
Este documento presenta un taller de informática sobre ingeniería de ejecución de empresas. Explica que el profesor es Pedro Guillermo Contreras Flores de la Universidad de Atacama, y describe el programa, horario y evaluación. Luego introduce conceptos básicos sobre informática, computación y tecnologías de la información, y resume brevemente la historia de las computadoras. Finalmente, discute temas como la economía digital, internet, redes inalámbricas y la importancia de aprovechar las oportunidades de la nueva economía digital.
Este documento presenta un resumen de los temas de modelado y simulación. Cubre definiciones clave, etapas para la construcción de simulaciones, ventajas y desventajas, y métodos para generar números pseudoaleatorios como el congruencial mixto y el congruencial multiplicativo. También incluye ejemplos de aplicaciones de simulación y un glosario de términos importantes.
Este documento presenta una introducción a Java 3D, incluyendo sus características principales como la facilidad de uso, el rendimiento escalable y la capacidad de crear gráficas 3D para la red. Explica conceptos como la jerarquía de escenas gráficas, la definición de geometría, las apariencias y las conductas. También brinda ejemplos de cómo visualizar modelos 3D e identifica algunos mercados y aplicaciones para Java 3D.
Este documento presenta una introducción a los complementos de programación, incluyendo temas como la recursividad, backtracking, estructuras de datos dinámicas como registros, archivos, punteros, y estilos de programación como imperativo, orientado a objetos, funcional y lógico. También cubre conceptos como la resolución de problemas, estrategias para diseñar algoritmos, modularidad y el ámbito de variables. Finalmente, propone un ejercicio para administrar compras que utiliza registros y módulos.
El documento habla sobre la memoria dinámica en C. Explica que existen dos tipos de memoria: estática y dinámica. La estática es fija e invariable, mientras que la dinámica cambia de acuerdo al tamaño de las variables. También describe funciones como malloc(), calloc(), realloc() y free() que permiten asignar y liberar memoria dinámica en tiempo de ejecución.
El documento trata sobre la recursividad. Explica que la recursividad implica que un objeto se define en función de sí mismo. Presenta ejemplos recursivos como los números naturales, las estructuras de árboles y la función factorial. Distingue entre recursividad directa e indirecta. Describe el funcionamiento interno de los procedimientos recursivos y conceptos como la condición de parada y el backtracking. Finalmente, explica cómo resolver problemas recursivos como las Torres de Hanoi y el problema de colocar N reinas en un tablero de ajedrez.
Este documento trata sobre registros, archivos y punteros en lenguaje C. Explica que los registros permiten definir nuevos tipos de datos compuestos, los archivos son utilizados para entrada y salida de datos, y los punteros almacenan direcciones de memoria y permiten acceder y modificar valores de variables.
Este documento describe las funciones básicas para la programación gráfica en C, incluyendo cómo activar bibliotecas gráficas, inicializar la configuración gráfica, trabajar con colores de texto y fondo, generar texto en la pantalla, y funciones como cleardevice(), closegraph() y graphresult() para limpiar, cerrar y revisar errores de la configuración gráfica.
Este documento describe los tipos de archivos, cómo se asocian a aplicaciones y tienen atributos como tamaño y fecha. Explica dos formas de acceso a archivos: secuencial línea por línea usando funciones como fopen y fclose, y acceso directo moviéndose directamente a un registro usando funciones como fseek. También resume una prueba y taller sobre programación en C que cubre estructuras de datos, sentencias y vectores/matrices.
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
State of Artificial intelligence Report 2023kuntobimo2016
Artificial intelligence (AI) is a multidisciplinary field of science and engineering whose goal is to create intelligent machines.
We believe that AI will be a force multiplier on technological progress in our increasingly digital, data-driven world. This is because everything around us today, ranging from culture to consumer products, is a product of intelligence.
The State of AI Report is now in its sixth year. Consider this report as a compilation of the most interesting things we’ve seen with a goal of triggering an informed conversation about the state of AI and its implication for the future.
We consider the following key dimensions in our report:
Research: Technology breakthroughs and their capabilities.
Industry: Areas of commercial application for AI and its business impact.
Politics: Regulation of AI, its economic implications and the evolving geopolitics of AI.
Safety: Identifying and mitigating catastrophic risks that highly-capable future AI systems could pose to us.
Predictions: What we believe will happen in the next 12 months and a 2022 performance review to keep us honest.
Global Situational Awareness of A.I. and where its headedvikram sood
You can see the future first in San Francisco.
Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum.
The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war.
Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change.
Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride.
Let me tell you what we see.
End-to-end pipeline agility - Berlin Buzzwords 2024Lars Albertsson
We describe how we achieve high change agility in data engineering by eliminating the fear of breaking downstream data pipelines through end-to-end pipeline testing, and by using schema metaprogramming to safely eliminate boilerplate involved in changes that affect whole pipelines.
A quick poll on agility in changing pipelines from end to end indicated a huge span in capabilities. For the question "How long time does it take for all downstream pipelines to be adapted to an upstream change," the median response was 6 months, but some respondents could do it in less than a day. When quantitative data engineering differences between the best and worst are measured, the span is often 100x-1000x, sometimes even more.
A long time ago, we suffered at Spotify from fear of changing pipelines due to not knowing what the impact might be downstream. We made plans for a technical solution to test pipelines end-to-end to mitigate that fear, but the effort failed for cultural reasons. We eventually solved this challenge, but in a different context. In this presentation we will describe how we test full pipelines effectively by manipulating workflow orchestration, which enables us to make changes in pipelines without fear of breaking downstream.
Making schema changes that affect many jobs also involves a lot of toil and boilerplate. Using schema-on-read mitigates some of it, but has drawbacks since it makes it more difficult to detect errors early. We will describe how we have rejected this tradeoff by applying schema metaprogramming, eliminating boilerplate but keeping the protection of static typing, thereby further improving agility to quickly modify data pipelines without fear.
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
2. Preamble
• Myself:
– Ph.D. @ HebrewU (DB uncertainty + search)
– IBM Almaden (DB theory, IR, Text Analytics)
– LogicBlox (ML in DB, Prob. Programming)
– Technion IL (Associate Prof., next year)
• This talk:
Infrastructure for text analytics
+ DB theory, formal languages, NLP, data mining,
computational complexity, …
2
3. • Text Analytics in the Big Data Era
• Information Extraction Systems & Formalism
• Foundational Research Challenges
• Conclusions and Outlook
Outline
3
4. Text Analytics Matters
Some important applications are based on the
analysis of text-centric data; for example:
Semantic Search
Semantic understanding & indexing of
content to better match user's intent
Life-Science Mining
Extract knowledge bases from
scientific publications
e-Commerce
Comparison Shopping extracts &
compares inventory from online sources
CRM / BI
Monitor customer’s social-media activity
for sentiment & business leads
Log Analysis
Summarize, visualize and analyze logs
produced by machines
4
5. Database Management Systems
• Old news: Data management is involved!
– Data semantics, query/analysis semantics, storage,
query evaluation, indices, consistency, transactions,
backup, privacy, recovery, …
– From-scratch engineering is highly challenging
• Motivation to the concept of a general-purpose
Database Management System
– Most notably: relational model (pioneered by Edgar F.
Codd in 1969) and SQL
5
6. “Big Data” Phenomena
Proprietary data in orgs.
(enterprises, governments, …)
Proliferation of publically open
data sources (Web, social, …)
Past: Present:
Massive-data analyses incurred
high machinery/personnel cost
Business models (cloud, crowd,
opensource) facilitate analyses
Data structured/controlled by
admins, e-forms, software, …
Uncontrolled data from humans’
free text, heterogeneous kbs, …
Analyses by specialized teams
of heavily trained experts
Analyses by a wide community
featuring a wide range of skills
6
7. “By 2018, the United States alone could face a shortage of 140,000
to 190,000 people with deep analytical skills as well as 1.5 million
managers and analysts with the know-how to use the analysis of
big data to make effective decisions.”
“Big data: The next frontier for innovation, competition, and productivity”
McKinsey Report, May 2011
We need dev. & management systems to
facilitate value extraction from Big Data
by a wide range of users / skills
7
8. Core Task: Information Extraction (IE)
“Information Extraction (IE) is the name given to any process
which selectively structures and combines data which is found,
explicitly stated or implied, in one or more texts. The final
output of the extraction process varies; in every case, however, it
can be transformed so as to populate some type of database.”
J. Cowie and Y. Wilks., Handbook of
Natural Language Processing, 2000
“Information extraction is the identification, and consequent or concurrent
classification and structuring into semantic classes, of specific
information found in unstructured data sources, such as natural language
text, making the information more suitable for information processing tasks.”
M. F. Moens, Information Extraction: Algorithms
and Prospects in a Retrieval Context, 2006
→data-in-text
(unstructured)
data-in-db
(structured)
In short:
8
9. Popular Classes of IE Tasks
• Named Entity Recognition
From September 1936 to July 1938,
Turing spent most of his time studying
under Church at Princeton University.
In June 1938, he obtained his PhD
from Princeton.
person person organization
organization
9
10. Popular Classes of IE Tasks
AdvisedB
y
WorksIn
From September 1936 to July 1938,
Turing spent most of his time studying
under Church at Princeton University.
In June 1938, he obtained his PhD
from Princeton.
• Named Entity Recognition
• Relation Extraction
10
11. Popular Classes of IE Tasks
From September 1936 to July 1938,
Turing spent most of his time studying
under Church at Princeton University.
In June 1938, he obtained his PhD
from Princeton.
Graduation
Where?
Who?
• Named Entity Recognition
• Relation Extraction
• Event Extraction
11
12. Popular Classes of IE Tasks
From September 1936 to July 1938,
Turing spent most of his time studying
under Church at Princeton University.
In June 1938, he obtained his PhD
from Princeton.
Education
Start End
Graduation
When?
• Named Entity Recognition
• Relation Extraction
• Event Extraction
• Temporal IE
12
13. Popular Classes of IE Tasks
From September 1936 to July 1938,
Turing spent most of his time studying
under Church at Princeton University.
In June 1938, he obtained his PhD
from Princeton.
SameEntity
SameEntity
• Named Entity Recognition
• Relation Extraction
• Event Extraction
• Temporal IE
• Coreference Resolution
13
14. ariu
lmaden
A
m. com
Yunyao Li
IBM Research - Almaden
San Jose, CA
yunyaol i @us. i bm. com
Frederick R. Reiss
IBM Research - Almaden
San Jose, CA
f r r ei ss@us. i bm. com
stract
a” analytics over unstruc-
enewed interest in infor-
E). We surveyed the land-
ies and identified amajor
industry and academia:
ominatesthecommercial
garded as dead-end tech-
mia. We believe the dis-
he way in which the two
ethebenefits and costsof
mia’s perception that rule-
research challenges. We
mportance of rule-based
Commercial*Vendors*(2013)*
NLP*Papers*
(200392012)*
100%$
50%$
0%$
3.5%*
21%$
75%$
Rule,$
Based$
Hybrid$
Machine$
Learning$
Based$
45%*
22%$
33%$
Implementa@ons*of*En@ty*Extrac@on*
Large*Vendors*
67%*
17%$
17%$
All*Vendors*
IE Paradigms: Rules & Statistics
• Rules
• ML classification
• Probabilistic graphical models
• Soft logic
[Chiticariu, Li, Reiss, EMNLP’13]
• EMNLP, ACL, NAACL, 2003-
2012
• 54 industrial vendors (Who’s
Who in Text Analytics, 2012)
“[…] rules are effective,
interpretable, and are easy
to customize by non-experts
to cope with errors.”
Gupta & Manning, CONLL’14
14
+
NLP
15. • Text Analytics in the Big Data Era
• Information Extraction Systems & Formalism
• Foundational Research Challenges
• Conclusions and Outlook
Outline
15
16. Xlog: Datalog for IE
• Extension of (non-recursive) Datalog
• Use case: DBLife (db research kb: dblife.cs.wisc.edu)
• Data types: string, document, span
– Focus on single-document programs
• “Procedural predicates” (p-predicates) are user-defined
functions that produce relations over spans
– Example: sentence(doc, span)
• Query-plan optimization
[Shen, Doan, Naughton, Ramakrishnan, VLDB 2007]
Kaspersky Lab CEO Eugene Kaspersky said Intel CEO Paul
Otellini and the Intel board had no idea what they were in for when
the company announced it was acquiring McAfee on August 19,
2010.
Same string, different spans
Span [42,47)
16
17. Xlog Example
“Declarative Information Extraction using Datalog with Embedded Extraction Predicates”
[Shen, Doan, Naughton, Ramakrishnan, VLDB 2007]
Regex.
(string)
Unary
regex
formula
Binary
regex
formula
17
18. • Datalog syntax
– Types: string, span
• Built in collection of p-predicates
– Various types of built-in regex formulas
– Linguistic: deep parsing, coreference
resolution, named-entity extractor
Instaread: Datalog + NLP
Binary regex
formulas
Unary regex
formulas
[Hoffmann, 2012]
18
19. IBM SystemT: SQL for IE
• Engine for AQL: SQL-like declarative IE lang.
– AQL = Annotation Query Language
• SystemT = AQL + Runtime + Dev. Tooling
– [Chiticariu et al., ACL 2010]: position SystemT as a
high-quality and high-efficiency IE solution
– System and IDE demos in ACL 2011, SIGMOD 2011
• Commercial product, high academic presence
– Integration on public financial records [Hernández et al., EDBT’13,
Balakrishnan et al. SIGMOD’10], NER [Chiticariu et al. EMNLP’10,
ACL’10, Nagesh et al. EMNLP’12, Roy et al. SIGMOD’13], IR [Zhu et
al. WWW’10, K et al. SIGIR’12, CIKM’12], sentiment analysis [Hu et
al., Interact’13], social media [Sindhwani et al., IBM Journal 2011]
19
21. Formal Framework
• Repeated concept: Extend a relational query
language with text transducers (p-predicates,
usually regex formulas)
• Research challenge: theoretical underpinnings
of this combined document/relation model
• Expressive power
– Query-plan optimization: Can we rewrite an operator via “easier”
building blocks?
– System extensions: Can we express a new operation using
existing ones, or prove impossibility?
• Next: a formal framework
– With Fagin, Reiss, Vansummeren, PODS’13, JACM
21
22. 22
Terminology
Kaspersky Lab CEO Eugene Kaspersky said Intel CEO Paul
Otellini and the Intel board had no idea what they were in for when
the company announced it was acquiring McAfee on August 19,
2010.
Company CEO CompanyCEO
[1,14)
(Kaspersky Lab)
[19,36)
(Eugene Kaspersky)
[1,36)
[42,47)
(Intel)
[52,65)
(Paul Otellini)
[42,65)
Relation over spans from the document
Document
Span [52,65)
23. Document Spanners
Document d Relation over the spans of d
Kaspersky Lab CEO Eugene
Kaspersky said Intel CEO Paul Otellini
and the Intel board had no idea what
they were in for when the company
announced it was acquiring McAfee
on August 19, 2010.
x y z
[1,14) [30,36) [1,36)
[42,47) [52,65) [42,65)
[102,110) [115,125) [102,125)
Document Spanner: a function that maps every
doc. (string) into a relation over the doc.’s spans
More formally:
• Finite alphabet of symbols
• A spanner maps each doc. d ∈ * into a relation over the spans [i,j) of d
• The relation has a fixed signature (set of attributes)
− The attributes come from an infinite domain of variables x, y, z, …
23
24. Spanners as Regex Formulas
• Regular expression with embedded variables
• Examples:
• Restriction: each “evaluation” (parse tree) assigns
one span to each variable (see [Fagin et al., PODS’13])
Ordinary regex Span variable
.* x{dddd} .*
.* in w{Alabama | Alaska | Arizona | …} .*
(.* z{[A-Z][a-z]*, y{[A-Z][a-z]*}} .*) | …
Representation system for spanners
24
25. Spanners as Datalog w/ Regex
• Non-recursive Datalog (NR-Datalog)
• Operate over a document (not a relational db)
Token(x) := [ (ε | .*_) x{[a-zA-Z]+} ( ((,V_) .*) | ε) ]
State(x) := Token(x) , [.* x{Georgia|Virginia|Washington}.*]
Cap1st(x) := Token(x) , [.* x{[A-Z].*}.*]
CommaSp(x,y,z) := [.* z{x{.*} ,_ y{.*}}.*]
Loc(z) := CommaSp(x,y,z) , Cap1st(x) , State(y)
RETURN(x,z) := Cap1st(x) , [.*x{.*}_from_z{.*}.*}] , Loc(z)
Carter_from_Plains,_Georgia,_Washington
_from_Westmoreland,_Virginia
x z
[1,7)
Carter
[13,28)
Plains,_Georgia
[30,40)
Washington
[46,69)
Westmoreland,_Virginia
EDBs = Spanners!
Another representation
system for spanners
Quer
y goal
25
26. Spanners as Automata
0,1 0 1
Ordinary
NFA
1 0 0 1 1 1 0 1
Var-Stack
Automaton
1 0 0 1 1 1 0 1x{
y{
}
}
y{x{ } }
Var-Set
Automaton
1 0 0 1 1 1 0 1x{
}y
y{x{ }x }y
}x
0,1 0 1
0,1 0 1
• In an accepting run, each variable opens and later closes exactly once
⇒ Each accepting run defines an assignment to the variables
• Nondeterministic ⇒ multiple accepting runs ⇒ multiple tuples
Close most recent
Close x
y
x
x
y
Another representation system for spanners
y{
26
28. Consequences
• Connections between Datalog+regex
spanners and other language formalisms
– Classic string relations [Berstel 79]
– Graph queries (CRPQs) [Cruz et al. 87]
• Extension with string equality & difference
– Expressiveness / closure properties
• Principles for cleaning inconsistencies
– Follow up work [PODS’14]
– Next…
28
29. • Text Analytics in the Big Data Era
• Information Extraction Systems & Formalism
• Foundational Research Challenges
• Conclusions and Outlook
Outline
29
30. Next, highlight 3 lines of foundational research that
were motivated by our work on text analytics:
1. Database inconsistency w/ repair priorities
2. Frequent subgraph mining
3. Update propagation
30
31. • Extractors may produce inconsistent results
– Data artifacts
– Developer limitations
• Rather than repairing the existing extractors,
common practice is to clean (intermediate) results
– SystemT “consolidators” [Chiticariu et al.10]
– GATE/JAPE “controls” [Cunningham 02]
– Implicit in other rule systems, e.g., WHISK [Soderland 99]
– POSIX regex disambiguation [Fowler 03]
Cleaning IE Inconsistencies
33 Martin Luther King Jr. Dr., SE, Atlanta, GA 30303
Person2
Person1
Address1
31
34. Cleaning via Prioritized Repairs
• Problem: existing policies are ad-hoc; how to
expose a language for user declaration?
• [Fagin, K, Reiss, Vansummeren 2014]: spanner
formalism for declarative cleaning
• Key: prioritized repairs [Staworko, et al. 12]
• Idea: Extend extraction programs with
– Denial constraints: which facts are in conflict?
– Priority declarations: preference between facts
• Captures SystemT, GATE, WHISK, POSIX, …
• We are now trying to improve our understanding
of prioritized repairs…
34
35. Prioritized Repairs: Definition
Database
Denial
Constraints
Collection of facts Which sets of facts
cannot co-exist?
Priority
Relation
Binary “is preferred to”
relation
• [Arenas, Bertossi, Chomicki 99]: Inconsistent DB
represents a set of (equally likely) “repairs”
Then we can ask for the “possible” or “consistent” query answers
• [Staworko, Chomicki, Marcinkowski 12] add priorities:
• Let A and B be two consistent subsets of the database
• Say that A improves B if we can obtain A from B by a
“profitable” exchange of facts (precision later…)
• A repair is a consistent subset that cannot be improved
Inconsistent Database Instance
35
36. Example
professor university city
Monica ubiobio Concepción
Monica carleton Ottawa
Jorge uchile Santiago
Jorge ubiobio Santiago
Pablo uchile Santiago
Violated constraints (functional
dependencies):
• professor university, city
(“key constraint”)
• university city
professor university city
Monica ubiobio Concepción
Monica carleton Ottawa
Jorge uchile Santiago
Jorge ubiobio Santiago
Pablo uchile Santiago
professor university city
Monica ubiobio Concepción
Monica carleton Ottawa
Jorge uchile Santiago
Jorge ubiobio Santiago
Pablo uchile Santiago
“Ordinary” repairs [Arenas et al. 99]
Tuple priority some repairs can be discarded [Staworko et al.] 36
A improves B if we get A from B by removing tuples & adding
tuple; each removed preferred to by some added
37. Complexity of Testing Improvability
Theorem:
In the case of a single functional dependency
or two keys per relation, improvability can be
tested in polynomial time
In any other combination of FDs, the
problem is NP-complete!
university faculty dean
UChile Economics Agosin
Technion CS Yavneh
Stanford Law Magill
two keys
37
Can a consistent subset be improved?
Recent work (unpublished)
w/ Fagin & Kolaitis
38. IE with Recurring Patterns
I want to buy my advisor a gift.
I really want to buy a gift to my advisor.
I want to buy a gift to the secretary and to my advisor.
1. Apply
dependency
parsing
38
[Zhang, Baldwin, Ho, K, Li, ACL13]: Restoring grammar in social media, sms, etc.
39. IE with Recurring Patterns
I want to buy my advisor a gift.
I really want to buy a gift to my advisor.
I want to buy a gift to the secretary and to my advisor.
I
want
buy
gift advisor
1. Apply
dependency
parsing
2. Find freq.
recurring
patterns
39
[Zhang, Baldwin, Ho, K, Li, ACL13]: Restoring grammar in social media, sms, etc.
41. Complexity Study
• Naturally, there has been a lot of work on this problem
– SPIN [Huan et al. 04], MARGIN [Thomas et al. 10], …
• But little was known about the computational complexity
• Studied: impact of assumptions on comp. complexity
– Graph properties (e.g., trees, treewidth, etc.)
– Label repeatability
– Bounded #results desired
– Bounded threshold
• This work led to novel complexity results and a new
methodologies for mining maximal subgraphs
– [K & Kolaitis, ACM PODS’13, ACM TODS]
• Next, some complexity nuggets
41
42. Complexity Nuggets
• Good news: If labels do not repeat in each input
graph, then there are PTime solutions when
– The threshold is bounded; or
– Graphs are trees & few results are desired
• In general graphs w/o label repetition, you can
find 2 results in PTime
– Bad news: But finding 3rd is NP-hard!
– Bad news: And if labels repeat and graphs are
trees, then finding 2nd is already NP-hard!
• Even for a bounded threshold
42
43. Improving Dictionaries w/ Feedback
text fragments
(sentences, tables, rows, …)
join
IBM , San Jose
company
occurrences
address
occurrencescompanies, countries, …
Apple , CupertinoIBM , Armonk
IE IE
IE
auto. suggest a “good”
fix to the IE program
Web data
“good” = small effect
on other results
Yahoo! , Cupertino Goo
43
44. View Updates
• View-update problem: Translate an update on a view to
an update on the base relations
• Deletion propagation as a special case
– Update is delete(a set of view tuples)
• Motivation:
– Classic: database/view maintenance
• DB access only through views, hidden join keys, etc.
– Debugging
• [K&al.12]: deletion propagation for debugging text extractors
– Database causality [Meliou&al.10]
• Intuition: good propagation provides a good explanation of why we
have the tuples to begin with
• [Bertossi, Salimi 14]: “Unifying Causality, Diagnosis,
Repairs and View-Updates in Databases”
44
45. Example: File Access
GroupFile
group file
ai a.txt
ai b.txt
db a. txt
db b.txt
os a.txt
UserGroup
user group
Emma ai
Emma db
Olivia os
Olivia db
Jacob ai
Access(u,f) :– UserGroup(u,g), GroupFile(g,f)
Delete source rows, s.t. Emma won’t access a.txt.
But, maintain maximum access permissions!
[Cui&Widom01; Buneman&al.02]
Access
user file
Emma a.txt
Emma b.txt
Olivia a.txt
Olivia b.txt
Jacob a.txt
Jacob b.txt
= ⋈
45
46. Example: File Access
= ⋈
GroupFile
group file
ai a.txt
ai b.txt
db a. txt
db b.txt
os a.txt
UserGroup
user group
Emma ai
Emma db
Olivia os
Olivia db
Jacob ai
Access
user file
Emma a.txt
Emma b.txt
Olivia a.txt
Olivia b.txt
Jacob a.txt
Jacob b.txt
Access(u,f) :– UserGroup(u,g), GroupFile(g,f)
[Cui&Widom01; Buneman&al.02]
Delete source rows, s.t. Emma won’t access a.txt.
But, maintain maximum access permissions!
46
47. Example: File Access
GroupFile
group file
ai a.txt
ai b.txt
db a. txt
db b.txt
os a.txt
UserGroup
user group
Emma ai
Emma db
Olivia os
Olivia db
Jacob ai
Access
user file
Emma a.txt
Emma b.txt
Olivia a.txt
Olivia b.txt
Jacob a.txt
Jacob b.txt
= ⋈
Access(u,f) :– UserGroup(u,g), GroupFile(g,f)
[Cui&Widom01; Buneman&al.02]
Delete source rows, s.t. Emma won’t access a.txt.
But, maintain maximum access permissions!
Decision variant is NP-complete [Buneman et al. 02]
47
48. Trichotomy in Complexity
We have established a precise (easily testable) criterion
that partition all cases into 3 categories:
1. The problem is solvable in PTime, and even via a
straightforward algorithm [Buneman et al. 2001]
2. The problem is NP-hard, but constant-ratio
approximable in PTime (ILP relaxation)
3. The problem is inapproximable for every ratio
Fix a schema (w/ fds) and a CQ w/o self joins
What is the complexity of finding a solution with a minimal side effect?
[K, Vondrak, Williams, Woodruff, PODS11, PODS12, TODS12, VLDB14]
48
49. • Text Analytics in the Big Data Era
• Information Extraction Systems & Formalism
• Foundational Research Challenges
• Conclusions and Outlook
Outline
49
50. Summary
• Text analytics & IE
• Rule systems for IE
• A formal framework for rules, relating IE to
traditional DB concepts such as Datalog
• Research directions motivated by IE
– Prioritized repairs
– Graph mining
– Update propagation
50
51. Outlook: DB w/ Deep Text Support
• We need a uniform & elegant data/query model to
combine structured data & text; usefulness for querying
both text and relations
• We need a principled, simple & transparent probability
model + effective quality + practical execution cost
• We need to balance between automation and control:
from full specification by experts to feature generation for
nonexperienced
– Maximally realize the potential of every developer!
– LogicBlox is working on incorporating ML in Datalog!
51
53. Room for Both
Statistical
Solution
Rule
System
Feature Engineering
Model Space, Runtime
Cleaning + Post Proc.
Cleaning + Post Proc.
Building blocks
(e.g., dictionaries, NER)
“What doesn’t work: Anything requiring high
precision and full automation”
Feldman & Ungar, KDD’08 tutorial on text mining
53
54. String DB, Spanners, Interval Algebra
Kaspersky Kaspersky
Intel Otellini
IBM Rometty
[10,20) [16,26)
[32,37) [50,58)
[105,108) [121,128)
[10,20) [16,26)
[32,37) [50,58)
[105,108) [121,128)
String Databases Interval AlgebraSpanners
Atomic value: string Atomic value: span
(pointing to doc)
Atomic value: interval
(no text)
Join by string conditions
(e.g., x is a substring of y)
Join by interval conditions
(e.g., x is a sub-interval of y)
Join by interval+string
conditions (e.g., x a
token in y)
Apps: text predicates in DBs
[Grahne & al. 99] [Benedikt &
al. 03], string manipulation
[Bonner & Mecca 98]
[Ginsburg and Wang 98]
App: IE Apps: temporal reasoning
[Allen 83] [Vilain & Kautz
86] [Nebel & Bürckert 95]
[Krokhin et al. 03]
54
55. 55
Imp. 1: Connection to Known Concepts
• Connection to Recognizable Relations [Berstel 79]
– These are unions of cross products of regular languages
– THM: The class of regular spanners is closed under
a string-selection predicate iff the predicate is a
recognizable relation
• Connection to CRPQs [Cruz et al. 87]
– Conjunctive Regular Path Queries have been studied as a
query language for labeled graphs
– THM: Regular spanners have the same expressive
power as unions of CRPQs on paths “with marked
endpoints”
• Up to some simple and necessary adaptation between the models
S I G M O D
Path with marked endpoints
56. Imp. 2: Adding String Equality
NR Datalog w/ regex formulas
Regular Spanners
Regularstr= Spanners
+ String-equality predicate
(+substring-of, prefix-of, …)
…application from Jane Doe,
social 012-345-6789, on Mar
20th… identified as John Doe,
012-345-6789, ask us to…
x1 x2
[117,125)
(Jane Doe)
[875,883)
(John Doe)
⋮ ⋮
NameSSN(x,y) := …
SameSSN(x1,x2) := NameSSN(x1,y1) , NameSSN(x2,y2) , str(y1)=str(y2)
Same string,
different spans
56
57. Difference with String Equality
• Are regularstr= spanners closed under difference?
– Why should they? Only positive operators are used…
– However, regex formulas (our EDBs) can introduce
“negative” operations (NFAs closed under complement)
• THM: The class of regular spanners is closed under
difference
• PROP: The class of regularstr= spanners is closed
under string-inequality selection
• THM: The class of regularstr= spanners is closed
under string-containment selection, but then, not
under non-string-containment selection!
• COR: The class of regularstr= is not closed under
difference
57
58. Formal Optimization Problem
Fixed: • Schema S w/ fun. dependencies
• Conjunctive query Q
Input: • Database instance I over S
• Set A⊆ Q(I) of answers to delete
Output: J ⊆ I s.t. Q(J) ∩ A = ∅
Goal: Minimize |(Q(I) – A) – Q(J)|
Side Effect
58