The document discusses techniques for indexing and querying graph data. It begins by categorizing graph queries as exact subgraph matching, similarity subgraph matching, or super graph matching. It then describes querying approaches for collection databases containing many small graphs versus large singular graphs. The document proceeds to summarize several graph indexing techniques including GraphGrep, gIndex, Grafil, C-tree, QuickSI, and others. It focuses on filtering techniques used to reduce the number of verification steps in subgraph matching queries over graph databases.
Information extraction involves extracting structured information from unstructured text. The goal is to identify named entities, relations between entities, and populate a database. This may also include event extraction, resolving temporal expressions, and wrapper induction. Common tasks include named entity recognition, co-reference resolution, relation extraction, and event extraction. Statistical methods like conditional random fields are often used. Evaluation involves measuring precision and recall.
Software Testing and Quality Assurance Assignment 3Gurpreet singh
Short questions :
Que 1 : Define Software Testing.
Que 2 : What is risk identification ?
Que 3 : What is SCM ?
Que 4 : Define Debugging.
Que 5 : Explain Configuration audit.
Que 6 : Differentiate between white box testing & black box testing.
Que 7 : What do you mean by metrics ?
Que 8 : What do you mean by version control ?
Que 9 : Explain Object Oriented Software Engineering.
Que 10 : What are the advantages and disadvantages of manual testing tools ?
Long Questions:
Que 1 : What do you mean by baselines ? Explain their importance.
Que 2 : What do you mean by change control ? Explain the various steps in detail.
Que 3 : Explain various types of testing in detail.
Que 4 : Differentiate between automated testing and manual testing.
Que 5 : What is web engineering ? Explain in detail its model and features.
Text mining seeks to extract useful information from unstructured text documents. It involves preprocessing the text, identifying features, and applying techniques from data mining, machine learning and natural language processing to discover patterns. The core operations of text mining include analyzing distributions of concepts, identifying frequent concept sets and associations between concepts. Text mining systems aim to analyze document collections over time to identify trends, ephemeral relationships and anomalous patterns.
This document discusses test case execution for a boutique product development company. It explains why test case execution is important to ensure requirements are fulfilled, features work as expected, and bugs are found early. It also describes how test case execution should be conducted, including using approved, up-to-date test cases; properly documenting the execution process; and sharing results with the team and client.
Planejando teste de usabilidade com os stakeholders Diana Fournier
O documento discute a importância do planejamento de testes de usabilidade com os stakeholders. Ele explica que o planejamento deve incluir o objetivo do teste, as hipóteses a serem testadas, as perguntas a serem respondidas e o impacto esperado nos resultados para os times e produtos. Ao envolver os stakeholders no planejamento, garante-se a transparência e o alinhamento do trabalho de pesquisa com o usuário com todos os envolvidos no projeto.
This document discusses various machine learning techniques for classification and prediction. It covers decision tree induction, tree pruning, Bayesian classification, Bayesian belief networks, backpropagation, association rule mining, and ensemble methods like bagging and boosting. Classification involves predicting categorical labels while prediction predicts continuous values. Key steps for preparing data include cleaning, transformation, and comparing different methods based on accuracy, speed, robustness, scalability, and interpretability.
Slide helps in generating an understand about the intuition and mathematics / stats behind association rule mining. This presentation starts by highlighting the difference between causal and correlation. This is followed Apriori algorithm and the metrics which are used with it. Each metric is discussed in detail. Then a formulation has been generated in classification setting which can be used to generate rules i.e. rule mining.
Other Reference: https://www.slideshare.net/JustinCletus/mining-frequent-patterns-association-and-correlations
Information extraction involves extracting structured information from unstructured text. The goal is to identify named entities, relations between entities, and populate a database. This may also include event extraction, resolving temporal expressions, and wrapper induction. Common tasks include named entity recognition, co-reference resolution, relation extraction, and event extraction. Statistical methods like conditional random fields are often used. Evaluation involves measuring precision and recall.
Software Testing and Quality Assurance Assignment 3Gurpreet singh
Short questions :
Que 1 : Define Software Testing.
Que 2 : What is risk identification ?
Que 3 : What is SCM ?
Que 4 : Define Debugging.
Que 5 : Explain Configuration audit.
Que 6 : Differentiate between white box testing & black box testing.
Que 7 : What do you mean by metrics ?
Que 8 : What do you mean by version control ?
Que 9 : Explain Object Oriented Software Engineering.
Que 10 : What are the advantages and disadvantages of manual testing tools ?
Long Questions:
Que 1 : What do you mean by baselines ? Explain their importance.
Que 2 : What do you mean by change control ? Explain the various steps in detail.
Que 3 : Explain various types of testing in detail.
Que 4 : Differentiate between automated testing and manual testing.
Que 5 : What is web engineering ? Explain in detail its model and features.
Text mining seeks to extract useful information from unstructured text documents. It involves preprocessing the text, identifying features, and applying techniques from data mining, machine learning and natural language processing to discover patterns. The core operations of text mining include analyzing distributions of concepts, identifying frequent concept sets and associations between concepts. Text mining systems aim to analyze document collections over time to identify trends, ephemeral relationships and anomalous patterns.
This document discusses test case execution for a boutique product development company. It explains why test case execution is important to ensure requirements are fulfilled, features work as expected, and bugs are found early. It also describes how test case execution should be conducted, including using approved, up-to-date test cases; properly documenting the execution process; and sharing results with the team and client.
Planejando teste de usabilidade com os stakeholders Diana Fournier
O documento discute a importância do planejamento de testes de usabilidade com os stakeholders. Ele explica que o planejamento deve incluir o objetivo do teste, as hipóteses a serem testadas, as perguntas a serem respondidas e o impacto esperado nos resultados para os times e produtos. Ao envolver os stakeholders no planejamento, garante-se a transparência e o alinhamento do trabalho de pesquisa com o usuário com todos os envolvidos no projeto.
This document discusses various machine learning techniques for classification and prediction. It covers decision tree induction, tree pruning, Bayesian classification, Bayesian belief networks, backpropagation, association rule mining, and ensemble methods like bagging and boosting. Classification involves predicting categorical labels while prediction predicts continuous values. Key steps for preparing data include cleaning, transformation, and comparing different methods based on accuracy, speed, robustness, scalability, and interpretability.
Slide helps in generating an understand about the intuition and mathematics / stats behind association rule mining. This presentation starts by highlighting the difference between causal and correlation. This is followed Apriori algorithm and the metrics which are used with it. Each metric is discussed in detail. Then a formulation has been generated in classification setting which can be used to generate rules i.e. rule mining.
Other Reference: https://www.slideshare.net/JustinCletus/mining-frequent-patterns-association-and-correlations
A content based movie recommender system for mobile applicationArafat X
This document discusses developing a content-based movie recommender system for mobile applications. It will analyze movie datasets to develop an algorithm that recommends movies to users on their mobile phones. A web service will be created to handle recommendation calculations and improve mobile performance. The document reviews similar movie recommender systems and identifies weaknesses. It also summarizes findings from a user questionnaire that found people want a mobile movie recommender and identified desired features. An overlap coefficient algorithm is proposed and use case and entity relationship diagrams are presented to illustrate the proposed solution.
Overview of Bibliometrics - IAP Course version 1.1Micah Altman
Whose articles cite a body of work? Is this a high-impact journal? How might others assess my scholarly impact? Citation analysis is one of the primary methods used to answer these questions.
Chapter - 8.3 Data Mining Concepts and Techniques 2nd Ed slides Han & Kambererror007
The document discusses sequential pattern mining algorithms. It begins by introducing sequential patterns and challenges in mining them from transaction databases. It then describes the Apriori-based GSP algorithm, which generates candidate sequences level-by-level and scans the database multiple times. The document also introduces pattern-growth methods like PrefixSpan that avoid candidate generation by projecting databases based on prefixes. Finally, it discusses optimizations like pseudo-projection that speed up sequential pattern mining.
Cluster analysis is used to group similar objects together and separate dissimilar objects. It has applications in understanding data patterns and reducing large datasets. The main types are partitional which divides data into non-overlapping subsets, and hierarchical which arranges clusters in a tree structure. Popular clustering algorithms include k-means, hierarchical clustering, and graph-based clustering. K-means partitions data into k clusters by minimizing distances between points and cluster centroids, but requires specifying k and is sensitive to initial centroid positions. Hierarchical clustering creates nested clusters without needing to specify the number of clusters, but has higher computational costs.
Cause-effect graphs capture relationships between inputs (causes) and outputs (effects) in black box testing. Causes and effects are represented as nodes in a graph connected by intermediate nodes. An example graphs the causes if the first character is 'A' or 'B' and the second column is a number leading to the effect that the file is updated, or other causes and effects like erroneous characters printing message X12. The methodology involves decomposing the system, identifying causes and effects, establishing the graph of relationships between them, adding constraints, converting the graph to a decision table, and producing a test per line of the simplified table.
The Text Classification slides contains the research results about the possible natural language processing algorithms. Specifically, it contains the brief overview of the natural language processing steps, the common algorithms used to transform words into meaningful vectors/data, and the algorithms used to learn and classify the data.
To learn more about RAX Automation Suite, visit: www.raxsuite.com
Text clustering involves grouping text documents into clusters such that documents within a cluster are similar to each other and dissimilar to documents in other clusters. Common text clustering methods include bisecting k-means clustering, which recursively partitions clusters, and agglomerative hierarchical clustering, which iteratively merges clusters. Text clustering is used to automatically organize large document collections and improve search by returning related groups of documents.
Data mining and data warehouse lab manual updatedYugal Kumar
This document describes experiments conducted for a Data Mining and Data Warehousing Lab course. Experiment 1 involves studying data pre-processing steps using a dataset. Experiment 2 involves implementing a decision tree classification algorithm in Java. Experiment 3 uses the WEKA tool to implement the ID3 decision tree algorithm on a bank dataset, generating and visualizing the decision tree model. The experiments aim to help students understand key concepts in data mining such as pre-processing, classification algorithms, and using tools like WEKA.
Web mining is the application of data mining techniques to extract knowledge from web data, including web content, structure, and usage data. Web content mining analyzes text, images, and other unstructured data on web pages using natural language processing and information retrieval. Web structure mining examines the hyperlinks between pages to discover relationships. Web usage mining applies data mining methods to server logs and other web data to discover patterns of user behavior on websites. Text mining aims to extract useful information from unstructured text documents using techniques like summarization, information extraction, categorization, and sentiment analysis.
BLEU evaluates machine translations by comparing n-grams to reference human translations, while ROUGE evaluates summaries by measuring co-occurrence of n-grams between machine and human summaries. ROUGE measures recall by counting overlapping n-grams between the reference and machine summaries, while BLEU measures precision by counting n-grams in the machine translation that appear in the reference translations. Both are complementary metrics as high BLEU indicates more words from the machine translation are in the references and high ROUGE means more words from the references are in the machine translation.
Simplifying Big Data Integration with Syncsort DMX and DMX-hPrecisely
Today’s modern data strategies have to manage more than growing data volumes. They must also address the added complexity of integrating diverse data sources and types, adhere to security and governance mandates, and ensure the right tools and skills are in place to deliver business value from the data.
Learn how the latest enhancements to Syncsort DMX and DMX-h can help you achieve your modern data strategy goals with a single interface for accessing and integrating all your enterprise data sources – batch and streaming – across Hadoop, Spark, Linux, Windows or Unix – on premise or in the cloud.
Watch this on-demand customer education webcast to learn the latest product features introduced this year, including:
• Best in class data ingestion capabilities with enhanced support for mainframes, RDBMSs, MPP, Avro/Parquet, Kafka, NoSQL and more.
• Single interface for streaming and batch processes – now with support for Kafka and MapR Streams
• Secure data access, data governance and lineage with seamless integration with Kerberos, Apache Ranger, Apache Ambari, Cloudera Manager, Cloudera Navigator and Sentry.
• Evolution of our design once, deploy anywhere architecture – now with support for Spark!
Recommendation algorithm using reinforcement learningArithmer Inc.
Slide for study session given by Lu Juanjuan at Arithmer inc.
It is a summary of recent methods for recommendation system using reinforcement learning.
Arithmer株式会社は東京大学大学院数理科学研究科発の数学の会社です。私達は現代数学を応用して、様々な分野のソリューションに、新しい高度AIシステムを導入しています。AIをいかに上手に使って仕事を効率化するか、そして人々の役に立つ結果を生み出すのか、それを考えるのが私たちの仕事です。
Arithmer began at the University of Tokyo Graduate School of Mathematical Sciences. Today, our research of modern mathematics and AI systems has the capability of providing solutions when dealing with tough complex issues. At Arithmer we believe it is our job to realize the functions of AI through improving work efficiency and producing more useful results for society.
Winning Kaggle competitions involves getting a good score as fast as possible using versatile machine learning libraries and models like Scikit-learn, XGBoost, and Keras. It also involves model ensembling techniques like voting, averaging, bagging and boosting to improve scores. The document provides tips for approaches like feature engineering, algorithm selection, and stacked generalization/stacking to develop strong ensemble models for competitions.
Lect7 Association analysis to correlation analysishktripathy
Association rule mining aims to discover interesting relationships between items in large datasets. The document discusses key concepts in association rule mining including support, confidence, and correlation. Support measures how frequently an itemset occurs, while confidence measures the conditional probability of an itemset given another itemset. Correlation evaluates statistical dependence between itemsets and can be used to measure lift. Various measures are proposed to evaluate interestingness and redundancy of discovered rules.
This document presents a test plan for version 1.0 of the IIT official website. It outlines the test items, features to be tested, approach, environment, responsibilities, and schedule. The test items include the website and its modules like achievements, gallery, news, programs, batches, courses, faculty, exams, results, groups, profile, documents, attendance, projects, calendar, and alumni. Features to be tested include adding, modifying, and viewing albums in the gallery module. The test plan follows IEEE 829 standards and will test the website on different client platforms.
Talk on Optimization for Deep Learning, which gives an overview of gradient descent optimization algorithms and highlights some current research directions.
This document discusses software metrics and how they can be used to measure various attributes of software products and processes. It begins by asking questions that software metrics can help answer, such as how to measure software size, development costs, bugs, and reliability. It then provides definitions of key terms like measurement, metrics, and defines software metrics as the application of measurement techniques to software development and products. The document outlines areas where software metrics are commonly used, like cost estimation and quality/reliability prediction. It also discusses challenges in implementing metrics and provides categories of metrics like product, process, and project metrics. The remainder of the document provides examples and formulas for specific software metrics.
This document discusses data generalization and summarization techniques. It describes how attribute-oriented induction generalizes data from low to high conceptual levels by examining attribute values. The number of distinct values for each attribute is considered, and attributes may be removed, generalized up concept hierarchies, or retained in the generalized relation. An algorithm for attribute-oriented induction takes a relational database and data mining query as input and outputs a generalized relation. Generalized data can be presented as crosstabs, bar charts, or pie charts.
Este documento presenta la línea de tiempo de Carlos Eduardo Matute, estudiante de la Universidad Bicentenaria de Aragua. La línea de tiempo detalla los eventos clave en la vida de Carlos, incluyendo su nacimiento en 1987, su educación primaria y secundaria, el nacimiento de su hijo en 2011, su formación como técnico de telecomunicaciones y chef, y actualmente su estudio de derecho para graduarse como abogado en 2021. El documento también incluye una breve bibliografía de fuentes de imágenes de Google utilizadas
Finding the Mean from Grouped Frequency TableMoonie Kim
- The document shows the steps to calculate the mean from a grouped frequency table.
- It begins with a grouped frequency table containing class ranges and frequencies.
- It then adds columns for class center and fx (frequency x class center) and fills these in to calculate total fx.
- Finally, it finds the total fx and divides by the total frequency to calculate the mean.
A content based movie recommender system for mobile applicationArafat X
This document discusses developing a content-based movie recommender system for mobile applications. It will analyze movie datasets to develop an algorithm that recommends movies to users on their mobile phones. A web service will be created to handle recommendation calculations and improve mobile performance. The document reviews similar movie recommender systems and identifies weaknesses. It also summarizes findings from a user questionnaire that found people want a mobile movie recommender and identified desired features. An overlap coefficient algorithm is proposed and use case and entity relationship diagrams are presented to illustrate the proposed solution.
Overview of Bibliometrics - IAP Course version 1.1Micah Altman
Whose articles cite a body of work? Is this a high-impact journal? How might others assess my scholarly impact? Citation analysis is one of the primary methods used to answer these questions.
Chapter - 8.3 Data Mining Concepts and Techniques 2nd Ed slides Han & Kambererror007
The document discusses sequential pattern mining algorithms. It begins by introducing sequential patterns and challenges in mining them from transaction databases. It then describes the Apriori-based GSP algorithm, which generates candidate sequences level-by-level and scans the database multiple times. The document also introduces pattern-growth methods like PrefixSpan that avoid candidate generation by projecting databases based on prefixes. Finally, it discusses optimizations like pseudo-projection that speed up sequential pattern mining.
Cluster analysis is used to group similar objects together and separate dissimilar objects. It has applications in understanding data patterns and reducing large datasets. The main types are partitional which divides data into non-overlapping subsets, and hierarchical which arranges clusters in a tree structure. Popular clustering algorithms include k-means, hierarchical clustering, and graph-based clustering. K-means partitions data into k clusters by minimizing distances between points and cluster centroids, but requires specifying k and is sensitive to initial centroid positions. Hierarchical clustering creates nested clusters without needing to specify the number of clusters, but has higher computational costs.
Cause-effect graphs capture relationships between inputs (causes) and outputs (effects) in black box testing. Causes and effects are represented as nodes in a graph connected by intermediate nodes. An example graphs the causes if the first character is 'A' or 'B' and the second column is a number leading to the effect that the file is updated, or other causes and effects like erroneous characters printing message X12. The methodology involves decomposing the system, identifying causes and effects, establishing the graph of relationships between them, adding constraints, converting the graph to a decision table, and producing a test per line of the simplified table.
The Text Classification slides contains the research results about the possible natural language processing algorithms. Specifically, it contains the brief overview of the natural language processing steps, the common algorithms used to transform words into meaningful vectors/data, and the algorithms used to learn and classify the data.
To learn more about RAX Automation Suite, visit: www.raxsuite.com
Text clustering involves grouping text documents into clusters such that documents within a cluster are similar to each other and dissimilar to documents in other clusters. Common text clustering methods include bisecting k-means clustering, which recursively partitions clusters, and agglomerative hierarchical clustering, which iteratively merges clusters. Text clustering is used to automatically organize large document collections and improve search by returning related groups of documents.
Data mining and data warehouse lab manual updatedYugal Kumar
This document describes experiments conducted for a Data Mining and Data Warehousing Lab course. Experiment 1 involves studying data pre-processing steps using a dataset. Experiment 2 involves implementing a decision tree classification algorithm in Java. Experiment 3 uses the WEKA tool to implement the ID3 decision tree algorithm on a bank dataset, generating and visualizing the decision tree model. The experiments aim to help students understand key concepts in data mining such as pre-processing, classification algorithms, and using tools like WEKA.
Web mining is the application of data mining techniques to extract knowledge from web data, including web content, structure, and usage data. Web content mining analyzes text, images, and other unstructured data on web pages using natural language processing and information retrieval. Web structure mining examines the hyperlinks between pages to discover relationships. Web usage mining applies data mining methods to server logs and other web data to discover patterns of user behavior on websites. Text mining aims to extract useful information from unstructured text documents using techniques like summarization, information extraction, categorization, and sentiment analysis.
BLEU evaluates machine translations by comparing n-grams to reference human translations, while ROUGE evaluates summaries by measuring co-occurrence of n-grams between machine and human summaries. ROUGE measures recall by counting overlapping n-grams between the reference and machine summaries, while BLEU measures precision by counting n-grams in the machine translation that appear in the reference translations. Both are complementary metrics as high BLEU indicates more words from the machine translation are in the references and high ROUGE means more words from the references are in the machine translation.
Simplifying Big Data Integration with Syncsort DMX and DMX-hPrecisely
Today’s modern data strategies have to manage more than growing data volumes. They must also address the added complexity of integrating diverse data sources and types, adhere to security and governance mandates, and ensure the right tools and skills are in place to deliver business value from the data.
Learn how the latest enhancements to Syncsort DMX and DMX-h can help you achieve your modern data strategy goals with a single interface for accessing and integrating all your enterprise data sources – batch and streaming – across Hadoop, Spark, Linux, Windows or Unix – on premise or in the cloud.
Watch this on-demand customer education webcast to learn the latest product features introduced this year, including:
• Best in class data ingestion capabilities with enhanced support for mainframes, RDBMSs, MPP, Avro/Parquet, Kafka, NoSQL and more.
• Single interface for streaming and batch processes – now with support for Kafka and MapR Streams
• Secure data access, data governance and lineage with seamless integration with Kerberos, Apache Ranger, Apache Ambari, Cloudera Manager, Cloudera Navigator and Sentry.
• Evolution of our design once, deploy anywhere architecture – now with support for Spark!
Recommendation algorithm using reinforcement learningArithmer Inc.
Slide for study session given by Lu Juanjuan at Arithmer inc.
It is a summary of recent methods for recommendation system using reinforcement learning.
Arithmer株式会社は東京大学大学院数理科学研究科発の数学の会社です。私達は現代数学を応用して、様々な分野のソリューションに、新しい高度AIシステムを導入しています。AIをいかに上手に使って仕事を効率化するか、そして人々の役に立つ結果を生み出すのか、それを考えるのが私たちの仕事です。
Arithmer began at the University of Tokyo Graduate School of Mathematical Sciences. Today, our research of modern mathematics and AI systems has the capability of providing solutions when dealing with tough complex issues. At Arithmer we believe it is our job to realize the functions of AI through improving work efficiency and producing more useful results for society.
Winning Kaggle competitions involves getting a good score as fast as possible using versatile machine learning libraries and models like Scikit-learn, XGBoost, and Keras. It also involves model ensembling techniques like voting, averaging, bagging and boosting to improve scores. The document provides tips for approaches like feature engineering, algorithm selection, and stacked generalization/stacking to develop strong ensemble models for competitions.
Lect7 Association analysis to correlation analysishktripathy
Association rule mining aims to discover interesting relationships between items in large datasets. The document discusses key concepts in association rule mining including support, confidence, and correlation. Support measures how frequently an itemset occurs, while confidence measures the conditional probability of an itemset given another itemset. Correlation evaluates statistical dependence between itemsets and can be used to measure lift. Various measures are proposed to evaluate interestingness and redundancy of discovered rules.
This document presents a test plan for version 1.0 of the IIT official website. It outlines the test items, features to be tested, approach, environment, responsibilities, and schedule. The test items include the website and its modules like achievements, gallery, news, programs, batches, courses, faculty, exams, results, groups, profile, documents, attendance, projects, calendar, and alumni. Features to be tested include adding, modifying, and viewing albums in the gallery module. The test plan follows IEEE 829 standards and will test the website on different client platforms.
Talk on Optimization for Deep Learning, which gives an overview of gradient descent optimization algorithms and highlights some current research directions.
This document discusses software metrics and how they can be used to measure various attributes of software products and processes. It begins by asking questions that software metrics can help answer, such as how to measure software size, development costs, bugs, and reliability. It then provides definitions of key terms like measurement, metrics, and defines software metrics as the application of measurement techniques to software development and products. The document outlines areas where software metrics are commonly used, like cost estimation and quality/reliability prediction. It also discusses challenges in implementing metrics and provides categories of metrics like product, process, and project metrics. The remainder of the document provides examples and formulas for specific software metrics.
This document discusses data generalization and summarization techniques. It describes how attribute-oriented induction generalizes data from low to high conceptual levels by examining attribute values. The number of distinct values for each attribute is considered, and attributes may be removed, generalized up concept hierarchies, or retained in the generalized relation. An algorithm for attribute-oriented induction takes a relational database and data mining query as input and outputs a generalized relation. Generalized data can be presented as crosstabs, bar charts, or pie charts.
Este documento presenta la línea de tiempo de Carlos Eduardo Matute, estudiante de la Universidad Bicentenaria de Aragua. La línea de tiempo detalla los eventos clave en la vida de Carlos, incluyendo su nacimiento en 1987, su educación primaria y secundaria, el nacimiento de su hijo en 2011, su formación como técnico de telecomunicaciones y chef, y actualmente su estudio de derecho para graduarse como abogado en 2021. El documento también incluye una breve bibliografía de fuentes de imágenes de Google utilizadas
Finding the Mean from Grouped Frequency TableMoonie Kim
- The document shows the steps to calculate the mean from a grouped frequency table.
- It begins with a grouped frequency table containing class ranges and frequencies.
- It then adds columns for class center and fx (frequency x class center) and fills these in to calculate total fx.
- Finally, it finds the total fx and divides by the total frequency to calculate the mean.
El documento resume la historia de los emblemas protegidos en el derecho internacional humanitario, incluyendo la Cruz Roja, la Media Luna Roja y el más reciente Cristal Roja. Explica cómo la Cruz Roja fue adoptada en 1863 y reconocida formalmente en los Convenios de Ginebra de 1864, mientras que la Media Luna Roja y el León y Sol Rojos se reconocieron en 1929. Luego discute los debates posteriores sobre nuevos emblemas y cómo esto llevó a la adopción del Cristal Roja en 2005 para proporcionar una opción adicional
A frequency table shows the frequency of data values divided into intervals to help group the data. It displays the number of observations that fall into each of several categories or intervals.
Instagram is a free photo sharing and social media platform that was acquired by Facebook in 2012. It allows users to take photos, apply filters to them, and share them on the app. Users can choose to make their profiles and posts public or private, controlling who can view their content and followings. To message Instagram directly, users can post on the Instagram Facebook page.
The document provides examples and instructions for creating frequency tables and line plots from data sets. A frequency table lists how often each item occurs in a data set, while a line plot uses X marks above a number line to show frequencies. An example frequency table shows the number of children who had their faces painted at different ages. Students are instructed to make their own frequency table and line plot using data on art projects chosen by 6th grade students.
La tabla periódica organiza los elementos químicos mostrando su número atómico, configuración electrónica y propiedades, proporcionando una referencia ampliamente usada en química, física, biología e ingeniería.
Robert Wiebel is the director of operations for two packaging plants. He previously spent two months in Florence, Italy. The document then lists three must-see attractions in Florence: 1) The Duomo cathedral, constructed over two centuries with admission free; 2) The Galleria dell'Accademia museum featuring Michelangelo's David and Botticelli's Madonna and Child with ticket required; 3) The Uffizi Gallery originally built as offices but now housing fine artworks by Michelangelo, Da Vinci and others in its original Tribune design over 400 years old.
El documento describe un modelo matemático exponencial para predecir el crecimiento de la población de conejos con el tiempo. Comienza con dos conejos que producen cuatro conejitos, los cuales a su vez producen ocho conejos, y así sucesivamente de manera exponencial. La función matemática que modela esto es C=2t, donde C es el número de conejos y t es el tiempo. Esta función exponencial es adecuada para modelar este proceso de variación poblacional debido a que la cantidad de conejos se multiplica exponencialmente con cada
Deliver successful code: Application integration best practices for developersIntuit Inc.
Jarred Keneally and Dinesh Kinger will present on application integration best practices for developers. They discuss why best practices are important for delivering successful code and providing a good user experience. Some key best practices they cover include ensuring data accuracy, handling errors gracefully, making sure application data matches bank data, and guiding users through the setup process to ensure correct accounting setup. They emphasize the importance of following best practices to avoid issues that could negatively impact users.
La política ambiental busca mejorar el medio ambiente y fomentar un desarrollo sostenible mediante la preocupación y desarrollo de objetivos. Se basa en principios como el desarrollo sostenible, la responsabilidad medioambiental, la prevención y sustitución de sustancias contaminantes. Existen leyes y acuerdos internacionales como el Protocolo de Kioto y los Acuerdos de Cancún para regular las emisiones de gases de efecto invernadero y proteger el medio ambiente.
Social Media: Corporate Reputation at RiskAlex Schaerer
This document discusses the importance of managing corporate reputation on social media. It notes that with over 2 billion Facebook and 300 million Twitter users, social media allows for greater scrutiny of organizations. Poor communication or hidden facts can damage brands, while transparency builds trust. The risks of social media include reputation damage, information disclosure, and legal issues. However, having a public relations strategy and being responsive in crises can help mitigate risks. The document provides tips for protecting corporate reputation, such as responding quickly to issues and developing communication channels.
Max De Marzi gave an introduction to graph databases using Neo4j as an example. He discussed trends in big, connected data and how NoSQL databases like key-value stores, column families, and document databases address these trends. However, graph databases are optimized for interconnected data by modeling it as nodes and relationships. Neo4j is a graph database that uses a property graph data model and allows querying and traversal through its Cypher query language and Gremlin scripting language. It is well-suited for domains involving highly connected data like social networks.
Tech Teens: Creating Classroom Community, Collaboration, and CitizenshipAlexander Davidson
This is the lightning chat version of this educational topic presented at the 2017 Michigan Reading Association annual conference. It's main focus is conducting productive online discussions in the classroom.
This document provides a learner's material for the Grade 7 mathematics curriculum in the Philippines. It contains 41 lessons covering topics in numbers, algebra, geometry, measurement, and statistics. The material was collaboratively developed by educators from public and private schools to assist teachers and students. Feedback on the material can be submitted to the Department of Education.
Here are the answers:
(a) A B is shown in Set 2. It contains all elements that belong to A or B or both.
(b) A B is shown in Set 3. It contains elements that belong to both A and B.
2. Given sets P = {1, 3, 5, 7, 9} and Q = {2, 4, 6, 8}, find P Q and P Q.
3. Draw a Venn diagram to represent the following sets:
A = {x | x is a prime number less than 10}
B = {x | x is an even number less than 10}
The document summarizes a method for mining frequent subgraphs from linear graphs. It describes:
1) Representing data like proteins, RNA and texts as linear graphs and the need for algorithms to mine frequent patterns from such graphs.
2) A method called LGM that can efficiently enumerate and mine both connected and disconnected subgraphs from linear graphs using reverse search techniques.
3) Experiments applying LGM to mine motifs from protein structures and phrases from texts, achieving better performance than existing methods.
This document discusses implementing various graph algorithms using GraphBLAS kernels. It describes how degree filtered breadth-first search, k-truss detection, calculating the Jaccard index, and non-negative matrix factorization can be expressed using operations like sparse matrix multiplication, element-wise multiplication, scaling and reduction. The goal is to demonstrate how fundamental graph problems can be solved within the GraphBLAS framework using linear algebraic formulations of graph computations.
This document discusses implementing various graph algorithms using GraphBLAS kernels. It describes how degree filtered breadth-first search, k-truss detection, calculating the Jaccard index, and non-negative matrix factorization can be expressed using operations like SpGEMM, SpMV, element-wise multiplication, and scaling. The goal is to demonstrate how common graph analytics can utilize the linear algebra approach of the GraphBLAS framework.
Start From A MapReduce Graph Pattern-recognize AlgorithmYu Liu
This document summarizes a presentation on developing a MapReduce algorithm to recognize patterns in large graphs by finding connected components. It discusses:
- Motivation to study parallel graph algorithms and frameworks like MapReduce and Pregel
- The problem of finding link patterns in graphs by extracting connected components
- Background on semantic web and linked open data modeled as RDF graphs
- A naive O(2Ck)-iteration MapReduce algorithm to find connected components between pairs of datasets
- Examples and analysis of the algorithm's complexity and communication costs
The document discusses query optimization in databases. It explains that the goal of query optimization is to determine the most efficient execution plan for a query to minimize the time needed. It outlines the typical steps in query optimization, including parsing/translation, applying relational algebra, and optimizing the query plan. It also discusses techniques like generating alternative execution plans using equivalence rules, estimating plan costs based on statistical data, and using heuristics or dynamic programming to choose the optimal plan.
This document discusses graph mining, including its motivation, applications, and algorithms. Graph mining aims to discover repetitive subgraphs in graph datasets. It has many applications including analyzing chemical compounds, biological networks, program flows, and social networks. The document outlines several graph mining algorithms, including the Apriori-based FSG algorithm, the DFS-based gSpan algorithm, and the greedy Subdue algorithm. It also distinguishes between the transaction setting and single graph setting for graph mining problems.
- The document summarizes techniques for slicing object-oriented programs. It discusses static and dynamic slicing, and limitations of previous approaches.
- It proposes a new intermediate representation called the Object-Oriented System Dependence Graph (OSDG) to more precisely capture dependencies in object-oriented programs. The OSDG explicitly represents data members of objects.
- An edge-marking algorithm is presented for efficiently performing dynamic slicing of object-oriented programs using the OSDG. This avoids recomputing the entire slice after each statement.
This document provides an overview of graph mining techniques. It discusses the motivation and applications of graph mining, including that graphs are commonly used to model data in domains like chemistry, biology, and social networks. It introduces the problem of finding frequent subgraphs and outlines some key graph mining algorithms, including the Apriori-based Frequent Subgraph Mining (FSG) algorithm, the Depth-First Search (DFS)-based gSpan algorithm, and greedy approaches. It also discusses some of the computational challenges of graph mining, such as performing graph isomorphism checks for operations like candidate generation and pruning.
This document discusses feature selection algorithms and self-organizing maps (SOM). It begins by introducing concepts related to feature selection, including the curse of dimensionality and feature reduction. It then provides details on the branch and bound algorithm for feature selection, including its steps, properties, and an example application. Finally, it discusses the beam search algorithm for feature selection as an alternative to branch and bound, comparing their observations and recommendations.
Data Mining Seminar - Graph Mining and Social Network Analysisvwchu
Delivered a formal presentation on course material for the Data Mining (EECS 4412) course at York University, Canada, about graph mining. Graphs have become increasingly important in modeling sophisticated structures and their interactions, with broad applications including chemical informatics, bioinformatics, computer vision, video indexing, text retrieval, and Web analysis. The formal seminar was 50 to 60 minutes followed by 10 to 20 minutes for questions.
https://wiki.eecs.yorku.ca/course_archive/2014-15/F/4412
https://wiki.eecs.yorku.ca/course_archive/2014-15/F/4412/lectures
The document introduces Hadoop and MapReduce concepts through two examples - word count and quiz grading. It provides code samples for mappers and reducers to count word frequencies and calculate student quiz scores in Hadoop. Readers are instructed to run the example code locally to understand how Hadoop partitions and processes large datasets in parallel using a map-reduce model. The goal is for readers to intuitively grasp Hadoop functionality and be able to write their own map-reduce programs for other problems.
The document discusses several "outrageous ideas" for improving graph databases, such as using a column-oriented storage model inspired by relational databases, employing worst-case optimal join algorithms, adopting a semantic query optimizer informed by mathematical concepts, and leveraging recursion to enable queries over paths in graph structures. The presentation argues that current graph database implementations are flawed and lessons from relational databases have not been adequately applied.
1. Represents text documents as graph-of-words and extracts subgraph features through frequent subgraph mining to classify texts as a graph classification problem.
2. Uses gSpan algorithm to efficiently mine frequent subgraphs from the graph-of-words and selects the optimal minimum support threshold using the elbow method.
3. Evaluates the approach on four datasets, achieving improved accuracy over bag-of-words models by extracting long-distance n-gram features through subgraph mining.
1. Represents text documents as graph-of-words and extracts subgraph features through frequent subgraph mining to classify texts as a graph classification problem.
2. Uses gSpan algorithm to efficiently mine frequent subgraphs from the graph-of-words and selects the best minimum support threshold using the elbow method.
3. Evaluates on four datasets showing improved accuracy over bag-of-words models by capturing long-distance n-grams through subgraph features.
1. The document proposes representing text documents as graphs (graph-of-words) instead of bag-of-words and using frequent subgraph mining to extract features for text categorization.
2. It describes using the gSpan algorithm to efficiently mine frequent subgraphs from the graph-of-words representations to generate features.
3. An elbow method is used to select an optimal minimum support threshold that balances feature set size and accuracy. Representing documents as graphs and mining subgraph features is shown to improve accuracy over traditional bag-of-words on four text categorization datasets.
1. The document proposes representing text documents as graphs (graph-of-words) instead of bag-of-words and using frequent subgraph mining to extract features for text categorization.
2. It describes using the gSpan algorithm to efficiently mine frequent subgraphs from the graph-of-words representations to generate features.
3. An elbow method is used to select an optimal minimum support threshold that balances feature set size and accuracy. Representing documents as graphs and mining subgraph features is shown to improve accuracy over traditional bag-of-words on four text categorization datasets.
1. Represents text documents as graph-of-words and extracts subgraph features through frequent subgraph mining to classify texts as a graph classification problem.
2. Uses gSpan algorithm to efficiently mine frequent subgraphs from the graph-of-words and selects the best minimum support threshold using the elbow method.
3. Evaluates on four datasets showing improved accuracy over bag-of-words models by capturing long-distance n-gram dependencies through subgraph features.
1. The document proposes representing text documents as graphs (graph-of-words) instead of bag-of-words and using frequent subgraph mining to extract features for text categorization.
2. It describes using the gSpan algorithm to efficiently mine frequent subgraphs from the graph-of-words representations to generate features.
3. An elbow method is used to select an optimal minimum support threshold that balances feature set size and accuracy. Representing documents as graphs and mining subgraph features is shown to improve accuracy over traditional bag-of-words on four text categorization datasets.
1. Represents text documents as graph-of-words and extracts subgraph features through frequent subgraph mining to classify texts as a graph classification problem.
2. Uses gSpan algorithm to efficiently mine frequent subgraphs from the graph-of-words and selects the best minimum support threshold using the elbow method.
3. Evaluates on four datasets showing improved accuracy over bag-of-words models by capturing long-distance n-grams through subgraph features.
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataKiwi Creative
Harness the power of AI-backed reports, benchmarking and data analysis to predict trends and detect anomalies in your marketing efforts.
Peter Caputa, CEO at Databox, reveals how you can discover the strategies and tools to increase your growth rate (and margins!).
From metrics to track to data habits to pick up, enhance your reporting for powerful insights to improve your B2B tech company's marketing.
- - -
This is the webinar recording from the June 2024 HubSpot User Group (HUG) for B2B Technology USA.
Watch the video recording at https://youtu.be/5vjwGfPN9lw
Sign up for future HUG events at https://events.hubspot.com/b2b-technology-usa/
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...Social Samosa
The Modern Marketing Reckoner (MMR) is a comprehensive resource packed with POVs from 60+ industry leaders on how AI is transforming the 4 key pillars of marketing – product, place, price and promotions.
2. Outline
• Category of graph queries
• Querying in collection DB
• Querying in large graphs
• References
2/50
3. Category of Graph Queries: Matching Type
• Exact subgraph matching
– Find graphs in DB which have all components of the query graph
• Similarity subgraph matching
– Find graphs in DB which have some components of the query graph
– Similarity measure is needed
• Super graph matching
– Find graphs in DB which are contained in the query graph
Query graph Exact subgraph Similarity
Subgraph
Query graph
3/50
4. Category of Graph Queries: Target DB
• Collection DB: large number of small graphs
– e.g. Chemical compounds
– Retrieval component
– IDs of graphs which contain matching parts
• Large graphs: small number of large graphs
– e.g. Social network, RDF graph
– Retrieval component
– All matching subgraphs
G1
G2
G3
G4
G7
G6
G5
Query graph
G1, G3, G5
Results: graph ID list
Querying Collection DB
Query graph
Results: matching subgraphs
Querying Large Graphs
4/50
5. Query Processing in Collection DB
• Processing flow
• Verification uses usual pair-wise subgraph isomorphism
algorithm
• Most of techniques focus on filtering techniques
– The cost of verification is high
– To reduce the number of verification execution
Query Filtering
Candidate
graph set
Verification
Answer
Graphs
5/50
6. Query Processing in Large Graphs
• Processing flow
• Focus on node indexing
– To reduce search space
– Use structural information of nodes
• Build subgraph by joining candidate nodes
– Join methods are not relatively researched
– Optimization using join ordering
Query
Index
search
Candidate
node sets
Building
subgraphs
Answer
subgraphs
6/50
7. Graph Indexing Techniques
Target Database Query Type
GraphGrep
[Shasha et al., PODS’02]
Collection DB Exact Feature(Path) based index
gIndex
[Yan et al., SIGMOD’04]
Collection DB Exact Feature(Graph) based index
Grafil
[Yan et al., SIGMOD’05]
Collection DB Exact & Similarity Feature based similarity search
C-tree
[He and Singh, ICDE’06]
Collection DB Exact & Similarity Closure based index
QuickSI
[Shang et al., VLDB’08]
Collection DB Exact Verification algorithm
Tale
[Tian and Patel, ICDE’08]
Collection DB Exact & Similarity Similarity search using node index
GraphQL
[He and Singh, SIGMOD’08]
Large graphs Exact Node indexing
Spath
[Zhao and Han, VLDB’10]
Large graphs Exact
Node indexing using
neighborhood information
7/50
8. Outline
• Category of graph queries
• Querying in collection DB
• Querying in large graphs
• References
8/50
9. GraphGrep(1/2) [Shasha et al. PODS’02]
• First work adopts the filtering-and-verification framework
• Path-based index
– Fingerprint of database
– Enumerate the set of all paths(length <= L) of all graphs in DB
– For each path, the number of occurrences in each graphs are stored in
hash table
B
A
C
B
B
A
C
B
D
E
C
A B
B
C
Key g1 g2 g3
h(CA) 1 0 1
…
h(ABCB) 2 2 0
g1 g2 g3 Index
9/50
10. GraphGrep(2/2): Query Processing
• Filtering
– Make the fingerprint of query q
– Hash all paths (length <= L) of q
– Compare the fingerprint of the query with the fingerprint of database
– Discard a graph whose value in fingerprint is less than the value in query
fingerprint
• Verification
– Check subgraph isomorphism tests
Key g1 g2 g3
h(AB) 2 2 1
h(AC) 1 0 1
h(BAC) 2 0 1
B
A
C
B
B
A
C
B
D
E
C
A B
B
C
g1 g2 g3
Index
B
A C
AB:1
AC:1
BAC:1
Query
Candidates
= {g1, g3}
Verification
10/50
11. gIndex(1/6) [Yan et al., SIGMOD’04]
• Path-based approach has week points
– Path is too simple: structural information is lost
– There are too many paths: the set of paths in a graph database usually
is huge
• Solution
– Use graph structure instead of path as the basic index feature
c c c c
c c
c c
c c
c c
c c
c c
c c
c c
Sample Database
c
c c
c
c
c
Query
c c c
c c c
Paths in Query Graph
Cannot Filter Any
Graphs
In Database
11/50
12. gIndex(2/6): Frequent Fragment
• The number of graph structure is large
Index only frequent subgraphs
• support(g)
– The number of graphs in D (graph database), where g is a subgraph
• minSup
– Minimum support threshold
– Index a fragment, g only if support(g) ≥ minSup
• Size-increasing support
– Frequent fragments are increasing as the size of a fragment increases
– Low minSup for small fragments, high minSup for large fragment
12/50
13. gIndex(3/6): Frequent Fragment
A A
B
A A
B B
A A
B B
A
A
B B
A A
A B
A A B
A B B
B A B
A B A
A B
B
A
A A
B
A
B B
B A
B
A
B A
B
A
B B
A
A A
B B
A
A
A
B B
Size=1 Size=2 Size=3 Size=4
F=3
F=4
B B
F=3
F=3
F=3
F=2
F=2
F=2
F=1
F=1
F=1
F=1
F=2
F=1
F=1
13/50
14. gIndex(4/6): Discriminative Fragment
• Redundant fragment
– The indexed graphs by a fragment are also indexed by its subgraphs
– We don’t need to include redundant fragments
• Discriminative fragment
– Fragments which are not redundant
– 𝐷 𝑥 ≪ 𝑓∈𝐹⋀𝑓⊆𝑥 𝐷𝑓
A A
B
A A
B B
A A
B B
A A B
A B B
A B
B
A
Size=2 Size=3
Df1={g1, g2, g3}
Df2={g2, g3, g4}
Df3={g2, g3}=Df1∩Df2
f1
f2
f3
g1
g2
g3
A
A
B B
g4
14/50
15. a
gIndex(5/6): gIndex Tree
• Use graph serialization method
– For fast graph isomorphism checking during index search
– DFS coding [Yan et al. ICDM’02]
– Translate a graph into a unique edge sequence
• gIndex Tree
– Prefix tree which consists of the edge sequences of discriminative fragments
– Record all size-n discriminative fragments in level n
– Black nodes discriminative fragments
– Have ID lists: the ids of graphs containing fi
– White nodes redundant fragments; for Apriori pruning
X
X
Z Y
b
a
ba
X
X
Z Y
b
ba
v0
v1
v2 v3
DFS Coding
<(v0,v1),(v1,v2),(v2,v0),(v1,v3)>
f1
f2
f3
e1
e2
e3
Level 0
Level 1
Level 2
…
gIndex Tree
15/50
16. gIndex(6/6): Searching
• Searching process
– Given a query q, enumerate all q’s fragments (size <= maxSize)
– Locate the fragments in gIndex tree
– Intersect the id lists associated with the fragments
• Apriori pruning
– Generating every fragment is inefficient
– If a fragment is not in gIndexTree, we need not check its super-graphs
any more
– Redundant fragments need to be recorded for Apriori pruning
f1
f2
f3
e1
e2
e3
Level 0
Level 1
Level 2
…
gIndex Tree
Query
<e1, e2, e3, e4, e5>
Fragments
<e1>
<e1, e2>
<e1, e2, e3>
<e1, e2, e3, e4> stop
<e2>
… 16/50
17. Grafil(1/4) [Yan et al., SIGMOD’05]
• Subgraph similarity search
• Feature-based approach
• Similarity search using relaxed queries
– Relax a query by deletion of k edges
– Missed edges incur missed features
• Main question
– What is the maximum missed features(𝑚 𝑚𝑎𝑥) when relaxing a query
with k missed edges?
Feature Vector
G1 {u1, u2, …, un}
G2
…
Gn
Subgraph exact search
Subgraph similarity search
𝑓𝑜𝑟 1 ≤ 𝑖 ≤ 𝑛, 𝑢𝑖 ≥ 𝑣𝑖
{v1, v2, …, vn}
𝑟 𝑢𝑖, 𝑣𝑖 =
0, 𝑖𝑓𝑢𝑖 ≥ 𝑣𝑖
𝑢𝑖 − 𝑣𝑖, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
𝑖=1
𝑛
𝑟 𝑢𝑖, 𝑣𝑖 ≤ 𝑚 𝑚𝑎𝑥
Query
17/50
18. Grafil(2/4): Feature Misses
Query
Relaxed Queries
Features
fa fb fc
fa fb fc
1 2 4
fa fb fc
1 0 3
fa fb fc
0 1 2
fa fb fc
0 1 2
Miss 1 edges =4
=3
=3
Feature
Miss
7-4=3
7-3=4
7-3=4
Maximum Feature Misses
mmax=4
18/50
19. Grafil(3/4): Feature Miss Estimation
• Problem
– Given a query Q and a set of features contained in Q, if the relaxation ratio
is given, what is the maximal number of features that can be missed?
• Use edge-feature matrix
– Find the maximum number of columns that can be hit by k rows
– K: the number of missing edges in Q
• Classic maximum coverage problem (set k-cover)
– Proved NP-complete
Features
fa fb fc
Query
fa fb1 fb2 fc1 fc2 fc3 fc4
e1 0 1 1 1 0 0 0
e2 1 1 0 0 1 0 1
e3 1 0 1 0 0 1 1
Edge-Feature Matrix
e1
e2 e3
19/50
20. Grafil(4/4): Feature Conjugation
• Compensate the misses of a feature by occurrences of
another features in G
• Using all the features together in one filter would deteriorate
the filtering performance
• Solution
– Use multiple filters
– Feature set selection
Query Features
fafa fb
3 4
mmax=4
(3-0)+0=3 ≤ mmax
A
B
A A
A A
C
B
B B
fb
C
A
A A
A A
C
Graph
20/50
21. C-tree(1/5) [He and Singh, ICDE’06]
• Closure-tree
– Tree-based index
– Each node has graph closure of its descendants
– Support subgraph queries and similarity queries
• Pseudo subgraph isomorphism
– Perform pairwise graph comparisons using heuristic techniques
– Produce candidate answers within a polynomial time
C-tree
Query
Graph
Candidate
Graphs
21/50
22. C-tree(2/5): Closures
• Generalized graph that captures the structural information of
graphs
• Serve as a bounding container of C-tree
A
B C
A
B C
D
A
B D
A
B D
C
B C
D
G1 G2 G3 G4 G5
A
B C
{D,ε}
C1=closure(G1,G2)
{A, ε}
B D
{D,ε}
C2=closure(G3,G4,G5)
{A, ε}
B {C,D}
{C,D,ε}
C3=closure(C1,C2)
22/50
23. C-tree(3/5): Structure
• Each node is a graph closure of its children
• The children of a leaf node are database graphs
• Similar structure to that of tree-based spatial access methods,
e.g. R-tree
• Traversing c-tree needs subgraph isomorphism tests
– Use approximation technique, pseudo subgraph isomorphism
C3
C1 C2
G1 G2 G1 G2 G2
23/50
24. C-tree(4/5): Pseudo Subgraph Isomorphism
• Approximation of subgraph isomorphism
• Given two graph G1 and G2, use adjacent tree structures of
each node to mapping node pairs
Subgraph
Isomorphism
Level-n
Sub-isomorphism
Level-n
Compatible
Level-n Adjacent
Subgraph
Level-n Pseudo
Sub-isomorphism
Level-n Pseudo
Compatible
Level-n Adjacent
Subtree
Approx.
Approx. Approx. Approx.
Bipartite
matching
Bipartite
matching
Defined
using
Defined
using
24/50
25. C-tree(5/5): Pseudo Subgraph Isomorphism
A
B C
C1 B1 A C2 B2G1
G2
A
B
C
C1
B1
A
C2
B2
A
B C
B
A C
C
A B
B1
A C1
A
B1 C2
C2
A B2
C1
B1
B2
C2
A
B C
B C B C
B
A C
B C A B
C
A B
B C A C
B1
A C1
B1 C2 B1
A
B1 C2
A C1 A B2
C2
A B2
B1 C2 C2
25/50Level-0 Level-1 Level-2
26. QuickSI(1/6) [Shang et al., VLDB’08]
• Main paradigm for processing graph containment queries
– Filtering-and-verification framework
• Verification techniques
– Subgraph isomorphism testing
– Existing techniques are not efficient especially when the query graph
size becomes large
• Develop efficient verification techniques
26/50
27. QuickSI(2/6): QI-Sequence
• A Sequence that represents a rooted spanning tree for a query q
– Encode a graph for efficient subgraph isomorphism testing
– Encode search order and topological information
– Have spanning entries and extra entries
• Spanning entry, Ti
– Keep basic information of the spanning tree
– Ti.v: record a vertex vk in a query graph q
– [Ti.p, Ti.l] : parent vertex and label of Ti.v
• Extra entry, Rij
– Extra topology information
– Degree constraint [deg : d] : the degree of Ti.v
– Extra edge [edge : j] : edge that doesn’t appear in the spanning tree
27/50
28. QuickSI(3/6): QI-Sequence
• Several QI-Sequences of one query graph, q
– Different search spaces when processing subgraph isomorphism testing
N C
C C
C
C C
Type [Ti.p, Ti.l] Ti.v
T1 [0, N] v1
T2 [1, C] v2
R21 [deg : 3]
T3 [2, C] v3
T4 [3, C] v4
T5 [4, C] v5
T6 [5, C] v6
T7 [6, C] v7
R71 [edge : 2]
Type [Ti.p, Ti.l] Ti.v
T1 [0, C] v4
T2 [1, C] v5
R61 [edge : 3]
T3 [2, C] v3
T4 [3, C] v6
T5 [4, C] v7
T6 [5, C] v2
T7 [6, C] v1
R61 [deg : 3]
Query
QI-Sequence, SEQq QI-Sequence, SEQq’
28/50
29. QuickSI(4/6): Effective QI-Sequence
• Constructing optimal QI-Sequence is hard
– Use heuristics to construct an effective QI-Sequence
• Calculate average inner supports of each distinct vertex and edge
– Average number of possible mappings in the graphs which contain the edge
or vertex
– Statistics information for graphs in the candidate set after filtering
• Convert q to a weighted graph qw
– w(e) = øavg(e), w(v)=øavg(v)
• Find minimum spanning tree in qw based on edge weights
N C
C C
C
C C
Weighted Graph
1.4
5.1
5.1
5.1
5.1 5.1
5.1
Edges
(N,C)
(C,C)
øavg(e)
1.4
5.1
Average Inner Support
29/50
30. QuickSI(5/6): Swift-Index
• Traditional filtering process
– Decompose the query graph into a set of features
– Identify every feature that also appears in the index
– Identification of a feature needs subgraph isomorphism
• Filtering using Swift-Index
– Pre-compute QI-Sequences for features
– Maintain QI-Sequences in a prefix-tree, Swift-Index
– Given a query graph q, search from the prefix-tree index in a top-down
fashion
– Reduce computational cost for subgraph isomorphism testing
30/50
32. TALE(1/5) [Tian and Patel, ICDE’08]
• Motivation
– Need approximate graph matching
– Supporting large queries is more and more desired
• TALE (A Tool for Approximate Large Graph Matching)
– A Novel Disk-based Indexing Method
– High pruning power
– Linear index size with the database size
– Index-based matching algorithm
– Significantly outperforms existing methods
– Gracefully handles large queries and databases
32/50
33. TALE(2/5): Neighborhood Indexing
• Neighborhood
– Induced subgraph of a node and its neighbor (adjacent nodes)
• Properties of neighborhood
– Degree: the number of neighbors
– Neighbor connection: how the neighbors connect to each other
– Neighbor array: The labels of the actual neighbors
A
A
A
B
DB
A
D
E
Ndb.label = A
Ndb.degree = 8
Ndb.nConn = 3
A CB ED
1 01 11
Neighbor array
33/50
34. TALE(3/5): Approximate Matching
Exact
Nq.label = Ndb.label
Nq.degree ≤ Ndb.degree
Nq.nConn ≤ Ndb.nConn
(NOT Ndb.nArray) AND
Nq.nArray = 0
Approximate
group(Nq.label) = group(Ndb.label)
Nq.degree ≤ Ndb.degree + ε
Nq.nConn ≤ Ndb.nConn + δ
|(NOT Ndb.nArray) AND Nq.nArray| ≤ ε
A
A
B
B
B
Ndb.label = A
Ndb.degree = 4
Ndb.nConn = 2
A CB ED
1 01 01
Neighbor array A
A
B
B
D B
Nq.label = A
Nq.degree = 5
Nq.nConn = 3
A CB ED
1 01 01
Neighbor array
34/50
35. TALE(4/5): Hybrid Index Structure
• Support efficient search for DB neighborhoods
group(Nq.label) = group(Ndb.label)
Nq.degree ≤ Ndb.degree + ε
Nq.nConn ≤ Ndb.nConn + δ
|(NOT Ndb.nArray) AND Nq.nArray| ≤ ε
B+-Tree
Index on
(group, degree, nConn)
1 0 0 1
1 1 0 0
n0
n1
n2
n3
n4
Bitmap Index
on nArray
35/50
36. TALE(5/5): Matching Algorithm
• Step 1: match the important nodes from the query
– A good match should be more tolerant towards missing unimportant
nodes than missing important nodes
– Use degree centrality to measure the importance of nodes
• Step 2: progressively extends the node matches
36/50
37. Outline
• Category of graph queries
• Querying in collection DB
• Querying in large graphs
• References
37/50
38. GraphQL(1/5) [He and Singh, SIGMOD’08]
• Motivation
– Need a language to query and manipulate graphs with arbitrary attributes
and structures
– Native access methods that exploit graph structural information
• Formal language for graphs
– Notion for manipulating graph structures
– Basis of graph query language
– Concatenation, disjunction, repetition
• Graph query language
– Subgraph isomorphism + predicate evaluation
graph G1 {
node v1, v2, v3;
edge e1 (v1, v2);
edge e2 (v2, v3);
edge e3 (v3, v1);
}
v1
v2 v3
e1 e3
e2
graph P {
node v1, v2;
edge e1 (v1, v2);
} where v1.name = “A”
and v2.year > 2000;
Graph motif Graph pattern
38/50
39. GraphQL(2/5): Access Methods
• Feasible mates
– Set of nodes in a graph that satisfies predicates
• Graph pattern matching
– Retrieve the feasible mates for each node in the pattern
– Searches the search space for subgraph isomorphism
• Reduce the search space
– Neighborhood subgraphs
– Profiles of neighborhood subgraphs
B
A
C B1
A1
C2C1 B2
A2
Pattern Graph
Basic Algorithm
for A in {A1, A2}
for B in {B1, B2}
for C in {C1, C2}
Search Space
{A1, A2} X {B1, B2} X {C1, C2}
Search Order
A B C
39/50
40. GraphQL(3/5)
B
A
C B1
A1
C2C1 B2
A2
Pattern Graph
B1
A1
C2
A1 ABC
Nodes of
Graph
Neighborhood
subgraphs (r=1)
Profiles
A1
B2
A2 AB
B1
A1
C2
B1 ABCC
C2
A2
B2
B2 ABC
C1 B1C1 BC
B1
A1
C2
C2 ABBC
C1
B2
Resulting Search Space
Retrieve by
nodes
Methods
{A1, A2} X {B1, B2} X {C1, C2}
Retrieve by
neighborhood
subgraphs
{A1} X {B1} X {C2}
Retrieve by
profiles of
neighborhood
subgraphs
{A1} X {B1 , B2} X {C2}
40/50
42. GraphQL(5/5)
• Cost model
Join2
Join1
A B C
Join2
Join1
A C B
B
A
C
Pattern
Search Space
{A1} X {B1, B2} X {C2} (a) (A ⋈ B) ⋈ C
Cost(Join1)=1X2=2
Size(Join1)=2𝛾
Cost(Join2)=2𝛾
Cost(Join1+Join2)=2+2𝛾
(b) (A ⋈ C) ⋈ B
Cost(Join1)=1X1=1
Size(Join1)=𝛾
Cost(Join2)=2𝛾
Cost(Join1+Join2)=1+2𝛾
Result Size of a Join i
Size(i)=size(i.left)Xsize(i.right)X𝛾 𝑖
𝛾(𝑖) : reduction factor
42/50
43. GADDI(1/6) [Zhang et al., EDBT’09]
• Employ novel indexing method, NDS distance
– Capture the local graph structure between each pair of vertices
– More pruning power than indexes which are based on information of one
vertex
• Matching algorithm based on two-way pruning
– Candidate matching using NDS distance
– Remove impossible vertices after some vertices are matched
43/50
44. GADDI(2/6): NDS Distance
• Neighboring discriminating substructure(NDS) distance
– Defined for a substructure P and a pair of vertices v1 and v2
– The number of matches of P in the induced subgraph of common
neighborhoods of v1 and v2
44/50
1
1
1
3
3
1
3
1
3
1
3
1
1
2 2
Database graph
3
1
P
a
a
a
a
a a
a
a
a
b
b
b
b
b
b
b
b
b
k=3 neighborhood of v1 k=3 neighborhood of v2
3
1
3
1
1
v1
v2
dNDS(G,v1,v2,P) = 3
45. GADDI(3/6):
• Pruning condition
– If v in Q has a neighbor v’ and there exist n substructures between v and
v’, a matching candidate, u in G should have a neighbor u’, which have
at least n substructures between u and u’
– DNDS(Q,v,v’,P) <= DNDS(G,u,u’,P)
45/50
v
v’
P1
P1
P2 P2
Query Q
u
u’P1 P1
P2 P2
Graph G
P1
DNDS(Q,v,v’,P1)=2
DNDS(Q,u,u’,P1)=3
DNDS(Q,v,v’,P2)=2
DNDS(Q,v,v’,P2)=2
u is a candidate for v
46. GADDI(4/6): Candidate Matches
• For each neighboring vertex(v) (length <= L) of vq in Q, there
must exist neighboring vertices(v’) of vg in G which satisfy
– L(v)=L(v’)
– dNDS(Q,vq,v,P) <= dNDS(G,vg,v’,P) for any substructure P
– d(G,vq,v)>=d(G,vg,v’)
46/50
1
1
1
3
3
1
3
1
3
1
3
1
1
2 2
a
a
a
a
a a
a
a
b
b
b
b
b
b
b
b
b
3
1 P
a
1
1
3
3
1
3
b
b
a
b
a
a
1
1
1
1 1
1
1
1
1
1
1
Database graph
47. GADDI(5/6): Index Structure
• Index structure
– Precompute all DNDS values for every pair of neighboring vertices and P
• Pruning process
– Compute DNDS of v in Q for each neighborhood and each P
– Check the pruning conditions
47/50
P1
u1 u2 u3 …
u1
u2
u3
…
P2
u1 u2 u3 …
u1
u2
u3
…
P3
u1 u2 u3 …
u1
u2
u3
…
P4
u1 u2 u3 …
u1
u2
u3
…
DNDS
DNDS
DNDS
DNDS
1
1
3
3
1
3
b
b
a
b
a
a
Query Q
GADDI Index
DNDS(Q,v1,v2,P1)
DNDS(Q,v1,v3,P1)
…
DNDS(Q,v1,vn,P1)
48. GADDI(6/6): Matching Algorithm
• After matching a query graph vertex to a candidate vertex,
remove those database graph vertices which are impossible to
be matched
48/50
1
1
1
3
3
1
3
1
3
1
3
1
1
2 2
a
a
a
a
a a
a
a
b
b
b
b
b
b
b
b
b
1
3
3
1
3
3
1
1
a
a
a
a a
b
b
b
1
1
3
3
1
3
b
b
a
b
a
a
Database graph
Pruned Database graph
Query
49. DSI(1/3) [Kou et al., WAIM’10]
• Discriminative structure
• Distance set
– Distinct distances of all the path between a vertex, v and substructures
in k-N(v)
– The path must not contain an edge in P
49/50
A1
B1
D1
C1
A2
A
B
Graph G
P1
A1
B1
Distance (k=3)
P1.A A1 : 0
P1.B A1 : 2, 3
P1.A A2 : 2, 3
P1.B A2 : 3, (4)
Vector Representation
A B
0123 0123
(P1,A1) 1000 0011
(P1,A2) 0011 0001
50. DSI(2/3): Pruning Condition
• Condition for including v in G in candidate set of u in Q
– For each P in k-N(u), DDSV(u, P) is dominated by DDSV(v, P)
50/50
Vector Representation
A B
0123 0123
(P1,A1) 1000 0011
(P1,A2) 0011 0001
A1
B1
D1
C1
A2
Graph G
A
B C
(P1,A) 1000 0010
Query Q
A
B
P1
51. DSI(3/3): Query Processing
• Search space generation
– For each node u in query, make DDSV
– For each structure and each indexed vertex, check pruning condition
– Make the candidate set for u
• Subgraph matching in resulting search space
51/50
Query Graph
P1: 0100 01111
P2: 0100 00010
P3: 0001 01101
P4: 0100 01010
…
P1: 0100 01111
P2: 0100 00010
P3: 0001 01101
P4: 0100 01010
…
P1: 0100 01111
P2: 0100 00010
P3: 0001 01101
P4: 0100 01010
…
P1 P2 P3 P4 … …
A B
0123 0123
1000 0011
0011 1000
0110 0100
0110 0010
A1
B1
C1
D1
A C D
012 012 01
100 001 00
010 010 00
001 100 00
000 000 10
000 000 01
A1
B1
C1
D1
A2
Distance Set Index
52. SPath(1/7) [Zhao and Han, VLDB’10]
• Problems of previous graph matching methods
– Designed on special graphs
– Limited guarantee on query performance and scalability support
– Lack of scalable graph indexing mechanisms and cost-effective graph
query optimizer
• SPath
– Compact indexing structure using local structural information of vertices:
neighborhood signatures
– Query processing: vertex-at-a-time to path-at-a-time
• Target graph
– Connected, undirected simple graphs with no edge weights
– Labeled vertices
52/50
53. SPath(2/7): Neighborhood Signature
• Path-based graph indexing technique
– Use shortest paths to capture the local structural information around the
vertex
• Neighborhood signature: NS(u)
– k-distance sets of u from k = 0 up to the neighborhood scope (parameter)
– k-distance set: the set of vertices k hops away from u
k is the length of the shortest path
NS(u1) = {{A: {1}},
{B: {2}, C:{3}},
{A: {4, 6}, B: {5}}
k = 0
k = 1
k = 2
1
2 3
4
5
6
8
7
9
11
10
12
A
B C
A A
B
A A
C B
C B
53/50
54. SPath(3/7): NS Containment
• 𝐺𝑖𝑣𝑒𝑛 𝑢 ∈ 𝑉 𝐺 𝑎𝑛𝑑 𝑣 ∈ 𝑉 𝑄 , 𝑁𝑆 𝑣 𝑖𝑠 𝑐𝑜𝑛𝑡𝑎𝑖𝑛𝑒𝑑 𝑖𝑛 𝑁𝑆 𝑢 , 𝑑𝑒𝑛𝑜𝑡𝑒𝑑 𝑎𝑠 𝑁𝑆 𝑣 ⊑
𝑁𝑆 𝑢 , 𝑖𝑓 ∀𝑘 ≤ 𝑘0, ∀𝑙 ∈ Σ, 𝑘≤𝑘0
𝑆 𝑙
𝑘
(𝑣) ≤ 𝑘≤𝑘0
𝑆 𝑙
𝑘
(𝑢)
• We can safely prune u1 from C(v1)
NS(u1) = {{A: {1}},
{B: {2}, C:{3}},
{A: {4, 6}, B: {5}}
k = 0
k = 1
k = 2
NS(v1) = {{A: {1}},
{B: {2}, C:{3}},
{C: {4}}
k = 0
k = 1
k = 2
Network G
Query Graph G
𝑁𝑆 𝑣1 ⋢ 𝑁𝑆 𝑢1
1
2 3
4
5
6
8
7
9
11
10
12
A
B C
A A
B
A A
C
C B
2
1 4
3
B
A C
C
54/50
55. SPath(4/7): Implementation
• Lookup table
– 𝛨: 𝑙∗
→ 𝑢 𝑙 𝑢 = 𝑙∗
, 𝑙∗
∈ Σ
– Easily figure out matching candidates
• Histogram
– Succinct distance-wise histogram 𝑆 𝑘
𝑙
(𝑢) for 𝑘 < 0 ≤ 𝑘0
• ID-List
– Exact vertex identifiers in 𝑆 𝑘
𝑙
(𝑢)
• Lookup table and histograms are stored in main memory
• ID-Lists are on disk
Global Lookup Table
Network G
Histogram and ID-List
for v3
1
2 3
4
5
6
8
7
9
11
10
12
A
B C
A A
B
A A
C B
C B
label
A
vid
1
B
C
2
3
4
5
7
6
10
9
8
12
11
distance label count
A 3
1
B 2
A 1
2
C 2
vid
1
2
8
7
4
5
9
6
55/50
56. SPath(5/7): Graph Query Processing
• Compute NS(v) for each 𝑣 ∈ 𝑉 𝑄
• Pruning
– Examine matching candidates C(v)
– NS containment testing
– Reduced matching candidates of v: C’(v)
• Query decomposition
– Select shortest paths of Q which are also shortest path in G
• Path selection and join
– Reconstruct Q
– Selected shortest paths should be cost-effective
56/50
57. SPath(6/7): Query Decomposition
• Select shortest paths of Q which are also shortest path in G
1
2
5
3
A B
C
1
2
5
4A C
C
Network G
Query Q
B
1
A
B
C
2
3
5
2
3
C 4
4C
Decomposed Path (for v1)
(v1, v2), (v1, v5), (v1, v2, v3)
Histogram and ID-List for v1
57/50
58. SPath(7/7): Path Selection
• Given a join path
• Total join cost
• Selectivity
– is a function of path length
58/50
59. References
• [Shasha et al., PODS’02] Dennis Shasha, Jaso T. L. Wang, Rosalba Giugno,
Algorithmics and Applications of Tree and Graph Searching. PODS, 2002.
• [Yan et al., SIGMOD’04] Xifeng Yan, Philip S. Yu, Jiawei Han, Graph Indexing:
A Frequent Structure-based Approach. SIGMOD, 2004.
• [Yan et al., SIGMOD’05] Xifeng Yan, Philip S. Yu, Jiawei Han, Substructure
Similarity Search in Graph Databases. SIGMOD, 2005.
• [Tian and Patel, ICDE’08] Yuanyuan Tian , Jignesh M. Patel. TALE: A Tool for
Approximate Large Graph Matching. ICDE, 2008.
• [He and Singh, SIGMOD’08] Huahai He, Ambuj K. Singh. Graphs-at-a-time:
query language and access methods for graph databases. SIGMOD, 2008.
• [Zhao and Han, VLDB’10] Peiziang Zhao, Jiawei Han. On Graph Query
Optimization in Large Networks. VLDB, 2010.
• [He and Singh, ICDE’06] Huahai He, Ambuj K. Singh, Closure-Tree: An Index
Structure for Graph Queries. ICDE, 2006
• [Shang et al., VLDB’08] Haichuan Shang, Ying Zhang, Xuemin Lin, Jeffrey Xu
Yu, Taming Verification Hardness: An Efficient Algorithm for Testing Subgraph
Isomorphism. VLDB, 2008
59/50
60. References
• [Zhang et al., EDBT’09] Shijie Zhang, Shirong Li, Jiong Yang,
GADDI: Distance Index based Subgraph Matching in
Biological Networks. EDBT, 2009
• [Zhang et al., CIKM’10] Shijie Zhang, Shirong Li, Jiong Yang,
SUMMA: Subgraph Matching in Massive Graphs. CIKM, 2010
• [Kou et al., WAIM’10] Yubo Kou, Yukun Li, Xiaofeng Meng,
DSI: A Method for Indexing Large Graphs Using Distance Set.
WAIM, 2010
60/50