This document presents an algorithm for matching properties between linked databases using Kullback-Leibler divergence (KL-Divergence). It first creates documents representing the distributions of objects linked to properties in each database. It then computes the normalized KL-Divergence between all document pairs to identify the most similar properties. The property with the lowest KL-Divergence score to a given property is returned as its match. Experimental results on real linked datasets found the algorithm could accurately match properties over 90% of the time.
We review our recent progress in the development of graph kernels. We discuss the hash graph kernel framework, which makes the computation of kernels for graphs with vertices and edges annotated with real-valued information feasible for large data sets. Moreover, we summarize our general investigation of the benefits of explicit graph feature maps in comparison to using the kernel trick. Our experimental studies on real-world data sets suggest that explicit feature maps often provide sufficient classification accuracy while being computed more efficiently. Finally, we describe how to construct valid kernels from optimal assignments to obtain new expressive graph kernels. These make use of the kernel trick to establish one-to-one correspondences. We conclude by a discussion of our results and their implication for the future development of graph kernels.
Glocalized Weisfeiler-Lehman Graph Kernels: Global-Local Feature Maps of Graphs Christopher Morris
Most state-of-the-art graph kernels only take local graph properties into account, i.e., the kernel is computed with regard to properties of the neighborhood of vertices or other small substructures. On the other hand, kernels that do take global graph properties into account may not scale well to large graph databases. Here we propose to start exploring the space between local and global graph kernels, striking the balance between both worlds. Specifically, we introduce a novel graph kernel based on the k-dimensional Weisfeiler-Lehman algorithm. Unfortunately, the k-dimensional Weisfeiler-Lehman algorithm scales exponentially in k. Consequently, we devise a stochastic version of the kernel with provable approximation guarantees using conditional Rademacher averages. On bounded-degree graphs, it can even be computed in constant time. We support our theoretical results with experiments on several graph classification benchmarks, showing that our kernels often outperform the state-of-the-art in terms of classification accuracies.
Prediction and Explanation over DL-Lite Data StreamsSzymon Klarman
Presentation for the paper:
Szymon Klarman and Thomas Meyer. Prediction and Explanation over DL-Lite Data Streams. In Proceedings of the 19th International Conference on Logic for Programming, Artificial Intelligence and Reasoning (LPAR-19), 2013.
We review our recent progress in the development of graph kernels. We discuss the hash graph kernel framework, which makes the computation of kernels for graphs with vertices and edges annotated with real-valued information feasible for large data sets. Moreover, we summarize our general investigation of the benefits of explicit graph feature maps in comparison to using the kernel trick. Our experimental studies on real-world data sets suggest that explicit feature maps often provide sufficient classification accuracy while being computed more efficiently. Finally, we describe how to construct valid kernels from optimal assignments to obtain new expressive graph kernels. These make use of the kernel trick to establish one-to-one correspondences. We conclude by a discussion of our results and their implication for the future development of graph kernels.
Glocalized Weisfeiler-Lehman Graph Kernels: Global-Local Feature Maps of Graphs Christopher Morris
Most state-of-the-art graph kernels only take local graph properties into account, i.e., the kernel is computed with regard to properties of the neighborhood of vertices or other small substructures. On the other hand, kernels that do take global graph properties into account may not scale well to large graph databases. Here we propose to start exploring the space between local and global graph kernels, striking the balance between both worlds. Specifically, we introduce a novel graph kernel based on the k-dimensional Weisfeiler-Lehman algorithm. Unfortunately, the k-dimensional Weisfeiler-Lehman algorithm scales exponentially in k. Consequently, we devise a stochastic version of the kernel with provable approximation guarantees using conditional Rademacher averages. On bounded-degree graphs, it can even be computed in constant time. We support our theoretical results with experiments on several graph classification benchmarks, showing that our kernels often outperform the state-of-the-art in terms of classification accuracies.
Prediction and Explanation over DL-Lite Data StreamsSzymon Klarman
Presentation for the paper:
Szymon Klarman and Thomas Meyer. Prediction and Explanation over DL-Lite Data Streams. In Proceedings of the 19th International Conference on Logic for Programming, Artificial Intelligence and Reasoning (LPAR-19), 2013.
PyData Amsterdam - Name Matching at ScaleGoDataDriven
Wendell Kuling works as a Data Scientist at ING in the Wholesale Banking Advanced Analytics team. Their projects aim to provide better services to corporate customers of ING, by using innovative techniques from data-science. In this talk, Wendell covers key insights from their experience in matching large datasets based on names. After covering the key algorithms and packages ING uses for name matching, Wendell will share his best-practice approach in applying these algorithms at scale… would you bet on a Cruncher (48-CPU/512 MB RAM machine), a Tesla (Cuda Tesla K80 with 4992 cores, 24GB memory) or a Spark cluster (80 cores/2,5 TB memory)?
Logistic Regression, Linear and Quadratic Discriminant Analyses, and KNN Tarek Dib
A summary of the classification methods: Logistic regression, Linear Discriminant Analysis, Quadratic Discriminant Analysis and a comparison of these three methods with K-Nearest Neighbors algorithm.
Patch models and sparse decompositions of image patches. Dictionary learning and the k-SVD algorithm. Collaborative filtering and BM3D. Non-local sparse based models. Expected patch log-likelihood. Other applications of patch models in inpainting, super-resolution and deblurring.
PyData Amsterdam - Name Matching at ScaleGoDataDriven
Wendell Kuling works as a Data Scientist at ING in the Wholesale Banking Advanced Analytics team. Their projects aim to provide better services to corporate customers of ING, by using innovative techniques from data-science. In this talk, Wendell covers key insights from their experience in matching large datasets based on names. After covering the key algorithms and packages ING uses for name matching, Wendell will share his best-practice approach in applying these algorithms at scale… would you bet on a Cruncher (48-CPU/512 MB RAM machine), a Tesla (Cuda Tesla K80 with 4992 cores, 24GB memory) or a Spark cluster (80 cores/2,5 TB memory)?
Logistic Regression, Linear and Quadratic Discriminant Analyses, and KNN Tarek Dib
A summary of the classification methods: Logistic regression, Linear Discriminant Analysis, Quadratic Discriminant Analysis and a comparison of these three methods with K-Nearest Neighbors algorithm.
Patch models and sparse decompositions of image patches. Dictionary learning and the k-SVD algorithm. Collaborative filtering and BM3D. Non-local sparse based models. Expected patch log-likelihood. Other applications of patch models in inpainting, super-resolution and deblurring.
Detecting paraphrases using recursive autoencodersFeynman Liang
Presentation on deep learning applied to natural language processing, presented at University of Cambridge Machine Learning Group's Research and Communication Club 2-11-2015 meeting.
Recent advances have made it feasible to apply the stochastic variational paradigm to a collapsed representation of latent Dirichlet allocation (LDA). While the stochastic variational paradigm has successfully been applied to an uncollapsed representation of the hierarchical Dirichlet process (HDP), no attempts to apply this type of inference in a collapsed setting of non-parametric topic modeling have been put forward so far. In this paper we explore such a collapsed stochastic variational Bayes inference for the HDP. The proposed online algorithm is easy to implement and accounts for the inference of hyper-parameters. First experiments show a promising improvement in predictive performance.
http://mimno.infosci.cornell.edu/nips2013ws/nips2013tm_submission_29.pdf
On Unified Stream Reasoning - The RDF Stream Processing realmDaniele Dell'Aglio
The presentation of my talk at WU Vienna on 18/2/2016. I discuss the problem of unifying existing solutions to process semantic streams - with a particular focus on the ones that perform continuous query answering over RDF streams
Navigating and Exploring RDF Data using Formal Concept AnalysisMehwish Alam
In this study we propose a new approach based on Pattern Structures, an extension of Formal Concept Analysis, to provide exploration over Linked Data through concept lattices. It takes RDF triples and RDF Schema based on user requirements and provides one navigation space resulting from several RDF resources. This navigation space provides interactive exploration over RDF data and allows user to visualize only the part of data that is interesting for her.
Tamara G. Kolda, Distinguished Member of Technical Staff, Sandia National Lab...MLconf
Tensor Decomposition: A Mathematical Tool for Data Analysis:
Tensors are multiway arrays, and tensor decompositions are powerful tools for data analysis. In this talk, we demonstrate the wide-ranging utility of the canonical polyadic (CP) tensor decomposition with examples in neuroscience and chemical detection. The CP model is extremely useful for interpretation, as we show with an example in neuroscience. However, it can be difficult to fit to real data for a variety of reasons. We present a novel randomized method for fitting the CP decomposition to dense data that is more scalable and robust than the standard techniques. We further consider the modeling assumptions for fitting tensor decompositions to data and explain alternative strategies for different statistical scenarios, resulting in a _generalized_ CP tensor decomposition.
Bio: Tamara G. Kolda is a member of the Data Science and Cyber Analytics Department at Sandia National Laboratories in Livermore, CA. Her research is generally in the area of computational science and data analysis, with specialties in multilinear algebra and tensor decompositions, graph models and algorithms, data mining, optimization, nonlinear solvers, parallel computing and the design of scientific software. She has received a Presidential Early Career Award for Scientists and Engineers (PECASE), been named a Distinguished Scientist of the Association for Computing Machinery (ACM) and a Fellow of the Society for Industrial and Applied Mathematics (SIAM). She was the winner of an R&D100 award and three best paper prizes at international conferences. She is currently a member of the SIAM Board of Trustees and serves as associate editor for both the SIAM J. Scientific Computing and the SIAM J. Matrix Analysis and Applications.
The main challenge of concurrent software verification has always been in achieving modularity, i.e., the ability to divide and conquer the correctness proofs with the goal of scaling the verification effort. Types are a formal method well-known for its ability to modularize programs, and in the case of dependent types, the ability to modularize and scale complex mathematical proofs.
In this talk I will present our recent work towards reconciling dependent types with shared memory concurrency, with the goal of achieving modular proofs for the latter. Applying the type-theoretic paradigm to concurrency has lead us to view separation logic as a type theory of state, and has motivated novel abstractions for expressing concurrency proofs based on the algebraic structure of a resource and on structure-preserving functions (i.e., morphisms) between resources.
Выступление Сергея Кольцова (НИУ ВШЭ) на International Conference on Big Data and its Applications (ICBDA).
ICBDA — конференция для предпринимателей и разработчиков о том, как эффективно решать бизнес-задачи с помощью анализа больших данных.
http://icbda2015.org/
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
"Impact of front-end architecture on development cost", Viktor TurskyiFwdays
I have heard many times that architecture is not important for the front-end. Also, many times I have seen how developers implement features on the front-end just following the standard rules for a framework and think that this is enough to successfully launch the project, and then the project fails. How to prevent this and what approach to choose? I have launched dozens of complex projects and during the talk we will analyze which approaches have worked for me and which have not.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
Property Matching and Query Expansion on Linked Data Using Kullback-Leibler Divergence
1. Property Matching and Query Expansion on
Linked Data Using Kullback-Leibler Divergence
Sean Golliher, Nathan Fortier, Logan Perreault
December 12, 2013
1 / 25
3. def: Query Expansion
Query expansion (QE) is the process of reformulating a seed
query to improve retrieval performance in information retrieval
operations.
3 / 25
7. Property Matching Problem
How do we find all actors in both databases?
Don’t want to manually inspect all databases
Can we use SPARQL query language to infer across all datasets?
SELECT ?p
WHERE { s ?p o }
Can only match total sizes of returned triple sets
7 / 25
8. Original Bayesian Approach
Problems with Bayesian Approach
Had to create, and track, a large vocabulary for training
Smoothing issues with very sparse text
Underflow issues – small confidence values
Complexity of likelihood was growing:
n different features in feature set X and c classes + tunable parameters.
8 / 25
9. KL-Divergence
Original paper from 1951 entitled “On Information and Sufficiency”
Also referred to as“relative entropy”
A system gains entropy when it moves to a state with more possible
arrangements. For example, a liquid to a gas.
Used in paper from 2003 for text categorization:
”Using KL-Distance for Text Categorization
Elegant and efficient method for plagiarism detection
9 / 25
14. Formal Problem Statement
Given:
Two databases DB1 and DB2
A predicate p1 ∈ DB1
An object type S1 where some triple “s p1 o exists in D1
where s ∈ S1
Find predicate p2 in DB2 where p2 is equivilant to p1
14 / 25
15. High Level Description
Create a document d1 containing labels of all objects linked
by p1
Find an object type S2 ∈ d2 where S1 is equivilant to S2
For each predicate p2 used by S2 create a document d2
containing labels of all objects linked by p2
Remove stop words and language tags from d1 and d2
For each document compute the normalized KL-Divergence,
KLD ∗ (d1 , d2 )
Return predicate corresponding to the document with the
lowest KL-Divergence
15 / 25
16. Algorithm 1 FindPredicate(DB1 , DB2 , p1 , S1 )
Create document d1 containing labels of all objects linked by p1
Find an object type S2 ∈ d2 where S1 is equivilant to S2
for each predicate p2 used by S2 do
Create document d2 containing labels of all objects linked by p2
end for
Remove stop words and language tags from d1 and d2
min ← 1
for each predicate pi used by S2 do
k ← KLD ∗ (d1 , di )
if k < min then
min ← k
pmap ← pi
end if
end for
return pmap
16 / 25
17. Computing KL-Divergence
KL-Divergence is computed as
(P(tk , di ) − P(tk , dj )) × log
KLD(di , dj ) =
k∈V
Where
P(tk , di ) =
tf (tk , di )
x∈di tf (tx , dj )
P(tk , di )
(1)
P(tk , dj )
(2)
If tk does not occur in di then P(tk , di ) ←
KL-Divergence is then normalized as follows:
KLD ∗ (di , dj ) =
KLD(di , dj )
KLD(di , 0)
(3)
17 / 25
18. Algorithm 2 tf (tk , di )
tf ← 0
for each term tx in di do
if sim(tk , tx ) > τ then
tf ← tf + 1
end if
end for
return tf
18 / 25