This document discusses using the Krimp algorithm to perform triclustering on triadic data. Krimp is adapted to find triclusters by identifying sets of attribute-condition pairs that compress a dataset. Experiments applying this approach to movie and bibliography datasets found it produced high density triclusters in a fast computational time, though allowing single itemsets led to many more triclusters. While Krimp shows promise for triadic data analysis, there is a tradeoff between coverage and number of triclusters when requiring higher itemset sizes.
Since the advent of the horseshoe priors for regularization, global-local shrinkage methods have proved to be a fertile ground for the development of Bayesian theory and methodology in machine learning. They have achieved remarkable success in computation, and enjoy strong theoretical support. Much of the existing literature has focused on the linear Gaussian case. The purpose of the current talk is to demonstrate that the horseshoe priors are useful more broadly, by reviewing both methodological and computational developments in complex models that are more relevant to machine learning applications. Specifically, we focus on methodological challenges in horseshoe regularization in nonlinear and non-Gaussian models; multivariate models; and deep neural networks. We also outline the recent computational developments in horseshoe shrinkage for complex models along with a list of available software implementations that allows one to venture out beyond the comfort zone of the canonical linear regression problems.
Since the advent of the horseshoe priors for regularization, global-local shrinkage methods have proved to be a fertile ground for the development of Bayesian theory and methodology in machine learning. They have achieved remarkable success in computation, and enjoy strong theoretical support. Much of the existing literature has focused on the linear Gaussian case. The purpose of the current talk is to demonstrate that the horseshoe priors are useful more broadly, by reviewing both methodological and computational developments in complex models that are more relevant to machine learning applications. Specifically, we focus on methodological challenges in horseshoe regularization in nonlinear and non-Gaussian models; multivariate models; and deep neural networks. We also outline the recent computational developments in horseshoe shrinkage for complex models along with a list of available software implementations that allows one to venture out beyond the comfort zone of the canonical linear regression problems.
Medical pathology images are visually evaluated by experts for disease diagnosis, but the connectionbetween image features and the state of the cells in an image is typically unknown. To understand thisrelationship, we describe a multimodal modeling and inference framework that estimates shared latentstructure of joint gene expression levels and medical image features. The method is built aroundprobabilistic canonical correlation analysis (PCCA), which is jointly fit to image embeddings that are learnedusing convolutional neural networks and linear embeddings of paired gene expression data. We finallydiscuss a set of theoretical and empirical challenges in domain adaptation settings arising from genomics data.(based on work in collab with Gregory Gundersen and Barbara E. Engelhardt)
We provide a review of the recent literature on statistical risk bounds for deep neural networks. We also discuss some theoretical results that compare the performance of deep ReLU networks to other methods such as wavelets and spline-type methods. The talk will moreover highlight some open problems and sketch possible new directions.
In this paper we discussed about the pseudo integral for a measurable function based on a strict pseudo addition and pseudo multiplication. Further more we got several important properties of the pseudo integral of a measurable function based on a strict pseudo addition decomposable measure.
Typically quantifying uncertainty requires many evaluations of a computational model or simulator. If a simulator is computationally expensive and/or high-dimensional, working directly with a simulator often proves intractable. Surrogates of expensive simulators are popular and powerful tools for overcoming these challenges. I will give an overview of surrogate approaches from an applied math perspective and from a statistics perspective with the goal of setting the stage for the "other" community.
A One-Pass Triclustering Approach: Is There any Room for Big Data?Dmitrii Ignatov
An efficient one-pass online algorithm for triclustering of binary data (triadic formal contexts) is proposed. This algorithm is a modified version of the basic algorithm for OAC-triclustering approach, but it has linear time and memory complexities with respect to the cardinality
of the underlying ternary relation and can be easily parallelized in order to be applied for the analysis of big datasets. The results of computer experiments show the efficiency of the proposed algorithm.
Medical pathology images are visually evaluated by experts for disease diagnosis, but the connectionbetween image features and the state of the cells in an image is typically unknown. To understand thisrelationship, we describe a multimodal modeling and inference framework that estimates shared latentstructure of joint gene expression levels and medical image features. The method is built aroundprobabilistic canonical correlation analysis (PCCA), which is jointly fit to image embeddings that are learnedusing convolutional neural networks and linear embeddings of paired gene expression data. We finallydiscuss a set of theoretical and empirical challenges in domain adaptation settings arising from genomics data.(based on work in collab with Gregory Gundersen and Barbara E. Engelhardt)
We provide a review of the recent literature on statistical risk bounds for deep neural networks. We also discuss some theoretical results that compare the performance of deep ReLU networks to other methods such as wavelets and spline-type methods. The talk will moreover highlight some open problems and sketch possible new directions.
In this paper we discussed about the pseudo integral for a measurable function based on a strict pseudo addition and pseudo multiplication. Further more we got several important properties of the pseudo integral of a measurable function based on a strict pseudo addition decomposable measure.
Typically quantifying uncertainty requires many evaluations of a computational model or simulator. If a simulator is computationally expensive and/or high-dimensional, working directly with a simulator often proves intractable. Surrogates of expensive simulators are popular and powerful tools for overcoming these challenges. I will give an overview of surrogate approaches from an applied math perspective and from a statistics perspective with the goal of setting the stage for the "other" community.
A One-Pass Triclustering Approach: Is There any Room for Big Data?Dmitrii Ignatov
An efficient one-pass online algorithm for triclustering of binary data (triadic formal contexts) is proposed. This algorithm is a modified version of the basic algorithm for OAC-triclustering approach, but it has linear time and memory complexities with respect to the cardinality
of the underlying ternary relation and can be easily parallelized in order to be applied for the analysis of big datasets. The results of computer experiments show the efficiency of the proposed algorithm.
The goal of this work is to advance our understanding of what new can be learned about crypto-tokens by analyzing the topological structure of the Ethereum transaction network. By introducing a novel combination of tools from topological data analysis and functional data depth into blockchain data analytics, we show that Ethereum network can provide critical insights on price strikes of crypto-token prices that are otherwise largely inaccessible with conventional data sources and traditional analytic methods.
On the Family of Concept Forming Operators in Polyadic FCADmitrii Ignatov
Triadic Formal Concept Analysis (3FCA) was introduced by Lehman and Wille almost two decades ago. And many researchers work in Data Mining and Formal Concept Analysis using the notions of closed sets, Galois and closure operators, closure systems. However, up-to-date even though that different researchers actively work on mining triadic and n-ary relations, a proper closure operator for enumeration of triconcepts, i.e. maximal triadic cliques of tripartite hypergaphs, was not introduced. In this talk we show that the previously introduced operators for obtaining triconcepts are not always consistent, describe their family and study their properties. We also introduce the notion of maximal switching generator to explain why such concept-forming operators are not closure operators due to violation of monotonicity property.
Multi Model Ensemble (MME) predictions are a popular ad-hoc technique for improving predictions of high-dimensional, multi-scale dynamical systems. The heuristic idea behind MME framework is simple: given a collection of models, one considers predictions obtained through the convex superposition of the individual probabilistic forecasts in the hope of mitigating model error. However, it is not obvious if this is a viable strategy and which models should be included in the MME forecast in order to achieve the best predictive performance. I will present an information-theoretic approach to this problem which allows for deriving a sufficient condition for improving dynamical predictions within the MME framework; moreover, this formulation gives rise to systematic and practical guidelines for optimising data assimilation techniques which are based on multi-model ensembles. Time permitting, the role and validity of “fluctuation-dissipation” arguments for improving imperfect predictions of externally perturbed non-autonomous systems - with possible applications to climate change considerations - will also be addressed.
survey slides for contextual bandit
main reference: Li Zhou. A Survey on Contextual Multi-armed Bandits. arXiv, 2015. (https://arxiv.org/abs/1508.03326)
Searching for optimal patterns in Boolean tensorsDmitrii Ignatov
This is our slides for a spotlight talk at Learning with Tensors workshop at NIPS 2016. We have shortly summarise comparison of five different triclustering algorithms (TRIAS, TriBox, OACPrime, OACBox, and SpecTric).
For the canonical regression setup where one wants to discover the relationship between Y and a pdimensional
vector x, BART (Bayesian Additive Regression Trees) approximates the conditional mean E[Y|x] with a sum of regression trees model, where each tree is constrained by a regularization prior to be a weak learner. Fitting and inference are accomplished via a scalable iterative Bayesian backfitting MCMC algorithm that generates samples from a posterior. Effectively, BART is a nonparametric Bayesian regression approach which uses dimensionally adaptive random basis elements. Motivated by ensemble methods in general, and boosting algorithms in particular, BART is defined by a statistical model: a prior and a likelihood. This approach enables full posterior inference including point and interval estimates of the unknown regression function as well as the marginal effects of potential predictors. By keeping track of predictor inclusion frequencies, BART can also be used for model-free variable selection. To further illustrate the modeling flexibility of BART, we introduce two elaborations, MBART and HBART. Exploiting the potential monotonicity of E[Y|x] in components of x, MBART incorporates such monotonicity with a multivariate basis of monotone trees. To allow for the possibility of heteroscedasticity, HBART incorporates an additional product of regression trees model component for the conditional.
Digital Signal Processing[ECEG-3171]-Ch1_L07Rediet Moges
This Digital Signal Processing Lecture material is the property of the author (Rediet M.) . It is not for publication,nor is it to be sold or reproduced.
For the discovery of a regression relationship between y and x, a vector of p potential predictors, the flexible nonparametric nature of BART (Bayesian Additive Regression Trees) allows for a much richer set of possibilities than restrictive parametric approaches. To exploit the potential monotonicity of the predictors, we introduce mBART, a constrained version of BART that incorporates monotonicity with a multivariate basis of monotone trees, thereby avoiding the further confines of a full parametric form. Using mBART to estimate such effects yields (i) function estimates that are smoother and more interpretable, (ii) better out-of-sample predictive performance and (iii) less post-data uncertainty. By using mBART to simultaneously estimate both the increasing and the decreasing regions of a predictor, mBART opens up a new approach to the discovery and estimation of the decomposition of a function into its monotone components.
(This is joint work with H. Chipman, R. McCulloch and T. Shively).
Conventional Implicature via Dependent Type SemanticsDaisuke BEKKI
Guest lecture in "expressive content" course (by Eric McCready and Daniel Gutzmann) in the 27th European Summer School in Logic, Language and Information (ESSLLI 2015), Barcelona, Spain.
I conti economici trimestrali: avanzamenti metodologici e prospettive di innovazione
Seminario
Roma, 21 aprile 2016
Istat, Aula Magna
Via Cesare Balbo, 14
Similar to Turning Krimp into a Triclustering Technique on Sets of Attribute-Condition Pairs that Compress (20)
Interpretable Concept-Based Classification with Shapley ValuesDmitrii Ignatov
The slides contain our talk on Shapley values as an interpretable Machine learning technique for JSM-method, a rule-based classification and reasoning technique, for ranking particular attributes of an undetermined example under classification.
https://doi.org/10.1007/978-3-030-57855-8_7
These are opening slides of the 8th International Conference on Analysis of Images, Social Networks and Texts (AIST 2019). We summarise general facts on AIST conf. series. See http://aistconf.org website for more details.
A short introduction into Sequential Pattern Mining in Russia. We consider frequent and frequent closed sequences along with two algorithms (SPADE and PrefixSpan). A demographic case study is provided as well. One can find links and references to relevant literature and software. We mainly follow Han & Kamber Data Mining book (2nd edition, Chapter 8.3).
Краткое введение в Sequential Pattern Mining на русском языке. Рассматриваются алгоритмы для поиска частых и частых замкнутых последовательностей (SPADE и PrefixSpan) Кейс-стади на примере демографических последовательностей. Приведены ссылки на библиотеки и реализации некоторых базовых алгоритмов. Основное изложение по мотивам учебника Джиавея Хана и Мишелин Камбер.
NIPS 2016, Tensor-Learn@NIPS, and IEEE ICDM 2016Dmitrii Ignatov
Some photo impressions from NIPS & ICDM 2016 in Barcelona mixed with workshops like Learning with Tensors (http://tensor-learn.org/) and related stuff.
Experimental Economics and Machine Learning workshopDmitrii Ignatov
This presentation summarises recent activities on EEML workshop organisation. In fact, this is a successful event which attracts economists and computers scientists who would like to use recent advances in machine learning and data mining to understand human behavior in different domains related to Economics and Social Science.
Pattern-based classification of demographic sequencesDmitrii Ignatov
We have proposed prefix-based gapless sequential patterns for classification of demographic sequences. In comparison to black-box machine learning techniques, this one provides interpretable patterns suitable for treatment by professional demographers. As for the language, we have used Pattern Structures as an extension of Formal Concept Analysis for the case of complex data like sequences, graphs, intervals, etc.
This paper presents an interesting idea how to compute a consensus of several k-partitions of a set by means of finding an antichain in the concept lattice of an appropriate formal context.
AIST is a scientific conference on Analysis of Images, Social Networks, and Texts. The conference is intended for computer scientists and practitioners whose research interests involve Internet mathematics and other related fields of data science. Similar to the previous year, the conference will be focused on applications of data mining and machine learning techniques to various problem domains: image processing, analysis of social networks, and natural language processing. We hope that the participants will benefit from the interdisciplinary nature of the conference and exchange experience.
In our previous work an efficient one-pass online algorithm for triclustering of binary data (triadic formal contexts) was proposed. This algorithm is a modified version of the basic algorithm for OAC- triclustering approach; it has linear time and memory complexities. In this paper we parallelise it via map-reduce framework in order to make it suitable for big datasets. The results of computer experiments show the efficiency of the proposed algorithm; for example, it outperforms the online counterpart on Bibsonomy dataset with ≈ 800, 000 triples.
Context-Aware Recommender System Based on Boolean Matrix FactorisationDmitrii Ignatov
In this work we propose and study an approach for collaborative filtering, which is based on Boolean matrix factorisation and exploits additional (context) information about users and items. To avoid similarity loss in case of Boolean representation we use an adjusted type of projection of a target user to the obtained factor space.
We have compared the proposed method with SVD-based approach on the MovieLens dataset. The experiments demonstrate that the proposed method has better MAE and Precision and comparable Recall and F-measure. We also report an increase of quality in the context information presence.
Pattern Mining and Machine Learning for Demographic SequencesDmitrii Ignatov
In this talk, we present the results of our first studies in application of pattern mining and machine learning to analysis of demographic sequences in Russia based on data of 11 generations from 1930 till 1984. The main goal is not prediction and data mining methods themselves, but rather extraction of interesting patterns and knowledge acquisition from substantial datasets of demographic data. We use decision trees as techniques for demographic events prediction and emerging patterns for searching significant and potentially useful sequences.
RAPS: A Recommender Algorithm Based on Pattern StructuresDmitrii Ignatov
We propose a new algorithm for recommender systems with numeric
ratings which is based on Pattern Structures (RAPS). As the input the algorithm
takes rating matrix, e.g., such that it contains movies rated by users. For a target
user, the algorithm returns a rated list of items (movies) based on its previous ratings
and ratings of other users.We compare the results of the proposed algorithm
in terms of precision and recall measures with Slope One, one of the state-of-the-art
item-based algorithms, on Movie Lens dataset and RAPS demonstrates the
best or comparable quality.
Поиск частых множеств признаков (товаров) и ассоциативные правилаDmitrii Ignatov
Краткое введение в анализ ассоциативных правил в терминах Анализа Формальных Понятий. Примеры задач: поиск документов почти-дубликатов, анализ посещаемости сайтов, контекстная реклама.
Boolean matrix factorisation for collaborative filteringDmitrii Ignatov
We propose a new approach for Collaborative filtering which
is based on Boolean Matrix Factorisation (BMF) and Formal Concept
Analysis. In a series of experiments on real data (MovieLens dataset) we
compare the approach with an SVD-based one in terms of Mean Average
Error (MAE). One of the experimental consequences is that it is enough
to have a binary-scaled rating data to obtain almost the same quality
in terms of MAE by BMF as for the SVD-based algorithm in case of
non-scaled data.
Online Recommender System for Radio Station Hosting: Experimental Results Rev...Dmitrii Ignatov
We present a new recommender system developed for the Russian interactive radio network FMhost based on a previously proposed model. The underlying model combines a collaborative user-based approach with information from tags of listened tracks in order to match user and radio station profiles.
It follows an adaptive online learning strategy based on the user history. We compare the proposed algorithms and an industry standard technique based on singular value decomposition (SVD)
in terms of precision, recall, and NDCG measures; experiments show that in our case the fusion-based approach shows the best results.
ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptxRASHMI M G
Abnormal or anomalous secondary growth in plants. It defines secondary growth as an increase in plant girth due to vascular cambium or cork cambium. Anomalous secondary growth does not follow the normal pattern of a single vascular cambium producing xylem internally and phloem externally.
Seminar of U.V. Spectroscopy by SAMIR PANDASAMIR PANDA
Spectroscopy is a branch of science dealing the study of interaction of electromagnetic radiation with matter.
Ultraviolet-visible spectroscopy refers to absorption spectroscopy or reflect spectroscopy in the UV-VIS spectral region.
Ultraviolet-visible spectroscopy is an analytical method that can measure the amount of light received by the analyte.
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...Wasswaderrick3
In this book, we use conservation of energy techniques on a fluid element to derive the Modified Bernoulli equation of flow with viscous or friction effects. We derive the general equation of flow/ velocity and then from this we derive the Pouiselle flow equation, the transition flow equation and the turbulent flow equation. In the situations where there are no viscous effects , the equation reduces to the Bernoulli equation. From experimental results, we are able to include other terms in the Bernoulli equation. We also look at cases where pressure gradients exist. We use the Modified Bernoulli equation to derive equations of flow rate for pipes of different cross sectional areas connected together. We also extend our techniques of energy conservation to a sphere falling in a viscous medium under the effect of gravity. We demonstrate Stokes equation of terminal velocity and turbulent flow equation. We look at a way of calculating the time taken for a body to fall in a viscous medium. We also look at the general equation of terminal velocity.
Richard's aventures in two entangled wonderlandsRichard Gill
Since the loophole-free Bell experiments of 2020 and the Nobel prizes in physics of 2022, critics of Bell's work have retreated to the fortress of super-determinism. Now, super-determinism is a derogatory word - it just means "determinism". Palmer, Hance and Hossenfelder argue that quantum mechanics and determinism are not incompatible, using a sophisticated mathematical construction based on a subtle thinning of allowed states and measurements in quantum mechanics, such that what is left appears to make Bell's argument fail, without altering the empirical predictions of quantum mechanics. I think however that it is a smoke screen, and the slogan "lost in math" comes to my mind. I will discuss some other recent disproofs of Bell's theorem using the language of causality based on causal graphs. Causal thinking is also central to law and justice. I will mention surprising connections to my work on serial killer nurse cases, in particular the Dutch case of Lucia de Berk and the current UK case of Lucy Letby.
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...Sérgio Sacani
Since volcanic activity was first discovered on Io from Voyager images in 1979, changes
on Io’s surface have been monitored from both spacecraft and ground-based telescopes.
Here, we present the highest spatial resolution images of Io ever obtained from a groundbased telescope. These images, acquired by the SHARK-VIS instrument on the Large
Binocular Telescope, show evidence of a major resurfacing event on Io’s trailing hemisphere. When compared to the most recent spacecraft images, the SHARK-VIS images
show that a plume deposit from a powerful eruption at Pillan Patera has covered part
of the long-lived Pele plume deposit. Although this type of resurfacing event may be common on Io, few have been detected due to the rarity of spacecraft visits and the previously low spatial resolution available from Earth-based telescopes. The SHARK-VIS instrument ushers in a new era of high resolution imaging of Io’s surface using adaptive
optics at visible wavelengths.
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...University of Maribor
Slides from talk:
Aleš Zamuda: Remote Sensing and Computational, Evolutionary, Supercomputing, and Intelligent Systems.
11th International Conference on Electrical, Electronics and Computer Engineering (IcETRAN), Niš, 3-6 June 2024
Inter-Society Networking Panel GRSS/MTT-S/CIS Panel Session: Promoting Connection and Cooperation
https://www.etran.rs/2024/en/home-english/
Nucleophilic Addition of carbonyl compounds.pptxSSR02
Nucleophilic addition is the most important reaction of carbonyls. Not just aldehydes and ketones, but also carboxylic acid derivatives in general.
Carbonyls undergo addition reactions with a large range of nucleophiles.
Comparing the relative basicity of the nucleophile and the product is extremely helpful in determining how reversible the addition reaction is. Reactions with Grignards and hydrides are irreversible. Reactions with weak bases like halides and carboxylates generally don’t happen.
Electronic effects (inductive effects, electron donation) have a large impact on reactivity.
Large groups adjacent to the carbonyl will slow the rate of reaction.
Neutral nucleophiles can also add to carbonyls, although their additions are generally slower and more reversible. Acid catalysis is sometimes employed to increase the rate of addition.
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptxMAGOTI ERNEST
Although Artemia has been known to man for centuries, its use as a food for the culture of larval organisms apparently began only in the 1930s, when several investigators found that it made an excellent food for newly hatched fish larvae (Litvinenko et al., 2023). As aquaculture developed in the 1960s and ‘70s, the use of Artemia also became more widespread, due both to its convenience and to its nutritional value for larval organisms (Arenas-Pardo et al., 2024). The fact that Artemia dormant cysts can be stored for long periods in cans, and then used as an off-the-shelf food requiring only 24 h of incubation makes them the most convenient, least labor-intensive, live food available for aquaculture (Sorgeloos & Roubach, 2021). The nutritional value of Artemia, especially for marine organisms, is not constant, but varies both geographically and temporally. During the last decade, however, both the causes of Artemia nutritional variability and methods to improve poorquality Artemia have been identified (Loufi et al., 2024).
Brine shrimp (Artemia spp.) are used in marine aquaculture worldwide. Annually, more than 2,000 metric tons of dry cysts are used for cultivation of fish, crustacean, and shellfish larva. Brine shrimp are important to aquaculture because newly hatched brine shrimp nauplii (larvae) provide a food source for many fish fry (Mozanzadeh et al., 2021). Culture and harvesting of brine shrimp eggs represents another aspect of the aquaculture industry. Nauplii and metanauplii of Artemia, commonly known as brine shrimp, play a crucial role in aquaculture due to their nutritional value and suitability as live feed for many aquatic species, particularly in larval stages (Sorgeloos & Roubach, 2021).
Nutraceutical market, scope and growth: Herbal drug technologyLokesh Patil
As consumer awareness of health and wellness rises, the nutraceutical market—which includes goods like functional meals, drinks, and dietary supplements that provide health advantages beyond basic nutrition—is growing significantly. As healthcare expenses rise, the population ages, and people want natural and preventative health solutions more and more, this industry is increasing quickly. Further driving market expansion are product formulation innovations and the use of cutting-edge technology for customized nutrition. With its worldwide reach, the nutraceutical industry is expected to keep growing and provide significant chances for research and investment in a number of categories, including vitamins, minerals, probiotics, and herbal supplements.
BREEDING METHODS FOR DISEASE RESISTANCE.pptxRASHMI M G
Plant breeding for disease resistance is a strategy to reduce crop losses caused by disease. Plants have an innate immune system that allows them to recognize pathogens and provide resistance. However, breeding for long-lasting resistance often involves combining multiple resistance genes
Mudde & Rovira Kaltwasser. - Populism in Europe and the Americas - Threat Or...
Turning Krimp into a Triclustering Technique on Sets of Attribute-Condition Pairs that Compress
1. Turning Krimp into a Triclustering Technique on
Sets of Attribute-Condition Pairs that Compress
Maxim Yurov and Dmitry I. Ignatov
National Research University Higher School of Economics, Moscow, Russia
Data Analysis and AI Dept. & Itelligent Systems and Structural Analysis Lab @
Computer Science Faculty
IJCRS 2017
Olsztyn, Poland
03.07.2017
1
3. Research Domain
Frequent Itemset Mining (FIM) is one of the basic problems in
Data Mining.
One of the first FIM task is market basket analysis (Agrawal et al.,
1993).
One of the first FIM algorithms is Apriori (Agrawal et al., 1994).
3
4. Frequent Itemset Mining
Problem: a humongous number of frequent itemset, which makes
complicated the search of the most interesting patterns among
them.
Q: How to solve it?
A: For example, to use Minimal Description Lenght princinple:
MDL principal
The best set of frequent itemsets compresses the input data the
best.1
1
Siebes A., Vreeken J., van Leeuwen M., Itemsets that compress (2011).
4
5. Frequent Itemset Mining
Problem: a humongous number of frequent itemset, which makes
complicated the search of the most interesting patterns among
them.
Q: How to solve it?
A: For example, to use Minimal Description Lenght princinple:
MDL principal
The best set of frequent itemsets compresses the input data the
best.1
1
Siebes A., Vreeken J., van Leeuwen M., Itemsets that compress (2011).
6. Frequent Itemset Mining
Problem: a humongous number of frequent itemset, which makes
complicated the search of the most interesting patterns among
them.
Q: How to solve it?
A: For example, to use Minimal Description Lenght princinple:
MDL principal
The best set of frequent itemsets compresses the input data the
best.1
1
Siebes A., Vreeken J., van Leeuwen M., Itemsets that compress (2011).
7. Krimp Algorithm
Input
A database D of transactions over a set items I (like purchases in a
supermarket).
Code Table
The code table CT is the table with two columns: the itemsets are
on the left and their codes are on the right.
The left column contains at least all single itemsets.
The codes are unique.
7
8. Krimp Algorithm
Input
A database D of transactions over a set items I (like purchases in a
supermarket).
Code Table
The code table CT is the table with two columns: the itemsets are
on the left and their codes are on the right.
The left column contains at least all single itemsets.
The codes are unique.
8
9. Figure: Code table example. The width of the Code column shows the
length of the code. I = {A, B, C}. NB: the column Usage is not a part
of the code table.
2
2
Siebes A., Vreeken J., van Leeuwen M., Itemsets that compress (2011).
9
10. Figure: Example of a database, its cover, and the encoded database
based on the previous codetable from Fig. 1. I = {A, B, C}.
3
3
Siebes A., Vreeken J., van Leeuwen M., Itemsets that compress (2011).
10
11. Figure: Example of the standard codetable for database from Fig. 2, its
cover and the encoded database.
4
4
Siebes A., Vreeken J., van Leeuwen M., Itemsets that compress (2011).
11
12. Minimal Coding Set Problem
Let I be a set of items, D be a dataset of transactions (some
itemsets) over I, cover be a coverage function, and F be a set of
candidate itemsets. Find the minimal coding set CS ⊆ F such that
the resulting code table CT implies the minimum total size of the
encoded database and the code table L(D, CT).
L(D, CT) = L(D|CT) + L(CT|D)
L(CT|D) =
X∈CT:usageD(X)=0
L(codeST (X)) + L(codeCT (X)))
L(D|CT) =
t∈D
L(t|CT)
L(t|CT) =
X∈cover(CT,t)
L(codeCT (X))
L(codeCT (X)) = |codeCT (X)|
12
13. Krimp algorithm
The algorithmic strategy
It starts with the standard code table ST, which contains only
singletones X ∈ I
Then it adds one by one othes itemsets (candidates) from F.
If the resulsting codetable maintains better compression, then
Krimp stores it and continues the search. Otherwise, Krimp
eliminates this itemset.
14. Krimp algorithm
Standard Cover Order
Let us order X ∈ CT by decreasing cardinality, then by decreasing
support, and finally in lectic order:
|X| ↓ suppD(X) ↓ lexicographically ↑
Standard Candidate Order
Frequent and long itemsets are of priority:
suppD(X) ↓ |X| ↓ lexicographically ↑
15. Krimp algorithm
Input: D is a transaction database and F is a candidate set over a
input set of items I.
Output: A heuristic solution to the Minimal Coding Set Problem,
code table CT.
1 CT ← StandardCodeTable(D)
2 F0 ← F in Standard Candidate Order
3 for F ∈ F0 {{i} | i ∈ I} do
4 CTc ← (CT ∪ F) in Standard Cover Order
5 if L(D, CTc) < L(D, CT) then
6 CT ← CTc
7 end
8 end
9 return CT
15
16. Krimp algorithm
Figure: The scheme of Krimp.
5
5
Siebes A., Vreeken J., van Leeuwen M., Itemsets that compress (2011).
16
17. Triadic Data
Folksonomy is a ternary relation over sets of objects, attributes and
conditions.6
From ternary binary relation to dyadic ones
(Obj., Attr., Cond.) → (Obj., Attr. × Cond.),
where A × B is the Cartesian product of A and B.
6
Folksonomy coinage and definition (2007) T. Vander Wal –
http://vanderwal.net/folksonomy.html
18. Triadic Data
Folksonomy is a ternary relation over sets of objects, attributes and
conditions.6
From ternary binary relation to dyadic ones
(Obj., Attr., Cond.) → (Obj., Attr. × Cond.),
where A × B is the Cartesian product of A and B.
6
Folksonomy coinage and definition (2007) T. Vander Wal –
http://vanderwal.net/folksonomy.html
19. Data
1. A sample of Top-250 movies from www.IMDB.com.
The objects are movie titles, the attributes are keywords, and
the conditions are genres.
2. A sample from bibliography sharing system BibSonomy.org.
The objects are users, the attributes are tags, and the
conditions are electronic bookmarks.
19
20. Example of data transformation
If there is a movie description in terms of keywords and genres
{Star Wars} × {Princess, Empire} × {Adventure, Sci-Fi, Action},
then this piece of data can be transformed into object-attribute
form as follows:
{Star Wars} ×
(Princess, Adventure), (Princess, Sci-Fi)
(Princess, Action), (Empire, Adventure)
(Empire, Sci-Fi), (Empire, Action)
.
21. Biclustering
[Mirkin, 1995]
Coinage the term bicluster
The term bicluster(ing) was proposed by B. Mirkin in the book
Mathematical Classification and Clustering. Kluwer Academic
Publishers (1996).
p. 296
The term biclustering refers to simultaneous clustering of
both row and column sets in a data matrix. Biclustering
addresses the problems of aggregate representation of the
basic features of interrelation between rows and columns
as expressed in the data.
21
22. Concept-based biclustering
[D. Ignatov and S. Kuznetsov, 2010]
Let K = (G, M, I ⊆ G × M) be a formal context.
Definition 1
If (g, m) ∈ I, then (m , g ) is called an object-attribute bicluster or
OA-bicluster with density ρ(m , g ) = |I∩(m ×g )|
|m |·|g | .7
7
(.) : 2G
→ 2M
and (.) : 2M
→ 2G
are the derivation operators applied to
{g} ⊆ G and {m} ⊆ M in sense of [Ganter & Wille, 1999].
23. Geometric interpretation of OA-bicluster: connection with
RST
[D. Ignatov and S. Kuznetsov, 2010]
g
m
g''
m''
g'
m'
23
24. Triadic FCA and Triclustering
[Lehman & Wille, 1993]
Consider K = (G, M, B, J ⊆ G × M × B), a triadic context; in
what follows we will refer to a trisets T = (X, Y , Z) with Z ⊆ G,
Y ⊆ M, Z ⊆ B as an object-attribute-condition tricluster or simply
tricluster8.
8
Ignatov, D.I., Gnatyshak, D.V., Kuznetsov, S.O., Mirkin, B.G.: Triadic
formal concept analysis and triclustering: searching for optimal patterns.
Machine Learning 101(1-3) (2015) 271–302
24
25. KRIMP-based triclusters
Each encoding set of (object, attribute) pairs found by Krimp
is contained as a coding block in the description of some
object g ∈ G.
Let S be a coding set returned by Krimp that consists of n
attribute-condition pairs from M × B.
Then the first component X of the corresponding tricluster is
{g | (g, m, b) ∈ I for all (m, b) ∈ S}.
The remaining two components are
Y = {m | ∀(m, b) ∈ S} and Z = {b | ∀(m, b) ∈ S}.
S is not necessarily equal to Y × Z, so, some amount of
missing triples is allowed inside T = (X, Y , Z). The quality of
such a tricluster can be assessed by its density.
26. Quality measures
Density
ρ(Ti ) =
|J ∩ (X × Y × Z)|
|X||Y ||Z|
For the tricluster collection:
ρ(T ) =
Ti ∈T ρ(Ti )
|T |
Coverage
coverage(T , K) =
| (X,Y ,Z)∈T X × Y × Z ∩ J|
|J|
26
30. Examples of triclusters
Three triclusters extracted by Krimp from IMDB dataset.
Tricluster 1.
Keyword-genre component:
{(Princess,Adventure), (Princess,Fantasy), (Empire,Sci-Fi),
(Empire,Adventure), (Empire,Action), (Princess,Sci-Fi),
(Princess,Action), (Empire,Fantasy), (Death Star,Sci-Fi),
(Death Star,Fantasy), (Death Star,Adventure),
(Death Star,Action)},
(2,2)
Movies component:
{Star Wars: Episode VI – Return of the Jedi (1983),
Star Wars (1977)}
30
31. Examples of triclusters
Three triclusters extracted by Krimp from IMDB dataset.
Tricluster 2.
Keyword-genre component:
{(Future,Sci-Fi), (Future,Thriller), (Future,Action), (Cyborg,Thriller),
(Cyborg,Sci-Fi), (Cyborg,Action), (The Terminator,Thriller),
(The Terminator,Sci-Fi), (The Terminator,Action) },
(2,2)
Movies component:
{The Terminator (1984), Terminator 2: Judgment Day (1991)}
Tricluster 3.
Keyword-genre component:
{(Gotham,Thriller), (Gotham,Drama), (Gotham,Crime), (Gotham,Action),
(Batman,Thriller), (Batman,Drama), (Batman,Crime), (Batman,Action)},
(2,2)
Movies component:
{Batman Begins (2005), The Dark Knight (2008)}.
31
32. Conclusion
Krimp (or its descendants) can be considered as a prospective
method for triadic data analysis.
The positive features:
fast computational time (although on the dataset of rather
moderate size with the lowest minimal support minsup = 2);
absolutely dense triclusters (however, this may not be the case
for sparse and noisy datasets);
we can select a rather small set of “large” triclusters (e.g., by
imposing higher support for non-singletons).
The negative features:
the strong trade-off between coverage and the number of
triclusters (switching from coding sets with singletons to
itemsets of higher size);
even higher number of triclusters than the number of
triconcepts when the usage of singletons is allowed.