Given a very large dataset of moderate-to-high di-
mensionality, how to mine useful patterns from it? In such
cases, dimensionality reduction is essential to overcome the
“curse of dimensionality”. Although there exist algorithms to
reduce the dimensionality of Big Data, unfortunately, they
all fail to identify/eliminate non-linear correlations between
attributes. This paper tackles the problem by exploring con-
cepts of the Fractal Theory and massive parallel processing
to present Curl-Remover, a novel dimensionality reduction
technique for very large datasets. Our contributions are: Curl-
Remover eliminates linear and non-linear attribute correlations
as well as irrelevant ones; it is unsupervised and suits for
analytical tasks in general – not only classification; it presents
linear scale-up; it does not require the user to guess the
number of attributes to be removed, and; it preserves the
attributes’ semantics. We performed experiments on synthetic
and real data spanning up to 1.1 billion points and Curl-
Remover outperformed a PCA-based algorithm, being up to
8% more accurate.
An Optimal Approach For Knowledge Protection In Structured Frequent PatternsWaqas Tariq
Data mining is valuable technology to facilitate the extraction of useful patterns and trends from large volume of data. When these patterns are to be shared in a collaborative environment, they must be protectively shared among the parties concerned in order to preserve the confidentiality of the sensitive data. Sharing of information may be in the form of datasets or in any of the structured patterns like trees, graphs, lattices, etc., This paper propose a sanitization algorithm for protecting sensitive data in a structured frequent pattern(tree).
Generative Adversarial Networks : Basic architecture and variantsananth
In this presentation we review the fundamentals behind GANs and look at different variants. We quickly review the theory such as the cost functions, training procedure, challenges and go on to look at variants such as CycleGAN, SAGAN etc.
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain RatioMarina Santini
attribute selection, constructing decision trees, decision trees, divide and conquer, entropy, gain ratio, information gain, machine leaning, pruning, rules, suprisal
An Optimal Approach For Knowledge Protection In Structured Frequent PatternsWaqas Tariq
Data mining is valuable technology to facilitate the extraction of useful patterns and trends from large volume of data. When these patterns are to be shared in a collaborative environment, they must be protectively shared among the parties concerned in order to preserve the confidentiality of the sensitive data. Sharing of information may be in the form of datasets or in any of the structured patterns like trees, graphs, lattices, etc., This paper propose a sanitization algorithm for protecting sensitive data in a structured frequent pattern(tree).
Generative Adversarial Networks : Basic architecture and variantsananth
In this presentation we review the fundamentals behind GANs and look at different variants. We quickly review the theory such as the cost functions, training procedure, challenges and go on to look at variants such as CycleGAN, SAGAN etc.
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain RatioMarina Santini
attribute selection, constructing decision trees, decision trees, divide and conquer, entropy, gain ratio, information gain, machine leaning, pruning, rules, suprisal
Introduction to the R Statistical Computing Environmentizahn
Get an introduction to R, the open-source system for statistical computation and graphics. With hands-on exercises, learn how to import and manage datasets, create R objects, and conduct basic statistical analyses. Full workshop materials can be downloaded from http://projects.iq.harvard.edu/rtc/event/introduction-r
Ranking and Diversity in Recommendations - RecSys Stammtisch at SoundCloud, B...Alexandros Karatzoglou
Slides from my talk at the RecSys Stammtisch at SoundCloud in Berlin. The presentation is split in two part one focusing on ranking and relevance and one on diversity and how to achieve it using genres. We introduce a novel diversity metric called Binomial Diversity.
Jan vitek distributedrandomforest_5-2-2013Sri Ambati
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
A TALE of DATA PATTERN DISCOVERY IN PARALLELJenny Liu
In the era of IoTs and A.I., distributed and parallel computing is embracing big data driven and algorithm focused applications and services. With rapid progress and development on parallel frameworks, algorithms and accelerated computing capacities, it still remains challenging on deliver an efficient and scalable data analysis solution. This talk shares a research experience on data pattern discovery in domain applications. In particular, the research scrutinizes key factors in analysis workflow design and data parallelism improvement on cloud.
Strata 2013: Tutorial-- How to Create Predictive Models in R using EnsemblesIntuit Inc.
This tutorial, based on a published book by Giovanni Seni, offers a hands-on intro to ensemble models, which combine multiple models into a single predictive system that’s often more accurate than the best of its components. Participants will use data sets and snippets of R code to experiment with the methods to gain a practical understanding of this breakthrough technology.
Giovanni Seni is currently a Senior Data Scientist with Intuit where he leads the Applied Data Sciences team. As an active data mining practitioner in Silicon Valley, he has over 15 years R&D experience in statistical pattern recognition and data mining applications. He has been a member of the technical staff at large technology companies, and a contributor at smaller organizations. He holds five US patents and has published over twenty conference and journal articles. His book with John Elder, “Ensemble Methods in Data Mining – Improving accuracy through combining predictions”, was published in February 2010 by Morgan & Claypool. Giovanni is also an adjunct faculty at the Computer Engineering Department of Santa Clara University, where he teaches an Introduction to Pattern Recognition and Data Mining class.
MS CS - Selecting Machine Learning AlgorithmKaniska Mandal
ML Algorithms usually solve an optimization problem such that we need to find parameters for a given model that minimizes
— Loss function (prediction error)
— Model simplicity (regularization)
http://www.icmc.usp.br/~junio/PublishedPapers/RodriguesJr_et_al_Frequency_Plot-SIBGRAPI2003.pdf
Jose Rodrigues, Agma J M Traina, Caetano Traina Jr (2003) Frequency Plot and Relevance Plot to Enhance Visual Data Exploration In: XVI Brazilian Symposium on Computer Graphics and Image Processing 117-124 IEEE Press.
@inproceedings { DBLP:conf/sibgrapi/RodriguesTT03,
title = "Frequency Plot and Relevance Plot to Enhance Visual Data Exploration",
year = "2003",
author = "Jose Rodrigues and Agma J M Traina and Caetano Traina Jr",
booktitle = " XVI Brazilian Symposium on Computer Graphics and Image Processing",
pages = "117-124",
publisher = "IEEE Press",
doi = "10.1109/SIBGRA.2003.1240999",
url = "http://www.icmc.usp.br/~junio/PublishedPapers/RodriguesJr_et_al_Frequency_Plot-SIBGRAPI2003.pdf",
urllink = "http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=1240999&",
abstract = "We present two techniques aiming at exploring databases through multivariate visualizations. Both techniques intend to deal with the problem caused by the limited amount of elements that can be presented simultaneously in traditional visual exploration procedures. The first technique, the Frequency Plot, combines data frequency with interactive filtering to identify clusters and trends in subsets of the database. Thus, graphical elements (lines, pixels, icons, or graphical marks) are color differentiated proportionally to how frequent the value being represented is, while interactive filtering allows the selection of interesting partitions of the database. The second technique, the Relevance Plot, corresponds to assigning different levels of color distinguishably to visual elements according to their relevance to a user's specified data properties set, which can be chosen visually and dynamically.",
keywords = "Computer science , Data analysis , Data visualization , Filtering , Frequency , Humans , Image databases , Information retrieval , Layout , Visual databases"}
Jose Rodrigues, Agma J M Traina, Christos Faloutsos, Caetano Traina Jr (2006) SuperGraph Visualization In: 8th IEEE International Symposium on Multimedia 227-234 IEEE Press.
@inproceedings { DBLP:conf/ism/RodriguesTFT06,
title = "SuperGraph Visualization",
year = "2006",
author = "Jose Rodrigues and Agma J M Traina and Christos Faloutsos and Caetano Traina Jr",
booktitle = "8th IEEE International Symposium on Multimedia",
pages = "227-234",
publisher = "IEEE Press",
doi = "10.1109/ISM.2006.143",
url = "http://www.icmc.usp.br/~junio/PublishedPapers/RodriguesJr_et_al-ISM2006.pdf",
urllink = "http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4061172",
abstract = "Given a large social or computer network, how can we visualize it, find patterns, outliers, communities? Although several graph visualization tools exist, they cannot handle large graphs with hundred thousand nodes and possibly million edges. Such graphs bring two challenges: interactive visualization demands prohibitive processing power and, even if we could interactively update the visualization, the user would be overwhelmed by the excessive number of graphical items. To cope with this problem, we propose a formal innovation on the use of graph hierarchies that leads to GMine system. GMine promotes scalability using a hierarchy of graph partitions, promotes concomitant presentation for the graph hierarchy and for the original graph, and extends analytical possibilities with the integration of the graph partitions in an interactive environment.",
keywords = "Application software , Bipartite graph , Computer networks , Computer science , Data structures , Scalability , Technological innovation , Tree graphs , Visualization , Web pages"}
Introduction to the R Statistical Computing Environmentizahn
Get an introduction to R, the open-source system for statistical computation and graphics. With hands-on exercises, learn how to import and manage datasets, create R objects, and conduct basic statistical analyses. Full workshop materials can be downloaded from http://projects.iq.harvard.edu/rtc/event/introduction-r
Ranking and Diversity in Recommendations - RecSys Stammtisch at SoundCloud, B...Alexandros Karatzoglou
Slides from my talk at the RecSys Stammtisch at SoundCloud in Berlin. The presentation is split in two part one focusing on ranking and relevance and one on diversity and how to achieve it using genres. We introduce a novel diversity metric called Binomial Diversity.
Jan vitek distributedrandomforest_5-2-2013Sri Ambati
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
A TALE of DATA PATTERN DISCOVERY IN PARALLELJenny Liu
In the era of IoTs and A.I., distributed and parallel computing is embracing big data driven and algorithm focused applications and services. With rapid progress and development on parallel frameworks, algorithms and accelerated computing capacities, it still remains challenging on deliver an efficient and scalable data analysis solution. This talk shares a research experience on data pattern discovery in domain applications. In particular, the research scrutinizes key factors in analysis workflow design and data parallelism improvement on cloud.
Strata 2013: Tutorial-- How to Create Predictive Models in R using EnsemblesIntuit Inc.
This tutorial, based on a published book by Giovanni Seni, offers a hands-on intro to ensemble models, which combine multiple models into a single predictive system that’s often more accurate than the best of its components. Participants will use data sets and snippets of R code to experiment with the methods to gain a practical understanding of this breakthrough technology.
Giovanni Seni is currently a Senior Data Scientist with Intuit where he leads the Applied Data Sciences team. As an active data mining practitioner in Silicon Valley, he has over 15 years R&D experience in statistical pattern recognition and data mining applications. He has been a member of the technical staff at large technology companies, and a contributor at smaller organizations. He holds five US patents and has published over twenty conference and journal articles. His book with John Elder, “Ensemble Methods in Data Mining – Improving accuracy through combining predictions”, was published in February 2010 by Morgan & Claypool. Giovanni is also an adjunct faculty at the Computer Engineering Department of Santa Clara University, where he teaches an Introduction to Pattern Recognition and Data Mining class.
MS CS - Selecting Machine Learning AlgorithmKaniska Mandal
ML Algorithms usually solve an optimization problem such that we need to find parameters for a given model that minimizes
— Loss function (prediction error)
— Model simplicity (regularization)
http://www.icmc.usp.br/~junio/PublishedPapers/RodriguesJr_et_al_Frequency_Plot-SIBGRAPI2003.pdf
Jose Rodrigues, Agma J M Traina, Caetano Traina Jr (2003) Frequency Plot and Relevance Plot to Enhance Visual Data Exploration In: XVI Brazilian Symposium on Computer Graphics and Image Processing 117-124 IEEE Press.
@inproceedings { DBLP:conf/sibgrapi/RodriguesTT03,
title = "Frequency Plot and Relevance Plot to Enhance Visual Data Exploration",
year = "2003",
author = "Jose Rodrigues and Agma J M Traina and Caetano Traina Jr",
booktitle = " XVI Brazilian Symposium on Computer Graphics and Image Processing",
pages = "117-124",
publisher = "IEEE Press",
doi = "10.1109/SIBGRA.2003.1240999",
url = "http://www.icmc.usp.br/~junio/PublishedPapers/RodriguesJr_et_al_Frequency_Plot-SIBGRAPI2003.pdf",
urllink = "http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=1240999&",
abstract = "We present two techniques aiming at exploring databases through multivariate visualizations. Both techniques intend to deal with the problem caused by the limited amount of elements that can be presented simultaneously in traditional visual exploration procedures. The first technique, the Frequency Plot, combines data frequency with interactive filtering to identify clusters and trends in subsets of the database. Thus, graphical elements (lines, pixels, icons, or graphical marks) are color differentiated proportionally to how frequent the value being represented is, while interactive filtering allows the selection of interesting partitions of the database. The second technique, the Relevance Plot, corresponds to assigning different levels of color distinguishably to visual elements according to their relevance to a user's specified data properties set, which can be chosen visually and dynamically.",
keywords = "Computer science , Data analysis , Data visualization , Filtering , Frequency , Humans , Image databases , Information retrieval , Layout , Visual databases"}
Jose Rodrigues, Agma J M Traina, Christos Faloutsos, Caetano Traina Jr (2006) SuperGraph Visualization In: 8th IEEE International Symposium on Multimedia 227-234 IEEE Press.
@inproceedings { DBLP:conf/ism/RodriguesTFT06,
title = "SuperGraph Visualization",
year = "2006",
author = "Jose Rodrigues and Agma J M Traina and Christos Faloutsos and Caetano Traina Jr",
booktitle = "8th IEEE International Symposium on Multimedia",
pages = "227-234",
publisher = "IEEE Press",
doi = "10.1109/ISM.2006.143",
url = "http://www.icmc.usp.br/~junio/PublishedPapers/RodriguesJr_et_al-ISM2006.pdf",
urllink = "http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4061172",
abstract = "Given a large social or computer network, how can we visualize it, find patterns, outliers, communities? Although several graph visualization tools exist, they cannot handle large graphs with hundred thousand nodes and possibly million edges. Such graphs bring two challenges: interactive visualization demands prohibitive processing power and, even if we could interactively update the visualization, the user would be overwhelmed by the excessive number of graphical items. To cope with this problem, we propose a formal innovation on the use of graph hierarchies that leads to GMine system. GMine promotes scalability using a hierarchy of graph partitions, promotes concomitant presentation for the graph hierarchy and for the original graph, and extends analytical possibilities with the integration of the graph partitions in an interactive environment.",
keywords = "Application software , Bipartite graph , Computer networks , Computer science , Data structures , Scalability , Technological innovation , Tree graphs , Visualization , Web pages"}
Can we use information from social media and crowdsourced images to detect smoke and assist rescue forces? While there are computer vision methods for detecting smoke, they require movement information extracted from video data. In this paper we propose SmokeBlock: a method that is able to segment and detect smoke in still images. SmokeBlock uses superpixel segmentation and extracts local color and texture features from images to spot smoke. We used real data from Flickr and compared SmokeBlock against state-of-the-art methods for feature extraction. Our method achieved performance superior than the competitors, for the task of smoke detection. Our findings shall support further investigations in the field of image analysis, in particular, concerning images captured with mobile devices.
Several graph visualization tools exist. However, they are not able to handle large graphs, and/or they do not allow interaction. We are interested on large graphs, with hundreds of thousands of nodes. Such graphs bring two challenges: the first one is that any straightforward interactive manipulation will be prohibitively slow. The second one is sensory overload: even if we could plot and replot the graph quickly, the user would be overwhelmed with the vast volume of information because the screen would be too cluttered as nodes and edges overlap each other. GMine system addresses both these issues, by using summarization and multi-resolution. GMine offers multi-resolution graph exploration by partitioning a given graph into a hierarchy of com-munities-within-communities and storing it into a novel R-tree-like structure which we name G-Tree. GMine offers summarization by implementing an innovative subgraph extraction algorithm and then visualizing its output.
On the Support of a Similarity-Enabled Relational Database Management System ...Universidade de São Paulo
Crowdsourcing solutions can be helpful to extract information from disaster-related data during crisis management. However, certain information can only be obtained through similarity operations. Some of them also depend on additional data stored in a Relational Database Management System (RDBMS). In this context, several works focus on crisis management supported by data. Nevertheless, none of them provides a methodology for employing a similarity-enabled RDBMS in disaster-relief tasks. To fill this gap, we introduce a similarity-enabled methodology together with a supporting architecture named Data-Centric Crisis Management (DCCM), which employs our methods over a RDBMS. We evaluate our proposal through three tasks: classification of incoming data regarding current events, identifying relevant information to guide rescue teams; filtering of incoming data, enhancing the decision support by removing near-duplicate data; and similarity retrieval of historical data, supporting analytical comprehension of the crisis context. To make it possible, similarity-based operations were implemented within one popular, open-source RDBMS. Results using real data from Flickr show that the proposed methodology over DCCM is feasible for real-time applications. In addition to high performance, accurate results were obtained with a proper combination of techniques for each task. At last, given its accuracy and efficiency, we expect our work to provide a framework for further developments on crisis management solutions.
StructMatrix: large-scale visualization of graphs by means of structure detec...Universidade de São Paulo
Given a large-scale graph with millions of nodes and edges, how to reveal macro patterns of interest, like cliques, bi-partite cores, stars, and chains? Furthermore, how to visualize such patterns altogether getting insights from the graph to support wise decision-making? Although there are many algorithmic and visual techniques to analyze graphs, none of the existing approaches is able to present the structural information of graphs at large-scale. Hence, this paper describes StructMatrix, a methodology aimed at high-scalable visual inspection of graph structures with the goal of revealing macro patterns of interest. StructMatrix combines algorithmic structure detection and adjacency matrix visualization to present cardinality, distribution, and relationship features of the structures found in a given graph. We performed experiments in real, large-scale graphs with up to one million nodes and millions of edges. StructMatrix revealed that graphs of high relevance (e.g., Web, Wikipedia and DBLP) have characterizations that reflect the nature of their corresponding domains; our findings have not been seen in the literature so far. We expect that our technique will bring deeper insights into large graph mining, leveraging their use for decision making.
Currently, link recommendation has gained more attention as networked data becomes abundant in several scenarios. However, existing methods for this task have failed in considering solely the structure of dynamic networks for improved performance and accuracy. Hence, in this work, we present a methodology based on the use of multiple topological metrics in order to achieve prospective link recommendations considering time constraints. The combination of such metrics is used as input to binary classification algorithms that state whether two pairs of authors will/should define a link. We experimented with five algorithms, what allowed us to reach high rates of accuracy and to evaluate the different classification paradigms. Our results also demonstrated that time parameters and the activity profile of the authors can significantly influence the recommendation. In the context of DBLP, this research is strategic as it may assist on identifying potential partners, research groups with similar themes, research competition (absence of obvious links), and related work.
Multimodal graph-based analysis over the DBLP repository: critical discoverie...Universidade de São Paulo
The use of graph theory for analyzing network-like data has gained central importance with the rise of the Web 2.0. However, many graph-based techniques are not well-disseminated and neither explored at their full potential, what might depend on a complimentary approach achieved with the combination of multiple techniques. This paper describes the systematic use of graph-based techniques of different types (multimodal) combining the resultant analytical insights around a common domain, the Digital Bibliography & Library Project (DBLP). To do so, we introduce an analytical ensemble based on statistical (degree, and weakly-connected components distribution), topological (average clustering coefficient, and effective diameter evolution), algorithmic (link prediction/machine learning), and algebraic techniques to inspect non-evident features of DBLP at the same time that we interpret the heterogeneous discoveries found along the work. As a result, we have put together a set of techniques demonstrating over DBLP what we call multimodal analysis, an innovative process of information understanding that demands a wide technical knowledge and a deep understanding of the data domain. We expect that our methodology and our findings will foster other multimodal analyses and also that they will bring light over the Computer Science research.
Techniques for effective and efficient fire detection from social media imagesUniversidade de São Paulo
Social media provides information, in the form of images, that is valuable to a vast set of human activities, including salvage and rescue in the case of crisis situations (such as accidents, explosions, and fire). However, these services produce images in a rate that is impossible for human beings to absorb and analyze; thus, it is a requirement to have methods for automatic analysis. However, despite the multiple works on image analysis, there are no studies on the specific topic of fire detection over social media. To fill this gap, this work describes the use and the evaluation of an ample set of content-based image retrieval and classification techniques in the task of fire detection. In our intent, we (1) built a ground-truth set of annotated images regarding fire occurrence; (2) engineered the Fast-Fire Detection and Retrieval ($\FFDnR$) architecture to combine configurations of feature extractors and distance functions to work with instance-based learning; and (3) evaluated 36 image descriptors in the task of fire detection. Our results demonstrated that, for fire detection, the best image descriptors concerning efficacy (F-measure, Precision-Recall, and ROC) and processing efficiency (wall-clock time) are achieved with MPEG-7 feature extractors Color Structure and Scalable Color, and with distance functions City-Block and Euclidean. Our work shall provide basis for further developments regarding monitoring of images from social media.
Fire Detection on Unconstrained Videos Using Color-Aware Spatial Modeling and...Universidade de São Paulo
The semantic segmentation of events on emergency contexts involves the identification of previously defined events of interest. In this work, the focused semantic event is the presence of fire in videos. The literature presents several methods for automatic video fire detection, but these methods were built under assumptions, such as stationary cameras and controlled lightening conditions that are often in contrast to the videos acquired by hand-held devices. To fulfill this gap, we propose a fire detection method, called SPATFIRE. Our method innovates on three aspects: (1) it relies on a specifically tailored color model named Fire-like Pixel Detector able to improve the accuracy of fire detection, (2) it employs a new technique for motion compensation, diminishing the problems observed in videos captured with non-stationary cameras, and, (3) it defines a segmentation method able to identify, not only the presence of fire in a video, but also the segments in the video where fire occurs. We experimented our proposal on two video datasets with different characteristics and summarize the results to demonstrate the superior efficacy, in terms of true positives and negatives, as compared to state-of-the-art methods.
Relational databases are rigid-structured data sources characterized by complex relationships among a set of relations (tables). Making sense of such relationships is a challenging problem because users must consider multiple relations, understand their ensemble of integrity constraints, interpret dozens of attributes, and draw complex SQL queries for each desired data exploration. In this scenario, we introduce a twofold methodology; we use a hierarchical graph representation to efficiently model the database relationships and, on top of it, we designed a visualization technique for rapidly relational exploration. Our results demonstrate that the exploration of databases is profoundly simplified as the user is able to visually browse the data with little or no knowledge about its structure, dismissing the need of complex SQL queries. We believe our findings will bring a novel paradigm in what concerns relational data comprehension.
Vertex Centric Asynchronous Belief Propagation Algorithm for Large-Scale GraphsUniversidade de São Paulo
Inference problems on networks and their algorithms were always important subjects, but more so now with so much data available and so little time to make sense of it.
Common applications range from product recommendation to social networks and protein interaction.
One of the main inferences in this types of networks is the guilty-by-association method, where labeled nodes propagate their information throughout the network, towards unlabeled nodes.
While there is a widely used algorithm for this context, called Belief Propagation, it lacks the necessary convergence guarantees for loopy-networks.
More recently, a new alternative method was proposed, called LinBP and while it solved the convergence issue, the scalability for large graphs that do not fit memory remains a challenge.
Additionally, most works that try to use BP considering large scale graphs rely on specific infrastructure such as supercomputers and computational clusters.
Therefore we propose a new algorithm, that leverages state-of-the-art asynchronous vertex-centric parallel processing techniques in conjunction with the state-of-the-art BP alternative LinBP, to provide a scalable framework for large graph inference that runs on a single commodity machine.
Our results show that our algorithm is up to 200 times faster than LinBP's SQL implementation on tested networks, while achieving the same accuracy rate.
We also show that due to the asynchronous processing, our algorithm actually needs less iterations to converge when compared to LinBP when using the same parameters.
Finally, we believe that our methodology highlights the yet not fully explored parallelism available on commodity machines, leaning towards a more cost-efficient computational paradigm.
Fast Billion-scale Graph Computation Using a Bimodal Block Processing ModelUniversidade de São Paulo
Recent graph computation approaches have demonstrated that a single PC can perform efficiently on billion-scale graphs. While these approaches achieve scalability by optimizing I/O operations, they do not fully exploit the capabilities of modern hard drives and processors. To overcome their performance, in this work, we introduce the Bimodal Block Processing (BBP), an innovation that is able to boost the graph computation by minimizing the I/O cost even further. With this strategy, we achieved the following contributions: (1) \mflash, the fastest graph computation framework to date; (2) a flexible and simple programming model to easily implement popular and essential graph algorithms, including the \textit{first} single-machine billion-scale eigensolver; and (3) extensive experiments on real graphs with up to 6.6 billion edges, demonstrating M-Flash's consistent and significant speedup.
Analysis of Feature Selection Algorithms (Branch & Bound and Beam search)Parinda Rajapaksha
Branch & Bound and Beam search algorithms were illustrated according to the feature selection domain. Presentation is structured as follows,
- Motivation
- Introduction
- Analysis
- Algorithm
- Pseudo Code
- Illustration of examples
- Applications
- Observations and Recommendations
- Comparison between two algorithms
- References
Learning to rank (LTR) for information retrieval (IR) involves the application of machine learning models to rank artifacts, such as webpages, in response to user's need, which may be expressed as a query. LTR models typically employ training data, such as human relevance labels and click data, to discriminatively train towards an IR objective. The focus of this lecture will be on the fundamentals of neural networks and their applications to learning to rank.
Dimensionality Reduction and feature extraction.pptxSivam Chinna
Dimensionality reduction, or dimension reduction, is the transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data, ideally close to its intrinsic dimension.
Tutorial presented at Forum for Information Retrieval Evaluation (FIRE 2019) conference in Kolkata, India.
Similar to Effective and Unsupervised Fractal-based Feature Selection for Very Large Datasets: removing linear and non-linear attribute correlations (20)
Informações sobre as oportunidade de estudo e carreira promovidos pelo curso de Ciências de Computação da Universidade de São Paulo, campus de São Carlos.
Introdução às ferramentas de Business Intelligence do ecossistema Hadoop:
Business Intelligence e Big Data
Big Data warehousing
Arquitetura de um data warehouse
Hadoop e Apache Hive
Extract Transform Load
Data warehouse vs Banco de dados operacional
OLAP – Online Analytical Processing
Apache Kylin
Soluções OLAP convencionais
Advanced Analytics com o Apache Mahout
Metric s plat - a platform for quick development testing and visualization of...Universidade de São Paulo
Jose Rodrigues, Luciana A S Romani, Luciana Zaina, Ricardo Ciferri (2009) MetricSPlat - A platform for quick development, testing and visualization of content-based retrieval techniques In: Simpósio Brasileiro de Bancos de Dados - SBBD2009 1-6.
@inproceedings { RodriguesSBBD09,
title = "MetricSPlat - A platform for quick development, testing and visualization of content-based retrieval techniques",
year = "2009",
author = "Jose Rodrigues and Luciana A S Romani and Luciana Zaina and Ricardo Ciferri",
booktitle = "Simpósio Brasileiro de Bancos de Dados - SBBD2009",
pages = "1-6",
url = "http://www.icmc.usp.br/~junio/PublishedPapers/RodriguesSBBD09-MetricSPlat.pdf",
urllink = "http://www.icmc.usp.br/~junio/MetricSPlat/index.htm",
abstract = "The development and testing of content-based data retrieval systems is a time-consuming task. Over the concept of metric space, such systems must integrate the three factors that deï¬ne an indexing environment. These factors are features extraction, metric structures and distance functions, not to mention a suitable user interface. This integration deviates the work from the real focus of research, suppressing quick experimentation of ideas. In this context, we present the Metric Space Platform (MetricSPlat), a system designed for content-based retrieval enabled with plug-in features. With minimal effort, MetricSPlat substantially speeds up the experimentation of new techniques by providing a well-deï¬ned framework aided with interactive data visualization techniques.",
note = "8 pages",
keywords = "visualization, content-based data retrieval"}
Hierarchical visual filtering pragmatic and epistemic actions for database vi...Universidade de São Paulo
Jose Rodrigues, Carlos E Cirilo, Luciana A M Zaina, Antonio F Prado (2013) Hierarchical Visual Filtering, pragmatic and epistemic actions for database visualization In: Proceedings of the ACM Symposium on Applied Computing Edited by:ACM Press. 946-952 ACM Press.
@inproceedings { ref35,
title = "Hierarchical Visual Filtering, pragmatic and epistemic actions for database visualization",
year = "2013",
author = "Jose Rodrigues and Carlos E Cirilo and Luciana A M Zaina and Antonio F Prado",
booktitle = "Proceedings of the ACM Symposium on Applied Computing",
editor = "A C M Press",
pages = "946-952",
publisher = "ACM Press",
doi = "10.1145/2480362.2480545",
url = "http://www.icmc.usp.br/~junio/PublishedPapers/RodriguesJr_et_al-ACMSAC2013.pdf",
urllink = "http://www.icmc.usp.br/~junio/VisTree/VisTree.htm",
abstract = "Visualization techniques of all sorts suffer from visual cluttering, the occlusion of visual information due to the overlap of graphical items, and from excessive complexity in analytical tasks due to multiple parallel perspectives drawn from the data at hand. To cope with these problems, we introduce Hierarchical Visual Filtering, a novel interaction principle that brings pragmatic and epistemic actions to visualization techniques. Pragmatic actions here mean that the analyst is able to visually select and filter information, determining visual configurations that reveal different perspectives of the data; epistemic actions mean that the analyst can record, annotate, and recall intermediate visualizations created over his pragmatic actions. To do so, we use a tree-like structure to keep multiple visualization workspaces linked according to the analytical decisions took by the user. Our goal is to promote an innovative systematization that can augment the potential for database visual inspection, and for visualization systems in general. It is our contention that Hierarchical Visual Filtering can inspire a novel scheme of visualization environments in which space limitations and complexity are treated by means of interactive tasks.",
keywords = "Information Visualization, Multiple Views, Visual Data Analysis, Databases, Interactive Filtering, Hierarchical Filtering"}
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Effective and Unsupervised Fractal-based Feature Selection for Very Large Datasets: removing linear and non-linear attribute correlations
1. Effective and
Unsupervised
Fractal-based
Feature Selection
for Very Large
Datasets
Removing linear and non-linear attribute correlations
Antonio Canabrava Fraideinberze
Jose F Rodrigues-Jr
Robson Leonardo Ferreira Cordeiro
Databases and Images Group
University of São Paulo
São Carlos - SP - Brazil
25. General Idea
25
Removes the E - ⌈D2⌉ least relevant attributes, one at a time
in ascending order of relevance.
26. General Idea
26
Removes the E - ⌈D2⌉ least relevant attributes, one at a time
in ascending order of relevance.
Builds partial trees
for the full dataset
and for its E
(E-1)-dimensional
projections
27. General Idea
27
Removes the E - ⌈D2⌉ least relevant attributes, one at a time
in ascending order of relevance.
TreeID
+
cell
spatial
position
Partial
count of
points
28. General Idea
28
Removes the E - ⌈D2⌉ least relevant attributes, one at a time
in ascending order of relevance.
Sums partial point
counts and reports
log(r) and log(sum2)
for each tree
29. General Idea
29
Removes the E - ⌈D2⌉ least relevant attributes, one at a time
in ascending order of relevance.
Computes D2 for
the full dataset and
pD2 for each of its E
(E-1)-dimensional
projections
30. General Idea
30
Removes the E - ⌈D2⌉ least relevant attributes, one at a time
in ascending order of relevance.
The least relevant
attribute, i.e., the one
not in the projection
that minimizes
| D2 - pD2 |
31. General Idea
31
Removes the E - ⌈D2⌉ least relevant attributes, one at a time
in ascending order of relevance.
Spots the second
least relevant
attribute …
32. General Idea
3 Main Issues
32
Removes the E - ⌈D2⌉ least relevant attributes, one at a time
in ascending order of relevance.
33. General Idea
3 Main Issues
33
Removes the E - ⌈D2⌉ least relevant attributes, one at a time
in ascending order of relevance.
1° Too much data to
be shuffled – one
data pair per cell/tree
34. General Idea
3 Main Issues
34
Removes the E - ⌈D2⌉ least relevant attributes, one at a time
in ascending order of relevance.
2° One
data pass
per
irrelevant
attribute
35. General Idea
3 Main Issues
35
Removes the E - ⌈D2⌉ least relevant attributes, one at a time
in ascending order of relevance.
3° Not enough
memory for mappers
36. Proposed Method
Curl-Remover
36
1° Issue - Too much data to be shuffled; one data pair per
cell/tree;
Our solution - Two-phase dimensionality reduction:
a) Serial feature selection in a tiny data sample (one reducer). Used to
speed-up processing only;
b) All mappers project data into a fixed subspace
37. 37
Removes the E - ⌈D2⌉ least relevant attributes, one at a time
in ascending order of relevance.Builds/reports N (2 or
3) tree levels of
lowest resolution…
Proposed Method
Curl-Remover
38. 38
Removes the E - ⌈D2⌉ least relevant attributes, one at a time
in ascending order of relevance.
… plus the points
projected into the M (2
or 3) most relevant
attributes of sample
Proposed Method
Curl-Remover
39. 39
Removes the E - ⌈D2⌉ least relevant attributes, one at a time
in ascending order of relevance.
Builds the full trees from
their low resolution level
cells and the projected
points
Proposed Method
Curl-Remover
40. 40
Removes the E - ⌈D2⌉ least relevant attributes, one at a time
in ascending order of relevance.
Proposed Method
Curl-Remover
High resolution cells
are never shuffled
41. Proposed Method
Curl-Remover
41
2° Issue - One data pass per irrelevant attribute;
Our solution – Stores/reads the tree level of highest
resolution, instead of the original data.
42. 42
Removes the E - ⌈D2⌉ least relevant attributes, one at a time
in ascending order of relevance.
Rdb = cost to read dataset;
TWRtree = cost to transfer,
write and read the last tree
level in next reduce step;
If (Rdb > TWRtree)
then writes tree;
Proposed Method
Curl-Remover
43. 43
Removes the E - ⌈D2⌉ least relevant attributes, one at a time
in ascending order of relevance.
Proposed Method
Curl-Remover
44. 44
Removes the E - ⌈D2⌉ least relevant attributes, one at a time
in ascending order of relevance.
Writes tree’s last level in
HDFS
Proposed Method
Curl-Remover
45. 45
Removes the E - ⌈D2⌉ least relevant attributes, one at a time
in ascending order of relevance.
Reads tree’s last level
from HDFS
Proposed Method
Curl-Remover
46. 46
Removes the E - ⌈D2⌉ least relevant attributes, one at a time
in ascending order of relevance.
Proposed Method
Curl-Remover
Reads dataset
only twice
48. 48
Removes the E - ⌈D2⌉ least relevant attributes, one at a time
in ascending order of relevance.
Sorts its local points and
builds “tree slices”
monitoring memory
consumption
Proposed Method
Curl-Remover
59. Conclusions
Accuracy - eliminates both linear and non-linear attribute correlations,
besides irrelevant attributes; 8% better than sPCA;
Scalability – linear scalability on the data size (theoretical analysis);
experiments with up to 1.1 billion points;
Unsupervised - it does not require the user to guess the number of attributes
to be removed neither requires a training set;
Semantics - it is a feature selection method, thus maintaining the semantics of
the attributes;
Generality - it suits for analytical tasks in general, and not only for
classification;
59
60. Conclusions
Accuracy - eliminates both linear and non-linear attribute correlations,
besides irrelevant attributes; 8% better than sPCA;
Scalability – linear scalability on the data size (theoretical analysis);
experiments with up to 1.1 billion points;
Unsupervised - it does not require the user to guess the number of attributes
to be removed neither requires a training set;
Semantics - it is a feature selection method, thus maintaining the semantics of
the attributes;
Generality - it suits for analytical tasks in general, and not only for
classification;
60
61. Conclusions
Accuracy - eliminates both linear and non-linear attribute correlations,
besides irrelevant attributes; 8% better than sPCA;
Scalability – linear scalability on the data size (theoretical analysis);
experiments with up to 1.1 billion points;
Unsupervised - it does not require the user to guess the number of
attributes to be removed neither requires a training set;
Semantics - it is a feature selection method, thus maintaining the semantics of
the attributes;
Generality - it suits for analytical tasks in general, and not only for
classification;
61
62. Conclusions
Accuracy - eliminates both linear and non-linear attribute correlations,
besides irrelevant attributes; 8% better than sPCA;
Scalability – linear scalability on the data size (theoretical analysis);
experiments with up to 1.1 billion points;
Unsupervised - it does not require the user to guess the number of
attributes to be removed neither requires a training set;
Semantics - it is a feature selection method, thus maintaining the semantics
of the attributes;
Generality - it suits for analytical tasks in general, and not only for
classification;
62
63. Conclusions
Accuracy - eliminates both linear and non-linear attribute correlations,
besides irrelevant attributes; 8% better than sPCA;
Scalability – linear scalability on the data size (theoretical analysis);
experiments with up to 1.1 billion points;
Unsupervised - it does not require the user to guess the number of
attributes to be removed neither requires a training set;
Semantics - it is a feature selection method, thus maintaining the semantics
of the attributes;
Generality - it suits for analytical tasks in general, and not only for
classification;
63