With the recent growth of the graph-based data, the large graph processing becomes more and more important. In order to explore and to extract knowledge from such data, graph mining methods, like community detection, is a necessity. The legacy graph processing tools mainly rely on single machine computational capacity, which cannot process large graphs with billions of nodes. Therefore, the main challenge of new tools and frameworks lies on the development of new paradigms that are scalable, efficient and flexible. In this paper, we review the new paradigms of large graph processing and their applications to graph mining domains using the distributed and shared nothing approach used for large data by Internet players.
Workshop on Real-time & Stream Analytics IEEE BigData 2016Sabri Skhiri
Introduction presentation of the Workshop on Real-time & Stream Analytics co-located with the IEEE Big Data Conference.
We have seen new business models emerging that require real-time features. However, the real-time nature impacts the IT systems. It impacts the IT in term of (1) Data architecture, (2) Stream Mining and (3) Stream Processor technologies. Those three impacts are still very interesting research areas. The papers presented at the workshop cover those three areas and provide interesting view points.
Many data mining and knowledge discovery methodologies and process models have been developed,
with varying degrees of success, there are three main methods used to discover patterns in data; KDD,
SEMMA and CRISP-DM. They are presented in many of the publications of the area and are used in
practice. To our knowledge, there is no clear methodology developed to support link mining. However,
there is a well known methodology in knowledge discovery in databases, known as Cross Industry
Standard Process for Data Mining (CRISPDM), developed by a consortium of several industrial
companies which can be relevant to the study of link mining. In this study CRISP-DM has been adapted to
the field of Link mining to detect anomalies. An important goal in link mining is the task of inferring links
that are not yet known in a given network. This approach is implemented through the use of a case study
of realworld data (co-citation data). This case study aims to use mutual information to interpret the
semantics of anomalies identified in co-citation, dataset that can provide valuable insights in determining
the nature of a given link and potentially identifying important future link relationships.
Query Optimization Techniques in Graph Databasesijdms
Graph databases (GDB) have recently been arisen to overcome the limits of traditional databases for
storing and managing data with graph-like structure. Today, they represent a requirementfor many
applications that manage graph-like data,like social networks.Most of the techniques, applied to optimize
queries in graph databases, have been used in traditional databases, distribution systems,… or they are
inspired from graph theory. However, their reuse in graph databases should take care of the main
characteristics of graph databases, such as dynamic structure, highly interconnected data, and ability to
efficiently access data relationships. In this paper, we survey the query optimization techniques in graph
databases. In particular,we focus on the features they have in
Final Year IEEE Projects, Final Year Projects, Academic Final Year Projects, Academic Final Year IEEE Projects, Academic Final Year IEEE Projects 2013, Academic Final Year IEEE Projects 2014, IEEE JAVA, .NET Projects, 2013 IEEE JAVA, .NET Projects, 2013 IEEE JAVA, .NET Projects in Chennai, 2013 IEEE JAVA, .NET Projects in Trichy, 2013 IEEE JAVA, .NET Projects in Karur, 2013 IEEE JAVA, .NET Projects in Erode, 2013 IEEE JAVA, .NET Projects in Madurai, 2013 IEEE JAVA, .NET Projects in Salem, 2013 IEEE JAVA, .NET Projects in Coimbatore, 2013 IEEE JAVA, .NET Projects in Tirupur, 2013 IEEE JAVA, .NET Projects in Bangalore, 2013 IEEE JAVA, .NET Projects in Hydrabad, 2013 IEEE JAVA, .NET Projects in Kerala, 2013 IEEE JAVA, .NET Projects in Namakkal, IEEE JAVA, .NET Image Processing, IEEE JAVA, .NET Face Recognition, IEEE JAVA, .NET Face Detection, IEEE JAVA, .NET Brain Tumour, IEEE JAVA, .NET Iris Recognition, IEEE JAVA, .NET Image Segmentation, Final Year JAVA, .NET Projects in Pondichery, Final Year JAVA, .NET Projects in Tamilnadu, Final Year JAVA, .NET Projects in Chennai, Final Year JAVA, .NET Projects in Trichy, Final Year JAVA, .NET Projects in Erode, Final Year JAVA, .NET Projects in Karur, Final Year JAVA, .NET Projects in Coimbatore, Final Year JAVA, .NET Projects in Tirunelveli, Final Year JAVA, .NET Projects in Madurai, Final Year JAVA, .NET Projects in Salem, Final Year JAVA, .NET Projects in Tirupur, Final Year JAVA, .NET Projects in Namakkal, Final Year JAVA, .NET Projects in Tanjore, Final Year JAVA, .NET Projects in Coimbatore, Final Year JAVA, .NET Projects in Bangalore, Final Year JAVA, .NET Projects in Hydrabad, Final Year JAVA, .NET Projects in Kerala, Final Year JAVA, .NET IEEE Projects in Pondichery, Final Year JAVA, .NET IEEE Projects in Tamilnadu, Final Year JAVA, .NET IEEE Projects in Chennai, Final Year JAVA, .NET IEEE Projects in Trichy, Final Year JAVA, .NET IEEE Projects in Erode, Final Year JAVA, .NET IEEE Projects in Karur, Final Year JAVA, .NET IEEE Projects in Coimbatore, Final Year JAVA, .NET IEEE Projects in Tirunelveli, Final Year JAVA, .NET IEEE Projects in Madurai, Final Year JAVA, .NET IEEE Projects in Salem, Final Year JAVA, .NET IEEE Projects in Tirupur, Final Year JAVA, .NET IEEE Projects in Namakkal, Final Year JAVA, .NET IEEE Projects in Tanjore, Final Year JAVA, .NET IEEE Projects in Coimbatore, Final Year JAVA, .NET IEEE Projects in Bangalore, Final Year JAVA, .NET IEEE Projects in Hydrabad, Final Year JAVA, .NET IEEE Projects in Kerala, Final Year IEEE MATLAB Projects, Final Year Projects, Academic Final Year Projects, Academic Final Year IEEE MATLAB Projects, Academic Final Year IEEE MATLAB Projects 2013, Academic Final Year IEEE MATLAB Projects 2014, IEEE MATLAB Projects, 2013 IEEE MATLAB Projects, 2013 IEEE MATLAB Projects in Chennai, 2013 IEEE MATLAB Projects in Trichy, 2013 IEEE MATLAB Projects in Karur, 2013 IEEE MATLAB Projects in Erode, 2013 IEEE MATLAB Projects in Madurai, 2013 IEEE MATLAB
Workshop on Real-time & Stream Analytics IEEE BigData 2016Sabri Skhiri
Introduction presentation of the Workshop on Real-time & Stream Analytics co-located with the IEEE Big Data Conference.
We have seen new business models emerging that require real-time features. However, the real-time nature impacts the IT systems. It impacts the IT in term of (1) Data architecture, (2) Stream Mining and (3) Stream Processor technologies. Those three impacts are still very interesting research areas. The papers presented at the workshop cover those three areas and provide interesting view points.
Many data mining and knowledge discovery methodologies and process models have been developed,
with varying degrees of success, there are three main methods used to discover patterns in data; KDD,
SEMMA and CRISP-DM. They are presented in many of the publications of the area and are used in
practice. To our knowledge, there is no clear methodology developed to support link mining. However,
there is a well known methodology in knowledge discovery in databases, known as Cross Industry
Standard Process for Data Mining (CRISPDM), developed by a consortium of several industrial
companies which can be relevant to the study of link mining. In this study CRISP-DM has been adapted to
the field of Link mining to detect anomalies. An important goal in link mining is the task of inferring links
that are not yet known in a given network. This approach is implemented through the use of a case study
of realworld data (co-citation data). This case study aims to use mutual information to interpret the
semantics of anomalies identified in co-citation, dataset that can provide valuable insights in determining
the nature of a given link and potentially identifying important future link relationships.
Query Optimization Techniques in Graph Databasesijdms
Graph databases (GDB) have recently been arisen to overcome the limits of traditional databases for
storing and managing data with graph-like structure. Today, they represent a requirementfor many
applications that manage graph-like data,like social networks.Most of the techniques, applied to optimize
queries in graph databases, have been used in traditional databases, distribution systems,… or they are
inspired from graph theory. However, their reuse in graph databases should take care of the main
characteristics of graph databases, such as dynamic structure, highly interconnected data, and ability to
efficiently access data relationships. In this paper, we survey the query optimization techniques in graph
databases. In particular,we focus on the features they have in
Final Year IEEE Projects, Final Year Projects, Academic Final Year Projects, Academic Final Year IEEE Projects, Academic Final Year IEEE Projects 2013, Academic Final Year IEEE Projects 2014, IEEE JAVA, .NET Projects, 2013 IEEE JAVA, .NET Projects, 2013 IEEE JAVA, .NET Projects in Chennai, 2013 IEEE JAVA, .NET Projects in Trichy, 2013 IEEE JAVA, .NET Projects in Karur, 2013 IEEE JAVA, .NET Projects in Erode, 2013 IEEE JAVA, .NET Projects in Madurai, 2013 IEEE JAVA, .NET Projects in Salem, 2013 IEEE JAVA, .NET Projects in Coimbatore, 2013 IEEE JAVA, .NET Projects in Tirupur, 2013 IEEE JAVA, .NET Projects in Bangalore, 2013 IEEE JAVA, .NET Projects in Hydrabad, 2013 IEEE JAVA, .NET Projects in Kerala, 2013 IEEE JAVA, .NET Projects in Namakkal, IEEE JAVA, .NET Image Processing, IEEE JAVA, .NET Face Recognition, IEEE JAVA, .NET Face Detection, IEEE JAVA, .NET Brain Tumour, IEEE JAVA, .NET Iris Recognition, IEEE JAVA, .NET Image Segmentation, Final Year JAVA, .NET Projects in Pondichery, Final Year JAVA, .NET Projects in Tamilnadu, Final Year JAVA, .NET Projects in Chennai, Final Year JAVA, .NET Projects in Trichy, Final Year JAVA, .NET Projects in Erode, Final Year JAVA, .NET Projects in Karur, Final Year JAVA, .NET Projects in Coimbatore, Final Year JAVA, .NET Projects in Tirunelveli, Final Year JAVA, .NET Projects in Madurai, Final Year JAVA, .NET Projects in Salem, Final Year JAVA, .NET Projects in Tirupur, Final Year JAVA, .NET Projects in Namakkal, Final Year JAVA, .NET Projects in Tanjore, Final Year JAVA, .NET Projects in Coimbatore, Final Year JAVA, .NET Projects in Bangalore, Final Year JAVA, .NET Projects in Hydrabad, Final Year JAVA, .NET Projects in Kerala, Final Year JAVA, .NET IEEE Projects in Pondichery, Final Year JAVA, .NET IEEE Projects in Tamilnadu, Final Year JAVA, .NET IEEE Projects in Chennai, Final Year JAVA, .NET IEEE Projects in Trichy, Final Year JAVA, .NET IEEE Projects in Erode, Final Year JAVA, .NET IEEE Projects in Karur, Final Year JAVA, .NET IEEE Projects in Coimbatore, Final Year JAVA, .NET IEEE Projects in Tirunelveli, Final Year JAVA, .NET IEEE Projects in Madurai, Final Year JAVA, .NET IEEE Projects in Salem, Final Year JAVA, .NET IEEE Projects in Tirupur, Final Year JAVA, .NET IEEE Projects in Namakkal, Final Year JAVA, .NET IEEE Projects in Tanjore, Final Year JAVA, .NET IEEE Projects in Coimbatore, Final Year JAVA, .NET IEEE Projects in Bangalore, Final Year JAVA, .NET IEEE Projects in Hydrabad, Final Year JAVA, .NET IEEE Projects in Kerala, Final Year IEEE MATLAB Projects, Final Year Projects, Academic Final Year Projects, Academic Final Year IEEE MATLAB Projects, Academic Final Year IEEE MATLAB Projects 2013, Academic Final Year IEEE MATLAB Projects 2014, IEEE MATLAB Projects, 2013 IEEE MATLAB Projects, 2013 IEEE MATLAB Projects in Chennai, 2013 IEEE MATLAB Projects in Trichy, 2013 IEEE MATLAB Projects in Karur, 2013 IEEE MATLAB Projects in Erode, 2013 IEEE MATLAB Projects in Madurai, 2013 IEEE MATLAB
Prov-O-Viz is a visualisation service for provenance graphs expressed using the W3C PROV vocabulary. It uses the Sankey-style visualisation from D3js.
See http://provoviz.org
Content + Signals: The value of the entire data estate for machine learningPaul Groth
Content-centric organizations have increasingly recognized the value of their material for analytics and decision support systems based on machine learning. However, as anyone involved in machine learning projects will tell you the difficulty is not in the provision of the content itself but in the production of annotations necessary to make use of that content for ML. The transformation of content into training data often requires manual human annotation. This is expensive particularly when the nature of the content requires subject matter experts to be involved.
In this talk, I highlight emerging approaches to tackling this challenge using what's known as weak supervision - using other signals to help annotate data. I discuss how content companies often overlook resources that they have in-house to provide these signals. I aim to show how looking at a data estate in terms of signals can amplify its value for artificial intelligence.
A Novel Data mining Technique to Discover Patterns from Huge Text CorpusIJMER
Today, we have far more information than we can handle: from business transactions and scientific
data, to satellite pictures, text reports and military intelligence. Information retrieval is simply not enough
anymore for decision-making. Confronted with huge collections of data, we have now created new needs to
help us make better managerial choices. These needs are automatic summarization of data, extraction of the
"essence" of information stored, and the discovery of patterns in raw data. With this, Data mining with
inventory pattern came into existence and got popularized. Data mining finds these patterns and relationships
using data analysis tools and techniques to build models.
The semantic technology enhances big data advancements by allowing sophisticated analysis of texts. Through the Linked Data technology, tremendous amount of information can be connected. However, this inherits ambiguity when it needs to be manipulated for certain purpose like natural language interface, semantic search and question answering. There are limited works which address ambiguity in semantic search. This paper introduces a technique based on self-adaptive disambiguation which utilizes the possible concept annotations of terms in the natural language queries. This will allow users to compose query in natural language and receive accurate answers without having to master the formal syntax of the semantic query language.
Data mining , knowledge discovery is the process
of analyzing data from different perspectives and summarizing it
into useful information - information that can be used to increase
revenue, cuts costs, or both. Data mining software is one of a
number of analytical tools for analyzing data. It allows users to
analyze data from many different dimensions or angles, categorize
it, and summarize the relationships identified. Technically, data
mining is the process of finding correlations or patterns among
dozens of fields in large relational databases. The goal of
clustering is to determine the intrinsic grouping in a set of
unlabeled data. But how to decide what constitutes a good
clustering? It can be shown that there is no absolute “best”
criterion which would be independent of the final aim of the
clustering. Consequently, it is the user which must supply this
criterion, in such a way that the result of the clustering will suit
their needs.
For instance, we could be interested in finding
representatives for homogeneous groups (data reduction), in
finding “natural clusters” and describe their unknown properties
(“natural” data types), in finding useful and suitable groupings
(“useful” data classes) or in finding unusual data objects (outlier
detection).Of late, clustering techniques have been applied in the
areas which involve browsing the gathered data or in categorizing
the outcome provided by the search engines for the reply to the
query raised by the users. In this paper, we are providing a
comprehensive survey over the document clustering.
Presentation for NEC Lab Europe.
Knowledge graphs are increasingly built using complex multifaceted machine learning-based systems relying on a wide of different data sources. To be effective these must constantly evolve and thus be maintained. I present work on combining knowledge graph construction (e.g. information extraction) and refinement (e.g. link prediction) in end to end systems. In particular, I will discuss recent work on using inductive representations for link predication. I then discuss the challenges of ongoing system maintenance, knowledge graph quality and traceability.
A comprehensive survey of link mining and anomalies detectioncsandit
This survey introduces the emergence of link mining and its relevant application to detect
anomalies which can include events that are unusual, out of the ordinary or rare, unexpected
behaviour, or outliers.
R. Zafarani, M. A. Abbasi, and H. Liu, Social Media Mining: An Introduction, Cambridge University Press, 2014.
Free book and slides at http://socialmediamining.info/
Prov-O-Viz is a visualisation service for provenance graphs expressed using the W3C PROV vocabulary. It uses the Sankey-style visualisation from D3js.
See http://provoviz.org
Content + Signals: The value of the entire data estate for machine learningPaul Groth
Content-centric organizations have increasingly recognized the value of their material for analytics and decision support systems based on machine learning. However, as anyone involved in machine learning projects will tell you the difficulty is not in the provision of the content itself but in the production of annotations necessary to make use of that content for ML. The transformation of content into training data often requires manual human annotation. This is expensive particularly when the nature of the content requires subject matter experts to be involved.
In this talk, I highlight emerging approaches to tackling this challenge using what's known as weak supervision - using other signals to help annotate data. I discuss how content companies often overlook resources that they have in-house to provide these signals. I aim to show how looking at a data estate in terms of signals can amplify its value for artificial intelligence.
A Novel Data mining Technique to Discover Patterns from Huge Text CorpusIJMER
Today, we have far more information than we can handle: from business transactions and scientific
data, to satellite pictures, text reports and military intelligence. Information retrieval is simply not enough
anymore for decision-making. Confronted with huge collections of data, we have now created new needs to
help us make better managerial choices. These needs are automatic summarization of data, extraction of the
"essence" of information stored, and the discovery of patterns in raw data. With this, Data mining with
inventory pattern came into existence and got popularized. Data mining finds these patterns and relationships
using data analysis tools and techniques to build models.
The semantic technology enhances big data advancements by allowing sophisticated analysis of texts. Through the Linked Data technology, tremendous amount of information can be connected. However, this inherits ambiguity when it needs to be manipulated for certain purpose like natural language interface, semantic search and question answering. There are limited works which address ambiguity in semantic search. This paper introduces a technique based on self-adaptive disambiguation which utilizes the possible concept annotations of terms in the natural language queries. This will allow users to compose query in natural language and receive accurate answers without having to master the formal syntax of the semantic query language.
Data mining , knowledge discovery is the process
of analyzing data from different perspectives and summarizing it
into useful information - information that can be used to increase
revenue, cuts costs, or both. Data mining software is one of a
number of analytical tools for analyzing data. It allows users to
analyze data from many different dimensions or angles, categorize
it, and summarize the relationships identified. Technically, data
mining is the process of finding correlations or patterns among
dozens of fields in large relational databases. The goal of
clustering is to determine the intrinsic grouping in a set of
unlabeled data. But how to decide what constitutes a good
clustering? It can be shown that there is no absolute “best”
criterion which would be independent of the final aim of the
clustering. Consequently, it is the user which must supply this
criterion, in such a way that the result of the clustering will suit
their needs.
For instance, we could be interested in finding
representatives for homogeneous groups (data reduction), in
finding “natural clusters” and describe their unknown properties
(“natural” data types), in finding useful and suitable groupings
(“useful” data classes) or in finding unusual data objects (outlier
detection).Of late, clustering techniques have been applied in the
areas which involve browsing the gathered data or in categorizing
the outcome provided by the search engines for the reply to the
query raised by the users. In this paper, we are providing a
comprehensive survey over the document clustering.
Presentation for NEC Lab Europe.
Knowledge graphs are increasingly built using complex multifaceted machine learning-based systems relying on a wide of different data sources. To be effective these must constantly evolve and thus be maintained. I present work on combining knowledge graph construction (e.g. information extraction) and refinement (e.g. link prediction) in end to end systems. In particular, I will discuss recent work on using inductive representations for link predication. I then discuss the challenges of ongoing system maintenance, knowledge graph quality and traceability.
A comprehensive survey of link mining and anomalies detectioncsandit
This survey introduces the emergence of link mining and its relevant application to detect
anomalies which can include events that are unusual, out of the ordinary or rare, unexpected
behaviour, or outliers.
R. Zafarani, M. A. Abbasi, and H. Liu, Social Media Mining: An Introduction, Cambridge University Press, 2014.
Free book and slides at http://socialmediamining.info/
Graph mining 2: Statistical approaches for graph miningtuxette
Workshop "Advanced mathematics for network analysis"
organized by Institut des Systèmes Complexes de Toulouse
http://isc-t.fr/evenements/?event_id1=2
Luchon, France
May, 3rd 2016
Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...Xiaohan Zeng
The advent of the social networks has completely changed our daily life. The deluge of data collected on Social Network Services (SNS) and recent developments in complex network theory have enabled many marvelous predictive analysis, which tells us many amazing stories.
Why do we often feel that "the world is so small?" Is the six-degree separation purely imagination or based on mathematical insights? Why are there just a few rockstars who enjoy extreme popularity while most of us stay unknown to the world? When science meets coffee shop knowledge, things are bound to be intriguing.
I will first briefly describe what social networks are, in the mathematical sense. Then I will introduce some ways to extract characteristics of networks, and how these analyses can explain many anecdotes in our life. Finally, I'll show an example of what we can learn from social network analysis, based on data from Groupon.
Part 1: Concepts and Cases (the language of networks, networks in organizations, case studies and key concepts)
Part 2: (Starts on #44) Mapping Organizational, Personal, and Enterprise Networks: Tools
An update to last year's Social Network Analysis Introduction and Tools...
Social Network Analysis (SNA) and its implications for knowledge discovery in...ACMBangalore
Social Network Analysis (SNA) and its implications for knowledge discovery in Informal Networks- Talk by Dr Jai Ganesh, SETLabs, Infosys at Search and Social Platforms tutorial, as part of Compute 2009, ACM Bangalore
Large Graph Mining – Patterns, tools and cascade analysis by Christos FaloutsosBigMine
What do graphs look like? How do they evolve over time? How does influence/news/viruses propagate, over time? We present a long list of static and temporal laws, and some recent observations on real graphs. We show that fractals and self-similarity can explain several of the observed patterns, and we conclude with cascade analysis and a surprising result on virus propagation and immunization.
Mining Social Web APIs with IPython Notebook (PyCon 2014)Matthew Russell
From the tutorial description at https://us.pycon.org/2014/schedule/presentation/134/ -
Description
Social websites such as Twitter, Facebook, LinkedIn, Google+, and GitHub have vast amounts of valuable insights lurking just beneath the surface, and this workshop minimizes the barriers to exploring and mining this valuable data by presenting turn-key examples from the thoroughly revised 2nd Edition of Mining the Social Web.
Abstract
This workshop teaches you fundamental data mining techniques as applied to popular social websites by adapting example code from Mining the Social Web (2nd Edition, O'Reilly 2013) in a tutorial-style step-by-step manner that is designed specifically to accommodate attendees with very little programming or domain experience. This workshop's extensive use of IPython Notebook facilitates interactive learning with turn-key examples against a Vagrant-based virtual machine that takes care of installing all 3rd party dependencies that are needed. The barriers to entry are truly minimal, which allows maximal use of the time to be spent on interactive learning.
The workshop is somewhat broadly designed and acclimates you to mining social data from Twitter, Facebook, LinkedIn, Google+, and GitHub APIs in five corresponding modules with the following memorable approach for each of them:
* Aspire - Set out to answer a question or test a hypothesis as part of a data science experiment
* Acquire - Collect and store the data that you need to answer the question or test the hypothesis
* Analyze - Use fundamental data mining techniques to explore and exploit the data
* Summarize - Present analytical findings in a compact and meaningful way
Each module consists of a brief period in which each attendee will customize the corresponding notebook for the module with their own account credentials with the remainder of the module devoted to learning what data is available from the API and exercises demonstrating analysis of the data—all from a pre-populated IPython Notebook. Time will be set aside at the end of each module for attendees to hack on the code, discuss examples, and ask any lingering questions.
An introductory-to-mid level to presentation to complex network analysis: network metrics, analysis of online social networks, approximated algorithms, memorization issues, storage.
Data Mining Seminar - Graph Mining and Social Network Analysisvwchu
Delivered a formal presentation on course material for the Data Mining (EECS 4412) course at York University, Canada, about graph mining. Graphs have become increasingly important in modeling sophisticated structures and their interactions, with broad applications including chemical informatics, bioinformatics, computer vision, video indexing, text retrieval, and Web analysis. The formal seminar was 50 to 60 minutes followed by 10 to 20 minutes for questions.
https://wiki.eecs.yorku.ca/course_archive/2014-15/F/4412
https://wiki.eecs.yorku.ca/course_archive/2014-15/F/4412/lectures
A Fast and Dirty Intro to NetworkX (and D3)Lynn Cherny
Using the python lib NetworkX to calculate stats on a Twitter network, and then display the results in several D3.js visualizations. Links to demos and source files. I'm @arnicas and live at www.ghostweather.com.
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...acijjournal
Apriori is one of the key algorithms to generate frequent itemsets. Analysing frequent itemset is a crucial
step in analysing structured data and in finding association relationship between items. This stands as an
elementary foundation to supervised learning, which encompasses classifier and feature extraction
methods. Applying this algorithm is crucial to understand the behaviour of structured data. Most of the
structured data in scientific domain are voluminous. Processing such kind of data requires state of the art
computing machines. Setting up such an infrastructure is expensive. Hence a distributed environment
such as a clustered setup is employed for tackling such scenarios. Apache Hadoop distribution is one of
the cluster frameworks in distributed environment that helps by distributing voluminous data across a
number of nodes in the framework. This paper focuses on map/reduce design and implementation of
Apriori algorithm for structured data analysis.
Massive Data Analysis- Challenges and ApplicationsVijay Raghavan
We highlight a few trends of massive data that are available for corporations, government agencies and researchers and some examples of opportunities that exist for turning this data into knowledge. We provide a brief overview of some of the state-of-the-art technologies in the massive data analysis landscape. Then, we describe two applications from two diverse areas in detail: recommendations in e-commerce, link discovery from biomedical literature. Finally, we present some challenges and open problems in the field of massive data analysis.
A Comparative Study of Various Data Mining Techniques: Statistics, Decision T...Editor IJCATR
In this paper we focus on some techniques for solving data mining tasks such as: Statistics, Decision Trees and Neural
Networks. The new approach has succeed in defining some new criteria for the evaluation process, and it has obtained valuable results
based on what the technique is, the environment of using each techniques, the advantages and disadvantages of each technique, the
consequences of choosing any of these techniques to extract hidden predictive information from large databases, and the methods of
implementation of each technique. Finally, the paper has presented some valuable recommendations in this field.
Classifier Model using Artificial Neural NetworkAI Publications
When it comes to AI and ML, precision in categorization is of the utmost importance. In this research, the use of supervised instance selection (SIS) to improve the performance of artificial neural networks (ANNs) in classification is investigated. The goal of SIS is to enhance the accuracy of future classification tasks by identifying and selecting a subset of examples from the original dataset. The purpose of this research is to provide light on how useful SIS is as a preprocessing tool for artificial neural network-based classification. The work aims to improve the input dataset to ANNs by using SIS, which may help with problems caused by noisy or redundant data. The ultimate goal is to improve ANNs' ability to identify data points properly across a wide range of application areas.
A SURVEY ON DATA MINING IN STEEL INDUSTRIESIJCSES Journal
In Industrial environments, huge amount of data is being generated which in turn collected indatabase anddata warehouses from all involved areas such as planning, process design, materials, assembly, production, quality, process control, scheduling, fault detection,shutdown, customer relation management, and so on. Data Mining has become auseful tool for knowledge acquisition for industrial process of Iron and steel making. Due to the rapid growth in Data Mining, various industries started using data mining technology to search the hidden patterns, which might further be used to the system with the new knowledge which might design new models to enhance the production quality, productivity optimum cost and maintenance etc. The continuous improvement of all steel production process regarding the avoidance of quality deficiencies and the related improvement of production yield is an essential task of steel producer. Therefore, zero defect strategy is popular today and to maintain it several quality assurancetechniques areused. The present report explains the methods of data mining and describes its application in the industrial environment and especially, in the steel industry.
Re-Mining Association Mining Results Through Visualization, Data Envelopment ...ertekg
İndirmek için Bağlantı > https://ertekprojects.com/gurdal-ertek-publications/blog/re-mining-association-mining-results-through-visualization-data-envelopment-analysis-and-decision-trees/
Re-mining is a general framework which suggests the execution of additional data mining steps based on the results of an original data mining process. This study investigates the multi-faceted re-mining of association mining results, develops and presents a practical methodology, and shows the applicability of the developed methodology through real world data. The methodology suggests re-mining using data visualization, data envelopment analysis, and decision trees. Six hypotheses, regarding how re-mining can be carried out on association mining results, are answered in the case study through empirical analysis.
Jiawei Han, Micheline Kamber and Jian Pei
Data Mining: Concepts and Techniques, 3rd ed.
The Morgan Kaufmann Series in Data Management Systems
Morgan Kaufmann Publishers, July 2011. ISBN 978-0123814791
COMBINED MINING APPROACH TO GENERATE PATTERNS FOR COMPLEX DATAcscpconf
In Data mining applications, which often involve complex data like multiple heterogeneous data sources, user preferences, decision-making actions and business impacts etc., the complete useful information cannot be obtained by using single data mining method in the form of informative patterns as that would consume more time and space, if and only if it is possible to join large relevant data sources for discovering patterns consisting of various aspects of useful information. We consider combined mining as an approach for mining informative patterns
from multiple data-sources or multiple-features or by multiple-methods as per the requirements. In combined mining approach, we applied Lossy-counting algorithm on each data-source to get the frequent data item-sets and then get the combined association rules. In multi-feature combined mining approach, we obtained pair patterns and cluster patterns and then generate incremental pair patterns and incremental cluster patterns, which cannot be directly generated by the existing methods. In multi-method combined mining approach, we combine FP-growth and Bayesian Belief Network to make a classifier to get more informative knowledge.
Combined mining approach to generate patterns for complex datacsandit
In Data mining applications, which often involve complex data like multiple heterogeneous data
sources, user preferences, decision-making actions and business impacts etc., the complete
useful information cannot be obtained by using single data mining method in the form of
informative patterns as that would consume more time and space, if and only if it is possible to
join large relevant data sources for discovering patterns consisting of various aspects of useful
information. We consider combined mining as an approach for mining informative patterns
from multiple data-sources or multiple-features or by multiple-methods as per the requirements.
In combined mining approach, we applied Lossy-counting algorithm on each data-source to get
the frequent data item-sets and then get the combined association rules. In multi-feature
combined mining approach, we obtained pair patterns and cluster patterns and then generate
incremental pair patterns and incremental cluster patterns, which cannot be directly generated
by the existing methods. In multi-method combined mining approach, we combine FP-growth
and Bayesian Belief Network to make a classifier to get more informative knowledge.
We are living in a world, where a vast amount of digital data which is called big data. Plus as the world becomes more and more connected via the Internet of Things (IoT). The IoT has been a major influence on the Big Data landscape. The analysis of such big data brings ahead business competition to the next level of innovation and productivity.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
1. Large Graph Mining
Recent Developement, Challenges and Potential
Solutions
EBISS,
20 of July 2012
Brussels
SABRI SKHIRI / RESEARCH DIRECTOR EURA NOVA
2. PASSIONATE BY COMPUTER SCIENCE, TECHNOLOGY &
RESEARCH
THE SPEAKER
Research director @ EURA NOVA
Make the link between Research & Customer challenges
Supervising 3 PhD thesis, 6 Master thesis with 3 BEL
Universities
2
Head of the EU R&D Architecture
for a Telco equipment provider
Guiding the transition from Telco to Service provider with new technologies
Committer on open source
projects launched @ EURA NOVA
RoQ-Messaging, NAIAD, Wazaabi
3. Ramp-up test to wake-up the room after lunch a Friday afternoon …
Before starting
I will use persons to illustrate the topic in this tutorial
Can you give me their names?
3
Leonard Sheldon Moss Lary Page
Looks ready to start to learn about Graph Processing !
4. AGENDA
1 / Introduction
2 / Focus on two graph mining algorithms
3 / Introduction of Distributed Processing Framework
4 / Graph Data warehouse – an emerging challenge
4
5 / Conclusion
5. AGENDA
1 / Introduction
2 / Focus on two graph mining algorithms
3 / Introduction of Distributed Processing Framework
4 / Graph Data warehouse – an emerging challenge
5
5 / Conclusion
6. Graph Mining needs another approach
EXECUTIVE SUMMARY
Data Mining
Mature, algorithmic, libraries & products
New Needs
Linked data & reasoning on relationships
What do we need?
Is traditional data mining still applicable?
Graph Data Warehouse
Is traditional data warehouse still applicable?
Flat data, relational data,
multi-dimensional data
No Linked data
Biology
Chemistry
Social Networks
Internet - Networks
Graph-based similarity
Algorithm re-design for graphs
Scalability for storage & processing
Conceptual modeling
Query
Processing Stack & materialization
Storage
7. LET’S START
WITH DATA MINING
Process of discovering patterns or models of data. Those
patterns often consist in previously unknown and implicit
information and knowledge embedded within a data set [1]
[1] M.-S. Chen, J. Han, and P. S. Yu. Data mining: An overview from a database perspective. IEEE Trans. Knowl. Data Eng.,
8(6):866–883, 1996.
8. Techniques have been developed these last 20 years
DATA MINING
Process of analyzing data from different perspectives and
summarizing it into useful information
Pattern recognition
We mine data to retrieve pre-
determined patterns
Clustering
Data are grouped within partitions
according criteria
Association
Enables to link data between each other
Classification
We position data in a pre-determined
group
Feature extraction
We transform the input data into a set of
features (data set reduction)
Summarization
Ranking such as page rank
8
9. Manages & processes data as a collection of independent instances
DATA MINING
The Mining usually does not consider the global relations
between the objects
Almost all clustering algorithms compute the similarity between all the pair of
objects in the data set
10. Taking into account the relation between data in mining
Why the relationship matters?
Imagine to cluster people from their profiles
1
0
11. Taking into account the relation between data in mining
Why the relationship matters?
Imagine to cluster people not only from their profiles but also
… by their social interactions
New emergent industrial needs lead to deal with this kind of structured data
11
More complete Data structure
Greater expressive power
Better model or real-life cases
16. New emergent industrial needs
1. Biochemical Networks
What happens if I drop a compound in the system ?
Drug simulation in drug design
Predict a metabolic pathway given a metabolic network and seed reactions
Subgraph extraction
Find which genes are involved in the fat reduction pathway?
Genetic therapy
Predict a metabolic network from a genetic signature given a protein interaction
graph & a regulation network
16
17. New emergent industrial needs
2. Chemical Databases
Database specifically designed to store chemical
information.
Atoms
Bonds
17
Graphs are the natural representation for chemical compounds, most of the
mining algorithms focus on mining chemical graphs
18. New emergent industrial needs
2. Chemical Databases
A typical request: Structural similarity search
18
),...,(
),(
1 nd
ddd
V
EVG
Gd is the graph query
The objective is to maximize the probability that
the ith teta = alpha knowing the measure a, b.
}{with)),|((max VbaP ii
19. New emergent industrial needs
2. Chemical Databases
19
Structural indexing
Indexing the structural properties of the molecules
Structural similarity search
Similar molecules will have similar effects
Structure-Activity-Relationship
How to modify the Structure for changing its activity
3D molecule conformation
Based on similar molecule conformations
20. New emergent industrial needs
2. Chemical Databases
20
Structure-Activity-Relationship
Example of the sucralose where 3 hydroxyl groups have been replaced with
Chloride (Cl)
Sugar C12H22O11
Diet Sugar C12H19Cl3O8
http://en.wikipedia.org/wiki/Sucralose
21. New emergent industrial needs
3. Social network anlytics
21
The Social Graph models the (direct or indirect) Social
interactions between users
22. Example of Trust from a bipartite Graph
3. Social network analytics
22
The Goal is to infer trust connections between actors in
set A only connected through Item I
Daire O'Doherty, Salim Jouili, Peter Van Roy:
Towards trust inference from bipartite social
networks. DBSocial 2012: 13-18
23. Example of Trust from a bi-partite Graph
3. Social network anlytics
23
The Goal is to infer trust connections between actors in
set A only connected through Item I
Measure to compare similarity and diversity
Highly connected shared item will have higher
distance values
Daire O'Doherty, Salim Jouili, Peter Van Roy:
Towards trust inference from bipartite social
networks. DBSocial 2012: 13-18
24. Example of Trust from a bi-partite Graph
3. Social network anlytics
24
Daire O'Doherty, Salim Jouili, and Peter Van Roy. Trust-
Based Recommendation: An Empirical Analysis, Sixth
ACM Workshop on Social Network Mining and Analysis
(SNA-KDD 2012), Beijing, China, Aug. 12, 2012.
25. New emergent industrial needs
3. Social network analytics
25
People you may know
Structural similarity based
Trust computation on structural properties
Used for accurate recommendation
Collaborative filtering
Tends to like what your friends like
Influence management
Used in marketing models
26. Marketing model to influence users
3. Social network analytics
SOCIAL KNOWLEDGE
TRADITIONAL
MARKETING MODELS
Bolton 1998
Bolton & Lemon 1999
SOCIAL MODELS
Nitan & Libai 2011 / Singer 2012
INFLUENCE NETWORK Able to predict much more accurately
> How to influence influencer to reach objectives
Viral marketing maven
Accurate
churners
Product (content, services, etc.)
adoption
Loyal user to reward to optimize the subscriber base
Decrease
acquisition
costs
27. Building an interaction-based model for INFLUENCE
3. Social network analytics
27
Vertex similarity distance
Edge weight computing
Betweenness centrality computation
Temporal analysis and version at
vertex/edge
When all social interaction variables are
considered within the same model we end-up
with a very powerful Social Profile model
29. What changes with graphs?
Problem Statement
29
Similarity & Distances
Must be graph-based
Structural nature of the data model
Makes mining algorithm more challenging to implement
Scalability issue
Most of the graph mining problems include significant graphs
Most of the existing graph mining algorithms deal with data in the main
memory-> not possible anymore
30. Let’s position this tutorial
Problem Statement
30
BSP approach
Using fully distributed approach
Google Pregel, Apache HAMA
In-memory/MPI/HPC
Use multi-processors implementations
SNAP
Graph DB
Focus on storage & graph traversal
Neo4J, Dex, OrientDB
31. Let’s position this tutorial
Problem Statement
31
BSP approach
Using fully distributed approach
Google Pregel, Apache HAMA
Given a set of data mining algorithms, how can we adapt
them to fully leverage the distributed processing approach?
32. The base data model is not the same anymore
Using the distributed way
32
(Distributed) Storage
Graph Model
(Distributed) graph processing
Mining algorithm
The algorithm implementation will depend on the underlying
distributed processing paradigm
33. AGENDA
1 / Introduction
2 / Focus on two graph mining algorithms
3 / Introduction of Distributed Processing Framework
4 / Graph Data warehouse – an emerging challenge
3
3
5 / Conclusion
35. A ranking algorithm
Page Rank
The web is a network of web pages
In addition to the page content, the page linkage represents a useful
source of knowledge and information
35
Compute a ranking on every web page based only on
the linkage structure
L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking:
Bringing order to the web. Technical Report 1999-66, Stanford InfoLab, November
1999. Previous number = SIDL-WP-1999-0120.
37. Random surfer who browses the pages
Page Rank
Either,
1. The surfer chooses an outgoing link of the current vertex
uniformly at random, and follows that link to the destination
vertex, or
2. it “teleports” to a completely random Web page, independent of
the links out of the current vertex.
37
Intuitively, the random surfer traverses frequently “important” vertices with many
vertices pointing to it
38. Random surfer who browses the pages
Page Rank
Let G = (V,E) be the web graph
The PageRank equation
38
)( )(
)(
.
)1(
)(
vdu outin
ud
uPR
p
V
p
vPR
Number of incoming edges to vertex V
Number of outgoing edges from vertex u
The dumping factor (0.85)
We will see how to implement it in a distributed processing framework in the 2nd
part of this tutorial
39. Introduction
Graph clustering
Probably the most important topic studied in graph mining
Graph area: referred as community detection
39L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: An Introduction to
Cluster Analysis (Wiley Series in Probability and Statistics). Wiley-Interscience,
Mar. 2005.
Goal
Given a set of instances, grouping them into groups which share
common characteristics based on similarity
40. Example in targeting advertisement
Graph clustering
40
Brands
Method to cluster a
new user
Display
Ads
Track
Behavior
Improve
model
Classified group
User grouped by brand affinity
Social Interactions
Usage Patterns
Social Graph
Let us see 2 kind of clustering algorithms
(1) Generalization of K-Means & (2) divide algorithm that uses the structure
41. The original algorithm concep
K-Means based clustering
Goal finding cluster by minimizing the sum of the distances
between the data instances and the corresponding centroid
41
The k Number of groups A similarity measure
),(: ji ooD
Steps
1. Select K instance as initial centroids
2. Each data instance is assigned to the nearest
cluster
3. Each cluster center is recomputed as the
average of the data instance in the cluster
4. Repeat step [2-3]
42. What do we need to change?
Adapting K-Means to Graph model
Extending K-Means to take advantage of the linkage
information
42
A Graph-aware selection of the
vertex center
A Graph-aware similarity
measure
),(: ji ooD
The Simplest is the geodesic distance
Number of edges (hops)
Median Vertex
Minimizes the sum of distances to all other vertices
Cu
Cv
m vuDv ),(min
43. What do we need to change?
Adapting K-Means to Graph model
Extending K-Means to take advantage of the linkage
information
43
A Graph-aware selection of the
vertex center
A Graph-aware similarity
measure
),(: ji ooD
The Simplest is the geodesic distance
Number of edges (hops)
Closeness Centrality
a node is the more central the lower its total distance
to all other nodes
Vvuv
uvD
V
vCC
,
),(
1
)( We usually take the shortest path
as distance
M. J. Rattigan, M. E. Maier, and D. Jensen. Graph clustering with network
structure indices. In Z. Ghahramani, editor, ICML, volume 227 of ACM
International Conference Proceeding Series, pages 783–790. ACM, 2007.
44. A divide method
Centrality-based clustering
From the graph, iteratively cut specific edges
Progressively cut into smaller communities
44
The cutting strategy should select the edges
connecting as much as possible communities
[1] proposed to use the edge betweenness
centrality to select the edges to be cut
M. Girvan and M. E. J. Newman. Community structure in social and biological
networks. Proceedings of the National Academy of Sciences, 99(12):7821–
7826,2002
45. Definition
Edge betweenness centrality
Locates structurally the “well-connected” edges
If it is located on many shortest paths
45S. Wasserman and K. Faust. Social Network Analysis: Methods and
Applications.Number 8 in Structural analysis in the social sciences. Cambridge
University Press, 1 edition, 1994.
Vwv vw
vw
b
eb
eBC
,
)(
)(
Bvw (e) = the number of shortest paths from V to W
through e
Bvw = the total number of shortest paths from V to W
46. Step by step description
Centrality-based clustering
46
Steps
1. Compute the betweenness of all existing edges
2. Remove the edge with the highest betweenness centrality
3. Repeat step [1,2] until the communities are suitably found
Vwv vw
vw
b
eb
eBC
,
)(
)(
Extremely useful for web & social graphs
Characterized by Small-World structure property
R. Kumar, J. Novak, and A. Tomkins. Structure and evolution of online social networks. In
Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data
mining, KDD ’06, pages 611–617, New York, NY, USA, 2006. ACM.
47. AGENDA
1 / Introduction
2 / Focus on two graph mining algorithms
3 / Introduction of Distributed Processing Framework
4 / Graph Data warehouse – an emerging challenge
4
7
5 / Conclusion
48. Why do we need a distributed approach?
Scalability issues
The graphs can reach a significant size ~ x100 millions nodes, x
billion edges
48
Most of the Graph mining frameworks & libraries use in-
memory graph data => we need another paradigm
49. (really) Short introduction to
distributed computing
How to distribute a processing over a huge data set?
The ability to run simultaneously software in different
processors in order to increase its performance while the
distributed concept emphasizes the notion of loosely
coupling between those processors.
50. From the resource sharing & the paradigm viewpoint
Distributed architectures
Shared memory
Shared Disks
Share Nothing
50
Explicit parallel programming Implicit parallel programming
51. Distributed architecture
Shared memory
51
Distributed systems that share a common memory space
Case of distributed machine, it can be a distributed cache
Pros
High speed transfer
Cons
The shared memory must manage the data
consistency &
The access from different clients
Can be costly when adding a new memory nodes
Can be highly expensive
52. Distributed architecture
Shared disk
52
Distributed systems that share a common shared disk space
Typically through a LAN
Pros
Almost transparent for the applications
Less costly when adding new storage node
Cons
Access contention & data consistency issue
when clients increase
Expensive
53. Distributed architecture
Shared Nothing
53
Distributed systems where each machine has its own memory
space
Pros
Can be implemented on cheap or expensive
server
With an adapted distributed processing
framework the application does not need to deal
with the distributed aspect
Highly elastic
Cons
Applications need to be re-designed
54. Distributed architecture
Shared Nothing
54
This kind of system needs to distribute the data
Partitioning policy
1 3 1’ 2
4’ 4
2’ 3’
5
1 2 3 4 5
This leads to the interesting concept of data locality
Executing a process where the data is located
55. Distributed architecture: programming model viewpoint
Explicit parallel programming
55
The developer will have to explicitly program the parallel
aspects
Create tasks, synchronization, managing threads & processes, thread safe operation, etc.
Not advised solution
Pros
Richer expressivity, give very low level control
over the distributed processing (main pain point
in Hadoop MR)
Cons
Serious complexity
Error-prone
56. Distributed architecture: programming model viewpoint
Implicit parallel programming
56
The developer will NOT have to take of those details
The compiler or the framework handles all aspects related to parallel execution
The code to run, the scheduling, the location of execution, etc
Most of the examples we present here are Implicit programming with
share nothing data resources
Pros
Much more easy – hidden complexity
Highly scalable
Cons
Much less control on the execution as it is
completely handled by the framework
57. Let’s talk about graph processing
How can I process a graph using implicit parallel
programming and a share nothing processing?
58. The well known framework from Google & Hadoop its open source version
Map Reduce
Created by Google to index crawled web pages
The 3 main strengths of Hadoop [1]
Data Locality
Can schedule a process where the data is
[1 )A. Bialecki, M. Cafarella, D. Cutting, and O. O’Malley. Hadoop: A
framework for running applications on large clusters built of commodity
hardware, http://lucene.apache.org/hadoop/, 2005
Fault Tolerant
Automatic re-scheduling of failing tasks
Parallel processing
On different chunks of data
58
59. Short introduction – 2 main phases Map & Reduce
Map Reduce
Main concepts
Map Phase
[1] A. Bialecki, M. Cafarella, D. Cutting, and O. O’Malley. Hadoop: A
framework for running applications on large clusters built of commodity
hardware, http://lucene.apache.org/hadoop/, 2005
59
The problem is partitioned into a set of smaller sub-problems
Distributed over the worker in the cluster
& processed independently
Reduce Phase All answers to all sub-problems are gathered from the worker nodes
and then merged
60. Is it really suited for Graph Processing & mining?
The developer only focus on the algorithm but
Gives a simple way to deal with large data sets in
completely distributed way
60
However… not really suited for Graph
processing
1. Does not manipulate a Graph model – makes
complex the algorithm
2. Is not suited for iterative processing
1 iteration = 1 MR
Requiring a lot of I/O, data migration, unnecessary computation
61. Optimizing data transfert for iterative algorithms
Map Reduce Improvements
Few works have been done in this direction
R. Chen, X. Weng, B. He, and M. Yang. Large graph processing in the cloud. In Proceedings of the 2010
international conference on Management of data, SIGMOD ’10, pages 1123–1126, New York, NY, USA,
2010. ACM.
J.Ekanayake, H. Li, B. Zhang, T. Gunarathne, S.-H. Bae, J. Qiu, and G. Fox.Twister: a runtime for iterative
mapreduce. In Proceedings of the 19th ACM International Symposium on High Performance Distributed
Computing, HPDC ’10, pages 810–818, New York, NY, USA, 2010. ACM.
U. Kang, C. Tsourakakis, A. Appel, C. Faloutsos, and J. Leskovec. Hadi: Fast diameter estimation and
mining in massive graphs with hadoop. CMU-ML-08-117, 2008.
U. Kang, C. E. Tsourakakis, and C. Faloutsos. Pegasus: A peta-scale graph mining system. In W. Wang,
H. Kargupta, S. Ranka, P. S. Yu, and X. Wu, editors, ICDM, pages 229–238. IEEE Computer Society,
2009.
61
Despite the improvements these solutions lack for graph based model since they deal
with multi-dimension data
62. Methods for dealing with linked structures using Map reduce concept
Then comes Google with Pregel
62
G. Malewicz, M. H. Austern, A. J. C. Bik, J. C. Dehnert, I. Horn, N. Leiser,
and G. Czajkowski. Pregel: a system for large-scale graph processing. In
A. K. Elmagarmid and D. Agrawal, editors, SIGMOD Conference, pages
135–146. ACM, 2010.
Providing a distributed computing framework
dedicated to graph processing
Bulk Synchronous Processing (BSP) for graph processing
In a BSP model an algorithm is executed as a
sequence a Supersteps separated by a global
synch. point untill termination.
In 1 Superstep a processor can:
1. Perform computation on local data
2. Send or receive messages
63. Leanring distributed graph processing framework
Concep of superstep@Pregel
63
G. Malewicz, M. H. Austern, A. J. C. Bik, J. C. Dehnert, I. Horn, N. Leiser,
and G. Czajkowski. Pregel: a system for large-scale graph processing. In
A. K. Elmagarmid and D. Agrawal, editors, SIGMOD Conference, pages
135–146. ACM, 2010.
The vertices of the graph execute the same
user defined function (compute) in //
Modification of the state of a vertex or its outgoing edges
Read messages sent to the vertex from previous supersteps
Send messages to other vertices that will be received in the next supersteps
Modification of the Graph Topology
64. Leanring distributed graph processing framework
Concep of superstep@Pregel
64
How do I stop the processing?
Use the “Vertex Voting”
Each node votes to halt -> become inactive unless it receives a non-empty message
Inactive vertices are not involved in processing
anymore.
The processing stops when all vertices are inactive.
65. Methods for dealing with linked structures using Map reduce concept
Open source implementation of Pregel
65
Apache Giraph
From Google Pregel
BSP for distributed
graph processing
Distributed Graph Processing
Processing
HDFS
66. Let’s play with Giraph
Implementing a single source shortest path (SSP)
67. Thinking in term of supersteps & messages
Re-thinking the SSP for Giraph Processing
1. Init vertex value to larger possible value for all vertices except the source
2. On each step
1. The vertex reads the message from its neighbor
2. Each message contains the distance between the source & current
vertex through the last vertex
3. We take the min value between the current value & the received
value
4. Send the message to all neighbor as min distance + weighted edge
Definition of the vertex value
The distance to reach the current vertex
from the source
Definition of the messages
Vertex sends its current value +edge
weight
67
68. Thinking in term of supersteps & messages
Re-thinking the SSP for Giraph Processing
68
73. Thinking in term of supersteps & messages
Re-thinking PageRank for Giraph Processing
Remember the PageRank equation
Definition of the vertex value
?
Definition of the messages
?
73
)( )(
)(
.
)1(
)(
vdu outin
ud
uPR
p
V
p
vPR
3 Mins to think !
74. Thinking in term of supersteps & messages
Re-thinking PageRank for Giraph Processing
Remember the PageRank equation
Definition of the vertex value
The PageRank tentative
Definition of the messages
The PageRank tentative divided by #out
edges
74
)( )(
)(
.
)1(
)(
vdu outin
ud
uPR
p
V
p
vPR
75. Dive into the algorithm
PageRank in Giraph
1. Init vertex value with 1/Size of the Grpah
2. On each step
1. The vertex read the message from its neighbor
2. Each message contains PR tentative of ingoing vertex
3. Compute the page rank for the current vertex with p=0.85
4. Send the message to all outgoing edges
5. After a fixed number of supersteps (iterations), Vertex vote to halt
75
Definition of the vertex value
The PageRank tentative
Definition of the messages
The PageRank tentative divided by #out
edges
)( )(
)(
.
)1(
)(
vdu outin
ud
uPR
p
V
p
vPR
[1]L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking:
Bringing order to the web. Technical Report 1999-66, Stanford InfoLab, November
1999. Previous number = SIDL-WP-1999-0120.
One could find a suitable setup to run
until convergence of values [1]
76. A deeper look at the algorithm
PageRank algorithm distilled
76
77. For a Geek like me, code is easier to get
PageRank for Giraph Processing
77
*Moss, IT Crowd
https://github.com/apache/giraph
78. For the Geekers - what’s the meaning of the sendMesgToAllEdges ?
PageRank for Giraph Processing
78
*Moss, IT Crowd
https://github.com/apache/giraph
79. Up to you guys – Classification of customer by product
Test: Write a classification Example
1. Starting from n root nodes, each having one color
2. Propagate the color to all neighbor nodes
3. The color is propagated if there is no nearest root colored node
4. Use the SSSP to define the distance
79
Definition of the vertex value
?
Definition of the messages
?
15 mins
public enum Color {
GREEN, RED, ORANGE
}
80. Up to you guys – Classification of customer by product
Test: Write a classification Example
1. Starting from n root nodes, each having one color
2. Propagate the color to all neighbor nodes
3. The color is propagated if there is no nearest root colored node
4. Use the SSSP to define the distance
80
Definition of the vertex value
[Color Label, Distance to the root node of
this color]
Definition of the messages
[Color, Distance to the root node of this
color]
10 mins
public enum Color {
GREEN, RED, ORANGE
}
81. Up to you guys – Classification of customer by product
Test: Write a classification Example
81
Definition of the vertex value
[Color Label, Distance to the root node of this
color]
Definition of the messages
[Color, Distance to the root node of this color]
8 mins
82. 1. Init vertex value to larger possible value for all vertices except the source
colored vertices
2. On each step
1. The vertex read the message from its neighbor
2. Each message contains the distance between the source & current
vertex through the last vertex and the propagated color
3. If the value is less than the received value we update the value and
set the color
4. Send the message to all neighbor as min distance + weighted edge
82
Up to you guys – Classification of customer by product
Test: Write a classification Example
Definition of the vertex value
[Color Label, Distance to the root node of this
color]
Definition of the messages
[Color, Distance to the root node of this color]
83. 83
Up to you guys – Classification of customer by product
Test: Write a classification Example
84. Intermediate Conclusion
Can I use graph mining algorithm on huge graphs
using distributed framework coming from the web?
85. Can we do graph mining on large graphs using the distributed approach?
Intermediate Conclusion
85
Yes you can, but …
1. Need to choose a implicit distributed framework
2. This will constraint the programming model & the storage
3. Need to re-design the algorithm to fully exploit the framework
If I can mine the graph - does it mean that I have a data warehouse?
What do we miss to have a full graph data warehouse?
86. AGENDA
1 / Introduction
2 / Focus on two graph mining algorithms
3 / Introduction of Distributed Processing Framework
4 / Graph Data warehouse – an emerging challenge
8
6
5 / Conclusion
88. Definition of interactions
Data warehouse & mining
88
Data Mining algorithms are involved in many
steps of the DW
1. Identifying key attributes
2. Finding related measures
3. Limiting the scope of queries
Mining space
Multi-dimensional cube space for mining
Generating features & target
By using OLAP queries
Multi-step OLAP process
Using data mining as building blocs
Speeding up model construction
Using data cube computation
OLAP framework are often integrated with
mining frameworks
-> OLAM (On-Line Analytic Mining) &
exploratory multi-dimensional mining [1]
[1] J. Han and M. Kamber. Data Mining: Concepts and Techniques.
Morgan Kaufmann, 2000.
89. Graph is fine but stop to play,
be an adult
Come back in a professional & Business
environment, come back to relational DB
90. It is not because it is fun, it is because the relationship model brings a value
The graph is a constraint
90
Let’s take the Social Network example
1. We can model a friend relationship in a m-n
2. In Average ~ 100 Friends
3. Friends of Friends request – 1002 join requests
Storing a SN in a Relational DB is not a problem
Unless you need traversal queries for mining
91. Two main important issues
A Graph in a relational DB
91
1. Cost of Joins when traversing
2. Almost transfering the totality of the graph between the client and the
DB
We have seen that Distributed Graph Processing frameworks use the data locality
to minimize the cost
Data Application Server
92. I got a distributed processing
framework & mining
algorithms
Now do I have a Graph Data warehouse?
…BTW what is exactly a Data warehouse?
93. Let’s take a look
Traditional Data warehouse
Aim at providing software, modeling approaches & tools to
analyze a set of data in a collection of DB
E. Malinowski and E. Zimanyi. Advanced data warehouse design: From conventional to
spatial and temporal applications. Springer-Verlag, 2008.
94. An important topic of research
Conceptual modeling
94
Aim at providing software, modeling approaches & tools to
analyze a set of data in a collection of DB
Research topic focus
1. Improvement of the Snowflake & Star
model
2. Models enabling the to define levels of
hierarchies
3. Role played by a measure in different
dimension
4. Properties such as additive, derive
E. Malinowski and E. Zimanyi. Multidimensional conceptual modeling. In J. Wang, editor,
Encyclopedia of Data Warehousing and Mining, pages 293–300. IGI Global, second edition,
2008.
Measures
Fact
Dimensions
95. The multiDim model – a conceptual model for Data Warehouse & OLAP Applications
Conceptual modeling
95
E. Malinowski and E. Zimanyi. Multidimensional conceptual modeling. In J. Wang, editor,
Encyclopedia of Data Warehousing and Mining, pages 293–300. IGI Global, second edition,
2008.
Measures
Fact
Hierarchy of dimensions
Cardinality child parent
Conceptual modeling reached a certain level of maturity
96. Operations & queries on the model
OLAP queries
96
Extracting information by Queries
1. Rollup (increasing the level of aggregation)
2. Drill-down (decreasing the level of aggregation or increasing detail)
along one or more dimension hierarchies
3. Slice and dice (selection and projection)
4. Pivot (re-orienting the multidimensional view of data).
S. Chaudhuri and U. Dayal. An overview of data warehousing and olap
technology. SIGMOD Record, 26(1):65–74, 1997
98. I got a distributed processing
framework & mining
algorithms
Now do I have a Graph Data warehouse?!
99. Define what is missing if we have a graph model instead of a relational model
Let’s take the Data warehouse process
99
Global process overview
Need to be able to model intermediate structure keeping the
relationship as a central place while Defining navigation path, roles in
navigation, summarization pros, etc.
100. Central element in the traversal and then in graph mining
Why navigation path matters?
100
Define the way one could traverse the graph
Person
Friends of
Group
Belongs toMembers
Item
Bought by
Bought
Used in
1. Classification
2. Ranking
3. Collaborative filtering
Roles in paths
Hierarchies in paths
Additivity in paths
101. Dealing with distributed frameworks while keeping an high level query layer
Processing layers
101
QUERY LAYER
TRANSLATION LAYER
DISTRIBUTED PROCESSING FRAMEWORK
GRAPH STORAGE
102. 102
How to deal with the graph nature ?
If I have a graph DB how do I use Giraph ?
How to deal with the distributed aspects ?
Integration of the processing FWK ?
How to infer a physical execution plan ?
Data materialization issue is completely different from OLAP
What kind of query language to expose ?
SQL - PigLatin – SPARQL ?
Dealing with distributed frameworks while keeping an high level query layer
Challenges @Processing layers
103. From Google & Microsoft Research
The most advanced research
103
Zhao and al., Graph cube: on warehousing and OLAP multidimensional networks,
in Proceedings of the 2011 international conference on Management of data
Combining Social Interaction information with user profiles
Target ads, marketing, etc.
New Warehousing & OLAP multi-dimensional network model
A graph on which vertex = tuple in a table
Attributes of this table = multi-dimensional spaces
104. From Google & Microsoft Research
The most advanced research
104
1. Shown we can execute standard OLAP operations while leveraging the
graph aspects
2. Defined the algorithm to obtain the aggregated networks from queries
3. Present a materialization approach
Zhao and al., Graph cube: on warehousing and OLAP multidimensional networks,
in Proceedings of the 2011 international conference on Management of data
New Warehousing & OLAP multi-dimensional network model
A graph on which vertex = tuple in a table
Attributes of this table = multi-dimensional spaces
105. Examples for operation on multi-dimensional networks
Showing structural behaviors
105
Zhao and al., Graph cube: on warehousing and OLAP multidimensional networks,
in Proceedings of the 2011 international conference on Management of data
Summarizing on the multi-
dimensional network on
the dimension “Gender”
Summarizing on the multi-
dimensional network on
the dimensions “Gender” &
“Location”
2 females in CA take 55.6% of
the total Male-Female
connections
Drill-down operation
What is the network structure as grouped by
both gender & location?
106. 1. The cuboid queries
Queries on GraphCube
106
Has as output the aggregate network corresponding to a
specific aggregation of the multi-dimensional network
What is the network structure
between various location
& profession
combinations?
Zhao and al., Graph cube: on warehousing and OLAP multidimensional networks,
in Proceedings of the 2011 international conference on Management of data
The answer = the aggregated network in the
desired cuboid in the graph cube
107. 2. Crossboid query
Queries on GraphCube
107
Queries which crosses multiple multi-dimensional spaces of
the networks (Cuboids)
What is the network structure
between the user “3” and
various locations?
Zhao and al., Graph cube: on warehousing and OLAP multidimensional networks,
in Proceedings of the 2011 international conference on Management of data
108. From Google & Microsoft Research
The most advanced research
108
New Warehousing & OLAP multi-dimensional network model
A graph on which vertex = tuple in a table
Attributes of this table = multi-dimensional spaces
1. Shown we can execute standard OLAP operation while leveraging
the graph aspects
2. Defined the algorithm to obtain the aggregated networks from queries
3. Present a materialization approach
Zhao and al., Graph cube: on warehousing and OLAP multidimensional networks,
in Proceedings of the 2011 international conference on Management of data
Only consider vertex of the same type
Only centralized processing
Then materialization policy is inspired by legacy central DW
109. AGENDA
1 / Introduction
2 / Focus on two graph mining algorithms
3 / Introduction of Distributed Processing Framework
4 / Graph Data warehouse – an emerging challenge
109
5 / Conclusion
110. Conclusion
Today building blocs exist to mine large graphs
Up to you to assemble them for a dedicated purpose
DISTRIBUTED PROCESSING FRAMEWORK
GRAPH STORAGE
MINING LIBRARIES
NON-GRAPH BASED
111. Conclusion
111
Structuring linked data as graph is an emerging & important
requirement
Important challenges for Mining algorithms
Adapting the logic to include the global relationship
Important challenges for the processing layer
Re-design algorithms – integrating the storage layer - using emerging Big data frameworks
However implicit distributed graph processing
frameworks are emerging
Still far from the concept of Graph Data Warehouse
Lack of modeling – uniform stack –Query language – Re-design the materialization
112. THANK YOU
EBISS,
20 of July 2012
Brussels
sabri.skhiri@euranova.eu / twitter@sskhiri /http://blog.euranova.eu
SABRI SKHIRI / RESEARCH DIRECTOR EURA NOVA
Data mining has been developed for 2 decades, we have mature algorithms, libraries, and even product. Mainly focused on relational data and flat data.
New requirements coming from research or industry such as bioolgy, chemistry, social networks, internet, etc.
Then the question is “is traditional data mining algorithm but also processing stack, still equally applicable on this new data model?” so what do we need as processing paradigm framework and what can we change from the alorithmic view point?
Those techniques have been heavily developed these last years in Business intelligence
[1,2] especially for database and flat data in order to feed market nalysis,business management, and assisted-decision tools [16]. It is worth saying
that the data mining stands at the intersection of di.erent disciplines such
as statistic, machine learning, information retrieval and pattern recognition. In
fact, there is no question that data mining appropriately uses algorithms from
these well studied fields. Indeed, almost all mining algorithms can be divided
into the following families: (1) the classification for which we position data in
pre-determined groups, (2) clustering in which data are grouped within partitions
according to di.erent criterias, (3) associations that enables to link data
between each other (4) pattern recognition in which we mine data to retrieved
pre-determined pattern (5)feature extraction and (6) Summarization(Ranking
such as Page rank).
the instances of data to be mined are considered independent without relationships between them. For
example, in the case of the clustering algorithm in which the input data set is divided in groups with similar objects, it is considered that there is no relation
between the objects. Hence almost all clustering algorithm compute the similarity between all the pair of objects in the data set by means of a distance measure.
Indeed, the traditional data mining works are focused on multi-dimensional and
text data.
Taking into account the structural relationships give us additional information about the objects, their links, their interactions, even in social networks, we start speaking about social user profile instead of User profile. Directly we see an evolution of the mining, by considering another data model.
Taking into account the structural relationships give us additional information about the objects, their links, their interactions, even in social networks, we start speaking about social user profile instead of User profile. Directly we see an evolution of the mining, by considering another data model.
Description dui schema:
Catalyser, inhibitors, compound, protein and gene coding for those proteins.
Each gene that codes for those protein can be activated or blocked -> transduction signal network or gene regulation network.
If you take a systemic approach we end up with a huge graph.
Gene, Regulator, Protein, Compound
To calculate this measure we use the classic Jaccard index Widely used measure to compare similarity and diversity of sample sets
The second structural feature to analyse is that of our intuition of the popularity of shared items More highly connected a shared item higher distance value will be
alpha+ beta +gamma = 1
To calculate this measure we use the classic Jaccard index Widely used measure to compare similarity and diversity of sample sets
The second structural feature to analyse is that of our intuition of the popularity of shared items More highly connected a shared item higher distance value will be
alpha+ beta +gamma = 1
To calculate this measure we use the classic Jaccard index Widely used measure to compare similarity and diversity of sample sets
The second structural feature to analyse is that of our intuition of the popularity of shared items More highly connected a shared item higher distance value will be
alpha+ beta +gamma = 1
I will show you how the new generation of distributed processing framework provides power full tools for this kind of mining.
Make evolve the infra to support Mobile Apps
Re-designing Service Life-cycle management: to be competitive you need to deploy on market in cycle that are less than 3 weeks, how to re-design the complete chain, governance integration
P= is the probability of teleportation of the surfer.
We stop the algorithm after a pre-defined number of iteration.
For example a co-authorship graph is a bi-partite graph, by clustering this kind of graph we can see paper and authors in different clsuter and easyly identify paper relevant in a specific domain. This kind of things is also used in targeting advertisment
This kind of things is also used in targeting advertisement
K-means takes 2 parameters: the k number of groups and a similarity measure between two object instances.
Convergence means that no objects moves during the last 2 round.
Graph aware means that take the graph and linkage information into account for the computation
Graph aware means that take the graph and linkage information into account for the computation
K-means takes 2 parameters: the k number of groups and a similarity measure between two object instances.
Convergence means that no objects moves during the last 2 round.
The size of the graph makes it impossible to work in memory, then we need another kind of solutions. One of them is to distributed them among distributed storage and to link the processing to this storage.
The distributed architecture can be classified & described according to the resources the machines or the processors share each other. But also according to the programming paradigm they offers.
The distributed architecture can be classified & described according to the resources the machines or the processors share each other.
NEC San storage
NEC San storage
This policy defines the location of the data and then,
the distributed computing framework can send dedicated tasks where the data
is located. This represents the notion of data locality
This policy defines the location of the data and then,
the distributed computing framework can send dedicated tasks where the data
is located. This represents the notion of data locality
This policy defines the location of the data and then,
the distributed computing framework can send dedicated tasks where the data
is located. This represents the notion of data locality
Make evolve the infra to support Mobile Apps
Re-designing Service Life-cycle management: to be competitive you need to deploy on market in cycle that are less than 3 weeks, how to re-design the complete chain, governance integration
Iterative processing as it is the case in K-means, clusterisation, page rank, etc.
Within each superstep a processor (or a virtual processor) may perform
the following operations; (1) perform computations on a set of local data (only)
and (2) send or receive messages. Similarly, in Pregel, whithin a superstep the
vertices of graph execute the same user-defined function, in parallel. This function
can include : a modification of the state of a vertex or that of its outgoing
edges, read messages sent to the vertex in the previous superstep, send messages
to other vertices that will be received in the next superstep, or even a modification
of the topology of the graph (deleting or adding vertices and/or edges)
[49].
Pregel uses a “vertex voting to halt” technique to determine the algorithm
termination. Each vertex has two possible states: active or inactive. An algorithm
is considered terminated when all the vertices are in the inactive state.
Practically, in the initial superstep (superstep 0), all vertices are in the active
state, then in each subsequent supersteps each vertex can vote to halt explicitly
to deactive itself. An inactive vertex do not participate of any superstep unless
it receives an non-empty message
.
The initial step consists on setting the values associated to all the other vertices to infinity. In superstep 1, the vertices
(2), (3) and (4) receive from the vertex (1) (in superstep 0), respectively, the messages containing their distances to (1). For instance, the vertex (2) receives a message that contains 6 which is the sum of the value of vertex (1) and the
weight of outgoing edge ((1).(2)). Moreover, in superstep 1, the source vertex is in inactive state because it does not receive any message in this superstep. The
next supersteps follow the same procedure until all the vertices are in inactive
state.
The initial step consists on setting the values associated to all the other vertices to infinity. In superstep 1, the vertices
(2), (3) and (4) receive from the vertex (1) (in superstep 0), respectively, the messages containing their distances to (1). For instance, the vertex (2) receives a message that contains 6 which is the sum of the value of vertex (1) and the
weight of outgoing edge ((1).(2)). Moreover, in superstep 1, the source vertex is in inactive state because it does not receive any message in this superstep. The
next supersteps follow the same procedure until all the vertices are in inactive
state.
The initial step consists on setting the values associated to all the other vertices to infinity. In superstep 1, the vertices
(2), (3) and (4) receive from the vertex (1) (in superstep 0), respectively, the messages containing their distances to (1). For instance, the vertex (2) receives a message that contains 6 which is the sum of the value of vertex (1) and the
weight of outgoing edge ((1).(2)). Moreover, in superstep 1, the source vertex is in inactive state because it does not receive any message in this superstep. The
next supersteps follow the same procedure until all the vertices are in inactive
state.
Vertex Value = PageRank tentative
Message= contains tentative pageRank divided by the number of outgoing edges of the involved vertex, to get the term to sum in the current vertex
Vertex Value = PageRank tentative
Message= contains tentative pageRank divided by the number of outgoing edges of the involved vertex, to get the term to sum in the current vertex
https://github.com/apache/giraph
https://github.com/apache/giraph
Equally functional features of a data warehouse but for graph model?
Golden OrB does not use generic to get messages, so you have to cast yourselves !
@Override public void compute(Collection<IntMessage> messages) { int _maxValue = 0; for(IntMessage m: messages) { int msgValue = ((IntWritable)m.getMessageValue()).get(); _maxValue = Math.max(_maxValue, msgValue); } }
Equally functional features of a data warehouse but for graph model?
Golden OrB does not use generic to get messages, so you have to cast yourselves !
@Override public void compute(Collection<IntMessage> messages) { int _maxValue = 0; for(IntMessage m: messages) { int msgValue = ((IntWritable)m.getMessageValue()).get(); _maxValue = Math.max(_maxValue, msgValue); } }
The result is that thy highly minimize the amount of data to transfer and even optimise the data locality. Today in the relational model this vision is emerging with the concept of SQL MR DB.
Let’s focus on the data ware house and OLAP tier
Most of the research topics focus on the improvement of the snowflake and star schema [51]. Some researches try to add
a graphical representation [60] based on the ER model [61, 66] or based on UML [2, 47] , other focus on models that enable to define di.erent level of hierarchies [60, 7, 31, 39], while other provide models take into account the role played by
a measure in di.erent dimensions [47, 1]. The model described by [51] tries to summarize the main limitations o.ered by the snowflake and star model and proposes a new model that include most of the previous researches in this area.
I will not enter in detail in this model, my only goal on this slide is to show you that we got a certain level of maturity in the conceptual modeling area for data wareshouse
The processing layer will take the cube and the queries to generate an optimize physical execution plan that will materialized the queries
The DB can be existing DB such as in fraud detection, with a buying log or in telecom with the call data record or Graph storage.
Then as soon as we have our consilated graph there there is a gap I the conceptual modeling that we will have, and what kind of queries ?
There is no equally functional conceptual modeling approach as we can find in data warehouse.
The multi dim model defined by Esteban, could be applied here, but with some semantic modification to take into account our constraints.
I have 3 way to navigate in the graph:
Friends
Belongs / members
Bought, bought by
You need to deal with a lot of data type: non-structured, Graph, semi-structured, structured
The size of the data will perhaps lead you to consider a distributed approach
Then a translation layer is required to transform your query in a physical execution plan
You need to deal with a lot of data type: non-structured, Graph, semi-structured, structured
The size of the data will perhaps lead you to consider a distributed approach
Then a translation layer is required to transform your query in a physical execution plan
Most of the research topics focus on the
improvement of the snowflake and star schema [51]. Some researches try to add
a graphical representation [60] based on the ER model [61, 66] or based on UML
[2, 47] , other focus on models that enable to define di.erent level of hierarchies
[60, 7, 31, 39], while other provide models take into account the role played by
a measure in di.erent dimensions [47, 1]. The model described by [51] tries to
summarize the main limitations o.ered by the snowflake and star model and
proposes a new model that include most of the previous researches in this area.
1st graph:
Shows the aggragated network which is the result of condensing aggregation operation group by gender. The vertices are the condesed vertices by the aggregation. The edge represent the relation between aggragated vertices.
The weight on the edge are the result of the count operation on the group by.
The graph cube is obtain by restructuring all possible aggregations of A
Most of the research topics focus on the
improvement of the snowflake and star schema [51]. Some researches try to add
a graphical representation [60] based on the ER model [61, 66] or based on UML
[2, 47] , other focus on models that enable to define di.erent level of hierarchies
[60, 7, 31, 39], while other provide models take into account the role played by
a measure in di.erent dimensions [47, 1]. The model described by [51] tries to
summarize the main limitations o.ered by the snowflake and star model and
proposes a new model that include most of the previous researches in this area.