Multimodal Biometrics Recognition by Dimensionality Diminution MethodIJERA Editor
Multimodal biometric system utilizes two or more character modalities, e.g., face, ear, and fingerprint,
Signature, plamprint to improve the recognition accuracy of conventional unimodal methods. We propose a new
dimensionality reduction method called Dimension Diminish Projection (DDP) in this paper. DDP can not only
preserve local information by capturing the intra-modal geometry, but also extract between-class relevant
structures for classification effectively. Experimental results show that our proposed method performs better
than other algorithms including PCA, LDA and MFA.
Healthcare deserts: How accessible is US healthcare?Data Con LA
Data Con LA 2020
Description
In 2018, healthcare spending in the US accounted for 17% of the nation’s GDP. With such significant spend, how can we better understand what that means for healthcare and treatment accessibility? When policy changes occur, how can we gauge the impact on rural areas, which are disproportionally affected by inadequate access to healthcare (or “healthcare deserts”)? Using publicly available data and records, it is possible to locate all major hospitals in the U.S. and, for every residential ZIP code, model the population affected by healthcare deserts at various travel mileage thresholds. This talk will focus on:
· The several public datasets that are available to address this question
· The logic and algorithm(s) used to compute this efficiently in Python
· Visualizing the problem and telling the story in Tableau
Speaker
Andrew Kaszpurenko, Edwards Lifesciences, Manager of Advanced Analytics at Edwards Lifesciences THV Division
High performance intrusion detection using modified k mean & naïve bayeseSAT Journals
Abstract
Internet Technology is growing at exponential rate day by day, making data security of computer systems more complex and critical. There has been multiple methodology implemented for the same in recent time as detailed in [1], [3]. Availability of larger bandwidth has made the multiple large computer server network connected worldwide and thus increasing the load on the necessity to secure data and Intrusion detection system (IDS) is one of the most efficient technique to maintain security of computer system. The proposed system is designed in such a way that are helpful in identifying malicious behavior and improper use of computer system. In this report we proposed a hybrid technique for intrusion detection using data mining algorithms. Our main objective is to do complete analysis of intrusion detection Dataset to test the implemented system.In This report we will propose a new methodology in which Modified k-mean is used for clustering whereas Naïve Bayes for the classification. These two data mining techniques will be used for Intrusion detection in large horizontally distributed database.
Keywords: Intrusion Detection, Modified K-Mean, Naïve Bays
High performance intrusion detection using modified k mean & naïve bayeseSAT Journals
Abstract
Internet Technology is growing at exponential rate day by day, making data security of computer systems more complex and critical. There has been multiple methodology implemented for the same in recent time as detailed in [1], [3]. Availability of larger bandwidth has made the multiple large computer server network connected worldwide and thus increasing the load on the necessity to secure data and Intrusion detection system (IDS) is one of the most efficient technique to maintain security of computer system. The proposed system is designed in such a way that are helpful in identifying malicious behavior and improper use of computer system. In this report we proposed a hybrid technique for intrusion detection using data mining algorithms. Our main objective is to do complete analysis of intrusion detection Dataset to test the implemented system.In This report we will propose a new methodology in which Modified k-mean is used for clustering whereas Naïve Bayes for the classification. These two data mining techniques will be used for Intrusion detection in large horizontally distributed database.
Keywords: Intrusion Detection, Modified K-Mean, Naïve Bays
NEURAL NETWORKS WITH DECISION TREES FOR DIAGNOSIS ISSUEScscpconf
This paper presents a new idea for fault detection and isolation (FDI) technique which is applied to industrial system. This technique is based on Neural Networks fault-free and Faulty
behaviours Models (NNFMs). NNFMs are used for residual generation, while decision tree architecture is used for residual evaluation. The decision tree is realized with data collected
from the NNFM’s outputs and is used to isolate detectable faults depending on computed threshold. Each part of the tree corresponds to specific residual. With the decision tree, it
becomes possible to take the appropriate decision regarding the actual process behaviour by evaluating few numbers of residuals. In comparison to usual systematic evaluation of all
residuals, the proposed technique requires less computational effort and can be used for on line diagnosis. An application example is presented to illustrate and confirm the effectiveness and the accuracy of the proposed approach.
NEURAL NETWORKS WITH DECISION TREES FOR DIAGNOSIS ISSUEScsitconf
This paper presents a new idea for fault detection and isolation (FDI) technique which is
applied to industrial system. This technique is based on Neural Networks fault-free and Faulty
behaviours Models (NNFMs). NNFMs are used for residual generation, while decision tree
architecture is used for residual evaluation. The decision tree is realized with data collected
from the NNFM’s outputs and is used to isolate detectable faults depending on computed
threshold. Each part of the tree corresponds to specific residual. With the decision tree, it
becomes possible to take the appropriate decision regarding the actual process behaviour by
evaluating few numbers of residuals. In comparison to usual systematic evaluation of all
residuals, the proposed technique requires less computational effort and can be used for on line
diagnosis. An application example is presented to illustrate and confirm the effectiveness and
the accuracy of the proposed approach.
On Tracking Behavior of Streaming Data: An Unsupervised ApproachWaqas Tariq
In the recent years, data streams have been in the gravity of focus of quite a lot number of researchers in different domains. All these researchers share the same difficulty when discovering unknown pattern within data streams that is concept change. The notion of concept change refers to the places where underlying distribution of data changes from time to time. There have been proposed different methods to detect changes in the data stream but most of them are based on an unrealistic assumption of having data labels available to the learning algorithms. Nonetheless, in the real world problems labels of streaming data are rarely available. This is the main reason why data stream communities have recently focused on unsupervised domain. This study is based on the observation that unsupervised approaches for learning data stream are not yet matured; namely, they merely provide mediocre performance specially when applied on multi-dimensional data streams. In this paper, we propose a method for Tracking Changes in the behavior of instances using Cumulative Density Function; abbreviated as TrackChCDF. Our method is able to detect change points along unlabeled data stream accurately and also is able to determine the trend of data called closing or opening. The advantages of our approach are three folds. First, it is able to detect change points accurately. Second, it works well in multi-dimensional data stream, and the last but not the least, it can determine the type of change, namely closing or opening of instances over the time which has vast applications in different fields such as economy, stock market, and medical diagnosis. We compare our algorithm to the state-of-the-art method for concept change detection in data streams and the obtained results are very promising.
Constructing a classification model is important in machine learning for a particular task. A
classification process involves assigning objects into predefined groups or classes based on a
number of observed attributes related to those objects. Artificial neural network is one of the
classification algorithms which, can be used in many application areas. This paper investigates
the potential of applying the feed forward neural network architecture for the classification of
medical datasets. Migration based differential evolution algorithm (MBDE) is chosen and
applied to feed forward neural network to enhance the learning process and the network
learning is validated in terms of convergence rate and classification accuracy. In this paper,
MBDE algorithm with various migration policies is proposed for classification problems using
medical diagnosis.
MEDICAL DIAGNOSIS CLASSIFICATION USING MIGRATION BASED DIFFERENTIAL EVOLUTION...cscpconf
Constructing a classification model is important in machine learning for a particular task. A
classification process involves assigning objects into predefined groups or classes based on a
number of observed attributes related to those objects. Artificial neural network is one of the
classification algorithms which, can be used in many application areas. This paper investigates
the potential of applying the feed forward neural network architecture for the classification of
medical datasets. Migration based differential evolution algorithm (MBDE) is chosen and
applied to feed forward neural network to enhance the learning process and the network
learning is validated in terms of convergence rate and classification accuracy. In this paper,
MBDE algorithm with various migration policies is proposed for classification problems using
medical diagnosis.
Credit Default Swap (CDS) Rate Construction by Machine Learning TechniquesZhongmin Luo
1. Financial institutions need to construct proxy CDS rates for counterparties lacking liquid CDS quotes, which are required for CVA pricing, CVA risk charge calculation, etc;
2. Existing CDS Proxy Methods do not meet regulatory requirements and are vulnerable to arbitrage;
3. After investigating 8 most popular Machine Learning algorithms, we show that Machine Learning techniques can be used to construct reliable CDS proxies that meet regulatory regulations while free from the above problem
4. Feature variable selection can be critical for performance of CDS-proxy construction methods
5. Effects of feature variable correlations on classification performances have to be investigated in the case of financial data
New Ethernet standards, such as 40 GbE or 100 GbE, are already being deployed commercially along with their corresponding Network Interface Cards (NICs) for the servers. However, network measurement solutions are lagging behind: while there are several tools available for monitoring 10 or 20 Gbps networks, higher speeds pose a harder challenge that requires more new ideas, different from those applied previously, and so there are less applications available. In this paper, we show a system capable of capturing, timestamping and storing 40 Gbps network traffic using a tailored network driver together with Non-Volatile Memory express (NVMe) technology and the Storage Performance Development Kit (SPDK) framework. Also, we expose core ideas that can be extended for the capture at higher rates: a multicore architecture capable of synchronization with minimal overhead that reduces disordering of the received frames, methods to filter the traffic discarding unwanted frames without being computationally expensive, and the use of an intermediate buffer that allows simultaneous access from several applications to the same data and efficient disk writes. Finally, we show a testbed for a reliable benchmarking of our solution using custom DPDK traffic generators and replayers, which
have been made freely available for the network measurement
community.
En la actualidad, la proliferación de dispositivos móviles y accesos a Internet utilizando tecnologías inalámbricas en los entornos domésticos obliga a cambiar las metodologías para la realización de medidas de red. Para que éstas representen fidedignamente las condiciones ofrecidas a los usuarios, las prestaciones del equipamiento de medida y el número de dispositivos empleados deben adaptarse a las condiciones reales de un despliegue. Para facilitar y abaratar el desarrollo de medidas en estas condiciones, este trabajo presenta una evaluación de las capacidades de varias plataformas de propósito general y bajo coste. Nuestros resultados muestran que, aunque aparecen limitaciones relacionadas con cómo son conectadas a la red y los protocolos empleados, son aptas para medir una gran variedad de situaciones.
More Related Content
Similar to Dictyogram: a Statistical Approach for the Definition and Visualization of Network Flow Categories
Multimodal Biometrics Recognition by Dimensionality Diminution MethodIJERA Editor
Multimodal biometric system utilizes two or more character modalities, e.g., face, ear, and fingerprint,
Signature, plamprint to improve the recognition accuracy of conventional unimodal methods. We propose a new
dimensionality reduction method called Dimension Diminish Projection (DDP) in this paper. DDP can not only
preserve local information by capturing the intra-modal geometry, but also extract between-class relevant
structures for classification effectively. Experimental results show that our proposed method performs better
than other algorithms including PCA, LDA and MFA.
Healthcare deserts: How accessible is US healthcare?Data Con LA
Data Con LA 2020
Description
In 2018, healthcare spending in the US accounted for 17% of the nation’s GDP. With such significant spend, how can we better understand what that means for healthcare and treatment accessibility? When policy changes occur, how can we gauge the impact on rural areas, which are disproportionally affected by inadequate access to healthcare (or “healthcare deserts”)? Using publicly available data and records, it is possible to locate all major hospitals in the U.S. and, for every residential ZIP code, model the population affected by healthcare deserts at various travel mileage thresholds. This talk will focus on:
· The several public datasets that are available to address this question
· The logic and algorithm(s) used to compute this efficiently in Python
· Visualizing the problem and telling the story in Tableau
Speaker
Andrew Kaszpurenko, Edwards Lifesciences, Manager of Advanced Analytics at Edwards Lifesciences THV Division
High performance intrusion detection using modified k mean & naïve bayeseSAT Journals
Abstract
Internet Technology is growing at exponential rate day by day, making data security of computer systems more complex and critical. There has been multiple methodology implemented for the same in recent time as detailed in [1], [3]. Availability of larger bandwidth has made the multiple large computer server network connected worldwide and thus increasing the load on the necessity to secure data and Intrusion detection system (IDS) is one of the most efficient technique to maintain security of computer system. The proposed system is designed in such a way that are helpful in identifying malicious behavior and improper use of computer system. In this report we proposed a hybrid technique for intrusion detection using data mining algorithms. Our main objective is to do complete analysis of intrusion detection Dataset to test the implemented system.In This report we will propose a new methodology in which Modified k-mean is used for clustering whereas Naïve Bayes for the classification. These two data mining techniques will be used for Intrusion detection in large horizontally distributed database.
Keywords: Intrusion Detection, Modified K-Mean, Naïve Bays
High performance intrusion detection using modified k mean & naïve bayeseSAT Journals
Abstract
Internet Technology is growing at exponential rate day by day, making data security of computer systems more complex and critical. There has been multiple methodology implemented for the same in recent time as detailed in [1], [3]. Availability of larger bandwidth has made the multiple large computer server network connected worldwide and thus increasing the load on the necessity to secure data and Intrusion detection system (IDS) is one of the most efficient technique to maintain security of computer system. The proposed system is designed in such a way that are helpful in identifying malicious behavior and improper use of computer system. In this report we proposed a hybrid technique for intrusion detection using data mining algorithms. Our main objective is to do complete analysis of intrusion detection Dataset to test the implemented system.In This report we will propose a new methodology in which Modified k-mean is used for clustering whereas Naïve Bayes for the classification. These two data mining techniques will be used for Intrusion detection in large horizontally distributed database.
Keywords: Intrusion Detection, Modified K-Mean, Naïve Bays
NEURAL NETWORKS WITH DECISION TREES FOR DIAGNOSIS ISSUEScscpconf
This paper presents a new idea for fault detection and isolation (FDI) technique which is applied to industrial system. This technique is based on Neural Networks fault-free and Faulty
behaviours Models (NNFMs). NNFMs are used for residual generation, while decision tree architecture is used for residual evaluation. The decision tree is realized with data collected
from the NNFM’s outputs and is used to isolate detectable faults depending on computed threshold. Each part of the tree corresponds to specific residual. With the decision tree, it
becomes possible to take the appropriate decision regarding the actual process behaviour by evaluating few numbers of residuals. In comparison to usual systematic evaluation of all
residuals, the proposed technique requires less computational effort and can be used for on line diagnosis. An application example is presented to illustrate and confirm the effectiveness and the accuracy of the proposed approach.
NEURAL NETWORKS WITH DECISION TREES FOR DIAGNOSIS ISSUEScsitconf
This paper presents a new idea for fault detection and isolation (FDI) technique which is
applied to industrial system. This technique is based on Neural Networks fault-free and Faulty
behaviours Models (NNFMs). NNFMs are used for residual generation, while decision tree
architecture is used for residual evaluation. The decision tree is realized with data collected
from the NNFM’s outputs and is used to isolate detectable faults depending on computed
threshold. Each part of the tree corresponds to specific residual. With the decision tree, it
becomes possible to take the appropriate decision regarding the actual process behaviour by
evaluating few numbers of residuals. In comparison to usual systematic evaluation of all
residuals, the proposed technique requires less computational effort and can be used for on line
diagnosis. An application example is presented to illustrate and confirm the effectiveness and
the accuracy of the proposed approach.
On Tracking Behavior of Streaming Data: An Unsupervised ApproachWaqas Tariq
In the recent years, data streams have been in the gravity of focus of quite a lot number of researchers in different domains. All these researchers share the same difficulty when discovering unknown pattern within data streams that is concept change. The notion of concept change refers to the places where underlying distribution of data changes from time to time. There have been proposed different methods to detect changes in the data stream but most of them are based on an unrealistic assumption of having data labels available to the learning algorithms. Nonetheless, in the real world problems labels of streaming data are rarely available. This is the main reason why data stream communities have recently focused on unsupervised domain. This study is based on the observation that unsupervised approaches for learning data stream are not yet matured; namely, they merely provide mediocre performance specially when applied on multi-dimensional data streams. In this paper, we propose a method for Tracking Changes in the behavior of instances using Cumulative Density Function; abbreviated as TrackChCDF. Our method is able to detect change points along unlabeled data stream accurately and also is able to determine the trend of data called closing or opening. The advantages of our approach are three folds. First, it is able to detect change points accurately. Second, it works well in multi-dimensional data stream, and the last but not the least, it can determine the type of change, namely closing or opening of instances over the time which has vast applications in different fields such as economy, stock market, and medical diagnosis. We compare our algorithm to the state-of-the-art method for concept change detection in data streams and the obtained results are very promising.
Constructing a classification model is important in machine learning for a particular task. A
classification process involves assigning objects into predefined groups or classes based on a
number of observed attributes related to those objects. Artificial neural network is one of the
classification algorithms which, can be used in many application areas. This paper investigates
the potential of applying the feed forward neural network architecture for the classification of
medical datasets. Migration based differential evolution algorithm (MBDE) is chosen and
applied to feed forward neural network to enhance the learning process and the network
learning is validated in terms of convergence rate and classification accuracy. In this paper,
MBDE algorithm with various migration policies is proposed for classification problems using
medical diagnosis.
MEDICAL DIAGNOSIS CLASSIFICATION USING MIGRATION BASED DIFFERENTIAL EVOLUTION...cscpconf
Constructing a classification model is important in machine learning for a particular task. A
classification process involves assigning objects into predefined groups or classes based on a
number of observed attributes related to those objects. Artificial neural network is one of the
classification algorithms which, can be used in many application areas. This paper investigates
the potential of applying the feed forward neural network architecture for the classification of
medical datasets. Migration based differential evolution algorithm (MBDE) is chosen and
applied to feed forward neural network to enhance the learning process and the network
learning is validated in terms of convergence rate and classification accuracy. In this paper,
MBDE algorithm with various migration policies is proposed for classification problems using
medical diagnosis.
Credit Default Swap (CDS) Rate Construction by Machine Learning TechniquesZhongmin Luo
1. Financial institutions need to construct proxy CDS rates for counterparties lacking liquid CDS quotes, which are required for CVA pricing, CVA risk charge calculation, etc;
2. Existing CDS Proxy Methods do not meet regulatory requirements and are vulnerable to arbitrage;
3. After investigating 8 most popular Machine Learning algorithms, we show that Machine Learning techniques can be used to construct reliable CDS proxies that meet regulatory regulations while free from the above problem
4. Feature variable selection can be critical for performance of CDS-proxy construction methods
5. Effects of feature variable correlations on classification performances have to be investigated in the case of financial data
New Ethernet standards, such as 40 GbE or 100 GbE, are already being deployed commercially along with their corresponding Network Interface Cards (NICs) for the servers. However, network measurement solutions are lagging behind: while there are several tools available for monitoring 10 or 20 Gbps networks, higher speeds pose a harder challenge that requires more new ideas, different from those applied previously, and so there are less applications available. In this paper, we show a system capable of capturing, timestamping and storing 40 Gbps network traffic using a tailored network driver together with Non-Volatile Memory express (NVMe) technology and the Storage Performance Development Kit (SPDK) framework. Also, we expose core ideas that can be extended for the capture at higher rates: a multicore architecture capable of synchronization with minimal overhead that reduces disordering of the received frames, methods to filter the traffic discarding unwanted frames without being computationally expensive, and the use of an intermediate buffer that allows simultaneous access from several applications to the same data and efficient disk writes. Finally, we show a testbed for a reliable benchmarking of our solution using custom DPDK traffic generators and replayers, which
have been made freely available for the network measurement
community.
En la actualidad, la proliferación de dispositivos móviles y accesos a Internet utilizando tecnologías inalámbricas en los entornos domésticos obliga a cambiar las metodologías para la realización de medidas de red. Para que éstas representen fidedignamente las condiciones ofrecidas a los usuarios, las prestaciones del equipamiento de medida y el número de dispositivos empleados deben adaptarse a las condiciones reales de un despliegue. Para facilitar y abaratar el desarrollo de medidas en estas condiciones, este trabajo presenta una evaluación de las capacidades de varias plataformas de propósito general y bajo coste. Nuestros resultados muestran que, aunque aparecen limitaciones relacionadas con cómo son conectadas a la red y los protocolos empleados, son aptas para medir una gran variedad de situaciones.
Internet is an ever-growing network. The network equipment has to be improved to cope with this growth, including those devices used to classify the network traffic. Internet service providers and network operators require to apply different QoS policies for specific protocols. Then, such classifying systems are critical. However, classification by port does not provide good results, and it is necessary to apply other more complex techniques. These classification techniques have to be fast enough to work at line rates. This paper presents a system that unifies the entire process involved in flow classification at high speed. It captures the traffic, builds flows from the received packets and finally classifies them inside a GPU. All the process is possible at 10 Gbps using commodity hardware. Our results show that the achieved performance is very influenced by the number of protocols to find, and it is limited by the number of network flows. In any case, our system reaches up to 24.4 Gbps using commodity hardware.
Current network services such as Voice over IP or IP Television pose new challenges to network providers. Network operators need to know if their services are being properly provided. However, current quality of service parameters commonly used in data networks (e.g. throughput, packet delays, packet losses, etc.) do not show a clear view of what the users are experiencing. Thus, it is necessary to translate such measured quality parameters into a quality of experience value. Several models are being developed to cope with this problem. For instance, some approaches have used the packet loss rate to evaluate the experienced quality of an IP television channel. Unfortunately, packet loss just explains a fraction of the quality behavior. Then, we go one step further, taking into account the different MPEG frame types that are transmitted. In this paper, we have defined a model to predict the experienced quality that is a function of the loss of the different types of MPEG frames, providing a mean opinion score of the delivered service. The final results show that our model is able to better predict the quality of experience of such video services than just using the packet loss rate.
La charla está enfocada a contar un tema de investigación a los estudiantes del Máster Universitario de Investigación en TIC de la Universidad de Valladolid
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Dictyogram: a Statistical Approach for the Definition and Visualization of Network Flow Categories
1. Dictyogram: a Statistical Approach for the
Definition and Visualization of Network Flow
Categories
David Muelas, Miguel Gordo, Jos´e Luis Garc´ıa-Dorado,
Jorge E. L´opez de Vergara
Email: {dav.muelas, jl.garcia, jorge.lopez vergara}@uam.es,
miguel.gordo@estudiante.uam.es
Universidad Aut´onoma de Madrid
CNSM 2015 – November 2015
2. Network Health Check
Network managers must monitor network vital signs to assure it is
healthy:
(a) ECG
00:00:00 03:20:00 06:40:00 10:00:00 13:20:00 16:40:00 20:00:00 23:20:00
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Cat1 Cat2 Cat3 Cat4 Cat5 Cat6 Cat7 Cat8 Cat9 Cat10
(b) Dictyogram (Normalized version)
But. . . What exactly is Dictyogram?
3. Dictyogram (from δ´ικτυo, network in Greek): Method to
graphically trace the network flow behavior versus time. Its
graphical results can be like a network electrogram, showing its
vital signs.
4. Introduction
Method definition
Experimental results
Conclusions
Outline
1 Introduction
Context
Our Goals
2 Method definition
Probability integral transform
Modeling CDFs
3 Experimental results
Model evaluation
Dictyogram visualization
4 Conclusions
D. Muelas, M. Gordo, J.L. Garc´ıa-Dorado, J.E. L´opez de Vergara Dictyogram 4
5. Introduction
Method definition
Experimental results
Conclusions
Context
Our Goals
Context
Network flow-based monitoring has been proven useful to
detect network intrusion, malfunction, or other types of
anomalies.
Unfortunately, network managers have to deal with tons of
measurement data, and its interpretation has become a
challenge.
Data summaries: difficult to reach a good trade-off between
detail and simplifications: insufficient data can lead to
restricted or even erroneous conclusions.
Not only the measurements are important from the point of
view of network management: the application of suitable
techniques improves the quality and depth of the knowledge
that can be extracted from measurements.
D. Muelas, M. Gordo, J.L. Garc´ıa-Dorado, J.E. L´opez de Vergara Dictyogram 5
6. Introduction
Method definition
Experimental results
Conclusions
Context
Our Goals
Our Goals
Our proposal is intended to ease network managers’ work by
proposing a novel approach to study the behavior of network flow
characteristics. Our main goal is to define comprehensive
summaries of network flow data:
Our approach is based in the study of different flow
characteristics’ ECDFs — e.g., flow size or duration
distributions.
Using those ECDFs, we define flow categories using the
integral probability transform — e.g., using decile delimited
intervals.
As we will see, this approach improves the detection of network
anomalies and the visualization of network state.
D. Muelas, M. Gordo, J.L. Garc´ıa-Dorado, J.E. L´opez de Vergara Dictyogram 6
7. Introduction
Method definition
Experimental results
Conclusions
Probability integral transform
Modeling CDFs
Method description
Probability integral transform:
Let X be a continuous random variable with cumulative
distribution function FX . Then FX (X) follows a uniform
distribution on [0, 1].
(b)
0
0.5
1
(a)
C
i
= F
X
−1
(P
i
)
P
i
And them, we define flow categories using a set of probability
levels using the CDF of certain flow characteristics.
D. Muelas, M. Gordo, J.L. Garc´ıa-Dorado, J.E. L´opez de Vergara Dictyogram 7
8. Introduction
Method definition
Experimental results
Conclusions
Probability integral transform
Modeling CDFs
Keep an eye on the hypotheses!
25 30 35
0
0.2
0.4
0.6
0.8
1
(b)
0200400600
0
0.2
0.4
0.6
0.8
1
(a)
(c) Gaussian
0 20 40 60
0
0.2
0.4
0.6
0.8
1
(b)
05101520
0
0.2
0.4
0.6
0.8
1
(a)
(d) Poisson
D. Muelas, M. Gordo, J.L. Garc´ıa-Dorado, J.E. L´opez de Vergara Dictyogram 8
9. Introduction
Method definition
Experimental results
Conclusions
Probability integral transform
Modeling CDFs
How can we model an CDF?
Glivenko-Cantelli theorem: the ECDF converges to the CDF
as the number of observations increases.
Nonetheless, computational cost increases when we
accumulate all the values of the characteristic under analysis.
Alternative approach: Functional Data Analysis:
Mean Function: Fmean
X =
1
n
n
i=1
FXi
Problem: not robust
Functional Depth:
Maximum depth observation.
Median Function (it is the function that maximizes the
functional depth we use).
Problem: more computationally expensive
D. Muelas, M. Gordo, J.L. Garc´ıa-Dorado, J.E. L´opez de Vergara Dictyogram 9
10. Introduction
Method definition
Experimental results
Conclusions
Probability integral transform
Modeling CDFs
Dataset for the evaluation
To asses the advantages of our method, we have use a real
dataset:
Flow records, Spanish Academic Network: more than one
million users, more than 7 years of data.
Exporters: 5 Netflow exporters, different geographical
locations (all of them in Spain).
Packet level sampling: rate of one out of 100 packets.
Period selected for the evaluation of the CDF estimation
methods: 30 days.
D. Muelas, M. Gordo, J.L. Garc´ıa-Dorado, J.E. L´opez de Vergara Dictyogram 10
11. Introduction
Method definition
Experimental results
Conclusions
Probability integral transform
Modeling CDFs
Analyzing ECDFs to get a model of the typical behavior
10
1
10
2
10
3
10
4
10
5
10
6
10
7
10
8
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
X: 40
Y: 0.9
X: 44
Y: 0.8
X: 53
Y: 0.7
X: 80
Y: 0.6
X: 149
Y: 0.5
X: 501
Y: 0.4
X: 1452
Y: 0.3
X: 1500
Y: 0.2
X: 3000
Y: 0.1
Flow size (bytes)
P(X>x)
Mean
Deepest
Median
Figure: Comparison between observed CCDFs (orange line, no marker)
for Exporter A, and models obtained using the mean (blue line, circles),
deepest (black line, diamonds) and median (red line, triangles) functions.
D. Muelas, M. Gordo, J.L. Garc´ıa-Dorado, J.E. L´opez de Vergara Dictyogram 11
12. Introduction
Method definition
Experimental results
Conclusions
Model evaluation
Dictyogram visualization
Empirical comparison (I)
0 5 10 15 20 25 30
0
5
10
x 10
5
A
0 5 10 15 20 25 30
0
5
10
x 10
6
B
0 5 10 15 20 25 30
0
5
10
x 10
7
C
0 5 10 15 20 25 30
0
5
10
x 10
6
D
0 5 10 15 20 25 30
0
5
x 10
6
E
Mean Deepest Median
Figure: Evolution of the Pearson’s test-statistic for all exporters. (Less is
better.)
D. Muelas, M. Gordo, J.L. Garc´ıa-Dorado, J.E. L´opez de Vergara Dictyogram 12
13. Introduction
Method definition
Experimental results
Conclusions
Model evaluation
Dictyogram visualization
Empirical comparison (II)
Table: Summary of the evaluation of the different methods to estimate
the CDF.
Exporter Method # Best
A
Mean function 0
Deepest obs. 3
Median function 25
B
Mean function 0
Deepest obs. 6
Median function 22
C
Mean function 20
Deepest obs. 8
Median function 0
D
Mean function 0
Deepest obs. 23
Median function 5
E
Mean function 0
Deepest obs. 28
Median function 0
D. Muelas, M. Gordo, J.L. Garc´ıa-Dorado, J.E. L´opez de Vergara Dictyogram 13
14. Introduction
Method definition
Experimental results
Conclusions
Model evaluation
Dictyogram visualization
Final visualization of Dictyogram
03:00:00 06:00:00 09:00:00 12:00:00 15:00:00 18:00:00 21:00:00
0
2
4
x 10
4
(a) Mean
Concurrentflowsforeachcategory
03:00:00 06:00:00 09:00:00 12:00:00 15:00:00 18:00:00 21:00:00
0
2
4
x 10
4
(b) Deepest Observation
Time of day
03:00:00 06:00:00 09:00:00 12:00:00 15:00:00 18:00:00 21:00:00
0
2
4
x 10
4
(c) Median
1
1
1
2
2
2
Figure: Dictyogram representation of fi (t) with their respective size
intervals delimited by the deciles given by (a) mean, (b) deepest observed
ECDF, and (c) median.
D. Muelas, M. Gordo, J.L. Garc´ıa-Dorado, J.E. L´opez de Vergara Dictyogram 14
15. Introduction
Method definition
Experimental results
Conclusions
Model evaluation
Dictyogram visualization
Final visualization of Dictyogram
00:00:00 03:20:00 06:40:00 10:00:00 13:20:00 16:40:00 20:00:00 23:20:00
0
0.5
1
1.5
2
2.5
3
3.5
4
x 10
4
1 2
Figure: Zoom in the median.
D. Muelas, M. Gordo, J.L. Garc´ıa-Dorado, J.E. L´opez de Vergara Dictyogram 15
16. Introduction
Method definition
Experimental results
Conclusions
Key remarks
Our method:
Is manager friendly: it provides Statistical summaries based
on certain probability levels, which eases the study of the
flows traversing the network.
Links statistical properties to time evolution: it eases the
detection of changes in the statistical properties of the
characteristics under analysis.
Improves network flow data visualization: it lets control
the resolution of the visualization of the distribution that
network flow characteristics follow.
D. Muelas, M. Gordo, J.L. Garc´ıa-Dorado, J.E. L´opez de Vergara Dictyogram 16
17. Introduction
Method definition
Experimental results
Conclusions
Future work
We plan to:
Study how to summarize several different network behaviors in
a multivariate uniform distribution, and use other well-known
distributions (and not only uniform) for signatures.
Study the distribution of the Pearson’s test-statistic to detect
anomalous events.
Test the stability of the estimation of the CDF ( to define
some criteria to recalibrate the model).
Explore other representations with higher dimensionality.
D. Muelas, M. Gordo, J.L. Garc´ıa-Dorado, J.E. L´opez de Vergara Dictyogram 17
19. Introduction
Method definition
Experimental results
Conclusions
Annex: Functional depth
We use the definition given by:
MSn,H(x) = min{SLn(x), ILn(x)} (1)
where
SLn(x) = 1
nλ(I)
n
i=1
λ{t ∈ I : x(t) ≤ xi (t)}
ILn(x) = 1
nλ(I)
n
i=1
λ{t ∈ I : x(t) ≥ xi (t)} (2)
With it, we consider:
Maximum depth observation.
Median Function (it is the function that maximizes the
functional depth we use).
D. Muelas, M. Gordo, J.L. Garc´ıa-Dorado, J.E. L´opez de Vergara Dictyogram 19