This document summarizes key concepts about data that were covered in a lecture. It discusses the definition of data as a collection of objects and their attributes. It describes different types of attributes and data, including record data, document data, and transaction data. It also covers important characteristics of data like dimensionality and size. Finally, it discusses various data preprocessing techniques such as sampling, discretization, and feature selection.
Data mining Basics and complete description Sulman Ahmed
This course is all about the data mining techniques and how we mine the data and get optimize results.This course is all about the data mining techniques and how we mine the data and get optimize results.This course is all about the data mining techniques and how we mine the data and get optimize results.This course is all about the data mining techniques and how we mine the data and get optimize results.This course is all about the data mining techniques and how we mine the data and get optimize results.This course is all about the data mining techniques and how we mine the data and get optimize results.This course is all about the data mining techniques and how we mine the data and get optimize results
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) là một t...sweetieDL
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) là một thuật toán phân cụm dữ liệu không giám sát được sử dụng nhiều trong lĩnh vực
Data mining Basics and complete description Sulman Ahmed
This course is all about the data mining techniques and how we mine the data and get optimize results.This course is all about the data mining techniques and how we mine the data and get optimize results.This course is all about the data mining techniques and how we mine the data and get optimize results.This course is all about the data mining techniques and how we mine the data and get optimize results.This course is all about the data mining techniques and how we mine the data and get optimize results.This course is all about the data mining techniques and how we mine the data and get optimize results.This course is all about the data mining techniques and how we mine the data and get optimize results
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) là một t...sweetieDL
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) là một thuật toán phân cụm dữ liệu không giám sát được sử dụng nhiều trong lĩnh vực
Date: September 11, 2017
Course: UiS DAT630 - Web Search and Data Mining (fall 2017) (https://github.com/kbalog/uis-dat630-fall2017)
Presentation based on resources from the 2016 edition of the course (https://github.com/kbalog/uis-dat630-fall2016) and the resources shared by the authors of the book used through the course (https://www-users.cs.umn.edu/~kumar001/dmbook/index.php).
Please cite, link to or credit this presentation when using it or part of it in your work.
Data Visualization dataviz superpower! Guidelines on using best practice data visualization principles for Power BI, Excel, SSRS, Tableau and other great tools!
Kompetensi Data analitik merupakan kompetensi yang sangat dibutuhkan pada era industri 4.0. Kami siap memberikan pelatihan kepada karyawan perusahaan yang membutuhkan. Silahkan kontak kami di jhotank@yahoo.com atau di website http://corporaeuniversity-digital.com.
Date: September 11, 2017
Course: UiS DAT630 - Web Search and Data Mining (fall 2017) (https://github.com/kbalog/uis-dat630-fall2017)
Presentation based on resources from the 2016 edition of the course (https://github.com/kbalog/uis-dat630-fall2016) and the resources shared by the authors of the book used through the course (https://www-users.cs.umn.edu/~kumar001/dmbook/index.php).
Please cite, link to or credit this presentation when using it or part of it in your work.
Data Visualization dataviz superpower! Guidelines on using best practice data visualization principles for Power BI, Excel, SSRS, Tableau and other great tools!
Kompetensi Data analitik merupakan kompetensi yang sangat dibutuhkan pada era industri 4.0. Kami siap memberikan pelatihan kepada karyawan perusahaan yang membutuhkan. Silahkan kontak kami di jhotank@yahoo.com atau di website http://corporaeuniversity-digital.com.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
2. • Last week:
– Course Motivation
– Data Mining basics
• This week:
– Data
Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 2
Summary – last week
3. Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 3
Agenda
– Attributes and Objects
– Types of Data
– Data Quality
– Similarity and Distance
– Data Preprocessing
4. What is Data?
Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 4
• Collection of data objects
and their attributes
• An attribute is a property or
characteristic of an object
• Examples: eye color of a
person, temperature, etc.
• Attribute is also known as
variable, field, characteristic,
dimension, or feature
• A collection of attributes
describe an object
• Object is also known as
record, point, case, sample,
entity, or instance
Tid Refund Marital
Status
Taxable
Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Attributes
Objects
5. Attribute Values
• Attribute values are numbers or symbols assigned to
an attribute for a particular object
• Distinction between attributes and attribute values
– Same attribute can be mapped to different attribute values
• Example: height can be measured in feet or meters
– Different attributes can be mapped to the same set of values
• Example: Attribute values for ID and age are integers
– But properties of attribute can be different than
the properties of the values used to represent the
attribute
Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 5
6. Types of Attributes
There are different types of attributes or measurement scales:
– Nominal: are used to categorise/classify objects, individuals,
groups, or even phenomena. Mutually exclusive
• Examples: Gender, Country, Religion, Ethnicity
– Ordinal: the data can be categorized and ranked. The rank is
known but the magnitude of the rank is unknown.
• Examples: Rate the smartphone brands (iPhone, Samsung, HTC)
– Interval: You can categorize, rank, and infer equal
intervals between neighbouring data points, but the origin point is
arbitrary.
• Examples: My university education prepare my well for my career.
(Strongly Agree, Agree, Neutral, Disagree, Strongly Disagree)
– Ratio: You can categorize, rank, and infer equal intervals between
neighbouring data points, and there is a true zero point.
• Examples: Weight, income, Age, Sales volume
Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 6
7. Properties of Attribute Values
Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 7
Attribute Type Description Examples Operations
Categorical
Qualitative
Nominal Nominal attribute
values only
distinguish. (=, ¹)
Gender, Country,
Religion, Ethnicity
mode, frequency
(%, count)
Ordinal Ordinal attribute
values also order
objects.
(<, >)
Rate the smartphone
brands (iPhone,
Samsung, HTC)
median,
percentiles, rank
correlation, run
tests, sign tests
Numeric
Quantitative
Interval For interval
attributes,
differences between
values are
meaningful. (+, - )
My university
education prepare
my well for my
career.
Strongly Agree,
Agree, Neutral,
Disagree, Strongly
Disagree
mean, standard
deviation,
Pearson's
correlation, t and
F tests
Ratio For ratio variables,
both differences and
ratios are
meaningful. (*, /)
Weight, income,
Age, Sales volume
geometric mean,
harmonic mean,
percent variation
8. Discrete and Continuous Attributes
• Discrete Attribute
– Has only a finite or countably infinite set of values
– Examples: zip codes, counts, or the set of words in a
collection of documents
– Often represented as integer variables.
– Note: binary attributes are a special case of discrete
attributes
• Continuous Attribute
– Has real numbers as attribute values
– Examples: temperature, height, or weight.
– Practically, real values can only be measured and
represented using a finite number of digits.
– Continuous attributes are typically represented as floating-
point variables.
Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 8
9. Asymmetric Attributes
• Only presence (a non-zero attribute value) is regarded
as important
• Words present in documents
• Items present in customer transactions
• If we met a friend in the grocery store would we ever
say the following?
“I see our purchases are very similar since we didn’t buy
most of the same things.”
Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 9
10. Important Characteristics of Data
– Dimensionality (number of attributes)
• High dimensional data brings a number of
challenges
– Sparsity
• Only presence counts
– Resolution
• Patterns depend on the scale
– Size
• Type of analysis may depend on size of data
Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 10
11. Types of data sets
• Record
– Data Matrix
– Document Data
– Transaction Data
• Graph
– World Wide Web
– Molecular Structures
• Ordered
– Spatial Data
– Temporal Data
– Sequential Data
– Genetic Sequence Data
Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 11
12. Record Data
• Data that consists of a collection of records, each of
which consists of a fixed set of attributes
Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 12
Tid Refund Marital
Status
Taxable
Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
13. Data Matrix
• If data objects have the same fixed set of numeric
attributes, then the data objects can be thought of as
points in a multi-dimensional space, where each
dimension represents a distinct attribute
• Such a data set can be represented by an m by n
matrix, where there are m rows, one for each object,
and n columns, one for each attribute
Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 13
1.1
2.2
16.22
6.25
12.65
1.2
2.7
15.22
5.27
10.23
Thickness
Load
Distance
Projection
of y load
Projection
of x Load
1.1
2.2
16.22
6.25
12.65
1.2
2.7
15.22
5.27
10.23
Thickness
Load
Distance
Projection
of y load
Projection
of x Load
14. Document Data
• Each document becomes a ‘term’ vector
– Each term is a component (attribute) of the vector
– The value of each component is the number of times the
corresponding term occurs in the document.
Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 14
Document 1
season
timeout
lost
win
game
score
ball
play
coach
team
Document 2
Document 3
3 0 5 0 2 6 0 2 0 2
0
0
7 0 2 1 0 0 3 0 0
1 0 0 1 2 2 0 3 0
15. Transaction Data
• A special type of data, where
– Each transaction involves a set of items.
– For example, consider a grocery store. The set of products
purchased by a customer during one shopping trip constitute a
transaction, while the individual products that were purchased
are the items.
– Can represent transaction data as record data
Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 15
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
16. Graph Data
• Examples: Generic graph, a molecule, and webpages
Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 16
5
2
1
2
5
Benzene Molecule: C6H6
17. Ordered Data
• Sequences of transactions
Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 17
An element of
the sequence
Items/Events
18. Ordered Data
• Spatio-Temporal Data
Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 18
Average Monthly
Temperature of
land and ocean
19. Data Quality
• Poor data quality negatively affects many data
processing efforts
• Data mining example: a classification model for
detecting people who are loan risks is built using poor
data
– Some credit-worthy candidates are denied loans
– More loans are given to individuals that default
Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 19
20. Data Quality
• What kinds of data quality problems?
• How can we detect problems with the data?
• What can we do about these problems?
• Examples of data quality problems:
– Noise and outliers
– Wrong data
– Fake data
– Missing values
– Duplicate data
Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 20
21. Noise
• For objects, noise is an extraneous (irrelevant) object
• For attributes, noise refers to modification of original
values
– Examples: distortion of a person’s voice when talking on a poor
phone and “snow” on television screen
– The figures below show two sine waves of the same magnitude
and different frequencies, the waves combined, and the two
sine waves with random noise
• The magnitude and shape of the original signal is distorted
Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 21
22. Outliers
• Outliers are data objects with characteristics that are
considerably different than most of the other data
objects in the data set
– Case 1: Outliers are
noise that interferes
with data analysis
– Case 2: Outliers are
the goal of our analysis
• Credit card fraud
• Intrusion detection
• Causes?
Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 22
23. Data Preprocessing
• Aggregation
• Sampling
• Discretization and Binarization
• Attribute Transformation
• Dimensionality Reduction
• Feature subset selection
• Feature creation
Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 23
24. Aggregation
• Combining two or more attributes (or objects) into a single
attribute (or object)
• Purpose
– Data reduction - reduce the number of attributes or objects
– Change of scale
• Cities aggregated into regions, states, countries, etc.
• Days aggregated into weeks, months, or years
– More “stable” data - aggregated data tends to have less variability
Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 24
25. Sampling
• Sampling is the main technique employed for data
reduction.
– It is often used for both the preliminary investigation of the data
and the final data analysis.
• Statisticians often sample because obtaining the entire
set of data of interest is too expensive or time
consuming.
• Sampling is typically used in data mining because
processing the entire set of data of interest is too
expensive or time consuming.
Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 25
26. Sample Size
• The key principle for effective sampling is the following:
– Using a sample will work almost as well as using the entire data
set, if the sample is representative
– A sample is representative if it has approximately the same
properties (of interest) as the original set of data
Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 26
8000 points 2000 Points 500 Points
27. Types of Sampling
• Simple Random Sampling
– There is an equal probability of selecting any particular item
– Sampling without replacement
• As each item is selected, it is removed from the
population
– Sampling with replacement
• Objects are not removed from the population as
they are selected for the sample.
• In sampling with replacement, the same object
can be picked up more than once
• Stratified sampling
– Split the data into several partitions; then draw random samples
from each partition
Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 27
28. Discretization
• Discretization is the process of converting a continuous
attribute into an ordinal attribute
– A potentially infinite number of values are mapped into a small
number of categories
– Discretization is used in both unsupervised and supervised
settings
Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 28
29. Binarization
• Binarization maps a continuous or categorical attribute
into one or more binary variables
Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 29
30. Attribute Transformation
• An attribute transform is a function that maps the entire
set of values of a given attribute to a new set of
replacement values such that each old value can be
identified with one of the new values
– Simple functions: xk, log(x), ex, |x|
– Normalization
• Refers to various techniques to adjust to
differences among attributes in terms of
frequency of occurrence, mean, variance, range
• Take out unwanted, common signal, e.g.,
seasonality
– In statistics, standardization refers to subtracting off the means
and dividing by the standard deviation
Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 30
31. Dimensionality Reduction
• Purpose:
– Avoid curse of dimensionality
– Reduce amount of time and memory
required by data mining algorithms
– Allow data to be more easily
visualized
– May help to eliminate irrelevant
features or reduce noise
• Techniques
– Principal Components Analysis (PCA)
– Singular Value Decomposition
– Others: supervised and non-linear
techniques
• Goal is to find a projection that
captures the largest amount of
variation in data
Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 31
x2
x1
e
32. Feature Subset Selection
• Another way to reduce dimensionality of data
• Redundant features
– Duplicate much or all of the information contained in one or
more other attributes
– Example: purchase price of a product and the amount of sales
tax paid
• Irrelevant features
– Contain no information that is useful for the data mining task at
hand
– Example: students' ID is often irrelevant to the task of predicting
students' GPA
• Many techniques developed, especially for classification
Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 32
33. Feature Creation
• Create new attributes that can capture the important
information in a data set much more efficiently than the
original attributes
• Three general methodologies:
– Feature extraction
• Example: extracting edges from images
– Feature construction
• Example: dividing mass by volume to get density
– Mapping data to new space
• Example: Fourier and wavelet analysis
Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 33
34. Summary
– Attributes and Objects
– Types of Data
– Data Quality
– Similarity and Distance
– Data Preprocessing
Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 34
35. • Association Rule Mining
Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 35
Next week