The document provides an overview of data mining, including:
1) It defines data mining as the process of extracting patterns from large datasets that are valid, novel, useful and understandable.
2) It discusses some of the challenges of data mining like dealing with noise and missing data and not overfitting models.
3) It outlines several common data mining tasks like classification, clustering, association rule mining and sequential pattern mining.
This presentation gives the idea about Data Preprocessing in the field of Data Mining. Images, examples and other things are adopted from "Data Mining Concepts and Techniques by Jiawei Han, Micheline Kamber and Jian Pei "
This presentation gives the idea about Data Preprocessing in the field of Data Mining. Images, examples and other things are adopted from "Data Mining Concepts and Techniques by Jiawei Han, Micheline Kamber and Jian Pei "
Data preprocessing techniques are applied before mining. These can improve the overall quality of the patterns mined and the time required for the actual mining.
Some important data preprocessing that must be needed before applying the data mining algorithm to any data sets are completely described in these slides.
Data Mining: Concepts and Techniques (3rd ed.)- Chapter 3 preprocessingSalah Amean
the chapter contains :
Data Preprocessing: An Overview,
Data Quality,
Major Tasks in Data Preprocessing,
Data Cleaning,
Data Integration,
Data Reduction,
Data Transformation and Data Discretization,
Summary.
key note address delivered on 23rd March 2011 in the Workshop on Data Mining and Computational Biology in Bioinformatics, sponsored by DBT India and organised by Unit of Simulation and Informatics, IARI, New Delhi.
I do not claim any originality either to slides or their content and in fact aknowledge various web sources.
Date: September 11, 2017
Course: UiS DAT630 - Web Search and Data Mining (fall 2017) (https://github.com/kbalog/uis-dat630-fall2017)
Presentation based on resources from the 2016 edition of the course (https://github.com/kbalog/uis-dat630-fall2016) and the resources shared by the authors of the book used through the course (https://www-users.cs.umn.edu/~kumar001/dmbook/index.php).
Please cite, link to or credit this presentation when using it or part of it in your work.
Data mining Basics and complete description Sulman Ahmed
This course is all about the data mining techniques and how we mine the data and get optimize results.This course is all about the data mining techniques and how we mine the data and get optimize results.This course is all about the data mining techniques and how we mine the data and get optimize results.This course is all about the data mining techniques and how we mine the data and get optimize results.This course is all about the data mining techniques and how we mine the data and get optimize results.This course is all about the data mining techniques and how we mine the data and get optimize results.This course is all about the data mining techniques and how we mine the data and get optimize results
Data Mining DataLecture Notes for Chapter 2Introduc.docxwhittemorelucilla
Data Mining: Data
Lecture Notes for Chapter 2
Introduction to Data Mining
by
Tan, Steinbach, Kumar
What is Data?Collection of data objects and their attributes
An attribute is a property or characteristic of an objectExamples: eye color of a person, temperature, etc.Attribute is also known as variable, field, characteristic, or featureA collection of attributes describe an objectObject is also known as record, point, case, sample, entity, or instance
Attributes
Objects
Attribute ValuesAttribute values are numbers or symbols assigned to an attribute
Distinction between attributes and attribute valuesSame attribute can be mapped to different attribute values Example: height can be measured in feet or meters
Different attributes can be mapped to the same set of values Example: Attribute values for ID and age are integers But properties of attribute values can be different
ID has no limit but age has a maximum and minimum value
Measurement of Length The way you measure an attribute is somewhat may not match the attributes properties.
Types of Attributes There are different types of attributesNominalExamples: ID numbers, eye color, zip codesOrdinalExamples: rankings (e.g., taste of potato chips on a scale from 1-10), grades, height in {tall, medium, short}IntervalExamples: calendar dates, temperatures in Celsius or Fahrenheit.RatioExamples: temperature in Kelvin, length, time, counts
Properties of Attribute Values The type of an attribute depends on which of the following properties it possesses:Distinctness: = Order: < > Addition: + - Multiplication: * /
Nominal attribute: distinctnessOrdinal attribute: distinctness & orderInterval attribute: distinctness, order & additionRatio attribute: all 4 properties
Attribute Type
Description
Examples
Operations
Nominal
The values of a nominal attribute are just different names, i.e., nominal attributes provide only enough information to distinguish one object from another. (=, )
zip codes, employee ID numbers, eye color, sex: {male, female}
mode, entropy, contingency correlation, 2 test
Ordinal
The values of an ordinal attribute provide enough information to order objects. (<, >)
hardness of minerals, {good, better, best},
grades, street numbers
median, percentiles, rank correlation, run tests, sign tests
Interval
For interval attributes, the differences between values are meaningful, i.e., a unit of measurement exists.
(+, - )
calendar dates, temperature in Celsius or Fahrenheit
mean, standard deviation, Pearson's correlation, t and F tests
Ratio
For ratio variables, both differences and ratios are meaningful. (*, /)
temperature in Kelvin, monetary quantities, counts, age, mass, length, electrical current
geometric mean, harmonic mean, percent variation
Attribute Level
Transformation
Comments
Nominal
Any permutation of values
If all employee ID numbers were reassigned, would it make any difference?
Ordinal
An .
Data Mining DataLecture Notes for Chapter 2IntroducOllieShoresna
Data Mining: Data
Lecture Notes for Chapter 2
Introduction to Data Mining
by
Tan, Steinbach, Kumar
What is Data?Collection of data objects and their attributes
An attribute is a property or characteristic of an objectExamples: eye color of a person, temperature, etc.Attribute is also known as variable, field, characteristic, or featureA collection of attributes describe an objectObject is also known as record, point, case, sample, entity, or instance
Attributes
Objects
Attribute ValuesAttribute values are numbers or symbols assigned to an attribute
Distinction between attributes and attribute valuesSame attribute can be mapped to different attribute values Example: height can be measured in feet or meters
Different attributes can be mapped to the same set of values Example: Attribute values for ID and age are integers But properties of attribute values can be different
ID has no limit but age has a maximum and minimum value
Types of Attributes There are different types of attributesNominalExamples: ID numbers, eye color, zip codesOrdinalExamples: rankings (e.g., taste of potato chips on a scale from 1-10), grades, height in {tall, medium, short}IntervalExamples: calendar dates, temperatures in Celsius or Fahrenheit.RatioExamples: temperature in Kelvin, length, time, counts
Properties of Attribute Values The type of an attribute depends on which of the following properties it possesses:Distinctness: = Order: < > Addition: + - Multiplication: * /
Nominal attribute: distinctnessOrdinal attribute: distinctness & orderInterval attribute: distinctness, order & additionRatio attribute: all 4 properties
Attribute Type
Description
Examples
Operations
Nominal
The values of a nominal attribute are just different names, i.e., nominal attributes provide only enough information to distinguish one object from another. (=, )
zip codes, employee ID numbers, eye color, sex: {male, female}
mode, entropy, contingency correlation, 2 test
Ordinal
The values of an ordinal attribute provide enough information to order objects. (<, >)
hardness of minerals, {good, better, best},
grades, street numbers
median, percentiles, rank correlation, run tests, sign tests
Interval
For interval attributes, the differences between values are meaningful, i.e., a unit of measurement exists.
(+, - )
calendar dates, temperature in Celsius or Fahrenheit
mean, standard deviation, Pearson's correlation, t and F tests
Ratio
For ratio variables, both differences and ratios are meaningful. (*, /)
temperature in Kelvin, monetary quantities, counts, age, mass, length, electrical current
geometric mean, harmonic mean, percent variation
Attribute Level
Transformation
Comments
Nominal
Any permutation of values
If all employee ID numbers were reassigned, would it make any difference?
Ordinal
An order preserving change of values, i.e.,
new_value = f(old_value)
where f is a monotonic function.
An attribut ...
Data preprocessing techniques are applied before mining. These can improve the overall quality of the patterns mined and the time required for the actual mining.
Some important data preprocessing that must be needed before applying the data mining algorithm to any data sets are completely described in these slides.
Data Mining: Concepts and Techniques (3rd ed.)- Chapter 3 preprocessingSalah Amean
the chapter contains :
Data Preprocessing: An Overview,
Data Quality,
Major Tasks in Data Preprocessing,
Data Cleaning,
Data Integration,
Data Reduction,
Data Transformation and Data Discretization,
Summary.
key note address delivered on 23rd March 2011 in the Workshop on Data Mining and Computational Biology in Bioinformatics, sponsored by DBT India and organised by Unit of Simulation and Informatics, IARI, New Delhi.
I do not claim any originality either to slides or their content and in fact aknowledge various web sources.
Date: September 11, 2017
Course: UiS DAT630 - Web Search and Data Mining (fall 2017) (https://github.com/kbalog/uis-dat630-fall2017)
Presentation based on resources from the 2016 edition of the course (https://github.com/kbalog/uis-dat630-fall2016) and the resources shared by the authors of the book used through the course (https://www-users.cs.umn.edu/~kumar001/dmbook/index.php).
Please cite, link to or credit this presentation when using it or part of it in your work.
Data mining Basics and complete description Sulman Ahmed
This course is all about the data mining techniques and how we mine the data and get optimize results.This course is all about the data mining techniques and how we mine the data and get optimize results.This course is all about the data mining techniques and how we mine the data and get optimize results.This course is all about the data mining techniques and how we mine the data and get optimize results.This course is all about the data mining techniques and how we mine the data and get optimize results.This course is all about the data mining techniques and how we mine the data and get optimize results.This course is all about the data mining techniques and how we mine the data and get optimize results
Data Mining DataLecture Notes for Chapter 2Introduc.docxwhittemorelucilla
Data Mining: Data
Lecture Notes for Chapter 2
Introduction to Data Mining
by
Tan, Steinbach, Kumar
What is Data?Collection of data objects and their attributes
An attribute is a property or characteristic of an objectExamples: eye color of a person, temperature, etc.Attribute is also known as variable, field, characteristic, or featureA collection of attributes describe an objectObject is also known as record, point, case, sample, entity, or instance
Attributes
Objects
Attribute ValuesAttribute values are numbers or symbols assigned to an attribute
Distinction between attributes and attribute valuesSame attribute can be mapped to different attribute values Example: height can be measured in feet or meters
Different attributes can be mapped to the same set of values Example: Attribute values for ID and age are integers But properties of attribute values can be different
ID has no limit but age has a maximum and minimum value
Measurement of Length The way you measure an attribute is somewhat may not match the attributes properties.
Types of Attributes There are different types of attributesNominalExamples: ID numbers, eye color, zip codesOrdinalExamples: rankings (e.g., taste of potato chips on a scale from 1-10), grades, height in {tall, medium, short}IntervalExamples: calendar dates, temperatures in Celsius or Fahrenheit.RatioExamples: temperature in Kelvin, length, time, counts
Properties of Attribute Values The type of an attribute depends on which of the following properties it possesses:Distinctness: = Order: < > Addition: + - Multiplication: * /
Nominal attribute: distinctnessOrdinal attribute: distinctness & orderInterval attribute: distinctness, order & additionRatio attribute: all 4 properties
Attribute Type
Description
Examples
Operations
Nominal
The values of a nominal attribute are just different names, i.e., nominal attributes provide only enough information to distinguish one object from another. (=, )
zip codes, employee ID numbers, eye color, sex: {male, female}
mode, entropy, contingency correlation, 2 test
Ordinal
The values of an ordinal attribute provide enough information to order objects. (<, >)
hardness of minerals, {good, better, best},
grades, street numbers
median, percentiles, rank correlation, run tests, sign tests
Interval
For interval attributes, the differences between values are meaningful, i.e., a unit of measurement exists.
(+, - )
calendar dates, temperature in Celsius or Fahrenheit
mean, standard deviation, Pearson's correlation, t and F tests
Ratio
For ratio variables, both differences and ratios are meaningful. (*, /)
temperature in Kelvin, monetary quantities, counts, age, mass, length, electrical current
geometric mean, harmonic mean, percent variation
Attribute Level
Transformation
Comments
Nominal
Any permutation of values
If all employee ID numbers were reassigned, would it make any difference?
Ordinal
An .
Data Mining DataLecture Notes for Chapter 2IntroducOllieShoresna
Data Mining: Data
Lecture Notes for Chapter 2
Introduction to Data Mining
by
Tan, Steinbach, Kumar
What is Data?Collection of data objects and their attributes
An attribute is a property or characteristic of an objectExamples: eye color of a person, temperature, etc.Attribute is also known as variable, field, characteristic, or featureA collection of attributes describe an objectObject is also known as record, point, case, sample, entity, or instance
Attributes
Objects
Attribute ValuesAttribute values are numbers or symbols assigned to an attribute
Distinction between attributes and attribute valuesSame attribute can be mapped to different attribute values Example: height can be measured in feet or meters
Different attributes can be mapped to the same set of values Example: Attribute values for ID and age are integers But properties of attribute values can be different
ID has no limit but age has a maximum and minimum value
Types of Attributes There are different types of attributesNominalExamples: ID numbers, eye color, zip codesOrdinalExamples: rankings (e.g., taste of potato chips on a scale from 1-10), grades, height in {tall, medium, short}IntervalExamples: calendar dates, temperatures in Celsius or Fahrenheit.RatioExamples: temperature in Kelvin, length, time, counts
Properties of Attribute Values The type of an attribute depends on which of the following properties it possesses:Distinctness: = Order: < > Addition: + - Multiplication: * /
Nominal attribute: distinctnessOrdinal attribute: distinctness & orderInterval attribute: distinctness, order & additionRatio attribute: all 4 properties
Attribute Type
Description
Examples
Operations
Nominal
The values of a nominal attribute are just different names, i.e., nominal attributes provide only enough information to distinguish one object from another. (=, )
zip codes, employee ID numbers, eye color, sex: {male, female}
mode, entropy, contingency correlation, 2 test
Ordinal
The values of an ordinal attribute provide enough information to order objects. (<, >)
hardness of minerals, {good, better, best},
grades, street numbers
median, percentiles, rank correlation, run tests, sign tests
Interval
For interval attributes, the differences between values are meaningful, i.e., a unit of measurement exists.
(+, - )
calendar dates, temperature in Celsius or Fahrenheit
mean, standard deviation, Pearson's correlation, t and F tests
Ratio
For ratio variables, both differences and ratios are meaningful. (*, /)
temperature in Kelvin, monetary quantities, counts, age, mass, length, electrical current
geometric mean, harmonic mean, percent variation
Attribute Level
Transformation
Comments
Nominal
Any permutation of values
If all employee ID numbers were reassigned, would it make any difference?
Ordinal
An order preserving change of values, i.e.,
new_value = f(old_value)
where f is a monotonic function.
An attribut ...
Kompetensi Data analitik merupakan kompetensi yang sangat dibutuhkan pada era industri 4.0. Kami siap memberikan pelatihan kepada karyawan perusahaan yang membutuhkan. Silahkan kontak kami di jhotank@yahoo.com atau di website http://corporaeuniversity-digital.com.
Introduction to Data Mining(Chapter 1)......Data Mining concepts and techniques by R. Deepa (IT) ..Batch(2016-2019) published on Oct-13 2018 from NS college of Arts and Science,Theni
Data Analysis in Research: Descriptive Statistics & NormalityIkbal Ahmed
A Presentation on Data Analysis using descriptive statistics & normality. From this presentation you can know-
1) What is Data
2) Types of Data
3) What is Data analysis
4) Descriptive Statistics
5) Tools for assessing normality
Startup and incubation
It was a great moment to deliver keynote to participants of ECHOFOUNDATION and IGNITE
The deck is available for download at https://lnkd.in/et3bwyG
Youtube video https://lnkd.in/eUaWRzA
This is a case of Data Science applied in Financial Risk Management. The deck is presented to the audience in the ICAP 2019 : International Conference on Applied Psychology, held on 25th and 26th of Jun 2019 held in Paris France.
Find a copy of this presentation at
https://prezi.com/view/KBP8JnekVH9LkLOiKY3w/
Your comments and Feedback are welcome.
Jason Rodrigues
jason.rodrigues@protonmail.com
Business Challenge #11: Instilling safety consciousness in people
Facts on Safety at Work by International Labour Organization gives following highlights
Each day 6,000 people die as a result of work related accidents/diseases
2.2 Mil work related deaths a year, 0.35 Mil due to work related accidents, 1.7 due to work related diseases.
270 Mil occupational accidents resulting in absence at work for more than 3 days resulting in a 4% global GDP impact.
Target is to create an injury free culture and to achieve this the only number one needs to know is \'ZERO\'. (India\'s contribution to the world). Be it ZERO Falls, ZERO incidents, ZERO fatalities.
This can be achieved by individual by Changing behaviours, Challenging beliefs, Reducing risks
A simple but effective self introspection to see ones approach to safety, by asking,
if she/he promotes proactive or reactive approach to safety? pre inspection? follow on incidents? look at close calls?
take on incidence reporting?
A self assessment on following,
“What type of risks do I take?”
“What should I have done differently?”
“What am I going to do differently next time?”
Knowing the secret of TEEEE
T - TRAINING
E - EQUIPMENT
E - EXTRA PRECAUTION
E - ENVIRONMENT
E - ENJOYMENT
Not replying on Luck but preventing many of those that can be easily prevented.
In an effort to get these in a simple and understandable way, a compilation in the form of slide decks is attached. Most of us, should be able to relate these, as these are relatively fresh in our minds. May it be the Mangalore air crash that took lives of 158 innocents to Fire blaze at Carlton Towers in Bangalore that left 9 dead and over 70 injured or Heavy rains of July 26th in Mumbai to Tsunami in March 11, 2011 or a Delhi Metro accident to the most recent Lawyers attack on Media.
Data preprocessing techniques
See my Paris applied psychology conference paper here
https://www.slideshare.net/jasonrodrigues/paris-conference-on-applied-psychology
or
https://prezi.com/view/KBP8JnekVH9LkLOiKY3w/
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™UiPathCommunity
In questo evento online gratuito, organizzato dalla Community Italiana di UiPath, potrai esplorare le nuove funzionalità di Autopilot, il tool che integra l'Intelligenza Artificiale nei processi di sviluppo e utilizzo delle Automazioni.
📕 Vedremo insieme alcuni esempi dell'utilizzo di Autopilot in diversi tool della Suite UiPath:
Autopilot per Studio Web
Autopilot per Studio
Autopilot per Apps
Clipboard AI
GenAI applicata alla Document Understanding
👨🏫👨💻 Speakers:
Stefano Negro, UiPath MVPx3, RPA Tech Lead @ BSP Consultant
Flavio Martinelli, UiPath MVP 2023, Technical Account Manager @UiPath
Andrei Tasca, RPA Solutions Team Lead @NTT Data
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
2. • Introduction
• What is Data Mining?
• Challenges and Trends
• The origin of DM
• Data Mining Tasks
• Types of Data
• Data Quality
• Takeaways
Agenda
3.
4. What is Data Mining?
Definition:
Data mining is the process of extracting patterns from
data.- Wikipedia
Process of semi-automatically analyzing large databases
to find patterns that are:
valid: hold on new data with some certainty
novel: non-obvious to the system
useful: should be possible to act on the item
understandable: humans should be able to interpret the pattern
Other Definitions:
Non-trivial extraction of implicit, previously unknown and
potentially useful information from data
Exploration & analysis, by automatic or
semi-automatic means, of
large quantities of data
in order to discover
meaningful patterns
5. What is Data Mining? (Other Definitions)
Extracting or “mining” knowledge from large amounts of
data
Data -driven discovery and modeling of hidden patterns
(we never new existed) in large volumes of data
Extraction of implicit, previously unknown and
unexpected, potentially extremely useful information
from data
6. What is Data Mining?
What is Data Mining?
– Certain names are more
prevalent in certain US
locations (O’Brien, O’Rurke,
O’Reilly… in Boston area)
– Group together similar
documents returned by
search engine according to
their context (e.g. Amazon
rainforest, Amazon.com,)
What is not Data
Mining?
– Look up phone
number in phone
directory
– Query a Web
search engine for
information about
“Amazon”
7. What is Data Mining?
Moore's Law:
The law is named after Intel co-founder Gordon E. Moore, who described the
trend in his 1965 paper. The paper noted that number of components in
integrated circuits had doubled every year from the invention of the integrated
circuit in 1958 until 1965 and predicted that the trend would continue "for at least
ten years"
The number of transistors that can be placed inexpensively on an integrated
circuit has doubled approximately every two years. The trend has continued for
more than half a century and is not expected to stop until 2015 or later.
9. What is Data Mining? History
Data Analysis?
Data Warehouse?
Data Mining and Statistics
− Standard Deviation
Data Mining and AI and Machine Learning
− Identify Possible Heart Attack cases
1980's1980's
1990's1990's
2000's2000's
2010's2010's
12. Data Mining Challenges and Trends
Computationally expensive to investigate all
possibilities
Dealing with noise/missing information and
errors in data
Choosing appropriate attributes/input
representation
Finding the minimal attribute space
Finding adequate evaluation function(s)
Extracting meaningful information
Not overfitting
13. Predictive AnalysisPredictive Analysis
Presentation Exploration Discovery
Passive
Interactive
Proactive
Role of Software
Business
Insight
Canned reporting
Ad-hoc reporting
Online Analytical
Processing
Data mining
Data Mining Challenges and Trends
14. Data Mining Tasks
Prediction Methods
− Use some variables to predict unknown or
future values of other variables.
Description Methods
− Find human-interpretable patterns that
describe the data.
16. Data Mining Tasks
Concept/Class description: Characterization
and discrimination
Generalize, summarize, and contrast data
characteristics, e.g., dry vs. wet regions
Association (correlation and causality)
Multi-dimensional or single-dimensional association
age(X, “20-29”) ^ income(X, “60-90K”) buys(X, “TV”)
17. Data Mining Tasks
Classification and Prediction
Finding models (functions) that describe and
distinguish classes or concepts for future prediction
Example: classify countries based on climate, or
classify cars based on gas mileage
Presentation:
If-THEN rules, decision-tree, classification rule,
neural network
Prediction: Predict some unknown or missing
numerical values
18. Data Mining Tasks
Cluster analysis
− Class label is unknown: Group data to form
new classes,
Example: cluster houses to find distribution
patterns
− Clustering based on the principle:
maximizing the intra-class similarity and
minimizing the interclass similarity
19. Data Mining Tasks
Outlier analysis
Outlier: a data object that does not comply with the
general behavior of the data
Mostly considered as noise or exception, but is
quite useful in fraud detection, rare events analysis
Trend and evolution analysis
Trend and deviation: regression analysis
Sequential pattern mining, periodicity analysis
21. What is Data?
Collection of data objects and
their attributes
An attribute is a property or
characteristic of an object
− Examples: eye color of a
person, temperature, etc.
− Attribute is also known as
variable, field, characteristic,
or feature
A collection of attributes
describe an object
− Object is also known as
record, point, case, sample,
entity, or instance
Tid Refund Marital
Status
Taxable
Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Attributes
Objects
22. Attribute Values
Attribute values are numbers or symbols assigned to an
attribute
Distinction between attributes and attribute values
− Same attribute can be mapped to different attribute values
Example: height can be measured in feet or meters
− Different attributes can be mapped to the same set of
values
Example: Attribute values for ID and age are integers
But properties of attribute values can be different
− ID has no limit but age has a maximum and
minimum value
23. Types of Attribute
There are different types of attributes
– Nominal
Examples: ID numbers, eye color, zip codes
– Ordinal
Examples: rankings (e.g., taste of potato chips on
a scale from 1-10), grades, height in {tall, medium,
short}
– Interval
Examples: calendar dates, temperatures in Celsius
or Fahrenheit.
– Ratio
Examples: temperature in Kelvin, length, time,
counts
24. Properties of Attribute Values
The type of an attribute depends on which of
the following properties it possesses:
− Distinctness: = ≠
− Order: < >
− Addition: + -
− Multiplication: * /
− Nominal attribute: distinctness
− Ordinal attribute: distinctness & order
− Interval attribute: distinctness, order & addition
− Ratio attribute: all 4 properties
25. Attribute
Type
Description Examples Operations
Nominal The values of a nominal attribute
are just different names, i.e.,
nominal attributes provide only
enough information to distinguish
one object from another. (=, ≠)
zip codes, employee
ID numbers, eye
color, sex: {male,
female}
mode, entropy,
contingency
correlation, χ2
test
Ordinal The values of an ordinal attribute
provide enough information to
order objects. (<, >)
hardness of
minerals, {good,
better, best},
grades, street
numbers
median,
percentiles, rank
correlation, run
tests, sign tests
Interval For interval attributes, the
differences between values are
meaningful, i.e., a unit of
measurement exists.
(+, - )
calendar dates,
temperature in
Celsius or
Fahrenheit
mean, standard
deviation,
Pearson's
correlation, t and
F tests
Ratio For ratio variables, both
differences and ratios are
meaningful. (*, /)
temperature in
Kelvin, monetary
quantities, counts,
age, mass, length,
electrical current
geometric mean,
harmonic mean,
percent variation
26. Discreet and Continuous Attributes
Discrete Attribute
− Has only a finite or countably infinite set of values
− Examples: zip codes, counts, or the set of words in a collection
of documents
− Often represented as integer variables.
− Note: binary attributes are a special case of discrete attributes
Continuous Attribute
− Has real numbers as attribute values
− Examples: temperature, height, or weight.
− Practically, real values can only be measured and represented
using a finite number of digits.
− Continuous attributes are typically represented as floating-point
variables.
27. Attribute Values
Attribute values are numbers or symbols assigned to an
attribute
Distinction between attributes and attribute values
− Same attribute can be mapped to different attribute values
Example: height can be measured in feet or meters
− Different attributes can be mapped to the same set of
values
Example: Attribute values for ID and age are integers
But properties of attribute values can be different
− ID has no limit but age has a maximum and
minimum value
28. Types of Record Sets
Record
− Data Matrix
− Document Data
− Transaction Data
Graph
− World Wide Web
− Molecular Structures
Ordered
− Spatial Data
− Temporal Data
− Sequential Data
− Genetic Sequence Data
29. Record Sets
Data that consists of a collection of
records, each of which consists of a fixed
set of attributes
Tid Refund Marital
Status
Taxable
Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
30. Data Matrics
If data objects have the same fixed set of numeric
attributes, then the data objects can be thought of as
points in a multi-dimensional space, where each
dimension represents a distinct attribute
Such data set can be represented by an m by n matrix,
where there are m rows, one for each object, and n
columns, one for each attribute
31. Document Data
Each document becomes a `term' vector,
− each term is a component (attribute) of the
vector,
− the value of each component is the number of
times the corresponding term occurs in the
document.
32. Transaction Data
A special type of record data, where
− each record (transaction) involves a set of
items.
− For example, consider a grocery store. The
set of products purchased by a customer
during one shopping trip constitute a
transaction, while the individual products that
were purchased are the items.
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
38. Data Quality
What kinds of data quality problems?
How can we detect problems with the
data?
What can we do about these problems?
Examples of data quality problems:
− Noise and outliers
− missing values
− duplicate data
39. Noise
Noise refers to modification of original
values
− Examples: distortion of a person’s voice when
talking on a poor phone and “snow” on
television screen
Two Sine Waves Two Sine Waves + Noise
40. Outliers
Outliers are data objects with
characteristics that are considerably
different than most of the other data
objects in the data set
41. Missing Values
Reasons for missing values
− Information is not collected
(e.g., people decline to give their age and
weight)
− Attributes may not be applicable to all cases
(e.g., annual income is not applicable to
children)
Handling missing values
− Eliminate Data Objects
− Estimate Missing Values
− Ignore the Missing Value During Analysis
− Replace with all possible values (weighted by
their probabilities)
42. Duplicate Data
Data set may include data objects that are
duplicates, or almost duplicates of one
another
− Major issue when merging data from
heterogeous sources
Examples:
− Same person with multiple email addresses
Data cleaning
− Process of dealing with duplicate data issues
43. Six Rules of Data Quality
1. Data that is not used cannot be correct for very long
2. Data Quality in an information system is a function of
its use, not its collection
3.Data quality will ultimately be no better than its most
stringent use
4. Data quality problems tend to become worse with
the age of the system
5. Less likely it is that some data element will change,
more traumatic it will be when it finally does change.
6. Information overload affects data quality
44. Takeaways
What is Data Mining?
Challenges and Trends
The origin of DM
Data Mining Tasks
Data Types
Data Quality