SlideShare a Scribd company logo
1 of 24
1
From Data to Wisdom
 Data
 The raw material of
information
 Information
 Data organized and
presented by someone
 Knowledge
 Information read, heard or
seen and understood and
integrated
 Wisdom
 Distilled knowledge and
understanding which can
lead to decisions
Wisdom
Knowledge
Information
Data
The Information Hierarchy
Why Data Mining?
The Explosive Growth of Data: from terabytes to
petabytes
Data collection and data availability
Automated data collection tools, database systems, Web,
computerized society
Major sources of abundant data
Business: Web, e-commerce, transactions, stocks, …
Science: Remote sensing, bioinformatics, scientific simulation, …
Society and everyone: news, images, video, documents
Internet …
2
3
Source: Intel
How much data?
 Google: ~20-30 PB a day
 Wayback Machine has ~4 PB + 100-200 TB/month
 Facebook: ~3 PB of user data + 25 TB/day
 eBay: ~7 PB of user data + 50 TB/day
 CERN’s Large Hydron Collider generates 15 PB a year
 In 2010, enterprises stored 7 Exabytes = 7,000,000,000 GB
640K ought to be
enough for anybody.
Big Data Growing
5
The Untapped Data Gap:
Most of the useful data will
not be tagged or analyzed –
partly due to skill shortage
IDC predicts: From 2005 to 2020, the
digital universe will double every 2
years and grow from 130 exabytes to
40,000 exabytes
or 5,200 GB / person in 2020.
What Is Data Mining?
We are drowning in data, but starving for knowledge!
“Necessity is the mother of invention”—Data mining—
Automated analysis of massive data sets
6
The non-trivial extraction of implicit, previously unknown and
potentially useful knowledge from data in large data repositories
 Data Mining: A Definition
 Non-trivial: obvious knowledge is not useful
 implicit: hidden difficult to observe knowledge
 previously unknown
 potentially useful: actionable; easy to understand
7
Data Mining: Confluence of Multiple Disciplines
Data Mining
Machine
Learning
Statistics
Applications
Algorithm
Pattern
Recognition
High-Performance
Computing
Visualization
Database
Technology
8
Data Mining’s Virtuous Cycle
1. Identifying the problem
2. Mining data to transform it into actionable
information
3. Acting on the information
4. Measuring the results
9
The Knowledge Discovery Process
 Data Mining v. Knowledge Discovery in Databases (KDD)
 DM and KDD are often used interchangeably
 actually, DM is only part of the KDD process
- The KDD Process
10
Types of Knowledge Discovery
 Two kinds of knowledge discovery: directed and undirected
 Directed Knowledge Discovery
 Purpose: Explain value of some field in terms of all the others (goal-oriented)
 Method: select the target field based on some hypothesis about the data; ask the
algorithm to tell us how to predict or classify new instances
 Examples:
what products show increased sale when cream cheese is discounted
which banner ad to use on a web page for a given user coming to the site
 Undirected Knowledge Discovery
 Purpose: Find patterns in the data that may be interesting (no target field)
 Method: clustering, affinity grouping
 Examples:
which products in the catalog often sell together
market segmentation (find groups of customers/users with similar
characteristics or behavioral patterns)
From Data Mining to Data Science
11
12
Data Mining: On What Kinds of Data?
 Database-oriented data sets and applications
Relational database, data warehouse, transactional database
Object-relational databases, Heterogeneous databases and legacy databases
 Advanced data sets and advanced applications
Data streams and sensor data
Time-series data, temporal data, sequence data (incl. bio-sequences)
Structure data, graphs, social networks and information networks
Spatial data and spatiotemporal data
Multimedia database
Text databases
The World-Wide Web
13
Data Mining: What Kind of Data?
Structured Databases
relational, object-relational, etc.
can use SQL to perform parts of the process
e.g., SELECT count(*) FROM Items WHERE
type=video GROUP BY category
14
Data Mining: What Kind of Data?
 Flat Files
 most common data source
 can be text (or HTML) or binary
 may contain transactions, statistical data, measurements, etc.
 Transactional databases
 set of records each with a transaction id, time stamp, and a set of items
 may have an associated “description” file for the items
 typical source of data used in market basket analysis
15
Data Mining: What Kind of Data?
 Other Types of Databases
 legacy databases
 multimedia databases (usually very high-dimensional)
 spatial databases (containing geographical information, such as maps, or
satellite imaging data, etc.)
 Time Series Temporal Data (time dependent information such as stock market
data; usually very dynamic)
 World Wide Web
 basically a large, heterogeneous, distributed database
 need for new or additional tools and techniques
information retrieval, filtering and extraction
agents to assist in browsing and filtering
Web content, usage, and structure (linkage) mining tools
 The “social Web”
User generated meta-data, social networks, shared resources, etc.
16
What Can Data Mining Do
Many Data Mining Tasks
 often inter-related
 often need to try different techniques/algorithms for each task
 each tasks may require different types of knowledge discovery
What are some of data mining tasks
 Classification
 Prediction
 Clustering
 Affinity Grouping / Association discovery
 Sequence Analysis
 Characterization
 Discrimination
17
Some Applications of Data mining
 Business data analysis and decision support
Marketing focalization
Recognizing specific market segments that respond to particular
characteristics
Return on mailing campaign (target marketing)
Customer Profiling
Segmentation of customer for marketing strategies and/or product
offerings
Customer behavior understanding
Customer retention and loyalty
Mass customization / personalization
18
Some Applications of Data mining
 Business data analysis and decision support (cont.)
Market analysis and management
Provide summary information for decision-making
Market basket analysis, cross selling, market segmentation.
Resource planning
Risk analysis and management
"What if" analysis
Forecasting
Pricing analysis, competitive analysis
Time-series analysis (Ex. stock market)
19
Some Applications of Data mining
 Fraud detection
Detecting telephone fraud:
Telephone call model: destination of the call, duration, time of day or week
Analyze patterns that deviate from an expected norm
British Telecom identified discrete groups of callers with frequent intra-group calls,
especially mobile phones, and broke a multimillion dollar fraud scheme
Detection of credit-card fraud
Detecting suspicious money transactions (money laundering)
 Text mining:
 Message filtering (e-mail, newsgroups, etc.)
 Newspaper articles analysis
 Text and document categorization
 Web Mining
 Mining patterns from the content, usage, and structure of Web resources
Types of Web Mining
Web Content
Mining
Web Structure
Mining
Web Usage
Mining
Web Mining
20
Types of Web Mining
Web Content
Mining
Web Structure
Mining
Web Usage
Mining
Web Mining
21
Applications:
• document clustering or
categorization
• topic identification / tracking
• concept discovery
• focused crawling
• content-based personalization
• intelligent search tools
Types of Web Mining
Web Content
Mining
Web Structure
Mining
Web Usage
Mining
Web Mining
Applications:
• user and customer behavior modeling
• Web site optimization
• e-customer relationship management
• Web marketing
• targeted advertising
• recommender systems
22
Types of Web Mining
Web Content
Mining
Web Structure
Mining
Web Usage
Mining
Web Mining
Applications:
• document retrieval and
ranking (e.g., Google)
• discovery of “hubs” and
“authorities”
• discovery of Web
communities
• social network analysis
23
24
The Knowledge Discovery Process
- The KDD Process
 Next: We first focus on understanding the data and data
preparation/transformation

More Related Content

What's hot

Data Mining: Applying data mining
Data Mining: Applying data miningData Mining: Applying data mining
Data Mining: Applying data miningDataminingTools Inc
 
Data Mining Techniques
Data Mining TechniquesData Mining Techniques
Data Mining TechniquesSanzid Kawsar
 
Upstate CSCI 525 Data Mining Chapter 1
Upstate CSCI 525 Data Mining Chapter 1Upstate CSCI 525 Data Mining Chapter 1
Upstate CSCI 525 Data Mining Chapter 1DanWooster1
 
Data mining techniques
Data mining techniquesData mining techniques
Data mining techniquesHatem Magdy
 
Data Mining: Classification and analysis
Data Mining: Classification and analysisData Mining: Classification and analysis
Data Mining: Classification and analysisDataminingTools Inc
 
Secondary Research in Applied Marketing Research
Secondary Research in Applied Marketing ResearchSecondary Research in Applied Marketing Research
Secondary Research in Applied Marketing ResearchKelly Page
 
All types of mining and trends indata mining
All types of mining and trends indata miningAll types of mining and trends indata mining
All types of mining and trends indata miningRupal Kharya
 
Data mining in Telecommunications
Data mining in TelecommunicationsData mining in Telecommunications
Data mining in TelecommunicationsMohsin Nadaf
 
Introduction to-data-mining chapter 1
Introduction to-data-mining  chapter 1Introduction to-data-mining  chapter 1
Introduction to-data-mining chapter 1Mahmoud Alfarra
 
What is Data mining? Data mining Presentation
What is Data mining? Data mining Presentation What is Data mining? Data mining Presentation
What is Data mining? Data mining Presentation Pralhad Rijal
 
Mining internal sources of data
Mining internal sources of dataMining internal sources of data
Mining internal sources of datanomanbhutta
 
MC0088 Internal Assignment (SMU)
MC0088 Internal Assignment (SMU)MC0088 Internal Assignment (SMU)
MC0088 Internal Assignment (SMU)Krishan Pareek
 
Data mining 1 - Introduction (cheat sheet - printable)
Data mining 1 - Introduction (cheat sheet - printable)Data mining 1 - Introduction (cheat sheet - printable)
Data mining 1 - Introduction (cheat sheet - printable)yesheeka
 

What's hot (18)

10appl
10appl10appl
10appl
 
Data Mining: Applying data mining
Data Mining: Applying data miningData Mining: Applying data mining
Data Mining: Applying data mining
 
Data mining
Data miningData mining
Data mining
 
Data Mining Techniques
Data Mining TechniquesData Mining Techniques
Data Mining Techniques
 
Upstate CSCI 525 Data Mining Chapter 1
Upstate CSCI 525 Data Mining Chapter 1Upstate CSCI 525 Data Mining Chapter 1
Upstate CSCI 525 Data Mining Chapter 1
 
Data mining techniques
Data mining techniquesData mining techniques
Data mining techniques
 
Data Mining: Classification and analysis
Data Mining: Classification and analysisData Mining: Classification and analysis
Data Mining: Classification and analysis
 
Secondary Research in Applied Marketing Research
Secondary Research in Applied Marketing ResearchSecondary Research in Applied Marketing Research
Secondary Research in Applied Marketing Research
 
All types of mining and trends indata mining
All types of mining and trends indata miningAll types of mining and trends indata mining
All types of mining and trends indata mining
 
Data Mining: Key definitions
Data Mining: Key definitionsData Mining: Key definitions
Data Mining: Key definitions
 
Data mining in Telecommunications
Data mining in TelecommunicationsData mining in Telecommunications
Data mining in Telecommunications
 
Dm unit i r16
Dm unit i   r16Dm unit i   r16
Dm unit i r16
 
Introduction to-data-mining chapter 1
Introduction to-data-mining  chapter 1Introduction to-data-mining  chapter 1
Introduction to-data-mining chapter 1
 
What is Data mining? Data mining Presentation
What is Data mining? Data mining Presentation What is Data mining? Data mining Presentation
What is Data mining? Data mining Presentation
 
Mining internal sources of data
Mining internal sources of dataMining internal sources of data
Mining internal sources of data
 
MC0088 Internal Assignment (SMU)
MC0088 Internal Assignment (SMU)MC0088 Internal Assignment (SMU)
MC0088 Internal Assignment (SMU)
 
Big data
Big dataBig data
Big data
 
Data mining 1 - Introduction (cheat sheet - printable)
Data mining 1 - Introduction (cheat sheet - printable)Data mining 1 - Introduction (cheat sheet - printable)
Data mining 1 - Introduction (cheat sheet - printable)
 

Viewers also liked

Programming for engineers in python
Programming for engineers in pythonProgramming for engineers in python
Programming for engineers in pythonYoung Alista
 
Database introduction
Database introductionDatabase introduction
Database introductionYoung Alista
 
Optimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessorsOptimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessorsYoung Alista
 
Key exchange in crypto
Key exchange in cryptoKey exchange in crypto
Key exchange in cryptoYoung Alista
 
Directory based cache coherence
Directory based cache coherenceDirectory based cache coherence
Directory based cache coherenceYoung Alista
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingYoung Alista
 
Stacks queues lists
Stacks queues listsStacks queues lists
Stacks queues listsYoung Alista
 
Tecnologías de Información y Comunicación
Tecnologías de Información y ComunicaciónTecnologías de Información y Comunicación
Tecnologías de Información y Comunicaciónpolivirtual972
 
Python your new best friend
Python your new best friendPython your new best friend
Python your new best friendYoung Alista
 
Data visualization
Data visualizationData visualization
Data visualizationYoung Alista
 
Prolog programming
Prolog programmingProlog programming
Prolog programmingYoung Alista
 
Google appenginejava.ppt
Google appenginejava.pptGoogle appenginejava.ppt
Google appenginejava.pptYoung Alista
 
Motivation for multithreaded architectures
Motivation for multithreaded architecturesMotivation for multithreaded architectures
Motivation for multithreaded architecturesYoung Alista
 

Viewers also liked (20)

Network
NetworkNetwork
Network
 
Python basics
Python basicsPython basics
Python basics
 
Programming for engineers in python
Programming for engineers in pythonProgramming for engineers in python
Programming for engineers in python
 
Database concepts
Database conceptsDatabase concepts
Database concepts
 
Database introduction
Database introductionDatabase introduction
Database introduction
 
Optimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessorsOptimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessors
 
Abstract class
Abstract classAbstract class
Abstract class
 
Key exchange in crypto
Key exchange in cryptoKey exchange in crypto
Key exchange in crypto
 
Directory based cache coherence
Directory based cache coherenceDirectory based cache coherence
Directory based cache coherence
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Xml stylus studio
Xml stylus studioXml stylus studio
Xml stylus studio
 
Stacks queues lists
Stacks queues listsStacks queues lists
Stacks queues lists
 
Poo java
Poo javaPoo java
Poo java
 
Tecnologías de Información y Comunicación
Tecnologías de Información y ComunicaciónTecnologías de Información y Comunicación
Tecnologías de Información y Comunicación
 
List in webpage
List in webpageList in webpage
List in webpage
 
Python your new best friend
Python your new best friendPython your new best friend
Python your new best friend
 
Data visualization
Data visualizationData visualization
Data visualization
 
Prolog programming
Prolog programmingProlog programming
Prolog programming
 
Google appenginejava.ppt
Google appenginejava.pptGoogle appenginejava.ppt
Google appenginejava.ppt
 
Motivation for multithreaded architectures
Motivation for multithreaded architecturesMotivation for multithreaded architectures
Motivation for multithreaded architectures
 

Similar to Data mining and knowledge discovery

Data warehouse and data mining
Data warehouse and data miningData warehouse and data mining
Data warehouse and data miningRohit Kumar
 
6months industrial training in data mining, jalandhar
6months industrial training in data mining, jalandhar6months industrial training in data mining, jalandhar
6months industrial training in data mining, jalandhardeepikakaler1
 
6 weeks summer training in data mining,ludhiana
6 weeks summer training in data mining,ludhiana6 weeks summer training in data mining,ludhiana
6 weeks summer training in data mining,ludhianadeepikakaler1
 
6 weeks summer training in data mining,jalandhar
6 weeks summer training in data mining,jalandhar6 weeks summer training in data mining,jalandhar
6 weeks summer training in data mining,jalandhardeepikakaler1
 
6months industrial training in data mining,ludhiana
6months industrial training in data mining,ludhiana6months industrial training in data mining,ludhiana
6months industrial training in data mining,ludhianadeepikakaler1
 
Introduction.ppt
Introduction.pptIntroduction.ppt
Introduction.pptbommaiah
 
Unit 1 (Chapter-1) on data mining concepts.ppt
Unit 1 (Chapter-1) on data mining concepts.pptUnit 1 (Chapter-1) on data mining concepts.ppt
Unit 1 (Chapter-1) on data mining concepts.pptPadmajaLaksh
 
01Introduction to data mining chapter 1.ppt
01Introduction to data mining chapter 1.ppt01Introduction to data mining chapter 1.ppt
01Introduction to data mining chapter 1.pptadmsoyadm4
 
Introduction To Data Mining
Introduction To Data MiningIntroduction To Data Mining
Introduction To Data Miningdataminers.ir
 

Similar to Data mining and knowledge discovery (20)

Data mining 1
Data mining 1Data mining 1
Data mining 1
 
Data warehouse and data mining
Data warehouse and data miningData warehouse and data mining
Data warehouse and data mining
 
Introduction
IntroductionIntroduction
Introduction
 
Dma unit 1
Dma unit   1Dma unit   1
Dma unit 1
 
6months industrial training in data mining, jalandhar
6months industrial training in data mining, jalandhar6months industrial training in data mining, jalandhar
6months industrial training in data mining, jalandhar
 
6 weeks summer training in data mining,ludhiana
6 weeks summer training in data mining,ludhiana6 weeks summer training in data mining,ludhiana
6 weeks summer training in data mining,ludhiana
 
6 weeks summer training in data mining,jalandhar
6 weeks summer training in data mining,jalandhar6 weeks summer training in data mining,jalandhar
6 weeks summer training in data mining,jalandhar
 
6months industrial training in data mining,ludhiana
6months industrial training in data mining,ludhiana6months industrial training in data mining,ludhiana
6months industrial training in data mining,ludhiana
 
Introduction.ppt
Introduction.pptIntroduction.ppt
Introduction.ppt
 
Unit 1 (Chapter-1) on data mining concepts.ppt
Unit 1 (Chapter-1) on data mining concepts.pptUnit 1 (Chapter-1) on data mining concepts.ppt
Unit 1 (Chapter-1) on data mining concepts.ppt
 
Chapter 1. Introduction.ppt
Chapter 1. Introduction.pptChapter 1. Introduction.ppt
Chapter 1. Introduction.ppt
 
Introduction to data warehouse
Introduction to data warehouseIntroduction to data warehouse
Introduction to data warehouse
 
Data mining
Data miningData mining
Data mining
 
Data Mining Intro
Data Mining IntroData Mining Intro
Data Mining Intro
 
data mining
data miningdata mining
data mining
 
01Intro.ppt
01Intro.ppt01Intro.ppt
01Intro.ppt
 
01Introduction to data mining chapter 1.ppt
01Introduction to data mining chapter 1.ppt01Introduction to data mining chapter 1.ppt
01Introduction to data mining chapter 1.ppt
 
01Intro.ppt
01Intro.ppt01Intro.ppt
01Intro.ppt
 
Data Mining
Data MiningData Mining
Data Mining
 
Introduction To Data Mining
Introduction To Data MiningIntroduction To Data Mining
Introduction To Data Mining
 

More from Young Alista

Serialization/deserialization
Serialization/deserializationSerialization/deserialization
Serialization/deserializationYoung Alista
 
Big picture of data mining
Big picture of data miningBig picture of data mining
Big picture of data miningYoung Alista
 
Business analytics and data mining
Business analytics and data miningBusiness analytics and data mining
Business analytics and data miningYoung Alista
 
Hardware managed cache
Hardware managed cacheHardware managed cache
Hardware managed cacheYoung Alista
 
How analysis services caching works
How analysis services caching worksHow analysis services caching works
How analysis services caching worksYoung Alista
 
Abstract data types
Abstract data typesAbstract data types
Abstract data typesYoung Alista
 
Concurrency with java
Concurrency with javaConcurrency with java
Concurrency with javaYoung Alista
 
Data structures and algorithms
Data structures and algorithmsData structures and algorithms
Data structures and algorithmsYoung Alista
 
Cobol, lisp, and python
Cobol, lisp, and pythonCobol, lisp, and python
Cobol, lisp, and pythonYoung Alista
 
Object oriented analysis
Object oriented analysisObject oriented analysis
Object oriented analysisYoung Alista
 
Extending burp with python
Extending burp with pythonExtending burp with python
Extending burp with pythonYoung Alista
 
Python language data types
Python language data typesPython language data types
Python language data typesYoung Alista
 
Rest api to integrate with your site
Rest api to integrate with your siteRest api to integrate with your site
Rest api to integrate with your siteYoung Alista
 
How to build a rest api.pptx
How to build a rest api.pptxHow to build a rest api.pptx
How to build a rest api.pptxYoung Alista
 

More from Young Alista (20)

Serialization/deserialization
Serialization/deserializationSerialization/deserialization
Serialization/deserialization
 
Big picture of data mining
Big picture of data miningBig picture of data mining
Big picture of data mining
 
Business analytics and data mining
Business analytics and data miningBusiness analytics and data mining
Business analytics and data mining
 
Cache recap
Cache recapCache recap
Cache recap
 
Hardware managed cache
Hardware managed cacheHardware managed cache
Hardware managed cache
 
How analysis services caching works
How analysis services caching worksHow analysis services caching works
How analysis services caching works
 
Object model
Object modelObject model
Object model
 
Abstract data types
Abstract data typesAbstract data types
Abstract data types
 
Abstraction file
Abstraction fileAbstraction file
Abstraction file
 
Concurrency with java
Concurrency with javaConcurrency with java
Concurrency with java
 
Data structures and algorithms
Data structures and algorithmsData structures and algorithms
Data structures and algorithms
 
Inheritance
InheritanceInheritance
Inheritance
 
Cobol, lisp, and python
Cobol, lisp, and pythonCobol, lisp, and python
Cobol, lisp, and python
 
Object oriented analysis
Object oriented analysisObject oriented analysis
Object oriented analysis
 
Api crash
Api crashApi crash
Api crash
 
Learning python
Learning pythonLearning python
Learning python
 
Extending burp with python
Extending burp with pythonExtending burp with python
Extending burp with python
 
Python language data types
Python language data typesPython language data types
Python language data types
 
Rest api to integrate with your site
Rest api to integrate with your siteRest api to integrate with your site
Rest api to integrate with your site
 
How to build a rest api.pptx
How to build a rest api.pptxHow to build a rest api.pptx
How to build a rest api.pptx
 

Recently uploaded

Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 

Recently uploaded (20)

Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 

Data mining and knowledge discovery

  • 1. 1 From Data to Wisdom  Data  The raw material of information  Information  Data organized and presented by someone  Knowledge  Information read, heard or seen and understood and integrated  Wisdom  Distilled knowledge and understanding which can lead to decisions Wisdom Knowledge Information Data The Information Hierarchy
  • 2. Why Data Mining? The Explosive Growth of Data: from terabytes to petabytes Data collection and data availability Automated data collection tools, database systems, Web, computerized society Major sources of abundant data Business: Web, e-commerce, transactions, stocks, … Science: Remote sensing, bioinformatics, scientific simulation, … Society and everyone: news, images, video, documents Internet … 2
  • 4. How much data?  Google: ~20-30 PB a day  Wayback Machine has ~4 PB + 100-200 TB/month  Facebook: ~3 PB of user data + 25 TB/day  eBay: ~7 PB of user data + 50 TB/day  CERN’s Large Hydron Collider generates 15 PB a year  In 2010, enterprises stored 7 Exabytes = 7,000,000,000 GB 640K ought to be enough for anybody.
  • 5. Big Data Growing 5 The Untapped Data Gap: Most of the useful data will not be tagged or analyzed – partly due to skill shortage IDC predicts: From 2005 to 2020, the digital universe will double every 2 years and grow from 130 exabytes to 40,000 exabytes or 5,200 GB / person in 2020.
  • 6. What Is Data Mining? We are drowning in data, but starving for knowledge! “Necessity is the mother of invention”—Data mining— Automated analysis of massive data sets 6 The non-trivial extraction of implicit, previously unknown and potentially useful knowledge from data in large data repositories  Data Mining: A Definition  Non-trivial: obvious knowledge is not useful  implicit: hidden difficult to observe knowledge  previously unknown  potentially useful: actionable; easy to understand
  • 7. 7 Data Mining: Confluence of Multiple Disciplines Data Mining Machine Learning Statistics Applications Algorithm Pattern Recognition High-Performance Computing Visualization Database Technology
  • 8. 8 Data Mining’s Virtuous Cycle 1. Identifying the problem 2. Mining data to transform it into actionable information 3. Acting on the information 4. Measuring the results
  • 9. 9 The Knowledge Discovery Process  Data Mining v. Knowledge Discovery in Databases (KDD)  DM and KDD are often used interchangeably  actually, DM is only part of the KDD process - The KDD Process
  • 10. 10 Types of Knowledge Discovery  Two kinds of knowledge discovery: directed and undirected  Directed Knowledge Discovery  Purpose: Explain value of some field in terms of all the others (goal-oriented)  Method: select the target field based on some hypothesis about the data; ask the algorithm to tell us how to predict or classify new instances  Examples: what products show increased sale when cream cheese is discounted which banner ad to use on a web page for a given user coming to the site  Undirected Knowledge Discovery  Purpose: Find patterns in the data that may be interesting (no target field)  Method: clustering, affinity grouping  Examples: which products in the catalog often sell together market segmentation (find groups of customers/users with similar characteristics or behavioral patterns)
  • 11. From Data Mining to Data Science 11
  • 12. 12 Data Mining: On What Kinds of Data?  Database-oriented data sets and applications Relational database, data warehouse, transactional database Object-relational databases, Heterogeneous databases and legacy databases  Advanced data sets and advanced applications Data streams and sensor data Time-series data, temporal data, sequence data (incl. bio-sequences) Structure data, graphs, social networks and information networks Spatial data and spatiotemporal data Multimedia database Text databases The World-Wide Web
  • 13. 13 Data Mining: What Kind of Data? Structured Databases relational, object-relational, etc. can use SQL to perform parts of the process e.g., SELECT count(*) FROM Items WHERE type=video GROUP BY category
  • 14. 14 Data Mining: What Kind of Data?  Flat Files  most common data source  can be text (or HTML) or binary  may contain transactions, statistical data, measurements, etc.  Transactional databases  set of records each with a transaction id, time stamp, and a set of items  may have an associated “description” file for the items  typical source of data used in market basket analysis
  • 15. 15 Data Mining: What Kind of Data?  Other Types of Databases  legacy databases  multimedia databases (usually very high-dimensional)  spatial databases (containing geographical information, such as maps, or satellite imaging data, etc.)  Time Series Temporal Data (time dependent information such as stock market data; usually very dynamic)  World Wide Web  basically a large, heterogeneous, distributed database  need for new or additional tools and techniques information retrieval, filtering and extraction agents to assist in browsing and filtering Web content, usage, and structure (linkage) mining tools  The “social Web” User generated meta-data, social networks, shared resources, etc.
  • 16. 16 What Can Data Mining Do Many Data Mining Tasks  often inter-related  often need to try different techniques/algorithms for each task  each tasks may require different types of knowledge discovery What are some of data mining tasks  Classification  Prediction  Clustering  Affinity Grouping / Association discovery  Sequence Analysis  Characterization  Discrimination
  • 17. 17 Some Applications of Data mining  Business data analysis and decision support Marketing focalization Recognizing specific market segments that respond to particular characteristics Return on mailing campaign (target marketing) Customer Profiling Segmentation of customer for marketing strategies and/or product offerings Customer behavior understanding Customer retention and loyalty Mass customization / personalization
  • 18. 18 Some Applications of Data mining  Business data analysis and decision support (cont.) Market analysis and management Provide summary information for decision-making Market basket analysis, cross selling, market segmentation. Resource planning Risk analysis and management "What if" analysis Forecasting Pricing analysis, competitive analysis Time-series analysis (Ex. stock market)
  • 19. 19 Some Applications of Data mining  Fraud detection Detecting telephone fraud: Telephone call model: destination of the call, duration, time of day or week Analyze patterns that deviate from an expected norm British Telecom identified discrete groups of callers with frequent intra-group calls, especially mobile phones, and broke a multimillion dollar fraud scheme Detection of credit-card fraud Detecting suspicious money transactions (money laundering)  Text mining:  Message filtering (e-mail, newsgroups, etc.)  Newspaper articles analysis  Text and document categorization  Web Mining  Mining patterns from the content, usage, and structure of Web resources
  • 20. Types of Web Mining Web Content Mining Web Structure Mining Web Usage Mining Web Mining 20
  • 21. Types of Web Mining Web Content Mining Web Structure Mining Web Usage Mining Web Mining 21 Applications: • document clustering or categorization • topic identification / tracking • concept discovery • focused crawling • content-based personalization • intelligent search tools
  • 22. Types of Web Mining Web Content Mining Web Structure Mining Web Usage Mining Web Mining Applications: • user and customer behavior modeling • Web site optimization • e-customer relationship management • Web marketing • targeted advertising • recommender systems 22
  • 23. Types of Web Mining Web Content Mining Web Structure Mining Web Usage Mining Web Mining Applications: • document retrieval and ranking (e.g., Google) • discovery of “hubs” and “authorities” • discovery of Web communities • social network analysis 23
  • 24. 24 The Knowledge Discovery Process - The KDD Process  Next: We first focus on understanding the data and data preparation/transformation