DATAMINING
Seval Ünver
E1900810 | CENG 553
Middle East Technical University
Computer Engineering Department
14.05.2013 CEN...
Outline
• Introduction
• Data vs. Information
• Who uses datamining?
• Common uses of datamining
• Datamining is…
• Superv...
Introduction
• Nowadays, large data sets have become available
due to advances in technology.
• As a result, there is an i...
What is Datamining?
• Process of semi-automatically analyzing large
databases to find patterns that are *:
– valid: hold o...
Big data: Cash Register
• Past: It was a
calculator.
• Now: It saves every
detail of every
action.
– The movements of
each...
Data vs. Information
• Data is useless by itself.
• Data is not just numbers
or letters. It consists of
numbers, letters a...
Who uses Datamining?
• CapitalOne Bank
– future prediction
• Netflix (the largest DVD-by-mail rental company)
– Recommenda...
Common uses of Datamining:
• Direct mail marketing
• Web site personalization
• Credit card fraud detection
• Gas & jewelr...
Application Areas
08.10.2013 Seval Ünver | CENG 553 9
Industry Application
Finance Credit Card Analysis
Insurance Claims, ...
Datamining is…
08.10.2013 Seval Ünver | CENG 553 10
Datamining is not…
• Data warehousing
• SQL / Ad Hoc Queries / Reporting
• Software Agents
• Online Analytical Processing ...
Supervised vs. Unsupervised Learning
• Supervised:
– Problem solving
– Driven by a real business problems and historical d...
Predictive Models
08.10.2013 Seval Ünver | CENG 553 13
Datamining Process
08.10.2013 Seval Ünver | CENG 553 14
Some Popular Data Mining Algorithms
Supervised
— Regression models
— Decision trees
— k-Nearest-Neighbor
— Neural networks...
A very simple problem set
08.10.2013 Seval Ünver | CENG 553 16
Regression Models
08.10.2013 Seval Ünver | CENG 553 17
Regression Models
08.10.2013 Seval Ünver | CENG 553 18
Decision Trees
A series of nested if/then rules.
08.10.2013 Seval Ünver | CENG 553 19
Decision Tree Models
08.10.2013 Seval Ünver | CENG 553 20
K-Nearest Neighbor Algorithm
• Find nearest data point and do the same thing
as you did for that record.
08.10.2013 Seval ...
K-Nearest Neighbor Models
08.10.2013 Seval Ünver | CENG 553 22
Neural Networks
08.10.2013 Seval Ünver | CENG 553 23
• Set of nodes connected by directed weighted edges.
Neural Networks Models
08.10.2013 Seval Ünver | CENG 553 24
Neural Networks Models
08.10.2013 Seval Ünver | CENG 553 25
08.10.2013 Seval Ünver | CENG 553 26
· Pros
+ Can learn more complicated
class boundaries
+ Fast application
+ Can handle ...
Supervised Algorithm Summary
• Decision Trees
– Understandable
– Relatively fast
– Easy to translate into SQL queries
• kN...
K-Means Clustering
• User starts by specifying the number of clusters (K)
• K datapoints are randomly selected
• Repeat un...
Data Warehouse
Data warehouse is a database used for
reporting and data analysis.
08.10.2013 Seval Ünver | CENG 553 29
Data Mining works with Warehouse Data
08.10.2013 Seval Ünver | CENG 553 30
• Data Mining provides
the Enterprise with
inte...
Conceptual Modeling of Data Warehouses
• Modeling data warehouses: dimensions & measures
– Star schema: A fact table in th...
Example of Star Schema
08.10.2013 32
time_key
day
day_of_the_week
month
quarter
year
time
location_key
street
city
state_o...
Example of Snowflake Schema
08.10.2013 33
time_key
day
day_of_the_week
month
quarter
year
time
location_key
street
city_ke...
Example of Fact Constellation
08.10.2013 34
time_key
day
day_of_the_week
month
quarter
year
time
location_key
street
city
...
Evolution of OLTP, OLAP and Data Warehouse
Time
08.10.2013 Seval Ünver | CENG 553 35
Evolutionary Step Business Question Enabling Technology
Data Collection
(1960s)
"What was my total revenue in the last
fiv...
As a Result
• In order to apply data mining, a large amount of
quality data is required.
• The aim of datamining is acquir...
Thank You
If you have question, you can contact with me
via email: e1900810@ceng.metu.edu.tr
Seval Ünver | METU CENG
08.10...
Upcoming SlideShare
Loading in …5
×

What is Datamining? Which algorithms can be used for Datamining?

2,090 views
1,954 views

Published on

This presentation includes what is datamining, which technics and algorithms are available in datamining. This presentation helps you to understand the concepts of datamining.

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,090
On SlideShare
0
From Embeds
0
Number of Embeds
758
Actions
Shares
0
Downloads
104
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide
  • The US Government uses Data Mining to track fraudA Supermarket becomes an information brokerBasketball teams use it to track game strategyCross SellingTarget MarketingHolding on to Good CustomersWeeding out Bad Customers
  • Regression: (linear or any other polynomial) a*x1 + b*x2 + c = Ci. Nearest neighourDecision tree classifier: divide decision space into piecewise constant regions.Probabilistic/generative modelsNeural networks: partition by non-linear boundaries
  • Tree where internal nodes are simple decision rules on one or more attributes and leaf nodes are predicted class labels. Widely used learning methodEasy to interpret: can be re-represented as if-then-else rulesApproximates function by piece wise constant regionsDoes not require any prior knowledge of data distribution, works well on noisy data.Has been applied to: classify medical patients based on the disease, equipment malfunction by cause, loan applicant by likelihood of payment.
  • Pros Reasonable training time Fast application Easy to interpret Easy to implement Can handle large number of featuresCons Cannot handle complicated relationship between features simple decision boundaries problems with lots of missing data
  • Pros Fast trainingCons Slow during application. No feature selection. Notion of proximity vague
  • Set of nodes connected by directed weighted edges.Useful for learning complex data like handwriting, speech and image recognition
  • ProsCan learn more complicated class boundaries Fast application Can handle large number of featuresConsSlow training time Hard to interpret Hard to implement: trial and error for choosing number of nodes
  • Data warehouse mining: assimilate data from operational sourcesmine static dataMining log dataContinuous mining: example in process controlStages in mining:data selection  pre-processing: cleaning  transformation  mining  result evaluation  visualization
  • What is Datamining? Which algorithms can be used for Datamining?

    1. 1. DATAMINING Seval Ünver E1900810 | CENG 553 Middle East Technical University Computer Engineering Department 14.05.2013 CENG 553 In Summary
    2. 2. Outline • Introduction • Data vs. Information • Who uses datamining? • Common uses of datamining • Datamining is… • Supervised and Unsupervised Learning • Predictive Models • Datamining Process • Some Popular Datamining Algorithms • Data Warehouse • Conceptual Modelling of Data Warehouse • Example of Star Schema, Snowflake Schema, Fact Constellation • Evolution of OLTP, OLAP and Data Warehouse 08.10.2013 Seval Ünver | CENG 553 2
    3. 3. Introduction • Nowadays, large data sets have become available due to advances in technology. • As a result, there is an increasing interest in various scientific communities to explore the use of emerging data mining techniques for the analysis of these large data sets *. • Data mining is the semi-automatic discovery of patterns, associations, changes, anomalies, and statistically significant structures and events in data **. * Grossman et al., 2001 ** Shmueli G, 2012 08.10.2013 Seval Ünver | CENG 553 3
    4. 4. What is Datamining? • Process of semi-automatically analyzing large databases to find patterns that are *: – valid: hold on new data with some certainty – novel: non-obvious to the system – useful: should be possible to act on the item – understandable: humans should be able to interpret the pattern • Also known as Knowledge Discovery in Databases 08.10.2013 Seval Ünver | CENG 553 4 * Prof. S. Sudarshan CSE Dept, IIT Bombay
    5. 5. Big data: Cash Register • Past: It was a calculator. • Now: It saves every detail of every action. – The movements of each product. – The movements of each user. 08.10.2013 Seval Ünver | CENG 553 5
    6. 6. Data vs. Information • Data is useless by itself. • Data is not just numbers or letters. It consists of numbers, letters and their meaning. The meaning is called metadata. • Information is interpreted data. • Converting the data to information is called data processing. 08.10.2013 Seval Ünver | CENG 553 6
    7. 7. Who uses Datamining? • CapitalOne Bank – future prediction • Netflix (the largest DVD-by-mail rental company) – Recommendation (you might also be interested in…) • Amazon.com – recommendation • British law enforcement – crime trends or security threats • Facebook – prediction how active a user will be after 3 months. • Children's Hospital in Boston – detecting domestic abuse • Pandora (an Internet music radio) – chooses the next song to play 08.10.2013 Seval Ünver | CENG 553 7
    8. 8. Common uses of Datamining: • Direct mail marketing • Web site personalization • Credit card fraud detection • Gas & jewelry • Bioinformatics • Text analysis – SAS lie detector • Market basket analysis – Beer & baby diapers: 08.10.2013 Seval Ünver | CENG 553 8
    9. 9. Application Areas 08.10.2013 Seval Ünver | CENG 553 9 Industry Application Finance Credit Card Analysis Insurance Claims, Fraud Analysis Telecommunication Call record analysis Transport Logistics management Consumer goods promotion analysis Data Service providers Value added data Utilities Power usage analysis
    10. 10. Datamining is… 08.10.2013 Seval Ünver | CENG 553 10
    11. 11. Datamining is not… • Data warehousing • SQL / Ad Hoc Queries / Reporting • Software Agents • Online Analytical Processing (OLAP) • Data Visualization 08.10.2013 Seval Ünver | CENG 553 11
    12. 12. Supervised vs. Unsupervised Learning • Supervised: – Problem solving – Driven by a real business problems and historical data – Quality of results dependent on quality of data • Unsupervised: – Exploration (aka clustering) – Relevance often an issue • Beer and baby diapers – Useful when trying to get an initial understanding of the data – Non-obvious patterns can sometimes pop out of a completed data analysis project 08.10.2013 Seval Ünver | CENG 553 12
    13. 13. Predictive Models 08.10.2013 Seval Ünver | CENG 553 13
    14. 14. Datamining Process 08.10.2013 Seval Ünver | CENG 553 14
    15. 15. Some Popular Data Mining Algorithms Supervised — Regression models — Decision trees — k-Nearest-Neighbor — Neural networks — Rule induction Unsupervised — K-means clustering — Self organized map 08.10.2013 Seval Ünver | CENG 553 15
    16. 16. A very simple problem set 08.10.2013 Seval Ünver | CENG 553 16
    17. 17. Regression Models 08.10.2013 Seval Ünver | CENG 553 17
    18. 18. Regression Models 08.10.2013 Seval Ünver | CENG 553 18
    19. 19. Decision Trees A series of nested if/then rules. 08.10.2013 Seval Ünver | CENG 553 19
    20. 20. Decision Tree Models 08.10.2013 Seval Ünver | CENG 553 20
    21. 21. K-Nearest Neighbor Algorithm • Find nearest data point and do the same thing as you did for that record. 08.10.2013 Seval Ünver | CENG 553 21
    22. 22. K-Nearest Neighbor Models 08.10.2013 Seval Ünver | CENG 553 22
    23. 23. Neural Networks 08.10.2013 Seval Ünver | CENG 553 23 • Set of nodes connected by directed weighted edges.
    24. 24. Neural Networks Models 08.10.2013 Seval Ünver | CENG 553 24
    25. 25. Neural Networks Models 08.10.2013 Seval Ünver | CENG 553 25
    26. 26. 08.10.2013 Seval Ünver | CENG 553 26 · Pros + Can learn more complicated class boundaries + Fast application + Can handle large number of features · Cons - Slow training time - Hard to interpret - Hard to implement: trial and error for choosing number of nodes Pros and Cons of Neural Networks
    27. 27. Supervised Algorithm Summary • Decision Trees – Understandable – Relatively fast – Easy to translate into SQL queries • kNN – Quick and easy – Models tend to be very large • Neural Networks – Difficult to interpret – Can require significant amounts of time to train 08.10.2013 Seval Ünver | CENG 553 27
    28. 28. K-Means Clustering • User starts by specifying the number of clusters (K) • K datapoints are randomly selected • Repeat until no change: – Hyperplanes separating K points are generated – K Centroids of each cluster are computed 08.10.2013 Seval Ünver | CENG 553 28
    29. 29. Data Warehouse Data warehouse is a database used for reporting and data analysis. 08.10.2013 Seval Ünver | CENG 553 29
    30. 30. Data Mining works with Warehouse Data 08.10.2013 Seval Ünver | CENG 553 30 • Data Mining provides the Enterprise with intelligence • Data Warehousing provides the Enterprise with a memory
    31. 31. Conceptual Modeling of Data Warehouses • Modeling data warehouses: dimensions & measures – Star schema: A fact table in the middle connected to a set of dimension tables – Snowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake – Fact constellations: Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation 08.10.2013 Seval Ünver | CENG 553 31
    32. 32. Example of Star Schema 08.10.2013 32 time_key day day_of_the_week month quarter year time location_key street city state_or_province country location Sales Fact Table time_key item_key branch_key location_key units_sold dollars_sold avg_sales Measures item_key item_name brand type supplier_type item branch_key branch_name branch_type branch Seval Ünver | CENG 553
    33. 33. Example of Snowflake Schema 08.10.2013 33 time_key day day_of_the_week month quarter year time location_key street city_key location Sales Fact Table time_key item_key branch_key location_key units_sold dollars_sold avg_sales Measures item_key item_name brand type supplier_key item branch_key branch_name branch_type branch supplier_key supplier_type supplier city_key city state_or_province country city Seval Ünver | CENG 553
    34. 34. Example of Fact Constellation 08.10.2013 34 time_key day day_of_the_week month quarter year time location_key street city province_or_state country location Sales Fact Table time_key item_key branch_key location_key units_sold dollars_sold avg_sales Measures item_key item_name brand type supplier_type item branch_key branch_name branch_type branch Shipping Fact Table time_key item_key shipper_key from_location to_location dollars_cost units_shipped shipper_key shipper_name location_key shipper_type shipper Seval Ünver | CENG 553
    35. 35. Evolution of OLTP, OLAP and Data Warehouse Time 08.10.2013 Seval Ünver | CENG 553 35
    36. 36. Evolutionary Step Business Question Enabling Technology Data Collection (1960s) "What was my total revenue in the last five years?" computers, tapes, disks Data Access (1980s) "What were unit sales in New England last March?" faster and cheaper computers with more storage, relational databases Data Warehousing And Decision Support "What were unit sales in New England last March? Drill down to Boston." faster and cheaper computers with more storage, On-line analytical processing (OLAP), multidimensional databases, data warehouses Data Mining "What's likely to happen to Boston unit sales next month? Why?" faster and cheaper computers with more storage, advanced computer algorithms 08.10.2013 Seval Ünver | CENG 553 36
    37. 37. As a Result • In order to apply data mining, a large amount of quality data is required. • The aim of datamining is acquiring rules and equations which can be used to predict future. • To be successful on such a work is dependent on working with database experts and data mining specialists. They need to work together. • Work may take longer, you need time and patience. 08.10.2013 Seval Ünver | CENG 553 37
    38. 38. Thank You If you have question, you can contact with me via email: e1900810@ceng.metu.edu.tr Seval Ünver | METU CENG 08.10.2013 Seval Ünver | CENG 553 38

    ×