SlideShare a Scribd company logo
1 of 64
Data Mining Primitives,
  Languages and System
       Architecture

CSE 634-Datamining Concepts and Techniques
        Professor Anita Wasilewska

               Presented By
     Sushma Devendrappa - 105526184
       Swathi Kothapalli - 105531380
Sources/References
   Data Mining Concepts and Techniques –Jiawei Han and Micheline Kamber,
    2003

   Handbook of Data Mining and Discovery- Willi Klosgen and Jan M Zytkow,
    2002

   Lydia: A System for Large-Scale News Analysis- String Processing and
    Information Retrieval: 12th International Conference, SPRING 2005,
    Buenos Aires, Argentina, November 2-4 2005.

   Information Retrieval: Data Structures and Algorithms - W. Frakes and R.
    Baeza-Yates, 1992

   Geographical Information System -
    http://erg.usgs.gov/isb/pubs/gis_poster/
Content

   Data mining primitives
   Languages
   System architecture
   Application – Geographical information system (GIS)
   Paper - Lydia: A System for Large-Scale News Analysis
Introduction

   Motivation- need to extract useful information and knowledge from a large
    amount of data (data explosion problem)

   Data Mining tools perform data analysis and may uncover important data
    patterns, contributing greatly to business strategies, knowledge bases, and
    scientific and medical research.
What is Data Mining???

   Data mining refers to extracting or “mining” knowledge from large
    amounts of data. Also referred as Knowledge Discovery in Databases.

   It is a process of discovering interesting knowledge from large amounts of
    data stored either in databases, data warehouses, or other information
    repositories.
Architecture of a typical data mining system


                         Graphical user interface



                           Pattern evaluation
                                                                 Knowledge base


                           Data mining engine



                 Database or data warehouse server

    Data cleansing
   Data Integration                                  Filtering



                      Database           Data warehouse
   Misconception: Data mining systems can autonomously dig out all of the
    valuable knowledge from a given large database, without human
    intervention.

   If there was no user intervention then the system would uncover a large set
    of patterns that may even surpass the size of the database. Hence, user
    interference is required.

   This user communication with the system is provided by using a set of data
    mining primitives.
Data Mining Primitives
    Data mining primitives define a data mining task, which can be specified in
    the form of a data mining query.

   Task Relevant Data

   Kinds of knowledge to be mined

   Background knowledge

   Interestingness measure

   Presentation and visualization of discovered patterns
Task relevant data

   Data portion to be investigated.

   Attributes of interest (relevant attributes) can be specified.

   Initial data relation

   Minable view
Example

   If a data mining task is to study associations between items frequently
    purchased at AllElectronics by customers in Canada, the task relevant data
    can be specified by providing the following information:
      Name of the database or data warehouse to be used (e.g., AllElectronics_db)
      Names of the tables or data cubes containing relevant data (e.g., item, customer,
       purchases and items_sold)
      Conditions for selecting the relevant data (e.g., retrieve data pertaining to
       purchases made in Canada for the current year)
      The relevant attributes or dimensions (e.g., name and price from the item table
       and income and age from the customer table)
Kind of knowledge to be mined

   It is important to specify the knowledge to be mined, as this determines the
    data mining function to be performed.

   Kinds of knowledge include concept description, association, classification,
    prediction and clustering.

   User can also provide pattern templates. Also called metapatterns or
    metarules or metaqueries.
Example

A user studying the buying habits of allelectronics customers may
choose to mine association rules of the form:
P (X:customer,W) ^ Q (X,Y) => buys (X,Z)


Meta rules such as the following can be specified:
age (X, “30…..39”) ^ income (X, “40k….49K”) => buys (X, “VCR”)
                                                 [2.2%, 60%]
occupation (X, “student ”) ^ age (X, “20…..29”)=> buys (X, “computer”)
                                                  [1.4%, 70%]
Background knowledge

   It is the information about the domain to be mined

   Concept hierarchy: is a powerful form of background knowledge.

   Four major types of concept hierarchies:
    schema hierarchies
    set-grouping hierarchies
    operation-derived hierarchies
    rule-based hierarchies
Concept hierarchies (1)
   Defines a sequence of mappings from a set of low-level concepts to higher-
    level (more general) concepts.

   Allows data to be mined at multiple levels of abstraction.

   These allow users to view data from different perspectives, allowing further
    insight into the relationships.

   Example (location)
Example


                                                    all                                       Level 0




                                                                         USA                  Level 1
                      Canada




       British                           Ontario              New York             Illinois   Level 2
      Columbia




Vancouver        Victoria      Toronto         Ottawa     New York       Buffalo   Chicago    Level 3
Concept hierarchies (2)
   Rolling Up - Generalization of data
     Allows to view data at more meaningful and explicit abstractions.
     Makes it easier to understand
     Compresses the data
     Would require fewer input/output operations
   Drilling Down - Specialization of data
     Concept values replaced by lower level concepts
   There may be more than concept hierarchy for a given attribute or
    dimension based on different user viewpoints
   Example:
    Regional sales manager may prefer the previous concept hierarchy but
    marketing manager might prefer to see location with respect to linguistic
    lines in order to facilitate the distribution of commercial ads.
Schema hierarchies
   Schema hierarchy is the total or partial order among attributes in the
    database schema.

   May formally express existing semantic relationships between attributes.

   Provides metadata information.

   Example: location hierarchy
    street < city < province/state < country
Set-grouping hierarchies
   Organizes values for a given attribute into groups or sets or range of values.

   Total or partial order can be defined among groups.

   Used to refine or enrich schema-defined hierarchies.

   Typically used for small sets of object relationships.

   Example: Set-grouping hierarchy for age
    {young, middle_aged, senior} all (age)
    {20….29} young
    {40….59} middle_aged
    {60….89} senior
Operation-derived hierarchies
   Operation-derived:
    based on operations specified
    operations may include
        decoding of information-encoded strings
        information extraction from complex data objects
        data clustering
    Example: URL or email address
    xyz@cs.iitm.in gives login name < dept. < univ. < country
Rule-based hierarchies
   Rule-based:
    Occurs when either whole or portion of a concept hierarchy is defined as a
    set of rules and is evaluated dynamically based on current database data and
    rule definition

   Example: Following rules are used to categorize items as low_profit,
    medium_profit and high_profit_margin.
    low_profit_margin(X) <= price(X,P1)^cost(X,P2)^((P1-P2)<50)
    medium_profit_margin(X) <= price(X,P1)^cost(X,P2)^((P1-
    P2)≥50)^((P1-P2)≤250)
    high_profit_margin(X) <= price(X,P1)^cost(X,P2)^((P1-P2)>250)
Interestingness measure (1)
   Used to confine the number of uninteresting patterns returned by the
    process.

   Based on the structure of patterns and statistics underlying them.

   Associate a threshold which can be controlled by the user.

   patterns not meeting the threshold are not presented to the user.

   Objective measures of pattern interestingness:
    simplicity
    certainty (confidence)
    utility (support)
    novelty
Interestingness measure (2)
   Simplicity
    a patterns interestingness is based on its overall simplicity for human
    comprehension.
    Example: Rule length is a simplicity measure

   Certainty (confidence)
     Assesses the validity or trustworthiness of a pattern.
    confidence is a certainty measure
    confidence (A=>B) = # tuples containing both A and B
                             # tuples containing A
    A confidence of 85% for the rule buys(X, “computer”)=>buys(X,“software”)
    means that 85% of all customers who purchased a computer also bought
    software
Interestingness measure (3)
   Utility (support)
    usefulness of a pattern
    support (A=>B) = # tuples containing both A and B
                               total # of tuples
    A support of 30% for the previous rule means that 30% of all customers in
    the computer department purchased both a computer and software.

   Association rules that satisfy both the minimum confidence and support
    threshold are referred to as strong association rules.

   Novelty
    Patterns contributing new information to the given pattern set are called
    novel patterns (example: Data exception).
    removing redundant patterns is a strategy for detecting novelty.
Presentation and visualization
   For data mining to be effective, data mining systems should be able to
    display the discovered patterns in multiple forms, such as rules, tables,
    crosstabs (cross-tabulations), pie or bar charts, decision trees, cubes, or
    other visual representations.

   User must be able to specify the forms of presentation to be used for
    displaying the discovered patterns.
Data mining query languages

   Data mining language must be designed to facilitate flexible and effective
    knowledge discovery.

   Having a query language for data mining may help standardize the
    development of platforms for data mining systems.

   But designed a language is challenging because data mining covers a wide
    spectrum of tasks and each task has different requirement.

   Hence, the design of a language requires deep understanding of the
    limitations and underlying mechanism of the various kinds of tasks.
Data mining query languages (2)
   So…how would you design an efficient query language???

   Based on the primitives discussed earlier.

   DMQL allows mining of different kinds of knowledge from relational
    databases and data warehouses at multiple levels of abstraction.
DMQL
   Adopts SQL-like syntax

   Hence, can be easily integrated with relational query languages

   Defined in BNF grammar
     [ ] represents 0 or one occurrence
     { } represents 0 or more occurrences
     Words in sans serif represent keywords
DMQL-Syntax for task-relevant data specification

   Names of the relevant database or data warehouse, conditions and relevant
    attributes or dimensions must be specified
   use database ‹database_name› or use data warehouse ‹data_warehouse_name›


   from ‹relation(s)/cube(s)› [where condition]
   in relevance to ‹attribute_or_dimension_list›
   order by ‹order_list›
   group by ‹grouping_list›
   having ‹condition›
Example
Syntax for Kind of Knowledge to be Mined
   Characterization :
    ‹Mine_Knowledge_Specification› ::=
         mine characteristics [as ‹pattern_name›]
         analyze ‹measure(s)›
   Example:
    mine characteristics as customerPurchasing analyze count%

   Discrimination:
     ‹Mine_Knowledge_Specification› ::=
          mine comparison [as ‹ pattern_name›]
          for ‹target_class› where ‹target_condition›
          {versus ‹contrast_class_i where ‹contrast_condition_i›}
          analyze ‹measure(s)›
   Example:
    Mine comparison as purchaseGroups
          for bigspenders where avg(I.price) >= $100
          versus budgetspenders where avg(I.price) < $100
          analyze count
Syntax for Kind of Knowledge to be Mined (2)
   Association:
    ‹Mine_Knowledge_Specification› ::=
         mine associations [as ‹pattern_name›]
         [matching ‹metapattern›]
   Example: mine associations as buyingHabits
              matching P(X: customer, W) ^ Q(X,Y) => buys (X,Z)

   Classification:
          ‹Mine_Knowledge_Specification› ::=
             mine classification [as ‹pattern_name›]
             analyze ‹classifying_attribute_or_dimension›
   Example: mine classification as classifyCustomerCreditRating
               analyze credit_rating
Syntax for concept hierarchy specification
   More than one concept per attribute can be specified
   Use hierarchy ‹hierarchy_name› for ‹attribute_or_dimension›
   Examples:
    Schema concept hierarchy (ordering is important)
          define hierarchy location_hierarchy on address as
           [street,city,province_or_state,country]

     Set-Grouping concept hierarchy
          define hierarchy age_hierarchy for age on customer as

                 level1: {young, middle_aged, senior} < level0: all
                 level2: {20, ..., 39} < level1: young
                 level2: {40, ..., 59} < level1: middle_aged
                 level2: {60, ..., 89} < level1: senior
Syntax for concept hierarchy specification (2)
   operation-derived concept hierarchy
      define hierarchy age_hierarchy for age on customer as

      {age_category(1), ..., age_category(5)} := cluster (default, age, 5) <
       all(age)

   rule-based concept hierarchy
       define hierarchy profit_margin_hierarchy on item as

        level_1: low_profit_margin < level_0: all
                 if (price - cost)< $50
        level_1: medium-profit_margin < level_0: all
                 if ((price - cost) > $50) and ((price - cost) <= $250))
        level_1: high_profit_margin < level_0: all
                 if (price - cost) > $250
Syntax for interestingness measure specification
   with [‹interest_measure_name›] threshold = ‹threshold_value›

   Example:
    with support threshold = 5%
    with confidence threshold = 70%
Syntax for pattern presentation and visualization
                          specification
   display as ‹result_form›

   The result form can be rules, tables, cubes, crosstabs, pie or bar charts,
    decision trees, curves or surfaces.

   To facilitate interactive viewing at different concept levels or different
    angles, the following syntax is defined:
    ‹Multilevel_Manipulation› ::= roll up on ‹attribute_or_dimension›
                                     | drill down on ‹attribute_or_dimension›
                                     | add ‹attribute_or_dimension›
                                     | drop ‹attribute_or_dimension›
Architectures of Data Mining System
   With popular and diverse application of data mining, it is expected that a
    good variety of data mining system will be designed and developed.
   Comprehensive information processing and data analysis will be
    continuously and systematically surrounded by data warehouse and
    databases.
   A critical question in design is whether we should integrate data mining
    systems with database systems.
   This gives rise to four architecture:
    -     No coupling
    -     Loose Coupling
    -     Semi-tight Coupling
    -     Tight Coupling
Cont.

   No Coupling: DM system will not utilize any functionality of a DB or DW
    system

   Loose Coupling: DM system will use some facilities of DB and DW system
like storing the data in either of DB or DW systems and using these systems for
data retrieval

   Semi-tight Coupling: Besides linking a DM system to a DB/DW systems,
    efficient implementation of a few DM primitives.

   Tight Coupling: DM system is smoothly integrated with DB/DW systems.
    Each of these DM, DB/DW is treated as main functional component of
    information retrieval system.
Paper Discussion


     Lydia: A System for Large-Scale News Analysis
           Levon Lloyd, Dimitrios Kechagias,
                     Steven Skiena
            Department of Computer Science
      State University of New York at Stony Brook
      Published in 12th International Conference
SPRING 2005, Buenos Aires, Argentina, November 2-4 2005
Abstract

   This paper is on “Text Mining” system called Lydia.

   Periodical publications represent a rich and recurrent source of
    knowledge on both current and historical events.

   The Lydia project seeks to build a relational model of people, places,
    and things through natural language processing of news sources and
    the statistical analysis of entity frequencies and co-locations.

   Perhaps the most familiar news analysis system is Google News
Lydia Text Analysis System


   Lydia is designed for high-speed analysis of online text



   Lydia performs a variety of interesting analysis on named entities in
    text, breaking them down by source, location and time.
Block Diagram of Lydia System



                                                                      Juxtaposition
                                                                        Analysis
                                         Applications


                                                                         Synset
                                                                      Identification

         Document Extractor
                                              DataBase                 Heatmap Generation




                        Syntax Tagging   Actor Classification   Rule Based             Geographic
POS Tagging                                                     Processing             Normalization
Process Involved

   Spidering and Article Classification

   Named Entity Recognition

   Juxtaposition Analysis

   Co-reference Set Identification

   Temporal and Spatial Analysis
News Analysis with Lydia


   Juxtapositional Analysis.

   Spatial Analysis

   Temporal entity analysis
Juxtaposition Analysis
     Mental model of where an entity fits into the world depends largely
      upon how it relates to other entities.
     For each entity, we compute a significance score for every other
      entity that co-occurs with it, and rank its juxtapositions by this
      score.

                                                 Israel               North Carolina
      Martin Luther King

       Entity            Score    Entity                  Score       Entity           Score
Jesse Jackson        545.97       Mahmoud Abbas           9, 635.51   Duke             2, 747.8

Coretta Scott King   454.51
                                  Palestinians            9, 041.70   ACC              1, 666.92
Atlanta, GA          286.73
                                  Ariel Sharon            3, 620.84   Virginia         1, 283.61
Ebenezer Baptist
Church               260.84
                                  Gaza                    4, 391.05   Wake Forest      1, 554.92
Cont.

To determine the significance of a juxtaposition, they
bound the probability that two entities co-occur in the
number of articles that they co-occur in if occurrences
where generated by a random process. To estimate this
probability they use a Chernoff Bound:
Spatial Analysis

   It is interesting to see where in the country people are talking about
    particular entities. Each newspaper has a location and a circulation
    and each city has a population. These facts allow them to
    approximate a sphere of influence for each newspaper. The heat on
    entity generated in a city is now a function of its frequency of
    reference in each of the newspapers that have influence over that
    city.
Cont.
Temporal Analysis
   Ability to track all references to entities broken down by article type
    gives the ability to monitor trends. Figure tracks the ebbs and flows
    in the interest in Michael Jackson as his trial progressed in May
    2005.
How the paper is related to DM?
   In the Lydia system in order to Classify the articles into different categories
    like news, sports etc., they use Bayesian classifier.

   Bayesian classifier is classification and prediction algorithm.

   Data Classification is DM technique which is done in two stages
    -building a model using predetermined set of data classes.
    -prediction of the input data.
Application




GIS (Geographical Information System)
What is GIS???



A GIS is a computer system capable of capturing, storing, analyzing,
and displaying geographically referenced information;

Example: GIS might be used to find wetlands that need protection
from pollution.
How does a GIS work?


   GIS works by Relating information from different sources

   The power of a GIS comes from the ability to relate different information
    in a spatial context and to reach a conclusion about this relationship.

   Most of the information we have about our world contains a location
    reference, placing that information at some point on the globe.
Geological Survey (USGS) Digital Line Graph (DLG) of roads.
Digital Line Graph of rivers.
Data capture
   If the data to be used are not already in digital form
    - Maps can be digitized by hand-tracing with a computer mouse
    - Electronic scanners can also be used

   Co-ordinates for the maps can be collected using Global Positioning System
    (GPS) receivers

   Putting the information into the system—involves identifying the objects on
    the map, their absolute location on the Earth's surface, and their spatial
    relationships .
Data integration
   A GIS makes it possible to link, or integrate, information that is difficult to
    associate through any other means.




                                   Mapmaking
Mapmaking
   Researchers are working to incorporate the mapmaking processes of
    traditional cartographers into GIS technology for the automated
    production of maps.
What is special about GIS??
   Information retrieval: What do you know about the swampy area at the
    end of your street? With a GIS you can "point" at a location, object, or area
    on the screen and retrieve recorded information about it from off-screen
    files . Using scanned aerial photographs as a visual guide, you can ask a GIS
    about the geology or hydrology of the area or even about how close a swamp
    is to the end of a street. This type of analysis allows you to draw conclusions
    about the swamp's environmental sensitivity.
Cont.

   Topological modeling: Have there ever been gas stations or factories that
    operated next to the swamp? Were any of these uphill from and within 2 miles of the
    swamp? A GIS can recognize and analyze the spatial relationships among mapped
    phenomena. Conditions of adjacency (what is next to what), containment (what is
    enclosed by what), and proximity (how close something is to something else) can be
    determined with a GIS
Cont.
   Networks: When nutrients from farmland are running off into streams,
    it is important to know in which direction the streams flow and which
    streams empty into other streams. This is done by using a linear network. It
    allows the computer to determine how the nutrients are transported
    downstream. Additional information on water volume and speed
    throughout the spatial network can help the GIS determine how long it will
    take the nutrients to travel downstream
Data Output

   A critical component of a GIS is its ability to produce graphics on the
    screen or on paper to convey the results of analyses to the people
    who make decisions about resources.
The future of GIS

   GIS and related technology will help analyze large datasets, allowing
    a better understanding of terrestrial processes and human activities
    to improve economic vitality and environmental quality
How is it related to DM?


   In order to represent the data in graphical Format which is most
likely represented as a graph cluster analysis is done on the data
set.

   Clustering is a data mining concept which is a process of grouping together
    the data into clusters or classes.
?

More Related Content

What's hot

Introduction to data mining
Introduction to data miningIntroduction to data mining
Introduction to data miningUjjawal
 
01 Introduction to Data Mining
01 Introduction to Data Mining01 Introduction to Data Mining
01 Introduction to Data MiningValerii Klymchuk
 
Data mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updatedData mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updatedYugal Kumar
 
Data mining approaches and methods
Data mining approaches and methodsData mining approaches and methods
Data mining approaches and methodssonangrai
 
Tutorial Knowledge Discovery
Tutorial Knowledge DiscoveryTutorial Knowledge Discovery
Tutorial Knowledge DiscoverySSSW
 
Chapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; KamberChapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kambererror007
 
Data Mining In Market Research
Data Mining In Market ResearchData Mining In Market Research
Data Mining In Market Researchkevinlan
 
DATA MINING.doc
DATA MINING.docDATA MINING.doc
DATA MINING.docbutest
 
Data Mining: Concepts and Techniques — Chapter 2 —
Data Mining:  Concepts and Techniques — Chapter 2 —Data Mining:  Concepts and Techniques — Chapter 2 —
Data Mining: Concepts and Techniques — Chapter 2 —Salah Amean
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingHarry Potter
 
Privacy preservation techniques in data mining
Privacy preservation techniques in data miningPrivacy preservation techniques in data mining
Privacy preservation techniques in data miningeSAT Publishing House
 
1.8 discretization
1.8 discretization1.8 discretization
1.8 discretizationKrish_ver2
 
Discretization and concept hierarchy(os)
Discretization and concept hierarchy(os)Discretization and concept hierarchy(os)
Discretization and concept hierarchy(os)snegacmr
 
Application of KDD & its future scope
Application of KDD & its future scopeApplication of KDD & its future scope
Application of KDD & its future scopeTanmay Sethi
 

What's hot (18)

Data mining
Data miningData mining
Data mining
 
Introduction to data mining
Introduction to data miningIntroduction to data mining
Introduction to data mining
 
01 Introduction to Data Mining
01 Introduction to Data Mining01 Introduction to Data Mining
01 Introduction to Data Mining
 
Data mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updatedData mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updated
 
Data mining approaches and methods
Data mining approaches and methodsData mining approaches and methods
Data mining approaches and methods
 
02 Data Mining
02 Data Mining02 Data Mining
02 Data Mining
 
Tutorial Knowledge Discovery
Tutorial Knowledge DiscoveryTutorial Knowledge Discovery
Tutorial Knowledge Discovery
 
Chapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; KamberChapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
 
Data Mining In Market Research
Data Mining In Market ResearchData Mining In Market Research
Data Mining In Market Research
 
DATA MINING.doc
DATA MINING.docDATA MINING.doc
DATA MINING.doc
 
Data Mining: Concepts and Techniques — Chapter 2 —
Data Mining:  Concepts and Techniques — Chapter 2 —Data Mining:  Concepts and Techniques — Chapter 2 —
Data Mining: Concepts and Techniques — Chapter 2 —
 
Data preprocessing ng
Data preprocessing   ngData preprocessing   ng
Data preprocessing ng
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Privacy preservation techniques in data mining
Privacy preservation techniques in data miningPrivacy preservation techniques in data mining
Privacy preservation techniques in data mining
 
1.8 discretization
1.8 discretization1.8 discretization
1.8 discretization
 
Discretization and concept hierarchy(os)
Discretization and concept hierarchy(os)Discretization and concept hierarchy(os)
Discretization and concept hierarchy(os)
 
Application of KDD & its future scope
Application of KDD & its future scopeApplication of KDD & its future scope
Application of KDD & its future scope
 
Seminar Presentation
Seminar PresentationSeminar Presentation
Seminar Presentation
 

Viewers also liked

Patiosososososo
PatiosososososoPatiosososososo
Patiosososososojuradox
 
Periodico el informativo
Periodico el informativoPeriodico el informativo
Periodico el informativoslinki
 
Corporate profile of abloom realty services pvt.ltd
Corporate profile of abloom realty services pvt.ltdCorporate profile of abloom realty services pvt.ltd
Corporate profile of abloom realty services pvt.ltdArun Verma
 
Choose Your Own Reformation
Choose Your Own ReformationChoose Your Own Reformation
Choose Your Own ReformationAWESOMEABBY
 
Guests of Fryday Afterwork!
Guests of Fryday Afterwork!Guests of Fryday Afterwork!
Guests of Fryday Afterwork!Playtestix
 
Periodico el informativo
Periodico el informativoPeriodico el informativo
Periodico el informativoslinki
 
Fryday franchisee information april 2014
Fryday franchisee information april 2014Fryday franchisee information april 2014
Fryday franchisee information april 2014Playtestix
 
Russell biodiversity values
Russell biodiversity valuesRussell biodiversity values
Russell biodiversity valuesrussellriver
 

Viewers also liked (8)

Patiosososososo
PatiosososososoPatiosososososo
Patiosososososo
 
Periodico el informativo
Periodico el informativoPeriodico el informativo
Periodico el informativo
 
Corporate profile of abloom realty services pvt.ltd
Corporate profile of abloom realty services pvt.ltdCorporate profile of abloom realty services pvt.ltd
Corporate profile of abloom realty services pvt.ltd
 
Choose Your Own Reformation
Choose Your Own ReformationChoose Your Own Reformation
Choose Your Own Reformation
 
Guests of Fryday Afterwork!
Guests of Fryday Afterwork!Guests of Fryday Afterwork!
Guests of Fryday Afterwork!
 
Periodico el informativo
Periodico el informativoPeriodico el informativo
Periodico el informativo
 
Fryday franchisee information april 2014
Fryday franchisee information april 2014Fryday franchisee information april 2014
Fryday franchisee information april 2014
 
Russell biodiversity values
Russell biodiversity valuesRussell biodiversity values
Russell biodiversity values
 

Similar to Data mining-2

Data mining-2
Data mining-2Data mining-2
Data mining-2Nit Hik
 
Data Mining Presentation on Science Day 2023
Data Mining Presentation on Science Day 2023Data Mining Presentation on Science Day 2023
Data Mining Presentation on Science Day 2023SakshiTiwari490123
 
Data mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, ClassificationData mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, ClassificationDr. Abdul Ahad Abro
 
Cssu dw dm
Cssu dw dmCssu dw dm
Cssu dw dmsumit621
 
Data science technology overview
Data science technology overviewData science technology overview
Data science technology overviewSoojung Hong
 
Data mining concepts and work
Data mining concepts and workData mining concepts and work
Data mining concepts and workAmr Abd El Latief
 
Data mining query language
Data mining query languageData mining query language
Data mining query languageGowriLatha1
 
Mca i unit part 501 dm
Mca i unit part 501 dmMca i unit part 501 dm
Mca i unit part 501 dmneeraj365
 
PowerPoint Template
PowerPoint TemplatePowerPoint Template
PowerPoint Templatebutest
 
20IT501_DWDM_PPT_Unit_II.ppt
20IT501_DWDM_PPT_Unit_II.ppt20IT501_DWDM_PPT_Unit_II.ppt
20IT501_DWDM_PPT_Unit_II.pptPalaniKumarR2
 
20IT501_DWDM_PPT_Unit_II.ppt
20IT501_DWDM_PPT_Unit_II.ppt20IT501_DWDM_PPT_Unit_II.ppt
20IT501_DWDM_PPT_Unit_II.pptSamPrem3
 
IRJET- Study and Evaluation of Classification Algorithms in Data Mining
IRJET- Study and Evaluation of Classification Algorithms in Data MiningIRJET- Study and Evaluation of Classification Algorithms in Data Mining
IRJET- Study and Evaluation of Classification Algorithms in Data MiningIRJET Journal
 

Similar to Data mining-2 (20)

Data mining-2
Data mining-2Data mining-2
Data mining-2
 
Data-Mining-2.ppt
Data-Mining-2.pptData-Mining-2.ppt
Data-Mining-2.ppt
 
Data Mining Presentation on Science Day 2023
Data Mining Presentation on Science Day 2023Data Mining Presentation on Science Day 2023
Data Mining Presentation on Science Day 2023
 
Data mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, ClassificationData mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, Classification
 
Data mining primitives
Data mining primitivesData mining primitives
Data mining primitives
 
Cssu dw dm
Cssu dw dmCssu dw dm
Cssu dw dm
 
Data science technology overview
Data science technology overviewData science technology overview
Data science technology overview
 
Data Mining
Data MiningData Mining
Data Mining
 
Data mining concepts and work
Data mining concepts and workData mining concepts and work
Data mining concepts and work
 
Data mining query language
Data mining query languageData mining query language
Data mining query language
 
Mca i unit part 501 dm
Mca i unit part 501 dmMca i unit part 501 dm
Mca i unit part 501 dm
 
PowerPoint Template
PowerPoint TemplatePowerPoint Template
PowerPoint Template
 
data mining
data miningdata mining
data mining
 
20IT501_DWDM_PPT_Unit_II.ppt
20IT501_DWDM_PPT_Unit_II.ppt20IT501_DWDM_PPT_Unit_II.ppt
20IT501_DWDM_PPT_Unit_II.ppt
 
20IT501_DWDM_PPT_Unit_II.ppt
20IT501_DWDM_PPT_Unit_II.ppt20IT501_DWDM_PPT_Unit_II.ppt
20IT501_DWDM_PPT_Unit_II.ppt
 
Chapter 1.pdf
Chapter 1.pdfChapter 1.pdf
Chapter 1.pdf
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
 
Chapter 1: Introduction to Data Mining
Chapter 1: Introduction to Data MiningChapter 1: Introduction to Data Mining
Chapter 1: Introduction to Data Mining
 
IRJET- Study and Evaluation of Classification Algorithms in Data Mining
IRJET- Study and Evaluation of Classification Algorithms in Data MiningIRJET- Study and Evaluation of Classification Algorithms in Data Mining
IRJET- Study and Evaluation of Classification Algorithms in Data Mining
 
Data mining
Data miningData mining
Data mining
 

Recently uploaded

SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 

Recently uploaded (20)

SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 

Data mining-2

  • 1. Data Mining Primitives, Languages and System Architecture CSE 634-Datamining Concepts and Techniques Professor Anita Wasilewska Presented By Sushma Devendrappa - 105526184 Swathi Kothapalli - 105531380
  • 2. Sources/References  Data Mining Concepts and Techniques –Jiawei Han and Micheline Kamber, 2003  Handbook of Data Mining and Discovery- Willi Klosgen and Jan M Zytkow, 2002  Lydia: A System for Large-Scale News Analysis- String Processing and Information Retrieval: 12th International Conference, SPRING 2005, Buenos Aires, Argentina, November 2-4 2005.  Information Retrieval: Data Structures and Algorithms - W. Frakes and R. Baeza-Yates, 1992  Geographical Information System - http://erg.usgs.gov/isb/pubs/gis_poster/
  • 3. Content  Data mining primitives  Languages  System architecture  Application – Geographical information system (GIS)  Paper - Lydia: A System for Large-Scale News Analysis
  • 4. Introduction  Motivation- need to extract useful information and knowledge from a large amount of data (data explosion problem)  Data Mining tools perform data analysis and may uncover important data patterns, contributing greatly to business strategies, knowledge bases, and scientific and medical research.
  • 5. What is Data Mining???  Data mining refers to extracting or “mining” knowledge from large amounts of data. Also referred as Knowledge Discovery in Databases.  It is a process of discovering interesting knowledge from large amounts of data stored either in databases, data warehouses, or other information repositories.
  • 6. Architecture of a typical data mining system Graphical user interface Pattern evaluation Knowledge base Data mining engine Database or data warehouse server Data cleansing Data Integration Filtering Database Data warehouse
  • 7. Misconception: Data mining systems can autonomously dig out all of the valuable knowledge from a given large database, without human intervention.  If there was no user intervention then the system would uncover a large set of patterns that may even surpass the size of the database. Hence, user interference is required.  This user communication with the system is provided by using a set of data mining primitives.
  • 8. Data Mining Primitives Data mining primitives define a data mining task, which can be specified in the form of a data mining query.  Task Relevant Data  Kinds of knowledge to be mined  Background knowledge  Interestingness measure  Presentation and visualization of discovered patterns
  • 9. Task relevant data  Data portion to be investigated.  Attributes of interest (relevant attributes) can be specified.  Initial data relation  Minable view
  • 10. Example  If a data mining task is to study associations between items frequently purchased at AllElectronics by customers in Canada, the task relevant data can be specified by providing the following information:  Name of the database or data warehouse to be used (e.g., AllElectronics_db)  Names of the tables or data cubes containing relevant data (e.g., item, customer, purchases and items_sold)  Conditions for selecting the relevant data (e.g., retrieve data pertaining to purchases made in Canada for the current year)  The relevant attributes or dimensions (e.g., name and price from the item table and income and age from the customer table)
  • 11. Kind of knowledge to be mined  It is important to specify the knowledge to be mined, as this determines the data mining function to be performed.  Kinds of knowledge include concept description, association, classification, prediction and clustering.  User can also provide pattern templates. Also called metapatterns or metarules or metaqueries.
  • 12. Example A user studying the buying habits of allelectronics customers may choose to mine association rules of the form: P (X:customer,W) ^ Q (X,Y) => buys (X,Z) Meta rules such as the following can be specified: age (X, “30…..39”) ^ income (X, “40k….49K”) => buys (X, “VCR”) [2.2%, 60%] occupation (X, “student ”) ^ age (X, “20…..29”)=> buys (X, “computer”) [1.4%, 70%]
  • 13. Background knowledge  It is the information about the domain to be mined  Concept hierarchy: is a powerful form of background knowledge.  Four major types of concept hierarchies: schema hierarchies set-grouping hierarchies operation-derived hierarchies rule-based hierarchies
  • 14. Concept hierarchies (1)  Defines a sequence of mappings from a set of low-level concepts to higher- level (more general) concepts.  Allows data to be mined at multiple levels of abstraction.  These allow users to view data from different perspectives, allowing further insight into the relationships.  Example (location)
  • 15. Example all Level 0 USA Level 1 Canada British Ontario New York Illinois Level 2 Columbia Vancouver Victoria Toronto Ottawa New York Buffalo Chicago Level 3
  • 16. Concept hierarchies (2)  Rolling Up - Generalization of data Allows to view data at more meaningful and explicit abstractions. Makes it easier to understand Compresses the data Would require fewer input/output operations  Drilling Down - Specialization of data Concept values replaced by lower level concepts  There may be more than concept hierarchy for a given attribute or dimension based on different user viewpoints  Example: Regional sales manager may prefer the previous concept hierarchy but marketing manager might prefer to see location with respect to linguistic lines in order to facilitate the distribution of commercial ads.
  • 17. Schema hierarchies  Schema hierarchy is the total or partial order among attributes in the database schema.  May formally express existing semantic relationships between attributes.  Provides metadata information.  Example: location hierarchy street < city < province/state < country
  • 18. Set-grouping hierarchies  Organizes values for a given attribute into groups or sets or range of values.  Total or partial order can be defined among groups.  Used to refine or enrich schema-defined hierarchies.  Typically used for small sets of object relationships.  Example: Set-grouping hierarchy for age {young, middle_aged, senior} all (age) {20….29} young {40….59} middle_aged {60….89} senior
  • 19. Operation-derived hierarchies  Operation-derived: based on operations specified operations may include decoding of information-encoded strings information extraction from complex data objects data clustering Example: URL or email address xyz@cs.iitm.in gives login name < dept. < univ. < country
  • 20. Rule-based hierarchies  Rule-based: Occurs when either whole or portion of a concept hierarchy is defined as a set of rules and is evaluated dynamically based on current database data and rule definition  Example: Following rules are used to categorize items as low_profit, medium_profit and high_profit_margin. low_profit_margin(X) <= price(X,P1)^cost(X,P2)^((P1-P2)<50) medium_profit_margin(X) <= price(X,P1)^cost(X,P2)^((P1- P2)≥50)^((P1-P2)≤250) high_profit_margin(X) <= price(X,P1)^cost(X,P2)^((P1-P2)>250)
  • 21. Interestingness measure (1)  Used to confine the number of uninteresting patterns returned by the process.  Based on the structure of patterns and statistics underlying them.  Associate a threshold which can be controlled by the user.  patterns not meeting the threshold are not presented to the user.  Objective measures of pattern interestingness: simplicity certainty (confidence) utility (support) novelty
  • 22. Interestingness measure (2)  Simplicity a patterns interestingness is based on its overall simplicity for human comprehension. Example: Rule length is a simplicity measure  Certainty (confidence) Assesses the validity or trustworthiness of a pattern. confidence is a certainty measure confidence (A=>B) = # tuples containing both A and B # tuples containing A A confidence of 85% for the rule buys(X, “computer”)=>buys(X,“software”) means that 85% of all customers who purchased a computer also bought software
  • 23. Interestingness measure (3)  Utility (support) usefulness of a pattern support (A=>B) = # tuples containing both A and B total # of tuples A support of 30% for the previous rule means that 30% of all customers in the computer department purchased both a computer and software.  Association rules that satisfy both the minimum confidence and support threshold are referred to as strong association rules.  Novelty Patterns contributing new information to the given pattern set are called novel patterns (example: Data exception). removing redundant patterns is a strategy for detecting novelty.
  • 24. Presentation and visualization  For data mining to be effective, data mining systems should be able to display the discovered patterns in multiple forms, such as rules, tables, crosstabs (cross-tabulations), pie or bar charts, decision trees, cubes, or other visual representations.  User must be able to specify the forms of presentation to be used for displaying the discovered patterns.
  • 25. Data mining query languages  Data mining language must be designed to facilitate flexible and effective knowledge discovery.  Having a query language for data mining may help standardize the development of platforms for data mining systems.  But designed a language is challenging because data mining covers a wide spectrum of tasks and each task has different requirement.  Hence, the design of a language requires deep understanding of the limitations and underlying mechanism of the various kinds of tasks.
  • 26. Data mining query languages (2)  So…how would you design an efficient query language???  Based on the primitives discussed earlier.  DMQL allows mining of different kinds of knowledge from relational databases and data warehouses at multiple levels of abstraction.
  • 27. DMQL  Adopts SQL-like syntax  Hence, can be easily integrated with relational query languages  Defined in BNF grammar [ ] represents 0 or one occurrence { } represents 0 or more occurrences Words in sans serif represent keywords
  • 28. DMQL-Syntax for task-relevant data specification  Names of the relevant database or data warehouse, conditions and relevant attributes or dimensions must be specified  use database ‹database_name› or use data warehouse ‹data_warehouse_name›  from ‹relation(s)/cube(s)› [where condition]  in relevance to ‹attribute_or_dimension_list›  order by ‹order_list›  group by ‹grouping_list›  having ‹condition›
  • 30. Syntax for Kind of Knowledge to be Mined  Characterization : ‹Mine_Knowledge_Specification› ::= mine characteristics [as ‹pattern_name›] analyze ‹measure(s)›  Example: mine characteristics as customerPurchasing analyze count%  Discrimination: ‹Mine_Knowledge_Specification› ::= mine comparison [as ‹ pattern_name›] for ‹target_class› where ‹target_condition› {versus ‹contrast_class_i where ‹contrast_condition_i›} analyze ‹measure(s)›  Example: Mine comparison as purchaseGroups for bigspenders where avg(I.price) >= $100 versus budgetspenders where avg(I.price) < $100 analyze count
  • 31. Syntax for Kind of Knowledge to be Mined (2)  Association: ‹Mine_Knowledge_Specification› ::= mine associations [as ‹pattern_name›] [matching ‹metapattern›]  Example: mine associations as buyingHabits matching P(X: customer, W) ^ Q(X,Y) => buys (X,Z)  Classification: ‹Mine_Knowledge_Specification› ::= mine classification [as ‹pattern_name›] analyze ‹classifying_attribute_or_dimension›  Example: mine classification as classifyCustomerCreditRating analyze credit_rating
  • 32. Syntax for concept hierarchy specification  More than one concept per attribute can be specified  Use hierarchy ‹hierarchy_name› for ‹attribute_or_dimension›  Examples: Schema concept hierarchy (ordering is important)  define hierarchy location_hierarchy on address as [street,city,province_or_state,country] Set-Grouping concept hierarchy  define hierarchy age_hierarchy for age on customer as level1: {young, middle_aged, senior} < level0: all level2: {20, ..., 39} < level1: young level2: {40, ..., 59} < level1: middle_aged level2: {60, ..., 89} < level1: senior
  • 33. Syntax for concept hierarchy specification (2)  operation-derived concept hierarchy  define hierarchy age_hierarchy for age on customer as {age_category(1), ..., age_category(5)} := cluster (default, age, 5) < all(age)  rule-based concept hierarchy  define hierarchy profit_margin_hierarchy on item as level_1: low_profit_margin < level_0: all if (price - cost)< $50 level_1: medium-profit_margin < level_0: all if ((price - cost) > $50) and ((price - cost) <= $250)) level_1: high_profit_margin < level_0: all if (price - cost) > $250
  • 34. Syntax for interestingness measure specification  with [‹interest_measure_name›] threshold = ‹threshold_value›  Example: with support threshold = 5% with confidence threshold = 70%
  • 35. Syntax for pattern presentation and visualization specification  display as ‹result_form›  The result form can be rules, tables, cubes, crosstabs, pie or bar charts, decision trees, curves or surfaces.  To facilitate interactive viewing at different concept levels or different angles, the following syntax is defined: ‹Multilevel_Manipulation› ::= roll up on ‹attribute_or_dimension› | drill down on ‹attribute_or_dimension› | add ‹attribute_or_dimension› | drop ‹attribute_or_dimension›
  • 36. Architectures of Data Mining System  With popular and diverse application of data mining, it is expected that a good variety of data mining system will be designed and developed.  Comprehensive information processing and data analysis will be continuously and systematically surrounded by data warehouse and databases.  A critical question in design is whether we should integrate data mining systems with database systems.  This gives rise to four architecture: - No coupling - Loose Coupling - Semi-tight Coupling - Tight Coupling
  • 37. Cont.  No Coupling: DM system will not utilize any functionality of a DB or DW system  Loose Coupling: DM system will use some facilities of DB and DW system like storing the data in either of DB or DW systems and using these systems for data retrieval  Semi-tight Coupling: Besides linking a DM system to a DB/DW systems, efficient implementation of a few DM primitives.  Tight Coupling: DM system is smoothly integrated with DB/DW systems. Each of these DM, DB/DW is treated as main functional component of information retrieval system.
  • 38. Paper Discussion Lydia: A System for Large-Scale News Analysis Levon Lloyd, Dimitrios Kechagias, Steven Skiena Department of Computer Science State University of New York at Stony Brook Published in 12th International Conference SPRING 2005, Buenos Aires, Argentina, November 2-4 2005
  • 39. Abstract  This paper is on “Text Mining” system called Lydia.  Periodical publications represent a rich and recurrent source of knowledge on both current and historical events.  The Lydia project seeks to build a relational model of people, places, and things through natural language processing of news sources and the statistical analysis of entity frequencies and co-locations.  Perhaps the most familiar news analysis system is Google News
  • 40. Lydia Text Analysis System  Lydia is designed for high-speed analysis of online text  Lydia performs a variety of interesting analysis on named entities in text, breaking them down by source, location and time.
  • 41. Block Diagram of Lydia System Juxtaposition Analysis Applications Synset Identification Document Extractor DataBase Heatmap Generation Syntax Tagging Actor Classification Rule Based Geographic POS Tagging Processing Normalization
  • 42. Process Involved  Spidering and Article Classification  Named Entity Recognition  Juxtaposition Analysis  Co-reference Set Identification  Temporal and Spatial Analysis
  • 43. News Analysis with Lydia  Juxtapositional Analysis.  Spatial Analysis  Temporal entity analysis
  • 44. Juxtaposition Analysis  Mental model of where an entity fits into the world depends largely upon how it relates to other entities.  For each entity, we compute a significance score for every other entity that co-occurs with it, and rank its juxtapositions by this score. Israel North Carolina Martin Luther King Entity Score Entity Score Entity Score Jesse Jackson 545.97 Mahmoud Abbas 9, 635.51 Duke 2, 747.8 Coretta Scott King 454.51 Palestinians 9, 041.70 ACC 1, 666.92 Atlanta, GA 286.73 Ariel Sharon 3, 620.84 Virginia 1, 283.61 Ebenezer Baptist Church 260.84 Gaza 4, 391.05 Wake Forest 1, 554.92
  • 45. Cont. To determine the significance of a juxtaposition, they bound the probability that two entities co-occur in the number of articles that they co-occur in if occurrences where generated by a random process. To estimate this probability they use a Chernoff Bound:
  • 46. Spatial Analysis  It is interesting to see where in the country people are talking about particular entities. Each newspaper has a location and a circulation and each city has a population. These facts allow them to approximate a sphere of influence for each newspaper. The heat on entity generated in a city is now a function of its frequency of reference in each of the newspapers that have influence over that city.
  • 47. Cont.
  • 48. Temporal Analysis  Ability to track all references to entities broken down by article type gives the ability to monitor trends. Figure tracks the ebbs and flows in the interest in Michael Jackson as his trial progressed in May 2005.
  • 49. How the paper is related to DM?  In the Lydia system in order to Classify the articles into different categories like news, sports etc., they use Bayesian classifier.  Bayesian classifier is classification and prediction algorithm.  Data Classification is DM technique which is done in two stages -building a model using predetermined set of data classes. -prediction of the input data.
  • 51. What is GIS??? A GIS is a computer system capable of capturing, storing, analyzing, and displaying geographically referenced information; Example: GIS might be used to find wetlands that need protection from pollution.
  • 52. How does a GIS work?  GIS works by Relating information from different sources  The power of a GIS comes from the ability to relate different information in a spatial context and to reach a conclusion about this relationship.  Most of the information we have about our world contains a location reference, placing that information at some point on the globe.
  • 53. Geological Survey (USGS) Digital Line Graph (DLG) of roads.
  • 54. Digital Line Graph of rivers.
  • 55. Data capture  If the data to be used are not already in digital form - Maps can be digitized by hand-tracing with a computer mouse - Electronic scanners can also be used  Co-ordinates for the maps can be collected using Global Positioning System (GPS) receivers  Putting the information into the system—involves identifying the objects on the map, their absolute location on the Earth's surface, and their spatial relationships .
  • 56. Data integration  A GIS makes it possible to link, or integrate, information that is difficult to associate through any other means. Mapmaking
  • 57. Mapmaking  Researchers are working to incorporate the mapmaking processes of traditional cartographers into GIS technology for the automated production of maps.
  • 58. What is special about GIS??  Information retrieval: What do you know about the swampy area at the end of your street? With a GIS you can "point" at a location, object, or area on the screen and retrieve recorded information about it from off-screen files . Using scanned aerial photographs as a visual guide, you can ask a GIS about the geology or hydrology of the area or even about how close a swamp is to the end of a street. This type of analysis allows you to draw conclusions about the swamp's environmental sensitivity.
  • 59. Cont.  Topological modeling: Have there ever been gas stations or factories that operated next to the swamp? Were any of these uphill from and within 2 miles of the swamp? A GIS can recognize and analyze the spatial relationships among mapped phenomena. Conditions of adjacency (what is next to what), containment (what is enclosed by what), and proximity (how close something is to something else) can be determined with a GIS
  • 60. Cont.  Networks: When nutrients from farmland are running off into streams, it is important to know in which direction the streams flow and which streams empty into other streams. This is done by using a linear network. It allows the computer to determine how the nutrients are transported downstream. Additional information on water volume and speed throughout the spatial network can help the GIS determine how long it will take the nutrients to travel downstream
  • 61. Data Output  A critical component of a GIS is its ability to produce graphics on the screen or on paper to convey the results of analyses to the people who make decisions about resources.
  • 62. The future of GIS  GIS and related technology will help analyze large datasets, allowing a better understanding of terrestrial processes and human activities to improve economic vitality and environmental quality
  • 63. How is it related to DM?  In order to represent the data in graphical Format which is most likely represented as a graph cluster analysis is done on the data set.  Clustering is a data mining concept which is a process of grouping together the data into clusters or classes.
  • 64. ?