This document provides details about a course on Data Warehousing and Data Mining. The course is taught by Ms. Qurat-ul-Ain and covers topics such as data warehousing concepts, OLAP tools, data transformation, data mining algorithms, and decision trees. The course is 3 credit hours and has prerequisites in DBMS. Several textbooks are recommended to help students learn about data warehousing and data mining.
The document provides an overview of key concepts in data warehousing and business intelligence, including:
1) It defines data warehousing concepts such as the characteristics of a data warehouse (subject-oriented, integrated, time-variant, non-volatile), grain/granularity, and the differences between OLTP and data warehouse systems.
2) It discusses the evolution of business intelligence and key components of a data warehouse such as the source systems, staging area, presentation area, and access tools.
3) It covers dimensional modeling concepts like star schemas, snowflake schemas, and slowly and rapidly changing dimensions.
This document provides an introduction and overview of a course on data warehousing and data mining. It discusses the need for data warehousing to help organizations make intelligent decisions based on analyzing large amounts of integrated historical data. A data warehouse combines data from multiple sources and stores it in a way that supports ad-hoc querying. It allows knowledge workers to analyze patterns and trends in the data to address business questions. The document contrasts this with traditional transaction systems and management information systems.
This document provides an overview of data warehousing. It defines data warehousing as collecting data from multiple sources into a central repository for analysis and decision making. The document outlines the history of data warehousing and describes its key characteristics like being subject-oriented, integrated, and time-variant. It also discusses the architecture of a data warehouse including sources, transformation, storage, and reporting layers. The document compares data warehousing to traditional DBMS and explains how data warehouses are better suited for analysis versus transaction processing.
The document provides an overview of data warehousing, decision support, online analytical processing (OLAP), and data mining. It discusses what data warehousing is, how it can help organizations make better decisions by integrating data from various sources and making it available for analysis. It also describes OLAP as a way to transform warehouse data into meaningful information for interactive analysis, and lists some common OLAP operations like roll-up, drill-down, slice and dice, and pivot. Finally, it gives a brief introduction to data mining as the process of extracting patterns and relationships from data.
Types of database processing,OLTP VS Data Warehouses(OLAP), Subject-oriented
Integrated
Time-variant
Non-volatile,
Functionalities of Data Warehouse,Roll-Up(Consolidation),
Drill-down,
Slicing,
Dicing,
Pivot,
KDD Process,Application of Data Mining
The document provides an overview of data warehousing and decision support systems. It discusses how data warehouses evolved from databases used for transaction processing to integrated databases designed for analysis and decision making. Key points include:
- Data warehouses store historical data from multiple sources to support analysis and decision making.
- They address limitations of transactional databases that are optimized for real-time queries rather than complex analysis.
- Effective data warehousing requires resolving data conflicts, documenting assumptions, and learning from mistakes in the implementation process.
This document discusses building a data warehouse. It defines key components of a data warehouse including the data warehouse database, transformation tools, metadata, access tools, and data marts. It describes two common approaches to building a data warehouse - top-down and bottom-up. Top-down involves building a centralized data warehouse first while bottom-up involves building departmental data marts initially. The document also outlines considerations for designing, implementing, and accessing a data warehouse.
The document discusses key concepts related to data warehousing including:
1) What data warehousing is, its main components, and differences from OLTP systems.
2) The typical architecture of a data warehouse including operational data sources, storage, and end-user access tools.
3) Important considerations like data flows, integration, management of metadata, and tools/technologies used.
4) Additional topics such as benefits, challenges, administration, and data marts.
The document provides an overview of key concepts in data warehousing and business intelligence, including:
1) It defines data warehousing concepts such as the characteristics of a data warehouse (subject-oriented, integrated, time-variant, non-volatile), grain/granularity, and the differences between OLTP and data warehouse systems.
2) It discusses the evolution of business intelligence and key components of a data warehouse such as the source systems, staging area, presentation area, and access tools.
3) It covers dimensional modeling concepts like star schemas, snowflake schemas, and slowly and rapidly changing dimensions.
This document provides an introduction and overview of a course on data warehousing and data mining. It discusses the need for data warehousing to help organizations make intelligent decisions based on analyzing large amounts of integrated historical data. A data warehouse combines data from multiple sources and stores it in a way that supports ad-hoc querying. It allows knowledge workers to analyze patterns and trends in the data to address business questions. The document contrasts this with traditional transaction systems and management information systems.
This document provides an overview of data warehousing. It defines data warehousing as collecting data from multiple sources into a central repository for analysis and decision making. The document outlines the history of data warehousing and describes its key characteristics like being subject-oriented, integrated, and time-variant. It also discusses the architecture of a data warehouse including sources, transformation, storage, and reporting layers. The document compares data warehousing to traditional DBMS and explains how data warehouses are better suited for analysis versus transaction processing.
The document provides an overview of data warehousing, decision support, online analytical processing (OLAP), and data mining. It discusses what data warehousing is, how it can help organizations make better decisions by integrating data from various sources and making it available for analysis. It also describes OLAP as a way to transform warehouse data into meaningful information for interactive analysis, and lists some common OLAP operations like roll-up, drill-down, slice and dice, and pivot. Finally, it gives a brief introduction to data mining as the process of extracting patterns and relationships from data.
Types of database processing,OLTP VS Data Warehouses(OLAP), Subject-oriented
Integrated
Time-variant
Non-volatile,
Functionalities of Data Warehouse,Roll-Up(Consolidation),
Drill-down,
Slicing,
Dicing,
Pivot,
KDD Process,Application of Data Mining
The document provides an overview of data warehousing and decision support systems. It discusses how data warehouses evolved from databases used for transaction processing to integrated databases designed for analysis and decision making. Key points include:
- Data warehouses store historical data from multiple sources to support analysis and decision making.
- They address limitations of transactional databases that are optimized for real-time queries rather than complex analysis.
- Effective data warehousing requires resolving data conflicts, documenting assumptions, and learning from mistakes in the implementation process.
This document discusses building a data warehouse. It defines key components of a data warehouse including the data warehouse database, transformation tools, metadata, access tools, and data marts. It describes two common approaches to building a data warehouse - top-down and bottom-up. Top-down involves building a centralized data warehouse first while bottom-up involves building departmental data marts initially. The document also outlines considerations for designing, implementing, and accessing a data warehouse.
The document discusses key concepts related to data warehousing including:
1) What data warehousing is, its main components, and differences from OLTP systems.
2) The typical architecture of a data warehouse including operational data sources, storage, and end-user access tools.
3) Important considerations like data flows, integration, management of metadata, and tools/technologies used.
4) Additional topics such as benefits, challenges, administration, and data marts.
The document discusses building a data warehouse. It defines a data warehouse as a subject-oriented, integrated, time-variant and non-volatile collection of data used for decision making. It describes the components of a data warehouse including staging, data warehouse database, transformation tools, metadata, data marts, access tools and administration. It also discusses approaches to building a data warehouse, design considerations, implementation steps, extraction/transformation tools, and user levels. The benefits of a data warehouse include locating the right information, presentation of information, testing hypotheses, discovery of information, and sharing analysis.
The document provides an overview of data warehousing and data mining. It discusses what a data warehouse is, how it is structured, and how it can help organizations make better decisions by integrating data from multiple sources and facilitating online analytical processing (OLAP). It also covers key components of a data warehousing architecture like the data manager, data acquisition, metadata repository, and middleware that connect the data warehouse to operational databases and analytical tools.
What is OLAP -Data Warehouse Concepts - IT Online Training @ NewyorksysNEWYORKSYS-IT SOLUTIONS
NEWYORKSYSTRAINING are destined to offer quality IT online training and comprehensive IT consulting services with complete business service delivery orientation.
A data warehouse is a subject-oriented, consolidated collection of integrated data from multiple sources used to support management decision making. It is separate from operational databases and contains historical data for analysis. Data warehouses use a star schema with fact and dimension tables and support online analytical processing (OLAP) for complex analysis and reporting.
The document discusses data warehousing concepts and technologies. It defines a data warehouse as a subject-oriented, integrated, time-variant, and non-volatile collection of data used for decision making. Key aspects covered include multidimensional data modeling using facts, dimensions, and cubes; data warehouse architectures; and efficient cube computation methods such as ROLAP-based algorithms.
This document provides an overview of data warehousing and related concepts. It begins with definitions of key terms like data warehousing, data marts, and OLAP. It then covers the history and evolution of data warehousing in organizations. The document outlines the typical architecture of a data warehouse, including sources, integration, and metadata. It discusses benefits like providing a customer-centric view and removing barriers between functions. It also notes some disadvantages like latency and maintenance costs. Finally, it briefly touches on strategic uses, data mining, and text mining.
The document discusses building a data warehouse, including approaches and design considerations. It describes a top-down approach to build an enterprise data warehouse as a centralized repository, while a bottom-up approach builds departmental data marts incrementally. Successful data warehouses are based on a dimensional model, contain both historical and current integrated data at detailed and summarized levels from multiple sources.
What is a Data Warehouse and How Do I Test It?RTTS
ETL Testing: A primer for Testers on Data Warehouses, ETL, Business Intelligence and how to test them.
Are you hearing and reading about Big Data, Enterprise Data Warehouses (EDW), the ETL Process and Business Intelligence (BI)? The software markets for EDW and BI are quickly approaching $22 billion, according to Gartner, and Big Data is growing at an exponential pace.
Are you being tasked to test these environments or would you like to learn about them and be prepared for when you are asked to test them?
RTTS, the Software Quality Experts, provided this groundbreaking webinar, based upon our many years of experience in providing software quality solutions for more than 400 companies.
You will learn the answer to the following questions:
• What is Big Data and what does it mean to me?
• What are the business reasons for a building a Data Warehouse and for using Business Intelligence software?
• How do Data Warehouses, Business Intelligence tools and ETL work from a technical perspective?
• Who are the primary players in this software space?
• How do I test these environments?
• What tools should I use?
This slide deck is geared towards:
QA Testers
Data Architects
Business Analysts
ETL Developers
Operations Teams
Project Managers
...and anyone else who is (a) new to the EDW space, (b) wants to be educated in the business and technical sides and (c) wants to understand how to test them.
The document defines a data warehouse as a subject-oriented, integrated, time-variant and non-volatile collection of data to support management decision making. A data warehouse is maintained separately from operational databases and provides a platform for consolidated historical data analysis. Key features of a data warehouse include dimensional modeling using facts, dimensions, and star or snowflake schemas.
Business Intelligence Data Warehouse SystemKiran kumar
This document provides an overview of data warehousing and business intelligence concepts. It discusses:
- What a data warehouse is and its key properties like being integrated, non-volatile, time-variant and subject-oriented.
- Common data warehouse architectures including dimensional modeling, ETL processes, and different layers like the data storage layer and presentation layer.
- How data marts are subsets of the data warehouse that focus on specific business functions or departments.
- Different types of dimensions tables and slowly changing dimensions.
- How business intelligence uses the data warehouse for analysis, querying, reporting and generating insights to help with decision making.
The document provides an overview of data mining and data warehousing concepts through a series of lectures. It discusses the evolution of database technology and data analysis, defines data mining and knowledge discovery, describes data mining functionalities like classification and clustering, and covers data warehouse concepts like dimensional modeling and OLAP operations. It also presents sample queries in a proposed data mining query language.
The document discusses various concepts related to database design and data warehousing. It describes how DBMS minimize problems like data redundancy, isolation, and inconsistency through techniques like normalization, indexing, and using data dictionaries. It then discusses data warehousing concepts like the need for data warehouses, their key characteristics of being subject-oriented, integrated, and time-variant. Common data warehouse architectures and components like the ETL process, OLAP, and decision support systems are also summarized.
The document provides information about data warehousing fundamentals. It discusses key concepts such as data warehouse architectures, dimensional modeling, fact and dimension tables, and metadata. The three common data warehouse architectures described are the basic architecture, architecture with a staging area, and architecture with staging area and data marts. Dimensional modeling is optimized for data retrieval and uses facts, dimensions, and attributes. Metadata provides information about the data in the warehouse.
Data Warehouse – Introduction, characteristics, architecture, scheme and modelling, Differences between operational database systems and data warehouse.
The document provides an overview of data warehousing concepts. It defines a data warehouse as a subject-oriented, integrated, time-variant, and non-volatile collection of data. It discusses the differences between OLTP and OLAP systems. It also covers data warehouse architectures, components, and processes. Additionally, it explains key concepts like facts and dimensions, star schemas, normalization forms, and metadata.
This document discusses data warehousing and integration. It outlines the challenges of integrating data from heterogeneous sources and managing data across large enterprises. It describes how data warehousing addresses these issues by collecting and combining data in advance and storing it in a warehouse for direct querying and analysis. Key aspects of data warehousing covered include architectures, types of data stored, extraction of data from sources, and issues in designing and maintaining a data warehouse.
This document discusses data warehousing and integration. It outlines the challenges of integrating data from heterogeneous sources and managing data across large enterprises. It describes how data warehousing addresses these issues by collecting and combining data in advance into a centralized warehouse for unified access and querying. Key aspects of data warehousing covered include architectures, extraction of data from sources, integration of data, and ongoing maintenance of the warehouse.
This document discusses data warehousing and integration. It outlines the challenges of integrating data from heterogeneous sources and managing data across large enterprises. It describes how data warehousing addresses these issues by collecting and combining data in advance into a centralized warehouse for unified access and querying. Key aspects of data warehousing covered include architectures, extraction of data from sources, integration of data, and ongoing maintenance of the warehouse.
The document discusses building a data warehouse. It defines a data warehouse as a subject-oriented, integrated, time-variant and non-volatile collection of data used for decision making. It describes the components of a data warehouse including staging, data warehouse database, transformation tools, metadata, data marts, access tools and administration. It also discusses approaches to building a data warehouse, design considerations, implementation steps, extraction/transformation tools, and user levels. The benefits of a data warehouse include locating the right information, presentation of information, testing hypotheses, discovery of information, and sharing analysis.
The document provides an overview of data warehousing and data mining. It discusses what a data warehouse is, how it is structured, and how it can help organizations make better decisions by integrating data from multiple sources and facilitating online analytical processing (OLAP). It also covers key components of a data warehousing architecture like the data manager, data acquisition, metadata repository, and middleware that connect the data warehouse to operational databases and analytical tools.
What is OLAP -Data Warehouse Concepts - IT Online Training @ NewyorksysNEWYORKSYS-IT SOLUTIONS
NEWYORKSYSTRAINING are destined to offer quality IT online training and comprehensive IT consulting services with complete business service delivery orientation.
A data warehouse is a subject-oriented, consolidated collection of integrated data from multiple sources used to support management decision making. It is separate from operational databases and contains historical data for analysis. Data warehouses use a star schema with fact and dimension tables and support online analytical processing (OLAP) for complex analysis and reporting.
The document discusses data warehousing concepts and technologies. It defines a data warehouse as a subject-oriented, integrated, time-variant, and non-volatile collection of data used for decision making. Key aspects covered include multidimensional data modeling using facts, dimensions, and cubes; data warehouse architectures; and efficient cube computation methods such as ROLAP-based algorithms.
This document provides an overview of data warehousing and related concepts. It begins with definitions of key terms like data warehousing, data marts, and OLAP. It then covers the history and evolution of data warehousing in organizations. The document outlines the typical architecture of a data warehouse, including sources, integration, and metadata. It discusses benefits like providing a customer-centric view and removing barriers between functions. It also notes some disadvantages like latency and maintenance costs. Finally, it briefly touches on strategic uses, data mining, and text mining.
The document discusses building a data warehouse, including approaches and design considerations. It describes a top-down approach to build an enterprise data warehouse as a centralized repository, while a bottom-up approach builds departmental data marts incrementally. Successful data warehouses are based on a dimensional model, contain both historical and current integrated data at detailed and summarized levels from multiple sources.
What is a Data Warehouse and How Do I Test It?RTTS
ETL Testing: A primer for Testers on Data Warehouses, ETL, Business Intelligence and how to test them.
Are you hearing and reading about Big Data, Enterprise Data Warehouses (EDW), the ETL Process and Business Intelligence (BI)? The software markets for EDW and BI are quickly approaching $22 billion, according to Gartner, and Big Data is growing at an exponential pace.
Are you being tasked to test these environments or would you like to learn about them and be prepared for when you are asked to test them?
RTTS, the Software Quality Experts, provided this groundbreaking webinar, based upon our many years of experience in providing software quality solutions for more than 400 companies.
You will learn the answer to the following questions:
• What is Big Data and what does it mean to me?
• What are the business reasons for a building a Data Warehouse and for using Business Intelligence software?
• How do Data Warehouses, Business Intelligence tools and ETL work from a technical perspective?
• Who are the primary players in this software space?
• How do I test these environments?
• What tools should I use?
This slide deck is geared towards:
QA Testers
Data Architects
Business Analysts
ETL Developers
Operations Teams
Project Managers
...and anyone else who is (a) new to the EDW space, (b) wants to be educated in the business and technical sides and (c) wants to understand how to test them.
The document defines a data warehouse as a subject-oriented, integrated, time-variant and non-volatile collection of data to support management decision making. A data warehouse is maintained separately from operational databases and provides a platform for consolidated historical data analysis. Key features of a data warehouse include dimensional modeling using facts, dimensions, and star or snowflake schemas.
Business Intelligence Data Warehouse SystemKiran kumar
This document provides an overview of data warehousing and business intelligence concepts. It discusses:
- What a data warehouse is and its key properties like being integrated, non-volatile, time-variant and subject-oriented.
- Common data warehouse architectures including dimensional modeling, ETL processes, and different layers like the data storage layer and presentation layer.
- How data marts are subsets of the data warehouse that focus on specific business functions or departments.
- Different types of dimensions tables and slowly changing dimensions.
- How business intelligence uses the data warehouse for analysis, querying, reporting and generating insights to help with decision making.
The document provides an overview of data mining and data warehousing concepts through a series of lectures. It discusses the evolution of database technology and data analysis, defines data mining and knowledge discovery, describes data mining functionalities like classification and clustering, and covers data warehouse concepts like dimensional modeling and OLAP operations. It also presents sample queries in a proposed data mining query language.
The document discusses various concepts related to database design and data warehousing. It describes how DBMS minimize problems like data redundancy, isolation, and inconsistency through techniques like normalization, indexing, and using data dictionaries. It then discusses data warehousing concepts like the need for data warehouses, their key characteristics of being subject-oriented, integrated, and time-variant. Common data warehouse architectures and components like the ETL process, OLAP, and decision support systems are also summarized.
The document provides information about data warehousing fundamentals. It discusses key concepts such as data warehouse architectures, dimensional modeling, fact and dimension tables, and metadata. The three common data warehouse architectures described are the basic architecture, architecture with a staging area, and architecture with staging area and data marts. Dimensional modeling is optimized for data retrieval and uses facts, dimensions, and attributes. Metadata provides information about the data in the warehouse.
Data Warehouse – Introduction, characteristics, architecture, scheme and modelling, Differences between operational database systems and data warehouse.
The document provides an overview of data warehousing concepts. It defines a data warehouse as a subject-oriented, integrated, time-variant, and non-volatile collection of data. It discusses the differences between OLTP and OLAP systems. It also covers data warehouse architectures, components, and processes. Additionally, it explains key concepts like facts and dimensions, star schemas, normalization forms, and metadata.
This document discusses data warehousing and integration. It outlines the challenges of integrating data from heterogeneous sources and managing data across large enterprises. It describes how data warehousing addresses these issues by collecting and combining data in advance and storing it in a warehouse for direct querying and analysis. Key aspects of data warehousing covered include architectures, types of data stored, extraction of data from sources, and issues in designing and maintaining a data warehouse.
This document discusses data warehousing and integration. It outlines the challenges of integrating data from heterogeneous sources and managing data across large enterprises. It describes how data warehousing addresses these issues by collecting and combining data in advance into a centralized warehouse for unified access and querying. Key aspects of data warehousing covered include architectures, extraction of data from sources, integration of data, and ongoing maintenance of the warehouse.
This document discusses data warehousing and integration. It outlines the challenges of integrating data from heterogeneous sources and managing data across large enterprises. It describes how data warehousing addresses these issues by collecting and combining data in advance into a centralized warehouse for unified access and querying. Key aspects of data warehousing covered include architectures, extraction of data from sources, integration of data, and ongoing maintenance of the warehouse.
Global Situational Awareness of A.I. and where its headedvikram sood
You can see the future first in San Francisco.
Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum.
The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war.
Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change.
Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride.
Let me tell you what we see.
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
Codeless Generative AI Pipelines
(GenAI with Milvus)
https://ml.dssconf.pl/user.html#!/lecture/DSSML24-041a/rate
Discover the potential of real-time streaming in the context of GenAI as we delve into the intricacies of Apache NiFi and its capabilities. Learn how this tool can significantly simplify the data engineering workflow for GenAI applications, allowing you to focus on the creative aspects rather than the technical complexities. I will guide you through practical examples and use cases, showing the impact of automation on prompt building. From data ingestion to transformation and delivery, witness how Apache NiFi streamlines the entire pipeline, ensuring a smooth and hassle-free experience.
Timothy Spann
https://www.youtube.com/@FLaNK-Stack
https://medium.com/@tspann
https://www.datainmotion.dev/
milvus, unstructured data, vector database, zilliz, cloud, vectors, python, deep learning, generative ai, genai, nifi, kafka, flink, streaming, iot, edge
2. Course Details
Course Title: Data Warehousing & Data Mining
Credit Hours: 3
Course Prerequisite: DBMS
3. Course Contents
Data Warehousing Concepts, Data Warehousing System And Components
Data Transformation Process Functions
Online Analytical Processing (OLAP) And OLAP Tools.
Data Crawling & Programming With Python
Data Warehousing Applications
Concepts Of Data Mining
Data Pre-processing, Pre-mining And Outlier Detection
Data Mining Learning Methods & Data Mining Classes (Association Rule Mining, Clustering,
Classification)
Fundamental of other Algorithms Related To Data Mining(Fuzzy Logic, Genetic Algorithm
And Neural Network)
Decision Trees
Web Mining
4. Text Books
Fundamentals of Data Warehousing - Paulraj Ponniah
The Data Warehouse Toolkit by Ralph Kimball - John Wiley & Sons Publications.
Decision Support in the Data Warehouse by Paul Gray, Hugh J. Watson - Prentice
Hall.
Jiawei Han ”Data Mining: Concepts and Techniques”, Second Edition and above
Data Mining and Analysis: Fundamental Concepts and Algorithms, 1st Edition, M.
Zaki & W. Meira
Data Mining: Concepts and Techniques, 3rd Edition Jiawei Han, Micheline Kamber,
Jian Pei; , 2011
Anything that you can find to help you learn.
5. History of IT
The “dark ages”: paper forms in file cabinets
Computerized systems emerge
Initially for big projects like Social Security
Same functionality as old paper-based systems
The “golden age”: databases are everywhere
Most activities tracked electronically
Stored data provides detailed history of activity
The next step: use data for decision-making
The focus of this course!
Made possible by omnipresence of IT
Identify inefficiencies in current processes
Quantify likely impact of decisions
6. What Is a Data Warehouse?
In many organizations, we want a central “store” of all of our entities,
concepts, metadata, and historical information
For doing data validation, complex mining, analysis, prediction, …
This is the data warehouse
To this point we’ve focused on scenarios where the data “lives” in the
sources – here we may have a “master” version (and archival version) in a
central database
For performance reasons, availability reasons, archival reasons, …
7. What Is a Data Warehouse?
More specific, a collective data repository – Containing snapshots of the
operational data (history) – Obtained through data cleansing (Extract-
Transform- Load process) – Useful for analytics
8. What Is a Data Warehouse?
Experts say…
– Ralph Kimball: “a copy of transaction data specifically structured for
query and analysis”
– Bill Inmon: “A data warehouse is a: – Subject oriented – Integrated –
Non-volatile – Time variant collection of data in support of
management’s decisions.”
9. Properties of a Data Warehouse?
The data in the DWH is organized in such a way that all the data elements
relating to the same real-world event or object are linked together
Typical subject areas in DWs are Customer, Product, Order, Claim,
Account,…
10. Properties of a Data Warehouse?
Non-Volatile
– Data in the DW is never over-written or deleted - once committed, the data
is static, read-only, and retained for future reporting
– Data is loaded, but not updated
– When subsequent changes occur, a new version or snapshot record is
written,…
11. Properties of a Data Warehouse?
Time-varying
– The changes to the data in the DW are tracked and recorded so that
reports show changes over time
– Different environments have different time horizons associated
• While for operational systems a 60-to-90 day time horizon is
normal, DWs have a 5-to-10 year horizon
12. General Definition
– A large repository of some organization’s electronically stored data
– Specifically designed to facilitate reporting and analysis
13. Characteristics of DW
Subject oriented Data are organized by how users refer to it
Integrated Inconsistencies are removed in both
nomenclature and conflicting information;
(i.e. data are ‘clean’)
Non-volatile Read-only data. Data do not change over
time.
Time variant Data are time series, not current status
14. Subject Oriented
Data Warehouse is designed around
“subjects” rather than processes
A company may have
Retail Sales System
Outlet Sales System
Catalog Sales System
DW will have a Sales Subject Area
16. Integrated
Heterogeneous Source Systems
Need to Integrate source data
For Example: Product codes could be different in different systems
Arrive at common code in DW
Information integrated in advance
Stored in DW for direct querying and analysis
18. Non-Volatile
Operational update of data does not occur in the data warehouse
environment.
Does not require transaction processing, recovery, and concurrency
control mechanisms
Requires only two operations in data accessing:
initial loading of data and access of data.
20. Time Variant
The time horizon for the data warehouse is significantly longer than that of
operational systems.
Operational database: current value data.
Data warehouse data: provide information from a historical perspective
(e.g., past 5-10 years)
21. Time Variant
Most business analysis has a time
component
Trend Analysis (historical data is
required)
2001 2002 2003 2004
Sales
22. Data recording and storage is growing.
History is excellent predictor of the future.
Gives total view of the organization.
Intelligent decision-support is required for decision-making.
Why a Data Warehouse (DWH)?
23. Data Sets are growing.
Size of Data Sets are going up .
Cost of data storage is coming down .
The amount of data average business collects and stores is doubling
every year
Total hardware and software cost to store and manage 1 Mbyte of data
1990: ~ $15
2002: ~ ¢15 (Down 100 times)
By 2007: < ¢1 (Down 150 times)
Reason-1: Why a Data Warehouse?
24. A Few Examples
WalMart: 24 TB
France Telecom: ~ 100 TB
CERN: Up to 20 PB by 2006
Stanford Linear Accelerator Center (SLAC): 500TB
Reason-1: Why a Data Warehouse?
27. Businesses demand Intelligence (BI).
Complex questions from integrated data.
“Intelligent Enterprise”
Reason-2: Why a Data Warehouse?
28. List of all items that were sold last month?
List of all items purchased by Tariq Majeed?
The total sales of the last month grouped by branch?
How many sales transactions occurred during the
month of January?
DBMS Approach
Reason-2: Why a Data Warehouse?
29. Which items sell together? Which items to stock?
Where and how to place the items? What discounts to
offer?
How best to target customers to increase sales at a branch?
Which customers are most likely to respond to my next
promotional campaign, and why?
Intelligent Enterprise
Reason-2: Why a Data Warehouse?
30. Businesses want much more…
What happened?
Why it happened?
What will happen?
What is happening?
What do you want to happen?
Stages of
Data
Warehouse
Reason-3: Why a Data Warehouse?
31. A complete repository of historical corporate data extracted
from transaction systems that is available for ad-hoc access
by knowledge workers.
Complete repository
History
Transaction System
Ad-Hoc access
Knowledge workers
What is a Data Warehouse?
32. Transaction System
Management Information System (MIS)
Could be typed sheets (NOT transaction system)
Ad-Hoc access
Dose not have a certain access pattern.
Queries not known in advance.
Difficult to write SQL in advance.
Knowledge workers
Typically NOT IT literate (Executives, Analysts, Managers).
NOT clerical workers.
Decision makers.
What is a Data Warehouse?
33. Features of a DWH
– DW typically
– Reside on computers dedicated to this function
– Run on enterprise scale DBMS such as Oracle, IBM DB2,
Teradata, or Microsoft SQL Server
– Retain data for long periods of time
– Consolidate data obtained from a variety of sources
– Are built around their own carefully designed data model
34. Data Management in Enterprises
Vertical fragmentation of informational systems
Result of application (user)-driven development of
operational systems
Sales Administration Finance Manufacturing ...
Sales Planning
Stock Mngmt
...
Suppliers
...
Debt Mngmt
Num. Control
...
Inventory
35. Two Approaches for accessing data:
Query-Driven (Lazy)
Warehouse (Eager)
Source Source
?
Data Management in Enterprises
36. The Need for DW
Source Source
Source
. . .
Integration System
. . .
Metadata
Clients
Wrapper Wrapper
Wrapper
Query-driven (lazy, on-demand)
38. The Warehousing Approach
Data
Warehouse
Clients
Source Source
Source
. . .
Extractor/
Monitor
Integration System
. . .
Metadata
Extractor/
Monitor
Extractor/
Monitor
Information
integrated in
advance
Stored in WH for
direct querying
and analysis
39. Advantages of DWH Approach
High query performance
Doesn’t interfere with local processing at sources
Information copied at warehouse
Can modify, annotate, summarize, restructure, etc.
Can store historical information
Security, no auditing
40. Why Data Mining?
The Explosive Growth of Data: from terabytes to petabytes
Data collection and data availability
Automated data collection tools, database systems, Web,
computerized society
Major sources of abundant data
Business: Web, e-commerce, transactions, stocks, …
Science: Remote sensing, bioinformatics, scientific simulation, …
Society and everyone: news, digital cameras,
We are drowning in data, but starving for knowledge!
“Necessity is the mother of invention”—Data mining—Automated analysis
of massive data sets 40
41. Evolution of Database Technology
1960s:
Data collection, database creation, IMS and network DBMS
1970s:
Relational data model, relational DBMS implementation
1980s:
RDBMS, advanced data models (extended-relational, OO, deductive, etc.)
Application-oriented DBMS (spatial, scientific, engineering, etc.)
1990s:
Data mining, data warehousing, multimedia databases, and Web databases
2000s
Stream data management and mining
Data mining and its applications
Web technology (XML, data integration) and global information systems
41
42. What Is Data Mining?
Alternative name
Knowledge discovery in databases (KDD)
Watch out: Is everything “data mining”?
Query processing
Expert systems or statistical programs
Data mining (knowledge discovery from data)
Extraction of interesting (non-trivial, implicit, previously unknown
and potentially useful) patterns or knowledge from huge amount of
data
42
43. Let’s start data mining with a interesting statement.
The statement, given by Donald Rumsfeld, Defense Secretary of the USA in an
interview, is as under.
As we know, there are known knowns. There are things we know that we know like
you know your names, your parent’s names. We also know there are known
unknowns.
That is to say, we know that there are some things we do not know like what one is
thinking about you, what you will eat after six days, what will be result of a lottery
and so on.
But there are also unknown unknowns, the ones we don't know that we don't know.
Are they beneficial if you know? Or it is harmful no to know them?
43
What Is Data Mining?
44. There are also unknown knowns, things we'd like to know, but don't know, but
know someone who can doctor them and pass them off as known knowns. To
associate Rumsfeld’s above quotation with data mining, we identify four core
phrases as
1. Known knowns
2. Known unknowns
3. Unknown unknowns
The items 1 3, and 4 deal with “Knowns”. Data mining has relevance to the
third point in red.
It is an art of digging out what exactly we don’t know that we must know in
our business.
The methodology is to first convert “unknown unkowns” into “known
unknowns” and then finally to “known knowns”.
44
What Is Data Mining?
45. What is Data Mining?: Slightly
Informal
Tell me something that I should know. When you don’t know what you should
be knowing, how do you write SQL?
You cant!!
Tell me something that I should know i.e. you ask your DWH, data repository
that tell me something that I don’t know, or I should know. Since we don’t know
what we actually don’t know and what we must know to know, we can’t write
SQL’s for getting answers like we do in OLTP systems.
Data mining is an exploratory approach, where browsing through data using
data mining techniques may reveal something that might be of interest to the
user as information that was unknown previously. Hence, in data mining we
don’t know the results.
45
46. Why Data Mining?—Potential Applications
Data analysis and decision support
Market analysis and management
Target marketing, customer relationship management (CRM),
market basket analysis, market segmentation
Risk analysis and management
Forecasting, customer retention, quality control, competitive
analysis
Fraud detection and detection of unusual patterns (outliers)
Other Applications
Text mining (news group, email, documents) and Web mining
Stream data mining
Bioinformatics and bio-data analysis
46
47. Market Analysis and Management
Where does the data come from?
Credit card transactions, discount coupons, customer complaint calls
Target marketing
Find clusters of “model” customers who share the same characteristics:
interest, income level, spending habits, etc.
Determine customer purchasing patterns over time
Cross-market analysis
Associations/co-relations between product sales, & prediction based on such
association
Customer profiling
What types of customers buy what products
Customer requirement analysis
Identifying the best products for different customers
Predict what factors will attract new customers
47
48. Fraud Detection & Mining Unusual
Patterns
Approaches: Clustering & model construction for frauds, outlier analysis
Applications: Health care, retail, credit card service, telecomm.
Medical insurance
Professional patients, and ring of doctors
Unnecessary or correlated screening tests
Telecommunications:
Phone call model: destination of the call, duration, time of day or
week. Analyze patterns that deviate from an expected norm
Retail industry
Analysts estimate that 38% of retail shrink is due to dishonest
employees
48
49. Other Applications
Internet Web Surf-Aid
IBM Surf-Aid applies data mining algorithms to Web access logs for
market-related pages to discover customer preference and behavior
pages, analyzing effectiveness of Web marketing, improving Web
site organization, etc.
49
50. Data Mining: A KDD Process
Data mining—core of
knowledge discovery process
50
Data Cleaning
Data Integration
Databases
Data Warehouse
Task-relevant Data
Selection
Data Mining
Pattern Evaluation
51. Steps of a KDD Process
Learning the application domain
Relevant prior knowledge and goals of application
Creating a target data set: data selection
Data cleaning and preprocessing: (may take 60% of effort!)
Data reduction and transformation
Find useful features, dimensionality/variable reduction.
Choosing functions of data mining
Summarization, classification, regression, association, clustering.
Choosing the mining algorithm(s)
Data mining: search for patterns of interest
Pattern evaluation and knowledge presentation
Visualization, transformation, removing redundant patterns, etc.
Use of discovered knowledge
51
52. Architecture: Typical Data Mining System
52
Data
Warehous
e
Data cleaning & data
integration
Filterin
g
Database
s
Database or data
warehouse server
Data mining engine
Pattern evaluation
Graphical user interface
Knowledge-base
54. Data mining evolved as a mechanism to cater the limitations of OLTP
systems to deal massive data sets with high dimensionality, new data
types, multiple heterogeneous data resources etc.
The conventional systems couldn’t keep pace with the ever changing
and increasing data sets.
Data mining algorithms are built to deal high dimensionality data, new
data types (images, video etc.), complex associations among data items,
distributed data sources and associated issues (security etc.)
54
55. 55
Traditional Database (Transactions): -- Querying
data in well-defined processes. Reliable storage
How Data Mining is different?
56. Data Mining: On What Kinds of Data?
Relational database
Data warehouse
Transactional database
Advanced database and information repository
Spatial and temporal data
Time-series data
Stream data
Multimedia database
Text databases & WWW
56
57. Data Mining Functionalities
Concept description: Characterization and discrimination
Generalize, summarize, and contrast data characteristics
Association (correlation and causality)
Diaper Beer [0.5%, 75%]
Classification and Prediction
Construct models (functions) that describe and distinguish classes or
concepts for future prediction
Presentation: decision-tree, classification rule, neural network
57
58. Data Mining Functionalities
Cluster analysis
Class label is unknown: Group data to form new classes, e.g., cluster
houses to find distribution patterns
Maximizing intra-class similarity & minimizing interclass similarity
Outlier analysis
Outlier: a data object that does not comply with the general behavior
of the data
Useful in fraud detection, rare events analysis
Trend and evolution analysis
Trend and deviation: regression analysis
Sequential pattern mining, periodicity analysis
58
59. Are All the “Discovered” Patterns
Interesting?
Data mining may generate thousands of patterns: Not all of them are
interesting
Suggested approach: Human-centered, query-based, focused mining
Interestingness measures
A pattern is interesting if it is easily understood by humans, valid on new or
test data with some degree of certainty, potentially useful, novel, or validates
some hypothesis that a user seeks to confirm
Objective vs. subjective interestingness measures
Objective: based on statistics and structures of patterns, e.g., support,
confidence, etc.
Subjective: based on user’s belief in the data, e.g., unexpectedness, novelty.
59
60. Data Mining: Confluence of Multiple
Disciplines
60
Data Mining
Database
Systems
Statistics
Other
Disciplines
Algorithm
Machine
Learning
Visualization
61. Data Mining: Classification Schemes
Different views, different classifications
Kinds of data to be mined
Kinds of knowledge to be discovered
Kinds of techniques utilized
Kinds of applications adapted
61
62. Multi-Dimensional View of Data Mining
Data to be mined
Relational, data warehouse, transactional, stream, object-
oriented/relational, active, spatial, time-series, text,
multi-media, heterogeneous, WWW
Knowledge to be mined
Characterization, discrimination, association,
classification, clustering, trend/deviation, outlier analysis,
etc.
Multiple/integrated functions and mining at multiple
levels
62
63. Multi-Dimensional View of Data Mining
Techniques utilized
Database-oriented, data warehouse (OLAP), machine
learning, statistics, visualization, etc.
Applications adapted
Retail, telecommunication, banking, fraud analysis, bio-
data mining, stock market analysis, Web mining, etc.
63
64. OLAP Mining: Integration of Data Mining and Data
Warehousing
Data mining systems, DBMS, Data warehouse systems
coupling
On-line analytical mining data
Integration of mining and OLAP technologies
Interactive mining multi-level knowledge
Necessity of mining knowledge and patterns at different levels of
abstraction.
Integration of multiple mining functions
Characterized classification, first clustering and then association
64
67. 67
A neural network is a series of algorithms that endeavors to recognize
underlying relationships in a set of data through a process that mimics the
way the human brain operates. In this sense, neural networks refer to
systems of neurons, either organic or artificial in nature.
Rule induction is an area of machine learning in which formal rules are
extracted from a set of observations. The rules extracted may represent a full
scientific model of the data, or merely represent local patterns in the data.
Data Mining
68. Major Issues in Data Mining
Mining methodology
Mining different kinds of knowledge from diverse data
types, e.g., bio, stream, Web
Performance: efficiency, effectiveness, and scalability
Pattern evaluation: the interestingness problem
Incorporation of background knowledge
Handling noise and incomplete data
Parallel, distributed and incremental mining methods
Integration of the discovered knowledge with existing
one: knowledge fusion
68
69. Major Issues in Data Mining
User interaction
Data mining query languages and ad-hoc mining
Expression and visualization of data mining results
Interactive mining of knowledge at multiple levels of
abstraction
Applications and social impacts
Domain-specific data mining & invisible data mining
Protection of data security, integrity, and privacy
69
70. Summary
Data mining: discovering interesting patterns from large amounts of data
A natural evolution of database technology, in great demand, with wide
applications
A KDD process includes data cleaning, data integration, data selection,
transformation, data mining, pattern evaluation, and knowledge
presentation
Mining can be performed in a variety of information repositories
Data mining functionalities: characterization, discrimination,
association, classification, clustering, outlier and trend analysis, etc.
Data mining systems and architectures
Major issues in data mining
70
71. Where to Find References?
More conferences on data mining
PAKDD (1997), PKDD (1997), SIAM-Data Mining (2001), (IEEE) ICDM (2001), etc.
Data mining and KDD
Conferences: ACM-SIGKDD, IEEE-ICDM, SIAM-DM, PKDD, PAKDD, etc.
Journal: Data Mining and Knowledge Discovery, KDD Explorations
Database systems
Conferences: ACM-SIGMOD, ACM-PODS, VLDB, IEEE-ICDE, EDBT, ICDT, DASFAA
Journals: ACM-TODS, IEEE-TKDE, JIIS, J. ACM, etc.
AI & Machine Learning
Conferences: Machine learning (ML), AAAI, IJCAI, COLT (Learning Theory), etc.
Journals: Machine Learning, Artificial Intelligence, etc.
Statistics
Conferences: Joint Stat. Meeting, etc.
Journals: Annals of statistics, etc.
Visualization
Conference proceedings: CHI, ACM-SIGGraph, etc.
Journals: IEEE Trans. visualization and computer graphics, etc.
71