DATA WAREHOUSING AND MINING
Akhil Singhal 3263
Aunj Gaikwad 3268
Anushka Srivastava 3269
Rahul Raisinghani 3293
Russall DMello 3322
Angad Chattwal 3323
What is Data warehouse?
Data warehouse is an information system that contains historical and commutative data
from single or multiple sources. It simplifies reporting and analysis process of the
organization.
It is also a single version of truth for any company for decision making and forecasting.
Characteristics of Data warehouse
A data warehouse has following characteristics:
Subject-Oriented
Integrated
Time-variant
Non-volatile
 Data Warehouse Architectures
There are mainly three types of Data warehouse Architectures: -
 Single-tier architecture
The objective of a single layer is to minimize the amount of data stored.This goal is to
remove data redundancy.This architecture is not frequently used in practice.
 Two-tier architecture
Two-layer architecture separates physically available sources and data warehouse.This
architecture is not expandable and also not supporting a large number of end-users. It
also has connectivity problems because of network limitations.
 Three-tier architecture
 This is the most widely used architecture.
It consists of theTop, Middle and BottomTier.
 Bottom Tier:The database of the Datawarehouse servers as the bottom tier. Data is
cleansed, transformed, and loaded into this layer using back-end tools.
 MiddleTier: The middle tier in Data warehouse is an OLAP server which is
implemented using either ROLAP or MOLAP model.
 Top-Tier: The top tier is a front-end client layer.Top tier is the tools and API that you
connect and get data out from the data warehouse.
Data warehouse Components
There are mainly five components of DataWarehouse:
 Data Warehouse Database
 The central database is the foundation of the data
warehousing environment.This database is implemented on
the RDBMS technology.
 Sourcing,Acquisition, Clean-up andTransformationTools
(ETL)
 The data sourcing, transformation, and migration tools are
used for performing all the conversions, summarizations,
and all the changes needed to transform data into a unified
format in the datawarehouse.
 Metadata
 Metadata is data about data which defines the data warehouse. It
is used for building, maintaining and managing the data
warehouse.
 Metadata can be classified into following categories:
 Technical Meta Data
 Business Meta Data
 QueryTools
 One of the primary objects of data warehousing is to provide
information to businesses to make strategic decisions. Query tools
allow users to interact with the data warehouse system.
 These tools fall into four different categories:
 Query and reporting tools
 Application Development tools
 Data mining tools
 OLAP tools
 Data warehouse Bus Architecture
 Data warehouse Bus determines the flow of data
in your warehouse.The data flow in a data
warehouse can be categorized as Inflow, Upflow,
Downflow, Outflow and Meta flow.
 Data Marts
 A data mart is an access layer which is used to get
data out to the users. It is presented as an option
for large size data warehouse as it takes less time
and money to build.
Data Mining
 Data mining is defined as a process
used to extract usable data from a
larger set of any raw data.
 It implies analysing data patterns in
large batches of data using one or
more software.
 For segmenting the data and
evaluating the probability of future
events, data mining uses
sophisticated mathematical
algorithms. Data mining is also known
as Knowledge Discovery in Data
(KDD).
Key features of data mining
 Automatic pattern predictions based on trend and
behaviour analysis.
 Prediction based on likely outcomes.
 Creation of decision-oriented information.
 Focus on large data sets and databases for analysis.
 Clustering based on finding and visually documented
groups of facts not previously known.
Data Mining Functionalities
 Are used to specify the kind of pattern to be found in data
mining tasks.There are 2 types of tasks:
Descriptive Task:
 These tasks present the general properties of data stored in
database. The descriptive tasks are used to find out patterns in
data i.e. cluster, correlation, trends and anomalies etc.
Predictive Tasks:
 Predictive data mining tasks predict the value of one attribute
on the bases of values of other attributes, which is known as
target or dependent variable and the attributes used for making
the prediction are known as independent variables.
Clustering
 Clustering is used to identify data objects that are similar to one another. Process of
partitioning a set of object or data in a same group called a cluster.
 Used in- machine learning, patterns recognition, image analysis and information
retrieval. For example, an insurance company can cluster its customers based on age,
residence, income etc. .
Associations and correlations:
 Association discovers the association or connection among a set of items.
 A retailer can identify the products that normally customers purchase together or even
find the customers who respond to the promotion of same kind of products.
 For example, a set of items, such as table and chair.
Summarization
 A set of relevant data is summarized which result in a smaller set that gives aggregated
information of the data.
 For example, the shopping done by a customer can be summarized into total products,
total spending, offers used, etc.
Data mining under DescriptiveTask
Prediction
 Prediction task predicts the possible values of future data.
 Prediction involves developing a model based on the available data and this model is
used in predicting future values of a new data set of interest.
 For example, a model can predict the income of an employee based on education,
experience and other demographic factors like place of stay, gender etc.
Time - Series Analysis
 Time series is a sequence of events where the next event is determined by one or more
of the preceding events.
 Time series analysis includes methods to analyze time-series data in order to extract
useful patterns, trends, rules and statistics. Stock market prediction is an important
application of time- series analysis.
Classification:
 Classification is used to builds models from data with predefined classes as the model
is used to classify new instance whose classification is not known.
 for example one may classify the employee’s potential salary on the bases of salary
classification of similar employees in the company.
Data mining under PredictiveTask
Applications of Data Mining
 Sales and Marketing
 Banking and Finance
 Healthcare and Insuarance
 Retail Industry
 Telecommunications Industry
 Higher Education
Amazon Web Services, Inc.
(IT service management company)
 AWS allows you to take advantage of all of the core benefits associated with on-demand
computing, such as access to seemingly limitless storage and compute capacity, and the
ability to scale your system in parallel with the growing amount of data collected, stored,
and queried, paying only for the resources you provision.
 Further, AWS offers a broad set of managed services that integrate seamlessly with each
other so that you can quickly deploy an end-to-end analytics and data warehousing
solution.
Amazon Redshift
 Amazon Redshift is a fast, fully managed, and cost-effective data
warehouse that gives you petabyte scale data warehousing and exabyte
scale data lake analytics together in one service.
 Amazon Redshift is up to ten times faster than traditional on-premises
data warehouses. Get unique insights by querying across petabytes of
data in Redshift and exabytes of structured data or open file formats in
Amazon S3, without the need to move or transform your data.
 Redshift is 1/10th the cost of traditional on-premises data warehouse
solutions.You can start small for just $0.25 per hour with no commitments,
scale out to petabytes of data for $250 to $333 per uncompressed terabyte
per year, and extend analytics to your Amazon S3 data lake for as little as
$0.05 for every 10 gigabytes of data scanned.
Amazon Redshift Customer Success
 “Amazon Redshift enables faster business insights and growth, and provides an
easy-to-manage infrastructure to support our data workloads. Redshift has given us
the confidence to run more data and analytics workloads on AWS and helps us meet
the growing needs of our customers.”
(Abhi Bhatt, Director Global Data & Analytics, McDonald’s)
 “Amazon Redshift allows us to ingest, optimize, transform, and aggregate billions
of transactional events per day at scale, coming to us from a variety of first and
third party sources. We query live data across our data warehouse and data lake,
and now with the new Amazon Redshift Federated Query feature we can easily
query and analyse live data across our relational databases as well.”
(AlexTverdohleb, Vice President Data Services, Consumer Products & Engineering,
FOX Corporation)
 “AtWD we use Amazon Redshift to enable the enterprise to gain value and insights
from large, complex, and dispersed datasets. Our data is nearly doubling every year
and we run six Redshift clusters with a total of 78 nodes and 631+TB of compressed
data stored to get insights that our business analysts and leadership depend on.”
(Fayaz Syed, Sr. Manager, Big Data Platform, Western Digital)

Data Warehousing AWS 12345

  • 1.
    DATA WAREHOUSING ANDMINING Akhil Singhal 3263 Aunj Gaikwad 3268 Anushka Srivastava 3269 Rahul Raisinghani 3293 Russall DMello 3322 Angad Chattwal 3323
  • 2.
    What is Datawarehouse? Data warehouse is an information system that contains historical and commutative data from single or multiple sources. It simplifies reporting and analysis process of the organization. It is also a single version of truth for any company for decision making and forecasting. Characteristics of Data warehouse A data warehouse has following characteristics: Subject-Oriented Integrated Time-variant Non-volatile
  • 3.
     Data WarehouseArchitectures There are mainly three types of Data warehouse Architectures: -  Single-tier architecture The objective of a single layer is to minimize the amount of data stored.This goal is to remove data redundancy.This architecture is not frequently used in practice.  Two-tier architecture Two-layer architecture separates physically available sources and data warehouse.This architecture is not expandable and also not supporting a large number of end-users. It also has connectivity problems because of network limitations.  Three-tier architecture  This is the most widely used architecture. It consists of theTop, Middle and BottomTier.  Bottom Tier:The database of the Datawarehouse servers as the bottom tier. Data is cleansed, transformed, and loaded into this layer using back-end tools.  MiddleTier: The middle tier in Data warehouse is an OLAP server which is implemented using either ROLAP or MOLAP model.  Top-Tier: The top tier is a front-end client layer.Top tier is the tools and API that you connect and get data out from the data warehouse.
  • 4.
    Data warehouse Components Thereare mainly five components of DataWarehouse:  Data Warehouse Database  The central database is the foundation of the data warehousing environment.This database is implemented on the RDBMS technology.  Sourcing,Acquisition, Clean-up andTransformationTools (ETL)  The data sourcing, transformation, and migration tools are used for performing all the conversions, summarizations, and all the changes needed to transform data into a unified format in the datawarehouse.
  • 5.
     Metadata  Metadatais data about data which defines the data warehouse. It is used for building, maintaining and managing the data warehouse.  Metadata can be classified into following categories:  Technical Meta Data  Business Meta Data  QueryTools  One of the primary objects of data warehousing is to provide information to businesses to make strategic decisions. Query tools allow users to interact with the data warehouse system.  These tools fall into four different categories:  Query and reporting tools  Application Development tools  Data mining tools  OLAP tools
  • 6.
     Data warehouseBus Architecture  Data warehouse Bus determines the flow of data in your warehouse.The data flow in a data warehouse can be categorized as Inflow, Upflow, Downflow, Outflow and Meta flow.  Data Marts  A data mart is an access layer which is used to get data out to the users. It is presented as an option for large size data warehouse as it takes less time and money to build.
  • 7.
    Data Mining  Datamining is defined as a process used to extract usable data from a larger set of any raw data.  It implies analysing data patterns in large batches of data using one or more software.  For segmenting the data and evaluating the probability of future events, data mining uses sophisticated mathematical algorithms. Data mining is also known as Knowledge Discovery in Data (KDD).
  • 8.
    Key features ofdata mining  Automatic pattern predictions based on trend and behaviour analysis.  Prediction based on likely outcomes.  Creation of decision-oriented information.  Focus on large data sets and databases for analysis.  Clustering based on finding and visually documented groups of facts not previously known.
  • 9.
    Data Mining Functionalities Are used to specify the kind of pattern to be found in data mining tasks.There are 2 types of tasks: Descriptive Task:  These tasks present the general properties of data stored in database. The descriptive tasks are used to find out patterns in data i.e. cluster, correlation, trends and anomalies etc. Predictive Tasks:  Predictive data mining tasks predict the value of one attribute on the bases of values of other attributes, which is known as target or dependent variable and the attributes used for making the prediction are known as independent variables.
  • 11.
    Clustering  Clustering isused to identify data objects that are similar to one another. Process of partitioning a set of object or data in a same group called a cluster.  Used in- machine learning, patterns recognition, image analysis and information retrieval. For example, an insurance company can cluster its customers based on age, residence, income etc. . Associations and correlations:  Association discovers the association or connection among a set of items.  A retailer can identify the products that normally customers purchase together or even find the customers who respond to the promotion of same kind of products.  For example, a set of items, such as table and chair. Summarization  A set of relevant data is summarized which result in a smaller set that gives aggregated information of the data.  For example, the shopping done by a customer can be summarized into total products, total spending, offers used, etc. Data mining under DescriptiveTask
  • 12.
    Prediction  Prediction taskpredicts the possible values of future data.  Prediction involves developing a model based on the available data and this model is used in predicting future values of a new data set of interest.  For example, a model can predict the income of an employee based on education, experience and other demographic factors like place of stay, gender etc. Time - Series Analysis  Time series is a sequence of events where the next event is determined by one or more of the preceding events.  Time series analysis includes methods to analyze time-series data in order to extract useful patterns, trends, rules and statistics. Stock market prediction is an important application of time- series analysis. Classification:  Classification is used to builds models from data with predefined classes as the model is used to classify new instance whose classification is not known.  for example one may classify the employee’s potential salary on the bases of salary classification of similar employees in the company. Data mining under PredictiveTask
  • 13.
    Applications of DataMining  Sales and Marketing  Banking and Finance  Healthcare and Insuarance  Retail Industry  Telecommunications Industry  Higher Education
  • 15.
    Amazon Web Services,Inc. (IT service management company)  AWS allows you to take advantage of all of the core benefits associated with on-demand computing, such as access to seemingly limitless storage and compute capacity, and the ability to scale your system in parallel with the growing amount of data collected, stored, and queried, paying only for the resources you provision.  Further, AWS offers a broad set of managed services that integrate seamlessly with each other so that you can quickly deploy an end-to-end analytics and data warehousing solution.
  • 16.
    Amazon Redshift  AmazonRedshift is a fast, fully managed, and cost-effective data warehouse that gives you petabyte scale data warehousing and exabyte scale data lake analytics together in one service.  Amazon Redshift is up to ten times faster than traditional on-premises data warehouses. Get unique insights by querying across petabytes of data in Redshift and exabytes of structured data or open file formats in Amazon S3, without the need to move or transform your data.  Redshift is 1/10th the cost of traditional on-premises data warehouse solutions.You can start small for just $0.25 per hour with no commitments, scale out to petabytes of data for $250 to $333 per uncompressed terabyte per year, and extend analytics to your Amazon S3 data lake for as little as $0.05 for every 10 gigabytes of data scanned.
  • 17.
    Amazon Redshift CustomerSuccess  “Amazon Redshift enables faster business insights and growth, and provides an easy-to-manage infrastructure to support our data workloads. Redshift has given us the confidence to run more data and analytics workloads on AWS and helps us meet the growing needs of our customers.” (Abhi Bhatt, Director Global Data & Analytics, McDonald’s)  “Amazon Redshift allows us to ingest, optimize, transform, and aggregate billions of transactional events per day at scale, coming to us from a variety of first and third party sources. We query live data across our data warehouse and data lake, and now with the new Amazon Redshift Federated Query feature we can easily query and analyse live data across our relational databases as well.” (AlexTverdohleb, Vice President Data Services, Consumer Products & Engineering, FOX Corporation)  “AtWD we use Amazon Redshift to enable the enterprise to gain value and insights from large, complex, and dispersed datasets. Our data is nearly doubling every year and we run six Redshift clusters with a total of 78 nodes and 631+TB of compressed data stored to get insights that our business analysts and leadership depend on.” (Fayaz Syed, Sr. Manager, Big Data Platform, Western Digital)