Data warehousing and machine learning primer

Data Warehousing and
Machine Learning
Tom Donoghue

Data Warehouse
• Business user friendly stories about past events (including near time)
• Designed to support decision making
• Serves a digest of answers in grouped and aggregated ways
• More meaningful and therefore more important to the business
• Ingests data from disparate sources which need to be merged to
enable business friendly queries

Data Warehouse Definition
• A consolidating bolt-on to existing operational systems
• Structured data associated with a specific user base and a specific set of
predefined business queries
• The data schema is predefined and structured to facilitate regular and ad-
hoc queries
• Populating the data warehouse requires multiple ETL processes designed in
advance
• Halts the proliferation of reports
O'Leary (2014)

Data Warehouse Basic Architecture
ETL Staging Area
Source
Data
Data
Warehouse
Business Users
Source
Data
Source
Data
Operational Data Soures Data Preparation Business Queries

Data Warehouse Requirements
• Organisational Data is easy to access
• Information is presented consistently
• Adaptive and resilient to change
• Secure
• Serves as a base for improved decision making
• Accepted by the business community
(Kimball, 2002)

Machine Learning
• A Data warehouse provides historic information for decision making
• Machine Learning uses algorithms to process features in the data to
learns patterns, make predictions and solution outcomes
• Image recognition, Classification, Forecasting, Anomaly detection
• Learning is Supervised (labelled with the desired outcome) or
Unsupervised (unlabelled, the model learns unaided)

Machine Learning - Supervised
• A predictive model is trained using a labelled training data set and the
outcome evaluated on its performance
• The model is tweaked to improve performance
• The model is then run against a test data set which is unlabelled and
evaluated on its performance in identifying the correct label
• Examples:
• k-Nearest Neighbours
• Linear and Logistical Regression
• Decision Trees
• Support Vector Machines
(Lantz, 2015)

Machine Learning - Unsupervised
• The training data set is unlabelled
• The descriptive model is trained and evaluated on its performance
• Examples:
• Clustering - k-Means
• Association Rules
• Natural Language Processing
(Lantz, 2015)

Machine Learning an Extension to Data
Warehousing
• Much of the hard work to cleanse and transform data has been
accomplished
• Ask the Business Question – what is the objective? Is it descriptive or
predictive?
• Does the data contain the desired features?
• Is further data transformation required
• Which ML algorithm is optimal for answering the question?
• Iterative approach assessing and evaluating model(s) performance
• Present the Solution

References
• Kimball, R., Ross, M., Thornthwaite, W., Mundy. J and Becker, B. (2008) The data warehouse lifecycle toolkit.
2nd ed. Indianapolis: Wiley Publishing, Inc.
• Lantz, B. (2015). Machine Learning with R, 2nd edn, Birmingham: Packt.
• O'Leary, D. E. (2014), ‘Embedding AI and Crowdsourcing in the Big Data Lake’, IEEE Intelligent Systems,
Volume 29, Issue 5, pp. 70-73.

Data warehousing and machine learning primer

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Data warehousing and machine learning primer

Similar to Data warehousing and machine learning primer (20)

More from Tom Donoghue

More from Tom Donoghue (6)

Recently uploaded

Recently uploaded (20)

Data warehousing and machine learning primer