Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Data warehousing and machine learning primer


Published on

An introduction to the data warehouse and machine learning. A simple primer on how a data warehouse assists machine learning

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Data warehousing and machine learning primer

  1. 1. Data Warehousing and Machine Learning Tom Donoghue
  2. 2. Data Warehouse • Business user friendly stories about past events (including near time) • Designed to support decision making • Serves a digest of answers in grouped and aggregated ways • More meaningful and therefore more important to the business • Ingests data from disparate sources which need to be merged to enable business friendly queries
  3. 3. Data Warehouse Definition • A consolidating bolt-on to existing operational systems • Structured data associated with a specific user base and a specific set of predefined business queries • The data schema is predefined and structured to facilitate regular and ad- hoc queries • Populating the data warehouse requires multiple ETL processes designed in advance • Halts the proliferation of reports O'Leary (2014)
  4. 4. Data Warehouse Basic Architecture ETL Staging Area Source Data Data Warehouse Business Users Source Data Source Data Operational Data Soures Data Preparation Business Queries
  5. 5. Data Warehouse Requirements • Organisational Data is easy to access • Information is presented consistently • Adaptive and resilient to change • Secure • Serves as a base for improved decision making • Accepted by the business community (Kimball, 2002)
  6. 6. Machine Learning • A Data warehouse provides historic information for decision making • Machine Learning uses algorithms to process features in the data to learns patterns, make predictions and solution outcomes • Image recognition, Classification, Forecasting, Anomaly detection • Learning is Supervised (labelled with the desired outcome) or Unsupervised (unlabelled, the model learns unaided)
  7. 7. Machine Learning - Supervised • A predictive model is trained using a labelled training data set and the outcome evaluated on its performance • The model is tweaked to improve performance • The model is then run against a test data set which is unlabelled and evaluated on its performance in identifying the correct label • Examples: • k-Nearest Neighbours • Linear and Logistical Regression • Decision Trees • Support Vector Machines (Lantz, 2015)
  8. 8. Machine Learning - Unsupervised • The training data set is unlabelled • The descriptive model is trained and evaluated on its performance • Examples: • Clustering - k-Means • Association Rules • Natural Language Processing (Lantz, 2015)
  9. 9. Machine Learning an Extension to Data Warehousing • Much of the hard work to cleanse and transform data has been accomplished • Ask the Business Question – what is the objective? Is it descriptive or predictive? • Does the data contain the desired features? • Is further data transformation required • Which ML algorithm is optimal for answering the question? • Iterative approach assessing and evaluating model(s) performance • Present the Solution
  10. 10. References • Kimball, R., Ross, M., Thornthwaite, W., Mundy. J and Becker, B. (2008) The data warehouse lifecycle toolkit. 2nd ed. Indianapolis: Wiley Publishing, Inc. • Lantz, B. (2015). Machine Learning with R, 2nd edn, Birmingham: Packt. • O'Leary, D. E. (2014), ‘Embedding AI and Crowdsourcing in the Big Data Lake’, IEEE Intelligent Systems, Volume 29, Issue 5, pp. 70-73.