Agile data warehousing

AGILE DATA
WAREHOUSIN
G SOLUTION
Data warehousing + Reporting Solution
Presented by: Sneha Challa
Date: 7/14/2016
Location: San Ramon, CA

Standard approach to DW
Agile Data Warehousing for the
enterprise- Ralph Hughes

Traditional EDW Model
• Brittle to changing requirements
Source: Agile Datawarehousing for the
enterprise

Two traditional
approaches
 Traditional Integration layer – model it in 3NF and
upwards. ETL loads into IL before transforming it to
populate the star schema of the presentation layer.
 Conformed Dimensional data warehouse skips
integration layer to load company’s data directly into star
schemas
Cons:
 Both these approaches lead to DW that are very difficult
to modify once the data is loaded.
 Brittle in the face of changing requirements.
 Costly redesign and data conversion

Agile Data Engineering
 Do not need to have the entire data design model
upfront
 Development that adapts to changing business
requirements.
 Do not need to re engineer the existing schema when
new entities and relationships arise.
 Simple reusable ETL modules.

Agile Data Engineering
Source:
https://www.youtube.com/watch?v=3QO
SOeN8vcY

Example
7 Tables
Source:
SOeN8vcY

HNF
enterprise

HNF model
18 Tables. We have hyper
normalized the table.
Source:
SOeN8vcY

HNF
• Parameter driven data transform using ETL script. One
ETL module for all business key modules. Yellow ETL
module (for linking tables). Take all other attributes from
the source and send it to the target ETL tables.
• Easily adapt to BR even after loading billions of records.

Data model
Source:
SOeN8vcY

HNF
• Caveat : Data retrieval gets complex. SQL to get data
can get very complex with outer joins and correlated sub
queries. But does it matter so much? Remember HNF is
used for Integration layer not that much presentation and
semantic layer.
• Storing the data into the integration layer from the source
systems using only 3 re usable ETL modules.
• Build DW a slice at a time and adapt to new business
requirements.
• http://www.anchormodeling.com/

Hyper Generalized Form
• Computer generate warehouse presentation and
semantic layers. Labor saving approach.
• Logical and physical data model eliminated
• Can operate at the business level.
• Builds on the notion of special purpose table.
• Need a acquire a automated Data ware house tool that
can generate entire data warehouse infrastructure.
• Entire dataset represented as 6 tables.
• Generate EDW and ETL Schema for all layers.

Step 1
Source:
https://www.youtube.com/watch?v=aNt
UoVkeq_Q

Step 2
Source:
UoVkeq_Q

Step 3
Source:
UoVkeq_Q

Step 4
(Add temporality)
Source:
UoVkeq_Q

Step 5
Source:
UoVkeq_Q

Step 6
source:
UoVkeq_Q

Big Data Technologies
 Power an iterative discovery and engineering process.
 Read and transform massive amount of data on cheap
commodity software using Massive parallel processing.
 Schema on Read. Don’t need to impose structure of every
piece of information gathered.
 Hadoop with more SQL like features or a traditional EDW with
big data packages. Which is more useful?
 Complex event analysis.
 Real time analytics of high volume data streams
 Complex event processing
 Data Mining Software
 Text analytics

Big Data Technologies
Products:
 Hadoop and HDFS
 NoSQL databases
 Big Data extensions to RDBMS

Apache Hadoop S/W
components
enterprise by Ralph Hughes

Hadoop
Reasons to use Hadoop:
 Building a data ware house for the future. Gear up your skills for
Hadoop and Big Data as the data size grows larger. Major distributions
like Horton works, Cloudera, MapR have enterprise Hub editions which
can be deployed.
 There is a complaint that not suitable for quick interactive querying. But
then, Cloudera’s Impala and Horton works Stinger initiative have made
interactive querying much faster.
 Horton works platform provides indexing and search features using
Apache Solr which can make search and querying faster.
 Horton works came up with something like Apache Zeppelin which
brings data visualization and collaboration features to Hadoop and
Spark.
 Provides Apache Sqoop to load data from RDBMS.
 Pig and Map Reduce for ETL
 Weekly, hourly and monthly work flow schedules
 Apache Flume to load web logs data.

NO SQL Databases.
Advantages:
 Schema less read
 Auto Sharding
 Cloud computing (AWS)
 Replication
 No separate application or expensive add ons
 Integrated Caching
 In memory caching for high throughput and low latency
 Open Source
 Cassandra, Redshift, Hbase
 Document based
 Graph Stores – Neo4J and Giraph
 Key-Value Stores – Riak and Berkeley DB, Redis. Complex info as
BLOBs in value columns
 Column wide stores – Cassandra and Hbase

Why Implement NOSQL?
 Big Data getting bigger. New sources of data emerge
eventually.
 More users are going online.
 Open Source- downloaded, implemented and scaled at
little cost.
 Viable alternative to expensive proprietary software
 Increase speed and agility of development.
 When requirements change data model also changes.

5 considerations to
evaluate NoSQL
 Data Model
 Document model – MongoDB, CouchDB
 Natural mapping of document object model to OOP
 Query on any field
 Graph Databases- traversing relationships is the key
 Social networks and supply chain
 Columnar and wide column data bases
 Query Model
 Consistency Model
 APIs
 Commercial Support & Community strength

DWaas
Amazon redshift
 Cost effective: $1000 per terabyte per year
 Columnar storage – fast access, parallelize queries
 MPP DW architecture
 Cheap, simple, secure and compatible with a SQL interface
 Automate provisioning, configuring and monitoring of a cloud
data warehouse.
 Integrations to Amazon S3, Amazon DynamoDB, Amazon
Elastic Map reduce, Amazon Kinesis.
 Security is built in.
 Amazon web services management console.
 Network Isolation using Virtual private cloud.

Presentation/Visualization
 Tableau
o Easy to use drag and drop interface
o No Code
o Connect to Hadoop, Cloud, SQL databases
o Offers free training
o Trend analysis, regression and correlation analysis
o In memory data analysis
o Data blending
o Clutter Free GUI
 Qlik View
o Faster in memory computation

Analytics and forecasting
• R, Python, Apache Spark – for predictive modeling and
forecasting
• Connected the data ware house with R, Python and
Spark.
• R libraries - R part, Random Forests, ROCR, mBoost
• Python – Scikit Learn, Numpy, Pandas, Sci-py
• Spark – ML and MLlib

Agile data warehousing

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Agile data warehousing

Similar to Agile data warehousing (20)

More from Sneha Challa

More from Sneha Challa (6)

Recently uploaded

Recently uploaded (20)

Agile data warehousing

Editor's Notes