AGILE DATA
WAREHOUSIN
G SOLUTION
Data warehousing + Reporting Solution
Presented by: Sneha Challa
Date: 7/14/2016
Location: San Ramon, CA
Standard approach to DW
Agile Data Warehousing for the
enterprise- Ralph Hughes
Traditional EDW Model
• Brittle to changing requirements
Source: Agile Datawarehousing for the
enterprise
Two traditional
approaches
 Traditional Integration layer – model it in 3NF and
upwards. ETL loads into IL before transforming it to
populate the star schema of the presentation layer.
 Conformed Dimensional data warehouse skips
integration layer to load company’s data directly into star
schemas
Cons:
 Both these approaches lead to DW that are very difficult
to modify once the data is loaded.
 Brittle in the face of changing requirements.
 Costly redesign and data conversion
Agile Data Engineering
 Do not need to have the entire data design model
upfront
 Development that adapts to changing business
requirements.
 Do not need to re engineer the existing schema when
new entities and relationships arise.
 Simple reusable ETL modules.
Agile Data Engineering
Source:
https://www.youtube.com/watch?v=3QO
SOeN8vcY
Example
7 Tables
Source:
https://www.youtube.com/watch?v=3QO
SOeN8vcY
HNF
Source: Agile Datawarehousing for the
enterprise
HNF model
18 Tables. We have hyper
normalized the table.
Source:
https://www.youtube.com/watch?v=3QO
SOeN8vcY
HNF
• Parameter driven data transform using ETL script. One
ETL module for all business key modules. Yellow ETL
module (for linking tables). Take all other attributes from
the source and send it to the target ETL tables.
• Easily adapt to BR even after loading billions of records.
Data model
Source:
https://www.youtube.com/watch?v=3QO
SOeN8vcY
HNF
• Caveat : Data retrieval gets complex. SQL to get data
can get very complex with outer joins and correlated sub
queries. But does it matter so much? Remember HNF is
used for Integration layer not that much presentation and
semantic layer.
• Storing the data into the integration layer from the source
systems using only 3 re usable ETL modules.
• Build DW a slice at a time and adapt to new business
requirements.
• http://www.anchormodeling.com/
Hyper Generalized Form
• Computer generate warehouse presentation and
semantic layers. Labor saving approach.
• Logical and physical data model eliminated
• Can operate at the business level.
• Builds on the notion of special purpose table.
• Need a acquire a automated Data ware house tool that
can generate entire data warehouse infrastructure.
• Entire dataset represented as 6 tables.
• Generate EDW and ETL Schema for all layers.
Step 1
Source:
https://www.youtube.com/watch?v=aNt
UoVkeq_Q
Step 2
Source:
https://www.youtube.com/watch?v=aNt
UoVkeq_Q
Step 3
Source:
https://www.youtube.com/watch?v=aNt
UoVkeq_Q
Step 4
(Add temporality)
Source:
https://www.youtube.com/watch?v=aNt
UoVkeq_Q
Step 5
Source:
https://www.youtube.com/watch?v=aNt
UoVkeq_Q
Step 6
source:
https://www.youtube.com/watch?v=aNt
UoVkeq_Q
Big Data Technologies
 Power an iterative discovery and engineering process.
 Read and transform massive amount of data on cheap
commodity software using Massive parallel processing.
 Schema on Read. Don’t need to impose structure of every
piece of information gathered.
 Hadoop with more SQL like features or a traditional EDW with
big data packages. Which is more useful?
 Complex event analysis.
 Real time analytics of high volume data streams
 Complex event processing
 Data Mining Software
 Text analytics
Big Data Technologies
Products:
 Hadoop and HDFS
 NoSQL databases
 Big Data extensions to RDBMS
Apache Hadoop S/W
components
Source: Agile Datawarehousing for the
enterprise by Ralph Hughes
Hadoop
Reasons to use Hadoop:
 Building a data ware house for the future. Gear up your skills for
Hadoop and Big Data as the data size grows larger. Major distributions
like Horton works, Cloudera, MapR have enterprise Hub editions which
can be deployed.
 There is a complaint that not suitable for quick interactive querying. But
then, Cloudera’s Impala and Horton works Stinger initiative have made
interactive querying much faster.
 Horton works platform provides indexing and search features using
Apache Solr which can make search and querying faster.
 Horton works came up with something like Apache Zeppelin which
brings data visualization and collaboration features to Hadoop and
Spark.
 Provides Apache Sqoop to load data from RDBMS.
 Pig and Map Reduce for ETL
 Weekly, hourly and monthly work flow schedules
 Apache Flume to load web logs data.
Data Virtualization
NO SQL Databases.
Advantages:
 Schema less read
 Auto Sharding
 Cloud computing (AWS)
 Replication
 No separate application or expensive add ons
 Integrated Caching
 In memory caching for high throughput and low latency
 Open Source
 Cassandra, Redshift, Hbase
 Document based
 Graph Stores – Neo4J and Giraph
 Key-Value Stores – Riak and Berkeley DB, Redis. Complex info as
BLOBs in value columns
 Column wide stores – Cassandra and Hbase
Why Implement NOSQL?
 Big Data getting bigger. New sources of data emerge
eventually.
 More users are going online.
 Open Source- downloaded, implemented and scaled at
little cost.
 Viable alternative to expensive proprietary software
 Increase speed and agility of development.
 When requirements change data model also changes.
5 considerations to
evaluate NoSQL
 Data Model
 Document model – MongoDB, CouchDB
 Natural mapping of document object model to OOP
 Query on any field
 Graph Databases- traversing relationships is the key
 Social networks and supply chain
 Columnar and wide column data bases
 Query Model
 Consistency Model
 APIs
 Commercial Support & Community strength
DWaas
Amazon redshift
 Cost effective: $1000 per terabyte per year
 Columnar storage – fast access, parallelize queries
 MPP DW architecture
 Cheap, simple, secure and compatible with a SQL interface
 Automate provisioning, configuring and monitoring of a cloud
data warehouse.
 Integrations to Amazon S3, Amazon DynamoDB, Amazon
Elastic Map reduce, Amazon Kinesis.
 Security is built in.
 Amazon web services management console.
 Network Isolation using Virtual private cloud.
Presentation/Visualization
 Tableau
o Easy to use drag and drop interface
o No Code
o Connect to Hadoop, Cloud, SQL databases
o Offers free training
o Trend analysis, regression and correlation analysis
o In memory data analysis
o Data blending
o Clutter Free GUI
 Qlik View
o Faster in memory computation
Analytics and forecasting
• R, Python, Apache Spark – for predictive modeling and
forecasting
• Connected the data ware house with R, Python and
Spark.
• R libraries - R part, Random Forests, ROCR, mBoost
• Python – Scikit Learn, Numpy, Pandas, Sci-py
• Spark – ML and MLlib
Agile data warehousing
Agile data warehousing

Agile data warehousing

  • 1.
    AGILE DATA WAREHOUSIN G SOLUTION Datawarehousing + Reporting Solution Presented by: Sneha Challa Date: 7/14/2016 Location: San Ramon, CA
  • 2.
    Standard approach toDW Agile Data Warehousing for the enterprise- Ralph Hughes
  • 3.
    Traditional EDW Model •Brittle to changing requirements Source: Agile Datawarehousing for the enterprise
  • 4.
    Two traditional approaches  TraditionalIntegration layer – model it in 3NF and upwards. ETL loads into IL before transforming it to populate the star schema of the presentation layer.  Conformed Dimensional data warehouse skips integration layer to load company’s data directly into star schemas Cons:  Both these approaches lead to DW that are very difficult to modify once the data is loaded.  Brittle in the face of changing requirements.  Costly redesign and data conversion
  • 5.
    Agile Data Engineering Do not need to have the entire data design model upfront  Development that adapts to changing business requirements.  Do not need to re engineer the existing schema when new entities and relationships arise.  Simple reusable ETL modules.
  • 6.
  • 7.
  • 8.
  • 9.
    HNF model 18 Tables.We have hyper normalized the table. Source: https://www.youtube.com/watch?v=3QO SOeN8vcY
  • 10.
    HNF • Parameter drivendata transform using ETL script. One ETL module for all business key modules. Yellow ETL module (for linking tables). Take all other attributes from the source and send it to the target ETL tables. • Easily adapt to BR even after loading billions of records.
  • 11.
  • 12.
    HNF • Caveat :Data retrieval gets complex. SQL to get data can get very complex with outer joins and correlated sub queries. But does it matter so much? Remember HNF is used for Integration layer not that much presentation and semantic layer. • Storing the data into the integration layer from the source systems using only 3 re usable ETL modules. • Build DW a slice at a time and adapt to new business requirements. • http://www.anchormodeling.com/
  • 13.
    Hyper Generalized Form •Computer generate warehouse presentation and semantic layers. Labor saving approach. • Logical and physical data model eliminated • Can operate at the business level. • Builds on the notion of special purpose table. • Need a acquire a automated Data ware house tool that can generate entire data warehouse infrastructure. • Entire dataset represented as 6 tables. • Generate EDW and ETL Schema for all layers.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
    Big Data Technologies Power an iterative discovery and engineering process.  Read and transform massive amount of data on cheap commodity software using Massive parallel processing.  Schema on Read. Don’t need to impose structure of every piece of information gathered.  Hadoop with more SQL like features or a traditional EDW with big data packages. Which is more useful?  Complex event analysis.  Real time analytics of high volume data streams  Complex event processing  Data Mining Software  Text analytics
  • 21.
    Big Data Technologies Products: Hadoop and HDFS  NoSQL databases  Big Data extensions to RDBMS
  • 22.
    Apache Hadoop S/W components Source:Agile Datawarehousing for the enterprise by Ralph Hughes
  • 23.
    Hadoop Reasons to useHadoop:  Building a data ware house for the future. Gear up your skills for Hadoop and Big Data as the data size grows larger. Major distributions like Horton works, Cloudera, MapR have enterprise Hub editions which can be deployed.  There is a complaint that not suitable for quick interactive querying. But then, Cloudera’s Impala and Horton works Stinger initiative have made interactive querying much faster.  Horton works platform provides indexing and search features using Apache Solr which can make search and querying faster.  Horton works came up with something like Apache Zeppelin which brings data visualization and collaboration features to Hadoop and Spark.  Provides Apache Sqoop to load data from RDBMS.  Pig and Map Reduce for ETL  Weekly, hourly and monthly work flow schedules  Apache Flume to load web logs data.
  • 24.
  • 25.
    NO SQL Databases. Advantages: Schema less read  Auto Sharding  Cloud computing (AWS)  Replication  No separate application or expensive add ons  Integrated Caching  In memory caching for high throughput and low latency  Open Source  Cassandra, Redshift, Hbase  Document based  Graph Stores – Neo4J and Giraph  Key-Value Stores – Riak and Berkeley DB, Redis. Complex info as BLOBs in value columns  Column wide stores – Cassandra and Hbase
  • 26.
    Why Implement NOSQL? Big Data getting bigger. New sources of data emerge eventually.  More users are going online.  Open Source- downloaded, implemented and scaled at little cost.  Viable alternative to expensive proprietary software  Increase speed and agility of development.  When requirements change data model also changes.
  • 27.
    5 considerations to evaluateNoSQL  Data Model  Document model – MongoDB, CouchDB  Natural mapping of document object model to OOP  Query on any field  Graph Databases- traversing relationships is the key  Social networks and supply chain  Columnar and wide column data bases  Query Model  Consistency Model  APIs  Commercial Support & Community strength
  • 28.
    DWaas Amazon redshift  Costeffective: $1000 per terabyte per year  Columnar storage – fast access, parallelize queries  MPP DW architecture  Cheap, simple, secure and compatible with a SQL interface  Automate provisioning, configuring and monitoring of a cloud data warehouse.  Integrations to Amazon S3, Amazon DynamoDB, Amazon Elastic Map reduce, Amazon Kinesis.  Security is built in.  Amazon web services management console.  Network Isolation using Virtual private cloud.
  • 29.
    Presentation/Visualization  Tableau o Easyto use drag and drop interface o No Code o Connect to Hadoop, Cloud, SQL databases o Offers free training o Trend analysis, regression and correlation analysis o In memory data analysis o Data blending o Clutter Free GUI  Qlik View o Faster in memory computation
  • 30.
    Analytics and forecasting •R, Python, Apache Spark – for predictive modeling and forecasting • Connected the data ware house with R, Python and Spark. • R libraries - R part, Random Forests, ROCR, mBoost • Python – Scikit Learn, Numpy, Pandas, Sci-py • Spark – ML and MLlib

Editor's Notes

  • #13 Interesting Questions: Which flavor of HNF is best for each use case? What does a physical HNF model look like? What are best platforms to model HNF schema for performance? How do we fold in data governance? Where to place columns that hold applied business rules (derived columns)? How to merge a HNF warehouse with a 3NF EDW? Can a HNF warehouse support self service BI? How do HNF advantages compare to Hyper generalization? Ceregenics
  • #14 Points of comparison between HNF and HGF What do physical models look like. How do you calculate and store derived values. Performance and platform considerations. Merging a new model style into existing EDWs
  • #19 The source data is converted into integration layer with 6 tables which contain all the information. This can be conveniently projected into data marts and presentation layers. Convert a drawning to things type, link types,
  • #21 Latest productivity tools for data analytics such as data virtualization, data warehouse automation and big data management system offers the team a new type of application development cycle that dramatically reduces the labor required design build and deploy each incremental version of EDW
  • #25 Where you make data from multiple databases be accessible through a single virtualization layer.