SlideShare a Scribd company logo
AGILE DATA
WAREHOUSIN
G SOLUTION
Data warehousing + Reporting Solution
Presented by: Sneha Challa
Date: 7/14/2016
Location: San Ramon, CA
Standard approach to DW
Agile Data Warehousing for the
enterprise- Ralph Hughes
Traditional EDW Model
• Brittle to changing requirements
Source: Agile Datawarehousing for the
enterprise
Two traditional
approaches
 Traditional Integration layer – model it in 3NF and
upwards. ETL loads into IL before transforming it to
populate the star schema of the presentation layer.
 Conformed Dimensional data warehouse skips
integration layer to load company’s data directly into star
schemas
Cons:
 Both these approaches lead to DW that are very difficult
to modify once the data is loaded.
 Brittle in the face of changing requirements.
 Costly redesign and data conversion
Agile Data Engineering
 Do not need to have the entire data design model
upfront
 Development that adapts to changing business
requirements.
 Do not need to re engineer the existing schema when
new entities and relationships arise.
 Simple reusable ETL modules.
Agile Data Engineering
Source:
https://www.youtube.com/watch?v=3QO
SOeN8vcY
Example
7 Tables
Source:
https://www.youtube.com/watch?v=3QO
SOeN8vcY
HNF
Source: Agile Datawarehousing for the
enterprise
HNF model
18 Tables. We have hyper
normalized the table.
Source:
https://www.youtube.com/watch?v=3QO
SOeN8vcY
HNF
• Parameter driven data transform using ETL script. One
ETL module for all business key modules. Yellow ETL
module (for linking tables). Take all other attributes from
the source and send it to the target ETL tables.
• Easily adapt to BR even after loading billions of records.
Data model
Source:
https://www.youtube.com/watch?v=3QO
SOeN8vcY
HNF
• Caveat : Data retrieval gets complex. SQL to get data
can get very complex with outer joins and correlated sub
queries. But does it matter so much? Remember HNF is
used for Integration layer not that much presentation and
semantic layer.
• Storing the data into the integration layer from the source
systems using only 3 re usable ETL modules.
• Build DW a slice at a time and adapt to new business
requirements.
• http://www.anchormodeling.com/
Hyper Generalized Form
• Computer generate warehouse presentation and
semantic layers. Labor saving approach.
• Logical and physical data model eliminated
• Can operate at the business level.
• Builds on the notion of special purpose table.
• Need a acquire a automated Data ware house tool that
can generate entire data warehouse infrastructure.
• Entire dataset represented as 6 tables.
• Generate EDW and ETL Schema for all layers.
Step 1
Source:
https://www.youtube.com/watch?v=aNt
UoVkeq_Q
Step 2
Source:
https://www.youtube.com/watch?v=aNt
UoVkeq_Q
Step 3
Source:
https://www.youtube.com/watch?v=aNt
UoVkeq_Q
Step 4
(Add temporality)
Source:
https://www.youtube.com/watch?v=aNt
UoVkeq_Q
Step 5
Source:
https://www.youtube.com/watch?v=aNt
UoVkeq_Q
Step 6
source:
https://www.youtube.com/watch?v=aNt
UoVkeq_Q
Big Data Technologies
 Power an iterative discovery and engineering process.
 Read and transform massive amount of data on cheap
commodity software using Massive parallel processing.
 Schema on Read. Don’t need to impose structure of every
piece of information gathered.
 Hadoop with more SQL like features or a traditional EDW with
big data packages. Which is more useful?
 Complex event analysis.
 Real time analytics of high volume data streams
 Complex event processing
 Data Mining Software
 Text analytics
Big Data Technologies
Products:
 Hadoop and HDFS
 NoSQL databases
 Big Data extensions to RDBMS
Apache Hadoop S/W
components
Source: Agile Datawarehousing for the
enterprise by Ralph Hughes
Hadoop
Reasons to use Hadoop:
 Building a data ware house for the future. Gear up your skills for
Hadoop and Big Data as the data size grows larger. Major distributions
like Horton works, Cloudera, MapR have enterprise Hub editions which
can be deployed.
 There is a complaint that not suitable for quick interactive querying. But
then, Cloudera’s Impala and Horton works Stinger initiative have made
interactive querying much faster.
 Horton works platform provides indexing and search features using
Apache Solr which can make search and querying faster.
 Horton works came up with something like Apache Zeppelin which
brings data visualization and collaboration features to Hadoop and
Spark.
 Provides Apache Sqoop to load data from RDBMS.
 Pig and Map Reduce for ETL
 Weekly, hourly and monthly work flow schedules
 Apache Flume to load web logs data.
Data Virtualization
NO SQL Databases.
Advantages:
 Schema less read
 Auto Sharding
 Cloud computing (AWS)
 Replication
 No separate application or expensive add ons
 Integrated Caching
 In memory caching for high throughput and low latency
 Open Source
 Cassandra, Redshift, Hbase
 Document based
 Graph Stores – Neo4J and Giraph
 Key-Value Stores – Riak and Berkeley DB, Redis. Complex info as
BLOBs in value columns
 Column wide stores – Cassandra and Hbase
Why Implement NOSQL?
 Big Data getting bigger. New sources of data emerge
eventually.
 More users are going online.
 Open Source- downloaded, implemented and scaled at
little cost.
 Viable alternative to expensive proprietary software
 Increase speed and agility of development.
 When requirements change data model also changes.
5 considerations to
evaluate NoSQL
 Data Model
 Document model – MongoDB, CouchDB
 Natural mapping of document object model to OOP
 Query on any field
 Graph Databases- traversing relationships is the key
 Social networks and supply chain
 Columnar and wide column data bases
 Query Model
 Consistency Model
 APIs
 Commercial Support & Community strength
DWaas
Amazon redshift
 Cost effective: $1000 per terabyte per year
 Columnar storage – fast access, parallelize queries
 MPP DW architecture
 Cheap, simple, secure and compatible with a SQL interface
 Automate provisioning, configuring and monitoring of a cloud
data warehouse.
 Integrations to Amazon S3, Amazon DynamoDB, Amazon
Elastic Map reduce, Amazon Kinesis.
 Security is built in.
 Amazon web services management console.
 Network Isolation using Virtual private cloud.
Presentation/Visualization
 Tableau
o Easy to use drag and drop interface
o No Code
o Connect to Hadoop, Cloud, SQL databases
o Offers free training
o Trend analysis, regression and correlation analysis
o In memory data analysis
o Data blending
o Clutter Free GUI
 Qlik View
o Faster in memory computation
Analytics and forecasting
• R, Python, Apache Spark – for predictive modeling and
forecasting
• Connected the data ware house with R, Python and
Spark.
• R libraries - R part, Random Forests, ROCR, mBoost
• Python – Scikit Learn, Numpy, Pandas, Sci-py
• Spark – ML and MLlib
Agile data warehousing
Agile data warehousing

More Related Content

What's hot

Attunity Solutions for Teradata
Attunity Solutions for TeradataAttunity Solutions for Teradata
Attunity Solutions for Teradata
Attunity
 
Tarun poladi resume
Tarun poladi resumeTarun poladi resume
Tarun poladi resume
Tarun P
 
Building a Data Lake on AWS
Building a Data Lake on AWSBuilding a Data Lake on AWS
Building a Data Lake on AWS
Gary Stafford
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Luan Moreno Medeiros Maciel
 
AWS Webcast - Tableau Big Data Solution Showcase
AWS Webcast - Tableau Big Data Solution ShowcaseAWS Webcast - Tableau Big Data Solution Showcase
AWS Webcast - Tableau Big Data Solution Showcase
Amazon Web Services
 
Optimize Data for the Logical Data Warehouse
Optimize Data for the Logical Data WarehouseOptimize Data for the Logical Data Warehouse
Optimize Data for the Logical Data Warehouse
Attunity
 
Using Hadoop to build a Data Quality Service for both real-time and batch data
Using Hadoop to build a Data Quality Service for both real-time and batch dataUsing Hadoop to build a Data Quality Service for both real-time and batch data
Using Hadoop to build a Data Quality Service for both real-time and batch data
DataWorks Summit/Hadoop Summit
 
Spark SQL | Apache Spark
Spark SQL | Apache SparkSpark SQL | Apache Spark
Spark SQL | Apache Spark
Edureka!
 
Spark - Migration Story
Spark - Migration Story Spark - Migration Story
Spark - Migration Story
Roman Chukh
 
When OLAP Meets Real-Time, What Happens in eBay?
When OLAP Meets Real-Time, What Happens in eBay?When OLAP Meets Real-Time, What Happens in eBay?
When OLAP Meets Real-Time, What Happens in eBay?
DataWorks Summit
 
MLeap: Release Spark ML Pipelines
MLeap: Release Spark ML PipelinesMLeap: Release Spark ML Pipelines
MLeap: Release Spark ML Pipelines
DataWorks Summit/Hadoop Summit
 
Mapping Data Flows Training April 2021
Mapping Data Flows Training April 2021Mapping Data Flows Training April 2021
Mapping Data Flows Training April 2021
Mark Kromer
 
High-Scale Entity Resolution in Hadoop
High-Scale Entity Resolution in HadoopHigh-Scale Entity Resolution in Hadoop
High-Scale Entity Resolution in Hadoop
DataWorks Summit/Hadoop Summit
 
Tailored for Spark
Tailored for SparkTailored for Spark
Tailored for Spark
DataWorks Summit/Hadoop Summit
 
An Introduction to Sparkling Water by Michal Malohlava
An Introduction to Sparkling Water by Michal MalohlavaAn Introduction to Sparkling Water by Michal Malohlava
An Introduction to Sparkling Water by Michal Malohlava
Spark Summit
 
Spark For Faster Batch Processing
Spark For Faster Batch ProcessingSpark For Faster Batch Processing
Spark For Faster Batch Processing
Edureka!
 
Apache® Spark™ MLlib: From Quick Start to Scikit-Learn
Apache® Spark™ MLlib: From Quick Start to Scikit-LearnApache® Spark™ MLlib: From Quick Start to Scikit-Learn
Apache® Spark™ MLlib: From Quick Start to Scikit-Learn
Databricks
 
Global AI Bootcamp Madrid - Azure Databricks
Global AI Bootcamp Madrid - Azure DatabricksGlobal AI Bootcamp Madrid - Azure Databricks
Global AI Bootcamp Madrid - Azure Databricks
Alberto Diaz Martin
 
Big Data Processing with Spark and Scala
Big Data Processing with Spark and Scala Big Data Processing with Spark and Scala
Big Data Processing with Spark and Scala
Edureka!
 

What's hot (20)

Attunity Solutions for Teradata
Attunity Solutions for TeradataAttunity Solutions for Teradata
Attunity Solutions for Teradata
 
Tarun poladi resume
Tarun poladi resumeTarun poladi resume
Tarun poladi resume
 
Building a Data Lake on AWS
Building a Data Lake on AWSBuilding a Data Lake on AWS
Building a Data Lake on AWS
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
 
AWS Webcast - Tableau Big Data Solution Showcase
AWS Webcast - Tableau Big Data Solution ShowcaseAWS Webcast - Tableau Big Data Solution Showcase
AWS Webcast - Tableau Big Data Solution Showcase
 
Optimize Data for the Logical Data Warehouse
Optimize Data for the Logical Data WarehouseOptimize Data for the Logical Data Warehouse
Optimize Data for the Logical Data Warehouse
 
Using Hadoop to build a Data Quality Service for both real-time and batch data
Using Hadoop to build a Data Quality Service for both real-time and batch dataUsing Hadoop to build a Data Quality Service for both real-time and batch data
Using Hadoop to build a Data Quality Service for both real-time and batch data
 
Spark SQL | Apache Spark
Spark SQL | Apache SparkSpark SQL | Apache Spark
Spark SQL | Apache Spark
 
Spark - Migration Story
Spark - Migration Story Spark - Migration Story
Spark - Migration Story
 
When OLAP Meets Real-Time, What Happens in eBay?
When OLAP Meets Real-Time, What Happens in eBay?When OLAP Meets Real-Time, What Happens in eBay?
When OLAP Meets Real-Time, What Happens in eBay?
 
MLeap: Release Spark ML Pipelines
MLeap: Release Spark ML PipelinesMLeap: Release Spark ML Pipelines
MLeap: Release Spark ML Pipelines
 
Google App Engine
Google App EngineGoogle App Engine
Google App Engine
 
Mapping Data Flows Training April 2021
Mapping Data Flows Training April 2021Mapping Data Flows Training April 2021
Mapping Data Flows Training April 2021
 
High-Scale Entity Resolution in Hadoop
High-Scale Entity Resolution in HadoopHigh-Scale Entity Resolution in Hadoop
High-Scale Entity Resolution in Hadoop
 
Tailored for Spark
Tailored for SparkTailored for Spark
Tailored for Spark
 
An Introduction to Sparkling Water by Michal Malohlava
An Introduction to Sparkling Water by Michal MalohlavaAn Introduction to Sparkling Water by Michal Malohlava
An Introduction to Sparkling Water by Michal Malohlava
 
Spark For Faster Batch Processing
Spark For Faster Batch ProcessingSpark For Faster Batch Processing
Spark For Faster Batch Processing
 
Apache® Spark™ MLlib: From Quick Start to Scikit-Learn
Apache® Spark™ MLlib: From Quick Start to Scikit-LearnApache® Spark™ MLlib: From Quick Start to Scikit-Learn
Apache® Spark™ MLlib: From Quick Start to Scikit-Learn
 
Global AI Bootcamp Madrid - Azure Databricks
Global AI Bootcamp Madrid - Azure DatabricksGlobal AI Bootcamp Madrid - Azure Databricks
Global AI Bootcamp Madrid - Azure Databricks
 
Big Data Processing with Spark and Scala
Big Data Processing with Spark and Scala Big Data Processing with Spark and Scala
Big Data Processing with Spark and Scala
 

Similar to Agile data warehousing

Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the CloudBring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
DataWorks Summit
 
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Precisely
 
Data Lake Overview
Data Lake OverviewData Lake Overview
Data Lake Overview
James Serra
 
Architecting Agile Data Applications for Scale
Architecting Agile Data Applications for ScaleArchitecting Agile Data Applications for Scale
Architecting Agile Data Applications for Scale
Databricks
 
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
DataWorks Summit
 
Accelerating Big Data Analytics
Accelerating Big Data AnalyticsAccelerating Big Data Analytics
Accelerating Big Data Analytics
Attunity
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
markgrover
 
Trafodion overview
Trafodion overviewTrafodion overview
Trafodion overview
Rohit Jain
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
cdmaxime
 
Feature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine LearningFeature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine Learning
Provectus
 
Big Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWSBig Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWS
javier ramirez
 
Azure BI Cloud Architectural Guidelines.pdf
Azure BI Cloud Architectural Guidelines.pdfAzure BI Cloud Architectural Guidelines.pdf
Azure BI Cloud Architectural Guidelines.pdf
pbonillo1
 
Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27
Martin Bém
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
James Serra
 
Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupesh Bansal
 
SQL/NoSQL How to choose ?
SQL/NoSQL How to choose ?SQL/NoSQL How to choose ?
SQL/NoSQL How to choose ?
Venu Anuganti
 
Rama prasad owk etl hadoop_developer
Rama prasad owk etl hadoop_developerRama prasad owk etl hadoop_developer
Rama prasad owk etl hadoop_developer
ramaprasad owk
 
Hadoop and SQL: Delivery Analytics Across the Organization
Hadoop and SQL:  Delivery Analytics Across the OrganizationHadoop and SQL:  Delivery Analytics Across the Organization
Hadoop and SQL: Delivery Analytics Across the Organization
Seeling Cheung
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Bhupesh Bansal
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedIn
Hadoop User Group
 

Similar to Agile data warehousing (20)

Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the CloudBring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
 
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
 
Data Lake Overview
Data Lake OverviewData Lake Overview
Data Lake Overview
 
Architecting Agile Data Applications for Scale
Architecting Agile Data Applications for ScaleArchitecting Agile Data Applications for Scale
Architecting Agile Data Applications for Scale
 
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
 
Accelerating Big Data Analytics
Accelerating Big Data AnalyticsAccelerating Big Data Analytics
Accelerating Big Data Analytics
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
Trafodion overview
Trafodion overviewTrafodion overview
Trafodion overview
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
 
Feature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine LearningFeature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine Learning
 
Big Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWSBig Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWS
 
Azure BI Cloud Architectural Guidelines.pdf
Azure BI Cloud Architectural Guidelines.pdfAzure BI Cloud Architectural Guidelines.pdf
Azure BI Cloud Architectural Guidelines.pdf
 
Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
 
Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupeshbansal bigdata
Bhupeshbansal bigdata
 
SQL/NoSQL How to choose ?
SQL/NoSQL How to choose ?SQL/NoSQL How to choose ?
SQL/NoSQL How to choose ?
 
Rama prasad owk etl hadoop_developer
Rama prasad owk etl hadoop_developerRama prasad owk etl hadoop_developer
Rama prasad owk etl hadoop_developer
 
Hadoop and SQL: Delivery Analytics Across the Organization
Hadoop and SQL:  Delivery Analytics Across the OrganizationHadoop and SQL:  Delivery Analytics Across the Organization
Hadoop and SQL: Delivery Analytics Across the Organization
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedIn
 

More from Sneha Challa

Datawarehousing_Final_ProjectPresentation
Datawarehousing_Final_ProjectPresentationDatawarehousing_Final_ProjectPresentation
Datawarehousing_Final_ProjectPresentation
Sneha Challa
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
Sneha Challa
 
Best practices in Conveying complexity
Best practices in Conveying complexityBest practices in Conveying complexity
Best practices in Conveying complexity
Sneha Challa
 
Why you should practice Sudarshan Kriya or S.K.Y technique?
Why you should practice Sudarshan Kriya or S.K.Y technique?Why you should practice Sudarshan Kriya or S.K.Y technique?
Why you should practice Sudarshan Kriya or S.K.Y technique?
Sneha Challa
 
Why you should practice Sudarshan Kriya or S.K.Y technique?
Why you should practice Sudarshan Kriya or S.K.Y technique?Why you should practice Sudarshan Kriya or S.K.Y technique?
Why you should practice Sudarshan Kriya or S.K.Y technique?
Sneha Challa
 
How to perform a Open Heart Surgery
How to perform a Open Heart SurgeryHow to perform a Open Heart Surgery
How to perform a Open Heart Surgery
Sneha Challa
 

More from Sneha Challa (6)

Datawarehousing_Final_ProjectPresentation
Datawarehousing_Final_ProjectPresentationDatawarehousing_Final_ProjectPresentation
Datawarehousing_Final_ProjectPresentation
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
 
Best practices in Conveying complexity
Best practices in Conveying complexityBest practices in Conveying complexity
Best practices in Conveying complexity
 
Why you should practice Sudarshan Kriya or S.K.Y technique?
Why you should practice Sudarshan Kriya or S.K.Y technique?Why you should practice Sudarshan Kriya or S.K.Y technique?
Why you should practice Sudarshan Kriya or S.K.Y technique?
 
Why you should practice Sudarshan Kriya or S.K.Y technique?
Why you should practice Sudarshan Kriya or S.K.Y technique?Why you should practice Sudarshan Kriya or S.K.Y technique?
Why you should practice Sudarshan Kriya or S.K.Y technique?
 
How to perform a Open Heart Surgery
How to perform a Open Heart SurgeryHow to perform a Open Heart Surgery
How to perform a Open Heart Surgery
 

Recently uploaded

社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
NABLAS株式会社
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
StarCompliance.io
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
correoyaya
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
ewymefz
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
alex933524
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
theahmadsaood
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
Tiktokethiodaily
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
benishzehra469
 

Recently uploaded (20)

社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 

Agile data warehousing

  • 1. AGILE DATA WAREHOUSIN G SOLUTION Data warehousing + Reporting Solution Presented by: Sneha Challa Date: 7/14/2016 Location: San Ramon, CA
  • 2. Standard approach to DW Agile Data Warehousing for the enterprise- Ralph Hughes
  • 3. Traditional EDW Model • Brittle to changing requirements Source: Agile Datawarehousing for the enterprise
  • 4. Two traditional approaches  Traditional Integration layer – model it in 3NF and upwards. ETL loads into IL before transforming it to populate the star schema of the presentation layer.  Conformed Dimensional data warehouse skips integration layer to load company’s data directly into star schemas Cons:  Both these approaches lead to DW that are very difficult to modify once the data is loaded.  Brittle in the face of changing requirements.  Costly redesign and data conversion
  • 5. Agile Data Engineering  Do not need to have the entire data design model upfront  Development that adapts to changing business requirements.  Do not need to re engineer the existing schema when new entities and relationships arise.  Simple reusable ETL modules.
  • 9. HNF model 18 Tables. We have hyper normalized the table. Source: https://www.youtube.com/watch?v=3QO SOeN8vcY
  • 10. HNF • Parameter driven data transform using ETL script. One ETL module for all business key modules. Yellow ETL module (for linking tables). Take all other attributes from the source and send it to the target ETL tables. • Easily adapt to BR even after loading billions of records.
  • 12. HNF • Caveat : Data retrieval gets complex. SQL to get data can get very complex with outer joins and correlated sub queries. But does it matter so much? Remember HNF is used for Integration layer not that much presentation and semantic layer. • Storing the data into the integration layer from the source systems using only 3 re usable ETL modules. • Build DW a slice at a time and adapt to new business requirements. • http://www.anchormodeling.com/
  • 13. Hyper Generalized Form • Computer generate warehouse presentation and semantic layers. Labor saving approach. • Logical and physical data model eliminated • Can operate at the business level. • Builds on the notion of special purpose table. • Need a acquire a automated Data ware house tool that can generate entire data warehouse infrastructure. • Entire dataset represented as 6 tables. • Generate EDW and ETL Schema for all layers.
  • 20. Big Data Technologies  Power an iterative discovery and engineering process.  Read and transform massive amount of data on cheap commodity software using Massive parallel processing.  Schema on Read. Don’t need to impose structure of every piece of information gathered.  Hadoop with more SQL like features or a traditional EDW with big data packages. Which is more useful?  Complex event analysis.  Real time analytics of high volume data streams  Complex event processing  Data Mining Software  Text analytics
  • 21. Big Data Technologies Products:  Hadoop and HDFS  NoSQL databases  Big Data extensions to RDBMS
  • 22. Apache Hadoop S/W components Source: Agile Datawarehousing for the enterprise by Ralph Hughes
  • 23. Hadoop Reasons to use Hadoop:  Building a data ware house for the future. Gear up your skills for Hadoop and Big Data as the data size grows larger. Major distributions like Horton works, Cloudera, MapR have enterprise Hub editions which can be deployed.  There is a complaint that not suitable for quick interactive querying. But then, Cloudera’s Impala and Horton works Stinger initiative have made interactive querying much faster.  Horton works platform provides indexing and search features using Apache Solr which can make search and querying faster.  Horton works came up with something like Apache Zeppelin which brings data visualization and collaboration features to Hadoop and Spark.  Provides Apache Sqoop to load data from RDBMS.  Pig and Map Reduce for ETL  Weekly, hourly and monthly work flow schedules  Apache Flume to load web logs data.
  • 25. NO SQL Databases. Advantages:  Schema less read  Auto Sharding  Cloud computing (AWS)  Replication  No separate application or expensive add ons  Integrated Caching  In memory caching for high throughput and low latency  Open Source  Cassandra, Redshift, Hbase  Document based  Graph Stores – Neo4J and Giraph  Key-Value Stores – Riak and Berkeley DB, Redis. Complex info as BLOBs in value columns  Column wide stores – Cassandra and Hbase
  • 26. Why Implement NOSQL?  Big Data getting bigger. New sources of data emerge eventually.  More users are going online.  Open Source- downloaded, implemented and scaled at little cost.  Viable alternative to expensive proprietary software  Increase speed and agility of development.  When requirements change data model also changes.
  • 27. 5 considerations to evaluate NoSQL  Data Model  Document model – MongoDB, CouchDB  Natural mapping of document object model to OOP  Query on any field  Graph Databases- traversing relationships is the key  Social networks and supply chain  Columnar and wide column data bases  Query Model  Consistency Model  APIs  Commercial Support & Community strength
  • 28. DWaas Amazon redshift  Cost effective: $1000 per terabyte per year  Columnar storage – fast access, parallelize queries  MPP DW architecture  Cheap, simple, secure and compatible with a SQL interface  Automate provisioning, configuring and monitoring of a cloud data warehouse.  Integrations to Amazon S3, Amazon DynamoDB, Amazon Elastic Map reduce, Amazon Kinesis.  Security is built in.  Amazon web services management console.  Network Isolation using Virtual private cloud.
  • 29. Presentation/Visualization  Tableau o Easy to use drag and drop interface o No Code o Connect to Hadoop, Cloud, SQL databases o Offers free training o Trend analysis, regression and correlation analysis o In memory data analysis o Data blending o Clutter Free GUI  Qlik View o Faster in memory computation
  • 30. Analytics and forecasting • R, Python, Apache Spark – for predictive modeling and forecasting • Connected the data ware house with R, Python and Spark. • R libraries - R part, Random Forests, ROCR, mBoost • Python – Scikit Learn, Numpy, Pandas, Sci-py • Spark – ML and MLlib

Editor's Notes

  1. Interesting Questions: Which flavor of HNF is best for each use case? What does a physical HNF model look like? What are best platforms to model HNF schema for performance? How do we fold in data governance? Where to place columns that hold applied business rules (derived columns)? How to merge a HNF warehouse with a 3NF EDW? Can a HNF warehouse support self service BI? How do HNF advantages compare to Hyper generalization? Ceregenics
  2. Points of comparison between HNF and HGF What do physical models look like. How do you calculate and store derived values. Performance and platform considerations. Merging a new model style into existing EDWs
  3. The source data is converted into integration layer with 6 tables which contain all the information. This can be conveniently projected into data marts and presentation layers. Convert a drawning to things type, link types,
  4. Latest productivity tools for data analytics such as data virtualization, data warehouse automation and big data management system offers the team a new type of application development cycle that dramatically reduces the labor required design build and deploy each incremental version of EDW
  5. Where you make data from multiple databases be accessible through a single virtualization layer.