SlideShare a Scribd company logo
CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED.
DATA WAREHOUSING
USING HADOOP
MICHELLE UFFORD
DATA PLATFORM @ GODADDY
10 JUNE 2015
CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED.
WHO AM I?
Some relevant experience…
Data platform architecture
Data warehousing
ETL frameworks & automation
Data quality & monitoring
Operationalizing predictive models
8 Years @ GoDaddy
Product Owner – Data Platform
Principal Engineer – Data Platform
Lead Engineer – Customer Data Warehouse
Senior Engineer – Personalization
Senior DBA – Business Intelligence
SQL Server DBA – Traffic & Messaging
Some highlights…
Data Platform & Cloud ETL Advisor (MSFT)
Published author (books, whitepapers, articles, etc.)
Award-winning blogger & open-source contributor
Microsoft Most Valuable Professional (MVP)
CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED.
DATA COLLOCATION IS
EXTREMELY POWERFUL
3
Customer Sales
Web
Clickstream
Social
Media
Business Value
Data Set
Moving our data warehouse to Hadoop enabled greater data
integration & allowed us to support all phases of analytics
Product
Structured
Semi-
Structured
Unstructured
Descriptive
Diagnostic
Predictive
Prescriptive
AnalyticsAscendencyModel Today’s Agenda
• Platform architecture
• Team principles
• Design patterns
• Batch processing
• New insights
• Tips & suggestions
CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED.
Sensors
Events
Logs
FEEDS
DATA VISUALIZATION
Tableau Excel
StormSpark
STREAM PROCESSING
MySQLSQL Server
CORPORATE DATA
Public
EXTERNAL DATA SOURCES
ElasticSearchCassandra
SERVING PLATFORM
MySQL SQL Server
Teradata Partnerships Subscriptions
Google Analytics
APPLICATIONS & SERVICES
OpenStack
Personalization Hosting WSB WebPro Etc.Search
Automated Ingress Secure Ingress
DataCollector
Pig & Sqoop
Ad Hocs
Kafka
DATA PLATFORM @ GODADDY
CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED.
DATA WAREHOUSING @ GODADDY
OR, “THE METHOD TO OUR MADNESS”
5
CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED.
TEAM PRINCIPLES
• data should be easy to
• discover
• understand
• consume
• maintain
• favor simplicity in both design
& process
• automate!
HOW WE GET WORK DONE
MAKE IT EASY
• weekly Agile sprints
• deliver quickly
• ‘Data First’ design
• be flexible
• focus incessantly on
business needs & value
• automate!
• data quality is critical to adoption
• expect failures
• data quality at every touch-point
• self-healing design
• proactive visibility into failures,
warnings, & exceptions
• automate!
DELIVER VALUE ENSURE QUALITY
CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED.
DATA WAREHOUSE – DESIGN DECISIONS
TO SUPPORT ALL TYPES OF ANALYTICS
• Basically, a variant of Kimball
• Wide, denormalized facts
• Integrated, conformed dimensions
• Maintain data at the lowest granularity
• Preserve source data in full fidelity
• Type 4 SCD (“history table”)
• Some differences
• Minimize “reference tables” and “flat dimensions” to reduce the need for expensive joins
• Minimize the need for updates (i.e. birthdate vs. age)
• Natural keys (instead of surrogate keys)
CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED.
TRANSACTIONAL
DESIGN PATTERN
8
CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED.
EXAMPLE DIMENSION TABLE – TRANSACTIONAL
9
DIM_CUSTOMER_TX
ETL Timestamp
Source Transaction
{Source, Timestamp, Action}
Delete
Flag
Cust ID
(Natural
Key)
Original
Gender
First
Order
Date
Original
Country
Code
Country
Code
2015-06-01 9:00 AM {ecomm, 2015-06-01 8:55 AM, update} False 111 2007-07-02 N/A
2015-06-01 9:15 AM {ecomm, 2015-06-01 9:14 AM, update} False 222 Female 2010-06-06 CA CA
2015-06-01 9:15 AM {ecomm, 2015-06-01 9:14 AM, update} False 111 2007-07-02 USA US
2015-06-01 9:30 AM {ecomm, 2015-06-01 9:22 AM, insert} False 333 Male 2015-06-01 blah N/A
2015-06-01 9:45 AM {ecomm, 2015-06-01 9:37 AM, delete} True 111 2007-07-02 USA US
Obligatory Disclaimer: this is fictitious data used for demonstration purposes only
CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED.
EXAMPLE DIMENSION TABLE – SNAPSHOT
10
DIM_CUSTOMER
ETL Timestamp
Customer ID
(Natural Key)
Active
Customer
Flag
Original
Gender
First Order
Date
Original
Country
Code
Country
Code
2015-06-01 10:00 AM 111 False 2007-07-02 USA US
2015-06-01 10:00 AM 222 True Female 2010-06-06 CA CA
2015-06-01 10:00 AM 333 True Male 2015-06-01 blah N/A
Obligatory Disclaimer: this is fictitious data used for demonstration purposes only
CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED.
TRANSACTIONAL VS. SNAPSHOT
• Mutable “snapshot” that rolls up transactions
• Unique on [Natural Key]
• May either use logical deletion or exclude deleted records
• Sourced from the processed, transactional table
• Populated using an automated snapshotting process
• Replaces the prior snapshot each time it executes
• Automates complexity
• Provides historical visibility via “archives”
• Default data source for most queries & reports
• Optimized for querying
• Immutable, append-only
• Unique on [Transaction Timestamp + Natural Key]
• Records have logical deletion indicators
• Sourced from raw imported data
• Populated by Pig script (data engineer)
• New data is always appended
• Minimizes complexity
• Provides dynamic “point-in-time” query functionality
• Typically used for PA, ML, & SOX
• Optimized for ETL processes
TRANSACTIONAL TABLE SNAPSHOT TABLE
CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED.
APPEND-ONLY
DESIGN PATTERN
12
CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED.
EXAMPLE FACT TABLE – APPEND-ONLY
13
FACT_USER_EVENT
ETL Timestamp Event Timestamp Event Type Customer ID Location Event JSON
2015-06-01 9:00 AM 2015-06-01 8:55 AM wsb.login 111 UK {“event”:”wsb.login”,“cust_id":111,”wsb_id”:579}
2015-06-01 9:15 AM 2015-06-01 9:14 AM call.inbound 222 IN {“event”:”call.inbound”,“cust_id":222,”rep_id”:25}
2015-06-01 9:15 AM 2015-06-01 9:14 AM account.create 333 US {“event”:”account.create”,“cust_id":333}
2015-06-01 9:30 AM 2015-06-01 9:22 AM wsb.config 111 UK {“event”:”wsb.convig”,“cust_id":111,”wsb_id”:579}
2015-06-01 9:45 AM 2015-06-01 9:37 AM o365.provision 222 IN {“event”:”o365.provision”,“cust_id":222,”rep_id”:25}
Obligatory Disclaimer: this is fictitious data used for demonstration purposes only
CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED.
BATCH PROCESSING
OR, “MAKING IT ALL HAPPEN”
14
CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED.
Enterprise Data Layer (data warehouse)
Data Ingress Layer (raw data)
Stage VMs
HDFS
Raw Feeds Snapshot Table (snap) External Data
Data Consumption Layer (user / team)
Hourly Snapshot (View)
Kafka RBDMS
Snapshot Table (snap) Hourly Snapshot (view)
HDFS DRILL-DOWN
LOGICAL DATA LAYERS
Integrated Data Pre-Aggregated Data Transformed Data
Append-Only Data
Logical construct only!
Users & processes can consume from any layer
Transactional Table (tx)
Transactional Table (tx)
15
Mostly or Fully Automated Requires Manual Intervention
CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED.
BATCH PROCESSING DRILL-DOWN
16
INCREMENTAL PATTERN
Enterprise Data Layer (data warehouse)
Data Ingress Layer (raw data)
HDFS
Data Consumption Layer (user / team)
Process
• next(tx_date) = $date
• foreach destination server
prep()
• execute(script.pig)
1. filter transactional source
(tx_date=$date)
2. store transactional to HDFS
(tx_date=$date)
3. store aggregations to HDFS
(tx_date=$date)
4. store to destination server(s)
• execute(data_quality_tests)
• if (tests=pass)
merge(destination)
• replace(dim_customer_snapshot)
customer
(snapshot)
dim_customer
(snapshot)
SERVING PLATFORM
MySQL SQL Server Cassandra
data-ingress / ecomm / customer_tx / tx_date=20150601
data-mgmt / dim_customer_tx / tx_date=20150601
data-rpt / new_customers / tx_date=20150601
/ tx_date=20150602
/ tx_date=20150602
/ tx_date=20150602
Mostly or Fully Automated Requires Manual Intervention
CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED.
REAL BUSINESS RESULTS
OR, “HADOOP: NOT JUST HYPE!”
17
CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED.
HADOOP ENABLES GREATER & QUICKER DW VALUE
• Better use of data engineers
• data ingress is largely automated
• reduces (not eliminates) the traditional 75-80% of project time spent on ETL
• Well-suited for Agile
• full source data is preserved in full fidelity
• minimizes permanence of design decisions
• roll out changes weekly
• Data integration
• access to the other 79.7% of the company’s data
• flexible data models using complex data types
• Single source of data processing
• Process once => export 0:N destinations; supports all data consumers
• Frees up expensive database resources
• Single enterprise solution for data quality, monitoring, etc.
CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED.
PROCESSED DATA HAS GREATER REACH
Descriptive
What has / is
happening?
Diagnostic
Why did it
happen?
Predictive
What will likely
happen?
Prescriptive
How can we
make it happen?
The same attributes used for
reporting can be inputs into
PA models & ML algorithms!
primarily uses
snapshots
primarily uses
transactional
uses both
snapshots &
transactional
CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED.
Enterprise data
+
Clickstream dataChurn
Analysis
Customer
Dashboard
New
Attributes
Sentiment
Analysis
UNLOCKING NEW, ACTIONABLE INSIGHTS
20
Customer
Experience
Business
Value
Complex Data
Complex Analysis
Enterprise data
+
Product data
+
Event data
Enterprise data
+
External data
Enterprise data
+
Call transcripts
CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED.
TIPS & LESSONS LEARNED
OR, “HOW TO BE AWESOME”
21
CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED.
SUGGESTIONS TO IMPROVE YOUR HADOOP DW PROJECT
• It’s not that new Traditional DW & Big DW are more similar than dissimilar
• Standardize on technology Pig for ETL, Hive for analysis
• Focus on simplicity if your data isn’t easy to use, you’ve failed
• Embrace flexibility don’t shy away from complex data types
• Be predictable use HCatalog & consistent naming standard
• Don’t be afraid of change use data abstraction to minimize impact to consumers
• Do quick prototyping use external tables & data extractions via Hive ODBC
• Democratize data amazing insights can come from anywhere; embrace new data consumers
• Don’t stop at data focus on the Big Picture – the final outcome – to identify bottlenecks
CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED.
QUESTIONS?
THANK YOU FOR ATTENDING! 
CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED.
APPENDIX
24
CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED.
BATCH PROCESSING EXAMPLE – PRE-DEPLOYMENT
25
DIM_CUSTOMER_TX – TRANSACTIONAL
ETL Timestamp
Source Transaction
{Source, Timestamp, Action}
Delete
Flag
Cust ID
(Natural
Key)
Original
Gender
First
Order
Date
Original
Country
Code
Country
Code
2015-06-01 9:15 AM {ecomm, 2015-06-01 9:14 AM, update} False 222 Female 2010-06-06 CA CA
2015-06-01 9:30 AM {ecomm, 2015-06-01 9:22 AM, insert} False 333 Male 2015-06-01 blah N/A
2015-06-01 9:45 AM {ecomm, 2015-06-01 9:37 AM, delete} True 111 2007-07-02 USA US
Obligatory Disclaimer: this is fictitious data used for demonstration purposes only
CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED.
BATCH PROCESSING EXAMPLE – POST-DEPLOYMENT
26
DIM_CUSTOMER_TX – TRANSACTIONAL
ETL Timestamp
Source Transaction
{Source, Timestamp, Action}
Delete
Flag
Cust ID
(Natural
Key)
Original
Gender
First
Order
Date
Original
Country
Code
Country
Code
Gender
2015-06-01 9:15 AM {ecomm, 2015-06-01 9:14 AM, update} False 222 Female 2010-06-06 CA CA
2015-06-01 9:30 AM {ecomm, 2015-06-01 9:22 AM, insert} False 333 Male 2015-06-01 blah N/A
2015-06-01 9:45 AM {ecomm, 2015-06-01 9:37 AM, delete} True 111 2007-07-02 USA US
2015-06-01 10:00 AM {ecomm, 2015-06-01 9:14 AM, deploy} False 222 Female 2010-06-06 CA CA Female
2015-06-01 10:00 AM {ecomm, 2015-06-01 9:22 AM, deploy} False 333 Male 2015-06-01 blah N/A Male
2015-06-01 10:00 AM {ecomm, 2015-06-01 9:37 AM, deploy} True 111 2007-07-02 USA US Female
Obligatory Disclaimer: this is fictitious data used for demonstration purposes only
CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED.
BATCH PROCESSING EXAMPLE – POST-DEPLOYMENT
27
DIM_CUSTOMER – SNAPSHOT
ETL Timestamp
Customer ID
(Natural Key)
Active
Customer
Flag
Original
Gender
First Order
Date
Original
Country
Code
Country
Code
Gender
2015-06-01 10:00 AM 111 False 2007-07-02 USA US Female
2015-06-01 10:00 AM 222 True Female 2010-06-06 CA CA Female
2015-06-01 10:00 AM 333 True Male 2015-06-01 blah N/A Male
Obligatory Disclaimer: this is fictitious data used for demonstration purposes only

More Related Content

What's hot

Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data Engineering
Vivek Aanand Ganesan
 
Considerations for Data Access in the Lakehouse
Considerations for Data Access in the LakehouseConsiderations for Data Access in the Lakehouse
Considerations for Data Access in the Lakehouse
Databricks
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data Engineering
Durga Gadiraju
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
C4Media
 
Denodo: Enabling a Data Mesh Architecture and Data Sharing Culture at Landsba...
Denodo: Enabling a Data Mesh Architecture and Data Sharing Culture at Landsba...Denodo: Enabling a Data Mesh Architecture and Data Sharing Culture at Landsba...
Denodo: Enabling a Data Mesh Architecture and Data Sharing Culture at Landsba...
Denodo
 
Dynamic filtering for presto join optimisation
Dynamic filtering for presto join optimisationDynamic filtering for presto join optimisation
Dynamic filtering for presto join optimisation
Ori Reshef
 
Moving Beyond Lambda Architectures with Apache Kudu
Moving Beyond Lambda Architectures with Apache KuduMoving Beyond Lambda Architectures with Apache Kudu
Moving Beyond Lambda Architectures with Apache Kudu
Cloudera, Inc.
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers
ScyllaDB
 
Data Mesh
Data MeshData Mesh
The Zen of DataOps – AWS Lake Formation and the Data Supply Chain Pipeline
The Zen of DataOps – AWS Lake Formation and the Data Supply Chain PipelineThe Zen of DataOps – AWS Lake Formation and the Data Supply Chain Pipeline
The Zen of DataOps – AWS Lake Formation and the Data Supply Chain Pipeline
Amazon Web Services
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
Till Rohrmann
 
Iceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsIceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data Analytics
Alluxio, Inc.
 
Which Hadoop Distribution to use: Apache, Cloudera, MapR or HortonWorks?
Which Hadoop Distribution to use: Apache, Cloudera, MapR or HortonWorks?Which Hadoop Distribution to use: Apache, Cloudera, MapR or HortonWorks?
Which Hadoop Distribution to use: Apache, Cloudera, MapR or HortonWorks?
Edureka!
 
Enabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data VirtualizationEnabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data Virtualization
Denodo
 
Druid deep dive
Druid deep diveDruid deep dive
Druid deep dive
Kashif Khan
 
OLTP+OLAP=HTAP
 OLTP+OLAP=HTAP OLTP+OLAP=HTAP
OLTP+OLAP=HTAP
EDB
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
Alluxio, Inc.
 
Data engineering
Data engineeringData engineering
Data engineering
Parimala Killada
 
Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...
Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...
Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...
Amazon Web Services
 
Hadoop Query Performance Smackdown
Hadoop Query Performance SmackdownHadoop Query Performance Smackdown
Hadoop Query Performance Smackdown
DataWorks Summit
 

What's hot (20)

Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data Engineering
 
Considerations for Data Access in the Lakehouse
Considerations for Data Access in the LakehouseConsiderations for Data Access in the Lakehouse
Considerations for Data Access in the Lakehouse
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data Engineering
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
 
Denodo: Enabling a Data Mesh Architecture and Data Sharing Culture at Landsba...
Denodo: Enabling a Data Mesh Architecture and Data Sharing Culture at Landsba...Denodo: Enabling a Data Mesh Architecture and Data Sharing Culture at Landsba...
Denodo: Enabling a Data Mesh Architecture and Data Sharing Culture at Landsba...
 
Dynamic filtering for presto join optimisation
Dynamic filtering for presto join optimisationDynamic filtering for presto join optimisation
Dynamic filtering for presto join optimisation
 
Moving Beyond Lambda Architectures with Apache Kudu
Moving Beyond Lambda Architectures with Apache KuduMoving Beyond Lambda Architectures with Apache Kudu
Moving Beyond Lambda Architectures with Apache Kudu
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers
 
Data Mesh
Data MeshData Mesh
Data Mesh
 
The Zen of DataOps – AWS Lake Formation and the Data Supply Chain Pipeline
The Zen of DataOps – AWS Lake Formation and the Data Supply Chain PipelineThe Zen of DataOps – AWS Lake Formation and the Data Supply Chain Pipeline
The Zen of DataOps – AWS Lake Formation and the Data Supply Chain Pipeline
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
 
Iceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsIceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data Analytics
 
Which Hadoop Distribution to use: Apache, Cloudera, MapR or HortonWorks?
Which Hadoop Distribution to use: Apache, Cloudera, MapR or HortonWorks?Which Hadoop Distribution to use: Apache, Cloudera, MapR or HortonWorks?
Which Hadoop Distribution to use: Apache, Cloudera, MapR or HortonWorks?
 
Enabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data VirtualizationEnabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data Virtualization
 
Druid deep dive
Druid deep diveDruid deep dive
Druid deep dive
 
OLTP+OLAP=HTAP
 OLTP+OLAP=HTAP OLTP+OLAP=HTAP
OLTP+OLAP=HTAP
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
 
Data engineering
Data engineeringData engineering
Data engineering
 
Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...
Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...
Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...
 
Hadoop Query Performance Smackdown
Hadoop Query Performance SmackdownHadoop Query Performance Smackdown
Hadoop Query Performance Smackdown
 

Viewers also liked

Big data hadoop rdbms
Big data hadoop rdbmsBig data hadoop rdbms
Big data hadoop rdbms
Arjen de Vries
 
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop ProfessionalsBest Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Cloudera, Inc.
 
Hadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data WarehouseHadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data Warehouse
DataWorks Summit
 
Hadoop and rdbms with sqoop
Hadoop and rdbms with sqoop Hadoop and rdbms with sqoop
Hadoop and rdbms with sqoop
Guy Harrison
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on Hadoop
Zheng Shao
 
Sage Business Intelligence Solutions Comparison
Sage Business Intelligence Solutions ComparisonSage Business Intelligence Solutions Comparison
Sage Business Intelligence Solutions Comparison
RKLeSolutions
 
Data warehousing with Hadoop
Data warehousing with HadoopData warehousing with Hadoop
Data warehousing with Hadoop
hadooparchbook
 
Hadoop and Your Data Warehouse
Hadoop and Your Data WarehouseHadoop and Your Data Warehouse
Hadoop and Your Data Warehouse
Caserta
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
sudhakara st
 

Viewers also liked (9)

Big data hadoop rdbms
Big data hadoop rdbmsBig data hadoop rdbms
Big data hadoop rdbms
 
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop ProfessionalsBest Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
 
Hadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data WarehouseHadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data Warehouse
 
Hadoop and rdbms with sqoop
Hadoop and rdbms with sqoop Hadoop and rdbms with sqoop
Hadoop and rdbms with sqoop
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on Hadoop
 
Sage Business Intelligence Solutions Comparison
Sage Business Intelligence Solutions ComparisonSage Business Intelligence Solutions Comparison
Sage Business Intelligence Solutions Comparison
 
Data warehousing with Hadoop
Data warehousing with HadoopData warehousing with Hadoop
Data warehousing with Hadoop
 
Hadoop and Your Data Warehouse
Hadoop and Your Data WarehouseHadoop and Your Data Warehouse
Hadoop and Your Data Warehouse
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 

Similar to Data Warehousing using Hadoop

Data Warehousing Patterns for Hadoop
Data Warehousing Patterns for HadoopData Warehousing Patterns for Hadoop
Data Warehousing Patterns for Hadoop
Michelle Ufford
 
Milomir Vojvodic - Business Analytics And Big Data Partner Forum Dubai 15.11.
Milomir Vojvodic - Business Analytics And Big Data Partner Forum Dubai 15.11.Milomir Vojvodic - Business Analytics And Big Data Partner Forum Dubai 15.11.
Milomir Vojvodic - Business Analytics And Big Data Partner Forum Dubai 15.11.
Milomir Vojvodic
 
Driving Real Insights Through Data Science
Driving Real Insights Through Data ScienceDriving Real Insights Through Data Science
Driving Real Insights Through Data Science
VMware Tanzu
 
Pivotal Big Data Roadshow
Pivotal Big Data Roadshow Pivotal Big Data Roadshow
Pivotal Big Data Roadshow
VMware Tanzu
 
Balance agility and governance with #TrueDataOps and The Data Cloud
Balance agility and governance with #TrueDataOps and The Data CloudBalance agility and governance with #TrueDataOps and The Data Cloud
Balance agility and governance with #TrueDataOps and The Data Cloud
Kent Graziano
 
Self-Service Analytics with Guard Rails
Self-Service Analytics with Guard RailsSelf-Service Analytics with Guard Rails
Self-Service Analytics with Guard Rails
Denodo
 
Automate Hadoop Jobs with Real World Business Impact
Automate Hadoop Jobs with Real World Business ImpactAutomate Hadoop Jobs with Real World Business Impact
Automate Hadoop Jobs with Real World Business Impact
CA Technologies
 
Realtime Reporting with GoldenGate
Realtime Reporting with GoldenGateRealtime Reporting with GoldenGate
Realtime Reporting with GoldenGate
Emtec Inc.
 
Agile Data Warehousing
Agile Data WarehousingAgile Data Warehousing
Agile Data Warehousing
Davide Mauri
 
A Revolutionary Approach to Modernizing the Data Warehouse
A Revolutionary Approach to Modernizing the Data WarehouseA Revolutionary Approach to Modernizing the Data Warehouse
A Revolutionary Approach to Modernizing the Data Warehouse
Inside Analysis
 
Achieving Agility and Scale for Your Data Lake - Talend
Achieving Agility and Scale for Your Data Lake - TalendAchieving Agility and Scale for Your Data Lake - Talend
Achieving Agility and Scale for Your Data Lake - Talend
Talend
 
Business Data Lake Best Practices
Business Data Lake Best PracticesBusiness Data Lake Best Practices
Business Data Lake Best Practices
Capgemini
 
Test Data Management 101—Featuring a Tour of CA Test Data Manager (Formerly G...
Test Data Management 101—Featuring a Tour of CA Test Data Manager (Formerly G...Test Data Management 101—Featuring a Tour of CA Test Data Manager (Formerly G...
Test Data Management 101—Featuring a Tour of CA Test Data Manager (Formerly G...
CA Technologies
 
Insight Facts & Figures
Insight Facts & FiguresInsight Facts & Figures
Insight Facts & Figures
Vince Caldwell
 
Sabre: Mastering a strong foundation for operational excellence and enhanced ...
Sabre: Mastering a strong foundation for operational excellence and enhanced ...Sabre: Mastering a strong foundation for operational excellence and enhanced ...
Sabre: Mastering a strong foundation for operational excellence and enhanced ...
Orchestra Networks
 
Oracle Data Integration CON9737 at OpenWorld
Oracle Data Integration CON9737 at OpenWorldOracle Data Integration CON9737 at OpenWorld
Oracle Data Integration CON9737 at OpenWorld
Jeffrey T. Pollock
 
Fast Data: Achieving Real-Time Data Analysis Across the Financial Data Continuum
Fast Data: Achieving Real-Time Data Analysis Across the Financial Data ContinuumFast Data: Achieving Real-Time Data Analysis Across the Financial Data Continuum
Fast Data: Achieving Real-Time Data Analysis Across the Financial Data Continuum
VoltDB
 
MongoDB IoT City Tour STUTTGART: Hadoop and future data management. By, Cloudera
MongoDB IoT City Tour STUTTGART: Hadoop and future data management. By, ClouderaMongoDB IoT City Tour STUTTGART: Hadoop and future data management. By, Cloudera
MongoDB IoT City Tour STUTTGART: Hadoop and future data management. By, Cloudera
MongoDB
 
Drive Business Outcomes for Big Data Environments
Drive Business Outcomes for Big Data EnvironmentsDrive Business Outcomes for Big Data Environments
Drive Business Outcomes for Big Data Environments
Cisco Services
 
Analytics in a Day Virtual Workshop
Analytics in a Day Virtual WorkshopAnalytics in a Day Virtual Workshop
Analytics in a Day Virtual Workshop
CCG
 

Similar to Data Warehousing using Hadoop (20)

Data Warehousing Patterns for Hadoop
Data Warehousing Patterns for HadoopData Warehousing Patterns for Hadoop
Data Warehousing Patterns for Hadoop
 
Milomir Vojvodic - Business Analytics And Big Data Partner Forum Dubai 15.11.
Milomir Vojvodic - Business Analytics And Big Data Partner Forum Dubai 15.11.Milomir Vojvodic - Business Analytics And Big Data Partner Forum Dubai 15.11.
Milomir Vojvodic - Business Analytics And Big Data Partner Forum Dubai 15.11.
 
Driving Real Insights Through Data Science
Driving Real Insights Through Data ScienceDriving Real Insights Through Data Science
Driving Real Insights Through Data Science
 
Pivotal Big Data Roadshow
Pivotal Big Data Roadshow Pivotal Big Data Roadshow
Pivotal Big Data Roadshow
 
Balance agility and governance with #TrueDataOps and The Data Cloud
Balance agility and governance with #TrueDataOps and The Data CloudBalance agility and governance with #TrueDataOps and The Data Cloud
Balance agility and governance with #TrueDataOps and The Data Cloud
 
Self-Service Analytics with Guard Rails
Self-Service Analytics with Guard RailsSelf-Service Analytics with Guard Rails
Self-Service Analytics with Guard Rails
 
Automate Hadoop Jobs with Real World Business Impact
Automate Hadoop Jobs with Real World Business ImpactAutomate Hadoop Jobs with Real World Business Impact
Automate Hadoop Jobs with Real World Business Impact
 
Realtime Reporting with GoldenGate
Realtime Reporting with GoldenGateRealtime Reporting with GoldenGate
Realtime Reporting with GoldenGate
 
Agile Data Warehousing
Agile Data WarehousingAgile Data Warehousing
Agile Data Warehousing
 
A Revolutionary Approach to Modernizing the Data Warehouse
A Revolutionary Approach to Modernizing the Data WarehouseA Revolutionary Approach to Modernizing the Data Warehouse
A Revolutionary Approach to Modernizing the Data Warehouse
 
Achieving Agility and Scale for Your Data Lake - Talend
Achieving Agility and Scale for Your Data Lake - TalendAchieving Agility and Scale for Your Data Lake - Talend
Achieving Agility and Scale for Your Data Lake - Talend
 
Business Data Lake Best Practices
Business Data Lake Best PracticesBusiness Data Lake Best Practices
Business Data Lake Best Practices
 
Test Data Management 101—Featuring a Tour of CA Test Data Manager (Formerly G...
Test Data Management 101—Featuring a Tour of CA Test Data Manager (Formerly G...Test Data Management 101—Featuring a Tour of CA Test Data Manager (Formerly G...
Test Data Management 101—Featuring a Tour of CA Test Data Manager (Formerly G...
 
Insight Facts & Figures
Insight Facts & FiguresInsight Facts & Figures
Insight Facts & Figures
 
Sabre: Mastering a strong foundation for operational excellence and enhanced ...
Sabre: Mastering a strong foundation for operational excellence and enhanced ...Sabre: Mastering a strong foundation for operational excellence and enhanced ...
Sabre: Mastering a strong foundation for operational excellence and enhanced ...
 
Oracle Data Integration CON9737 at OpenWorld
Oracle Data Integration CON9737 at OpenWorldOracle Data Integration CON9737 at OpenWorld
Oracle Data Integration CON9737 at OpenWorld
 
Fast Data: Achieving Real-Time Data Analysis Across the Financial Data Continuum
Fast Data: Achieving Real-Time Data Analysis Across the Financial Data ContinuumFast Data: Achieving Real-Time Data Analysis Across the Financial Data Continuum
Fast Data: Achieving Real-Time Data Analysis Across the Financial Data Continuum
 
MongoDB IoT City Tour STUTTGART: Hadoop and future data management. By, Cloudera
MongoDB IoT City Tour STUTTGART: Hadoop and future data management. By, ClouderaMongoDB IoT City Tour STUTTGART: Hadoop and future data management. By, Cloudera
MongoDB IoT City Tour STUTTGART: Hadoop and future data management. By, Cloudera
 
Drive Business Outcomes for Big Data Environments
Drive Business Outcomes for Big Data EnvironmentsDrive Business Outcomes for Big Data Environments
Drive Business Outcomes for Big Data Environments
 
Analytics in a Day Virtual Workshop
Analytics in a Day Virtual WorkshopAnalytics in a Day Virtual Workshop
Analytics in a Day Virtual Workshop
 

More from DataWorks Summit

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
DataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
DataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
Tatiana Kojar
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
ssuserfac0301
 
A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024
Intelisync
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
akankshawande
 
Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!
GDSC PJATK
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
saastr
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
Jakub Marek
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Alpen-Adria-Universität
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdfNunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
flufftailshop
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
Chart Kalyan
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
Hiroshi SHIBATA
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
DanBrown980551
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
Zilliz
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStrDeep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
saastr
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 
AWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptxAWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptx
HarisZaheer8
 

Recently uploaded (20)

Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
 
A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
 
Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdfNunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStrDeep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 
AWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptxAWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptx
 

Data Warehousing using Hadoop

  • 1. CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED. DATA WAREHOUSING USING HADOOP MICHELLE UFFORD DATA PLATFORM @ GODADDY 10 JUNE 2015
  • 2. CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED. WHO AM I? Some relevant experience… Data platform architecture Data warehousing ETL frameworks & automation Data quality & monitoring Operationalizing predictive models 8 Years @ GoDaddy Product Owner – Data Platform Principal Engineer – Data Platform Lead Engineer – Customer Data Warehouse Senior Engineer – Personalization Senior DBA – Business Intelligence SQL Server DBA – Traffic & Messaging Some highlights… Data Platform & Cloud ETL Advisor (MSFT) Published author (books, whitepapers, articles, etc.) Award-winning blogger & open-source contributor Microsoft Most Valuable Professional (MVP)
  • 3. CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED. DATA COLLOCATION IS EXTREMELY POWERFUL 3 Customer Sales Web Clickstream Social Media Business Value Data Set Moving our data warehouse to Hadoop enabled greater data integration & allowed us to support all phases of analytics Product Structured Semi- Structured Unstructured Descriptive Diagnostic Predictive Prescriptive AnalyticsAscendencyModel Today’s Agenda • Platform architecture • Team principles • Design patterns • Batch processing • New insights • Tips & suggestions
  • 4. CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED. Sensors Events Logs FEEDS DATA VISUALIZATION Tableau Excel StormSpark STREAM PROCESSING MySQLSQL Server CORPORATE DATA Public EXTERNAL DATA SOURCES ElasticSearchCassandra SERVING PLATFORM MySQL SQL Server Teradata Partnerships Subscriptions Google Analytics APPLICATIONS & SERVICES OpenStack Personalization Hosting WSB WebPro Etc.Search Automated Ingress Secure Ingress DataCollector Pig & Sqoop Ad Hocs Kafka DATA PLATFORM @ GODADDY
  • 5. CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED. DATA WAREHOUSING @ GODADDY OR, “THE METHOD TO OUR MADNESS” 5
  • 6. CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED. TEAM PRINCIPLES • data should be easy to • discover • understand • consume • maintain • favor simplicity in both design & process • automate! HOW WE GET WORK DONE MAKE IT EASY • weekly Agile sprints • deliver quickly • ‘Data First’ design • be flexible • focus incessantly on business needs & value • automate! • data quality is critical to adoption • expect failures • data quality at every touch-point • self-healing design • proactive visibility into failures, warnings, & exceptions • automate! DELIVER VALUE ENSURE QUALITY
  • 7. CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED. DATA WAREHOUSE – DESIGN DECISIONS TO SUPPORT ALL TYPES OF ANALYTICS • Basically, a variant of Kimball • Wide, denormalized facts • Integrated, conformed dimensions • Maintain data at the lowest granularity • Preserve source data in full fidelity • Type 4 SCD (“history table”) • Some differences • Minimize “reference tables” and “flat dimensions” to reduce the need for expensive joins • Minimize the need for updates (i.e. birthdate vs. age) • Natural keys (instead of surrogate keys)
  • 8. CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED. TRANSACTIONAL DESIGN PATTERN 8
  • 9. CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED. EXAMPLE DIMENSION TABLE – TRANSACTIONAL 9 DIM_CUSTOMER_TX ETL Timestamp Source Transaction {Source, Timestamp, Action} Delete Flag Cust ID (Natural Key) Original Gender First Order Date Original Country Code Country Code 2015-06-01 9:00 AM {ecomm, 2015-06-01 8:55 AM, update} False 111 2007-07-02 N/A 2015-06-01 9:15 AM {ecomm, 2015-06-01 9:14 AM, update} False 222 Female 2010-06-06 CA CA 2015-06-01 9:15 AM {ecomm, 2015-06-01 9:14 AM, update} False 111 2007-07-02 USA US 2015-06-01 9:30 AM {ecomm, 2015-06-01 9:22 AM, insert} False 333 Male 2015-06-01 blah N/A 2015-06-01 9:45 AM {ecomm, 2015-06-01 9:37 AM, delete} True 111 2007-07-02 USA US Obligatory Disclaimer: this is fictitious data used for demonstration purposes only
  • 10. CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED. EXAMPLE DIMENSION TABLE – SNAPSHOT 10 DIM_CUSTOMER ETL Timestamp Customer ID (Natural Key) Active Customer Flag Original Gender First Order Date Original Country Code Country Code 2015-06-01 10:00 AM 111 False 2007-07-02 USA US 2015-06-01 10:00 AM 222 True Female 2010-06-06 CA CA 2015-06-01 10:00 AM 333 True Male 2015-06-01 blah N/A Obligatory Disclaimer: this is fictitious data used for demonstration purposes only
  • 11. CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED. TRANSACTIONAL VS. SNAPSHOT • Mutable “snapshot” that rolls up transactions • Unique on [Natural Key] • May either use logical deletion or exclude deleted records • Sourced from the processed, transactional table • Populated using an automated snapshotting process • Replaces the prior snapshot each time it executes • Automates complexity • Provides historical visibility via “archives” • Default data source for most queries & reports • Optimized for querying • Immutable, append-only • Unique on [Transaction Timestamp + Natural Key] • Records have logical deletion indicators • Sourced from raw imported data • Populated by Pig script (data engineer) • New data is always appended • Minimizes complexity • Provides dynamic “point-in-time” query functionality • Typically used for PA, ML, & SOX • Optimized for ETL processes TRANSACTIONAL TABLE SNAPSHOT TABLE
  • 12. CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED. APPEND-ONLY DESIGN PATTERN 12
  • 13. CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED. EXAMPLE FACT TABLE – APPEND-ONLY 13 FACT_USER_EVENT ETL Timestamp Event Timestamp Event Type Customer ID Location Event JSON 2015-06-01 9:00 AM 2015-06-01 8:55 AM wsb.login 111 UK {“event”:”wsb.login”,“cust_id":111,”wsb_id”:579} 2015-06-01 9:15 AM 2015-06-01 9:14 AM call.inbound 222 IN {“event”:”call.inbound”,“cust_id":222,”rep_id”:25} 2015-06-01 9:15 AM 2015-06-01 9:14 AM account.create 333 US {“event”:”account.create”,“cust_id":333} 2015-06-01 9:30 AM 2015-06-01 9:22 AM wsb.config 111 UK {“event”:”wsb.convig”,“cust_id":111,”wsb_id”:579} 2015-06-01 9:45 AM 2015-06-01 9:37 AM o365.provision 222 IN {“event”:”o365.provision”,“cust_id":222,”rep_id”:25} Obligatory Disclaimer: this is fictitious data used for demonstration purposes only
  • 14. CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED. BATCH PROCESSING OR, “MAKING IT ALL HAPPEN” 14
  • 15. CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED. Enterprise Data Layer (data warehouse) Data Ingress Layer (raw data) Stage VMs HDFS Raw Feeds Snapshot Table (snap) External Data Data Consumption Layer (user / team) Hourly Snapshot (View) Kafka RBDMS Snapshot Table (snap) Hourly Snapshot (view) HDFS DRILL-DOWN LOGICAL DATA LAYERS Integrated Data Pre-Aggregated Data Transformed Data Append-Only Data Logical construct only! Users & processes can consume from any layer Transactional Table (tx) Transactional Table (tx) 15 Mostly or Fully Automated Requires Manual Intervention
  • 16. CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED. BATCH PROCESSING DRILL-DOWN 16 INCREMENTAL PATTERN Enterprise Data Layer (data warehouse) Data Ingress Layer (raw data) HDFS Data Consumption Layer (user / team) Process • next(tx_date) = $date • foreach destination server prep() • execute(script.pig) 1. filter transactional source (tx_date=$date) 2. store transactional to HDFS (tx_date=$date) 3. store aggregations to HDFS (tx_date=$date) 4. store to destination server(s) • execute(data_quality_tests) • if (tests=pass) merge(destination) • replace(dim_customer_snapshot) customer (snapshot) dim_customer (snapshot) SERVING PLATFORM MySQL SQL Server Cassandra data-ingress / ecomm / customer_tx / tx_date=20150601 data-mgmt / dim_customer_tx / tx_date=20150601 data-rpt / new_customers / tx_date=20150601 / tx_date=20150602 / tx_date=20150602 / tx_date=20150602 Mostly or Fully Automated Requires Manual Intervention
  • 17. CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED. REAL BUSINESS RESULTS OR, “HADOOP: NOT JUST HYPE!” 17
  • 18. CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED. HADOOP ENABLES GREATER & QUICKER DW VALUE • Better use of data engineers • data ingress is largely automated • reduces (not eliminates) the traditional 75-80% of project time spent on ETL • Well-suited for Agile • full source data is preserved in full fidelity • minimizes permanence of design decisions • roll out changes weekly • Data integration • access to the other 79.7% of the company’s data • flexible data models using complex data types • Single source of data processing • Process once => export 0:N destinations; supports all data consumers • Frees up expensive database resources • Single enterprise solution for data quality, monitoring, etc.
  • 19. CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED. PROCESSED DATA HAS GREATER REACH Descriptive What has / is happening? Diagnostic Why did it happen? Predictive What will likely happen? Prescriptive How can we make it happen? The same attributes used for reporting can be inputs into PA models & ML algorithms! primarily uses snapshots primarily uses transactional uses both snapshots & transactional
  • 20. CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED. Enterprise data + Clickstream dataChurn Analysis Customer Dashboard New Attributes Sentiment Analysis UNLOCKING NEW, ACTIONABLE INSIGHTS 20 Customer Experience Business Value Complex Data Complex Analysis Enterprise data + Product data + Event data Enterprise data + External data Enterprise data + Call transcripts
  • 21. CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED. TIPS & LESSONS LEARNED OR, “HOW TO BE AWESOME” 21
  • 22. CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED. SUGGESTIONS TO IMPROVE YOUR HADOOP DW PROJECT • It’s not that new Traditional DW & Big DW are more similar than dissimilar • Standardize on technology Pig for ETL, Hive for analysis • Focus on simplicity if your data isn’t easy to use, you’ve failed • Embrace flexibility don’t shy away from complex data types • Be predictable use HCatalog & consistent naming standard • Don’t be afraid of change use data abstraction to minimize impact to consumers • Do quick prototyping use external tables & data extractions via Hive ODBC • Democratize data amazing insights can come from anywhere; embrace new data consumers • Don’t stop at data focus on the Big Picture – the final outcome – to identify bottlenecks
  • 23. CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED. QUESTIONS? THANK YOU FOR ATTENDING! 
  • 24. CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED. APPENDIX 24
  • 25. CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED. BATCH PROCESSING EXAMPLE – PRE-DEPLOYMENT 25 DIM_CUSTOMER_TX – TRANSACTIONAL ETL Timestamp Source Transaction {Source, Timestamp, Action} Delete Flag Cust ID (Natural Key) Original Gender First Order Date Original Country Code Country Code 2015-06-01 9:15 AM {ecomm, 2015-06-01 9:14 AM, update} False 222 Female 2010-06-06 CA CA 2015-06-01 9:30 AM {ecomm, 2015-06-01 9:22 AM, insert} False 333 Male 2015-06-01 blah N/A 2015-06-01 9:45 AM {ecomm, 2015-06-01 9:37 AM, delete} True 111 2007-07-02 USA US Obligatory Disclaimer: this is fictitious data used for demonstration purposes only
  • 26. CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED. BATCH PROCESSING EXAMPLE – POST-DEPLOYMENT 26 DIM_CUSTOMER_TX – TRANSACTIONAL ETL Timestamp Source Transaction {Source, Timestamp, Action} Delete Flag Cust ID (Natural Key) Original Gender First Order Date Original Country Code Country Code Gender 2015-06-01 9:15 AM {ecomm, 2015-06-01 9:14 AM, update} False 222 Female 2010-06-06 CA CA 2015-06-01 9:30 AM {ecomm, 2015-06-01 9:22 AM, insert} False 333 Male 2015-06-01 blah N/A 2015-06-01 9:45 AM {ecomm, 2015-06-01 9:37 AM, delete} True 111 2007-07-02 USA US 2015-06-01 10:00 AM {ecomm, 2015-06-01 9:14 AM, deploy} False 222 Female 2010-06-06 CA CA Female 2015-06-01 10:00 AM {ecomm, 2015-06-01 9:22 AM, deploy} False 333 Male 2015-06-01 blah N/A Male 2015-06-01 10:00 AM {ecomm, 2015-06-01 9:37 AM, deploy} True 111 2007-07-02 USA US Female Obligatory Disclaimer: this is fictitious data used for demonstration purposes only
  • 27. CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED. BATCH PROCESSING EXAMPLE – POST-DEPLOYMENT 27 DIM_CUSTOMER – SNAPSHOT ETL Timestamp Customer ID (Natural Key) Active Customer Flag Original Gender First Order Date Original Country Code Country Code Gender 2015-06-01 10:00 AM 111 False 2007-07-02 USA US Female 2015-06-01 10:00 AM 222 True Female 2010-06-06 CA CA Female 2015-06-01 10:00 AM 333 True Male 2015-06-01 blah N/A Male Obligatory Disclaimer: this is fictitious data used for demonstration purposes only

Editor's Notes

  1. One of the leaders on the Data Platform team Responsible for making our petabytes of data easy to discover, understand, and consume Includes Data processing methodologies Building enterprise datasets Data egress Data catalog Making it easy for data consumers to onboard