SlideShare a Scribd company logo
1 of 35
April 2018
Building an Enterprise
Data Environment in the Cloud
2
Agenda for today
Broaden our focus
Create a strategic direction and create buy-in
Build a team and iterate to build a new architecture
1
2
3
Dive into the architecture4
Explain our philosophy5
Demonstrate the work6
3
3
4
4
Think of data more globally
5
6
7
8
8
9
10
Finance | HR/Payroll | Student
11
Thanks to:
• KPI Partners
• Amazon Web Services
• Microsoft
12
13
13
Thanks to Accenture
3xSize
Skills
Responsibilities
16
16
AWS Products
& Challenges
• Over 140 AWS products
• Designing for the future
• Need to break down barriers
between teams.
17
S3
Simple
Storage
Service
& Other Systems
RAAS
Report As
A Service
Apache
Airflow
Workflow
Manager
EMR
Elastic Map
Reduce
(Hadoop)
Tableau
Data Viz
On EC2
Data Cookbook
SAAS Data Governance
Where-
scape
ETL
On EC2
Glue
Data Catalog
Redshift
Analytics
Database
.py
redshift
spectrum
Glue
crawler
.py
.py
sqoop
.py python
.sql
parquet
18
Development
Philosophy
Push the limits on automation.
Save time for hands on
development where it’s
absolutely needed.
19
S3 Redshift
CD,VAL
C,44
D,75
E,92
Raw
Primarily CSV
Or JSON
Load
Contains only data
From Raw
CD VAL
C 44
D 75
E 92
CD KEY
C 3
D 4
E 5
Bronze
Merges history &
Load data, attaches key
KEY CD VAL
1 A 12
2 B 81
3 C 44
4 D 75
5 E 92
Key
Redshift does not
have sequence concept Silver
Enrichment of data with
business rules, slow
changing dimensions,
etc.
ID CD VAL CALC
1 A 12 24
2 B 81 162
3 C 44 88
4 D 75 150
5 E 92 184
Gold
Contain cross system
data. (E.g. Workday +
Peoplesoft)
Automated via Python
S3 folder auto setup
Table auto DDL Generation
DDL Changes & Migrations from old structure
Ability to handle deleted columns from source
Developed in Wherescape
Script generating ETL tool
Ability to track lineage
Ability to manage table DDL
.py .py .py
.py
.py
ws
ws
ws
.py .py
.py
ws
ws
20
Demo
21
21
22
23
24
25
26
26
27
27
28
28
29
Raw
Load
Today
Automation via Python
1. The load table is dropped every
time the process is run.
2. DDL for the new table is created
by looking at Workday’s metadata
to determine column types.
3. Also, column names are
autogenerated by the python
script. Abbreviations for
common words are defined in a
word table in Redshift, and all
other words just have their vowels
removed for shorter column
names. E.g.
report_organization_value would
become something like
rpt_org_val automatically in the
load layer.
.py
Yesterday
Redshift
CD,VAL
C,44
D,75
E,92
CD,VAL,ATTRB
F,23,COLD
G,94,WARM
H,22,HOT
CD VAL
C 44
D 75
E 92
CD VAL ATTRB
F 23 COLD
G 94 WARM
H 22 HOT
S3 to
30
CD VAL
C 44
D 75
E 92
KEY CD VAL
1 A 12
2 B 81
3 C 44
4 D 75
5 E 92
CD VAL ATTRB
E 12 COLD
F 24 WARM
G 48 HOT
Load
Bronze
Today
New Column
New Records
Existing Record
with Updated
Value
KEY CD VAL ATTRB
1 A 12
2 B 81
3 C 44
4 D 75
5 E 12 COLD
6 F 24 WARM
7 G 48 HOT
Automation via Python
1. Compares Yesterday’s Bronze
Table and Today’s Load Table to
see if new columns have come in.
2. Creates / updates DDL (if
necessary), and loads the new
structure with the correct column
order.
3. Updates records only with data
changes
4. Inserts new records.
.py
Yesterday
Redshift
31
S3 Redshift
Raw
Primarily CSV
Or JSON
Bronze
Combines history &
Load data, attaches key
KEY CD VAL
1 A 12
2 B 81
3 C 44
4 D 75
5 E 92
Silver
Enrichment of data
with business rules,
etc.
ID CD VAL CALC
1 A 12 24
2 B 81 162
3 C 44 88
4 D 75 150
5 E 92 184
Gold
Contain cross
system data. (E.g.
Workday +
Peoplesoft)
Tableau
Data Viz
On EC2
Sql
Clients
Analysis
Statistical
Software
.sql
stats
CD VAL
X 102
Y 203
Z 922
Experimental
Primarily CSV
Or JSON
CD,VAL
X,102
Y,203
Z,922
CD,VAL
C,44
D,75
E,92

Glue
crawler
Glue
Data Catalog
redshift
spectrum
32
33
Agenda for today
Broaden our focus
Create a strategic direction and create buy-in
Build a team and iterate to build a new architecture
1
2
3
Dive into the architecture4
Explain our philosophy5
Demonstrate the work6
34
• Data Governance (Laura and Meenal)
• Data Science and Machine Learning
• Repurposing our data architecture to assist decentralized colleges
and departments
• Operational Design
• Data/report discovery
“Add On” Items
Things we don’t have time to cover today, but are critical to our success, and are happy to discuss
34
35
Steve Fischer
fischer.141@osu.edu
Nate Polek
polek.3@osu.edu

More Related Content

What's hot

Presentation at SF Kubernetes Meetup (10/30/18), Introducing TiDB/TiKV
Presentation at SF Kubernetes Meetup (10/30/18), Introducing TiDB/TiKVPresentation at SF Kubernetes Meetup (10/30/18), Introducing TiDB/TiKV
Presentation at SF Kubernetes Meetup (10/30/18), Introducing TiDB/TiKVKevin Xu
 
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...Modern Data Stack France
 
Golang in TiDB (GopherChina 2017)
Golang in TiDB  (GopherChina 2017)Golang in TiDB  (GopherChina 2017)
Golang in TiDB (GopherChina 2017)PingCAP
 
IBM Optim Archive Files to Hadoop Hive Tables
IBM Optim Archive Files to Hadoop Hive TablesIBM Optim Archive Files to Hadoop Hive Tables
IBM Optim Archive Files to Hadoop Hive TablesHari Kasina
 
Real time observability with Redis and Grafana
Real time observability with Redis and GrafanaReal time observability with Redis and Grafana
Real time observability with Redis and GrafanaMikhail Volkov
 
Introduction to TitanDB
Introduction to TitanDB Introduction to TitanDB
Introduction to TitanDB Knoldus Inc.
 
TiDB DevCon 2020 Opening Keynote
TiDB DevCon 2020 Opening Keynote TiDB DevCon 2020 Opening Keynote
TiDB DevCon 2020 Opening Keynote PingCAP
 
Acid ORC, Iceberg and Delta Lake
Acid ORC, Iceberg and Delta LakeAcid ORC, Iceberg and Delta Lake
Acid ORC, Iceberg and Delta LakeMichal Gancarski
 
Scaling Graphite At Yelp
Scaling Graphite At YelpScaling Graphite At Yelp
Scaling Graphite At YelpPaul O'Connor
 

What's hot (20)

Presentation at SF Kubernetes Meetup (10/30/18), Introducing TiDB/TiKV
Presentation at SF Kubernetes Meetup (10/30/18), Introducing TiDB/TiKVPresentation at SF Kubernetes Meetup (10/30/18), Introducing TiDB/TiKV
Presentation at SF Kubernetes Meetup (10/30/18), Introducing TiDB/TiKV
 
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...
 
Working with HDF and netCDF Data in ArcGIS: Tools and Case Studies
Working with HDF and netCDF Data in ArcGIS: Tools and Case StudiesWorking with HDF and netCDF Data in ArcGIS: Tools and Case Studies
Working with HDF and netCDF Data in ArcGIS: Tools and Case Studies
 
MATLAB and Scientific Data: New Features and Capabilities
MATLAB and Scientific Data: New Features and CapabilitiesMATLAB and Scientific Data: New Features and Capabilities
MATLAB and Scientific Data: New Features and Capabilities
 
Improved Methods for Accessing Scientific Data for the Masses
Improved Methods for Accessing Scientific Data for the MassesImproved Methods for Accessing Scientific Data for the Masses
Improved Methods for Accessing Scientific Data for the Masses
 
Golang in TiDB (GopherChina 2017)
Golang in TiDB  (GopherChina 2017)Golang in TiDB  (GopherChina 2017)
Golang in TiDB (GopherChina 2017)
 
IBM Optim Archive Files to Hadoop Hive Tables
IBM Optim Archive Files to Hadoop Hive TablesIBM Optim Archive Files to Hadoop Hive Tables
IBM Optim Archive Files to Hadoop Hive Tables
 
Real time observability with Redis and Grafana
Real time observability with Redis and GrafanaReal time observability with Redis and Grafana
Real time observability with Redis and Grafana
 
Highly Available Graphite
Highly Available GraphiteHighly Available Graphite
Highly Available Graphite
 
Utilizing HDF4 File Content Maps for the Cloud Computing
Utilizing HDF4 File Content Maps for the Cloud ComputingUtilizing HDF4 File Content Maps for the Cloud Computing
Utilizing HDF4 File Content Maps for the Cloud Computing
 
The National Oceanographic Data Center’s NetCDF Templates
The National Oceanographic Data Center’s NetCDF TemplatesThe National Oceanographic Data Center’s NetCDF Templates
The National Oceanographic Data Center’s NetCDF Templates
 
Data Are from Mars, Tools Are from Venus
Data Are from Mars, Tools Are from VenusData Are from Mars, Tools Are from Venus
Data Are from Mars, Tools Are from Venus
 
Introduction to TitanDB
Introduction to TitanDB Introduction to TitanDB
Introduction to TitanDB
 
TiDB DevCon 2020 Opening Keynote
TiDB DevCon 2020 Opening Keynote TiDB DevCon 2020 Opening Keynote
TiDB DevCon 2020 Opening Keynote
 
Efficiently serving HDF5 via OPeNDAP
Efficiently serving HDF5 via OPeNDAPEfficiently serving HDF5 via OPeNDAP
Efficiently serving HDF5 via OPeNDAP
 
Acid ORC, Iceberg and Delta Lake
Acid ORC, Iceberg and Delta LakeAcid ORC, Iceberg and Delta Lake
Acid ORC, Iceberg and Delta Lake
 
Scaling Graphite At Yelp
Scaling Graphite At YelpScaling Graphite At Yelp
Scaling Graphite At Yelp
 
How to Meet the CF Conventions with NcML for NASA HDF/HDF-EOS
How to Meet the CF Conventions with NcML for NASA HDF/HDF-EOSHow to Meet the CF Conventions with NcML for NASA HDF/HDF-EOS
How to Meet the CF Conventions with NcML for NASA HDF/HDF-EOS
 
HDF and netCDF Data Support in ArcGIS
HDF and netCDF Data Support in ArcGISHDF and netCDF Data Support in ArcGIS
HDF and netCDF Data Support in ArcGIS
 
Hierarchical Data Formats (HDF) Update
Hierarchical Data Formats (HDF) UpdateHierarchical Data Formats (HDF) Update
Hierarchical Data Formats (HDF) Update
 

Similar to Building an Enterprise Data Environment in the Cloud

Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Eric Sun
 
(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...
(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...
(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...Amazon Web Services
 
Deploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWSDeploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWSAmazon Web Services
 
A smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
A smarter Pig: Building a SQL interface to Apache Pig using Apache CalciteA smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
A smarter Pig: Building a SQL interface to Apache Pig using Apache CalciteJulian Hyde
 
New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...
New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...
New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...Databricks
 
Best Practices for Migrating Legacy Data Warehouses into Amazon Redshift
Best Practices for Migrating Legacy Data Warehouses into Amazon RedshiftBest Practices for Migrating Legacy Data Warehouses into Amazon Redshift
Best Practices for Migrating Legacy Data Warehouses into Amazon RedshiftAmazon Web Services
 
AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...
AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...
AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...Amazon Web Services
 
AWS July Webinar Series: Amazon redshift migration and load data 20150722
AWS July Webinar Series: Amazon redshift migration and load data 20150722AWS July Webinar Series: Amazon redshift migration and load data 20150722
AWS July Webinar Series: Amazon redshift migration and load data 20150722Amazon Web Services
 
GPS Insight on Using Presto with Scylla for Data Analytics and Data Archival
GPS Insight on Using Presto with Scylla for Data Analytics and Data ArchivalGPS Insight on Using Presto with Scylla for Data Analytics and Data Archival
GPS Insight on Using Presto with Scylla for Data Analytics and Data ArchivalScyllaDB
 
Cloud Big Data Architectures
Cloud Big Data ArchitecturesCloud Big Data Architectures
Cloud Big Data ArchitecturesLynn Langit
 
Getting It Right Exactly Once: Principles for Streaming Architectures
Getting It Right Exactly Once: Principles for Streaming ArchitecturesGetting It Right Exactly Once: Principles for Streaming Architectures
Getting It Right Exactly Once: Principles for Streaming ArchitecturesSingleStore
 
How Glidewell Moves Data to Amazon Redshift
How Glidewell Moves Data to Amazon RedshiftHow Glidewell Moves Data to Amazon Redshift
How Glidewell Moves Data to Amazon RedshiftAttunity
 
Dan Querimit - BI Portfolio
Dan Querimit - BI PortfolioDan Querimit - BI Portfolio
Dan Querimit - BI Portfolioquerimit
 
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & DataductAmazon Web Services
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftBest Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftAmazon Web Services
 
Managing data analytics in a hybrid cloud
Managing data analytics in a hybrid cloudManaging data analytics in a hybrid cloud
Managing data analytics in a hybrid cloudKaran Singh
 
Best Practices for Supercharging Cloud Analytics on Amazon Redshift
Best Practices for Supercharging Cloud Analytics on Amazon RedshiftBest Practices for Supercharging Cloud Analytics on Amazon Redshift
Best Practices for Supercharging Cloud Analytics on Amazon RedshiftSnapLogic
 
AWS reInvent 2018 Recap - Solutions Updates Part 2
AWS reInvent 2018 Recap - Solutions Updates Part 2AWS reInvent 2018 Recap - Solutions Updates Part 2
AWS reInvent 2018 Recap - Solutions Updates Part 2Amazon Web Services
 
Metail at Cambridge AWS User Group Main Meetup #3
Metail at Cambridge AWS User Group Main Meetup #3Metail at Cambridge AWS User Group Main Meetup #3
Metail at Cambridge AWS User Group Main Meetup #3Gareth Rogers
 

Similar to Building an Enterprise Data Environment in the Cloud (20)

Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)
 
(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...
(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...
(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...
 
Deploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWSDeploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWS
 
A smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
A smarter Pig: Building a SQL interface to Apache Pig using Apache CalciteA smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
A smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
 
New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...
New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...
New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...
 
Best Practices for Migrating Legacy Data Warehouses into Amazon Redshift
Best Practices for Migrating Legacy Data Warehouses into Amazon RedshiftBest Practices for Migrating Legacy Data Warehouses into Amazon Redshift
Best Practices for Migrating Legacy Data Warehouses into Amazon Redshift
 
AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...
AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...
AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...
 
AWS July Webinar Series: Amazon redshift migration and load data 20150722
AWS July Webinar Series: Amazon redshift migration and load data 20150722AWS July Webinar Series: Amazon redshift migration and load data 20150722
AWS July Webinar Series: Amazon redshift migration and load data 20150722
 
GPS Insight on Using Presto with Scylla for Data Analytics and Data Archival
GPS Insight on Using Presto with Scylla for Data Analytics and Data ArchivalGPS Insight on Using Presto with Scylla for Data Analytics and Data Archival
GPS Insight on Using Presto with Scylla for Data Analytics and Data Archival
 
Cloud Big Data Architectures
Cloud Big Data ArchitecturesCloud Big Data Architectures
Cloud Big Data Architectures
 
Getting It Right Exactly Once: Principles for Streaming Architectures
Getting It Right Exactly Once: Principles for Streaming ArchitecturesGetting It Right Exactly Once: Principles for Streaming Architectures
Getting It Right Exactly Once: Principles for Streaming Architectures
 
How Glidewell Moves Data to Amazon Redshift
How Glidewell Moves Data to Amazon RedshiftHow Glidewell Moves Data to Amazon Redshift
How Glidewell Moves Data to Amazon Redshift
 
Dan Querimit - BI Portfolio
Dan Querimit - BI PortfolioDan Querimit - BI Portfolio
Dan Querimit - BI Portfolio
 
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftBest Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift
 
Managing data analytics in a hybrid cloud
Managing data analytics in a hybrid cloudManaging data analytics in a hybrid cloud
Managing data analytics in a hybrid cloud
 
Best Practices for Supercharging Cloud Analytics on Amazon Redshift
Best Practices for Supercharging Cloud Analytics on Amazon RedshiftBest Practices for Supercharging Cloud Analytics on Amazon Redshift
Best Practices for Supercharging Cloud Analytics on Amazon Redshift
 
AWS reInvent 2018 Recap - Solutions Updates Part 2
AWS reInvent 2018 Recap - Solutions Updates Part 2AWS reInvent 2018 Recap - Solutions Updates Part 2
AWS reInvent 2018 Recap - Solutions Updates Part 2
 
Introduction to AWS Glue
Introduction to AWS GlueIntroduction to AWS Glue
Introduction to AWS Glue
 
Metail at Cambridge AWS User Group Main Meetup #3
Metail at Cambridge AWS User Group Main Meetup #3Metail at Cambridge AWS User Group Main Meetup #3
Metail at Cambridge AWS User Group Main Meetup #3
 

Recently uploaded

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 

Recently uploaded (20)

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 

Building an Enterprise Data Environment in the Cloud

  • 1. April 2018 Building an Enterprise Data Environment in the Cloud
  • 2. 2 Agenda for today Broaden our focus Create a strategic direction and create buy-in Build a team and iterate to build a new architecture 1 2 3 Dive into the architecture4 Explain our philosophy5 Demonstrate the work6
  • 3. 3 3
  • 4. 4 4 Think of data more globally
  • 5. 5
  • 6. 6
  • 7. 7
  • 8. 8 8
  • 9. 9
  • 11. 11 Thanks to: • KPI Partners • Amazon Web Services • Microsoft
  • 12. 12
  • 13. 13 13
  • 16. 16 16 AWS Products & Challenges • Over 140 AWS products • Designing for the future • Need to break down barriers between teams.
  • 17. 17 S3 Simple Storage Service & Other Systems RAAS Report As A Service Apache Airflow Workflow Manager EMR Elastic Map Reduce (Hadoop) Tableau Data Viz On EC2 Data Cookbook SAAS Data Governance Where- scape ETL On EC2 Glue Data Catalog Redshift Analytics Database .py redshift spectrum Glue crawler .py .py sqoop .py python .sql parquet
  • 18. 18 Development Philosophy Push the limits on automation. Save time for hands on development where it’s absolutely needed.
  • 19. 19 S3 Redshift CD,VAL C,44 D,75 E,92 Raw Primarily CSV Or JSON Load Contains only data From Raw CD VAL C 44 D 75 E 92 CD KEY C 3 D 4 E 5 Bronze Merges history & Load data, attaches key KEY CD VAL 1 A 12 2 B 81 3 C 44 4 D 75 5 E 92 Key Redshift does not have sequence concept Silver Enrichment of data with business rules, slow changing dimensions, etc. ID CD VAL CALC 1 A 12 24 2 B 81 162 3 C 44 88 4 D 75 150 5 E 92 184 Gold Contain cross system data. (E.g. Workday + Peoplesoft) Automated via Python S3 folder auto setup Table auto DDL Generation DDL Changes & Migrations from old structure Ability to handle deleted columns from source Developed in Wherescape Script generating ETL tool Ability to track lineage Ability to manage table DDL .py .py .py .py .py ws ws ws .py .py .py ws ws
  • 21. 21 21
  • 22. 22
  • 23. 23
  • 24. 24
  • 25. 25
  • 26. 26 26
  • 27. 27 27
  • 28. 28 28
  • 29. 29 Raw Load Today Automation via Python 1. The load table is dropped every time the process is run. 2. DDL for the new table is created by looking at Workday’s metadata to determine column types. 3. Also, column names are autogenerated by the python script. Abbreviations for common words are defined in a word table in Redshift, and all other words just have their vowels removed for shorter column names. E.g. report_organization_value would become something like rpt_org_val automatically in the load layer. .py Yesterday Redshift CD,VAL C,44 D,75 E,92 CD,VAL,ATTRB F,23,COLD G,94,WARM H,22,HOT CD VAL C 44 D 75 E 92 CD VAL ATTRB F 23 COLD G 94 WARM H 22 HOT S3 to
  • 30. 30 CD VAL C 44 D 75 E 92 KEY CD VAL 1 A 12 2 B 81 3 C 44 4 D 75 5 E 92 CD VAL ATTRB E 12 COLD F 24 WARM G 48 HOT Load Bronze Today New Column New Records Existing Record with Updated Value KEY CD VAL ATTRB 1 A 12 2 B 81 3 C 44 4 D 75 5 E 12 COLD 6 F 24 WARM 7 G 48 HOT Automation via Python 1. Compares Yesterday’s Bronze Table and Today’s Load Table to see if new columns have come in. 2. Creates / updates DDL (if necessary), and loads the new structure with the correct column order. 3. Updates records only with data changes 4. Inserts new records. .py Yesterday Redshift
  • 31. 31 S3 Redshift Raw Primarily CSV Or JSON Bronze Combines history & Load data, attaches key KEY CD VAL 1 A 12 2 B 81 3 C 44 4 D 75 5 E 92 Silver Enrichment of data with business rules, etc. ID CD VAL CALC 1 A 12 24 2 B 81 162 3 C 44 88 4 D 75 150 5 E 92 184 Gold Contain cross system data. (E.g. Workday + Peoplesoft) Tableau Data Viz On EC2 Sql Clients Analysis Statistical Software .sql stats CD VAL X 102 Y 203 Z 922 Experimental Primarily CSV Or JSON CD,VAL X,102 Y,203 Z,922 CD,VAL C,44 D,75 E,92  Glue crawler Glue Data Catalog redshift spectrum
  • 32. 32
  • 33. 33 Agenda for today Broaden our focus Create a strategic direction and create buy-in Build a team and iterate to build a new architecture 1 2 3 Dive into the architecture4 Explain our philosophy5 Demonstrate the work6
  • 34. 34 • Data Governance (Laura and Meenal) • Data Science and Machine Learning • Repurposing our data architecture to assist decentralized colleges and departments • Operational Design • Data/report discovery “Add On” Items Things we don’t have time to cover today, but are critical to our success, and are happy to discuss 34

Editor's Notes

  1. Focus of the team and scope of data problems we could solve was limited Kind of like what’s going on here…good for cruising the neighborhoods but not a global solution
  2. Great partnerships with Enterprise Security and Infrastructure No formal cloud strategy Working through AWS service offering We went piece by piece Accenture
  3. Architecture is always evolving….you’re never finished Introduce Nate
  4. Over 140 products, and we’ve actively discussed over 20 of them -How do we design something that can withstand the test of time? Even more when you consider 3rd party software that can be installed on aws Has been challenging determining roles & responsibilities across IT teams. Need to break down barriers between teams. We’ve haven’t been able to engage with AWS professional services before a high level contract.
  5. Apache Airflow Schedule workflows Ability to run python against multiple services RAAS Need to get data out of Workday Workday Custom reports enabled as Webservice