Steve Fischer & Nate Polek, The Ohio State UniversityThe Ohio State University is moving their enterprise data environment to the cloud (AWS). This new environment includes an enterprise data lake, data warehouse, and Tableau environment. Hear the strategies that drove the decision, the process for gaining security approval, and the new opportunities to better serve the university community. We will provide an overview of the architecture and specific components. Learn the skills and technologies that have carried over and the new ones gained. Hear the challenges we faced, and the newfound advantages gained through the cloud and modern technologies.
2. 2
Agenda for today
Broaden our focus
Create a strategic direction and create buy-in
Build a team and iterate to build a new architecture
1
2
3
Dive into the architecture4
Explain our philosophy5
Demonstrate the work6
17. 17
S3
Simple
Storage
Service
& Other Systems
RAAS
Report As
A Service
Apache
Airflow
Workflow
Manager
EMR
Elastic Map
Reduce
(Hadoop)
Tableau
Data Viz
On EC2
Data Cookbook
SAAS Data Governance
Where-
scape
ETL
On EC2
Glue
Data Catalog
Redshift
Analytics
Database
.py
redshift
spectrum
Glue
crawler
.py
.py
sqoop
.py python
.sql
parquet
19. 19
S3 Redshift
CD,VAL
C,44
D,75
E,92
Raw
Primarily CSV
Or JSON
Load
Contains only data
From Raw
CD VAL
C 44
D 75
E 92
CD KEY
C 3
D 4
E 5
Bronze
Merges history &
Load data, attaches key
KEY CD VAL
1 A 12
2 B 81
3 C 44
4 D 75
5 E 92
Key
Redshift does not
have sequence concept Silver
Enrichment of data with
business rules, slow
changing dimensions,
etc.
ID CD VAL CALC
1 A 12 24
2 B 81 162
3 C 44 88
4 D 75 150
5 E 92 184
Gold
Contain cross system
data. (E.g. Workday +
Peoplesoft)
Automated via Python
S3 folder auto setup
Table auto DDL Generation
DDL Changes & Migrations from old structure
Ability to handle deleted columns from source
Developed in Wherescape
Script generating ETL tool
Ability to track lineage
Ability to manage table DDL
.py .py .py
.py
.py
ws
ws
ws
.py .py
.py
ws
ws
29. 29
Raw
Load
Today
Automation via Python
1. The load table is dropped every
time the process is run.
2. DDL for the new table is created
by looking at Workday’s metadata
to determine column types.
3. Also, column names are
autogenerated by the python
script. Abbreviations for
common words are defined in a
word table in Redshift, and all
other words just have their vowels
removed for shorter column
names. E.g.
report_organization_value would
become something like
rpt_org_val automatically in the
load layer.
.py
Yesterday
Redshift
CD,VAL
C,44
D,75
E,92
CD,VAL,ATTRB
F,23,COLD
G,94,WARM
H,22,HOT
CD VAL
C 44
D 75
E 92
CD VAL ATTRB
F 23 COLD
G 94 WARM
H 22 HOT
S3 to
30. 30
CD VAL
C 44
D 75
E 92
KEY CD VAL
1 A 12
2 B 81
3 C 44
4 D 75
5 E 92
CD VAL ATTRB
E 12 COLD
F 24 WARM
G 48 HOT
Load
Bronze
Today
New Column
New Records
Existing Record
with Updated
Value
KEY CD VAL ATTRB
1 A 12
2 B 81
3 C 44
4 D 75
5 E 12 COLD
6 F 24 WARM
7 G 48 HOT
Automation via Python
1. Compares Yesterday’s Bronze
Table and Today’s Load Table to
see if new columns have come in.
2. Creates / updates DDL (if
necessary), and loads the new
structure with the correct column
order.
3. Updates records only with data
changes
4. Inserts new records.
.py
Yesterday
Redshift
31. 31
S3 Redshift
Raw
Primarily CSV
Or JSON
Bronze
Combines history &
Load data, attaches key
KEY CD VAL
1 A 12
2 B 81
3 C 44
4 D 75
5 E 92
Silver
Enrichment of data
with business rules,
etc.
ID CD VAL CALC
1 A 12 24
2 B 81 162
3 C 44 88
4 D 75 150
5 E 92 184
Gold
Contain cross
system data. (E.g.
Workday +
Peoplesoft)
Tableau
Data Viz
On EC2
Sql
Clients
Analysis
Statistical
Software
.sql
stats
CD VAL
X 102
Y 203
Z 922
Experimental
Primarily CSV
Or JSON
CD,VAL
X,102
Y,203
Z,922
CD,VAL
C,44
D,75
E,92
Glue
crawler
Glue
Data Catalog
redshift
spectrum
33. 33
Agenda for today
Broaden our focus
Create a strategic direction and create buy-in
Build a team and iterate to build a new architecture
1
2
3
Dive into the architecture4
Explain our philosophy5
Demonstrate the work6
34. 34
• Data Governance (Laura and Meenal)
• Data Science and Machine Learning
• Repurposing our data architecture to assist decentralized colleges
and departments
• Operational Design
• Data/report discovery
“Add On” Items
Things we don’t have time to cover today, but are critical to our success, and are happy to discuss
34
Focus of the team and scope of data problems we could solve was limited
Kind of like what’s going on here…good for cruising the neighborhoods but not a global solution
Great partnerships with Enterprise Security and Infrastructure
No formal cloud strategy
Working through AWS service offering
We went piece by piece
Accenture
Architecture is always evolving….you’re never finished
Introduce Nate
Over 140 products, and we’ve actively discussed over 20 of them
-How do we design something that can withstand the test of time?
Even more when you consider 3rd party software that can be installed on aws
Has been challenging determining roles & responsibilities across IT teams. Need to break down barriers between teams.
We’ve haven’t been able to engage with AWS professional services before a high level contract.
Apache Airflow
Schedule workflows
Ability to run python against multiple services
RAAS
Need to get data out of Workday
Workday Custom reports enabled as Webservice