10 Things Learned Releasing Databricks Enterprise Wide

10 Things Learned
Releasing Databricks
Enterprise Wide
Jake Kulas
Senior Big Data Developer
Western Governors University

Agenda
§ Speaker Background
§ What is WGU?
§ Implementation
§ 10 Things Learned
§ Q&A

Who Am I?
• Jake Kulas
• https://www.linkedin.com/in/jakekulas/
• Senior Big Data Developer / Data
Engineer at Western Governors
University
• Wisconsin transplant living in Utah
• BS / MS in Information Systems at the
University of Utah
• Working with Apache Spark/Databricks
for 4 years

What is Western Governors University?

What is Western Governors University?
§ Founded in 1997 by 19 US governors
§ Non-Profit All Online Competency
Based
§ Undergraduate and Graduate degrees
§ Regionally and Nationally accredited
§ 8 State affiliates
§ 228,000+ graduates
§ 135,000+ active students
§ 8,000+ employees
Education without boundaries

Introduction
§ Unify data platforms
▪ Data engineering
▪ Analysts / Researchers
▪ Data Scientists
▪ Psychometricians / Statisticians
§ EDW rearchitected on Delta
▪ EDW
▪ LakeHouse Architecture
Implementation Reasoning
§ New languages
▪ Scala
▪ Python
§ New platform
§ New to cloud architecture / design
▪ AWS internal difficulties
§ Rolled out to entire enterprise
▪ 8 business units
▪ 140+ direct users
▪ 300+ jobs
Implementation Without Education

Key Mistakes and Challenges
• Understanding of Apache Spark and Delta
• Optimizing JDBC
• Delta optimizations
• Multilingual empowerment
• Code management in a new environment
• CICD Your Way
• Reduce, Reuse, Recycle
• Cost management
• Job/Cluster management
• User management
• User groups / permissions
• Cluster segregation
• Leveraging secrets
• Training / Best Practices

10 Things Learned (#1-3):
Understanding Apache Spark / Delta

Understanding Apache Spark
/ Delta
§ Optimizing JDBC
§ Delta optimizations
§ Multilingual empowerment

Optimizing JDBC
• Why are our reads so slow?
• Understanding key properties
• fetchSize
• numPartitions
• partitionColumn
• lowerBound
• upperBound

Delta optimizations
• Optimize to improve file size
• Understanding data skipping
• Utilizing partitioning
• Utilizing Zorder
0
5
10
15
20
25
30
Parquet Unoptimized
Delta
Optimized Delta
Delta File Optimizations
Query Time

Multilingual Empowerment
• Databricks allows multilingual
coding in notebooks
• Utilize what you know best to
get the job done
• Mixing languages based on task
at hand
• Python/Scala + SQL
• Scala + SQL + R
• R + SQL
• Empowering less experienced
analysts/engineers/users

Managing Your Code

Managing Your Code
§ CICD Your Way
§ Reduce, Reuse, Recycle

CICD Your Way
• No defined way to do CICD
• Dependent on architecture
• Protecting production
• Creating production workspace folders
• Limiting permissions for users
• Git Integration in notebooks
• Empowering users to push
their own code
• Utilize Projects
• Now called Repos API
• https://docs.databricks.com/repos.html
Folder Permissions
https://docs.databricks.com/security/access-control/workspace-acl.html
Git Integration

Pipeline POC
▪ Users develop code
▪ Code push using git
integration
▪ Create pull request
▪ Approvals by other
users/managers
• GitHub
• Databricks Git
▪ Code pipeline picks
up repository
modification
▪ Lambda executes
projects (now
repos) API call to
update repository
• AWS Code Pipeline
▪ Workspace repo is
updated and
already linked job is
now up to date
• Databricks Repos

Reduce, Reuse, Recycle
• Write code once and share
• Utilizing Databricks %run command
• Enterprise use functions
• JDBC strings
• Secret retrieval
• Common transformations
• Notebook splitting based on
functionality
• Core/Master (main notebook)
• Operations (functions/method definitions)
• Configuration (properties definitions)

10 Things Learned (#6):
Managing Your Costs

Managing Your Costs
§ Job/Cluster management
▪ Understanding your job
requirements
▪ Understanding cluster costs
▪ Use dashboarding to visualize
that cost

Monitoring Suggestions
▪ Use the correct cluster for the right job
▪ What is the main purpose of the cluster?
▪ Need for memory
▪ Need for processors
▪ Need for both
▪ Need for ML capabilities
▪ Use the Ganglia UI
▪ What is your job doing
▪ What are its requirements
▪ Frequency
▪ Completion times
▪ Data sizes
▪ Test, test, test -- Using Ganglia UI to compare
• Job Management
• Cluster Management

Dashboard Examples
• Job Monitoring in Tableau
• Job successes, failures and skips
in past 24 hours
• Failures in past 14 days
• Most recent failure explanations
• Stale table lists
• Usage Monitoring in Tableau
• Split between business units
• Allowing managers to see cluster
costs per month from both AWS
and Databricks
• Showing top 10 highest costing
jobs

Managing Your Users

Managing Your Users
§ User groups / permissions
§ Cluster segregation
§ Leveraging secrets
§ Training / Best Practices

Databricks Groups & Permissions
• Utilize Databricks groups to separate by
business unit or by user function
• Data Engineers
• Analysts
• Job Users
• Admins
• Utilize cloud permissions to limit access
• IAM Roles assumed by clusters using Instance Profiles

Cluster Segregation
• Would you trust 120+ users to manage
their own clusters?
• Possibilities with Cluster Policies
• Setting up team based shared clusters:
• Ad Hoc
• Machine Learning
• ETL
• All jobs run on automated clusters
• Jobs are owned and cost is assigned to each
business unit
• Cluster restrictions managed by cluster policies

Databricks Secrets
• User retrieval without physical
access
• Permissions to access scopes
• Return secrets through functions
• Example returning database
credentials:
readDBProperties(“research”)

Training & Best Practices
• It is hard to train hundreds of users on a new product
• Let your users learn and train them on how you want the
product to be used
• Utilize Databricks Academy for new hires / users
• Monthly ”tech talks” going over best practices or new features
• Open weekly office hours assisting engineers and analysts with
their code and general questions

Understanding and accomplishing
just half of these challenges prior
to releasing enterprise wide, could
have saved a year worth of work,
tens of thousands of dollars and a
more secure/efficient operating
environment.

Q & A
email: jake.kulas@wgu.edu
linkedin: https://www.linkedin.com/in/jakekulas/)

Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.

10 Things Learned Releasing Databricks Enterprise Wide

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to 10 Things Learned Releasing Databricks Enterprise Wide

Similar to 10 Things Learned Releasing Databricks Enterprise Wide (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

10 Things Learned Releasing Databricks Enterprise Wide