Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

Share

10 Things Learned Releasing Databricks Enterprise Wide

Download to read offline

Implementing tools, let alone an entire Unified Data Platform, like Databricks, can be quite the undertaking. Implementing a tool which you have not yet learned all the ins and outs of can be even more frustrating. Have you ever wished that you could take some of that uncertainty away? Four years ago, Western Governors University (WGU) took on the task of rewriting all of our ETL pipelines in Scala/Python, as well as migrating our Enterprise Data Warehouse into Delta, all on the Databricks platform. Starting with 4 users and rapidly growing to over 120 users across 8 business units, our Databricks environment turned into an entire unified platform, being used by individuals of all skill levels, data requirements, and internal security requirements.

Through this process, our team has had the chance and opportunity to learn while making a lot of mistakes. Taking a look back at those mistakes, there are a lot of things we wish we had known before opening the platform to our enterprise.

We would like to share with you 10 things we wish we had known before WGU started operating in our Databricks environment. Covering topics surrounding user management from both an AWS and Databricks perspective, understanding and managing costs, creating custom pipelines for efficient code management, learning about new Apache Spark snippets that helped save us a fortune, and more. We would like to provide our recommendations on how one can overcome these pitfalls to help new, current and prospective users to make their environments easier, safer, and more reliable to work in.

  • Be the first to like this

10 Things Learned Releasing Databricks Enterprise Wide

  1. 1. 10 Things Learned Releasing Databricks Enterprise Wide Jake Kulas Senior Big Data Developer Western Governors University
  2. 2. Agenda § Speaker Background § What is WGU? § Implementation § 10 Things Learned § Q&A
  3. 3. Who Am I? • Jake Kulas • https://www.linkedin.com/in/jakekulas/ • Senior Big Data Developer / Data Engineer at Western Governors University • Wisconsin transplant living in Utah • BS / MS in Information Systems at the University of Utah • Working with Apache Spark/Databricks for 4 years
  4. 4. What is Western Governors University?
  5. 5. What is Western Governors University? § Founded in 1997 by 19 US governors § Non-Profit All Online Competency Based § Undergraduate and Graduate degrees § Regionally and Nationally accredited § 8 State affiliates § 228,000+ graduates § 135,000+ active students § 8,000+ employees Education without boundaries
  6. 6. Introduction § Unify data platforms ▪ Data engineering ▪ Analysts / Researchers ▪ Data Scientists ▪ Psychometricians / Statisticians § EDW rearchitected on Delta ▪ EDW ▪ LakeHouse Architecture Implementation Reasoning § New languages ▪ Scala ▪ Python § New platform § New to cloud architecture / design ▪ AWS internal difficulties § Rolled out to entire enterprise ▪ 8 business units ▪ 140+ direct users ▪ 300+ jobs Implementation Without Education
  7. 7. Implementation Architecture
  8. 8. Key Mistakes and Challenges • Understanding of Apache Spark and Delta • Optimizing JDBC • Delta optimizations • Multilingual empowerment • Code management in a new environment • CICD Your Way • Reduce, Reuse, Recycle • Cost management • Job/Cluster management • User management • User groups / permissions • Cluster segregation • Leveraging secrets • Training / Best Practices
  9. 9. 10 Things Learned (#1-3): Understanding Apache Spark / Delta
  10. 10. Understanding Apache Spark / Delta § Optimizing JDBC § Delta optimizations § Multilingual empowerment
  11. 11. Optimizing JDBC • Why are our reads so slow? • Understanding key properties • fetchSize • numPartitions • partitionColumn • lowerBound • upperBound
  12. 12. Delta optimizations • Optimize to improve file size • Understanding data skipping • Utilizing partitioning • Utilizing Zorder 0 5 10 15 20 25 30 Parquet Unoptimized Delta Optimized Delta Delta File Optimizations Query Time
  13. 13. Multilingual Empowerment • Databricks allows multilingual coding in notebooks • Utilize what you know best to get the job done • Mixing languages based on task at hand • Python/Scala + SQL • Scala + SQL + R • R + SQL • Empowering less experienced analysts/engineers/users
  14. 14. 10 Things Learned (#4-5): Managing Your Code
  15. 15. Managing Your Code § CICD Your Way § Reduce, Reuse, Recycle
  16. 16. CICD Your Way • No defined way to do CICD • Dependent on architecture • Protecting production • Creating production workspace folders • Limiting permissions for users • Git Integration in notebooks • Empowering users to push their own code • Utilize Projects • Now called Repos API • https://docs.databricks.com/repos.html Folder Permissions https://docs.databricks.com/security/access-control/workspace-acl.html Git Integration
  17. 17. Pipeline POC ▪ Users develop code ▪ Code push using git integration ▪ Create pull request ▪ Approvals by other users/managers • GitHub • Databricks Git ▪ Code pipeline picks up repository modification ▪ Lambda executes projects (now repos) API call to update repository • AWS Code Pipeline ▪ Workspace repo is updated and already linked job is now up to date • Databricks Repos
  18. 18. Reduce, Reuse, Recycle • Write code once and share • Utilizing Databricks %run command • Enterprise use functions • JDBC strings • Secret retrieval • Common transformations • Notebook splitting based on functionality • Core/Master (main notebook) • Operations (functions/method definitions) • Configuration (properties definitions)
  19. 19. 10 Things Learned (#6): Managing Your Costs
  20. 20. Managing Your Costs § Job/Cluster management ▪ Understanding your job requirements ▪ Understanding cluster costs ▪ Use dashboarding to visualize that cost
  21. 21. Monitoring Suggestions ▪ Use the correct cluster for the right job ▪ What is the main purpose of the cluster? ▪ Need for memory ▪ Need for processors ▪ Need for both ▪ Need for ML capabilities ▪ Use the Ganglia UI ▪ What is your job doing ▪ What are its requirements ▪ Frequency ▪ Completion times ▪ Data sizes ▪ Test, test, test -- Using Ganglia UI to compare • Job Management • Cluster Management
  22. 22. Ganglia UI
  23. 23. Dashboard Examples • Job Monitoring in Tableau • Job successes, failures and skips in past 24 hours • Failures in past 14 days • Most recent failure explanations • Stale table lists • Usage Monitoring in Tableau • Split between business units • Allowing managers to see cluster costs per month from both AWS and Databricks • Showing top 10 highest costing jobs
  24. 24. 10 Things Learned (#7-10): Managing Your Users
  25. 25. Managing Your Users § User groups / permissions § Cluster segregation § Leveraging secrets § Training / Best Practices
  26. 26. Databricks Groups & Permissions • Utilize Databricks groups to separate by business unit or by user function • Data Engineers • Analysts • Job Users • Admins • Utilize cloud permissions to limit access • IAM Roles assumed by clusters using Instance Profiles
  27. 27. Cluster Segregation • Would you trust 120+ users to manage their own clusters? • Possibilities with Cluster Policies • Setting up team based shared clusters: • Ad Hoc • Machine Learning • ETL • All jobs run on automated clusters • Jobs are owned and cost is assigned to each business unit • Cluster restrictions managed by cluster policies
  28. 28. Databricks Secrets • User retrieval without physical access • Permissions to access scopes • Return secrets through functions • Example returning database credentials: readDBProperties(“research”)
  29. 29. Training & Best Practices • It is hard to train hundreds of users on a new product • Let your users learn and train them on how you want the product to be used • Utilize Databricks Academy for new hires / users • Monthly ”tech talks” going over best practices or new features • Open weekly office hours assisting engineers and analysts with their code and general questions
  30. 30. Understanding and accomplishing just half of these challenges prior to releasing enterprise wide, could have saved a year worth of work, tens of thousands of dollars and a more secure/efficient operating environment.
  31. 31. Q & A email: jake.kulas@wgu.edu linkedin: https://www.linkedin.com/in/jakekulas/)
  32. 32. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.

Implementing tools, let alone an entire Unified Data Platform, like Databricks, can be quite the undertaking. Implementing a tool which you have not yet learned all the ins and outs of can be even more frustrating. Have you ever wished that you could take some of that uncertainty away? Four years ago, Western Governors University (WGU) took on the task of rewriting all of our ETL pipelines in Scala/Python, as well as migrating our Enterprise Data Warehouse into Delta, all on the Databricks platform. Starting with 4 users and rapidly growing to over 120 users across 8 business units, our Databricks environment turned into an entire unified platform, being used by individuals of all skill levels, data requirements, and internal security requirements. Through this process, our team has had the chance and opportunity to learn while making a lot of mistakes. Taking a look back at those mistakes, there are a lot of things we wish we had known before opening the platform to our enterprise. We would like to share with you 10 things we wish we had known before WGU started operating in our Databricks environment. Covering topics surrounding user management from both an AWS and Databricks perspective, understanding and managing costs, creating custom pipelines for efficient code management, learning about new Apache Spark snippets that helped save us a fortune, and more. We would like to provide our recommendations on how one can overcome these pitfalls to help new, current and prospective users to make their environments easier, safer, and more reliable to work in.

Views

Total views

62

On Slideshare

0

From embeds

0

Number of embeds

0

Actions

Downloads

4

Shares

0

Comments

0

Likes

0

×