Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Using Databricks as an Analysis Platform

120 views

Published on

Over the past year, YipitData spearheaded a full migration of its data pipelines to Apache Spark via the Databricks platform. Databricks now empowers its 40+ data analysts to independently create data ingestion systems, manage ETL workflows, and produce meaningful financial research for our clients.

Published in: Data & Analytics
  • Be the first to comment

Using Databricks as an Analysis Platform

  1. 1. Using databricks as an analysis platform Anup Segu
  2. 2. Agenda Extending databricks to provide a robust analytics platform Why a platform? What is in our platform?
  3. 3. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.
  4. 4. YipitData’s Platform
  5. 5. YipitData Answers Key Investor Questions ▪ 70+ research products covering U.S. and international companies ▪ Email reports, excel files, and data downloads ▪ Transaction data, web data, app data, targeted interviews, and adding more ▪ Clients include over 200 investment funds and fortune 500 companies ▪ 53 data analysts and 3 data engineers ▪ 22 engineers total ▪ We are rapidly growing and hiring!
  6. 6. About Me ▪ Senior Software Engineer ▪ Manage platform for ETL workloads ▪ Based out of NYC ▪ linkedin.com/in/anupsegu
  7. 7. We Want Analysts To Own The Product Data Collection Data Exploration ETL Workflows Report Generation www SELECT * FROM ...
  8. 8. EngineersAnalysts Providing a Platform Answering Questions
  9. 9. Python Library Inside Notebooks
  10. 10. Ingesting data
  11. 11. Wide range of table sizes and schemas 1 PB Compressed Parquet 60 K Tables 1.7 K Databases
  12. 12. Readypipe: From URLs To Parquet ▪ Repeatedly capture a snapshot of the website ▪ Websites frequently change ▪ Makes data available quickly for analysis
  13. 13. Glue Metastore Streaming As JSON Is Great ▪ Append only data in S3 ▪ We don’t know the schema ahead of time ▪ Only flat column types s3://{json_bucket}/{project_name} /{table}/... JSON Bucket Parquet Bucket Kinesis Firehose
  14. 14. Parquet Makes Data “Queryable” ▪ Create or update databases, tables, and schemas as needed ▪ Partitioned by the date of ingestion ▪ Spark cluster subscribed to write events s3://{parquet_bucket}/{project_name} /{table}/dt={date}... JSON Bucket Kinesis Firehose Glue Metastore Parquet Bucket
  15. 15. Compaction = Greater Performance ▪ Insert files into new S3 locations ▪ Update partitions in Glue ▪ Pick appropriate column lengths for optimal file counts s3://{parquet_bucket}/{project_name} /{table}/compacted/dt={date}... JSON Bucket Kinesis Firehose Glue Metastore Parquet Bucket
  16. 16. With 3rd Party Data, We Strive for Uniformity Various File Formats Permissions Challenges Data Lineage Data Refreshes 403 Access Denied
  17. 17. Databricks Helps Manage 3rd Party Data ▪ Upload files and convert to parquet with additional metadata ▪ Configure data access by assuming IAM roles within notebooks
  18. 18. Table Utilities
  19. 19. Table: Database + Name + Data
  20. 20. Table Hygiene Pays Off Validate table naming conventions Keep storage layer organized Maintain prior versions of tables Automate table maintenance
  21. 21. However, Our Team Is Focused On Analysis so best practices are built into “create_table”
  22. 22. Cluster Management
  23. 23. Wide Range Of Options For Spark Clusters Hardware Permissions Spark Configuration Driver instance Metastore Runtime Worker instances S3 access Spark properties EBS Volumes IAM Roles Environment Variables
  24. 24. Wide Range Of Options For Spark Clusters Hardware Permissions Spark Configuration Driver instance Metastore Runtime Worker instances S3 access Spark properties EBS Volumes IAM Roles Environment Variables
  25. 25. T-Shirt Sizes For Clusters ▪ 3 r5.xlarge instances ▪ Warm instance pool for fast starts ▪ 10 r5.xlarge instances ▪ Larger EBS volumes available if needed “MEDIUM”“SMALL” ▪ 30 r5.xlarge instances Larger EBS volumes for heavy workloads “LARGE” Standard IAM Roles, Metastore, S3 access, and Environment Variables
  26. 26. Launch Spark Jobs With Ease
  27. 27. Names Map To Databricks Configurations
  28. 28. Databricks Does The Heavy Lifting ▪ Provisions compute resources via a REST API ▪ Scales instances for cluster load ▪ Applies a wide range of spark optimizations
  29. 29. ETL Workflow Automation
  30. 30. Airflow Is Our Preferred ETL Tool
  31. 31. Airflow Is Our Preferred ETL Tool Requires someone to manage this code
  32. 32. We use the databricks API to construct DAGs programmatically +
  33. 33. 1 DAG = 1 Folder, 1 Task = 1 Notebook
  34. 34. Templated Notebooks For DAGs /folder - commands - notebook_a - notebook_b - notebook_c
  35. 35. Translate Notebooks Into DAG files /api/2.0/workspace/list /api/2.0/workspace/export
  36. 36. Automatically Create Workflows ▪ Pipelines are deployed without engineers ▪ Robust logging and error handling ▪ Easy to modify DAGs ▪ All happens within databricks Task A Task B Task C
  37. 37. Platform Visibility
  38. 38. Tailored Monitoring Solutions
  39. 39. Standardize Logs As Data
  40. 40. Visualize Logs In Notebooks
  41. 41. A Platform Invites New Solutions ▪ Establish standard queries and notebooks ▪ Trigger one DAG from one another ▪ Trigger reporting processes after ETL jobs
  42. 42. Thank You Interested in working with us? We are hiring! yipitdata.com/careers
  43. 43. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.
  44. 44. Appendix I: Compaction Code
  45. 45. Compacting Partitions
  46. 46. Compacting Partitions (cont.)
  47. 47. Compacting Partitions (cont.)
  48. 48. Appendix II: Table Creation Code
  49. 49. Capturing metadata with source data
  50. 50. Creating a table
  51. 51. Creating a table (cont.)
  52. 52. Creating a table (cont.)
  53. 53. Creating a table (cont.)
  54. 54. Appendix III: Databricks Jobs Code
  55. 55. Create a Databricks Job
  56. 56. Create a Databricks Job (cont.)
  57. 57. Appendix IV: Airflow Code
  58. 58. Automatic DAG Creation
  59. 59. Automatic DAG Creation (cont.)
  60. 60. Automatic DAG Creation (cont.)
  61. 61. Automatic DAG Creation (cont.)

×