Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

Share

SQL Analytics Powering Telemetry Analysis at Comcast

Download to read offline

Comcast is one of the leading providers of communications, entertainment, and cable products and services. At the heart of it is Comcast RDK providing the backbone of telemetry to the industry. RDK (Reference Design Kit) is pre-bundled opensource firmware for a complete home platform covering video, broadband and IoT devices. RDK team at Comcast analyzes petabytes of data, collected every 15 minutes from 70 million devices (video and broadband and IoT devices) installed in customer homes. They run ETL and aggregation pipelines and publish analytical dashboards on a daily basis to reduce customer calls and firmware rollout. The analysis is also used to calculate WIFI happiness index which is a critical KPI for Comcast customer experience.

In addition to this, RDK team also does release tracking by analyzing the RDK firmware quality. SQL Analytics allows customers to operate a lakehouse architecture that provides data warehousing performance at data lake economics for up to 4x better price/performance for SQL workloads than traditional cloud data warehouses.

We present the results of the “Test and Learn” with SQL Analytics and the delta engine that we worked in partnership with the Databricks team. We present a quick demo introducing the SQL native interface, the challenges we faced with migration, The results of the execution and our journey of productionizing this at scale.

  • Be the first to like this

SQL Analytics Powering Telemetry Analysis at Comcast

  1. 1. SQL Analytics powering Telemetry Analysis at Comcast Suraj Nesamani, Principal Engineer, RDK Analytics @ Comcast and Molly Nagamuthu, Sr Resident Solutions Architect @ Databricks
  2. 2. Agenda ▪ Introduction to the Lakehouse Platform ▪ SQL Analytics - under the hood by Molly Nagamuthu ▪ RDK challenges at Comcast ▪ SQL Analytics Test and Learn by Suraj Nesamani
  3. 3. 20+ years in Product Development, Engineering & Professional Services Telecom Healthcare & Media Finance Molly Nagamuthu Sr Resident Solutions Architect @ Databricks
  4. 4. Databricks’ vision is to enable data-driven innovation to all enterprises
  5. 5. Lakehouse Platform
  6. 6. Today, most enterprises struggle with data Siloed stacks increase data architecture complexity Data Warehousing Data Engineering Streaming Data Science & Machine Learning Extract Load Transform Streaming data sources Streaming Data Engine Real-time Database Analytics and BI Data marts Data warehouse Structured data Structured, semi-structured and unstructured data Structured, semi-structured and unstructured data Data Lake Data prep Data Lake Machine Learning Data Science
  7. 7. Today, most enterprises struggle with data Siloed stacks increase data architecture complexity Data Warehousing Data Engineering Streaming Data Science & Machine Learning Extract Load Transform Streaming data sources Streaming Data Engine Real-time Database Analytics and BI Data marts Data warehouse Structured data Structured, semi-structured and unstructured data Structured, semi-structured and unstructured data Data Lake Data prep Data Lake Machine Learning Data Science Amazon Redshift Teradata Azure Synapse Google BigQuery Snowflake IBM Db2 SAP Oracle Autonomous Data Warehouse Hadoop Apache Airflow Amazon EMR Apache Spark Google Dataproc Cloudera Jupyter Amazon SageMaker Azure ML Studio MatLAB Domino Data Labs SAS TensorFlow PyTorch Apache Kafka Apache Spark Apache Flink Amazon Kinesis Azure Stream Analytics Google Dataflow Tibco Spotfire Confluent Disconnected systems and proprietary data formats make integration di cult
  8. 8. Today, most enterprises struggle with data Siloed stacks increase data architecture complexity Data Warehousing Data Engineering Streaming Data Science & Machine Learning Extract Load Transform Streaming data sources Streaming Data Engine Real-time Database Analytics and BI Data marts Data warehouse Structured data Structured, semi-structured and unstructured data Structured, semi-structured and unstructured data Data Lake Data prep Data Lake Machine Learning Data Science Amazon Redshift Teradata Azure Synapse Google BigQuery Snowflake IBM Db2 SAP Oracle Autonomous Data Warehouse Hadoop Apache Airflow Amazon EMR Apache Spark Google Dataproc Cloudera Jupyter Amazon SageMaker Azure ML Studio MatLAB Domino Data Labs SAS TensorFlow PyTorch Apache Kafka Apache Spark Apache Flink Amazon Kinesis Azure Stream Analytics Google Dataflow Tibco Spotfire Confluent Disconnected systems and proprietary data formats make integration di cult Data Scientists Data Engineers Data Analysts Data Engineers Siloed data teams decrease productivity
  9. 9. Structured Semi-structured Unstructured Streaming Lakehouse Platform Data Engineering BI & SQL Analytics Real-time Data Applications Data Science & Machine Learning Data Management & Governance Open Data Lake SIMPLE OPEN COLLABORATIVE
  10. 10. SQL Analytics
  11. 11. Databricks SQL Analytics Delivering analytics on the freshest data with data warehouse performance and data lake economics ■ Query your lakehouse with better price / performance ■ Simplify discovery and sharing of new insights ■ Connect to familiar BI tools, like Tableau or Power BI ■ Simplify administration and governance
  12. 12. Broad integration with BI tools Connect your preferred BI tools with optimized connectors that provide fast performance, low latency, and high user conconcurrency to your data lake for your existing BI tools. Coming soon:
  13. 13. Use Cases Collaborative exploratory data analysis on your data lake Data-enhanced applications Connect existing BI tools and use one source of truth for all your data
  14. 14. Connect your existing BI tools to your data lake with optimized connectors and ODBC/JDBC Drivers Databricks SQL Analytics Under the hood: SQL Analytics SQL Endpoints Vectorized Query Engine Build a curated cloud data lake on an open format with Delta Lake Filtered, Cleaned, Augmented Silver Raw Ingestion and History Bronze Business-level Aggregates Gold Curated Data Structured, Semi-Structured, and Unstructured Data BI/SQL Connectors SQL Interface Next generation query engine providers real-life performance for all queries Based on Redash, query your entire data lake with SQL and visualize results Quickly setup SQL / BI optimized compute with best price/performance and track usage
  15. 15. Demo
  16. 16. Cluster sizecheck your cloud provider quotas! Check documentation for latest cluster size: https://docs.sql.azuredatabricks.net/sql/admin/sql-endpoints.html https://docs.sql.databricks.com/sql/admin/sql-endpoints.html
  17. 17. Additional Resources ➢ https://docs.databricks.com/sql/index.html ➢ https://databricks.com/product/data-lakehouse ➢ https://databricks.com/discover/demos/sql-analytics ➢ Customer Success Offering SQL Analytics MVP available in Q2
  18. 18. Related Talks WEDNESDAY • 03:50 PM (PT): Databricks SQL Analytics Deep Dive for the Data Analyst - Doug Bateman, Databricks • 04:25 PM (PT): Radical Speed for SQL Queries on Databricks: Photon Under the Hood - Greg Rahn & Alex Behm, Databricks • 04:25 PM (PT): Delivering Insights from 20M+ Smart Homes with 500M+ devices - Sameer Vaidya, Plume THURSDAY • 11:00 AM (PT): Getting Started with Databricks SQL Analytics - Simon Whiteley, Advancing Analytics • 03:15 PM (PT): Building Lakehouses on Delta Lake and SQL Analytics - A Primer - Franco Patano, Databricks FRIDAY • 10:30 AM (PT): SQL Analytics Powering Telemetry Analysis at Comcast - Suraj Nesamani, Comcast & Molly Nagamuthu, Databricks
  19. 19. Reference Design Kit (RDK)
  20. 20. Suraj Nesamani Principal Engineer, RDK@Comcast 15 plus years of experience in Engineering, Mostly specializing in RDK Telemetry and Big Data Analysis. Experienced in handling Peta-Byte scale IoT data
  21. 21. The RDK is a Whole Home Open Source Software Platform powering Video, Broadband and IoT Devices. It enables operators to manage devices and easily customize their UI’s and Apps, and provides analytics to improve the customer experience and drive business results. ➢ https://rdkcentral.com/ RDK - Reference Design Kit
  22. 22. Raw Telemetry Data Raw Telemetry Data Formatted Data Copy data to Redshift RDK Telemetry Pipeline
  23. 23. Node : dc2.8xlarge Cluster Size : 12 nodes Queries executed : 1000 per day CPU Usage :80% most of the time Disk Usage : 60%
  24. 24. Storage Capacity Scalable Complex Queries Pricing Data retention Sort and distribution keys Concurrent Execution AWS Redshift
  25. 25. RDK Analytics Challenge ● Queries that take more CPU and time in cluster ● Workload Management (WLM ) Aborts ● Expensive to add more nodes ● Query output still needed to run the business
  26. 26. RDK-Databricks Partnership ● Migrated a complex redshift pipeline using Spark 3.0 on databricks ● Migrated some EMR workloads ● Optimizations and Databricks training ● Delta Lake Test and Learn ● Upgraded to E2 version of the Databricks Platform - Which is more secure, scalable and simpler to manage.
  27. 27. The Quest
  28. 28. SQL Analytics-(Test and Learn)-Scope Analyze Redshift workloads • 10 slowest performing queries on redshift • Average runtime for each of these queries is 30 mins • Currently use 12 dc2.8xl nodes Disclaimer: For the Test and Learn, we could not reproduce the redshift production environment. So we only considered the query runtimes for comparison.
  29. 29. Test and Learn Design Timeline 2-3 weeks Migrate workloads to Databricks SQL Analytics • Metastore decisions • Table Access Control List (ACLs) • Convert redshift queries to spark sql • Test workloads, as is , in parquet format • Convert to Delta Lake and test against Photon
  30. 30. Results
  31. 31. Observations The Native SQL interface was very intuitive and easy to use Creating endpoints is extremely simplified. It helps SQL analysts to a great extent. Does not have Support for UDFs in SQL Analytics We did not test ACLs too much but it seemed simple enough. a centralized catalog would be really nice
  32. 32. Looking forward to ...
  33. 33. Databricks Unity Catalog - Coming soon! S3 Data Source GCS Data Source Cluster or SQL Endpoint Database Data Source Data source credentials Users Unity Catalog SQL GRANT permissions Audit log
  34. 34. Thank you !
  35. 35. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.

Comcast is one of the leading providers of communications, entertainment, and cable products and services. At the heart of it is Comcast RDK providing the backbone of telemetry to the industry. RDK (Reference Design Kit) is pre-bundled opensource firmware for a complete home platform covering video, broadband and IoT devices. RDK team at Comcast analyzes petabytes of data, collected every 15 minutes from 70 million devices (video and broadband and IoT devices) installed in customer homes. They run ETL and aggregation pipelines and publish analytical dashboards on a daily basis to reduce customer calls and firmware rollout. The analysis is also used to calculate WIFI happiness index which is a critical KPI for Comcast customer experience. In addition to this, RDK team also does release tracking by analyzing the RDK firmware quality. SQL Analytics allows customers to operate a lakehouse architecture that provides data warehousing performance at data lake economics for up to 4x better price/performance for SQL workloads than traditional cloud data warehouses. We present the results of the “Test and Learn” with SQL Analytics and the delta engine that we worked in partnership with the Databricks team. We present a quick demo introducing the SQL native interface, the challenges we faced with migration, The results of the execution and our journey of productionizing this at scale.

Views

Total views

89

On Slideshare

0

From embeds

0

Number of embeds

0

Actions

Downloads

6

Shares

0

Comments

0

Likes

0

×