Self-Service Analytics on Hadoop: Lessons Learned
June 29, 2016
Drew Leamon
Director – Advanced Technology Solutions
Comcast: Shaping the Future of Media and Technology
High Speed
Internet
Video
IP
Telephony
Home
Security /
Automation
Universal
Parks
Media
Properties
Forecast
Engineering
Design
Budget
Engineering Analysis: Global Central Analysis Team
Animals are Best Suited in Their Native Habitat
Spreadsheets: The Natural Habitat of Analysts
Evolution of Self Service Analytics
SSRS
Self Service: Native Habitat
Limitations of the Spreadsheet Native Habitat
• 1 Million Row Max
Self Service
• Not Even Medium Data
• Not Collaborative
• No Automation
• Not Repeatable
IT Analyst
Self Service: How We Started
Analyst goes to IT, makes request, waited weeks to get results
SSRS
• 10 TB Storage
• 1 Compute Node
Not Self Service
• 10 TB (Medium Data)
• Limited Compute
• IT Hand-off
• Consultative service
• Not self service.
IT Analysts
Bigger database still meant building dashboards for team
IT Analysts
Still Not Self Service
• 100s TBs (Large Data)
• Data silos
• IT Hand-off
• Consultative service
• Analysts not SQL experts
Graduated to Specialized Databases
• Clustered Storage
• Columnar Compression
• Clustered Compute
Datameer, native on Hadoop, enables self-service for big data
Analysts
True Self Service
• PB == Big Data
• Data Lake
• Excel-like UI
• No more waiting for IT
Self Service: The New Way
• Clustered Storage
• Columnar Compression
• Clustered Compute
• Liberated Data
11
Multiple Configurations for Big Data
12
Engineering
Analysis
IP
Telephony
Video
Research
IP Video
Engineering
X1
Operations
Advanced
Advertising
Web
Analytics
Enterprise
Business
Intelligence
Network
EngineeringMature
Evolving
On-Boarded
On-Deck
Expanding Use Cases with Datameer
Use Case #1: Comcast Digital Voice
One Of The Largest IP Telephony Networks
Anonymized Call Detail Records (CDR) Data Set
Data complexity from network
Data size: TBs/month
Discovered Unusual Patterns
Noticed large spikes for high cost areas
Hypothesis: Network Abuse
30% of this traffic was coming from three
accounts.
Analysis Shows Traffic Concentration Few Accounts
Ongoing Monitoring of Future Abuse
Analyst Scheduled a Tableau Data Extract and built a Tableau dashboard
- Now the business can keep an eye out for further abuse.
Result: Future Abuse Prevented and More
Abuse detected Analysts empowered Resources saved
No IT hand-off Value to organizationAutomated and
repeatable
21
Engineering
Analysis
IP
Telephony
Video
Research
IP Video
Engineering
X1
Operations
Advanced
Advertising
Web
Analytics
Enterprise
Business
Intelligence
Network
EngineeringMature
Evolving
On-Boarded
On-Deck
Expanding Use Cases with Datameer
Use Case #2: Customer Perspective
How to measure customer experience from the customer perspective
22
23
Millions of Viewing
Experiences
Improved Customer Experience through Data Analytics
24
Findings / Analysis
Best
Practices
Improved Customer Experience
Data driven scheduling
Dataflow Automation
Solution:
25
- Build views
quickly &
aggregate
large
datasets.
- Early visibility
of data in
Hadoop
- Create
repeatable
processes
through
automated
workflow
• Aggregations of large datasets from disparate data sources.
- RDBMS, HDFS, APIs
• Data Joins / Data Quality Checks / Pipeline between clusters
Result: Data-driven Customer Viewing Experience Enhancements
26
Customer Experience
Improved
Analysts empowered Capital Spend
Directed Intelligently
No IT hand-off Value to organizationAutomated and
repeatable
Self-Service Analytics on Hadoop: Lessons Learned

Self-Service Analytics on Hadoop: Lessons Learned

Editor's Notes

  • #2 Welcome Self Introduction Journey to Self-Service Big Data Based on Lessons Learned from the work that we have done at Comcast.
  • #3 Comcast Introduction Cable Organization High Speed Internet Emmy winning Video Platform Home Security & Automation IP Telephony NBC Universal Media Properties Universal Theme Parks Scale 10s of Millions of Customers / 100s of Millions of Devices
  • #4 Intro to my team Initial Charter Start with Massive amounts of Data Deliver Budget Guidance Deliver Forecasts Engineering Design Guidance My specific goal is to empower all of these activities and more with Technology
  • #5 Hadoop Summit – Data Lake Safari in Africa Musth – testosterone spikes 60x - You will never experience that in a zoo - nor a theme park Native Habitat is critical
  • #7 We started with Self Service Analytices Excel on a Laptop Single Resource / No handoffs Contained Scaled to 1M rows Migrated to SQL Server / SSRS – Not Self Service IT Infrastructure / Handoff Limit at 250 GB 8 years ago before Big Data was cool we had big data problems – Enter Vertica Columnar Data store 100s of TBs Stil have silos Enter Datameer on Hadoop to bring us back to Self-Service Analytics
  • #8 Technical Limitations 1 M Row Max (Not even medium data) Not Collaborative No Automation Not Repeatable
  • #9 Consultative model - Limit for SQL Server at ~250GBs - IT Handoffs Model is Consultative Actually moved away from self service. In excel, analysts had access to data. IT Service Analytics
  • #10 - Now we can store TBs of data in clusters of servers - If you really have big data, you are still going to end up with silos - IT Handoff and still consultative - Analysts don’t know SQL or at least don’t know it well enough to not make problems.
  • #11 - OpenSource - Dataset blending Have true Self-Service No IT Handoffs Datameer 5000 row sample
  • #12 Multiple Configurations/Distributions Mixture of Bare Metal and Virtualized Multiple Distributions When we use “big data” like this we do so in compliance with all applicable privacy and security requirements and laws.
  • #13 Diagram Details the Maturity of Different Use Cases Many are being targeted and are in varying levels of maturity At this point I’m going to focus in on a specific use case. Lots of Consultative work Invested to make them Self-Sufficient with Datameer
  • #14 Comcast Digital Voice We are one of the largest telephone carriers in the country This is am important line of business for us There are many parts of the business including wholesale and peering relationships that need to be managed All of these rely on data to make decisions on how to manage the network and the relationships
  • #15 IP Telephony is complex. deep engineering field Intricacies session boarded controllers media gateways. SMEs and Analysts deep engineering knowledge My team did not have knowledge Consultative approach was very challenging Handoff errors Built the wrong thing Extremely iterative and costly
  • #16 Solution: Get the SMEs and Analysts into the data with Datameer Data Anonymized CDRs TBs of data per day Datameer UI – 5000 row sample of data Real-time feedback Create your Data pipeline via XLS-like Instantaneous Feedback
  • #17 What Happened – second hand Data Discovery Profiling – Understand the Data Let the data tell it’s story Noticed something strange in the data Spike to High cost areas (international?) Question: What does it mean?
  • #18 Hypothesis: Network abuse Not legitimate use Violation of the terms of service Not going to give a course in how to abuse our services
  • #19 The SMEs/Analysts Hypothesis Dug deeper / created aggregations Large percentage of traffic was coming from a handful of accounts
  • #20 Datameer has Visualization capability Infographics Tableau is fairly well adopted using Datameers integration with Tableau SMEs created an automation in Datameer to push a TDE to Tableau Server
  • #21 Abuse detected and addressed Analyst directed and empowered No IT Handoff - Value delivered to the organization - Automated and repeatable
  • #22 Diagram Details the Maturity of Different Use Cases Many are being targeted and are in varying levels of maturity At this point I’m going to focus in on a specific use case. Lots of Consultative work Invested to make them Self-Sufficient with Datameer
  • #25 Sausage Funnel Inputs 3rd Party QoE Network QoS In-Home QoS Outputs Improved IP Video QoE Improved NPS
  • #26 Blend Analyze Share Rapid Prototyping – Disparate Data Sets
  • #27 Changing how we prioritize capital spend Optimizing for CX – Right KPI