Your SlideShare is downloading. ×
Big Data Science Challenges in Media
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Big Data Science Challenges in Media

77
views

Published on

Presented at Data Science London meet up, this presentation list some of the Big Data Science challenges faced at Sky.

Presented at Data Science London meet up, this presentation list some of the Big Data Science challenges faced at Sky.


0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
77
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Big Data Science Challenges Chandan Rajah – Chief Architect, Big Data crajah@parallelai.com [ @chandanrajah ]
  • 2. Why Big Data Science ?Big Data Value & Vision • Machine learning (clustering, classification, regression, pattern mining, behaviour analysis, semantic analysis, topic extraction) • Real time analytics & recommendations • Central smelting pot • Cost to data benefits Volume & Variety • 10 million subscribers;10 different touch points • Petabytes of data; structured and unstructured • Event logs, program data, content metadata, purchase history, etc. • Too big for traditional data warehouse Velocity & Veracity • 140 MB/s approx. 12 TB/day • Too fast; 95% of the data dropped • Inconsistent data structure • No single version of truth
  • 3. Big Data Science ChallengesBig DataBig Data Data Quality Feature Extraction Machine Learning Visualisation & Verification Productizing • Dirty unstructured data with inconsistent labels • Start but no end events • Field shifts between extracts • XML fragmented data; 100k frags • Data too big to run in R requires subsampling and effective implementation • 100s of features; too big for Scala / Scalding tuple • No clearly identifiable keys • Algorithm implementation issues (e.g. parallelism, scalability, testability) • Collaborative filtering, topic modelling, incremental clustering, sentiment analysis • Real time versus batch algorithm design • Visualisation tool support • Automated testing frameworks • R -> Scala / Scalding not easy • Disaster recovery & cross data centre • On the fly analytics; data streams