How Conviva used Spark to Speed
Up Video Analytics by 30x
Dilip Antony Joseph (@DilipAntony)
Conviva monitors and optimizes online
video for premium content providers
What happens if you don’t use Conviva?
65 million video streams a day
Conviva data processing architecture

Monitoring	
  Dashboard	
  
Live	
  data	
  
processing	
  
Reports	
  

Video	
  Pl...
Group By queries dominate 
our Reporting workload
SELECT	
  videoName,	
  COUNT(1)	
  
FROM	
  summaries	
  
WHERE	
  date...
Group By query in Spark
val	
  sessions	
  =	
  sparkContext.sequenceFile[SessionSummary,	
  NullWritable](	
  
	
  	
  	
...
Spark is 30x faster than Hive
45	
  minutes	
  versus	
  24	
  hours	
  	
  
for	
  weekly	
  Conviva	
  Geo	
  Report	
  
How much data are we talking about?
•  150	
  GB	
  /	
  week	
  of	
  compressed	
  summary	
  data	
  
•  Compressed	
  ...
Spark is faster because it avoids
reading data from disk multiple times
Group	
  By	
  Country	
  
Read	
  from	
  HDFS	
  
Decompress	
  
Deserialize	
  
Group	
  By	
  State	
  
Read	
  from	
...
Why not <some other big data system>?
•  Hive	
  
•  Mysql	
  
•  Oracle/SAP	
  HANA	
  
•  Column	
  oriented	
  dbs	
  
Spark just worked for us

• 
• 
• 
• 
• 

30x	
  speed-­‐up	
  
Great	
  fit	
  for	
  our	
  report	
  computafon	
  model...
30% of our reports use Spark
We are working on more projects
that use Spark
•  Streaming	
  Spark	
  for	
  unifying	
  batch	
  and	
  live	
  
comput...
Problems with Spark

Monitoring	
  Dashboard	
  
Live	
  data	
  
processing	
  
Reports	
  

Video	
  Player	
  
Hadoop	
...
Monitoring	
  Dashboard	
  
Live	
  data	
  
processing	
  
Reports	
  

Video	
  Player	
  
Hadoop	
  for	
  
historical	...
Spark queries are not very succinct
SELECT	
  videoName,	
  COUNT(1)	
  
FROM	
  summaries	
  
WHERE	
  date='2012_08_22'	...
There is a learning curve associated
with Scala, but ...
•  Type-­‐safety	
  offered	
  by	
  Scala	
  is	
  a	
  great	
  ...
Additional Hiccups
•  Always	
  on	
  the	
  bleeding	
  edge	
  –	
  geung	
  
dependencies	
  right	
  
•  More	
  maint...
Spark has been working great for
Conviva for over a year
We are Hiring
jobs@conviva.com	
  
hwp://www.conviva.com/blog/engineering/
using-­‐spark-­‐and-­‐hive-­‐to-­‐process-­‐big...
Upcoming SlideShare
Loading in...5
×

Conviva spark

700

Published on

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
700
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
8
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Conviva spark

  1. 1. How Conviva used Spark to Speed Up Video Analytics by 30x Dilip Antony Joseph (@DilipAntony)
  2. 2. Conviva monitors and optimizes online video for premium content providers
  3. 3. What happens if you don’t use Conviva?
  4. 4. 65 million video streams a day
  5. 5. Conviva data processing architecture Monitoring  Dashboard   Live  data   processing   Reports   Video  Player   Hadoop  for   historical  data   Spark   Ad-­‐hoc   analysis  
  6. 6. Group By queries dominate our Reporting workload SELECT  videoName,  COUNT(1)   FROM  summaries   WHERE  date='2012_08_22'  AND  customer='XYZ’   GROUP  BY  videoName;   10s  of  metrics,  10s  of  group  bys  
  7. 7. Group By query in Spark val  sessions  =  sparkContext.sequenceFile[SessionSummary,  NullWritable](                                                              pathToSessionSummaryOnHdfs,                                                                                                    classOf[SessionSummary],  classOf[NullWritable])                                                  .flatMap  {                                                                                                      case  (key,  val)  =>  val.fieldsOfInterest                                                                                }     val  cachedSessions  =  sessions.filter(                                                                                    whereCondifonToFilterSessionsForTheDesiredDay)                                                                              .cache   val  mapFn  :  SessionSummary  =>  (String,  Long)  =  {  s  =>  (s.videoName,  1)  }     val  reduceFn  :  (Long,  Long)  =>  Long  =  {  (a,b)  =>  a+b  }     val  results  =  cachedSessions.map(mapFn).reduceByKey(reduceFn).collectAsMap  
  8. 8. Spark is 30x faster than Hive 45  minutes  versus  24  hours     for  weekly  Conviva  Geo  Report  
  9. 9. How much data are we talking about? •  150  GB  /  week  of  compressed  summary  data   •  Compressed  ~  3:1   •  Stored  as  Hadoop  Sequence  Files     •  Custom  binary  serializafon  format  
  10. 10. Spark is faster because it avoids reading data from disk multiple times
  11. 11. Group  By  Country   Read  from  HDFS   Decompress   Deserialize   Group  By  State   Read  from  HDFS   Decompress   Deserialize   Group  By  Video   Read  from  HDFS   Decompress   Deserialize   10s  of  Group  Bys  …   Cache  only   columns  of   interest   Hive/ MapReduce   startup   overhead   Overhead  of   flushing   intermediate   data  to  disk   Spark   Group  By  Country   Read  from  HDFS   Decompress   Deserialize   Cache  data  in  memory   Group  By  State   Read  data  from  memory   Group  By  Video   Read  data  from  memory   10s  of  Group  Bys  …  
  12. 12. Why not <some other big data system>? •  Hive   •  Mysql   •  Oracle/SAP  HANA   •  Column  oriented  dbs  
  13. 13. Spark just worked for us •  •  •  •  •  30x  speed-­‐up   Great  fit  for  our  report  computafon  model   Open-­‐source   Flexible   Scalable  
  14. 14. 30% of our reports use Spark
  15. 15. We are working on more projects that use Spark •  Streaming  Spark  for  unifying  batch  and  live   computafon   •  SHadoop  –  Run  exisfng  Hadoop  jobs  on  Spark    
  16. 16. Problems with Spark Monitoring  Dashboard   Live  data   processing   Reports   Video  Player   Hadoop  for   historical  data   No  Logo   Spark   Ad-­‐hoc   analysis  
  17. 17. Monitoring  Dashboard   Live  data   processing   Reports   Video  Player   Hadoop  for   historical  data   Spark   Ad-­‐hoc   analysis  
  18. 18. Spark queries are not very succinct SELECT  videoName,  COUNT(1)   FROM  summaries   WHERE  date='2012_08_22'  AND  customer='XYZ’   GROUP  BY  videoName;   Spark   val  sessions  =  sparkContext.sequenceFile[SessionSummary,  NullWritable](                                                              pathToSessionSummaryOnHdfs,                                                                                                    classOf[SessionSummary],  classOf[NullWritable])                                                  .flatMap  {                                                                                                      case  (key,  val)  =>  val.fieldsOfInterest                                                                                }     val  cachedSessions  =  sessions.filter(                                                                                    whereCondifonToFilterSessionsForTheDesiredDay)                                                                              .cache   val  mapFn  :  SessionSummary  =>  (String,  Long)  =  {  s  =>  (s.videoName,  1)  }     val  reduceFn  :  (Long,  Long)  =>  Long  =  {  (a,b)  =>  a+b  }     val  results  =  cachedSessions.map(mapFn).reduceByKey(reduceFn).collectAsMap  
  19. 19. There is a learning curve associated with Scala, but ... •  Type-­‐safety  offered  by  Scala  is  a  great  boon   –  Code  complefon  via  Eclipse  Scala  plugin   •  Complex  queries  are  easier  in  Scala  than  in  Hive   –  Cascading  IF()s  in  Hive   •  No  need  to  write  Scala  in  the  future   –  Shark   –  Java,  Python  APIs  
  20. 20. Additional Hiccups •  Always  on  the  bleeding  edge  –  geung   dependencies  right   •  More  maintenance/debugging  tools  required  
  21. 21. Spark has been working great for Conviva for over a year
  22. 22. We are Hiring jobs@conviva.com   hwp://www.conviva.com/blog/engineering/ using-­‐spark-­‐and-­‐hive-­‐to-­‐process-­‐bigdata-­‐at-­‐ conviva    
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×