Successfully reported this slideshow.
Your SlideShare is downloading. ×

Accelerating Real Time Video Analytics on a Heterogenous CPU + FPGA Platform

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 36 Ad

Accelerating Real Time Video Analytics on a Heterogenous CPU + FPGA Platform

Download to read offline

The current uptrend in faster computational power has led to a more mature eco-system for image processing and video analytics. By using deep neural networks for image recognition and object detection we can achieve better than human accuracies. Industrial sectors led by retail and finance want to take advantage of these latest developments in real-time analysis of video content for fraud detection, surveillance and many other applications.

There are a couple of challenges involved in the real word implementation of a video analytics solution:
1) Most video analytics use-cases are effective only when response times are in milliseconds. Requirement of performing at very low latencies gives rise to a need for software and hardware acceleration
2) Such solutions will need wide-spread deployment and are expected to have low TCO. To address these two key challenges we propose a video analytics solution leveraging Spark Structured Streaming + DL framework (like Intel’s Analytics-Zoo & Tensorflow) built on a heterogenous CPU + FPGA hardware platform.

The proposed solution provides >3x acceleration in performance to a video analytics pipeline when compared to a CPU only implementation while requiring zero code change on the application side as well as achieving more than 2x decrease in TCO. Our video analytics pipeline includes ingestion of video stream + H.264 decode to image frames + image transformation + image inferencing, that uses a deep neural network. FPGA based solution offloads the entire pipeline computation to the FPGA while CPU only solution implements the pipeline using OpenCV + Spark Structured Streaming + Intel’s Analytics-Zoo DL library.

Key Take aways:
1. Optimizing performance of Spark Streaming + DL pipeline
2. Acceleration of video analytics pipeline using FPGA to deliver high throughput at low latency and reduced TCO.
3. Performance data for benchmarking CPU and CPU + FPGA based solution.

The current uptrend in faster computational power has led to a more mature eco-system for image processing and video analytics. By using deep neural networks for image recognition and object detection we can achieve better than human accuracies. Industrial sectors led by retail and finance want to take advantage of these latest developments in real-time analysis of video content for fraud detection, surveillance and many other applications.

There are a couple of challenges involved in the real word implementation of a video analytics solution:
1) Most video analytics use-cases are effective only when response times are in milliseconds. Requirement of performing at very low latencies gives rise to a need for software and hardware acceleration
2) Such solutions will need wide-spread deployment and are expected to have low TCO. To address these two key challenges we propose a video analytics solution leveraging Spark Structured Streaming + DL framework (like Intel’s Analytics-Zoo & Tensorflow) built on a heterogenous CPU + FPGA hardware platform.

The proposed solution provides >3x acceleration in performance to a video analytics pipeline when compared to a CPU only implementation while requiring zero code change on the application side as well as achieving more than 2x decrease in TCO. Our video analytics pipeline includes ingestion of video stream + H.264 decode to image frames + image transformation + image inferencing, that uses a deep neural network. FPGA based solution offloads the entire pipeline computation to the FPGA while CPU only solution implements the pipeline using OpenCV + Spark Structured Streaming + Intel’s Analytics-Zoo DL library.

Key Take aways:
1. Optimizing performance of Spark Streaming + DL pipeline
2. Acceleration of video analytics pipeline using FPGA to deliver high throughput at low latency and reduced TCO.
3. Performance data for benchmarking CPU and CPU + FPGA based solution.

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Similar to Accelerating Real Time Video Analytics on a Heterogenous CPU + FPGA Platform (20)

Advertisement

More from Databricks (20)

Recently uploaded (20)

Advertisement

Accelerating Real Time Video Analytics on a Heterogenous CPU + FPGA Platform

  1. 1. WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
  2. 2. Bhoomika Sharma, Megh Computing Accelerating Real Time Video Analytics Using Heterogenous CPU+FPGA Environment #UnifiedDataAnalytics #SparkAISummit
  3. 3. Megh Computing • A startup based in Portland, Oregon, USA with development office in Bangalore, India • Vision of enabling third wave of computing in data center • Mission of accelerating real-time analytics using FPGA 3#UnifiedDataAnalytics #SparkAISummit
  4. 4. Agenda 1. Introduction to real-time analytics. 2. Existing software based real-time video analytics solutions. 3. Video analytics pipeline acceleration using CPU+FPGA platform. 4. Benchmarking between CPU and CPU+FPGA based solution. 4#UnifiedDataAnalytics #SparkAISummit
  5. 5. Real-Time Analytics 5
  6. 6. Why Real-Time ? 6#UnifiedDataAnalytics #SparkAISummit Real Time Secs Mins Hours Days Months Time ValueofDatatoDecisionMaking Information Half-Life in Decision Making Time Critical Decisions Traditional “Batch” Business Intelligence Predictive/ Preventive Actionable Reactive Historical
  7. 7. Real-Time Insights 7#UnifiedDataAnalytics #SparkAISummit Hard Real Time < 1 𝜇s Regular Trading 100 𝜇s Fraud Prevention ms Edge Computing 10s ms Dashboard (Inference) 100s ms Operational Insights seconds
  8. 8. Existing Real-Time Analytics Solution 8#UnifiedDataAnalytics #SparkAISummit *ETL = Extract Transform Load
  9. 9. Real-Time Video Analytics 9#UnifiedDataAnalytics #SparkAISummit Extracting values from video to impact business Object Detection Fraud DetectionYOU ARE BEING WATCHED Image Source: Pinterest, Towards Data Science, pointofsale.com
  10. 10. Main Phases of Video Analytics Pipeline 10#UnifiedDataAnalytics #SparkAISummit Ingest Transform Infer
  11. 11. CPU Software-Based Solutions 11
  12. 12. Architecture of CPU based Pipeline 12#UnifiedDataAnalytics #SparkAISummit RTSP = Real Time Streaming Protocol
  13. 13. • FFMpeg Library • H.264 Decoding • Extracting Image Transform Phase 13#UnifiedDataAnalytics #SparkAISummit Persistent Megh Microservice RTSP Video Stream Image Frame JavaCV Image Extraction and Transformation
  14. 14. Inference Phase 14#UnifiedDataAnalytics #SparkAISummit - Reads from custom data source Persistent Megh Microservice Spark Structured Streaming Deep Learning Inference Image Frame Processed Image
  15. 15. From DStreams to Structured Streaming 15#UnifiedDataAnalytics #SparkAISummit • Based on Simple DataFrame API • Handles Backpressure Ease of Use • Output tables are always consistent with all the records Consistency
  16. 16. Custom Connector: reading from custom data source 16#UnifiedDataAnalytics #SparkAISummit Spark Driver Micro Batch Reader Input Partition 1 Input Partition 2 Input Partition n Commit Commit Commit Spark Worker Spark Worker Megh Micro Service Input Partition Reader Input Partition Reader Megh Micro Service Reader Config Plan Input Partitions Input Partition Reader Read Data of Size n Reader Config Reader Config
  17. 17. Code Snippet 17#UnifiedDataAnalytics #SparkAISummit Reads data from custom data source Load Properties for Custom source val streamData = SQLContext .getOrCreate(sc) .sparkSession .readStream .format( “com.meghcomputing.videoanalytics.spark.receivers. MeghImageSourceV2") .options(Map ( "MEGH_RPC_HOST" -> prop.getProperty("rpc.server.host"), "MEGH_RPC_PORT" -> prop.getProperty("rpc.server.port"), "MEGH_MAX_RECORD" -> prop.getProperty(“rpc.max.record") )) .load()
  18. 18. Deep Learning Inference 18#UnifiedDataAnalytics #SparkAISummit CAT ? ? Deep Learning Topology
  19. 19. Deep Learning Inference 19#UnifiedDataAnalytics #SparkAISummit - Unified Analytics + AI Platform - Bi - BigDL for Deep Learning - Pretrained Squeezenet Quantized Model
  20. 20. Code Snippet 20#UnifiedDataAnalytics #SparkAISummit val predictImageUDF = udf( (uri: String, data: Array[Byte], latency: String) => { val st = System.nanoTime() val featureSteps = featureTransformersBC.value.clonePreprocessing() val localModel = modelBroadCast.value val labels = labelBroadcast.value val bytesData = Base64.getDecoder.decode(data) val imf = ImageFeature(bytesData, uri = uri) val imgSet: ImageSet = ImageSet.array(Array(imf)) var inputTensor = featureSteps(imgSet.toLocal().array.iterator).next() inputTensor = inputTensor.reshape(Array(1) ++ inputTensor.size()) val prediction = localModel .doPredict(inputTensor) .toTensor[Float] .squeeze() .toArray() val predictClass = prediction.zipWithIndex.maxBy(_._1)._2 if (predictClass < 0 || predictClass > (labels.length - 1)) { "unknown" } val labelName: String = labels(predictClass.toInt).toString() labelName } } ) • Classify image into its category • Predicts labels • Broadcasting model, labels and transformation steps to all worker • Transforming image data to analytics zoo default ImageFeature type
  21. 21. Performance of CPU based solution 21#UnifiedDataAnalytics #SparkAISummit Infrastructure Cluster with one worker node with Xeon Bronze Processor Video Specification 1080p Resolution Throughput ~22 FPS Latency > 250 ms *FPS = Frames Per Second
  22. 22. Challenges with existing software-based solutions 22#UnifiedDataAnalytics #SparkAISummit Latency • Does not always meet real time requirement Throughput • Non-Linear relation with number of nodes TCO • Increases with an increase in input feeds *TCO = Total Cost of Ownership
  23. 23. Hardware Accelerators – Alternate Solution 23#UnifiedDataAnalytics #SparkAISummit CPU • General Purpose Architecture • Non- Deterministic Latency • Sub-optimal Resource Utilization GPU • Suitable for High Batch • Non- Deterministic Latency • Sub-optimal Resource Utilization FPGA • Direct I/O for Ingestion • Deterministic Latency • Efficient Resource Utilization 1st Wave 2nd Wave 3rd Wave
  24. 24. FPGA In Brief 24#UnifiedDataAnalytics #SparkAISummit • Field Programmable Gate Array • Customizable Hardware • Direct I/O for ingestion • Processing at line rates • Support for parallel processing
  25. 25. Heterogenous CPU+FPGA Solution 25
  26. 26. Heterogenous CPU+FPGA based pipeline 26#UnifiedDataAnalytics #SparkAISummit *FU = Functional Unit
  27. 27. Performance of CPU+FPGA based solution 27#UnifiedDataAnalytics #SparkAISummit Infrastructure Cluster with one worker node with 2 Arria10 FPGA and Xeon Bronze Processor Video Specification 1080p Resolution Throughput ~240 FPS Latency < 100 ms
  28. 28. Distributed System Configuration 28#UnifiedDataAnalytics #SparkAISummit
  29. 29. Megh Solution Stack: reduces complexity of programming FPGA 29#UnifiedDataAnalytics #SparkAISummit
  30. 30. Benchmarking and Demo 30
  31. 31. Video Analytics Value Proposition: Higher Throughput for same CPU configuration 31#UnifiedDataAnalytics #SparkAISummit Server Cost Estimates: Dell.com PEFORMANCE (Throughput): > 10x 0 50 100 150 200 250 300 CPU FPGA Performance (fps) Throughput of the complete pipeline, including H.264 video decoding and Deep Learning inferencing, measured in fps on the target configuration 8 Channels CPU CPU + FPGA CPU Xeon E5 3106 Bronze (1 server, 2 sockets) Xeon E5 3106 Bronze (1 server, 2 sockets) FPGA Arria 10 (2 cards) Power 150W 250W (150W + 2 * 50W) Cost $6,000 $16,000 ($6000 + 2 * $5000) Throughput ~22 fps (max on CPU) ~240 fps (max ~400 fps) Latency > 250ms < 100 ms
  32. 32. Video Analytics Value Proposition: Lower TCO per channel for same throughput 32#UnifiedDataAnalytics #SparkAISummit Server Cost Estimates: Dell.com 8 Channels CPU CPU + FPGA CPU Xeon E5 8180 Platinum (2 servers, 4 sockets) Xeon E5 3106 Bronze (1 server, 2 sockets) FPGA Arria 10 (2 cards) Power 800W (400W * 2) 250W (150W + 2 * 50W) Cost $60,000 ($30000 * 2) $16,000 ($6000 + 2 * $5000) Throughput ** 240 fps (max: 150fps * 2) 240 fps (max ~400 fps) Latency > 250ms < 100 ms Total cost of system to process a channel of video stream including H.264 video decoding and Deep Learning inference based on costs for 8 channels COST SAVING (TCO): Cost < 3x $0 $1,000 $2,000 $3,000 $4,000 $5,000 $6,000 $7,000 $8,000 CPU FPGA TCO (per channel) ** projected throughput
  33. 33. Summary • Structured Streaming for Real-Time analytics. • Optimized Structured Streaming for Custom Data Source. • Megh solution reduces end to end latency by offloading entire video analytics pipeline to FPGA,. • TCO while using CPU+FPGA setup reduces by 3x. • Difficulty in programming FPGA is mitigated by Megh Solution Stack/SDK. 33#UnifiedDataAnalytics #SparkAISummit
  34. 34. More Information 34#UnifiedDataAnalytics #SparkAISummit megh.com info@meghcomputing.com +1-888-428-2396 Twitter: meghcomputing LinkedIn: megh-computing-inc Megh Computing, Inc. 1600 NE Compton Drive Suite 202 Hillsboro, OR 97006 Megh Computing Pvt Ltd 11 O’Shaughnessy Road Suite 202 Bangalore - 560025
  35. 35. Thank You info@meghcomputing.com
  36. 36. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT

×