Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Supercharge large IoT analytics

61 views

Published on

An OS approach

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Supercharge large IoT analytics

  1. 1. Supercharge large IoT analytics An OS approach Felix Xiaozhu Lin 1
  2. 2. This talk has three parts that can be sampled independently “Kernel is firmware” If you care about: kernel, security, file system, drivers, TrustZone, heterogeneous SoCs, binary translation… Goto slide 6 Large stream analytics on the edge If you care about: stream processing, 3D-stacked memory, parallelism, memory mgmt, in-memory computing… Goto slide 26 Large video analytics on edge & cameras If you care about: video intelligence, deep learning, storage, IoT, edge computing… Goto slide 51 2
  3. 3. Bio • 2014 – now. Asst. prof, Purdue ECE • Xroads Systems Exploration Lab • 2014 PhD in CS. Rice • Thesis: OS for mobile computing • 2008 MS + BS. Tsinghua 3
  4. 4. What I do • Layer? • Operating system (in a broad sense) • Scenarios? • Edge & IoT (mostly) • Objectives? • Speed, efficiency, & security 4 My premises for OS research …
  5. 5. The remaining OS is defined by scenarios Kernel is firmware Entrees: 45 mins Appetizers: 5 mins
  6. 6. 6
  7. 7. Kernel is firmware • Entangled subsystems • Difficult to re-architecture or extract • Has own evolution plan • Likely reject new ideologies • Little respect for stable internal interfaces • New additions quickly become obsoleted • Open source (a white box) • Can we retrofit the kernel as firmware? 7
  8. 8. 8 Case 1: Trustworthy file systems for smart devices Retrofitting the kernel (1): reuse in vivo
  9. 9. 9 Hw Case 1: Trustworthy file systems for smart devices Kernel Apps Retrofitting the kernel (1): reuse in vivo
  10. 10. 10 App Normal world Secure world Isolating secure apps in TrustZone Retrofitting the kernel (1): reuse in vivo Kernel
  11. 11. 11 Normal world Secure world Keep secure data persistence? Retrofitting the kernel (1): reuse in vivo Kernel FS App
  12. 12. 12 VFS FS Block layer Reuse kernel file system in vivo Retrofitting the kernel (1): reuse in vivo Normal world Secure world App
  13. 13. 13 VFS FS Block layer • Cloud: a safer execution environment • A pair of twin file systems • File data never leaves the device’s secure world Metadata-only FS Replica Untrusted Trusted Cloud continuously verifies fs behaviors Retrofitting the kernel (1): reuse in vivo App [arXiv:1902.06327] "Let the Cloud Watch Over Your IoT File Systems," Liwei Guo, Yiying Zhang, and Felix Xiaozhu Lin, 2019.
  14. 14. 14 Case 2: Unmodified drivers for TrustZone HW App Normal world Secure world Retrofitting the kernel (2): code transformation
  15. 15. 15 Device code Driver libs Kernel libs Core services SPI CSI WiFiEth USB Kernel source tree Othercode Transplant Linux drivers? CAM Retrofitting the kernel (2): code transformation App CAM Secure world
  16. 16. 16 Device code Driver libs Kernel libs Core services SPI CSI WiFiEth USB CAM Driver kernel Othercode Statically miniaturize the whole kernel Retrofitting the kernel (2): code transformation CAM Kernel source tree
  17. 17. 17 Statically miniaturize the whole kernel Retrofitting the kernel (2): code transformation A kernel for all → A kernel specialized for the driver only App Driver kernel Normal world Secure world
  18. 18. 18 Case 3: Kernel IO paths on co-processors Retrofit the kernel (3): binary translation
  19. 19. 19 Retrofit the kernel (3): binary translation Weak co-processors
  20. 20. 20 CPU Co Proc 2.5GHz 50MHz DRAM IO A heterogeneous SoC Retrofit the kernel (3): binary translation
  21. 21. 21 CPU Co Proc 2.5GHz 50MHz DRAM IO Weak co-processors: suits low-power IO tasks! Retrofit the kernel (3): binary translation high efficiency Linux Kernel IO tasks
  22. 22. 22 CPU Co Proc 2.5GHz 50MHz DRAM IO Kernel execution on weak co-processors? Retrofit the kernel (3): binary translation Linux Kernel IO tasks Diff ISA No MMU No POSIX …
  23. 23. 23 CPU Co Proc DRAM IO Co-processor translates unmodified kernel binary Retrofit the kernel (3): binary translation Dynamic Binary Translation Linux Kernel IO tasks [arXiv:1811.05000] "Transkernel: An Executor for Commodity Kernels on Peripheral Cores," Liwei Guo, Shuang Zhai, Yi Qiao, and Felix Xiaozhu Lin
  24. 24. Retrofit kernel as firmware 1. Reuse in vivo Unmodified file systems for TrustZone 2. Source transformation Unmodified device drivers for TrustZone 3. Binary translation Unmodified IO paths for co-processors
  25. 25. algorithms + resources + objectives OS defined by scenarios
  26. 26. OSes defined by two IoT scenarios Hot springs Edge Icebergs
  27. 27. High-throughput. Sub-second delay. Timely processing before data gets cold! 27 “Hot springs”: telemetry events Power sensor 140M events/day Oil rig 1-2TB/day Manufacturing machines PBs/day
  28. 28. Ingestion Groupby SensorID Average per sensor 28 Finding high-power sensors Window TopK Edge: cleanse & summarize data 10:00-10:05 10:05-10:10 130 500 302 100 150 500 302 Time 10:01 ID: 0x1024 Value: 200
  29. 29. Stream analytics: state of the art • Classic engines? • StreamBase, Aurora, TelegraphCQ, NiagaraST… • Single threaded. Not scaling well. • Modern engines for datacenters? • Apache Flink, Spark Streaming, Beam… • Designed for tens - hundreds of machines. Scaling out. • Assuming okay if individual nodes perform poorly • As analytics moves to the edge → bad 29
  30. 30. Project StreamBox stream analytics at the memory speed 30 • RDMA / 10GbE • Co-designed with mm/scheduling Stream pipeline Threads Ingestion Scheduler Mem • Squeeze parallelism for multi/manycore • Manage NUMA domains Exploit high-bandwidth memory [ASPLOS'19] "StreamBox-HBM: Stream Analytics on High Bandwidth Hybrid Memory," Hongyu Miao, Myeongjae Jeon, Gennady Pekhimenko, Kathryn S. McKinley, and Felix Xiaozhu Lin [USENIX ATC'17] "StreamBox: Modern Stream Processing on a Multicore Machine," Hongyu Miao, Heejin Park, Myeongjae Jeon, Gennady Pekhimenko, Kathryn S. McKinley, and Felix Xiaozhu Lin, in Proc. USENIX Annual Technical Conference, 2017. [ASPLOS'16] "memif: Towards Programming Heterogeneous Memory Asynchronously," Felix Xiaozhu Lin and Xu Liu, in Proc. ACM Int. Conf. Architectural Support for Programming Languages and Operating Systems, 2016.
  31. 31. Cores High-bandwidth hybrid memory 31 3D DRAM Normal DRAM Tradeoffs: capacity vs. bandwidth Untraditional memory hierarchy No latency benefit (Unlike SRAM+DRAM) 16 GB 375 GB/s ~100 GB 80 GB/s
  32. 32. 32 Already on off-the-shelf machinesIntel Xeon Phi Knights Landing (KNL)
  33. 33. Cool. But benefit is not free • Two alternative configurations: • Hw-managed: HBM as a cache • Sw-managed: one flat address space • Throw existing analytics engines on HBM? • Almost no benefit (or even hurts) 33
  34. 34. Existing engines: 3 inadequacies Algorithm • HBM sequential access + high parallelism • Existing engines: grouping is hash w/ random access Capacity • HBM: capacity limited • Streaming: high data volume + high velocity Dynamism • Streaming: fluctuating workloads • How to map to two memory types? 34 Ingress Group by key Average per key Window TopK
  35. 35. Algorithm: HBM can accelerate grouping! • Hash vs Sort: duals for Grouping • Algorithmic complexity: Sort is worse than Hash • Hash for in-core; sort for out-of-core [VLDB’09, VLDB’13, SIGMOD’15] • Yet, Sort outperforms Hash with … • High data parallelism (bitonic sort + avx512) • High task parallelism (parallel merge sort) • High mem bw (stacked DRAM) 35 [vldb’09] C Kim et al., Sort vs. hash revisited: Fast join implementation on modern multi-core cpus [vldb’13] C Balkesen et al., Multi-core, main-memory joins: Sort vs. hash revisited [sigmod’15] O Polychroniou et al., Rethinking simd vectorization for in-memory databases
  36. 36. Grouping -- Sort vs Hash? 36 0 20 40 60 80 100 120 140 160 180 0 20 40 60 millionpairs/sec # cores 0 50 100 150 200 250 300 0 20 40 60 GB/sec # cores Throughput Mem bandwidth
  37. 37. Grouping -- Sort vs Hash? 37 0 20 40 60 80 100 120 140 160 180 0 20 40 60 millionpairs/sec # cores 0 50 100 150 200 250 300 0 20 40 60 GB/sec # cores Hash DRAM Hash DRAM Throughput Mem bandwidth
  38. 38. 38 0 20 40 60 80 100 120 140 160 180 0 20 40 60 millionpairs/sec # cores 0 50 100 150 200 250 300 0 20 40 60 GB/sec # cores Hash DRAM Hash DRAM Sort DRAM Sort DRAM Throughput Mem bandwidth Grouping -- Sort vs Hash?
  39. 39. Grouping - Sort vs Hash choice reversed! 39 0 20 40 60 80 100 120 140 160 180 0 20 40 60 millionpairs/sec # cores 0 50 100 150 200 250 300 0 20 40 60 GB/sec # cores Throughput Mem bandwidth Hash DRAM Hash DRAM Sort DRAM Sort HBM Sort HBM Sort DRAM
  40. 40. 40 HBM Cores Normal DRAM Streaming data Data Bundles Index {<key, pointer>} Capacity: Use HBM only for grouping indexes
  41. 41. 41 HBM Cores Normal DRAM Dynamism: the art of pressure balance DRAM Bandwidth HBM Capacity
  42. 42. StreamBox vs. Apache Flink: 5-10x faster 42 0 10 20 30 40 50 60 2 10 18 26 34 42 50 58 ThroughputMRec/s # Cores Flink @ x56 Flink @ KNL StreamBox @ KNL RDMA ingestion limit KNL: Intel Xeon Phi Knights Landing w/ HBM. 64 cores@1.3GHz. $5,000 x56: Intel Xeon E7-4830v4. 4x14 cores @2.0GHz. 256GB. $23,000 Benchmark: Yahoo Stream Benchmark Output delay: 1 second
  43. 43. StreamBox vs. Apache Flink: 5-10x faster 43 0 10 20 30 40 50 60 2 10 18 26 34 42 50 58 ThroughputMRec/s # Cores Flink @ x56 Flink @ KNL StreamBox @ KNL RDMA ingestion limit KNL: Intel Xeon Phi Knights Landing w/ HBM. 64 cores@1.3GHz. $5,000 x56: Intel Xeon E7-4830v4. 4x14 cores @2.0GHz. $23,000 Benchmark: Yahoo Stream Benchmark. Output delay: 1 second ~5GB/sec! 5-10x
  44. 44. StreamBox scales well 44 0 5 10 15 20 25 30 35 0 10 20 30 40 50 60 # cores ThroughputMrec/s StreamBox HW: Intel Xeon Phi Knights Landing w/ HBM. 64 cores@1.3GHz. $5,000 Benchmark: TopK per key Output delay: 1 second
  45. 45. HBM matters 45 0 5 10 15 20 25 30 35 0 10 20 30 40 50 60 # cores ThroughputMrec/s Not using HBM StreamBox DRAM only HW: Intel Xeon Phi Knights Landing w/ HBM. 64 cores@1.3GHz. $5,000 Benchmark: TopK per key Output delay: 1 second
  46. 46. Runtime memory management matters 46 0 5 10 15 20 25 30 35 0 10 20 30 40 50 60 # cores ThroughputMrec/s 3D mem as cache DRAM only HW-managed HBM StreamBox HW: Intel Xeon Phi Knights Landing w/ HBM. 64 cores@1.3GHz. $5,000 Benchmark: TopK per key Output delay: 1 second
  47. 47. 47 0 5 10 15 20 25 30 35 0 10 20 30 40 50 60 # cores ThroughputMrec/s No in-mem indexes 3D mem as cache DRAM only 3D mem as cache; full records StreamBox In-HBM index matters HW: Intel Xeon Phi Knights Landing w/ HBM. 64 cores@1.3GHz. $5,000 Benchmark: TopK per key Output delay: 1 second
  48. 48. StreamBox lessons • An analytics engine built from ground up • 2.5 years. ~60,000 lines of C++11. http://xsel.rocks/p/streambox • Hardware often badly underutilized, even with production software • Performance requires careful optimization everywhere 48
  49. 49. 49 Cheap VM (huge page) Apps OS kernel Fast net stack (40 GbE or RDMA) High task parallelism Custom mem allocator Sequential mem access Runtime Thread pool + custom task scheduler Wide SIMD (avx512) Hybrid memory The software engineer’s guide to 3D DRAM Make sure to pack all the following
  50. 50. OSes defined by two IoT scenarios Hot springs Edge Icebergs [EuroSys'19] "VStore: Reinventing Data Stores for Video Analytics," Tiantu Xu, Luis Materon Botelho, and Felix Xiaozhu Lin, to appear at Proc. Eurosys Conference, 2019.
  51. 51. Cheap cameras. Large videos. 51 130M surveillance cameras shipped per year Many institutions run > 200 cameras 24x7 A single camera produces 24 GB video per day Must be consumed by algorithms! $25 on Amazon
  52. 52. Video analytics is expensive 52 NVIDIA Quadro P6000 NN object detection: 5 FPS $4,500 Object detection: deep neural network model YOLOv3 IoT Camera 30 FPS $25
  53. 53. Storage is cheap 53 Seagate Surveillance HDD 8TB One-year of video $250 IoT Camera 24 GB/day $25
  54. 54. Cameras EdgeEdge Video analytics on the edge: ingestion 54 Cloud
  55. 55. EdgeEdge Video analytics on the edge: query 55 Cameras
  56. 56. A retrospective query “Find all white buses appeared yesterday” As a cascade of operators • selects operators from a lib • specifies target accuracies for operators 56Image credits: NoScope: Optimizing Neural Network Queries over Video at Scale, Daniel Kang, John Emmons, Firas Abuzaid, Peter Bailis, Matei Zaharia, VLDB 2017 Frame diff detector Shallow neural net Deep neural net ~10,000x ~1,000x ~10x
  57. 57. Project VStore The video data store for AI 57 Ingestion Storage Retrieval Consume Video Data Operator @ accuracy Query “Aren’t there many video databases already?” For human consumers. Not for AI consumers [EuroSys'19] "VStore: Reinventing Data Stores for Video Analytics," Tiantu Xu, Luis Materon Botelho, and Felix Xiaozhu Lin, Eurosys Conference, 2019.
  58. 58. The first-class concern: controlling video formats 58 Ingestion Storage Retrieval Consume Video Data Operator @ accuracy Query
  59. 59. Extensive video format knobs 59 Ingestion Storage Retrieval Consume Video Data Operator @ accuracy Query Quality Crop Res Sample Speed KeyFrame Interval Fidelity Coding
  60. 60. Extensive video format knobs 60 Ingestion Storage Retrieval Consume Video Data Operator @ accuracy Query Quality Crop Res Sample Fidelity
  61. 61. Knob Impacts: High & Complex 61 Ingestion Storage Retrieval Consume Video Data Operator @ accuracy Query Quality Crop Res Sample Ingestion Storage Retrieval Consumption Fidelity
  62. 62. Knob Impacts: High & Complex 62 Ingestion Storage Consume Video Data Operator @ accuracy Query Quality Crop Res Sample Bad 100% 100p 2/3 Ingestion Storage Retrieval Consumption Retrieval
  63. 63. Knob Impacts: High & Complex 63 Ingestion Storage Retrieval Consume Video Data Operator @ accuracy Query Best 100% 100p 1/30 Ingestion Storage Retrieval Consumption Quality Crop Res Sample
  64. 64. Knob Impacts: High & Complex 64 Ingestion Storage Retrieval Consume Video Data Operator @ accuracy Query Quality Crop Res Sample Good 75% 100p 1/2 Ingestion Storage Retrieval Consumption
  65. 65. Configuration Space 65 Ingestion Storage Retrieval Consume <motion,0.95> M Storage Formats N Consumption Formats <motion, 0.7> … <OCR, 0.95> <OCR, 0.90> … <NN, 0.95> Operator @ accuracy K Consumers
  66. 66. Configuration Space 66 <motion,0.95> M Storage Formats N Consumption Formats <motion, 0.7> … <OCR, 0.95> <OCR, 0.90> … <NN, 0.95> K Consumers Ingestion Storage Retrieval Consume Operator @ accuracy
  67. 67. Configuration Constraints 67 <motion,0.95> <motion, 0.7> … <OCR, 0.95> <OCR, 0.90> … <NN, 0.95> M Storage Formats N Consumption Formats K Consumers Ingestion Storage Retrieval Consume Operator @ accuracy Richer!
  68. 68. Configuration Constraints 68 <motion,0.95> <motion, 0.7> … <OCR, 0.95> <OCR, 0.90> … <NN, 0.95> M Storage Formats N Consumption Formats Ingestion Storage Retrieval Consume Operator @ accuracy K Consumers Faster!
  69. 69. Objectives for Configuration 69 <motion,0.95> M Storage Formats N Consumption Formats <motion, 0.7> … <OCR, 0.95> <OCR, 0.90> … <NN, 0.95> K Consumers Ingestion Storage Retrieval Consume Operator @ accuracy Retrieval never be bottleneck Satisfy accuracy Respect resource budgets
  70. 70. Challenges 70 <motion,0.95> M Storage Formats N Consumption Formats <motion, 0.7> … <OCR, 0.95> <OCR, 0.90> … <NN, 0.95> K Consumers Ingestion Storage Retrieval Consume Operator @ accuracy Retrieval never be bottleneck Satisfy accuracy Respect resource budgets Many possible formats (~15k combos of knobs) Many possible configurations (~4M for 4 operators)
  71. 71. 71 M Storage Formats N Consumption Formats Ingestion Storage Retrieval Consume Operator @ accuracy K Consumers Key Idea: Deriving Configuration Backwards Backward derivation of formats <motion,0.95> <motion, 0.7> … <OCR, 0.95> <OCR, 0.90> … <NN, 0.95>
  72. 72. 72 M Storage Formats N Consumption Formats Ingestion Storage Retrieval Consume Operator @ accuracy K Consumers Technique 1: Profiling <motion,0.95> <motion, 0.7> … <OCR, 0.95> <OCR, 0.90> … <NN, 0.95> 1
  73. 73. 73 M Storage Formats N Consumption Formats Ingestion Storage Retrieval Consume Operator @ accuracy K Consumers Technique 2: Coalescing <motion,0.95> <motion, 0.7> … <OCR, 0.95> <OCR, 0.90> … <NN, 0.95> 12
  74. 74. 74 M Storage Formats N Consumption Formats Ingestion Storage Retrieval Consume Operator @ accuracy K Consumers Technique 3: Eroding <motion,0.95> <motion, 0.7> … <OCR, 0.95> <OCR, 0.90> … <NN, 0.95> 12 3 MaxMin Video Age Storage (All deleted)
  75. 75. A sample configuration by VStore 75 Storage formats Hundreds of knobs. Only possible through auto config! Consumption formats Ingestion Storage Retrieval Consume Operator @ accuracy
  76. 76. Target Accuracy 1x 10x 100x 1000x 1 0.95 0.9 0.8 Queryspeed(xrealtime) OursUnified format 76 Query speedup Test platform. CPU: 56-core Xeon E7-4830v4. DRAM: 260 GB. HDD: 4×1TB 10K RPM SAS 12Gbps in RAID 5. GPU: NVIDIA Quadro P6000 Query: car detector Why? • Lower accuracy → query tolerates cheaper video formats • VStore ensures video decoding is proportionally cheaper!
  77. 77. jackson 1x 10x 100x 1000x 1 0.95 0.9 0.8 miami 1 0.95 0.9 0.8 tucson 1 0.95 0.9 0.8 dashcam 1 0.95 0.9 0.8 park 1 0.95 0.9 0.8 airport 1 0.95 0.9 0.8 OursUnified format Queryspeed(xrealtime) 16x 77 Query speedup Query 1-hour video in 10 secs! Test platform. CPU: 56-core Xeon E7-4830v4. DRAM: 260 GB. HDD: 4×1TB 10K RPM SAS 12Gbps in RAID 5. GPU: NVIDIA Quadro P6000 Query: car detector Why? • Lower accuracy → query tolerates cheaper video formats • VStore ensures video decoding is proportionally cheaper!
  78. 78. Ongoing: Video icebergs on cameras Edge Icebergs Edge Little Icebergs
  79. 79. Why would this happen? • Wireless cameras easy to deploy • Wireless bandwidth is precious • Public WiFi: typically < 1MB/sec • Complaints on cams slowing down WiFi • Streaming videos → NOT scalable • On-camera storage is cheap 79 https://community.netgear.com/t5/Nighthawk-WiFi-Routers/Wireless-cameras-slowing-router-too-much/td-p/513047 https://www.securitycameraking.com/securityinfo/forum/networking/ip-cameras-are-slowing-down-your-network/ https://www.security-camera-warehouse.com/ip-camera/wifi-enabled/
  80. 80. Edge Cloud Edge Video icebergs on cameras 80 Cameras capture videos & keep silence Only respond to queries Cameras
  81. 81. Feasible? • Users are waiting • On-camera video is large (tens of GB) • Wireless bandwidth is scarce (1MB/sec) • $20 cameras are wimpy (one frame in 30 secs) 81
  82. 82. Feasible? • Users are waiting • On-camera video is large (tens of GB) • Wireless bandwidth is scarce (1MB/sec) • $20 cameras are wimpy (one frame in 30 secs) 82 Yes • Ingestion: cameras learn videos slowly but surely • Query: continuously refining results • Edge bootstraps specialized NNs for cameras to run • 1000x cheaper than full NN. • Process 1-hour video in secs (working with edge)
  83. 83. Lesson: conquering video icebergs • Lazy ingestion: pay as little as possible • Eager query-time optimizations • Take specialization opportunities • Users know their queries better • Resonate with compiler/PL wisdom! • Just-in-time compilation & lazy evaluation 83
  84. 84. Supercharge IoT analytics Two important scenarios Large stream analytics on the edge Large video analytics on edge & cameras OS plays key roles Map AI to new hardware Dynamically configure AI Trade off among competing objectives 84

×