This talk hasthree parts that can be
sampled independently
“Kernel is firmware”
If you care about: kernel, security, file system, drivers,
TrustZone, heterogeneous SoCs, binary translation…
Goto slide 6
Large stream analytics on the edge
If you care about: stream processing, 3D-stacked memory,
parallelism, memory mgmt, in-memory computing…
Goto slide 26
Large video analytics on edge & cameras
If you care about: video intelligence, deep learning, storage,
IoT, edge computing…
Goto slide 51
2
3.
Bio
• 2014 –now. Asst. prof, Purdue ECE
• Xroads Systems Exploration Lab
• 2014 PhD in CS. Rice
• Thesis: OS for mobile computing
• 2008 MS + BS. Tsinghua
3
4.
What I do
•Layer?
• Operating system (in a broad sense)
• Scenarios?
• Edge & IoT (mostly)
• Objectives?
• Speed, efficiency, & security
4
My premises for OS research …
5.
The remaining OSis defined by scenarios
Kernel is firmware
Entrees: 45 mins
Appetizers: 5 mins
Kernel is firmware
•Entangled subsystems
• Difficult to re-architecture or extract
• Has own evolution plan
• Likely reject new ideologies
• Little respect for stable internal interfaces
• New additions quickly become obsoleted
• Open source (a white box)
• Can we retrofit the kernel as firmware?
7
8.
8
Case 1: Trustworthyfile systems for smart devices
Retrofitting the kernel (1): reuse in vivo
9.
9
Hw
Case 1: Trustworthyfile systems for smart devices
Kernel
Apps
Retrofitting the kernel (1): reuse in vivo
13
VFS
FS
Block layer
• Cloud:a safer execution environment
• A pair of twin file systems
• File data never leaves the device’s secure world
Metadata-only
FS Replica
Untrusted Trusted
Cloud continuously verifies fs behaviors
Retrofitting the kernel (1): reuse in vivo
App
[arXiv:1902.06327] "Let the Cloud Watch Over Your IoT File Systems," Liwei Guo, Yiying Zhang, and Felix Xiaozhu Lin, 2019.
14.
14
Case 2: Unmodifieddrivers for TrustZone
HW
App
Normal
world
Secure
world
Retrofitting the kernel (2): code transformation
15.
15
Device code
Driver libs
Kernellibs
Core services
SPI CSI
WiFiEth
USB
Kernel source tree
Othercode
Transplant Linux drivers?
CAM
Retrofitting the kernel (2): code transformation
App
CAM
Secure
world
16.
16
Device code
Driver libs
Kernellibs
Core services
SPI CSI
WiFiEth
USB
CAM
Driver
kernel
Othercode
Statically miniaturize the whole kernel
Retrofitting the kernel (2): code transformation
CAM
Kernel source tree
17.
17
Statically miniaturize thewhole kernel
Retrofitting the kernel (2): code transformation
A kernel for all → A kernel specialized for the driver only
App
Driver
kernel
Normal
world
Secure
world
18.
18
Case 3: KernelIO paths on co-processors
Retrofit the kernel (3): binary translation
21
CPU Co
Proc
2.5GHz 50MHz
DRAMIO
Weak co-processors: suits low-power IO tasks!
Retrofit the kernel (3): binary translation
high
efficiency
Linux Kernel
IO
tasks
22.
22
CPU Co
Proc
2.5GHz 50MHz
DRAMIO
Kernel execution on weak co-processors?
Retrofit the kernel (3): binary translation
Linux Kernel
IO
tasks
Diff ISA
No MMU
No POSIX
…
23.
23
CPU Co
Proc
DRAM IO
Co-processortranslates unmodified kernel binary
Retrofit the kernel (3): binary translation
Dynamic
Binary
Translation
Linux Kernel
IO
tasks
[arXiv:1811.05000] "Transkernel: An Executor for Commodity Kernels on Peripheral Cores,"
Liwei Guo, Shuang Zhai, Yi Qiao, and Felix Xiaozhu Lin
24.
Retrofit kernel asfirmware
1. Reuse in vivo
Unmodified file systems
for TrustZone
2. Source transformation
Unmodified device drivers
for TrustZone
3. Binary translation
Unmodified IO paths for co-processors
Stream analytics: stateof the art
• Classic engines?
• StreamBase, Aurora, TelegraphCQ, NiagaraST…
• Single threaded. Not scaling well.
• Modern engines for datacenters?
• Apache Flink, Spark Streaming, Beam…
• Designed for tens - hundreds of machines. Scaling out.
• Assuming okay if individual nodes perform poorly
• As analytics moves to the edge → bad
29
30.
Project StreamBox
stream analyticsat the memory speed
30
• RDMA / 10GbE
• Co-designed with
mm/scheduling
Stream pipeline Threads
Ingestion
Scheduler
Mem
• Squeeze parallelism for
multi/manycore
• Manage NUMA domains
Exploit high-bandwidth memory
[ASPLOS'19] "StreamBox-HBM: Stream Analytics on High Bandwidth Hybrid Memory," Hongyu Miao, Myeongjae Jeon, Gennady Pekhimenko,
Kathryn S. McKinley, and Felix Xiaozhu Lin
[USENIX ATC'17] "StreamBox: Modern Stream Processing on a Multicore Machine," Hongyu Miao, Heejin Park, Myeongjae Jeon, Gennady
Pekhimenko, Kathryn S. McKinley, and Felix Xiaozhu Lin, in Proc. USENIX Annual Technical Conference, 2017.
[ASPLOS'16] "memif: Towards Programming Heterogeneous Memory Asynchronously," Felix Xiaozhu Lin and Xu Liu, in Proc. ACM Int. Conf.
Architectural Support for Programming Languages and Operating Systems, 2016.
31.
Cores
High-bandwidth hybrid memory
31
3DDRAM Normal DRAM
Tradeoffs: capacity vs. bandwidth
Untraditional memory hierarchy
No latency benefit (Unlike SRAM+DRAM)
16 GB
375 GB/s
~100 GB
80 GB/s
Cool. But benefitis not free
• Two alternative configurations:
• Hw-managed: HBM as a cache
• Sw-managed: one flat address space
• Throw existing analytics engines on HBM?
• Almost no benefit (or even hurts)
33
34.
Existing engines: 3inadequacies
Algorithm
• HBM sequential access + high parallelism
• Existing engines: grouping is hash w/ random access
Capacity
• HBM: capacity limited
• Streaming: high data volume + high velocity
Dynamism
• Streaming: fluctuating workloads
• How to map to two memory types?
34
Ingress
Group by
key
Average per
key
Window TopK
35.
Algorithm: HBM canaccelerate grouping!
• Hash vs Sort: duals for Grouping
• Algorithmic complexity: Sort is worse than Hash
• Hash for in-core; sort for out-of-core [VLDB’09, VLDB’13, SIGMOD’15]
• Yet, Sort outperforms Hash with …
• High data parallelism (bitonic sort + avx512)
• High task parallelism (parallel merge sort)
• High mem bw (stacked DRAM)
35
[vldb’09] C Kim et al., Sort vs. hash revisited: Fast join implementation on modern multi-core cpus
[vldb’13] C Balkesen et al., Multi-core, main-memory joins: Sort vs. hash revisited
[sigmod’15] O Polychroniou et al., Rethinking simd vectorization for in-memory databases
HBM matters
45
0
5
10
15
20
25
30
35
0 1020 30 40 50 60
# cores
ThroughputMrec/s
Not using HBM
StreamBox
DRAM only
HW: Intel Xeon Phi Knights Landing w/ HBM. 64 cores@1.3GHz. $5,000 Benchmark: TopK per key
Output delay: 1 second
46.
Runtime memory managementmatters
46
0
5
10
15
20
25
30
35
0 10 20 30 40 50 60
# cores
ThroughputMrec/s
3D mem as
cache
DRAM only
HW-managed
HBM
StreamBox
HW: Intel Xeon Phi Knights Landing w/ HBM. 64 cores@1.3GHz. $5,000 Benchmark: TopK per key
Output delay: 1 second
47.
47
0
5
10
15
20
25
30
35
0 10 2030 40 50 60
# cores
ThroughputMrec/s
No in-mem
indexes
3D mem as
cache
DRAM only
3D mem as
cache; full records
StreamBox
In-HBM index matters
HW: Intel Xeon Phi Knights Landing w/ HBM. 64 cores@1.3GHz. $5,000 Benchmark: TopK per key
Output delay: 1 second
48.
StreamBox lessons
• Ananalytics engine built from ground up
• 2.5 years. ~60,000 lines of C++11.
http://xsel.rocks/p/streambox
• Hardware often badly underutilized, even with
production software
• Performance requires careful optimization
everywhere
48
49.
49
Cheap VM
(huge page)
Apps
OS
kernel
Fastnet stack
(40 GbE or RDMA)
High task
parallelism
Custom mem
allocator
Sequential mem
access
Runtime
Thread pool
+ custom task scheduler
Wide SIMD
(avx512)
Hybrid
memory
The software engineer’s guide to 3D DRAM
Make sure to pack all the following
50.
OSes defined bytwo IoT scenarios
Hot springs
Edge
Icebergs
[EuroSys'19] "VStore: Reinventing Data Stores for Video Analytics," Tiantu Xu, Luis Materon Botelho, and Felix
Xiaozhu Lin, to appear at Proc. Eurosys Conference, 2019.
51.
Cheap cameras. Largevideos.
51
130M surveillance cameras
shipped per year
Many institutions run > 200
cameras 24x7
A single camera produces
24 GB video per day
Must be consumed by
algorithms!
$25 on Amazon
52.
Video analytics isexpensive
52
NVIDIA Quadro P6000
NN object detection: 5 FPS
$4,500
Object detection: deep neural network model YOLOv3
IoT Camera
30 FPS
$25
A retrospective query
“Findall white buses appeared
yesterday”
As a cascade of operators
• selects operators from a lib
• specifies target accuracies for
operators
56Image credits: NoScope: Optimizing Neural Network Queries over Video at Scale, Daniel Kang, John Emmons, Firas Abuzaid, Peter
Bailis, Matei Zaharia, VLDB 2017
Frame diff
detector
Shallow
neural net
Deep
neural net
~10,000x ~1,000x ~10x
57.
Project VStore
The videodata store for AI
57
Ingestion Storage Retrieval Consume
Video
Data
Operator
@ accuracy
Query
“Aren’t there many video databases already?”
For human consumers. Not for AI consumers
[EuroSys'19] "VStore: Reinventing Data Stores for Video Analytics," Tiantu Xu, Luis Materon Botelho, and Felix
Xiaozhu Lin, Eurosys Conference, 2019.
58.
The first-class concern:controlling
video formats
58
Ingestion Storage Retrieval Consume
Video
Data
Operator
@ accuracy
Query
59.
Extensive video formatknobs
59
Ingestion Storage Retrieval Consume
Video
Data
Operator
@ accuracy
Query
Quality Crop Res Sample Speed
KeyFrame
Interval
Fidelity Coding
60.
Extensive video formatknobs
60
Ingestion Storage Retrieval Consume
Video
Data
Operator
@ accuracy
Query
Quality Crop Res Sample
Fidelity
61.
Knob Impacts: High& Complex
61
Ingestion Storage Retrieval Consume
Video
Data
Operator
@ accuracy
Query
Quality Crop Res Sample
Ingestion
Storage
Retrieval
Consumption
Fidelity
62.
Knob Impacts: High& Complex
62
Ingestion Storage Consume
Video
Data
Operator
@ accuracy
Query
Quality Crop Res Sample
Bad 100% 100p 2/3
Ingestion
Storage
Retrieval
Consumption
Retrieval
63.
Knob Impacts: High& Complex
63
Ingestion Storage Retrieval Consume
Video
Data
Operator
@ accuracy
Query
Best 100% 100p 1/30
Ingestion
Storage
Retrieval
Consumption
Quality Crop Res Sample
64.
Knob Impacts: High& Complex
64
Ingestion Storage Retrieval Consume
Video
Data
Operator
@ accuracy
Query
Quality Crop Res Sample
Good 75% 100p 1/2
Ingestion
Storage
Retrieval
Consumption
65.
Configuration Space
65
Ingestion StorageRetrieval Consume
<motion,0.95>
M Storage
Formats
N Consumption
Formats
<motion, 0.7>
…
<OCR, 0.95>
<OCR, 0.90>
…
<NN, 0.95>
Operator
@ accuracy
K Consumers
Objectives for Configuration
69
<motion,0.95>
MStorage
Formats
N Consumption
Formats
<motion, 0.7>
…
<OCR, 0.95>
<OCR, 0.90>
…
<NN, 0.95>
K Consumers
Ingestion Storage Retrieval Consume
Operator
@ accuracy
Retrieval never
be bottleneck
Satisfy accuracy
Respect resource
budgets
70.
Challenges
70
<motion,0.95>
M Storage
Formats
N Consumption
Formats
<motion,0.7>
…
<OCR, 0.95>
<OCR, 0.90>
…
<NN, 0.95>
K Consumers
Ingestion Storage Retrieval Consume
Operator
@ accuracy
Retrieval never
be bottleneck
Satisfy accuracy
Respect resource
budgets
Many possible formats (~15k combos of knobs)
Many possible configurations (~4M for 4 operators)
71.
71
M Storage
Formats
N Consumption
Formats
IngestionStorage Retrieval Consume
Operator
@ accuracy
K Consumers
Key Idea: Deriving Configuration Backwards
Backward derivation of formats
<motion,0.95>
<motion, 0.7>
…
<OCR, 0.95>
<OCR, 0.90>
…
<NN, 0.95>
74
M Storage
Formats
N Consumption
Formats
IngestionStorage Retrieval Consume
Operator
@ accuracy
K Consumers
Technique 3: Eroding
<motion,0.95>
<motion, 0.7>
…
<OCR, 0.95>
<OCR, 0.90>
…
<NN, 0.95>
12
3
MaxMin
Video Age
Storage
(All deleted)
75.
A sample configurationby VStore
75
Storage formats
Hundreds of knobs. Only possible through auto config!
Consumption formats
Ingestion Storage Retrieval Consume
Operator
@ accuracy
Why would thishappen?
• Wireless cameras easy to deploy
• Wireless bandwidth is precious
• Public WiFi: typically < 1MB/sec
• Complaints on cams slowing down WiFi
• Streaming videos → NOT scalable
• On-camera storage is cheap
79
https://community.netgear.com/t5/Nighthawk-WiFi-Routers/Wireless-cameras-slowing-router-too-much/td-p/513047
https://www.securitycameraking.com/securityinfo/forum/networking/ip-cameras-are-slowing-down-your-network/
https://www.security-camera-warehouse.com/ip-camera/wifi-enabled/
Feasible?
• Users arewaiting
• On-camera video is large (tens of GB)
• Wireless bandwidth is scarce (1MB/sec)
• $20 cameras are wimpy (one frame in 30 secs)
81
82.
Feasible?
• Users arewaiting
• On-camera video is large (tens of GB)
• Wireless bandwidth is scarce (1MB/sec)
• $20 cameras are wimpy (one frame in 30 secs)
82
Yes
• Ingestion: cameras learn videos slowly but surely
• Query: continuously refining results
• Edge bootstraps specialized NNs for cameras to run
• 1000x cheaper than full NN.
• Process 1-hour video in secs (working with edge)
83.
Lesson: conquering videoicebergs
• Lazy ingestion: pay as little as possible
• Eager query-time optimizations
• Take specialization opportunities
• Users know their queries better
• Resonate with compiler/PL wisdom!
• Just-in-time compilation & lazy evaluation
83
84.
Supercharge IoT analytics
Twoimportant scenarios
Large stream analytics on the edge
Large video analytics on edge & cameras
OS plays key roles
Map AI to new hardware
Dynamically configure AI
Trade off among competing objectives
84