The document discusses containerizing Spark clusters on Kubernetes. It describes how the author's Spark cluster looked in 2014 running on Mesos with networked storage. It then covers motivations for microservices architectures and how Spark fits into this. The document outlines architectures for analytics and applications, including responsibilities like transformation, aggregation, training models, and more. It also discusses legacy architectures like data warehouses and Hadoop-style data lakes. Finally, it covers practical considerations and potential pitfalls of containerized Spark clusters like scheduling, security, and storage options.
86. queue for “raw data” topic
THE KAPPA ARCHITECTURE
events
87. queue for “raw data” topic
THE KAPPA ARCHITECTURE
events
transform
queue for “preprocessed data” topic
88. queue for “raw data” topic
THE KAPPA ARCHITECTURE
events
transform analysis
queue for “preprocessed data” topic
queue for “analysis results” topic
89. queue for “raw data” topic
THE KAPPA ARCHITECTURE
events
transform analysis
queue for “preprocessed data” topic
queue for “analysis results” topic
reporting end-user UI
90. DATA FEDERATION IN THE COMPUTE LAYER
aggregate
trainmodels
archive
events
databases
file, object
storage
management
web and mobile
reporting
developer UItransform
transform
transform
91. DATA FEDERATION IN THE COMPUTE LAYER
aggregate
trainmodels
archive
events
databases
file, object
storage
management
web and mobile
reporting
developer UItransform
transform
transform
92. DATA FEDERATION IN THE COMPUTE LAYER
aggregate
trainmodels
archive
events
databases
file, object
storage
management
web and mobile
reporting
developer UItransform
transform
transform
121. STORAGE
Kubernetes
app 1 app 2
app 5app 4
app 3
app 6
app 1 app 2
app 5app 4
app 3
app 6
object store
✓ interoperability
✓ fine-grained AC
✓ many implementations
✗ consistency model
✗ performance
122. “…in a cloud native architecture, the benefit of
HDFS is actually very small and that is why
many cloud-first organizations no longer run
HDFS, or only run it as a caching layer for S3.”
—Reynold Xin on Quora (http://qr.ae/TAF4cN)