9. Engineering Challenges
● The area of SF: 46.87 mi ²
● For the purpose of this project each cluster is 0.09 mi ²
● This means k is roughly 500
10. Engineering Challenges
● Parameters to tune:
– Time it takes to produce the messages
– Processing time for k-means in Spark Streaming
– The update interval for a fixed data point in the
database
11. Goal
● Tune the parameters in order to have a stable system
● The total delay after processing each batch must be
constant and comparable to the batch interval.
● You can check this in the Spark API
12. Tackling Challenges
●
Having multiple producers and consumers ✔
● Kafka is fast with sending messages and is not the bottleneck
● Establishing some safe limits:
– Using spark.streaming.receiver.maxRate to control
the input rate ✔
– Understanding the complexity of the process in Spark
Streaming ✔
– Choosing the right batch interval ✔
16. About Me
● Long time ago - B.S in pure math, University of Toronto
● More recent - M.S in applied math, University of British Columbia
● The exciting now - A data engieer who wants to go camping with other
data engineers