Validating Data at Scale 
Spenser Skates 
CEO at Amplitude
Doing things at scale is noisy 
u Code is supposed to run the same way, but what if you run the 
same loop a million times on a million different machines- how 
confident are you it will always run the same?
Data from phones is noisier 
u Running on tens of thousands of different platforms with 
hundreds of thousands of different software configurations on 
hundreds of millions of phones 
u Platforms have the craziest settings
How data can get messed up 
u HTTP requests get mangled in transit 
u Phone might not get the acknowledgement from the server 
u People’s clocks are off 
u People are running weird versions of Android 
u Memory/disk corruption 
u Gamma ray events
You can’t trust data from the 
client
Problem: Data gets mangled in 
transit 
u Parameters from post requests get dropped 
u Within a parameter, a chunk of data may not actually reach the 
server
Solution: Checksumming 
u Send a checksum that’s a function of all the fields 
u If the checksum is wrong/not present, you know that you haven’t 
got all the data. Tell the phone the upload wasn’t successful 
u The phone will attempt to reupload the data
Problem: Client sends the same 
data twice 
u How does the phone know that the server has received the data 
so it doesn’t reupload the same piece of data twice? It gets an 
acknowledgement back 
u How does the server know that the phone has received the 
acknowledgement? It doesn’t! 
u Equivalent to the two generals problem 
u Requests that are successfully received by the server fail to 
successfully send an acknowledgement to the phone 5% of the 
time 
u That means all counts are inflated by about 5%!
Solution: Deduplication 
u Your system must be idempotent on the event level- it must be 
able to receive an event it’s received before and not change its 
state 
u Create a unique key for every event that has been sent 
u When you see an event, check your list of keys if the key is already 
present, discard the event
Problem: Clocks are off 
u Phones are often offline, so an analytics SDK needs to cache data 
locally before uploading, including the time the event occurred 
u But people’s clocks are often off, occasionally by years! 
u We can’t timestamp to the upload time, 5% of data is uploaded 
>24 hours after an event happened
Solution: Get an estimate of the 
actual time an event was logged 
u Timestamp the upload from the phone 
u For each event, let’s compare: 
u The difference between the phone event timestamp and the server 
upload time 
u The difference between the phone upload timestamp and the server 
upload time
Solution: Get an estimate of the 
actual time an event was logged 
u For each event timestamp, subtract the difference between the 
phone’s upload time and the server’s upload time
Other Problems 
u People are running weird versions of Android 
u MD5 library 
u Memory/disk corruption 
u Gamma ray events
Clean Data
Questions? 
Always happy to talk about analytics problems! 
spenser@amplitude.com 
blog.amplitude.com 
twitter: @amplitudemobile 
MOBILE ANALYTICS FOR DECISION MAKERS

Validating big data at scale

  • 1.
    Validating Data atScale Spenser Skates CEO at Amplitude
  • 2.
    Doing things atscale is noisy u Code is supposed to run the same way, but what if you run the same loop a million times on a million different machines- how confident are you it will always run the same?
  • 3.
    Data from phonesis noisier u Running on tens of thousands of different platforms with hundreds of thousands of different software configurations on hundreds of millions of phones u Platforms have the craziest settings
  • 4.
    How data canget messed up u HTTP requests get mangled in transit u Phone might not get the acknowledgement from the server u People’s clocks are off u People are running weird versions of Android u Memory/disk corruption u Gamma ray events
  • 5.
    You can’t trustdata from the client
  • 6.
    Problem: Data getsmangled in transit u Parameters from post requests get dropped u Within a parameter, a chunk of data may not actually reach the server
  • 7.
    Solution: Checksumming uSend a checksum that’s a function of all the fields u If the checksum is wrong/not present, you know that you haven’t got all the data. Tell the phone the upload wasn’t successful u The phone will attempt to reupload the data
  • 8.
    Problem: Client sendsthe same data twice u How does the phone know that the server has received the data so it doesn’t reupload the same piece of data twice? It gets an acknowledgement back u How does the server know that the phone has received the acknowledgement? It doesn’t! u Equivalent to the two generals problem u Requests that are successfully received by the server fail to successfully send an acknowledgement to the phone 5% of the time u That means all counts are inflated by about 5%!
  • 9.
    Solution: Deduplication uYour system must be idempotent on the event level- it must be able to receive an event it’s received before and not change its state u Create a unique key for every event that has been sent u When you see an event, check your list of keys if the key is already present, discard the event
  • 10.
    Problem: Clocks areoff u Phones are often offline, so an analytics SDK needs to cache data locally before uploading, including the time the event occurred u But people’s clocks are often off, occasionally by years! u We can’t timestamp to the upload time, 5% of data is uploaded >24 hours after an event happened
  • 11.
    Solution: Get anestimate of the actual time an event was logged u Timestamp the upload from the phone u For each event, let’s compare: u The difference between the phone event timestamp and the server upload time u The difference between the phone upload timestamp and the server upload time
  • 14.
    Solution: Get anestimate of the actual time an event was logged u For each event timestamp, subtract the difference between the phone’s upload time and the server’s upload time
  • 15.
    Other Problems uPeople are running weird versions of Android u MD5 library u Memory/disk corruption u Gamma ray events
  • 16.
  • 17.
    Questions? Always happyto talk about analytics problems! spenser@amplitude.com blog.amplitude.com twitter: @amplitudemobile MOBILE ANALYTICS FOR DECISION MAKERS