Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Validating big data at scale


Published on

When you're collecting data from hundreds of millions of devices simultaneously, things get noisy. We go over key problems and solutions for collecting and validating data at scale.

Published in: Data & Analytics
  • Be the first to comment

Validating big data at scale

  1. 1. Validating Data at Scale Spenser Skates CEO at Amplitude
  2. 2. Doing things at scale is noisy u Code is supposed to run the same way, but what if you run the same loop a million times on a million different machines- how confident are you it will always run the same?
  3. 3. Data from phones is noisier u Running on tens of thousands of different platforms with hundreds of thousands of different software configurations on hundreds of millions of phones u Platforms have the craziest settings
  4. 4. How data can get messed up u HTTP requests get mangled in transit u Phone might not get the acknowledgement from the server u People’s clocks are off u People are running weird versions of Android u Memory/disk corruption u Gamma ray events
  5. 5. You can’t trust data from the client
  6. 6. Problem: Data gets mangled in transit u Parameters from post requests get dropped u Within a parameter, a chunk of data may not actually reach the server
  7. 7. Solution: Checksumming u Send a checksum that’s a function of all the fields u If the checksum is wrong/not present, you know that you haven’t got all the data. Tell the phone the upload wasn’t successful u The phone will attempt to reupload the data
  8. 8. Problem: Client sends the same data twice u How does the phone know that the server has received the data so it doesn’t reupload the same piece of data twice? It gets an acknowledgement back u How does the server know that the phone has received the acknowledgement? It doesn’t! u Equivalent to the two generals problem u Requests that are successfully received by the server fail to successfully send an acknowledgement to the phone 5% of the time u That means all counts are inflated by about 5%!
  9. 9. Solution: Deduplication u Your system must be idempotent on the event level- it must be able to receive an event it’s received before and not change its state u Create a unique key for every event that has been sent u When you see an event, check your list of keys if the key is already present, discard the event
  10. 10. Problem: Clocks are off u Phones are often offline, so an analytics SDK needs to cache data locally before uploading, including the time the event occurred u But people’s clocks are often off, occasionally by years! u We can’t timestamp to the upload time, 5% of data is uploaded >24 hours after an event happened
  11. 11. Solution: Get an estimate of the actual time an event was logged u Timestamp the upload from the phone u For each event, let’s compare: u The difference between the phone event timestamp and the server upload time u The difference between the phone upload timestamp and the server upload time
  12. 12. Solution: Get an estimate of the actual time an event was logged u For each event timestamp, subtract the difference between the phone’s upload time and the server’s upload time
  13. 13. Other Problems u People are running weird versions of Android u MD5 library u Memory/disk corruption u Gamma ray events
  14. 14. Clean Data
  15. 15. Questions? Always happy to talk about analytics problems! twitter: @amplitudemobile MOBILE ANALYTICS FOR DECISION MAKERS