Half of the work that it takes to do data science is plumbing and wrangling. I’ll discuss some tricks we’ve learned while building AddThis over the years to collect and process data at web scale.
2. Topic
Half of the work that it takes to do data science is
plumbing and wrangling
I’ll discuss some tricks we’ve learned over the
years to collect and process data at web scale
@numbakrrunch
4. Our Data
We process tool data
● Sharing
● Following
● Visitation
● Content Classification
And feed it back to sites
● Analytics
● Trending Content
● Personalized
Recommendations
@numbakrrunch
5. At Scale...
●
●
●
●
●
14 million domains
100 billion views/month
45k events/sec
160k concurrent firewall sessions
500k unique metrics in ganglia
@numbakrrunch
6. Counting Things
Common operations:
● Cardinality
● Set membership
● Top-k elements
● Frequency
●
●
●
●
http://highlyscalable.wordpress.
com/2012/05/01/probabilistic-structures-webanalytics-data-mining/
Estimate when possible
Sample when possible
Often streaming vs. batch
Mergeability is a big plus
○
○
Distributed counting
Checkpointing
Stream-lib: https://github.com/clearspring/stream-lib
@numbakrrunch
7. Distributed ID Generation
●
●
Session IDs are generated in the browser
We concatenate time and a random value
time
63
●
Hex: 4f6934b6f54bd7c1
rand
31
Base64: T2k0to403VS
0
Time-bounded probabilistic uniqueness
○ (m 2) / n = 0.142 collisions/sec (at 35k rq/sec)
●
Naturally time ordered, built-in DoB
Compare to Twitter Snowflake
https://github.com/twitter/snowflake/
@numbakrrunch
8. Joining Data
●
Value of data increases with higher dimensionality
○
●
Join and de-normalize data when you ingest
○
●
Disk is cheap
Join your data in client-side storage
○
●
Geo, user profile, page attributes, external data
Browsers as a lossy distributed database
Mutability?
“The value is in the join”
(or something like that)
https://github.com/stewartoallen
@numbakrrunch
9. Sharding and Sampling
● Choose your shard keys wisely
○ High cardinality field to reduce lumpiness
○ What do you need to co-locate
● Shards also useful for sampling
○ Law of big numbers
● Can yield statistical significance
○ Depending on the question
@numbakrrunch
10. Tunable QoS
●
●
●
●
●
URL Metadata stored in a 90-node
Cassandra cluster
We scrape and classify 20M URLs/day
750 million active records
2.2B reads/day
Variable cache TTLs
○
●
Depending on write rate per record
6
CDN cache
Global TTL knob
○
○
Turn up to reduce load for maintenance
Turn down to improve responsiveness
@numbakrrunch
11. Deployment
● Continuous Deploy?
● Deploying our javascript costs $3k
○ Have to invalidate 1.4B browser caches
○ Several hours to flush to browsers (clench)
● 2PB of CDN data served per month
● Have DDOSed ourselves
○ Very interesting bugs
● Simulation is weak
○ The internet is a dirty place
○ Embrace incremental deploys
12. Columnar Compression
●
●
●
●
●
Columnar storage techniques for row data
Better compressor efficiency
Different compressors per column
>20% size savings
by @abramsm
Input Data
Time
IP
UID
URL
Stored Data
Geo
Time
IP
Block
Size
UID
URL
Geo
@numbakrrunch
13. Summary
● Are you more like the post office or the bank?
● Look for good-enough answers
● Fight your nerd tendency for perfect
○ I’m still struggling with this
@numbakrrunch