Data Lessons Learned at Scale - Big Data DC

•

0 likes•900 views

Half of the work that it takes to do data science is plumbing and wrangling. I’ll discuss some tricks we’ve learned while building AddThis over the years to collect and process data at web scale.

Technology

Charlie Reverte
VP Engineering
@numbakrrunch

Data Lessons Learned at Scale

Topic
Half of the work that it takes to do data science is
plumbing and wrangling
I’ll discuss some tricks we’ve learned over the
years to collect and process data at web scale

@numbakrrunch

About AddThis
We make tools for websites:

@numbakrrunch

Our Data
We process tool data
● Sharing
● Following
● Visitation
● Content Classification
And feed it back to sites
● Analytics
● Trending Content
● Personalized
Recommendations

@numbakrrunch

At Scale...
●
●
●
●
●

14 million domains
100 billion views/month
45k events/sec
160k concurrent firewall sessions
500k unique metrics in ganglia

@numbakrrunch

Counting Things
Common operations:
● Cardinality
● Set membership
● Top-k elements
● Frequency
●
●
●
●
http://highlyscalable.wordpress.
com/2012/05/01/probabilistic-structures-webanalytics-data-mining/

Estimate when possible
Sample when possible
Often streaming vs. batch
Mergeability is a big plus
○
○

Distributed counting
Checkpointing

Stream-lib: https://github.com/clearspring/stream-lib
@numbakrrunch

Distributed ID Generation
●
●

Session IDs are generated in the browser
We concatenate time and a random value
time
63

●

Hex: 4f6934b6f54bd7c1

rand
31

Base64: T2k0to403VS
0

Time-bounded probabilistic uniqueness
○ (m 2) / n = 0.142 collisions/sec (at 35k rq/sec)

●

Naturally time ordered, built-in DoB

Compare to Twitter Snowflake
https://github.com/twitter/snowflake/
@numbakrrunch

Sharding and Sampling
● Choose your shard keys wisely
○ High cardinality field to reduce lumpiness
○ What do you need to co-locate
● Shards also useful for sampling
○ Law of big numbers
● Can yield statistical significance
○ Depending on the question

@numbakrrunch

Tunable QoS
●

●
●
●
●

URL Metadata stored in a 90-node
Cassandra cluster
We scrape and classify 20M URLs/day
750 million active records
2.2B reads/day
Variable cache TTLs
○

●

Depending on write rate per record

6

CDN cache

Global TTL knob
○
○

Turn up to reduce load for maintenance
Turn down to improve responsiveness

@numbakrrunch

Deployment
● Continuous Deploy?
● Deploying our javascript costs $3k
○ Have to invalidate 1.4B browser caches
○ Several hours to flush to browsers (clench)

● 2PB of CDN data served per month
● Have DDOSed ourselves
○ Very interesting bugs

● Simulation is weak
○ The internet is a dirty place
○ Embrace incremental deploys

Columnar Compression
●
●
●
●
●

Columnar storage techniques for row data
Better compressor efficiency
Different compressors per column
>20% size savings
by @abramsm

Input Data
Time

IP

UID

URL

Stored Data
Geo

Time
IP

Block
Size

UID

URL
Geo

@numbakrrunch

Summary
● Are you more like the post office or the bank?
● Look for good-enough answers
● Fight your nerd tendency for perfect
○ I’m still struggling with this

@numbakrrunch

What's hot

MongoDB .local Chicago 2019: MongoDB Atlas Data Lake Technical Deep DiveMongoDB

Improve your SQL workload with observabilityOVHcloud

Kafka as an Eventing System to Replatform a Monolith into Microservices confluent

Scalable Application Development @ PicnicSander Mak (@Sander_Mak)

The Big Bad DataPrzemysław Pastuszka

Big data on google platform dev fest presentationPrzemysław Pastuszka

An introduction to the WSO2 Analytics Platform Sriskandarajah Suhothayan

Data pipelines observability: OpenLineage & MarquezJulien Le Dem

Open core summit: Observability for data pipelines with OpenLineageJulien Le Dem

Voxxed days thessaloniki 21/10/2016 - Streaming Engines for Big DataStavros Kontopoulos

MongoDB .local Houston 2019: MongoDB Atlas Data Lake Technical Deep DiveMongoDB

Data lineage and observability with Marquez - subsurface 2020Julien Le Dem

DocumentDB - NoSQL on Cloud at Reboot2015Vidyasagar Machupalli

Kafka Streams - From the Ground Up to the CloudVMware Tanzu

Pomerania Cloud case study - Openstack Day Warsaw 2017Łukasz Klimek

FIWARE Global Summit - QuantumLeap: Time-series and Geographic QueriesFIWARE

Big data @ uber vu (1)Mihnea Giurgea

Dataspace presentatieRoland Cornelissen

M-PIL-3.2 Public SessionHelix Nebula The Science Cloud

Scalable Dynamic Data Consumption on the WebRuben Taelman

What's hot (20)

MongoDB .local Chicago 2019: MongoDB Atlas Data Lake Technical Deep Dive

Improve your SQL workload with observability

Kafka as an Eventing System to Replatform a Monolith into Microservices

Scalable Application Development @ Picnic

The Big Bad Data

Big data on google platform dev fest presentation

An introduction to the WSO2 Analytics Platform

Data pipelines observability: OpenLineage & Marquez

Open core summit: Observability for data pipelines with OpenLineage

Voxxed days thessaloniki 21/10/2016 - Streaming Engines for Big Data

MongoDB .local Houston 2019: MongoDB Atlas Data Lake Technical Deep Dive

Data lineage and observability with Marquez - subsurface 2020

DocumentDB - NoSQL on Cloud at Reboot2015

Kafka Streams - From the Ground Up to the Cloud

Pomerania Cloud case study - Openstack Day Warsaw 2017

FIWARE Global Summit - QuantumLeap: Time-series and Geographic Queries

Big data @ uber vu (1)

Dataspace presentatie

M-PIL-3.2 Public Session

Scalable Dynamic Data Consumption on the Web

Viewers also liked

Functional Prototyping For Mobile AppsMovel

Data Lessons Learned at ScaleCharlie Reverte

Privacy Friendly PersonalizationCharlie Reverte

.Gov to .comCharlie Reverte

UI Testing AutomationAgileEngine

"Docker is NOT Container." ~ Dockerとコンテナ技術、PaaSの関係を理解するEtsuji Nakai

Viewers also liked (6)

Functional Prototyping For Mobile Apps

Data Lessons Learned at Scale

Privacy Friendly Personalization

.Gov to .com

UI Testing Automation

"Docker is NOT Container." ~ Dockerとコンテナ技術、PaaSの関係を理解する

Similar to Data Lessons Learned at Scale - Big Data DC

Data Science in the Cloud @StitchFixC4Media

Netflix Big Data Paris 2017Jason Flittner

MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB

Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...Spark Summit

Extracting Insights from Data at TwitterPrasad Wagle

Big Data in 200 km/h | AWS Big Data Demystified #1.3 Omid Vahdaty

Big data at scrapinghubDana Brophy

Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...DataStax

Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...Data Con LA

AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | EnglishOmid Vahdaty

Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...Flink Forward

AWS Big Data Demystified #1: Big data architecture lessons learned Omid Vahdaty

Web performance mercadolibre - ECI 2013Santiago Aimetta

AWS Big Data Demystified #1.2 | Big Data architecture lessons learned Omid Vahdaty

Streamsets and spark in RetailHari Shreedharan

Sparking up Data Engineering: Spark Summit East talk by Rohan SharmaSpark Summit

Introduction to Data Engineer and Data Pipeline at Credit OKKriangkrai Chaonithi

Web performance optimization - MercadoLibrePablo Moretti

A Day in the Life of a Druid Implementor and Druid's RoadmapItai Yaffe

kranonit S06E01 Игорь Цинько: High loadKrivoy Rog IT Community

Similar to Data Lessons Learned at Scale - Big Data DC (20)

Data Science in the Cloud @StitchFix

Netflix Big Data Paris 2017

MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas

Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...

Extracting Insights from Data at Twitter

Big Data in 200 km/h | AWS Big Data Demystified #1.3

Big data at scrapinghub

Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...

Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...

AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English

Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...

AWS Big Data Demystified #1: Big data architecture lessons learned

Web performance mercadolibre - ECI 2013

AWS Big Data Demystified #1.2 | Big Data architecture lessons learned

Streamsets and spark in Retail

Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma

Introduction to Data Engineer and Data Pipeline at Credit OK

Web performance optimization - MercadoLibre

A Day in the Life of a Druid Implementor and Druid's Roadmap

kranonit S06E01 Игорь Цинько: High load

Recently uploaded

08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls

Real Time Object Detection Using Open CVKhem

[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745

A Call to Action for Generative AI in 2024Results

Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1

2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong

The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad

04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG

From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software

Presentation on how to chat with PDF using ChatGPT code interpreternaman860154

Finology Group – Insurtech Innovation Award 2024The Digital Insurer

Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer

Artificial Intelligence: Facts and MythsJoaquim Jorge

IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge

08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls

Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun

How to convert PDF to text with Nanonetsnaman860154

Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j

GenCyber Cyber Security Day PresentationMichael W. Hawkins

Recently uploaded (20)

08448380779 Call Girls In Friends Colony Women Seeking Men

Real Time Object Detection Using Open CV

[2024]Digital Global Overview Report 2024 Meltwater.pdf

A Call to Action for Generative AI in 2024

Boost Fertility New Invention Ups Success Rates.pdf

2024: Domino Containers - The Next Step. News from the Domino Container commu...

The Codex of Business Writing Software for Real-World Solutions 2.pptx

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx

From Event to Action: Accelerate Your Decision Making with Real-Time Automation

Presentation on how to chat with PDF using ChatGPT code interpreter

Finology Group – Insurtech Innovation Award 2024

Axa Assurance Maroc - Insurer Innovation Award 2024

Artificial Intelligence: Facts and Myths

IAC 2024 - IA Fast Track to Search Focused AI Solutions

08448380779 Call Girls In Civil Lines Women Seeking Men

Data Cloud, More than a CDP by Matt Robison

How to convert PDF to text with Nanonets

Advantages of Hiring UIUX Design Service Providers for Your Business

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...

GenCyber Cyber Security Day Presentation

Data Lessons Learned at Scale - Big Data DC

1. Charlie Reverte VP Engineering @numbakrrunch Data Lessons Learned at Scale

2. Topic Half of the work that it takes to do data science is plumbing and wrangling I’ll discuss some tricks we’ve learned over the years to collect and process data at web scale @numbakrrunch

3. About AddThis We make tools for websites: @numbakrrunch

4. Our Data We process tool data ● Sharing ● Following ● Visitation ● Content Classification And feed it back to sites ● Analytics ● Trending Content ● Personalized Recommendations @numbakrrunch

5. At Scale... ● ● ● ● ● 14 million domains 100 billion views/month 45k events/sec 160k concurrent firewall sessions 500k unique metrics in ganglia @numbakrrunch

6. Counting Things Common operations: ● Cardinality ● Set membership ● Top-k elements ● Frequency ● ● ● ● http://highlyscalable.wordpress. com/2012/05/01/probabilistic-structures-webanalytics-data-mining/ Estimate when possible Sample when possible Often streaming vs. batch Mergeability is a big plus ○ ○ Distributed counting Checkpointing Stream-lib: https://github.com/clearspring/stream-lib @numbakrrunch

7. Distributed ID Generation ● ● Session IDs are generated in the browser We concatenate time and a random value time 63 ● Hex: 4f6934b6f54bd7c1 rand 31 Base64: T2k0to403VS 0 Time-bounded probabilistic uniqueness ○ (m 2) / n = 0.142 collisions/sec (at 35k rq/sec) ● Naturally time ordered, built-in DoB Compare to Twitter Snowflake https://github.com/twitter/snowflake/ @numbakrrunch

8. Joining Data ● Value of data increases with higher dimensionality ○ ● Join and de-normalize data when you ingest ○ ● Disk is cheap Join your data in client-side storage ○ ● Geo, user profile, page attributes, external data Browsers as a lossy distributed database Mutability? “The value is in the join” (or something like that) https://github.com/stewartoallen @numbakrrunch

9. Sharding and Sampling ● Choose your shard keys wisely ○ High cardinality field to reduce lumpiness ○ What do you need to co-locate ● Shards also useful for sampling ○ Law of big numbers ● Can yield statistical significance ○ Depending on the question @numbakrrunch

10. Tunable QoS ● ● ● ● ● URL Metadata stored in a 90-node Cassandra cluster We scrape and classify 20M URLs/day 750 million active records 2.2B reads/day Variable cache TTLs ○ ● Depending on write rate per record 6 CDN cache Global TTL knob ○ ○ Turn up to reduce load for maintenance Turn down to improve responsiveness @numbakrrunch

11. Deployment ● Continuous Deploy? ● Deploying our javascript costs $3k ○ Have to invalidate 1.4B browser caches ○ Several hours to flush to browsers (clench) ● 2PB of CDN data served per month ● Have DDOSed ourselves ○ Very interesting bugs ● Simulation is weak ○ The internet is a dirty place ○ Embrace incremental deploys

12. Columnar Compression ● ● ● ● ● Columnar storage techniques for row data Better compressor efficiency Different compressors per column >20% size savings by @abramsm Input Data Time IP UID URL Stored Data Geo Time IP Block Size UID URL Geo @numbakrrunch

13. Summary ● Are you more like the post office or the bank? ● Look for good-enough answers ● Fight your nerd tendency for perfect ○ I’m still struggling with this @numbakrrunch

14. Questions? @numbakrrunch

Data Lessons Learned at Scale - Big Data DC

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to Data Lessons Learned at Scale - Big Data DC

Similar to Data Lessons Learned at Scale - Big Data DC (20)

Recently uploaded

Recently uploaded (20)

Data Lessons Learned at Scale - Big Data DC