SlideShare a Scribd company logo
Charlie Reverte
VP Engineering
@numbakrrunch

Data Lessons Learned at Scale
Topic
Half of the work that it takes to do data science is
plumbing and wrangling
I’ll discuss some tricks we’ve learned over the
years to collect and process data at web scale

@numbakrrunch
About AddThis
We make tools for websites:

@numbakrrunch
Our Data
We process tool data
● Sharing
● Following
● Visitation
● Content Classification
And feed it back to sites
● Analytics
● Trending Content
● Personalized
Recommendations

@numbakrrunch
At Scale...
●
●
●
●
●

14 million domains
100 billion views/month
45k events/sec
160k concurrent firewall sessions
500k unique metrics in ganglia

@numbakrrunch
Counting Things
Common operations:
● Cardinality
● Set membership
● Top-k elements
● Frequency
●
●
●
●
http://highlyscalable.wordpress.
com/2012/05/01/probabilistic-structures-webanalytics-data-mining/

Estimate when possible
Sample when possible
Often streaming vs. batch
Mergeability is a big plus
○
○

Distributed counting
Checkpointing

Stream-lib: https://github.com/clearspring/stream-lib
@numbakrrunch
Distributed ID Generation
●
●

Session IDs are generated in the browser
We concatenate time and a random value
time
63

●

Hex: 4f6934b6f54bd7c1

rand
31

Base64: T2k0to403VS
0

Time-bounded probabilistic uniqueness
○ (m 2) / n = 0.142 collisions/sec (at 35k rq/sec)

●

Naturally time ordered, built-in DoB

Compare to Twitter Snowflake
https://github.com/twitter/snowflake/
@numbakrrunch
Joining Data
●

Value of data increases with higher dimensionality
○

●

Join and de-normalize data when you ingest
○

●

Disk is cheap

Join your data in client-side storage
○

●

Geo, user profile, page attributes, external data

Browsers as a lossy distributed database

Mutability?
“The value is in the join”
(or something like that)

https://github.com/stewartoallen

@numbakrrunch
Sharding and Sampling
● Choose your shard keys wisely
○ High cardinality field to reduce lumpiness
○ What do you need to co-locate
● Shards also useful for sampling
○ Law of big numbers
● Can yield statistical significance
○ Depending on the question

@numbakrrunch
Tunable QoS
●

●
●
●
●

URL Metadata stored in a 90-node
Cassandra cluster
We scrape and classify 20M URLs/day
750 million active records
2.2B reads/day
Variable cache TTLs
○

●

Depending on write rate per record

6

CDN cache

Global TTL knob
○
○

Turn up to reduce load for maintenance
Turn down to improve responsiveness

@numbakrrunch
Deployment
● Continuous Deploy?
● Deploying our javascript costs $3k
○ Have to invalidate 1.4B browser caches
○ Several hours to flush to browsers (clench)

● 2PB of CDN data served per month
● Have DDOSed ourselves
○ Very interesting bugs

● Simulation is weak
○ The internet is a dirty place
○ Embrace incremental deploys
Columnar Compression
●
●
●
●
●

Columnar storage techniques for row data
Better compressor efficiency
Different compressors per column
>20% size savings
by @abramsm

Input Data
Time

IP

UID

URL

Stored Data
Geo

Time
IP

Block
Size

UID

URL
Geo

@numbakrrunch
Summary
● Are you more like the post office or the bank?
● Look for good-enough answers
● Fight your nerd tendency for perfect
○ I’m still struggling with this

@numbakrrunch
Questions?
@numbakrrunch

More Related Content

What's hot

MongoDB .local Chicago 2019: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local Chicago 2019: MongoDB Atlas Data Lake Technical Deep DiveMongoDB .local Chicago 2019: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local Chicago 2019: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB
 
Improve your SQL workload with observability
Improve your SQL workload with observabilityImprove your SQL workload with observability
Improve your SQL workload with observability
OVHcloud
 
Kafka as an Eventing System to Replatform a Monolith into Microservices
Kafka as an Eventing System to Replatform a Monolith into Microservices Kafka as an Eventing System to Replatform a Monolith into Microservices
Kafka as an Eventing System to Replatform a Monolith into Microservices
confluent
 
Scalable Application Development @ Picnic
Scalable Application Development @ PicnicScalable Application Development @ Picnic
Scalable Application Development @ Picnic
Sander Mak (@Sander_Mak)
 
The Big Bad Data
The Big Bad DataThe Big Bad Data
The Big Bad Data
Przemysław Pastuszka
 
Big data on google platform dev fest presentation
Big data on google platform   dev fest presentationBig data on google platform   dev fest presentation
Big data on google platform dev fest presentation
Przemysław Pastuszka
 
An introduction to the WSO2 Analytics Platform
An introduction to the WSO2 Analytics Platform   An introduction to the WSO2 Analytics Platform
An introduction to the WSO2 Analytics Platform
Sriskandarajah Suhothayan
 
Data pipelines observability: OpenLineage & Marquez
Data pipelines observability:  OpenLineage & MarquezData pipelines observability:  OpenLineage & Marquez
Data pipelines observability: OpenLineage & Marquez
Julien Le Dem
 
Open core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineageOpen core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineage
Julien Le Dem
 
Voxxed days thessaloniki 21/10/2016 - Streaming Engines for Big Data
Voxxed days thessaloniki 21/10/2016 - Streaming Engines for Big DataVoxxed days thessaloniki 21/10/2016 - Streaming Engines for Big Data
Voxxed days thessaloniki 21/10/2016 - Streaming Engines for Big Data
Stavros Kontopoulos
 
MongoDB .local Houston 2019: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local Houston 2019: MongoDB Atlas Data Lake Technical Deep DiveMongoDB .local Houston 2019: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local Houston 2019: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB
 
Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020
Julien Le Dem
 
DocumentDB - NoSQL on Cloud at Reboot2015
DocumentDB - NoSQL on Cloud at Reboot2015DocumentDB - NoSQL on Cloud at Reboot2015
DocumentDB - NoSQL on Cloud at Reboot2015
Vidyasagar Machupalli
 
Kafka Streams - From the Ground Up to the Cloud
Kafka Streams - From the Ground Up to the CloudKafka Streams - From the Ground Up to the Cloud
Kafka Streams - From the Ground Up to the Cloud
VMware Tanzu
 
Pomerania Cloud case study - Openstack Day Warsaw 2017
Pomerania Cloud case study - Openstack Day Warsaw 2017Pomerania Cloud case study - Openstack Day Warsaw 2017
Pomerania Cloud case study - Openstack Day Warsaw 2017
Łukasz Klimek
 
FIWARE Global Summit - QuantumLeap: Time-series and Geographic Queries
FIWARE Global Summit - QuantumLeap: Time-series and Geographic QueriesFIWARE Global Summit - QuantumLeap: Time-series and Geographic Queries
FIWARE Global Summit - QuantumLeap: Time-series and Geographic Queries
FIWARE
 
Big data @ uber vu (1)
Big data @ uber vu (1)Big data @ uber vu (1)
Big data @ uber vu (1)
Mihnea Giurgea
 
Dataspace presentatie
Dataspace presentatieDataspace presentatie
Dataspace presentatie
Roland Cornelissen
 
M-PIL-3.2 Public Session
M-PIL-3.2 Public SessionM-PIL-3.2 Public Session
M-PIL-3.2 Public Session
Helix Nebula The Science Cloud
 
Scalable Dynamic Data Consumption on the Web
Scalable Dynamic Data Consumption on the WebScalable Dynamic Data Consumption on the Web
Scalable Dynamic Data Consumption on the Web
Ruben Taelman
 

What's hot (20)

MongoDB .local Chicago 2019: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local Chicago 2019: MongoDB Atlas Data Lake Technical Deep DiveMongoDB .local Chicago 2019: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local Chicago 2019: MongoDB Atlas Data Lake Technical Deep Dive
 
Improve your SQL workload with observability
Improve your SQL workload with observabilityImprove your SQL workload with observability
Improve your SQL workload with observability
 
Kafka as an Eventing System to Replatform a Monolith into Microservices
Kafka as an Eventing System to Replatform a Monolith into Microservices Kafka as an Eventing System to Replatform a Monolith into Microservices
Kafka as an Eventing System to Replatform a Monolith into Microservices
 
Scalable Application Development @ Picnic
Scalable Application Development @ PicnicScalable Application Development @ Picnic
Scalable Application Development @ Picnic
 
The Big Bad Data
The Big Bad DataThe Big Bad Data
The Big Bad Data
 
Big data on google platform dev fest presentation
Big data on google platform   dev fest presentationBig data on google platform   dev fest presentation
Big data on google platform dev fest presentation
 
An introduction to the WSO2 Analytics Platform
An introduction to the WSO2 Analytics Platform   An introduction to the WSO2 Analytics Platform
An introduction to the WSO2 Analytics Platform
 
Data pipelines observability: OpenLineage & Marquez
Data pipelines observability:  OpenLineage & MarquezData pipelines observability:  OpenLineage & Marquez
Data pipelines observability: OpenLineage & Marquez
 
Open core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineageOpen core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineage
 
Voxxed days thessaloniki 21/10/2016 - Streaming Engines for Big Data
Voxxed days thessaloniki 21/10/2016 - Streaming Engines for Big DataVoxxed days thessaloniki 21/10/2016 - Streaming Engines for Big Data
Voxxed days thessaloniki 21/10/2016 - Streaming Engines for Big Data
 
MongoDB .local Houston 2019: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local Houston 2019: MongoDB Atlas Data Lake Technical Deep DiveMongoDB .local Houston 2019: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local Houston 2019: MongoDB Atlas Data Lake Technical Deep Dive
 
Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020
 
DocumentDB - NoSQL on Cloud at Reboot2015
DocumentDB - NoSQL on Cloud at Reboot2015DocumentDB - NoSQL on Cloud at Reboot2015
DocumentDB - NoSQL on Cloud at Reboot2015
 
Kafka Streams - From the Ground Up to the Cloud
Kafka Streams - From the Ground Up to the CloudKafka Streams - From the Ground Up to the Cloud
Kafka Streams - From the Ground Up to the Cloud
 
Pomerania Cloud case study - Openstack Day Warsaw 2017
Pomerania Cloud case study - Openstack Day Warsaw 2017Pomerania Cloud case study - Openstack Day Warsaw 2017
Pomerania Cloud case study - Openstack Day Warsaw 2017
 
FIWARE Global Summit - QuantumLeap: Time-series and Geographic Queries
FIWARE Global Summit - QuantumLeap: Time-series and Geographic QueriesFIWARE Global Summit - QuantumLeap: Time-series and Geographic Queries
FIWARE Global Summit - QuantumLeap: Time-series and Geographic Queries
 
Big data @ uber vu (1)
Big data @ uber vu (1)Big data @ uber vu (1)
Big data @ uber vu (1)
 
Dataspace presentatie
Dataspace presentatieDataspace presentatie
Dataspace presentatie
 
M-PIL-3.2 Public Session
M-PIL-3.2 Public SessionM-PIL-3.2 Public Session
M-PIL-3.2 Public Session
 
Scalable Dynamic Data Consumption on the Web
Scalable Dynamic Data Consumption on the WebScalable Dynamic Data Consumption on the Web
Scalable Dynamic Data Consumption on the Web
 

Viewers also liked

Functional Prototyping For Mobile Apps
Functional Prototyping For Mobile AppsFunctional Prototyping For Mobile Apps
Functional Prototyping For Mobile Apps
Movel
 
Data Lessons Learned at Scale
Data Lessons Learned at ScaleData Lessons Learned at Scale
Data Lessons Learned at Scale
Charlie Reverte
 
Privacy Friendly Personalization
Privacy Friendly PersonalizationPrivacy Friendly Personalization
Privacy Friendly Personalization
Charlie Reverte
 
.Gov to .com
.Gov to .com.Gov to .com
.Gov to .com
Charlie Reverte
 
UI Testing Automation
UI Testing AutomationUI Testing Automation
UI Testing Automation
AgileEngine
 
"Docker is NOT Container." ~ Dockerとコンテナ技術、PaaSの関係を理解する
"Docker is NOT Container." ~ Dockerとコンテナ技術、PaaSの関係を理解する"Docker is NOT Container." ~ Dockerとコンテナ技術、PaaSの関係を理解する
"Docker is NOT Container." ~ Dockerとコンテナ技術、PaaSの関係を理解するEtsuji Nakai
 

Viewers also liked (6)

Functional Prototyping For Mobile Apps
Functional Prototyping For Mobile AppsFunctional Prototyping For Mobile Apps
Functional Prototyping For Mobile Apps
 
Data Lessons Learned at Scale
Data Lessons Learned at ScaleData Lessons Learned at Scale
Data Lessons Learned at Scale
 
Privacy Friendly Personalization
Privacy Friendly PersonalizationPrivacy Friendly Personalization
Privacy Friendly Personalization
 
.Gov to .com
.Gov to .com.Gov to .com
.Gov to .com
 
UI Testing Automation
UI Testing AutomationUI Testing Automation
UI Testing Automation
 
"Docker is NOT Container." ~ Dockerとコンテナ技術、PaaSの関係を理解する
"Docker is NOT Container." ~ Dockerとコンテナ技術、PaaSの関係を理解する"Docker is NOT Container." ~ Dockerとコンテナ技術、PaaSの関係を理解する
"Docker is NOT Container." ~ Dockerとコンテナ技術、PaaSの関係を理解する
 

Similar to Data Lessons Learned at Scale - Big Data DC

Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
C4Media
 
Netflix Big Data Paris 2017
Netflix Big Data Paris 2017Netflix Big Data Paris 2017
Netflix Big Data Paris 2017
Jason Flittner
 
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB
 
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Spark Summit
 
Extracting Insights from Data at Twitter
Extracting Insights from Data at TwitterExtracting Insights from Data at Twitter
Extracting Insights from Data at Twitter
Prasad Wagle
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Omid Vahdaty
 
Big data at scrapinghub
Big data at scrapinghubBig data at scrapinghub
Big data at scrapinghub
Dana Brophy
 
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
DataStax
 
Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...
Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...
Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...
Data Con LA
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
Omid Vahdaty
 
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...
Flink Forward
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
Omid Vahdaty
 
Web performance mercadolibre - ECI 2013
Web performance   mercadolibre - ECI 2013Web performance   mercadolibre - ECI 2013
Web performance mercadolibre - ECI 2013
Santiago Aimetta
 
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
Omid Vahdaty
 
Streamsets and spark in Retail
Streamsets and spark in RetailStreamsets and spark in Retail
Streamsets and spark in Retail
Hari Shreedharan
 
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Sparking up Data Engineering: Spark Summit East talk by Rohan SharmaSparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Spark Summit
 
Introduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OKIntroduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OK
Kriangkrai Chaonithi
 
Web performance optimization - MercadoLibre
Web performance optimization - MercadoLibreWeb performance optimization - MercadoLibre
Web performance optimization - MercadoLibre
Pablo Moretti
 
A Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapA Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's Roadmap
Itai Yaffe
 
kranonit S06E01 Игорь Цинько: High load
kranonit S06E01 Игорь Цинько: High loadkranonit S06E01 Игорь Цинько: High load
kranonit S06E01 Игорь Цинько: High load
Krivoy Rog IT Community
 

Similar to Data Lessons Learned at Scale - Big Data DC (20)

Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
 
Netflix Big Data Paris 2017
Netflix Big Data Paris 2017Netflix Big Data Paris 2017
Netflix Big Data Paris 2017
 
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
 
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
 
Extracting Insights from Data at Twitter
Extracting Insights from Data at TwitterExtracting Insights from Data at Twitter
Extracting Insights from Data at Twitter
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3
 
Big data at scrapinghub
Big data at scrapinghubBig data at scrapinghub
Big data at scrapinghub
 
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
 
Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...
Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...
Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
 
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
 
Web performance mercadolibre - ECI 2013
Web performance   mercadolibre - ECI 2013Web performance   mercadolibre - ECI 2013
Web performance mercadolibre - ECI 2013
 
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
 
Streamsets and spark in Retail
Streamsets and spark in RetailStreamsets and spark in Retail
Streamsets and spark in Retail
 
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Sparking up Data Engineering: Spark Summit East talk by Rohan SharmaSparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
 
Introduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OKIntroduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OK
 
Web performance optimization - MercadoLibre
Web performance optimization - MercadoLibreWeb performance optimization - MercadoLibre
Web performance optimization - MercadoLibre
 
A Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapA Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's Roadmap
 
kranonit S06E01 Игорь Цинько: High load
kranonit S06E01 Игорь Цинько: High loadkranonit S06E01 Игорь Цинько: High load
kranonit S06E01 Игорь Цинько: High load
 

Recently uploaded

GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
kumardaparthi1024
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
SOFTTECHHUB
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
IndexBug
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
Zilliz
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
Daiki Mogmet Ito
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
Zilliz
 

Recently uploaded (20)

GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
 

Data Lessons Learned at Scale - Big Data DC

  • 2. Topic Half of the work that it takes to do data science is plumbing and wrangling I’ll discuss some tricks we’ve learned over the years to collect and process data at web scale @numbakrrunch
  • 3. About AddThis We make tools for websites: @numbakrrunch
  • 4. Our Data We process tool data ● Sharing ● Following ● Visitation ● Content Classification And feed it back to sites ● Analytics ● Trending Content ● Personalized Recommendations @numbakrrunch
  • 5. At Scale... ● ● ● ● ● 14 million domains 100 billion views/month 45k events/sec 160k concurrent firewall sessions 500k unique metrics in ganglia @numbakrrunch
  • 6. Counting Things Common operations: ● Cardinality ● Set membership ● Top-k elements ● Frequency ● ● ● ● http://highlyscalable.wordpress. com/2012/05/01/probabilistic-structures-webanalytics-data-mining/ Estimate when possible Sample when possible Often streaming vs. batch Mergeability is a big plus ○ ○ Distributed counting Checkpointing Stream-lib: https://github.com/clearspring/stream-lib @numbakrrunch
  • 7. Distributed ID Generation ● ● Session IDs are generated in the browser We concatenate time and a random value time 63 ● Hex: 4f6934b6f54bd7c1 rand 31 Base64: T2k0to403VS 0 Time-bounded probabilistic uniqueness ○ (m 2) / n = 0.142 collisions/sec (at 35k rq/sec) ● Naturally time ordered, built-in DoB Compare to Twitter Snowflake https://github.com/twitter/snowflake/ @numbakrrunch
  • 8. Joining Data ● Value of data increases with higher dimensionality ○ ● Join and de-normalize data when you ingest ○ ● Disk is cheap Join your data in client-side storage ○ ● Geo, user profile, page attributes, external data Browsers as a lossy distributed database Mutability? “The value is in the join” (or something like that) https://github.com/stewartoallen @numbakrrunch
  • 9. Sharding and Sampling ● Choose your shard keys wisely ○ High cardinality field to reduce lumpiness ○ What do you need to co-locate ● Shards also useful for sampling ○ Law of big numbers ● Can yield statistical significance ○ Depending on the question @numbakrrunch
  • 10. Tunable QoS ● ● ● ● ● URL Metadata stored in a 90-node Cassandra cluster We scrape and classify 20M URLs/day 750 million active records 2.2B reads/day Variable cache TTLs ○ ● Depending on write rate per record 6 CDN cache Global TTL knob ○ ○ Turn up to reduce load for maintenance Turn down to improve responsiveness @numbakrrunch
  • 11. Deployment ● Continuous Deploy? ● Deploying our javascript costs $3k ○ Have to invalidate 1.4B browser caches ○ Several hours to flush to browsers (clench) ● 2PB of CDN data served per month ● Have DDOSed ourselves ○ Very interesting bugs ● Simulation is weak ○ The internet is a dirty place ○ Embrace incremental deploys
  • 12. Columnar Compression ● ● ● ● ● Columnar storage techniques for row data Better compressor efficiency Different compressors per column >20% size savings by @abramsm Input Data Time IP UID URL Stored Data Geo Time IP Block Size UID URL Geo @numbakrrunch
  • 13. Summary ● Are you more like the post office or the bank? ● Look for good-enough answers ● Fight your nerd tendency for perfect ○ I’m still struggling with this @numbakrrunch