SlideShare a Scribd company logo
1 of 14
Download to read offline
Charlie Reverte
VP Engineering
@numbakrrunch

Data Lessons Learned at Scale
Topic
Half of the work that it takes to do data science is
plumbing and wrangling
I’ll discuss some tricks we’ve learned over the
years to collect and process data at web scale

@numbakrrunch
About AddThis
We make tools for websites:

@numbakrrunch
Our Data
We process tool data
● Sharing
● Following
● Visitation
● Content Classification
And feed it back to sites
● Analytics
● Trending Content
● Personalized
Recommendations

@numbakrrunch
At Scale...
●
●
●
●
●

14 million domains
100 billion views/month
45k events/sec
160k concurrent firewall sessions
500k unique metrics in ganglia

@numbakrrunch
Counting Things
Common operations:
● Cardinality
● Set membership
● Top-k elements
● Frequency
●
●
●
●
http://highlyscalable.wordpress.
com/2012/05/01/probabilistic-structures-webanalytics-data-mining/

Estimate when possible
Sample when possible
Often streaming vs. batch
Mergeability is a big plus
○
○

Distributed counting
Checkpointing

Stream-lib: https://github.com/clearspring/stream-lib
@numbakrrunch
Distributed ID Generation
●
●

Session IDs are generated in the browser
We concatenate time and a random value
time
63

●

Hex: 4f6934b6f54bd7c1

rand
31

Base64: T2k0to403VS
0

Time-bounded probabilistic uniqueness
○ (m 2) / n = 0.142 collisions/sec (at 35k rq/sec)

●

Naturally time ordered, built-in DoB

Compare to Twitter Snowflake
https://github.com/twitter/snowflake/
@numbakrrunch
Joining Data
●

Value of data increases with higher dimensionality
○

●

Join and de-normalize data when you ingest
○

●

Disk is cheap

Join your data in client-side storage
○

●

Geo, user profile, page attributes, external data

Browsers as a lossy distributed database

Mutability?
“The value is in the join”
(or something like that)

https://github.com/stewartoallen

@numbakrrunch
Sharding and Sampling
● Choose your shard keys wisely
○ High cardinality field to reduce lumpiness
○ What do you need to co-locate
● Shards also useful for sampling
○ Law of big numbers
● Can yield statistical significance
○ Depending on the question

@numbakrrunch
Tunable QoS
●

●
●
●
●

URL Metadata stored in a 90-node
Cassandra cluster
We scrape and classify 20M URLs/day
750 million active records
2.2B reads/day
Variable cache TTLs
○

●

Depending on write rate per record

6

CDN cache

Global TTL knob
○
○

Turn up to reduce load for maintenance
Turn down to improve responsiveness

@numbakrrunch
Deployment
● Continuous Deploy?
● Deploying our javascript costs $3k
○ Have to invalidate 1.4B browser caches
○ Several hours to flush to browsers (clench)

● 2PB of CDN data served per month
● Have DDOSed ourselves
○ Very interesting bugs

● Simulation is weak
○ The internet is a dirty place
○ Embrace incremental deploys
Columnar Compression
●
●
●
●
●

Columnar storage techniques for row data
Better compressor efficiency
Different compressors per column
>20% size savings
by @abramsm

Input Data
Time

IP

UID

URL

Stored Data
Geo

Time
IP

Block
Size

UID

URL
Geo

@numbakrrunch
Summary
● Are you more like the post office or the bank?
● Look for good-enough answers
● Fight your nerd tendency for perfect
○ I’m still struggling with this

@numbakrrunch
Questions?
@numbakrrunch

More Related Content

What's hot

MongoDB .local Chicago 2019: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local Chicago 2019: MongoDB Atlas Data Lake Technical Deep DiveMongoDB .local Chicago 2019: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local Chicago 2019: MongoDB Atlas Data Lake Technical Deep DiveMongoDB
 
Improve your SQL workload with observability
Improve your SQL workload with observabilityImprove your SQL workload with observability
Improve your SQL workload with observabilityOVHcloud
 
Kafka as an Eventing System to Replatform a Monolith into Microservices
Kafka as an Eventing System to Replatform a Monolith into Microservices Kafka as an Eventing System to Replatform a Monolith into Microservices
Kafka as an Eventing System to Replatform a Monolith into Microservices confluent
 
Big data on google platform dev fest presentation
Big data on google platform   dev fest presentationBig data on google platform   dev fest presentation
Big data on google platform dev fest presentationPrzemysław Pastuszka
 
An introduction to the WSO2 Analytics Platform
An introduction to the WSO2 Analytics Platform   An introduction to the WSO2 Analytics Platform
An introduction to the WSO2 Analytics Platform Sriskandarajah Suhothayan
 
Data pipelines observability: OpenLineage & Marquez
Data pipelines observability:  OpenLineage & MarquezData pipelines observability:  OpenLineage & Marquez
Data pipelines observability: OpenLineage & MarquezJulien Le Dem
 
Open core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineageOpen core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineageJulien Le Dem
 
Voxxed days thessaloniki 21/10/2016 - Streaming Engines for Big Data
Voxxed days thessaloniki 21/10/2016 - Streaming Engines for Big DataVoxxed days thessaloniki 21/10/2016 - Streaming Engines for Big Data
Voxxed days thessaloniki 21/10/2016 - Streaming Engines for Big DataStavros Kontopoulos
 
MongoDB .local Houston 2019: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local Houston 2019: MongoDB Atlas Data Lake Technical Deep DiveMongoDB .local Houston 2019: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local Houston 2019: MongoDB Atlas Data Lake Technical Deep DiveMongoDB
 
Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020Julien Le Dem
 
DocumentDB - NoSQL on Cloud at Reboot2015
DocumentDB - NoSQL on Cloud at Reboot2015DocumentDB - NoSQL on Cloud at Reboot2015
DocumentDB - NoSQL on Cloud at Reboot2015Vidyasagar Machupalli
 
Kafka Streams - From the Ground Up to the Cloud
Kafka Streams - From the Ground Up to the CloudKafka Streams - From the Ground Up to the Cloud
Kafka Streams - From the Ground Up to the CloudVMware Tanzu
 
Pomerania Cloud case study - Openstack Day Warsaw 2017
Pomerania Cloud case study - Openstack Day Warsaw 2017Pomerania Cloud case study - Openstack Day Warsaw 2017
Pomerania Cloud case study - Openstack Day Warsaw 2017Łukasz Klimek
 
FIWARE Global Summit - QuantumLeap: Time-series and Geographic Queries
FIWARE Global Summit - QuantumLeap: Time-series and Geographic QueriesFIWARE Global Summit - QuantumLeap: Time-series and Geographic Queries
FIWARE Global Summit - QuantumLeap: Time-series and Geographic QueriesFIWARE
 
Big data @ uber vu (1)
Big data @ uber vu (1)Big data @ uber vu (1)
Big data @ uber vu (1)Mihnea Giurgea
 
Scalable Dynamic Data Consumption on the Web
Scalable Dynamic Data Consumption on the WebScalable Dynamic Data Consumption on the Web
Scalable Dynamic Data Consumption on the WebRuben Taelman
 

What's hot (20)

MongoDB .local Chicago 2019: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local Chicago 2019: MongoDB Atlas Data Lake Technical Deep DiveMongoDB .local Chicago 2019: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local Chicago 2019: MongoDB Atlas Data Lake Technical Deep Dive
 
Improve your SQL workload with observability
Improve your SQL workload with observabilityImprove your SQL workload with observability
Improve your SQL workload with observability
 
Kafka as an Eventing System to Replatform a Monolith into Microservices
Kafka as an Eventing System to Replatform a Monolith into Microservices Kafka as an Eventing System to Replatform a Monolith into Microservices
Kafka as an Eventing System to Replatform a Monolith into Microservices
 
Scalable Application Development @ Picnic
Scalable Application Development @ PicnicScalable Application Development @ Picnic
Scalable Application Development @ Picnic
 
The Big Bad Data
The Big Bad DataThe Big Bad Data
The Big Bad Data
 
Big data on google platform dev fest presentation
Big data on google platform   dev fest presentationBig data on google platform   dev fest presentation
Big data on google platform dev fest presentation
 
An introduction to the WSO2 Analytics Platform
An introduction to the WSO2 Analytics Platform   An introduction to the WSO2 Analytics Platform
An introduction to the WSO2 Analytics Platform
 
Data pipelines observability: OpenLineage & Marquez
Data pipelines observability:  OpenLineage & MarquezData pipelines observability:  OpenLineage & Marquez
Data pipelines observability: OpenLineage & Marquez
 
Open core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineageOpen core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineage
 
Voxxed days thessaloniki 21/10/2016 - Streaming Engines for Big Data
Voxxed days thessaloniki 21/10/2016 - Streaming Engines for Big DataVoxxed days thessaloniki 21/10/2016 - Streaming Engines for Big Data
Voxxed days thessaloniki 21/10/2016 - Streaming Engines for Big Data
 
MongoDB .local Houston 2019: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local Houston 2019: MongoDB Atlas Data Lake Technical Deep DiveMongoDB .local Houston 2019: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local Houston 2019: MongoDB Atlas Data Lake Technical Deep Dive
 
Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020
 
DocumentDB - NoSQL on Cloud at Reboot2015
DocumentDB - NoSQL on Cloud at Reboot2015DocumentDB - NoSQL on Cloud at Reboot2015
DocumentDB - NoSQL on Cloud at Reboot2015
 
Kafka Streams - From the Ground Up to the Cloud
Kafka Streams - From the Ground Up to the CloudKafka Streams - From the Ground Up to the Cloud
Kafka Streams - From the Ground Up to the Cloud
 
Pomerania Cloud case study - Openstack Day Warsaw 2017
Pomerania Cloud case study - Openstack Day Warsaw 2017Pomerania Cloud case study - Openstack Day Warsaw 2017
Pomerania Cloud case study - Openstack Day Warsaw 2017
 
FIWARE Global Summit - QuantumLeap: Time-series and Geographic Queries
FIWARE Global Summit - QuantumLeap: Time-series and Geographic QueriesFIWARE Global Summit - QuantumLeap: Time-series and Geographic Queries
FIWARE Global Summit - QuantumLeap: Time-series and Geographic Queries
 
Big data @ uber vu (1)
Big data @ uber vu (1)Big data @ uber vu (1)
Big data @ uber vu (1)
 
Dataspace presentatie
Dataspace presentatieDataspace presentatie
Dataspace presentatie
 
M-PIL-3.2 Public Session
M-PIL-3.2 Public SessionM-PIL-3.2 Public Session
M-PIL-3.2 Public Session
 
Scalable Dynamic Data Consumption on the Web
Scalable Dynamic Data Consumption on the WebScalable Dynamic Data Consumption on the Web
Scalable Dynamic Data Consumption on the Web
 

Viewers also liked

Functional Prototyping For Mobile Apps
Functional Prototyping For Mobile AppsFunctional Prototyping For Mobile Apps
Functional Prototyping For Mobile AppsMovel
 
Data Lessons Learned at Scale
Data Lessons Learned at ScaleData Lessons Learned at Scale
Data Lessons Learned at ScaleCharlie Reverte
 
Privacy Friendly Personalization
Privacy Friendly PersonalizationPrivacy Friendly Personalization
Privacy Friendly PersonalizationCharlie Reverte
 
UI Testing Automation
UI Testing AutomationUI Testing Automation
UI Testing AutomationAgileEngine
 
"Docker is NOT Container." ~ Dockerとコンテナ技術、PaaSの関係を理解する
"Docker is NOT Container." ~ Dockerとコンテナ技術、PaaSの関係を理解する"Docker is NOT Container." ~ Dockerとコンテナ技術、PaaSの関係を理解する
"Docker is NOT Container." ~ Dockerとコンテナ技術、PaaSの関係を理解するEtsuji Nakai
 

Viewers also liked (6)

Functional Prototyping For Mobile Apps
Functional Prototyping For Mobile AppsFunctional Prototyping For Mobile Apps
Functional Prototyping For Mobile Apps
 
Data Lessons Learned at Scale
Data Lessons Learned at ScaleData Lessons Learned at Scale
Data Lessons Learned at Scale
 
Privacy Friendly Personalization
Privacy Friendly PersonalizationPrivacy Friendly Personalization
Privacy Friendly Personalization
 
.Gov to .com
.Gov to .com.Gov to .com
.Gov to .com
 
UI Testing Automation
UI Testing AutomationUI Testing Automation
UI Testing Automation
 
"Docker is NOT Container." ~ Dockerとコンテナ技術、PaaSの関係を理解する
"Docker is NOT Container." ~ Dockerとコンテナ技術、PaaSの関係を理解する"Docker is NOT Container." ~ Dockerとコンテナ技術、PaaSの関係を理解する
"Docker is NOT Container." ~ Dockerとコンテナ技術、PaaSの関係を理解する
 

Similar to Data Lessons Learned at Scale - Big Data DC

Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixC4Media
 
Netflix Big Data Paris 2017
Netflix Big Data Paris 2017Netflix Big Data Paris 2017
Netflix Big Data Paris 2017Jason Flittner
 
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB
 
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...Spark Summit
 
Extracting Insights from Data at Twitter
Extracting Insights from Data at TwitterExtracting Insights from Data at Twitter
Extracting Insights from Data at TwitterPrasad Wagle
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3 Omid Vahdaty
 
Big data at scrapinghub
Big data at scrapinghubBig data at scrapinghub
Big data at scrapinghubDana Brophy
 
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...DataStax
 
Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...
Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...
Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...Data Con LA
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | EnglishOmid Vahdaty
 
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...Flink Forward
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned Omid Vahdaty
 
Web performance mercadolibre - ECI 2013
Web performance   mercadolibre - ECI 2013Web performance   mercadolibre - ECI 2013
Web performance mercadolibre - ECI 2013Santiago Aimetta
 
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned Omid Vahdaty
 
Streamsets and spark in Retail
Streamsets and spark in RetailStreamsets and spark in Retail
Streamsets and spark in RetailHari Shreedharan
 
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Sparking up Data Engineering: Spark Summit East talk by Rohan SharmaSparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Sparking up Data Engineering: Spark Summit East talk by Rohan SharmaSpark Summit
 
Introduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OKIntroduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OKKriangkrai Chaonithi
 
Web performance optimization - MercadoLibre
Web performance optimization - MercadoLibreWeb performance optimization - MercadoLibre
Web performance optimization - MercadoLibrePablo Moretti
 
A Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapA Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapItai Yaffe
 
kranonit S06E01 Игорь Цинько: High load
kranonit S06E01 Игорь Цинько: High loadkranonit S06E01 Игорь Цинько: High load
kranonit S06E01 Игорь Цинько: High loadKrivoy Rog IT Community
 

Similar to Data Lessons Learned at Scale - Big Data DC (20)

Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
 
Netflix Big Data Paris 2017
Netflix Big Data Paris 2017Netflix Big Data Paris 2017
Netflix Big Data Paris 2017
 
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
 
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
 
Extracting Insights from Data at Twitter
Extracting Insights from Data at TwitterExtracting Insights from Data at Twitter
Extracting Insights from Data at Twitter
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3
 
Big data at scrapinghub
Big data at scrapinghubBig data at scrapinghub
Big data at scrapinghub
 
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
 
Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...
Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...
Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
 
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
 
Web performance mercadolibre - ECI 2013
Web performance   mercadolibre - ECI 2013Web performance   mercadolibre - ECI 2013
Web performance mercadolibre - ECI 2013
 
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
 
Streamsets and spark in Retail
Streamsets and spark in RetailStreamsets and spark in Retail
Streamsets and spark in Retail
 
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Sparking up Data Engineering: Spark Summit East talk by Rohan SharmaSparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
 
Introduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OKIntroduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OK
 
Web performance optimization - MercadoLibre
Web performance optimization - MercadoLibreWeb performance optimization - MercadoLibre
Web performance optimization - MercadoLibre
 
A Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapA Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's Roadmap
 
kranonit S06E01 Игорь Цинько: High load
kranonit S06E01 Игорь Цинько: High loadkranonit S06E01 Игорь Цинько: High load
kranonit S06E01 Игорь Цинько: High load
 

Recently uploaded

08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 

Recently uploaded (20)

08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 

Data Lessons Learned at Scale - Big Data DC

  • 2. Topic Half of the work that it takes to do data science is plumbing and wrangling I’ll discuss some tricks we’ve learned over the years to collect and process data at web scale @numbakrrunch
  • 3. About AddThis We make tools for websites: @numbakrrunch
  • 4. Our Data We process tool data ● Sharing ● Following ● Visitation ● Content Classification And feed it back to sites ● Analytics ● Trending Content ● Personalized Recommendations @numbakrrunch
  • 5. At Scale... ● ● ● ● ● 14 million domains 100 billion views/month 45k events/sec 160k concurrent firewall sessions 500k unique metrics in ganglia @numbakrrunch
  • 6. Counting Things Common operations: ● Cardinality ● Set membership ● Top-k elements ● Frequency ● ● ● ● http://highlyscalable.wordpress. com/2012/05/01/probabilistic-structures-webanalytics-data-mining/ Estimate when possible Sample when possible Often streaming vs. batch Mergeability is a big plus ○ ○ Distributed counting Checkpointing Stream-lib: https://github.com/clearspring/stream-lib @numbakrrunch
  • 7. Distributed ID Generation ● ● Session IDs are generated in the browser We concatenate time and a random value time 63 ● Hex: 4f6934b6f54bd7c1 rand 31 Base64: T2k0to403VS 0 Time-bounded probabilistic uniqueness ○ (m 2) / n = 0.142 collisions/sec (at 35k rq/sec) ● Naturally time ordered, built-in DoB Compare to Twitter Snowflake https://github.com/twitter/snowflake/ @numbakrrunch
  • 8. Joining Data ● Value of data increases with higher dimensionality ○ ● Join and de-normalize data when you ingest ○ ● Disk is cheap Join your data in client-side storage ○ ● Geo, user profile, page attributes, external data Browsers as a lossy distributed database Mutability? “The value is in the join” (or something like that) https://github.com/stewartoallen @numbakrrunch
  • 9. Sharding and Sampling ● Choose your shard keys wisely ○ High cardinality field to reduce lumpiness ○ What do you need to co-locate ● Shards also useful for sampling ○ Law of big numbers ● Can yield statistical significance ○ Depending on the question @numbakrrunch
  • 10. Tunable QoS ● ● ● ● ● URL Metadata stored in a 90-node Cassandra cluster We scrape and classify 20M URLs/day 750 million active records 2.2B reads/day Variable cache TTLs ○ ● Depending on write rate per record 6 CDN cache Global TTL knob ○ ○ Turn up to reduce load for maintenance Turn down to improve responsiveness @numbakrrunch
  • 11. Deployment ● Continuous Deploy? ● Deploying our javascript costs $3k ○ Have to invalidate 1.4B browser caches ○ Several hours to flush to browsers (clench) ● 2PB of CDN data served per month ● Have DDOSed ourselves ○ Very interesting bugs ● Simulation is weak ○ The internet is a dirty place ○ Embrace incremental deploys
  • 12. Columnar Compression ● ● ● ● ● Columnar storage techniques for row data Better compressor efficiency Different compressors per column >20% size savings by @abramsm Input Data Time IP UID URL Stored Data Geo Time IP Block Size UID URL Geo @numbakrrunch
  • 13. Summary ● Are you more like the post office or the bank? ● Look for good-enough answers ● Fight your nerd tendency for perfect ○ I’m still struggling with this @numbakrrunch