SlideShare a Scribd company logo
Charlie Reverte
VP Engineering
@numbakrrunch
Data Lessons Learned at Scale
@numbakrrunch
Topic
Half of the work that it takes to do data science is
plumbing and wrangling
Here are some lessons we’ve learned..
@numbakrrunch
About AddThis
We make tools for websites
@numbakrrunch
Our Data
We process website data
● Visitation
● Sharing
● Following
● Content Classification
And use it to improve the site
● Content Recommendation
● Personalization
● Analytics
@numbakrrunch
At Scale...
● 14 million domains
● 100 billion views/month
● 50k events/sec
● 160k concurrent firewall sessions
● 500k unique ganglia metrics
@numbakrrunch
Distributed ID Generation
● Session IDs are generated in the browser
● We concatenate time and a random value
Hex: 4f6934b6f54bd7c1
Base64: T2k0to403VS
● Time-bounded probabilistic uniqueness
○ (m 2) / n = 0.142 collisions/sec (at 35k rq/sec)
● Naturally time ordered, built-in DoB
Compare to Twitter Snowflake
https://github.com/twitter/snowflake/
time rand
63 31 0
@numbakrrunch
Counting Things
● Cardinality
● Set membership
● Top-k elements
● Frequency
● Estimate when possible
● Sample when possible
● Often streaming vs. batch
● Mergeability is a big plus
○ Distributed counting
○ Checkpointing
Stream-lib: https://github.com/clearspring/stream-lib
http://highlyscalable.wordpress.
com/2012/05/01/probabilistic-structures-web-
analytics-data-mining/
@numbakrrunch
Joining Data
● Value of data increases with higher dimensionality
○ Geo, user profile, page attributes, external data
● Join and de-normalize data when you ingest
○ Disk is cheap
● Join your data in client-side storage
○ Browsers as a lossy distributed database
● Oceans of data in the cloud..
“The value is in the join”
(or something like that)
https://github.com/stewartoallen
@numbakrrunch
Sharding and Sampling
● Choose your shard keys wisely
○ High cardinality field to reduce lumpiness
○ What do you need to co-locate
○ Storage is cheap, multiple copies?
● Shards also useful for sampling
○ Complete data subsets
● Can yield statistical significance
○ Depending on the question
Deployment
● Continuous Deploy?
● Deploying our javascript costs $3k
○ Have to invalidate 1.4B browser caches
○ Several hours to flush to browsers (clench)
● 2PB of CDN data served per month
● Have DDOSed ourselves
○ Very interesting bugs
● Simulation is weak
○ The internet is a dirty place
○ Embrace incremental deploys
@numbakrrunch
The Log
Jay Kreps: “Real-time data’s unifying abstraction”
● Centralized logging
● Loosely coupled consumers
Divide your dependencies:
● Synchronous - 0mq
● Asynchronous - Kafka
Distributed event logging
● Does determinism matter?
Log format durability?
● Protobuf?
http://bit.ly/thelog
@numbakrrunch
Columnar Compression
● Columnar storage techniques for row data
● Better compressor efficiency
● Different compressors per column
● >20% size savings
● https://github.com/addthis/columncompressor
○ by @abramsm
Time IP UID URL Geo Time
IP
UID
URL
Geo
Input Data Stored Data
Block
Size
@numbakrrunch
Tunable QoS
Cassandra URL Store
● We scrape and classify 20M URLs/day
● 750 million active records
● 2.2B reads/day
● Variable cache TTLs
○ Depending on write rate per record
● Global TTL knob
○ Turn up to reduce load for maintenance
○ Turn down to improve responsiveness
6
CDN cache
Hydra
Our custom processing system
Optimized for real-time data
Just open sourced:
https://github.com/addthis/hydra
Go see @csby’s talk
Great Hall North @3:55pm
@numbakrrunch
Summary
● Are you more like the post office or the bank?
● Look for good-enough answers
● Fight your nerd tendency for perfect
○ I’m still struggling with this
Questions?
@numbakrrunchSlides: http://bit.ly/datalessons

More Related Content

What's hot

Hyperloglog Lightning Talk
Hyperloglog Lightning TalkHyperloglog Lightning Talk
Hyperloglog Lightning Talk
Simon Prickett
 
Piano Media - approach to data gathering and processing
Piano Media - approach to data gathering and processingPiano Media - approach to data gathering and processing
Piano Media - approach to data gathering and processing
MartinStrycek
 
Cost Effective Presto on AWS with Spot Nodes - Strata SF 2019
Cost Effective Presto on AWS with Spot Nodes - Strata SF 2019Cost Effective Presto on AWS with Spot Nodes - Strata SF 2019
Cost Effective Presto on AWS with Spot Nodes - Strata SF 2019
Shubham Tagra
 
MongoDB IoT City Tour EINDHOVEN: Managing the Database Complexity
MongoDB IoT City Tour EINDHOVEN: Managing the Database ComplexityMongoDB IoT City Tour EINDHOVEN: Managing the Database Complexity
MongoDB IoT City Tour EINDHOVEN: Managing the Database Complexity
MongoDB
 
Atmosphere 2014: Centralized log management based on Logstash and Kibana - ca...
Atmosphere 2014: Centralized log management based on Logstash and Kibana - ca...Atmosphere 2014: Centralized log management based on Logstash and Kibana - ca...
Atmosphere 2014: Centralized log management based on Logstash and Kibana - ca...
PROIDEA
 
Python crash course for geologists in the mining industry
Python crash course for geologists in the mining industryPython crash course for geologists in the mining industry
Python crash course for geologists in the mining industry
Laurent Wagner
 
MongoDB IoT City Tour LONDON: Managing the Database Complexity, by Arthur Vie...
MongoDB IoT City Tour LONDON: Managing the Database Complexity, by Arthur Vie...MongoDB IoT City Tour LONDON: Managing the Database Complexity, by Arthur Vie...
MongoDB IoT City Tour LONDON: Managing the Database Complexity, by Arthur Vie...
MongoDB
 
Data engineering Stl Big Data IDEA user group
Data engineering   Stl Big Data IDEA user groupData engineering   Stl Big Data IDEA user group
Data engineering Stl Big Data IDEA user group
Adam Doyle
 
Fluentd and Docker - running fluentd within a docker container
Fluentd and Docker - running fluentd within a docker containerFluentd and Docker - running fluentd within a docker container
Fluentd and Docker - running fluentd within a docker container
Treasure Data, Inc.
 
Mongo db present
Mongo db presentMongo db present
Mongo db present
scottmsims
 
Logging in The World of DevOps
Logging in The World of DevOps Logging in The World of DevOps
Logging in The World of DevOps
DevOps Indonesia
 
Geo data analytics
Geo data analyticsGeo data analytics
Geo data analytics
Daniel Marcous
 
umeng analytical arch
umeng analytical archumeng analytical arch
umeng analytical arch
Yan Zhang
 
KDB+ Lite
KDB+ LiteKDB+ Lite
KDB+ Lite
Sayanosauras
 
Webinar: MongoDB Use Cases within the Oil, Gas, and Energy Industries
Webinar: MongoDB Use Cases within the Oil, Gas, and Energy IndustriesWebinar: MongoDB Use Cases within the Oil, Gas, and Energy Industries
Webinar: MongoDB Use Cases within the Oil, Gas, and Energy Industries
MongoDB
 
Moodle performance testing presentation - Jonathon Moore
 Moodle performance testing presentation - Jonathon Moore Moodle performance testing presentation - Jonathon Moore
Moodle performance testing presentation - Jonathon Moore
Ireland & UK Moodlemoot 2012
 
Clickhouse MeetUp@ContentSquare - ContentSquare's Experience Sharing
Clickhouse MeetUp@ContentSquare - ContentSquare's Experience SharingClickhouse MeetUp@ContentSquare - ContentSquare's Experience Sharing
Clickhouse MeetUp@ContentSquare - ContentSquare's Experience Sharing
Vianney FOUCAULT
 
MongoDB IoT City Tour STUTTGART: Managing the Database Complexity, by Arthur ...
MongoDB IoT City Tour STUTTGART: Managing the Database Complexity, by Arthur ...MongoDB IoT City Tour STUTTGART: Managing the Database Complexity, by Arthur ...
MongoDB IoT City Tour STUTTGART: Managing the Database Complexity, by Arthur ...
MongoDB
 
MongoDB for Spatio-Behavioral Data Analysis and Visualization
MongoDB for Spatio-Behavioral Data Analysis and VisualizationMongoDB for Spatio-Behavioral Data Analysis and Visualization
MongoDB for Spatio-Behavioral Data Analysis and Visualization
MongoDB
 
Logging for Containers
Logging for ContainersLogging for Containers
Logging for Containers
Eduardo Silva Pereira
 

What's hot (20)

Hyperloglog Lightning Talk
Hyperloglog Lightning TalkHyperloglog Lightning Talk
Hyperloglog Lightning Talk
 
Piano Media - approach to data gathering and processing
Piano Media - approach to data gathering and processingPiano Media - approach to data gathering and processing
Piano Media - approach to data gathering and processing
 
Cost Effective Presto on AWS with Spot Nodes - Strata SF 2019
Cost Effective Presto on AWS with Spot Nodes - Strata SF 2019Cost Effective Presto on AWS with Spot Nodes - Strata SF 2019
Cost Effective Presto on AWS with Spot Nodes - Strata SF 2019
 
MongoDB IoT City Tour EINDHOVEN: Managing the Database Complexity
MongoDB IoT City Tour EINDHOVEN: Managing the Database ComplexityMongoDB IoT City Tour EINDHOVEN: Managing the Database Complexity
MongoDB IoT City Tour EINDHOVEN: Managing the Database Complexity
 
Atmosphere 2014: Centralized log management based on Logstash and Kibana - ca...
Atmosphere 2014: Centralized log management based on Logstash and Kibana - ca...Atmosphere 2014: Centralized log management based on Logstash and Kibana - ca...
Atmosphere 2014: Centralized log management based on Logstash and Kibana - ca...
 
Python crash course for geologists in the mining industry
Python crash course for geologists in the mining industryPython crash course for geologists in the mining industry
Python crash course for geologists in the mining industry
 
MongoDB IoT City Tour LONDON: Managing the Database Complexity, by Arthur Vie...
MongoDB IoT City Tour LONDON: Managing the Database Complexity, by Arthur Vie...MongoDB IoT City Tour LONDON: Managing the Database Complexity, by Arthur Vie...
MongoDB IoT City Tour LONDON: Managing the Database Complexity, by Arthur Vie...
 
Data engineering Stl Big Data IDEA user group
Data engineering   Stl Big Data IDEA user groupData engineering   Stl Big Data IDEA user group
Data engineering Stl Big Data IDEA user group
 
Fluentd and Docker - running fluentd within a docker container
Fluentd and Docker - running fluentd within a docker containerFluentd and Docker - running fluentd within a docker container
Fluentd and Docker - running fluentd within a docker container
 
Mongo db present
Mongo db presentMongo db present
Mongo db present
 
Logging in The World of DevOps
Logging in The World of DevOps Logging in The World of DevOps
Logging in The World of DevOps
 
Geo data analytics
Geo data analyticsGeo data analytics
Geo data analytics
 
umeng analytical arch
umeng analytical archumeng analytical arch
umeng analytical arch
 
KDB+ Lite
KDB+ LiteKDB+ Lite
KDB+ Lite
 
Webinar: MongoDB Use Cases within the Oil, Gas, and Energy Industries
Webinar: MongoDB Use Cases within the Oil, Gas, and Energy IndustriesWebinar: MongoDB Use Cases within the Oil, Gas, and Energy Industries
Webinar: MongoDB Use Cases within the Oil, Gas, and Energy Industries
 
Moodle performance testing presentation - Jonathon Moore
 Moodle performance testing presentation - Jonathon Moore Moodle performance testing presentation - Jonathon Moore
Moodle performance testing presentation - Jonathon Moore
 
Clickhouse MeetUp@ContentSquare - ContentSquare's Experience Sharing
Clickhouse MeetUp@ContentSquare - ContentSquare's Experience SharingClickhouse MeetUp@ContentSquare - ContentSquare's Experience Sharing
Clickhouse MeetUp@ContentSquare - ContentSquare's Experience Sharing
 
MongoDB IoT City Tour STUTTGART: Managing the Database Complexity, by Arthur ...
MongoDB IoT City Tour STUTTGART: Managing the Database Complexity, by Arthur ...MongoDB IoT City Tour STUTTGART: Managing the Database Complexity, by Arthur ...
MongoDB IoT City Tour STUTTGART: Managing the Database Complexity, by Arthur ...
 
MongoDB for Spatio-Behavioral Data Analysis and Visualization
MongoDB for Spatio-Behavioral Data Analysis and VisualizationMongoDB for Spatio-Behavioral Data Analysis and Visualization
MongoDB for Spatio-Behavioral Data Analysis and Visualization
 
Logging for Containers
Logging for ContainersLogging for Containers
Logging for Containers
 

Similar to Data Lessons Learned at Scale

Data Lessons Learned at Scale - Big Data DC
Data Lessons Learned at Scale - Big Data DCData Lessons Learned at Scale - Big Data DC
Data Lessons Learned at Scale - Big Data DC
Charlie Reverte
 
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Spark Summit
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
C4Media
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
Omid Vahdaty
 
Build real time stream processing applications using Apache Kafka
Build real time stream processing applications using Apache KafkaBuild real time stream processing applications using Apache Kafka
Build real time stream processing applications using Apache Kafka
Hotstar
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2
aspyker
 
Cloud arch patterns
Cloud arch patternsCloud arch patterns
Cloud arch patterns
Corey Huinker
 
Streaming datasets for personalization
Streaming datasets for personalizationStreaming datasets for personalization
Streaming datasets for personalization
Shriya Arora
 
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
Omid Vahdaty
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Omid Vahdaty
 
Web performance mercadolibre - ECI 2013
Web performance   mercadolibre - ECI 2013Web performance   mercadolibre - ECI 2013
Web performance mercadolibre - ECI 2013
Santiago Aimetta
 
Streamsets and spark in Retail
Streamsets and spark in RetailStreamsets and spark in Retail
Streamsets and spark in Retail
Hari Shreedharan
 
The Dark Side Of Go -- Go runtime related problems in TiDB in production
The Dark Side Of Go -- Go runtime related problems in TiDB  in productionThe Dark Side Of Go -- Go runtime related problems in TiDB  in production
The Dark Side Of Go -- Go runtime related problems in TiDB in production
PingCAP
 
MongoDB@sfr.fr
MongoDB@sfr.frMongoDB@sfr.fr
MongoDB@sfr.frbeboutou
 
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
Omid Vahdaty
 
PyGrunn2013 High Performance Web Applications with TurboGears
PyGrunn2013  High Performance Web Applications with TurboGearsPyGrunn2013  High Performance Web Applications with TurboGears
PyGrunn2013 High Performance Web Applications with TurboGears
Alessandro Molina
 
Apache Beam and Google Cloud Dataflow - IDG - final
Apache Beam and Google Cloud Dataflow - IDG - finalApache Beam and Google Cloud Dataflow - IDG - final
Apache Beam and Google Cloud Dataflow - IDG - finalSub Szabolcs Feczak
 
Archmage, Pinterest’s Real-time Analytics Platform on Druid
Archmage, Pinterest’s Real-time Analytics Platform on DruidArchmage, Pinterest’s Real-time Analytics Platform on Druid
Archmage, Pinterest’s Real-time Analytics Platform on Druid
Imply
 
NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1
Ruslan Meshenberg
 

Similar to Data Lessons Learned at Scale (20)

Data Lessons Learned at Scale - Big Data DC
Data Lessons Learned at Scale - Big Data DCData Lessons Learned at Scale - Big Data DC
Data Lessons Learned at Scale - Big Data DC
 
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
 
Build real time stream processing applications using Apache Kafka
Build real time stream processing applications using Apache KafkaBuild real time stream processing applications using Apache Kafka
Build real time stream processing applications using Apache Kafka
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2
 
Cloud arch patterns
Cloud arch patternsCloud arch patterns
Cloud arch patterns
 
Streaming datasets for personalization
Streaming datasets for personalizationStreaming datasets for personalization
Streaming datasets for personalization
 
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3
 
Web performance mercadolibre - ECI 2013
Web performance   mercadolibre - ECI 2013Web performance   mercadolibre - ECI 2013
Web performance mercadolibre - ECI 2013
 
Streamsets and spark in Retail
Streamsets and spark in RetailStreamsets and spark in Retail
Streamsets and spark in Retail
 
The Dark Side Of Go -- Go runtime related problems in TiDB in production
The Dark Side Of Go -- Go runtime related problems in TiDB  in productionThe Dark Side Of Go -- Go runtime related problems in TiDB  in production
The Dark Side Of Go -- Go runtime related problems in TiDB in production
 
MongoDB@sfr.fr
MongoDB@sfr.frMongoDB@sfr.fr
MongoDB@sfr.fr
 
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
 
PyGrunn2013 High Performance Web Applications with TurboGears
PyGrunn2013  High Performance Web Applications with TurboGearsPyGrunn2013  High Performance Web Applications with TurboGears
PyGrunn2013 High Performance Web Applications with TurboGears
 
Apache Beam and Google Cloud Dataflow - IDG - final
Apache Beam and Google Cloud Dataflow - IDG - finalApache Beam and Google Cloud Dataflow - IDG - final
Apache Beam and Google Cloud Dataflow - IDG - final
 
Archmage, Pinterest’s Real-time Analytics Platform on Druid
Archmage, Pinterest’s Real-time Analytics Platform on DruidArchmage, Pinterest’s Real-time Analytics Platform on Druid
Archmage, Pinterest’s Real-time Analytics Platform on Druid
 
NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1
 

Recently uploaded

Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 

Recently uploaded (20)

Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 

Data Lessons Learned at Scale

  • 2. @numbakrrunch Topic Half of the work that it takes to do data science is plumbing and wrangling Here are some lessons we’ve learned..
  • 4. @numbakrrunch Our Data We process website data ● Visitation ● Sharing ● Following ● Content Classification And use it to improve the site ● Content Recommendation ● Personalization ● Analytics
  • 5. @numbakrrunch At Scale... ● 14 million domains ● 100 billion views/month ● 50k events/sec ● 160k concurrent firewall sessions ● 500k unique ganglia metrics
  • 6. @numbakrrunch Distributed ID Generation ● Session IDs are generated in the browser ● We concatenate time and a random value Hex: 4f6934b6f54bd7c1 Base64: T2k0to403VS ● Time-bounded probabilistic uniqueness ○ (m 2) / n = 0.142 collisions/sec (at 35k rq/sec) ● Naturally time ordered, built-in DoB Compare to Twitter Snowflake https://github.com/twitter/snowflake/ time rand 63 31 0
  • 7. @numbakrrunch Counting Things ● Cardinality ● Set membership ● Top-k elements ● Frequency ● Estimate when possible ● Sample when possible ● Often streaming vs. batch ● Mergeability is a big plus ○ Distributed counting ○ Checkpointing Stream-lib: https://github.com/clearspring/stream-lib http://highlyscalable.wordpress. com/2012/05/01/probabilistic-structures-web- analytics-data-mining/
  • 8. @numbakrrunch Joining Data ● Value of data increases with higher dimensionality ○ Geo, user profile, page attributes, external data ● Join and de-normalize data when you ingest ○ Disk is cheap ● Join your data in client-side storage ○ Browsers as a lossy distributed database ● Oceans of data in the cloud.. “The value is in the join” (or something like that) https://github.com/stewartoallen
  • 9. @numbakrrunch Sharding and Sampling ● Choose your shard keys wisely ○ High cardinality field to reduce lumpiness ○ What do you need to co-locate ○ Storage is cheap, multiple copies? ● Shards also useful for sampling ○ Complete data subsets ● Can yield statistical significance ○ Depending on the question
  • 10. Deployment ● Continuous Deploy? ● Deploying our javascript costs $3k ○ Have to invalidate 1.4B browser caches ○ Several hours to flush to browsers (clench) ● 2PB of CDN data served per month ● Have DDOSed ourselves ○ Very interesting bugs ● Simulation is weak ○ The internet is a dirty place ○ Embrace incremental deploys
  • 11. @numbakrrunch The Log Jay Kreps: “Real-time data’s unifying abstraction” ● Centralized logging ● Loosely coupled consumers Divide your dependencies: ● Synchronous - 0mq ● Asynchronous - Kafka Distributed event logging ● Does determinism matter? Log format durability? ● Protobuf? http://bit.ly/thelog
  • 12. @numbakrrunch Columnar Compression ● Columnar storage techniques for row data ● Better compressor efficiency ● Different compressors per column ● >20% size savings ● https://github.com/addthis/columncompressor ○ by @abramsm Time IP UID URL Geo Time IP UID URL Geo Input Data Stored Data Block Size
  • 13. @numbakrrunch Tunable QoS Cassandra URL Store ● We scrape and classify 20M URLs/day ● 750 million active records ● 2.2B reads/day ● Variable cache TTLs ○ Depending on write rate per record ● Global TTL knob ○ Turn up to reduce load for maintenance ○ Turn down to improve responsiveness 6 CDN cache
  • 14. Hydra Our custom processing system Optimized for real-time data Just open sourced: https://github.com/addthis/hydra Go see @csby’s talk Great Hall North @3:55pm
  • 15. @numbakrrunch Summary ● Are you more like the post office or the bank? ● Look for good-enough answers ● Fight your nerd tendency for perfect ○ I’m still struggling with this