Data Lessons Learned at Scale

•

1 like•934 views

Charlie Reverte, VP of Engineering at AddThis, discusses lessons learned from processing large-scale web data. AddThis processes data from 14 million domains, including 100 billion monthly page views and 50,000 events per second. Reverte outlines challenges around distributed ID generation, counting unique values, joining distributed data, sampling large datasets, and deploying systems that invalidate over 1.4 billion browser caches. He advocates for loose coupling between systems using approaches like Kafka for asynchronous event logging. Reverte also discusses techniques for columnar compression, tunable quality of service, and open sourcing Hydra, AddThis' custom processing system optimized for real-time data.

Technology

Charlie Reverte
VP Engineering
@numbakrrunch
Data Lessons Learned at Scale

@numbakrrunch
Topic
Half of the work that it takes to do data science is
plumbing and wrangling
Here are some lessons we’ve learned..

@numbakrrunch
About AddThis
We make tools for websites

@numbakrrunch
Our Data
We process website data
● Visitation
● Sharing
● Following
● Content Classification
And use it to improve the site
● Content Recommendation
● Personalization
● Analytics

@numbakrrunch
At Scale...
● 14 million domains
● 100 billion views/month
● 50k events/sec
● 160k concurrent firewall sessions
● 500k unique ganglia metrics

@numbakrrunch
Distributed ID Generation
● Session IDs are generated in the browser
● We concatenate time and a random value
Hex: 4f6934b6f54bd7c1
Base64: T2k0to403VS
● Time-bounded probabilistic uniqueness
○ (m 2) / n = 0.142 collisions/sec (at 35k rq/sec)
● Naturally time ordered, built-in DoB
Compare to Twitter Snowflake
https://github.com/twitter/snowflake/
time rand
63 31 0

@numbakrrunch
Counting Things
● Cardinality
● Set membership
● Top-k elements
● Frequency
● Estimate when possible
● Sample when possible
● Often streaming vs. batch
● Mergeability is a big plus
○ Distributed counting
○ Checkpointing
Stream-lib: https://github.com/clearspring/stream-lib
http://highlyscalable.wordpress.
com/2012/05/01/probabilistic-structures-web-
analytics-data-mining/

@numbakrrunch
Sharding and Sampling
● Choose your shard keys wisely
○ High cardinality field to reduce lumpiness
○ What do you need to co-locate
○ Storage is cheap, multiple copies?
● Shards also useful for sampling
○ Complete data subsets
● Can yield statistical significance
○ Depending on the question

Deployment
● Continuous Deploy?
● Deploying our javascript costs $3k
○ Have to invalidate 1.4B browser caches
○ Several hours to flush to browsers (clench)
● 2PB of CDN data served per month
● Have DDOSed ourselves
○ Very interesting bugs
● Simulation is weak
○ The internet is a dirty place
○ Embrace incremental deploys

@numbakrrunch
The Log
Jay Kreps: “Real-time data’s unifying abstraction”
● Centralized logging
● Loosely coupled consumers
Divide your dependencies:
● Synchronous - 0mq
● Asynchronous - Kafka
Distributed event logging
● Does determinism matter?
Log format durability?
● Protobuf?
http://bit.ly/thelog

@numbakrrunch
Columnar Compression
● Columnar storage techniques for row data
● Better compressor efficiency
● Different compressors per column
● >20% size savings
● https://github.com/addthis/columncompressor
○ by @abramsm
Time IP UID URL Geo Time
IP
UID
URL
Geo
Input Data Stored Data
Block
Size

@numbakrrunch
Tunable QoS
Cassandra URL Store
● We scrape and classify 20M URLs/day
● 750 million active records
● 2.2B reads/day
● Variable cache TTLs
○ Depending on write rate per record
● Global TTL knob
○ Turn up to reduce load for maintenance
○ Turn down to improve responsiveness
6
CDN cache

Hydra
Our custom processing system
Optimized for real-time data
Just open sourced:
https://github.com/addthis/hydra
Go see @csby’s talk
Great Hall North @3:55pm

@numbakrrunch
Summary
● Are you more like the post office or the bank?
● Look for good-enough answers
● Fight your nerd tendency for perfect
○ I’m still struggling with this

Questions?
@numbakrrunchSlides: http://bit.ly/datalessons

Streamsets Data Collector is designed to make data ingest and processing easy. SDC integrates at several levels with Apache Spark to make data analysis using Spark very easy. SDC works with Databricks Cloud to trigger jobs based on incoming data. In this talk, you will learn how a larger retail player with thousands of outlets is utilizing StreamSets to power Spark jobs on the Databricks cloud, combining real-time foot traffic data and historic behavioral & transaction data for analytic insights that improve revenue per square foot.

21st Athens Big Data Meetup - 1st Talk - Fast and simple data exploration wit...

Athens Big Data

Austin bdug 2011_01_27_small_and_big_data

Alex Pinkin

Mongodb (1)Deepak Kumar

SOLR Power FTW: short version

Alex Pinkin

Speed Up Uber's Presto with Alluxio

Alluxio, Inc.

The value of the fast growing class of NoSQL databases is the ability to handle high velocity and volumes of data while enabling greater agility with dynamic schemas. MongoDB gives you those benefits while also providing a rich querying capability and a document model for developer productivity. Arthur Viegers will outline the reasons for MongoDB's popularity in IoT applications and how you can leverage the core concepts of NoSQL to build robust and highly scalable IoT applications.

Atmosphere 2014: Centralized log management based on Logstash and Kibana - ca...

PROIDEA

Nowadays cloud enviroments are primary platform for applications. We no longer have multipurpose machines, rather multiple smaller virtual servers with dedicated roles. Therefore there is a need to have one place where we can manage applications and system logs. I wish to share my experience gained while building centralized log managment system using Nxlog, Logstash and Kibana. With that tools we are building cost effective and scalable log managment platform. Dariusz Eliasz - Works in Allegro Group as a Solution Architect and is responsible for organizing cooperation with infrastructure teams, also leads some of the infrastructure projects. Earlier as an Expert System Administratorhe was related with building and maintaining the infrastructure shared services (i.e. image hosting platform) within Allegro Group.

Python crash course for geologists in the mining industry

Laurent Wagner

MongoDB IoT City Tour LONDON: Managing the Database Complexity, by Arthur Vie...

MongoDB

Arthur Viegers, Senior Solutions Architect, MongoDB. The value of the fast growing class of NoSQL databases is the ability to handle high velocity and volumes of data while enabling greater agility with dynamic schemas. MongoDB gives you those benefits while also providing a rich querying capability and a document model for developer productivity. Arthur Viegers outlines the reasons for MongoDB's popularity in IoT applications and how you can leverage the core concepts of NoSQL to build robust and highly scalable IoT applications.

Data engineering Stl Big Data IDEA user group

Adam Doyle

Modern day Data Engineering requires creating reliable data pipelines, architecting distributed systems, designing data stores, and preparing data for other teams. We’ll describe a year in the life of a Data Engineer who is tasked with creating a streaming data pipeline and touch on the skills necessary to set one up using Apache Spark. Slides from the April 2019 meeting of the St. Louis Big Data IDEA meetup.

Fluentd and Docker - running fluentd within a docker container

Treasure Data, Inc.

Mongo db present

scottmsims

Logging in The World of DevOps

DevOps Indonesia

Geo data analytics

Daniel Marcous

umeng analytical arch

Yan Zhang

KDB+ Lite

Sayanosauras

Webinar: MongoDB Use Cases within the Oil, Gas, and Energy Industries

MongoDB

In this session we will dive into some of the use-cases companies are currently deploying MongoDB for in the energy space. It is becoming more important for companies to make data driven decisions, and MongoDB can often be the right tool for analyzing the massive amounts of data coming in. Whether tracking oil well site statistics, power meter data, or feeds from sensors, MongoDB can be a great fit for tracking and analyzing that data, using it to make smart, informed business decisions.

Moodle performance testing presentation - Jonathon Moore

Ireland & UK Moodlemoot 2012

Clickhouse MeetUp@ContentSquare - ContentSquare's Experience Sharing

Vianney FOUCAULT

MongoDB IoT City Tour STUTTGART: Managing the Database Complexity, by Arthur ...

MongoDB

MongoDB for Spatio-Behavioral Data Analysis and Visualization

MongoDB

T-Sciences offers iSpatial - a web-based Spatial Data Infrastructure (SDI) to enable integration of third-party applications with geo-visualization tools. The iHarvest tool further enables the mining and analysis of data aggregated in the iSpatial platform for spatio-temporal behavior modelling. At the back-end of both products is MongoDB, providing fundamental framework capabilities for the spatial indexing and data analysis techniques. Come witness how Thermopylae Sciences and Technology leveraged the aggregation framework, and extended the spatial capabilities of MongoDB to tackle dynamic spatio-behavioral data at scale.

Logging for Containers

Eduardo Silva Pereira

Data Lessons Learned at Scale - Big Data DC

Charlie Reverte

Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...

Spark Summit

Streaming applications have often been complex to design and maintain because of the significant upfront infrastructure investment required. However, with the advent of Spark an easy transition to stream processing is now available, enabling personalization applications and experiments to consume near real-time data without massive development cycles. Our decision to evaluate Spark as our stream processing engine was primarily led by the following considerations: 1) Ease of development for the team (already familiar with spark for batch), 2) the scope/requirements of our problem, 3) re-usability of code from spark batch jobs, and 4) Spark support from infrastructure teams within the company. In this session, we will present our experience using Spark for stream processing unbounded datasets in the personalization space. The datasets consisted of, but were not limited, to the stream of playback events that are used as feedback for all personalization algorithms. These plays are used to extract specific behaviors which are highly predictive of a customer’s enjoyment of our service. This dataset is massive and has to be further enriched by other online and offline Netflix data sources. These datasets, when consumed by our machine learning models, directly affect the customer’s personalized experience, which means that the impact is high and tolerance for failure is low. We’ll talk about the experiments we did to compare Spark with other streaming solutions like Apache Flink , the impact that we had on our customers, and most importantly, the challenges we faced. Take-aways for the audience: 1) A great example of stream processing large, personalization datasets at scale. 2) An increased awareness of the costs/requirements for making the transition from batch to streaming successfully. 3) Exposure to some of the technical challenges that should be expected along the way.

What's hot

Hyperloglog Lightning Talk

Simon Prickett

Piano Media - approach to data gathering and processing

MartinStrycek

Cost Effective Presto on AWS with Spot Nodes - Strata SF 2019

Shubham Tagra

MongoDB IoT City Tour EINDHOVEN: Managing the Database Complexity

MongoDB

Atmosphere 2014: Centralized log management based on Logstash and Kibana - ca...

PROIDEA

Python crash course for geologists in the mining industry

Laurent Wagner

MongoDB IoT City Tour LONDON: Managing the Database Complexity, by Arthur Vie...

MongoDB

Data engineering Stl Big Data IDEA user group

Adam Doyle

Fluentd and Docker - running fluentd within a docker container

Treasure Data, Inc.

Mongo db present

scottmsims

Logging in The World of DevOps

DevOps Indonesia

Geo data analytics

Daniel Marcous

umeng analytical arch

Yan Zhang

KDB+ Lite

Sayanosauras

Webinar: MongoDB Use Cases within the Oil, Gas, and Energy Industries

MongoDB

Moodle performance testing presentation - Jonathon Moore

Ireland & UK Moodlemoot 2012

Clickhouse MeetUp@ContentSquare - ContentSquare's Experience Sharing

Vianney FOUCAULT

MongoDB IoT City Tour STUTTGART: Managing the Database Complexity, by Arthur ...

MongoDB

MongoDB for Spatio-Behavioral Data Analysis and Visualization

MongoDB

Logging for Containers

Eduardo Silva Pereira

What's hot (20)

Hyperloglog Lightning Talk

Piano Media - approach to data gathering and processing

Cost Effective Presto on AWS with Spot Nodes - Strata SF 2019

MongoDB IoT City Tour EINDHOVEN: Managing the Database Complexity

Atmosphere 2014: Centralized log management based on Logstash and Kibana - ca...

Python crash course for geologists in the mining industry

MongoDB IoT City Tour LONDON: Managing the Database Complexity, by Arthur Vie...

Data engineering Stl Big Data IDEA user group

Fluentd and Docker - running fluentd within a docker container

Mongo db present

Logging in The World of DevOps

Geo data analytics

umeng analytical arch

KDB+ Lite

Webinar: MongoDB Use Cases within the Oil, Gas, and Energy Industries

Moodle performance testing presentation - Jonathon Moore

Clickhouse MeetUp@ContentSquare - ContentSquare's Experience Sharing

MongoDB IoT City Tour STUTTGART: Managing the Database Complexity, by Arthur ...

MongoDB for Spatio-Behavioral Data Analysis and Visualization

Logging for Containers

Similar to Data Lessons Learned at Scale

Data Lessons Learned at Scale - Big Data DC

Charlie Reverte

Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...

Spark Summit

Data Science in the Cloud @StitchFix

C4Media

Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2lGNybu. Stefan Krawczyk discusses how his team at StitchFix use the cloud to enable over 80 data scientists to be productive. He also talks about prototyping ideas, algorithms and analyses, how they set up & keep schemas in sync between Hive, Presto, Redshift & Spark and make access easy for their data scientists, etc. Filmed at qconsf.com.. Stefan Krawczyk is Algo Dev Platform Lead at StitchFix, where he’s leading development of the algorithm development platform. He spent formative years at Stanford, LinkedIn, Nextdoor & Idibon, working on everything from growth engineering, product engineering, data engineering, to recommendation systems, NLP, data science and business intelligence.

AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English

Omid Vahdaty

Build real time stream processing applications using Apache Kafka

Hotstar

This talk was presented at the Hotstar Scale Meetup in Bangalore by Jayesh Sidhwani In this talk, the presenter introduces Apache Kafka and the Apache Kafka Streams library. Starting from the need for building streaming applications to thinking the use-cases as a streaming job - this talk covers all the technicalities. It ends with a short description of how Kafka is deployed and used at Hotstar

Netflix Open Source Meetup Season 4 Episode 2

aspyker

In this episode, we will take a close look at 2 different approaches to high-throughput/low-latency data stores, developed by Netflix. The first, EVCache, is a battle-tested distributed memcached-backed data store, optimized for the cloud. You will also hear about the road ahead for EVCache it evolves into an L1/L2 cache over RAM and SSDs. The second, Dynomite, is a framework to make any non-distributed data-store, distributed. Netflix's first implementation of Dynomite is based on Redis. Come learn about the products' features and hear from Thomson and Reuters, Diego Pacheco from Ilegra and other third party speakers, internal and external to Netflix, on how these products fit in their stack and roadmap.

Cloud arch patterns

Corey Huinker

Streaming datasets for personalization

Shriya Arora

Streaming applications have historically been complex to design and implement because of the significant infrastructure investment. However, recent active developments in various streaming platforms provide an easy transition to stream processing, and enable analytics applications/experiments to consume near real-time data without massive development cycles.In this session, we will present our experience on stream processing unbounded datasets in the personalization space. The datasets consisted of -- but were not limited to -- the stream of playback events that are used as feedback for all personalization algorithms. These datasets when ultimately consumed by our machine learning models, directly affect the customer’s personalized experience. We’ll talk about the experiments we did to compare Apache Spark and Apache Flink, and the challenges we faced.

MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas

MongoDB

AWS Big Data Demystified #1: Big data architecture lessons learned

Omid Vahdaty

AWS Big Data Demystified #1: Big data architecture lessons learned . a quick overview of a big data techonoligies, which were selected and disregard in our company The video: https://youtu.be/l5KmaZNQxaU dont forget to subcribe to the youtube channel The website: https://amazon-aws-big-data-demystified.ninja/ The meetup : https://www.meetup.com/AWS-Big-Data-Demystified/ The facebook group : https://www.facebook.com/Amazon-AWS-Big-Data-Demystified-1832900280345700/

Big Data in 200 km/h | AWS Big Data Demystified #1.3

Omid Vahdaty

What we're about A while ago I entered the challenging world of Big Data. As an engineer, at first, I was not so impressed with this field. As time went by, I realised more and more, The technological challenges in this area are too great to master by one person. Just look at the picture in this articles, it only covers a small fraction of the technologies in the Big Data industry… Consequently, I created a meetup detailing all the challenges of Big Data, especially in the world of cloud. I am using AWS infrastructure to answer the basic questions of anyone starting their way in the big data world. how to transform data (TXT, CSV, TSV, JSON) into Parquet, ORCwhich technology should we use to model the data ? EMR? Athena? Redshift? Spectrum? Glue? Spark? SparkSQL?how to handle streaming?how to manage costs?Performance tips?Security tip?Cloud best practices tips? Some of our online materials: Website: https://big-data-demystified.ninja/ Youtube channels: https://www.youtube.com/channel/UCzeGqhZIWU-hIDczWa8GtgQ?view_as=subscriber https://www.youtube.com/channel/UCMSdNB0fGmX5dXI7S7Y_LFA?view_as=subscriber Meetup: https://www.meetup.com/AWS-Big-Data-Demystified/ https://www.meetup.com/Big-Data-Demystified Facebook Group : https://www.facebook.com/groups/amazon.aws.big.data.demystified/ Facebook page (https://www.facebook.com/Amazon-AWS-Big-Data-Demystified-1832900280345700/) Audience: Data Engineers Data Science DevOps Engineers Big Data Architects Solution Architects CTO VP R&D

Web performance mercadolibre - ECI 2013

Santiago Aimetta

Streamsets and spark in Retail

Hari Shreedharan

The Dark Side Of Go -- Go runtime related problems in TiDB in production

PingCAP

MongoDB@sfr.frbeboutou

AWS Big Data Demystified #1.2 | Big Data architecture lessons learned

Omid Vahdaty

A while ago I entered the challenging world of Big Data. As an engineer, at first, I was not so impressed with this field. As time went by, I realised more and more, The technological challenges in this area are too great to master by one person. Just look at the picture in this articles, it only covers a small fraction of the technologies in the Big Data industry… Consequently, I created a meetup detailing all the challenges of Big Data, especially in the world of cloud. I am using AWS & GCP and Data Center infrastructure to answer the basic questions of anyone starting their way in the big data world. how to transform data (TXT, CSV, TSV, JSON) into Parquet, ORC,AVRO which technology should we use to model the data ? EMR? Athena? Redshift? Spectrum? Glue? Spark? SparkSQL? GCS? Big Query? Data flow? Data Lab? tensor flow? how to handle streaming? how to manage costs? Performance tips? Security tip? Cloud best practices tips? In this meetup we shall present lecturers working on several cloud vendors, various big data platforms such hadoop, Data warehourses , startups working on big data products. basically - if it is related to big data - this is THE meetup. Some of our online materials (mixed content from several cloud vendor): Website: https://big-data-demystified.ninja (under construction) Meetups: https://www.meetup.com/Big-Data-Demystified https://www.meetup.com/AWS-Big-Data-Demystified/ You tube channels: https://www.youtube.com/channel/UCMSdNB0fGmX5dXI7S7Y_LFA?view_as=subscriber https://www.youtube.com/channel/UCzeGqhZIWU-hIDczWa8GtgQ?view_as=subscriber Audience: Data Engineers Data Science DevOps Engineers Big Data Architects Solution Architects CTO VP R&D

PyGrunn2013 High Performance Web Applications with TurboGears

Alessandro Molina

Users are getting more and more used to fast websites, a second or two is way too much before they leave the page. Since version 2.2 TurboGears has focused on providing more tools to create faster web applications and improving its speed constantly. The latest 2.3 version, the first to support Python3 is up to 4x faster than the previous and provides a great toolset to make fast pages. The talk will focus on showcasing the tools provided by the framework to increase speed of your web applications and provide some tips and tricks to get maximum speed from the framework itself.

Apache Beam and Google Cloud Dataflow - IDG - finalSub Szabolcs Feczak

Archmage, Pinterest’s Real-time Analytics Platform on Druid

Imply

In this talk, we will talk about: 1) the motivation of switching from Hbase backed analytics system to Druid 2) the architecture design of Druid as a platform in Pinterest (Archmage, Hadoop, Kafka) including a query interface, Archmage, a thrift service in front of Druid which exposes a thrift api to company-wise clients, handles Druid broker hosts discovery, serves as a relay to broker hosts to abstract the async HTTP connection and provides query optimizations transparent to clients including directly translating fixed pattern SQL to Druid native JSON queries to save planning time. In addition, we’ll cover the production Hadoop batch and Kafka real time ingestion pipeline setup and the reason we picked a pull-based solution instead of a push-based solution for real time ingestion. 3) We will also talk about the use cases currently running in production on this platform including their data volume, QPS, Druid cluster setup, the unique challenges we met while onboarding and how we addressed them with extensive tunings to meet SLA and lessons learned for use cases including: partner insights, which provides partners with stats on organic pins; realtime spam detection, which detects user login related anomaly events and pin related spamming events like pin creation and repin; and migrating the backend from Presto to Druid for Ads related experiments data analysis.

NetflixOSS Meetup season 3 episode 1

Ruslan Meshenberg

Similar to Data Lessons Learned at Scale (20)

Data Lessons Learned at Scale - Big Data DC

Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...

Data Science in the Cloud @StitchFix

AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English

Build real time stream processing applications using Apache Kafka

Netflix Open Source Meetup Season 4 Episode 2

Cloud arch patterns

Streaming datasets for personalization

MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas

AWS Big Data Demystified #1: Big data architecture lessons learned

Big Data in 200 km/h | AWS Big Data Demystified #1.3

Web performance mercadolibre - ECI 2013

Streamsets and spark in Retail

The Dark Side Of Go -- Go runtime related problems in TiDB in production

MongoDB@sfr.fr

AWS Big Data Demystified #1.2 | Big Data architecture lessons learned

PyGrunn2013 High Performance Web Applications with TurboGears

Apache Beam and Google Cloud Dataflow - IDG - final

Archmage, Pinterest’s Real-time Analytics Platform on Druid

NetflixOSS Meetup season 3 episode 1

Recently uploaded

Bits & Pixels using AI for Good.........

Alison B. Lowndes

Mission to Decommission: Importance of Decommissioning Products to Increase E...

Product School

FIDO Alliance Osaka Seminar: Overview.pdf

FIDO Alliance

De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...

Product School

Securing your Kubernetes cluster_ a step-by-step guide to success !

KatiaHIMEUR1

Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster. However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks. In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.

Epistemic Interaction - tuning interfaces to provide information for AI support

Alan Dix

Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024 https://alandix.com/academic/papers/synergy2024-epistemic/ As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.

The Future of Platform Engineering

Jemma Hussein Allen

LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...

DanBrown980551

Do you want to learn how to model and simulate an electrical network from scratch in under an hour? Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)! During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook. PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides: - A fully editable and extendable library for grid component modelling; - Visualization tools to display your network; - Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses; The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well. What you will learn during the webinar: - For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills; - For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.

DevOps and Testing slides at DASA Connect

Kari Kakkonen

Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality

Inflectra

In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring. Learn about: • The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks. • Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective. • Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification. • Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process. Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.

State of ICS and IoT Cyber Threat Landscape Report 2024 preview

Prayukth K V

The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development. The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers: State of global ICS asset and network exposure Sectoral targets and attacks as well as the cost of ransom Global APT activity, AI usage, actor and tactic profiles, and implications Rise in volumes of AI-powered cyberattacks Major cyber events in 2024 Malware and malicious payload trends Cyberattack types and targets Vulnerability exploit attempts on CVEs Attacks on counties – USA Expansion of bot farms – how, where, and why In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East Why are attacks on smart factories rising? Cyber risk predictions Axis of attacks – Europe Systemic attacks in the Middle East Download the full report from here: https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/

Leading Change strategies and insights for effective change management pdf 1.pdf

OnBoard

To Graph or Not to Graph Knowledge Graph Architectures and LLMs

Paul Groth

GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...

James Anderson

Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management. The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM). Speakers: Bob Boule Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle. Gopinath Rebala Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.

FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf

FIDO Alliance

FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf

FIDO Alliance

Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...

Thierry Lestable

Assuring Contact Center Experiences for Your Customers With ThousandEyes

ThousandEyes

Elevating Tactical DDD Patterns Through Object Calisthenics

Dorra BARTAGUIZ

After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!

Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...

Ramesh Iyer

In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.

Recently uploaded (20)

Bits & Pixels using AI for Good.........

Mission to Decommission: Importance of Decommissioning Products to Increase E...

FIDO Alliance Osaka Seminar: Overview.pdf

De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...

Securing your Kubernetes cluster_ a step-by-step guide to success !

Epistemic Interaction - tuning interfaces to provide information for AI support

The Future of Platform Engineering

LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...

DevOps and Testing slides at DASA Connect

Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality

State of ICS and IoT Cyber Threat Landscape Report 2024 preview

Leading Change strategies and insights for effective change management pdf 1.pdf

To Graph or Not to Graph Knowledge Graph Architectures and LLMs

GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...

FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf

FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf

Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...

Assuring Contact Center Experiences for Your Customers With ThousandEyes

Elevating Tactical DDD Patterns Through Object Calisthenics

Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...

Data Lessons Learned at Scale

1. Charlie Reverte VP Engineering @numbakrrunch Data Lessons Learned at Scale

2. @numbakrrunch Topic Half of the work that it takes to do data science is plumbing and wrangling Here are some lessons we’ve learned..

3. @numbakrrunch About AddThis We make tools for websites

4. @numbakrrunch Our Data We process website data ● Visitation ● Sharing ● Following ● Content Classification And use it to improve the site ● Content Recommendation ● Personalization ● Analytics

5. @numbakrrunch At Scale... ● 14 million domains ● 100 billion views/month ● 50k events/sec ● 160k concurrent firewall sessions ● 500k unique ganglia metrics

6. @numbakrrunch Distributed ID Generation ● Session IDs are generated in the browser ● We concatenate time and a random value Hex: 4f6934b6f54bd7c1 Base64: T2k0to403VS ● Time-bounded probabilistic uniqueness ○ (m 2) / n = 0.142 collisions/sec (at 35k rq/sec) ● Naturally time ordered, built-in DoB Compare to Twitter Snowflake https://github.com/twitter/snowflake/ time rand 63 31 0

7. @numbakrrunch Counting Things ● Cardinality ● Set membership ● Top-k elements ● Frequency ● Estimate when possible ● Sample when possible ● Often streaming vs. batch ● Mergeability is a big plus ○ Distributed counting ○ Checkpointing Stream-lib: https://github.com/clearspring/stream-lib http://highlyscalable.wordpress. com/2012/05/01/probabilistic-structures-web- analytics-data-mining/

8. @numbakrrunch Joining Data ● Value of data increases with higher dimensionality ○ Geo, user profile, page attributes, external data ● Join and de-normalize data when you ingest ○ Disk is cheap ● Join your data in client-side storage ○ Browsers as a lossy distributed database ● Oceans of data in the cloud.. “The value is in the join” (or something like that) https://github.com/stewartoallen

9. @numbakrrunch Sharding and Sampling ● Choose your shard keys wisely ○ High cardinality field to reduce lumpiness ○ What do you need to co-locate ○ Storage is cheap, multiple copies? ● Shards also useful for sampling ○ Complete data subsets ● Can yield statistical significance ○ Depending on the question

10. Deployment ● Continuous Deploy? ● Deploying our javascript costs $3k ○ Have to invalidate 1.4B browser caches ○ Several hours to flush to browsers (clench) ● 2PB of CDN data served per month ● Have DDOSed ourselves ○ Very interesting bugs ● Simulation is weak ○ The internet is a dirty place ○ Embrace incremental deploys

11. @numbakrrunch The Log Jay Kreps: “Real-time data’s unifying abstraction” ● Centralized logging ● Loosely coupled consumers Divide your dependencies: ● Synchronous - 0mq ● Asynchronous - Kafka Distributed event logging ● Does determinism matter? Log format durability? ● Protobuf? http://bit.ly/thelog

12. @numbakrrunch Columnar Compression ● Columnar storage techniques for row data ● Better compressor efficiency ● Different compressors per column ● >20% size savings ● https://github.com/addthis/columncompressor ○ by @abramsm Time IP UID URL Geo Time IP UID URL Geo Input Data Stored Data Block Size

13. @numbakrrunch Tunable QoS Cassandra URL Store ● We scrape and classify 20M URLs/day ● 750 million active records ● 2.2B reads/day ● Variable cache TTLs ○ Depending on write rate per record ● Global TTL knob ○ Turn up to reduce load for maintenance ○ Turn down to improve responsiveness 6 CDN cache

14. Hydra Our custom processing system Optimized for real-time data Just open sourced: https://github.com/addthis/hydra Go see @csby’s talk Great Hall North @3:55pm

15. @numbakrrunch Summary ● Are you more like the post office or the bank? ● Look for good-enough answers ● Fight your nerd tendency for perfect ○ I’m still struggling with this

16. Questions? @numbakrrunchSlides: http://bit.ly/datalessons

Data Lessons Learned at Scale

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Data Lessons Learned at Scale

Similar to Data Lessons Learned at Scale (20)

Recently uploaded

Recently uploaded (20)

Data Lessons Learned at Scale