Fixing Twitter and Finding your own Fail Whale document discusses Twitter operations. The Twitter operations team focuses on software performance, availability, capacity planning, and configuration management using metrics, logs, and science. They use a dedicated managed services team and run their own servers instead of cloud services. The document outlines Twitter's rapid growth and challenges in maintaining performance. It discusses strategies for monitoring, analyzing metrics to find weak points, deploying changes, and improving processes through configuration management and peer reviews.
The Secrets of Building Realtime Big Data Systemsnathanmarz
The architectural principles behind building systems that scale to vast amounts of data and operate on that data in realtime.
Presented at POSSCON '11.
How LinkedIn uses memcached, a spoonful of SOA, and a sprinkle of SQL to scaleLinkedIn
This is one of two presentations given by LinkedIn engineers at Java One 2009.
This presentation was given by David Raccah and Dhananjay Ragade from LinkedIn.
For more information, check out http://blog.linkedin.com/
Using Riak for Events storage and analysis at Booking.comDamien Krotkine
At Booking.com, we have a constant flow of events coming from various applications and internal subsystems. This critical data needs to be stored for real-time, medium and long term analysis. Events are schema-less, making it difficult to use standard analysis tools.This presentation will explain how we built a storage and analysis solution based on Riak. The talk will cover: data aggregation and serialization, Riak configuration, solutions for lowering the network usage, and finally, how Riak's advanced features are used to perform real-time data crunching on the cluster nodes.
The Secrets of Building Realtime Big Data Systemsnathanmarz
The architectural principles behind building systems that scale to vast amounts of data and operate on that data in realtime.
Presented at POSSCON '11.
How LinkedIn uses memcached, a spoonful of SOA, and a sprinkle of SQL to scaleLinkedIn
This is one of two presentations given by LinkedIn engineers at Java One 2009.
This presentation was given by David Raccah and Dhananjay Ragade from LinkedIn.
For more information, check out http://blog.linkedin.com/
Using Riak for Events storage and analysis at Booking.comDamien Krotkine
At Booking.com, we have a constant flow of events coming from various applications and internal subsystems. This critical data needs to be stored for real-time, medium and long term analysis. Events are schema-less, making it difficult to use standard analysis tools.This presentation will explain how we built a storage and analysis solution based on Riak. The talk will cover: data aggregation and serialization, Riak configuration, solutions for lowering the network usage, and finally, how Riak's advanced features are used to perform real-time data crunching on the cluster nodes.
Webinar: Diagnosing Apache Cassandra Problems in ProductionDataStax Academy
This session covers diagnosing and solving common problems encountered in production, using performance profiling tools. We’ll also give a crash course to basic JVM garbage collection tuning. Viewers will leave with a better understanding of what they should look for when they encounter problems with their in-production Cassandra cluster.
PlayStation and Searchable Cassandra Without Solr (Dustin Pham & Alexander Fi...DataStax
When we were preparing for PlayStation4 launch we faced 1 hard problem: how to make our games and videos storage system super-fast, highly available and fault tolerant. Moving from relation database to Cassandra is not easy, and it even harder if you want to support different search and query use cases. Join our talk to learn how we managed to build highly available platform, that supports tens of millions of active users and can execute multiple user specific queries in less than a millisecond. And all of it without using Solr or ElasticSearch.
About the Speaker
Alexander Filipchik, Sony
Alex spent last 4 years of his life building the next generation of PlayStation Network. He is honored to be a part of a small team of engineers who managed to build from scratch a platform that scaled from 0 users to 1 million PS4s in just 1 day and have being landing 1.5 million of new devices per month since and now reached tenth of millions of active users. He is passionate about technology, innovations, walking his dog and building scalable software using Cassandra.
Wordnik's technical co-founder Tony Tam describes the reason for going NoSQL. During his talk Tony will discuss the selection criteria, testing + evaluation and successful, zero-downtime migration to MongoDB. Additionally details on Wordnik's speed and stability will be covered as well as how NoSQL technologies have changed the way Wordnik scales.
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...Amazon Web Services
"This is a technical architect's case study of how Loggly has employed the latest social-media-scale technologies as the backbone ingestion processing for our multi-tenant, geo-distributed, and real-time log management system. This presentation describes design details of how we built a second-generation system fully leveraging AWS services including Amazon Route 53 DNS with heartbeat and latency-based routing, multi-region VPCs, Elastic Load Balancing, Amazon Relational Database Service, and a number of pro-active and re-active approaches to scaling computational and indexing capacity.
The talk includes lessons learned in our first generation release, validated by thousands of customers; speed bumps and the mistakes we made along the way; various data models and architectures previously considered; and success at scale: speeds, feeds, and an unmeltable log processing engine."
Akka Streams And Kafka Streams: Where Microservices Meet Fast DataLightbend
In a recent survey, 90% of over 2400 developers reported having at least some real-time functionality in their systems. Enterprises are realizing that the ability to extract value from streaming data in near real-time is the new competitive advantage.
Two technologies–Akka Streams and Kafka Streams–have emerged as popular tools to use with Apache Kafka for addressing the shared requirements of availability, scalability, and resilience for both streaming microservices and Fast Data. So which one should you use for specific use cases?
Kafka and Storm - event processing in realtimeGuido Schmutz
Apache Kafka is publish-subscribe messaging rethought as a distributed commit log. It is designed to allow a single cluster to serve as the central data backbone for a large organization. It can be elastically and transparently expanded without downtime. Storm is a distributed realtime computation system. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing. Storm has many use cases: realtime analytics, online machine learning, continuous computation, distributed RPC, ETL, and more. This session presents the main concepts of Kafka and Storm and then shows how a simple stream processing application is implemented using these two technologies.
Abstract:
Reactive applications need to be able to respond to demand, be elastic and ready to scale up, down, in and out—taking full advantage of mobile, multi-core and cloud computing architectures—in real time.
In this talk we will discuss the guiding principles making this possible through the use of share-nothing and non-blocking designs, applied all the way down the stack. We will learn how to deliver systems that provide reactive supply to changing demand.
I gave this talk at React Conf 2014 in London. Recording available here: https://www.youtube.com/watch?v=mBFdj7w4aFA
What's inside the black box? Using ML to tune and manage Kafka. (Matthew Stum...confluent
We use machine learning to delve deep into the internals of how systems like Kafka work. In this talk I'll dive into what variables affect performance and reliability, including previously unknown leading indicators of major performance problems, failure conditions and how to tune for specific use cases. I'll cover some of the specific methodology we use, including Bayesian optimization, and reinforcement learning. I'll also talk about our own internal infrastructure that makes heavy use of Kafka and Kubernetes to deliver real-time predictions to our customers.
Webinar: Diagnosing Apache Cassandra Problems in ProductionDataStax Academy
This session covers diagnosing and solving common problems encountered in production, using performance profiling tools. We’ll also give a crash course to basic JVM garbage collection tuning. Viewers will leave with a better understanding of what they should look for when they encounter problems with their in-production Cassandra cluster.
PlayStation and Searchable Cassandra Without Solr (Dustin Pham & Alexander Fi...DataStax
When we were preparing for PlayStation4 launch we faced 1 hard problem: how to make our games and videos storage system super-fast, highly available and fault tolerant. Moving from relation database to Cassandra is not easy, and it even harder if you want to support different search and query use cases. Join our talk to learn how we managed to build highly available platform, that supports tens of millions of active users and can execute multiple user specific queries in less than a millisecond. And all of it without using Solr or ElasticSearch.
About the Speaker
Alexander Filipchik, Sony
Alex spent last 4 years of his life building the next generation of PlayStation Network. He is honored to be a part of a small team of engineers who managed to build from scratch a platform that scaled from 0 users to 1 million PS4s in just 1 day and have being landing 1.5 million of new devices per month since and now reached tenth of millions of active users. He is passionate about technology, innovations, walking his dog and building scalable software using Cassandra.
Wordnik's technical co-founder Tony Tam describes the reason for going NoSQL. During his talk Tony will discuss the selection criteria, testing + evaluation and successful, zero-downtime migration to MongoDB. Additionally details on Wordnik's speed and stability will be covered as well as how NoSQL technologies have changed the way Wordnik scales.
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...Amazon Web Services
"This is a technical architect's case study of how Loggly has employed the latest social-media-scale technologies as the backbone ingestion processing for our multi-tenant, geo-distributed, and real-time log management system. This presentation describes design details of how we built a second-generation system fully leveraging AWS services including Amazon Route 53 DNS with heartbeat and latency-based routing, multi-region VPCs, Elastic Load Balancing, Amazon Relational Database Service, and a number of pro-active and re-active approaches to scaling computational and indexing capacity.
The talk includes lessons learned in our first generation release, validated by thousands of customers; speed bumps and the mistakes we made along the way; various data models and architectures previously considered; and success at scale: speeds, feeds, and an unmeltable log processing engine."
Akka Streams And Kafka Streams: Where Microservices Meet Fast DataLightbend
In a recent survey, 90% of over 2400 developers reported having at least some real-time functionality in their systems. Enterprises are realizing that the ability to extract value from streaming data in near real-time is the new competitive advantage.
Two technologies–Akka Streams and Kafka Streams–have emerged as popular tools to use with Apache Kafka for addressing the shared requirements of availability, scalability, and resilience for both streaming microservices and Fast Data. So which one should you use for specific use cases?
Kafka and Storm - event processing in realtimeGuido Schmutz
Apache Kafka is publish-subscribe messaging rethought as a distributed commit log. It is designed to allow a single cluster to serve as the central data backbone for a large organization. It can be elastically and transparently expanded without downtime. Storm is a distributed realtime computation system. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing. Storm has many use cases: realtime analytics, online machine learning, continuous computation, distributed RPC, ETL, and more. This session presents the main concepts of Kafka and Storm and then shows how a simple stream processing application is implemented using these two technologies.
Abstract:
Reactive applications need to be able to respond to demand, be elastic and ready to scale up, down, in and out—taking full advantage of mobile, multi-core and cloud computing architectures—in real time.
In this talk we will discuss the guiding principles making this possible through the use of share-nothing and non-blocking designs, applied all the way down the stack. We will learn how to deliver systems that provide reactive supply to changing demand.
I gave this talk at React Conf 2014 in London. Recording available here: https://www.youtube.com/watch?v=mBFdj7w4aFA
What's inside the black box? Using ML to tune and manage Kafka. (Matthew Stum...confluent
We use machine learning to delve deep into the internals of how systems like Kafka work. In this talk I'll dive into what variables affect performance and reliability, including previously unknown leading indicators of major performance problems, failure conditions and how to tune for specific use cases. I'll cover some of the specific methodology we use, including Bayesian optimization, and reinforcement learning. I'll also talk about our own internal infrastructure that makes heavy use of Kafka and Kubernetes to deliver real-time predictions to our customers.
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedInLinkedIn
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn. This was a presentation made at QCon 2009 and is embedded on LinkedIn's blog - http://blog.linkedin.com/
.Net Microservices with Event Sourcing, CQRS, Docker and... Windows Server 20...Javier García Magna
Good technical practices you can follow with (micro)services but can be applied to almost anything: discovery (microphone/consul), security, resilience (polly), composition, ssecurity (jwt/oauth2)... And then an example with a CQRS application, and how docker can be used in Windows 2016. Lastly a brief summary of what Service Fabric is and its programming models.
Cassandra Community Webinar: MySQL to Cassandra - What I Wish I'd KnownDataStax
A brief intro to how Barracuda Networks uses Cassandra and the ways in which they are replacing their MySQL infrastructure, with Cassandra. This presentation will include the lessons they've learned along the way during this migration.
Speaker: Michael Kjellman, Software Engineer at Barracuda Networks
Michael Kjellman is a Software Engineer, from San Francisco, working at Barracuda Networks. Michael works across multiple products, technologies, and languages. He primarily works on Barracuda's spam infrastructure and web filter classification data.
Cassandra Day Atlanta 2015: Diagnosing Problems in ProductionDataStax Academy
This sessions covers diagnosing and solving common problems encountered in production, using performance profiling tools. We’ll also give a crash course to basic JVM garbage collection tuning. Attendees will leave with a better understanding of what they should look for when they encounter problems with their in-production Cassandra cluster. This talk is intended for people with a general understanding of Cassandra, but it not required to have experience running it in production.
Cassandra Day Chicago 2015: Diagnosing Problems in ProductionDataStax Academy
Speaker(s): Jon Haddad, Apache Cassandra Evangelist at DataStax
This sessions covers diagnosing and solving common problems encountered in production, using performance profiling tools. We’ll also give a crash course to basic JVM garbage collection tuning. Attendees will leave with a better understanding of what they should look for when they encounter problems with their in-production Cassandra cluster. This talk is intended for people with a general understanding of Cassandra, but it not required to have experience running it in production.
Cassandra Day London 2015: Diagnosing Problems in ProductionDataStax Academy
Speaker(s): Jon Haddad, Apache Cassandra Evangelist at DataStax
This sessions covers diagnosing and solving common problems encountered in production, using performance profiling tools. We’ll also give a crash course to basic JVM garbage collection tuning. Attendees will leave with a better understanding of what they should look for when they encounter problems with their in-production Cassandra cluster. This talk is intended for people with a general understanding of Cassandra, but it not required to have experience running it in production.
Scalable and Reliable Logging at PinterestKrishna Gade
At Pinterest, hundreds of services and third-party tools that are implemented in various programming languages generate billions of events every day. To achieve scalable and reliable low latency logging, there are several challenges: (1) uploading logs that are generated in various formats from tens of thousands of hosts to Kafka in a timely manner; (2) running Kafka reliably on Amazon Web Services where the virtual instances are less reliable than on-premises hardware; (3) moving tens of terabytes data per day from Kafka to cloud storage reliably and efficiently, and guaranteeing exact one time persistence per message.
In this talk, we will present Pinterest’s logging pipeline, and share our experience addressing these challenges. We will dive deep into the three components we developed: data uploading from service hosts to Kafka, data transportation from Kafka to S3, and data sanitization. We will also share our experience in operating Kafka at scale in the cloud.
Diagnosing Problems in Production - CassandraJon Haddad
This presentation covers diagnosing and solving common problems encountered in production, using performance profiling tools. We’ll also give a crash course to basic JVM garbage collection tuning. Readers will leave with a better understanding of what they should look for when they encounter problems with their in-production Cassandra cluster. This presentation is intended for people with a general understanding of Cassandra, but it not required to have experience running it in production.
Introduction to memcached, a caching service designed for optimizing performance and scaling in the web stack, seen from perspective of MySQL/PHP users. Given for 2nd year students of professional bachelor in ICT at Kaho St. Lieven, Gent.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
"Impact of front-end architecture on development cost", Viktor TurskyiFwdays
I have heard many times that architecture is not important for the front-end. Also, many times I have seen how developers implement features on the front-end just following the standard rules for a framework and think that this is enough to successfully launch the project, and then the project fails. How to prevent this and what approach to choose? I have launched dozens of complex projects and during the talk we will analyze which approaches have worked for me and which have not.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Let's dive deeper into the world of ODC! Ricardo Alves (OutSystems) will join us to tell all about the new Data Fabric. After that, Sezen de Bruijn (OutSystems) will get into the details on how to best design a sturdy architecture within ODC.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Search and Society: Reimagining Information Access for Radical FuturesBhaskar Mitra
The field of Information retrieval (IR) is currently undergoing a transformative shift, at least partly due to the emerging applications of generative AI to information access. In this talk, we will deliberate on the sociotechnical implications of generative AI for information access. We will argue that there is both a critical necessity and an exciting opportunity for the IR community to re-center our research agendas on societal needs while dismantling the artificial separation between the work on fairness, accountability, transparency, and ethics in IR and the rest of IR research. Instead of adopting a reactionary strategy of trying to mitigate potential social harms from emerging technologies, the community should aim to proactively set the research agenda for the kinds of systems we should build inspired by diverse explicitly stated sociotechnical imaginaries. The sociotechnical imaginaries that underpin the design and development of information access technologies needs to be explicitly articulated, and we need to develop theories of change in context of these diverse perspectives. Our guiding future imaginaries must be informed by other academic fields, such as democratic theory and critical theory, and should be co-developed with social science scholars, legal scholars, civil rights and social justice activists, and artists, among others.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Fixing twitter
1. Fixing Twitter
... and Finding your own Fail Whale
John Adams
Twitter Operations
<jna@twitter.com>
2. Operations
• Small team, growing rapidly.
• What do we do?
• Software Performance (back-end)
• Availability
• Capacity Planning (metrics-driven)
• Configuration Management
• We don’t deal with the physical plant.
3. Managed Services
• Dedicated team (NTTA)
• 24/7 Hands on remote support
• No clouds. We tried that!
• Need raw processing power, latency too
high in existing cloud offerings
• Frees us to deal with real, intellectual,
computer science problems.
4. 752%
2008 Growth
5
3.75
2.5
1.25
0
Dec 07 Feb 08 Apr 08 Jun 08 Aug 08 Oct 08 Dec 08
Unique Visitors (in Millions)
13. Monitoring
• Graph and report critical metrics in as near
real time as possible
• You already have the tools.
• RRD
• Ganglia + custom gMetric scripts
• MRTG
14. Dashboards
• “Criticals” view
• Smokeping/MRTG
• Google Analytics
• Not just for
HTTP 200s/SEO
• XML Feeds from
managed services
• Data Porn!
15. Analyze
• Turn data into information
• Where is the code base going?
• Are things worse than they were?
• Understand the impact of the last
software deploy
• Run check scripts during and after
deploys
• Capacity Planning, not Fire Fighting!
16. Forecasting Curve-fitting for capacity planning
(R, fityk, Mathematica, CurveFit)
unsigned int (32 bit)
Twitpocolypse
status_id
signed int (32 bit)
Twitpocolypse
r2=0.99
17. Deploys
• Graph time-of-deploy along side server
CPU and Latency
• Display time-of-last-deploy on dashboard
last deploy times
18. Whale-Watcher
• Simple shell script,
• MASSIVE WIN.
• Whale = HTTP 503 (timeout)
• Robot = HTTP 500 (error)
• Examines last 100,000 lines of aggregated
daemon / www logs
• “Whales per Second” > Wthreshold
• Thar be whales! Call in ops.
20. Feature “Darkmode”
• Specific site controls to enable and
disable computationally or IO-Heavy site
function
• The “Emergency Stop” button
• Changes logged and reported to all teams
• Around 60 switches we can throw
• Static / Read-only mode
21. Configuration
Management
• Start automated configuration management
EARLY in your company.
• Don’t wait until it’s too late.
• Twitter started within the first few months.
22. Configuration
Management
• Complex Environment
• Multiple Admins
• Unknown Interactions
• Solution: 2nd set of eyes.
24. Reviewboard
www.review-board.org
• SVN pre-commit hook causes a failure if
the log message doesn’t include
‘reviewed’
• SVN post-commit hook informs people
what changed via email
• Watches the entire SVN tree
27. Many limiting factors in the request pipeline
Apache Rails
MPM Model (mongrel)
MaxClients 2:1 oversubscribed
TCP Listen queue depth
to cores
Memcached
# connections
MySQL
Varnish (search) # db connections
# threads
29. CPU: More with Less
• Reduction in 40% of CPU by replacing dual
and quad core machines with 8 core
• Switching from AMD to Intel Xeon = 30%
gain
• Saved data center space, power, cost per
month.
• Not the best option if you own machines.
Capital expenditure = hard to realize new
technology gains.
30. Rails
• Stop blaming Rails.
• Analysis found:
• Caching + Cache invalidation problems
• Bad queries generated by ActiveRecord,
resulting in slow queries against the db
• Queue Latency
• Memcache / Page Cache Corruption
• Replication Lag
31. Disk is the new Tape.
• Social Networking application profile has
many O(ny) operations.
• Page requests have to happen in < 500mS
or users start to notice. Goal: 250-300mS
• Web 2.0 isn’t possible without lots of RAM
• What to do?
32. Caching
• We’re the real-time web, but lots of caching
opportunity
• Most caching strategies rely on long TTLs
(>60 s)
• Separate memcache pools for different data
types to prevent eviction
• Optimize Ruby Gem to libmemcached +
FNV Hash instead of Ruby + MD5
• Twitter now largest contributor to
libmemcached
33. Caching 50% decrease in load with Native C
gem + libmemcached
34. Cache Money!
• Active Record Plugin
• Cache when reading from the DB
• Cache when writing to the DB
• Transparently provides caching
• Removes need for set/get cache code
• Open Source!
35. Caching
• “Cache Everything!” not the best policy
• Invalidating caches at the right time is
difficult.
• Cold Cache problem
• Network Memory Bus != Infinite
36. Memcached
• memcached isn’t perfect.
• Memcached SEGVs hurt us early on.
• Evictions make the cache unreliable for
important configuration data
(loss of darkmode flags, for example)
• Data and Hash Corruption (even in 1.2.6)
• Exposed corruption issue with specific
inputs causing SEGV and unexpected
behavior
37. API + Caching (search)
• Cache and control abusive clients
• Varnish between two Apache Virtual Hosts
(failover to another backend if Varnish
dies)
• Remove Cache busting query strings before
applying hash algorithm
• Using ESI to cache jQuery requests when
specifying a callback= parameter - big win.
38. Relational Databases
not a Panacea
• Good for:
• Users, Relational Data, Transactions
• Bad:
• Queues. Polling operations. Caching.
• You don’t need ACID for everything.
• Enter the message queue...
39. Queues
• Many message queue solutions on the
market
• At high loads, most perform poorly when
used in ‘durable’ mode.
• Erlang based queues work well
(RabbitMQ), but you need in house Erlang
experience.
• We wrote our own.
• Kestrel to the rescue!
40. Kestrel
Falco tinnunculus
• Works like memcache (same protocol)
• SET = enqueue | GET = dequeue
• No strict ordering of jobs
• No shared state between servers
• Written in Scala.
41. Asynchronous
Requests
• Inbound traffic consumes a mongrel
• Outbound traffic consumes a mongrel
• The request pipeline should not be used to
handle 3rd party communications or
back-end work.
• Daemons, Daemons, Daemons.
42. Don’t make services
dependent
• Move operations out of the synchronous
request cycle
• Email
• Complex object generation (timelines)
• 3rd party services (bit.ly, sms, etc.)
43. Daemons
• Many different types at Twitter.
• # of daemons have to match the workload
• Early Kestrel would crash if queues filled
• “Seppaku” patch
• Kill daemons after n requests
• Long-running daemons = low memory
44. MySQL Challenges
• Replication Delay
• Single threaded. Slow.
• Social Networking not good for RDBMS
• N x N relationships and social graph /
tree traversal
• Sharding importance
• Disk issues (FS Choice, noatime,
scheduling algorithm)
45. MySQL
• Replication delay and cache eviction
produce inconsistent results to the end
user.
• Locks create resource contention for
popular data
46. Database Replication
• Major issues around users and statuses
tables
• Multiple functional masters (FRP, FWP)
• Make sure your code reads and writes to
the write DBs. Reading from master = slow
death
• Monitor the DB. Find slow / poorly
designed queries
• Kill long running queries before they kill
you (mkill)
47. status.twitter.com
• Keep users in the loop, or suffer.
• Hosted on different service (Tumblr)
• No matter how little information you have
available.
48. Key Points
• Databases not always the best store.
• Instrument everything.
• Use metrics to make decisions, not guesses.
• Don’t make services dependent
• Process asynchronously when possible