No bid left behind

•

4 likes•794 views

The document discusses techniques for building resiliency into real-time bidding systems. It describes monitoring the system through logging, heartbeats, and metrics collection. It also covers detecting and recovering from errors through techniques like circuit breakers and bulkheads. Rollbacks and retries are suggested for data errors, while circuit breakers and failovers can help handle system integration errors.

No bid left behind
My day to day handling a resilient real time bidding platform in a JVM environment.
Marc de Palol
Trovit

Hey hi,
• Studied here (good to be back)
• Some research on supercomputing
• Moved to London, discovered Hadoop & intensive
data systems.
• Came back, still in the ‘Data Engineering’ stuff.

A classiﬁed search engine for property, jobs, cars, products and holiday rentals
• 180 Million ads,
• 170 Tb in the cluster
• 65 Million uniques / 170 Million visits
• 10 apps (iOS, Android)
• Cool ofﬁce in Barcelona.
have a look at http://www.trovit.es

Real Time Bidding
It’s about selling ads.
• Per impression basis.
• Programmatic instantaneous auction

We are using ‘DoubleClick Ad Exchange’ (Google)
• Response under 100 ms.
• If 15% of our responses are invalid or timed out,
we stop getting bid requests progressively

This system, literally, spends money. So, it must be rock solid.
Our system is coded carefully, with love and tests.

Resiliency
The ability to recover from unexpected errors.
The ability to sleep at night.

Detect Recover Warn
Monitoring
Resiliency
Patterns
Notiﬁcations

• Logging with ‘mailAppender’
log4j.appender.mail=org.apache.log4j.net.SMTPAppender
log4j.appender.mail.SMTPHost=localhost
log4j.appender.mail.From=Error <error-bla@trovit.com>
log4j.appender.mail.To=tech@trovit.com, ceo@trovit.com
log4j.appender.mail.Subject=[ERROR] WE ARE GOING TO DIE
log4j.appender.mail.layout=org.apache.log4j.PatternLayout
log4j.appender.mail.threshold=ERROR

• Logging with ‘mailAppender’
Probably, no e-mail when you’ve got an OOM.
log4j.appender.mail=org.apache.log4j.net.SMTPAppender
log4j.appender.mail.SMTPHost=localhost
log4j.appender.mail.From=Error <error-bla@trovit.com>
log4j.appender.mail.To=tech@trovit.com, ceo@trovit.com
log4j.appender.mail.Subject=[ERROR] WE ARE GOING TO DIE
log4j.appender.mail.layout=org.apache.log4j.PatternLayout
log4j.appender.mail.threshold=ERROR

Let’s talk about OOM for a
minute.
ps ax | grep java

Let’s talk about OOM for a
minute.
ps ax | grep java
JVMOpts=“-
XX:OnOutOfMemoryError=
/usr/local/bin/slack-msg.sh"
🚫
👍

Some cool ideas for improving memory usage
• byte[] serialization in objects ❗
• Varying Memory Conditions ❗

• Logging with ‘mailAppender’
• Bad when OOM.

• Logging with ‘mailAppender’
• Bad when OOM.
• Heartbeat
• Doing some real work

• Logging with ‘mailAppender’
• Bad when OOM.
• Heartbeat
• Doing some real work
• Supervision with actors
• If you’re using Akka
• control ﬂow != data ﬂow

Our Monitoring:
• Nagios.
• Logging (to Sentry)
• Heartbeats with real work.
• graphite comparison

Now we know that something
is going wrong.

Bad data in the system
or / and
Errors in the system

Data errors.
Roll back (when possible)
• Keeping different versions in the DB.
• Keep the old version around.
• Know how to do a rollback.

Checks & Asserts with google guava.
checkArgument(i >= 0,
"Argument was %s but expected nonnegative", i);
checkArgument(i < j,
"Expected i < j, but %s > %s", i, j);
checkNotNull(myList,
"List should not be null")
checkState(object.isValid(),
"Object is not valid")

System errors
These happen mostly between system integrations.
• Your code and the DB.
• Your code and the 3rd party library.
• Your code and the queue.

DBs, a necessary supervillain
• Lost connection.
• Timeouts
• Can give you corrupted data.
• Can give you 0 data.
• Can give you too much data.

Circuit Breaker and his friend,
the Bulkhead Pattern.

Once the circuit breaker is open,
• Notify
• Try again! maybe.
• Try to avoid DOS your own system.
• Exponential retry.
• Failover
• Restart

Some other bits and pieces:
• Tight coupling leads to fast propagation of errors.
• Event driven stuff
• Complete parameter checking
• Avoid SPF’s. Pretty please.
• Stateless is better.
• Bounded queues!

Your turn.
mdepalol@trovit.com
@lant
[]
http://www.maxisciences.com/destruction/wallpaper

A big percentage of companies do not use proper configuration management nor automation. In this talk, I will share various ideas and experiences, making the audience feel much more prepared and confident in rolling out Puppet to manage old or legacy environments that might be left out of automation and configuration management. It is possible to roll out the Puppet agent with ease on these challenging scenarios, letting Puppet gradually fully manage everything with confidence and close to no impact. One of the reasons that commonly makes companies afraid of rolling out the Puppet agent is the impact of actually letting it do its job, since there might be so much accumulated technical debt due to years of manual work. I’ll show successful strategies used on some of our clients that demonstrate Puppet’s powerful built-in state modeling and simulation mechanisms and various useful resource types that let us very granularly manage configuration without breaking what is already in place.

Cloud jpl

Marc de Palol

There Are Literally Thousands of Erlang Projects

Pierre Fenoll

Presented at Erlang Factory 2016, San Francisco, CA. Erlang is widely used for building concurrent applications. However, when we push the performance of our Erlang based application to handle millions of concurrent clients, some Erlang scalability issues begin to show and some conventional programming paradigm of Erlang no longer hold. We would like to share some of these issue and how we address them. In addition, we share some of our experience on how to profile an Erlang application to identify bottlenecks. We will take a deep look at some of the basic mechanisms of Erlang and show how they behave under high load and parallelism, which includes message delivery, process management and shared data structures such as maps and ETS tables. We will demonstrate their limitations and propose techniques to alleviate the issues. We will also share profiling techniques on how to find those bottlenecks in Erlang applications across different levels. We will share techniques for writing highly performant Erlang applications.

Erlang containers

Sargun Dhillon

State of the art introduction

Jolien Coenraets

Netty from the trenches

Jordi Gerona

Netty is an asynchronous event-driven network application framework for rapid development of maintainable high performance protocol servers & clients. AND IT'S TRUE! In this talk given at JBCNConf 2015 in Barcelona, we will see how we use Netty at Trovit since 2013, what brought to us and how it opened our minds. We will share tips that helped us to learn more about Netty, some performance tricks and all things that worked for us.

Trending with Purpose

Jason Dixon

This talk evalutes some easy ways to extract useful trending and capacity planning out of your existing monitoring investment. Using Nagios performance data, we examine simple behaviors with PNP4Nagios and graduate on to more insightful analytics with Graphite. With metrics in hand we look at the questions that IT /should/ be asking, such as: * What sort of data should I trend? * Why do I need to trend it? * How do Operational or Engineering trends relate to Business or Transactional monitoring? * How does this data impact our customer relationship and/or their bottom-line? Finally, we look at creative ways to get profiling data out of your production systems with a minimum amount of effort from your development team.

Big Data Berlin - Criteo

Sofian Djamaa

How Gousto is moving to just-in-time personalization with Snowplow

Giuseppe Gaviani

Presented at Snowplow London Meetup, 8 February 2017 Dejan Petelin, head of data science at Gousto, gave a presentation about their data journey, explaining how data reflects the customer’s voice and the importance of joining up all data sources. The goal is to delight and retain customers – critical for a subscription business like Gousto’s. Gousto is using Snowplow as a unified log, to scale up its data capabilities, listen to its customer and provide them with a more personalized experience. Finally, Gousto is moving to the real-time pipeline to enable just-in-time personalization.

Machine Learning with Hadoop Boston hug 2012

MapR Technologies

Dev Ops without the Ops

Konstantin Gredeskoul

Do you need Ops in your new startup? If not now, then when? And...what is Ops? Learn how to scale ruby-based distributed software infrastructure in the cloud to serve 4,000 requests per second, handle 400 updates per second, and achieve 99.97% uptime – all while building the product at the speed of light. Unimpressed? Now try doing the above altogether without the Ops team, while growing your traffic 100x in 6 months and deploying 5-6 times a day! It could be a dream, but luckily it's a reality that could be yours.

Breaking the oracle tieagiamas

Rubyslava + PyVo #48

Jozef Képesi

Your app works slowly. Now what?

Aleksandra (Ola) Kunysz

You did great job finishing this web app on time and budget. Design patterns, good code coverage, cutting edge frameworks and best CI ever. It goes to production and boom, clients complain it's too slow. They don't really care, if it's best engineering ever, if each view loads 4 seconds. My presentation will give you hints on how to look for bottlenecks. I will also share simple tricks to make the app work faster, or at least seem to work faster.

MySQL Performance Monitoring

spil-engineering

Monitoring of OpenNebula installations

NETWAYS

OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl

OpenNebula Project

The complexity of a typical OpenNebula installation brings a special set of challenges on the monitoring side. In this talk, I will show monitoring of a full stack of from the physical servers to storage layer and ONE daemon. Providing an aggregated view of this information allows you see the real impact of a certain failure. I would like to also present a use case for a “closed-loop” setup where new VMs are automatically added to the monitoring without human intervention, allowing for an efficient approach to monitoring the services a OpenNebula setup provides. Bio: I’ve been into virtualization and storage for a long time and I like the amount of abstraction OpenNebula offers. Professionally I have been a Unix systems administrator for most of my working life. I’ve also done systems integration and monitoring work on the Check_MK project. Now I’m one of very few Nagios experts in Germany that aren’t working for one of the 3-5 leading Nagios outfits and as such I’m able to speak freely about what I think works best for the users. My strength is simply sitting down and listening to what people really need.

Defcon 21-pinto-defending-networks-machine-learning by pseudor00t

pseudor00t overflow

Prophet - Beijing Perl WorkshopJesse Vincent

Rate Limiting at Scale, from SANS AppSec Las Vegas 2012

Nick Galbreath

Systems Monitoring with Prometheus (Devops Ireland April 2015)

Brian Brazil

Monitoring means many things to many people. This talk looks at Systems Monitoring, that is how to keep an eye on a given system and use this as part of overall management of a system. This talk will cover Why one monitors, What to monitor, How to monitor, the general design of a monitoring system and how Prometheus is a good fit for this in terms of instrumentation, consoles, alerts, general system health and sanity. Prometheus is a next-generation monitoring system publicly announced earlier this year, developed by companies including SoundCloud, locals Boxever and Docker. Since launch there has been wide-spread interest, and many community contributions. For more information see http://prometheus.io or http://www.boxever.com/tag/monitoring

BYO/DIY Analytics Platform (MeasureCamp Presentation by Clancy Childs)

Clancy Childs

Viewers also liked

Competing to be uniqueSpecialist Language Courses

Hfile

Marc de Palol

High Performance Erlang - Pitfalls and Solutions

Yinghai Lu

Erlang containers

Sargun Dhillon

State of the art introduction

Jolien Coenraets

Netty from the trenches

Jordi Gerona

Viewers also liked (6)

Competing to be unique

Hfile

High Performance Erlang - Pitfalls and Solutions

Erlang containers

State of the art introduction

Netty from the trenches

Similar to No bid left behind

Trending with Purpose

Jason Dixon

Big Data Berlin - Criteo

Sofian Djamaa

How Gousto is moving to just-in-time personalization with Snowplow

Giuseppe Gaviani

Machine Learning with Hadoop Boston hug 2012

MapR Technologies

Dev Ops without the Ops

Konstantin Gredeskoul

Breaking the oracle tieagiamas

Rubyslava + PyVo #48

Jozef Képesi

Your app works slowly. Now what?

Aleksandra (Ola) Kunysz

MySQL Performance Monitoring

spil-engineering

Monitoring of OpenNebula installations

NETWAYS

OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl

OpenNebula Project

Defcon 21-pinto-defending-networks-machine-learning by pseudor00t

pseudor00t overflow

Prophet - Beijing Perl WorkshopJesse Vincent

Rate Limiting at Scale, from SANS AppSec Las Vegas 2012

Nick Galbreath

Systems Monitoring with Prometheus (Devops Ireland April 2015)

Brian Brazil

BYO/DIY Analytics Platform (MeasureCamp Presentation by Clancy Childs)

Clancy Childs

Message passing

Damien Krotkine

Stop using Nagios (so it can die peacefully)

Andy Sykes

The Big Data Journey at Connexity - Big Data Day LA 2015

Will Gage

It isn't easy to drink from the technology firehose of today's Internet economy. At Connexity, we have gone from home-grown MapReduce frameworks and custom in-house search-engines to extensive use of Apache Hadoop, Hive, Pig, Cassandra, Solr and other technologies to power our business. This talk will explore some of the evolutionary steps that we've made and what lessons you might draw from our 15+ years of experience of swimming with the Internet sharks.

H2O World - Solving Customer Churn with Machine Learning - Julian Bharadwaj

Sri Ambati

Similar to No bid left behind (20)

Trending with Purpose

Big Data Berlin - Criteo

How Gousto is moving to just-in-time personalization with Snowplow

Machine Learning with Hadoop Boston hug 2012

Dev Ops without the Ops

Breaking the oracle tie

Rubyslava + PyVo #48

Your app works slowly. Now what?

MySQL Performance Monitoring

Monitoring of OpenNebula installations

OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl

Defcon 21-pinto-defending-networks-machine-learning by pseudor00t

Prophet - Beijing Perl Workshop

Rate Limiting at Scale, from SANS AppSec Las Vegas 2012

Systems Monitoring with Prometheus (Devops Ireland April 2015)

BYO/DIY Analytics Platform (MeasureCamp Presentation by Clancy Childs)

Message passing

Stop using Nagios (so it can die peacefully)

The Big Data Journey at Connexity - Big Data Day LA 2015

H2O World - Solving Customer Churn with Machine Learning - Julian Bharadwaj

Recently uploaded

2024 RoOUG Security model for the cloud.pptx

Georgi Kodinov

Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...

Shahin Sheidaei

Games are powerful teaching tools, fostering hands-on engagement and fun. But they require careful consideration to succeed. Join me to explore factors in running and selecting games, ensuring they serve as effective teaching tools. Learn to maintain focus on learning objectives while playing, and how to measure the ROI of gaming in education. Discover strategies for pitching gaming to leadership. This session offers insights, tips, and examples for coaches, team leads, and enterprise leaders seeking to teach from simple to complex concepts.

Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...

informapgpstrackings

Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx

rickgrimesss22

Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...

Mind IT Systems

Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better

XfilesPro

Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...

Globus

The Earth System Grid Federation (ESGF) is a global network of data servers that archives and distributes the planet’s largest collection of Earth system model output for thousands of climate and environmental scientists worldwide. Many of these petabyte-scale data archives are located in proximity to large high-performance computing (HPC) or cloud computing resources, but the primary workflow for data users consists of transferring data, and applying computations on a different system. As a part of the ESGF 2.0 US project (funded by the United States Department of Energy Office of Science), we developed pre-defined data workflows, which can be run on-demand, capable of applying many data reduction and data analysis to the large ESGF data archives, transferring only the resultant analysis (ex. visualizations, smaller data files). In this talk, we will showcase a few of these workflows, highlighting how Globus Flows can be used for petabyte-scale climate analysis.

GlobusWorld 2024 Opening Keynote session

Globus

Corporate Management | Session 3 of 3 | Tendenci AMS

Tendenci - The Open Source AMS (Association Management Software)

Experience our free, in-depth three-part Tendenci Platform Corporate Membership Management workshop series! In Session 1 on May 14th, 2024, we began with an Introduction and Setup, mastering the configuration of your Corporate Membership Module settings to establish membership types, applications, and more. Then, on May 16th, 2024, in Session 2, we focused on binding individual members to a Corporate Membership and Corporate Reps, teaching you how to add individual members and assign Corporate Representatives to manage dues, renewals, and associated members. Finally, on May 28th, 2024, in Session 3, we covered questions and concerns, addressing any queries or issues you may have. For more Tendenci AMS events, check out www.tendenci.com/events

RISE with SAP and Journey to the Intelligent Enterprise

Srikant77

Into the Box 2024 - Keynote Day 2 Slides.pdf

Ortus Solutions, Corp

First Steps with Globus Compute Multi-User Endpoints

Globus

In this presentation we will share our experiences around getting started with the Globus Compute multi-user endpoint. Working with the Pharmacology group at the University of Auckland, we have previously written an application using Globus Compute that can offload computationally expensive steps in the researcher's workflows, which they wish to manage from their familiar Windows environments, onto the NeSI (New Zealand eScience Infrastructure) cluster. Some of the challenges we have encountered were that each researcher had to set up and manage their own single-user globus compute endpoint and that the workloads had varying resource requirements (CPUs, memory and wall time) between different runs. We hope that the multi-user endpoint will help to address these challenges and share an update on our progress here.

Cracking the code review at SpringIO 2024

Paco van Beckhoven

Code reviews are vital for ensuring good code quality. They serve as one of our last lines of defense against bugs and subpar code reaching production. Yet, they often turn into annoying tasks riddled with frustration, hostility, unclear feedback and lack of standards. How can we improve this crucial process? In this session we will cover: - The Art of Effective Code Reviews - Streamlining the Review Process - Elevating Reviews with Automated Tools By the end of this presentation, you'll have the knowledge on how to organize and improve your code review proces

How to Position Your Globus Data Portal for Success Ten Good Practices

Globus

Science gateways allow science and engineering communities to access shared data, software, computing services, and instruments. Science gateways have gained a lot of traction in the last twenty years, as evidenced by projects such as the Science Gateways Community Institute (SGCI) and the Center of Excellence on Science Gateways (SGX3) in the US, The Australian Research Data Commons (ARDC) and its platforms in Australia, and the projects around Virtual Research Environments in Europe. A few mature frameworks have evolved with their different strengths and foci and have been taken up by a larger community such as the Globus Data Portal, Hubzero, Tapis, and Galaxy. However, even when gateways are built on successful frameworks, they continue to face the challenges of ongoing maintenance costs and how to meet the ever-expanding needs of the community they serve with enhanced features. It is not uncommon that gateways with compelling use cases are nonetheless unable to get past the prototype phase and become a full production service, or if they do, they don't survive more than a couple of years. While there is no guaranteed pathway to success, it seems likely that for any gateway there is a need for a strong community and/or solid funding streams to create and sustain its success. With over twenty years of examples to draw from, this presentation goes into detail for ten factors common to successful and enduring gateways that effectively serve as best practices for any new or developing gateway.

AI Pilot Review: The World’s First Virtual Assistant Marketing Suite

Google

AI Pilot Review: The World’s First Virtual Assistant Marketing Suite 👉👉 Click Here To Get More Info 👇👇 https://sumonreview.com/ai-pilot-review/ AI Pilot Review: Key Features ✅Deploy AI expert bots in Any Niche With Just A Click ✅With one keyword, generate complete funnels, websites, landing pages, and more. ✅More than 85 AI features are included in the AI pilot. ✅No setup or configuration; use your voice (like Siri) to do whatever you want. ✅You Can Use AI Pilot To Create your version of AI Pilot And Charge People For It… ✅ZERO Manual Work With AI Pilot. Never write, Design, Or Code Again. ✅ZERO Limits On Features Or Usages ✅Use Our AI-powered Traffic To Get Hundreds Of Customers ✅No Complicated Setup: Get Up And Running In 2 Minutes ✅99.99% Up-Time Guaranteed ✅30 Days Money-Back Guarantee ✅ZERO Upfront Cost See My Other Reviews Article: (1) TubeTrivia AI Review: https://sumonreview.com/tubetrivia-ai-review (2) SocioWave Review: https://sumonreview.com/sociowave-review (3) AI Partner & Profit Review: https://sumonreview.com/ai-partner-profit-review (4) AI Ebook Suite Review: https://sumonreview.com/ai-ebook-suite-review

Vitthal Shirke Microservices Resume Montevideo

Vitthal Shirke

SOCRadar Research Team: Latest Activities of IntelBroker

SOCRadar

The European Union Agency for Law Enforcement Cooperation (Europol) has suffered an alleged data breach after a notorious threat actor claimed to have exfiltrated data from its systems. Infamous data leaker IntelBroker posted on the even more infamous BreachForums hacking forum, saying that Europol suffered a data breach this month. The alleged breach affected Europol agencies CCSE, EC3, Europol Platform for Experts, Law Enforcement Forum, and SIRIUS. Infiltration of these entities can disrupt ongoing investigations and compromise sensitive intelligence shared among international law enforcement agencies. However, this is neither the first nor the last activity of IntekBroker. We have compiled for you what happened in the last few days. To track such hacker activities on dark web sources like hacker forums, private Telegram channels, and other hidden platforms where cyber threats often originate, you can check SOCRadar’s Dark Web News. Stay Informed on Threat Actors’ Activity on the Dark Web with SOCRadar!

Cyaniclab : Software Development Agency Portfolio.pdf

Cyanic lab

CyanicLab, an offshore custom software development company based in Sweden,India, Finland, is your go-to partner for startup development and innovative web design solutions. Our expert team specializes in crafting cutting-edge software tailored to meet the unique needs of startups and established enterprises alike. From conceptualization to execution, we offer comprehensive services including web and mobile app development, UI/UX design, and ongoing software maintenance. Ready to elevate your business? Contact CyanicLab today and let us propel your vision to success with our top-notch IT solutions.

Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...

Globus

The U.S. Geological Survey (USGS) has made substantial investments in meeting evolving scientific, technical, and policy driven demands on storing, managing, and delivering data. As these demands continue to grow in complexity and scale, the USGS must continue to explore innovative solutions to improve its management, curation, sharing, delivering, and preservation approaches for large-scale research data. Supporting these needs, the USGS has partnered with the University of Chicago-Globus to research and develop advanced repository components and workflows leveraging its current investment in Globus. The primary outcome of this partnership includes the development of a prototype enterprise repository, driven by USGS Data Release requirements, through exploration and implementation of the entire suite of the Globus platform offerings, including Globus Flow, Globus Auth, Globus Transfer, and Globus Search. This presentation will provide insights into this research partnership, introduce the unique requirements and challenges being addressed and provide relevant project progress.

May Marketo Masterclass, London MUG May 22 2024.pdf

Adele Miller

Recently uploaded (20)

2024 RoOUG Security model for the cloud.pptx

Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...

Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...

Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx

Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...

Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better

Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...

GlobusWorld 2024 Opening Keynote session

Corporate Management | Session 3 of 3 | Tendenci AMS

RISE with SAP and Journey to the Intelligent Enterprise

Into the Box 2024 - Keynote Day 2 Slides.pdf

First Steps with Globus Compute Multi-User Endpoints

Cracking the code review at SpringIO 2024

How to Position Your Globus Data Portal for Success Ten Good Practices

AI Pilot Review: The World’s First Virtual Assistant Marketing Suite

Vitthal Shirke Microservices Resume Montevideo

SOCRadar Research Team: Latest Activities of IntelBroker

Cyaniclab : Software Development Agency Portfolio.pdf

Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...

May Marketo Masterclass, London MUG May 22 2024.pdf

No bid left behind

1. No bid left behind My day to day handling a resilient real time bidding platform in a JVM environment. Marc de Palol Trovit

2. Hey hi, • Studied here (good to be back) • Some research on supercomputing • Moved to London, discovered Hadoop & intensive data systems. • Came back, still in the ‘Data Engineering’ stuff.

3. A classiﬁed search engine for property, jobs, cars, products and holiday rentals • 180 Million ads, • 170 Tb in the cluster • 65 Million uniques / 170 Million visits • 10 apps (iOS, Android) • Cool ofﬁce in Barcelona. have a look at http://www.trovit.es

4. Real Time Bidding It’s about selling ads. • Per impression basis. • Programmatic instantaneous auction

5. We are using ‘DoubleClick Ad Exchange’ (Google) • Response under 100 ms. • If 15% of our responses are invalid or timed out, we stop getting bid requests progressively

6. Currently 10.000 QPS.

7. This system, literally, spends money. So, it must be rock solid. Our system is coded carefully, with love and tests.

8. Still, sh*t happens.*t Happens

9. Resiliency The ability to recover from unexpected errors. The ability to sleep at night.

10.

11.

12.

13. Detect Recover Warn

14. Detect Recover Warn Monitoring Resiliency Patterns Notiﬁcations

15. Monitoring, in a sensible way

16. • Logging with ‘mailAppender’ log4j.appender.mail=org.apache.log4j.net.SMTPAppender log4j.appender.mail.SMTPHost=localhost log4j.appender.mail.From=Error <error-bla@trovit.com> log4j.appender.mail.To=tech@trovit.com, ceo@trovit.com log4j.appender.mail.Subject=[ERROR] WE ARE GOING TO DIE log4j.appender.mail.layout=org.apache.log4j.PatternLayout log4j.appender.mail.threshold=ERROR

17. • Logging with ‘mailAppender’ Probably, no e-mail when you’ve got an OOM. log4j.appender.mail=org.apache.log4j.net.SMTPAppender log4j.appender.mail.SMTPHost=localhost log4j.appender.mail.From=Error <error-bla@trovit.com> log4j.appender.mail.To=tech@trovit.com, ceo@trovit.com log4j.appender.mail.Subject=[ERROR] WE ARE GOING TO DIE log4j.appender.mail.layout=org.apache.log4j.PatternLayout log4j.appender.mail.threshold=ERROR

18. Let’s talk about OOM for a minute.

19. Let’s talk about OOM for a minute. ps ax | grep java

20. Let’s talk about OOM for a minute. ps ax | grep java JVMOpts=“- XX:OnOutOfMemoryError= /usr/local/bin/slack-msg.sh" 🚫 👍

21. Some cool ideas for improving memory usage • byte[] serialization in objects ❗ • Varying Memory Conditions ❗

22. • Logging with ‘mailAppender’ • Bad when OOM.

23. • Logging with ‘mailAppender’ • Bad when OOM. • Heartbeat • Doing some real work

24. • Logging with ‘mailAppender’ • Bad when OOM. • Heartbeat • Doing some real work • Supervision with actors • If you’re using Akka • control ﬂow != data ﬂow

25. Our Monitoring: • Nagios. • Logging (to Sentry) • Heartbeats with real work. • graphite comparison

26. Our Monitoring: • Nagios. • Logging (to Sentry) • Heartbeats with real work. • graphite comparison

27. Have graphs

28. Now we know that something is going wrong.

29. Recovery

30. Bad data in the system or / and Errors in the system

31. Data errors. Roll back (when possible) • Keeping different versions in the DB. • Keep the old version around. • Know how to do a rollback.

32. Data errors. Roll back (when possible) • Keeping different versions in the DB. • Keep the old version around. • Know how to do a rollback.

33. Checks & Asserts with google guava. checkArgument(i >= 0, "Argument was %s but expected nonnegative", i); checkArgument(i < j, "Expected i < j, but %s > %s", i, j); checkNotNull(myList, "List should not be null") checkState(object.isValid(), "Object is not valid")

34. System errors These happen mostly between system integrations. • Your code and the DB. • Your code and the 3rd party library. • Your code and the queue.

35. DBs, a necessary supervillain • Lost connection. • Timeouts • Can give you corrupted data. • Can give you 0 data. • Can give you too much data.

36. Circuit Breaker and his friend, the Bulkhead Pattern.

37.

38. Circuit Breaker

39. Our Beloved CircuitBreakers

40. Bulkhead

41. Once the circuit breaker is open, • Notify • Try again! maybe. • Try to avoid DOS your own system. • Exponential retry. • Failover • Restart

42. Some other bits and pieces: • Tight coupling leads to fast propagation of errors. • Event driven stuff • Complete parameter checking • Avoid SPF’s. Pretty please. • Stateless is better. • Bounded queues!

43. Your turn. mdepalol@trovit.com @lant [] http://www.maxisciences.com/destruction/wallpaper

No bid left behind

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (6)

Similar to No bid left behind

Similar to No bid left behind (20)

Recently uploaded

Recently uploaded (20)

No bid left behind