Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard Laskey

214 views

Published on

In this InfluxDays NYC 2019 session, Richard Laskey from the Wayfair Storefront team will share their monitoring best practices using InfluxEnterprise. These efforts are critical and help improve the user experience by driving forward site-wide improvements, establishing best practices, and driving change through many different teams.

Published in: Technology
  • Login to see the comments

  • Be the first to like this

Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard Laskey

  1. 1. InfluxDB @ Wayfair Nuance, at Scale 1
  2. 2. Introduction 2
  3. 3. Website / App / Services, e-commerce focused on home goods “Everyone Should Live in a Home They Love” We are a tech-focused company ● We innovate for a better customer experience HQ in Boston, w/ a growing EU presence in Berlin ● More than 2,300 Engineers and Data Scientists What is Wayfair? 3
  4. 4. ● We’ve had Graphite for years, and it mostly worked ● BUT it became harder and harder to maintain. Chaos Reigned ○ So many developers, so many series creates => many disk full errors ○ Carbon storage units are independent, and graphite-web doesn’t maintain a holistic view ● We ALSO wanted better insights, beyond means and fixed percentiles ● Horizontal expansion with Carbon was tough re: storage model ○ Every series had to be relocated, based on a consistent hash Leaving Graphite 4
  5. 5. Deciding on InfluxDB ● Resilience & HA: replication + restoring from a backup ○ Internal metrics for capacity planning and tracking overall system health ● Granular retention policies helps with control over sharding logic ○ For scaling, it helped to define per instance ● SQL-like API was great for training new developers ● Ability to capture raw data for tracking rare events ● Active development and official support channels ○ Tooling ecosystem: Telegraf, Kapacitor, Chronograf ○ Cloud friendly 5
  6. 6. Challenge #1: Tackling RUM 6
  7. 7. I manage the Storefront Performance Team at Wayfair ● We want our website and apps to be fast Our job: Amplify a Performance Culture ● Captains, Consulting, and R&D ● In Storefront and Beyond Challenge: Scale alongside a growing company Why I’m Here 7
  8. 8. Extremely noisy re: networks, devices, other processes Points drastically affected by any customer encountering: ● Being on bad WiFi ● Having too many tabs open ● Entering Battery Saver mode We retaliate by collecting TONS of data Data size scales directly with traffic: many requests / second over hundreds of hosts RUM: Real User Monitoring 8
  9. 9. With low RUM counts overnight, there’s even more noise BUT, after looking at the raw data: ● Caches are cold ● Performance is actually worse We track >> 300 unique pages ● Boss-Level Cardinality Challenge TODO: sample differently based on volume What’s That Noise? Check COUNT 9
  10. 10. When to tag HOST RUM cardinality was destroyed by a HOST tag w/ hundreds of unique values ● When adding a tag: “Is GROUP BY useful here?” ● Thank you, InfluxData: SELECT .. INTO ● Dropping the HOST tag meant much faster SELECTs Avoid proxy measurements: ● Tag what is actionable, where it matters ● RUM: a bad proxy for CPU / system load monitoring 10
  11. 11. Addendum: Observer Effect https://en.wikipedia.org/wiki/Observer_effect_(information_technology) UDP is great, but it’s not a solution to all problems ● We ran a test where we hit our PHP max_children limit on some hosts ● Server needed more time to send all data out ● Failure w/ > 800 Points / request + DNS delay ● register_shutdown_function => reduced visibility 11
  12. 12. Lions and Tigers and Interns 12
  13. 13. Storefront InfluxDB Tiger Team We have many thousands of PHP files w/ StatsD instrumentation Shutting off Graphite meant we needed to teach building a schema ● Hundreds of developers over many groups, each with their own style ● “I still don’t understand tags and fields” ● “Can’t we just change the measurement name?” Tiger Team: Small group of Engineers turn the tide Dedicated Slack channel + many small projects 13
  14. 14. Undergraduate students from Northeastern joined the Tiger Team Party Bus ● Python script which rewrote parts of our PHP code ● Consulting with other groups ● Driving large swaths of conversion Advanced DAO Instrumentation ● Different sampling rates for memcache ● Record long-running SQL queries Don’t underestimate what a few people can do Interns Let Loose 14
  15. 15. StatsD: ● rum.client_timers.desktop.wayfair_com.index.speed_index.bo1.timer.mean ● rum.client_timers.desktop.wayfair_com.index.page_load.bo1.timer.mean InfluxDB: ● Measurement: rum ● Tags: platform=desktop; dc=bo1; store=wayfair_com; route=index ● Fields: speed_index=800; total_page_load_time=2300 It’s a feature, not a bug; BUT features require thinking Tiger Challenge: StatsD vs. Schema 15
  16. 16. Developer Experience 16
  17. 17. Crossing Streams I’m the primary owner for our PHP InfluxDB Client I’m the one you called ● Inherited from a developer who moved on to another group ● PHP is our most common language at Wayfair, though there are others Many problems ensued from mixing between StatsD and InfluxDB paradigms ● Uniqueness for accumulators work, IF you only have strings ● Add a Schema, with Tags and Fields, and There Will Be Bugs 17
  18. 18. Back to the Drawing Board Rebuilding from scratch is super expensive ● I had to do it anyway. The API was wrong. Expectations always failed ● Key advice: build clients w/ the right mental model End result: one client that serves all of our PHP codebase ● Works for Storefront and beyond Standard Software Best Practices: ● Composition over Inheritance. Fluent Interfaces. Separate Responsibilities 18
  19. 19. Accumulators != Points An Accumulator (Counter / Stopwatch) is distinct from a single Point ● Counter::findOrCreate(array $uniqueness, Influx_Point $initialPoint); ● Goal: one Point with value 2, instead of two Points each w/ value 1 ● Importantly, $uniqueness is separated from the value itself Use case: ● Two versions of a system, one which uses SQLite, another which doesn’t ● SQLite slower per iteration, but has no 100ms startup cost ● Which version is faster overall? Track per entire request 19
  20. 20. Beware the `value` Field Key 20 Field type conflicts: lost data, tons of noise, confusion for developers Frequent developer retaliation: one measurement per field ● Same cardinality, but much harder to organize ● TooSpecificMeasurementStopwatch: Float ● TooSpecificMeasurementCounter: Integer ● Chronograf gets slow ● Dashboards are hard to create ● InfluxQL limitation re: multi-measurement math fixed in re-combining
  21. 21. Language Limitations PHP is still one of my favorite languages, but its simplicity can be problematic ● Nothing is shared from one request to the next ● We’ve thought about using SQLite / cron, but it’s complicated With some C# systems at Wayfair, measurements go to a separate thread ● Easier to aggregate Points across individual requests ● Helps with the Observer Effect ● Complexity also implies a management cost 21
  22. 22. Fighting the Firehose 22
  23. 23. Feature Toggles FTW At Wayfair, we deploy code many times every day We built a robust system for toggling off and on branches of our code ● Uses percentages and many other fine-grained filters ● One single adjustment propagates instantly across all systems ● Not tied to any deployment process, so they’re always unblocked Designed to safely test new functionality in Production ● Works exceedingly well at scaling down measurements ● “Do we have enough data at 7%?” => $influxPoint->setSamplingRate(0.07); 23
  24. 24. Feature Toggle: Example Helps w/ volume, not cardinality. Percentage => Boolean 24
  25. 25. Dev, meet Ops We give great opportunities for junior developers Our Timeseries team has built up resilience ● PHP => Telegraf => Kafka => Telegraf => InfluxDB ● Limits threats to any one piece ● MirrorMaker allows for multi-DC cohesion Tremor, built by Wayfair, allows us to shape any traffic ● We can blacklist / rate limit a given measurement by inspecting line protocol 25
  26. 26. Infrastructure Updates 26
  27. 27. Yo Dawg 27 We heard you like Influx, so we put Influx in your Influx so you can .. measure how your clusters are doing when the target system is under attack 2018 Clusters Layout: ● C1: General: for most measurements ● C2: Storefront: specific raw data at high volume ● C3: monitors C1, C2, Kafka, Puppet, Celery, ++ ● Data Centers in Boston, Seattle, and Beyond
  28. 28. It makes me want to cry every time I hear this ● My response: “do you know what you’re asking Influx to do?” Developers try to fetch the 99th percentile on 30+ days of data ● We have a 30 second timeout for Grafana, etc. ● We often hit that limit when processing over 400 million Points Problem: developers have been used to 10 second aggregations in Graphite ● They only had count, mean, 90th at that window “Influx is Slow!” 28
  29. 29. Mitigation Strategies We’ve tried a variety of solutions, w/ InfluxData, to provide that 10 second windowing: ● Telegraf plugins ● Continuous Queries: CPU load ● Kapacitor: our best path forward yet Future: processing line protocol further with Tremor Challenge: speed_index vs. count_speed_index, etc. ● Users want magic: swap out a Retention Policy and see the same data ● Danger: percentile(90th_speed_index, 90) => “what does this mean?” 29
  30. 30. 6 Data Centers & Growing ● On-premise and Cloud ● Pictured: our first 3 DCs 2 Telegrafs + Microphone (Kafka) Resilient, whole-system view Scaling Up 30
  31. 31. Speaking to Strengths 31
  32. 32. With downsampled Graphite, moving means / medians were less helpful ● InfluxDB gives us all the functions we could want ● GROUP BY time(:interval:) is super helpful w/ analysis InfluxDB lets us follow the advice of John Rauser ● Grafana is excellent for our Wayfair Operations Center ● Full analysis of the points themselves, however is a treasure trove ● We can look at our raw data Looking at Our Data 32
  33. 33. 625,441 Points vs. 30,466 33
  34. 34. Keep a window of the most recent 128 Points, unsampled Calculate the Mean and Standard Deviation of this raw data For each next Point, check if we are 2 Standard Deviations away ● If we have an outlier, record that Point ● Else move on to the next Point 5% of the data for the same picture, w/ < 50 lines of Python Means / Medians / etc. are clearly different, but “reality” is sometimes overrated Graphing Outliers 34
  35. 35. Q&A: 5 Minutes 35

×