1
Monitoring Java Applications at Scale In the Cloud:
Lessons from eBay
Raju Kolluru (Sr. Manager), eBay Inc., and
Mahesh Somani (Principal Architect), eBay Inc.
2
Agenda
• Background
• Monitoring
– Logs
– Metrics
– Alerts
– Self-healing
• Summary
3
eBay: The Biggest eCommerce Marketplace Platform
Founded in September 1995, eBay is a global online marketplace where
practically anyone can trade practically anything
From Devices to Diamonds . . .
4
eBay: The Biggest eCommerce Marketplace Platform
Founded in September 1995, eBay is a global online
marketplace where practically anyone can trade practically
anything
From Clothing to Cameras . . . and more
5
Founded in September 1995, eBay is a global online
marketplace where practically anyone can trade practically
anything
Cards, Missile Base, Cities, Jets, Yachts . . .
eBay: The Biggest eCommerce Marketplace Platform
6
What we’re up against ?
eBay manages …
– Over 100 million active users
– Over 2 Billion photos
– eBay averages 2 billion page views per day
– eBay has over 300 million items for sale in over
50,000 categories
– eBay site stores over 5 Petabytes of data
– eBay Analytics Infrastructure processes 80+
PB of data per day
– eBay handles 40 billion service calls per month
In a dynamic environment
– 300+ features per quarter
– Roll 100,000+ code lines every 2 weeks
– 40+ million lines of code
• In 40+ countries, in 20+ languages, 24x7x365
>100 Billion SQL executions/day!
An SUV is sold every 5 minutesA sporting good sells every 2 seconds
Over ½ Million pounds of
Kimchi are sold every year!
7
• Background
• Monitoring
– Logs
– Metrics
– Alerts
– Self-healing
• Summary
Agenda: Motivation
8
Monitoring: Philosophy
“You can’t control what you don’t measure.
And, what you measure tends to improve
over time.”
9
Monitoring: Scale and Complexity
Billions
of Events
100s of
DBs
Thousands
of Services
Billions of
Service Calls
More than
1000
applications
More than
50K
servers
2 Billion
Hits
10
– 50,000+ Servers monitored
– 1,000+ Applications
– 3000+ Load Balancers monitored
– 150 TB Logging volume per day
– 60K Metrics/second (6 Billion /day)
– 1,000+ Alert Rules
Monitoring: Data Volume
11
Monitoring Customers
Customer
Behavior
Performance
Root Cause
Threats
Sensitive Data
Biz Metrics
A/B Test
Anomaly
Errors
12
• Background
• Monitoring
– Logs
– Metrics
– Alerts
– Self-healing
• Summary
Agenda: Logs
13
Logs: Data Quality
Alerts
Metrics
Logs
Volume Quality
14
Logs
Processing
Architecture
• Data volume scale and extensibility needed
– Open source and Big Data technologies adoption (Hadoop)
– HDFS and TSDB/HBase
Client
(Metrics and
Logs)
Transport
Metrics
Processing
Metrics Store
(TSDB/HBase )
Logs Store
15
Logs
• Advantages
– Temporal record
– Detailed
– Provides instance level information
– Distributed w/ co-relations
• Traditional Challenges
– Unstructured
– Decentralized
– Storage and retention
– Processing requires parsing
16
Logs: Dealing w/ Challenges
• Client API’s
– Log different kind of information
• Transaction: Nested activities
• Transaction: Start and end of activity
– Additional structures
• Types (URL, Service, SQL)
• Names (Request name, Query name)
• Server
– Centralized storage
– Distributed processing (Hadoop)
– Volume: 150 TB / day (uncompressed). 5x compression
17
Logs: Processing
• Processing
– Generate on-going reports and aggregation
– Converts logs to metrics
• Data breakdown along different dimensions
– Requests, Browsers, Experiments, Errors, Machines, IP
addresses, Geo
• On-demand processing. Distributed processing
– Search
– PIG / Hive / MR jobs
18
Logs: Viewer
19
• Background
• Monitoring
– Logs
– Metrics
– Alerts
– Self-healing
• Summary
Agenda: Metrics
20
Metrics and Events
• A Metric is a measure sampled over time.
– Has a metric id as unique identifies
– Has a value
• A gauge, that is a measurement
• A counter that increments (error counts, bytes
transferred).
– Has “tags” that uniquely identify instance of metric from others
• An Event is an occurrence indicating thing of interest. Events are
aperiodic.
21
Metrics: When to use?
• Balance between volume and quality
– Short SLA (~seconds)
• Periodicity enables trending
• Client
– Convenience for users
– Dealing with volume
• Server
– Caching and in-memory processing
– Feed to other systems with real-time data
• Aggregation: Both client and server end
Volume Quality
22
Metrics: Which metrics?
SAAS
Null searches SEO traffic Shipping option selection Unsuccessful login rate
PAAS
Requests per second Error per second Latency Services GC Overhead
IAAS
CPU Network Memory Disk Load Balancer
23
Metrics: Dashboard and UI
24
• Background
• Monitoring
– Logs
– Metrics
– Alerts
– Self-healing
• Summary
Agenda: Alerts
25
Alerts
• Static thresholds
– E.g., Machine CPU > 70%, Response time > 500 ms
• Cliffs
– Bollinger bands
• Slow poison
– Day over day or week over week comparison
• Alerts and Alarms
– Multiple correlated alerts => Alarm(s)
– Alarms are time sensitive
• Proactive vs. Reactive detection
26
Cliff Detection: Bollinger Band
27
Slow Poison: Week Over Week Analysis
28
• Background
• Motivation
– Logs
– Metrics
– Alerts
– Self-healing
• Summary
Agenda: Self-healing
29
Self-Healing: In the Cloud
Remediate
(PAAS)
Deploy
(PAAS)
Monitor
(PAAS)
Provision
(IAAS)
30
• Background
• Motivation
– Logs
– Metrics
– Alerts
– Self-healing
• Summary
Agenda: Summary
31
Summary: Monitoring at Scale
Data in eBay is BIG and getting BIGGER. Need Big Data for Scale
Scope of Monitoring includes Logs, Metrics, Alerts and Self-Healing
Data Quality versus Data Volume
Multiple Client Sensors
Monitoring and management at Scale needs Self Healing
32
Connect with us
o raju@ebay.com (@raju_kolluru)
o msomani@ebay.com (@mahesh_somani)
We are Hiring!
o Opportunities in Java, Big data, Software
applications and systems
indiajobs@ebay.com
Q & A

Java one2013 monitoringatscaleincloud

  • 1.
    1 Monitoring Java Applicationsat Scale In the Cloud: Lessons from eBay Raju Kolluru (Sr. Manager), eBay Inc., and Mahesh Somani (Principal Architect), eBay Inc.
  • 2.
    2 Agenda • Background • Monitoring –Logs – Metrics – Alerts – Self-healing • Summary
  • 3.
    3 eBay: The BiggesteCommerce Marketplace Platform Founded in September 1995, eBay is a global online marketplace where practically anyone can trade practically anything From Devices to Diamonds . . .
  • 4.
    4 eBay: The BiggesteCommerce Marketplace Platform Founded in September 1995, eBay is a global online marketplace where practically anyone can trade practically anything From Clothing to Cameras . . . and more
  • 5.
    5 Founded in September1995, eBay is a global online marketplace where practically anyone can trade practically anything Cards, Missile Base, Cities, Jets, Yachts . . . eBay: The Biggest eCommerce Marketplace Platform
  • 6.
    6 What we’re upagainst ? eBay manages … – Over 100 million active users – Over 2 Billion photos – eBay averages 2 billion page views per day – eBay has over 300 million items for sale in over 50,000 categories – eBay site stores over 5 Petabytes of data – eBay Analytics Infrastructure processes 80+ PB of data per day – eBay handles 40 billion service calls per month In a dynamic environment – 300+ features per quarter – Roll 100,000+ code lines every 2 weeks – 40+ million lines of code • In 40+ countries, in 20+ languages, 24x7x365 >100 Billion SQL executions/day! An SUV is sold every 5 minutesA sporting good sells every 2 seconds Over ½ Million pounds of Kimchi are sold every year!
  • 7.
    7 • Background • Monitoring –Logs – Metrics – Alerts – Self-healing • Summary Agenda: Motivation
  • 8.
    8 Monitoring: Philosophy “You can’tcontrol what you don’t measure. And, what you measure tends to improve over time.”
  • 9.
    9 Monitoring: Scale andComplexity Billions of Events 100s of DBs Thousands of Services Billions of Service Calls More than 1000 applications More than 50K servers 2 Billion Hits
  • 10.
    10 – 50,000+ Serversmonitored – 1,000+ Applications – 3000+ Load Balancers monitored – 150 TB Logging volume per day – 60K Metrics/second (6 Billion /day) – 1,000+ Alert Rules Monitoring: Data Volume
  • 11.
  • 12.
    12 • Background • Monitoring –Logs – Metrics – Alerts – Self-healing • Summary Agenda: Logs
  • 13.
  • 14.
    14 Logs Processing Architecture • Data volumescale and extensibility needed – Open source and Big Data technologies adoption (Hadoop) – HDFS and TSDB/HBase Client (Metrics and Logs) Transport Metrics Processing Metrics Store (TSDB/HBase ) Logs Store
  • 15.
    15 Logs • Advantages – Temporalrecord – Detailed – Provides instance level information – Distributed w/ co-relations • Traditional Challenges – Unstructured – Decentralized – Storage and retention – Processing requires parsing
  • 16.
    16 Logs: Dealing w/Challenges • Client API’s – Log different kind of information • Transaction: Nested activities • Transaction: Start and end of activity – Additional structures • Types (URL, Service, SQL) • Names (Request name, Query name) • Server – Centralized storage – Distributed processing (Hadoop) – Volume: 150 TB / day (uncompressed). 5x compression
  • 17.
    17 Logs: Processing • Processing –Generate on-going reports and aggregation – Converts logs to metrics • Data breakdown along different dimensions – Requests, Browsers, Experiments, Errors, Machines, IP addresses, Geo • On-demand processing. Distributed processing – Search – PIG / Hive / MR jobs
  • 18.
  • 19.
    19 • Background • Monitoring –Logs – Metrics – Alerts – Self-healing • Summary Agenda: Metrics
  • 20.
    20 Metrics and Events •A Metric is a measure sampled over time. – Has a metric id as unique identifies – Has a value • A gauge, that is a measurement • A counter that increments (error counts, bytes transferred). – Has “tags” that uniquely identify instance of metric from others • An Event is an occurrence indicating thing of interest. Events are aperiodic.
  • 21.
    21 Metrics: When touse? • Balance between volume and quality – Short SLA (~seconds) • Periodicity enables trending • Client – Convenience for users – Dealing with volume • Server – Caching and in-memory processing – Feed to other systems with real-time data • Aggregation: Both client and server end Volume Quality
  • 22.
    22 Metrics: Which metrics? SAAS Nullsearches SEO traffic Shipping option selection Unsuccessful login rate PAAS Requests per second Error per second Latency Services GC Overhead IAAS CPU Network Memory Disk Load Balancer
  • 23.
  • 24.
    24 • Background • Monitoring –Logs – Metrics – Alerts – Self-healing • Summary Agenda: Alerts
  • 25.
    25 Alerts • Static thresholds –E.g., Machine CPU > 70%, Response time > 500 ms • Cliffs – Bollinger bands • Slow poison – Day over day or week over week comparison • Alerts and Alarms – Multiple correlated alerts => Alarm(s) – Alarms are time sensitive • Proactive vs. Reactive detection
  • 26.
  • 27.
    27 Slow Poison: WeekOver Week Analysis
  • 28.
    28 • Background • Motivation –Logs – Metrics – Alerts – Self-healing • Summary Agenda: Self-healing
  • 29.
    29 Self-Healing: In theCloud Remediate (PAAS) Deploy (PAAS) Monitor (PAAS) Provision (IAAS)
  • 30.
    30 • Background • Motivation –Logs – Metrics – Alerts – Self-healing • Summary Agenda: Summary
  • 31.
    31 Summary: Monitoring atScale Data in eBay is BIG and getting BIGGER. Need Big Data for Scale Scope of Monitoring includes Logs, Metrics, Alerts and Self-Healing Data Quality versus Data Volume Multiple Client Sensors Monitoring and management at Scale needs Self Healing
  • 32.
    32 Connect with us oraju@ebay.com (@raju_kolluru) o msomani@ebay.com (@mahesh_somani) We are Hiring! o Opportunities in Java, Big data, Software applications and systems indiajobs@ebay.com Q & A

Editor's Notes

  • #2 Platform Services pioneers the SAAS vision at eBay
  • #8 [Raju] Can we have some other pic ? Should we change the topic to Monitoring Philosophy and Overview ?
  • #13 Fonts are not very visible
  • #15 Put Hadoop logos(elephant) & UI logo
  • #16 We currently have 3 logging slides; will be good if we reduce by 1.
  • #20 Cant see the text
  • #21 Pretty Busy slide; reduce content by 20%
  • #23 Show a Stack Diagram like the one in NHT
  • #25 Dilbert cartoon
  • #26 Alerts and Alarms: DifferencesStatic ThresholdCliffsSlows poison
  • #29 Dilbert cartoon