Java one2013 monitoringatscaleincloud

1
Monitoring Java Applications at Scale In the Cloud:
Lessons from eBay
Raju Kolluru (Sr. Manager), eBay Inc., and
Mahesh Somani (Principal Architect), eBay Inc.

2
Agenda
• Background
• Monitoring
– Logs
– Metrics
– Alerts
– Self-healing
• Summary

3
eBay: The Biggest eCommerce Marketplace Platform
Founded in September 1995, eBay is a global online marketplace where
practically anyone can trade practically anything
From Devices to Diamonds . . .

4
Founded in September 1995, eBay is a global online
marketplace where practically anyone can trade practically
anything
From Clothing to Cameras . . . and more

5
Founded in September 1995, eBay is a global online
marketplace where practically anyone can trade practically
anything
Cards, Missile Base, Cities, Jets, Yachts . . .

6
What we’re up against ?
eBay manages …
– Over 100 million active users
– Over 2 Billion photos
– eBay averages 2 billion page views per day
– eBay has over 300 million items for sale in over
50,000 categories
– eBay site stores over 5 Petabytes of data
– eBay Analytics Infrastructure processes 80+
PB of data per day
– eBay handles 40 billion service calls per month
In a dynamic environment
– 300+ features per quarter
– Roll 100,000+ code lines every 2 weeks
– 40+ million lines of code
• In 40+ countries, in 20+ languages, 24x7x365
>100 Billion SQL executions/day!
An SUV is sold every 5 minutesA sporting good sells every 2 seconds
Over ½ Million pounds of
Kimchi are sold every year!

7
• Background
• Monitoring
– Logs
– Metrics
– Alerts
– Self-healing
• Summary
Agenda: Motivation

8
Monitoring: Philosophy
“You can’t control what you don’t measure.
And, what you measure tends to improve
over time.”

9
Monitoring: Scale and Complexity
Billions
of Events
100s of
DBs
Thousands
of Services
Billions of
Service Calls
More than
1000
applications
More than
50K
servers
2 Billion
Hits

10
– 50,000+ Servers monitored
– 1,000+ Applications
– 3000+ Load Balancers monitored
– 150 TB Logging volume per day
– 60K Metrics/second (6 Billion /day)
– 1,000+ Alert Rules
Monitoring: Data Volume

11
Monitoring Customers
Customer
Behavior
Performance
Root Cause
Threats
Sensitive Data
Biz Metrics
A/B Test
Anomaly
Errors

12
• Background
• Monitoring
– Logs
– Metrics
– Alerts
– Self-healing
• Summary
Agenda: Logs

13
Logs: Data Quality
Alerts
Metrics
Logs
Volume Quality

14
Logs
Processing
Architecture
• Data volume scale and extensibility needed
– Open source and Big Data technologies adoption (Hadoop)
– HDFS and TSDB/HBase
Client
(Metrics and
Logs)
Transport
Metrics
Processing
Metrics Store
(TSDB/HBase )
Logs Store

15
Logs
• Advantages
– Temporal record
– Detailed
– Provides instance level information
– Distributed w/ co-relations
• Traditional Challenges
– Unstructured
– Decentralized
– Storage and retention
– Processing requires parsing

16
Logs: Dealing w/ Challenges
• Client API’s
– Log different kind of information
• Transaction: Nested activities
• Transaction: Start and end of activity
– Additional structures
• Types (URL, Service, SQL)
• Names (Request name, Query name)
• Server
– Centralized storage
– Distributed processing (Hadoop)
– Volume: 150 TB / day (uncompressed). 5x compression

17
Logs: Processing
• Processing
– Generate on-going reports and aggregation
– Converts logs to metrics
• Data breakdown along different dimensions
– Requests, Browsers, Experiments, Errors, Machines, IP
addresses, Geo
• On-demand processing. Distributed processing
– Search
– PIG / Hive / MR jobs

19
• Background
• Monitoring
– Logs
– Metrics
– Alerts
– Self-healing
• Summary
Agenda: Metrics

20
Metrics and Events
• A Metric is a measure sampled over time.
– Has a metric id as unique identifies
– Has a value
• A gauge, that is a measurement
• A counter that increments (error counts, bytes
transferred).
– Has “tags” that uniquely identify instance of metric from others
• An Event is an occurrence indicating thing of interest. Events are
aperiodic.

21
Metrics: When to use?
• Balance between volume and quality
– Short SLA (~seconds)
• Periodicity enables trending
• Client
– Convenience for users
– Dealing with volume
• Server
– Caching and in-memory processing
– Feed to other systems with real-time data
• Aggregation: Both client and server end
Volume Quality

22
Metrics: Which metrics?
SAAS
Null searches SEO traffic Shipping option selection Unsuccessful login rate
PAAS
Requests per second Error per second Latency Services GC Overhead
IAAS
CPU Network Memory Disk Load Balancer

24
• Background
• Monitoring
– Logs
– Metrics
– Alerts
– Self-healing
• Summary
Agenda: Alerts

25
Alerts
• Static thresholds
– E.g., Machine CPU > 70%, Response time > 500 ms
• Cliffs
– Bollinger bands
• Slow poison
– Day over day or week over week comparison
• Alerts and Alarms
– Multiple correlated alerts => Alarm(s)
– Alarms are time sensitive
• Proactive vs. Reactive detection

26
Cliff Detection: Bollinger Band

27
Slow Poison: Week Over Week Analysis

28
• Background
• Motivation
– Logs
– Metrics
– Alerts
– Self-healing
• Summary
Agenda: Self-healing

29
Self-Healing: In the Cloud
Remediate
(PAAS)
Deploy
(PAAS)
Monitor
(PAAS)
Provision
(IAAS)

30
• Background
• Motivation
– Logs
– Metrics
– Alerts
– Self-healing
• Summary
Agenda: Summary

31
Summary: Monitoring at Scale
Data in eBay is BIG and getting BIGGER. Need Big Data for Scale
Scope of Monitoring includes Logs, Metrics, Alerts and Self-Healing
Data Quality versus Data Volume
Multiple Client Sensors
Monitoring and management at Scale needs Self Healing

32
Connect with us
o raju@ebay.com (@raju_kolluru)
o msomani@ebay.com (@mahesh_somani)
We are Hiring!
o Opportunities in Java, Big data, Software
applications and systems
indiajobs@ebay.com
Q & A

Java one2013 monitoringatscaleincloud

Recommended

Recommended

More Related Content

Similar to Java one2013 monitoringatscaleincloud

Similar to Java one2013 monitoringatscaleincloud (20)

Java one2013 monitoringatscaleincloud

Editor's Notes