SlideShare a Scribd company logo
1
Monitoring Java Applications at Scale In the Cloud:
Lessons from eBay
Raju Kolluru (Sr. Manager), eBay Inc., and
Mahesh Somani (Principal Architect), eBay Inc.
2
Agenda
• Background
• Monitoring
– Logs
– Metrics
– Alerts
– Self-healing
• Summary
3
eBay: The Biggest eCommerce Marketplace Platform
Founded in September 1995, eBay is a global online marketplace where
practically anyone can trade practically anything
From Devices to Diamonds . . .
4
eBay: The Biggest eCommerce Marketplace Platform
Founded in September 1995, eBay is a global online
marketplace where practically anyone can trade practically
anything
From Clothing to Cameras . . . and more
5
Founded in September 1995, eBay is a global online
marketplace where practically anyone can trade practically
anything
Cards, Missile Base, Cities, Jets, Yachts . . .
eBay: The Biggest eCommerce Marketplace Platform
6
What we’re up against ?
eBay manages …
– Over 100 million active users
– Over 2 Billion photos
– eBay averages 2 billion page views per day
– eBay has over 300 million items for sale in over
50,000 categories
– eBay site stores over 5 Petabytes of data
– eBay Analytics Infrastructure processes 80+
PB of data per day
– eBay handles 40 billion service calls per month
In a dynamic environment
– 300+ features per quarter
– Roll 100,000+ code lines every 2 weeks
– 40+ million lines of code
• In 40+ countries, in 20+ languages, 24x7x365
>100 Billion SQL executions/day!
An SUV is sold every 5 minutesA sporting good sells every 2 seconds
Over ½ Million pounds of
Kimchi are sold every year!
7
• Background
• Monitoring
– Logs
– Metrics
– Alerts
– Self-healing
• Summary
Agenda: Motivation
8
Monitoring: Philosophy
“You can’t control what you don’t measure.
And, what you measure tends to improve
over time.”
9
Monitoring: Scale and Complexity
Billions
of Events
100s of
DBs
Thousands
of Services
Billions of
Service Calls
More than
1000
applications
More than
50K
servers
2 Billion
Hits
10
– 50,000+ Servers monitored
– 1,000+ Applications
– 3000+ Load Balancers monitored
– 150 TB Logging volume per day
– 60K Metrics/second (6 Billion /day)
– 1,000+ Alert Rules
Monitoring: Data Volume
11
Monitoring Customers
Customer
Behavior
Performance
Root Cause
Threats
Sensitive Data
Biz Metrics
A/B Test
Anomaly
Errors
12
• Background
• Monitoring
– Logs
– Metrics
– Alerts
– Self-healing
• Summary
Agenda: Logs
13
Logs: Data Quality
Alerts
Metrics
Logs
Volume Quality
14
Logs
Processing
Architecture
• Data volume scale and extensibility needed
– Open source and Big Data technologies adoption (Hadoop)
– HDFS and TSDB/HBase
Client
(Metrics and
Logs)
Transport
Metrics
Processing
Metrics Store
(TSDB/HBase )
Logs Store
15
Logs
• Advantages
– Temporal record
– Detailed
– Provides instance level information
– Distributed w/ co-relations
• Traditional Challenges
– Unstructured
– Decentralized
– Storage and retention
– Processing requires parsing
16
Logs: Dealing w/ Challenges
• Client API’s
– Log different kind of information
• Transaction: Nested activities
• Transaction: Start and end of activity
– Additional structures
• Types (URL, Service, SQL)
• Names (Request name, Query name)
• Server
– Centralized storage
– Distributed processing (Hadoop)
– Volume: 150 TB / day (uncompressed). 5x compression
17
Logs: Processing
• Processing
– Generate on-going reports and aggregation
– Converts logs to metrics
• Data breakdown along different dimensions
– Requests, Browsers, Experiments, Errors, Machines, IP
addresses, Geo
• On-demand processing. Distributed processing
– Search
– PIG / Hive / MR jobs
18
Logs: Viewer
19
• Background
• Monitoring
– Logs
– Metrics
– Alerts
– Self-healing
• Summary
Agenda: Metrics
20
Metrics and Events
• A Metric is a measure sampled over time.
– Has a metric id as unique identifies
– Has a value
• A gauge, that is a measurement
• A counter that increments (error counts, bytes
transferred).
– Has “tags” that uniquely identify instance of metric from others
• An Event is an occurrence indicating thing of interest. Events are
aperiodic.
21
Metrics: When to use?
• Balance between volume and quality
– Short SLA (~seconds)
• Periodicity enables trending
• Client
– Convenience for users
– Dealing with volume
• Server
– Caching and in-memory processing
– Feed to other systems with real-time data
• Aggregation: Both client and server end
Volume Quality
22
Metrics: Which metrics?
SAAS
Null searches SEO traffic Shipping option selection Unsuccessful login rate
PAAS
Requests per second Error per second Latency Services GC Overhead
IAAS
CPU Network Memory Disk Load Balancer
23
Metrics: Dashboard and UI
24
• Background
• Monitoring
– Logs
– Metrics
– Alerts
– Self-healing
• Summary
Agenda: Alerts
25
Alerts
• Static thresholds
– E.g., Machine CPU > 70%, Response time > 500 ms
• Cliffs
– Bollinger bands
• Slow poison
– Day over day or week over week comparison
• Alerts and Alarms
– Multiple correlated alerts => Alarm(s)
– Alarms are time sensitive
• Proactive vs. Reactive detection
26
Cliff Detection: Bollinger Band
27
Slow Poison: Week Over Week Analysis
28
• Background
• Motivation
– Logs
– Metrics
– Alerts
– Self-healing
• Summary
Agenda: Self-healing
29
Self-Healing: In the Cloud
Remediate
(PAAS)
Deploy
(PAAS)
Monitor
(PAAS)
Provision
(IAAS)
30
• Background
• Motivation
– Logs
– Metrics
– Alerts
– Self-healing
• Summary
Agenda: Summary
31
Summary: Monitoring at Scale
Data in eBay is BIG and getting BIGGER. Need Big Data for Scale
Scope of Monitoring includes Logs, Metrics, Alerts and Self-Healing
Data Quality versus Data Volume
Multiple Client Sensors
Monitoring and management at Scale needs Self Healing
32
Connect with us
o raju@ebay.com (@raju_kolluru)
o msomani@ebay.com (@mahesh_somani)
We are Hiring!
o Opportunities in Java, Big data, Software
applications and systems
indiajobs@ebay.com
Q & A

More Related Content

Similar to Java one2013 monitoringatscaleincloud

Assessing New Databases– Translytical Use Cases
Assessing New Databases– Translytical Use CasesAssessing New Databases– Translytical Use Cases
Assessing New Databases– Translytical Use Cases
DATAVERSITY
 
Kaseya Connect 2012 - THE ABC'S OF MONITORING
Kaseya Connect 2012 - THE ABC'S OF MONITORINGKaseya Connect 2012 - THE ABC'S OF MONITORING
Kaseya Connect 2012 - THE ABC'S OF MONITORING
Kaseya
 
Monitoring Node.js Microservices on CloudFoundry with Open Source Tools and a...
Monitoring Node.js Microservices on CloudFoundry with Open Source Tools and a...Monitoring Node.js Microservices on CloudFoundry with Open Source Tools and a...
Monitoring Node.js Microservices on CloudFoundry with Open Source Tools and a...
Tony Erwin
 
Mesoscon 2015
Mesoscon 2015Mesoscon 2015
Mesoscon 2015
Skand Gupta
 
Effective Microservices In a Data-centric World
Effective Microservices In a Data-centric WorldEffective Microservices In a Data-centric World
Effective Microservices In a Data-centric World
Randy Shoup
 
Building Streaming And Fast Data Applications With Spark, Mesos, Akka, Cassan...
Building Streaming And Fast Data Applications With Spark, Mesos, Akka, Cassan...Building Streaming And Fast Data Applications With Spark, Mesos, Akka, Cassan...
Building Streaming And Fast Data Applications With Spark, Mesos, Akka, Cassan...
Lightbend
 
StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis
StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark SonisStatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis
StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis
StatsCraft
 
Monitoring Containerized Micro-Services In Azure
Monitoring Containerized Micro-Services In AzureMonitoring Containerized Micro-Services In Azure
Monitoring Containerized Micro-Services In Azure
Alex Bulankou
 
API and Big Data Solution Patterns
API and Big Data Solution Patterns API and Big Data Solution Patterns
API and Big Data Solution Patterns
WSO2
 
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
 Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
Redis Labs
 
ADDO Open Source Observability Tools
ADDO Open Source Observability Tools ADDO Open Source Observability Tools
ADDO Open Source Observability Tools
Mickey Boxell
 
The Evolution of Big Data Pipelines at Intuit
The Evolution of Big Data Pipelines at Intuit The Evolution of Big Data Pipelines at Intuit
The Evolution of Big Data Pipelines at Intuit
DataWorks Summit/Hadoop Summit
 
Traitement d'événements
Traitement d'événementsTraitement d'événements
Traitement d'événements
Amazon Web Services
 
Day 5 - Real-time Data Processing/Internet of Things (IoT) with Amazon Kinesis
Day 5 - Real-time Data Processing/Internet of Things (IoT) with Amazon KinesisDay 5 - Real-time Data Processing/Internet of Things (IoT) with Amazon Kinesis
Day 5 - Real-time Data Processing/Internet of Things (IoT) with Amazon Kinesis
Amazon Web Services
 
Application Performance Management
Application Performance ManagementApplication Performance Management
Application Performance Management
Noriaki Tatsumi
 
Pulsar - Real-time Analytics at Scale
Pulsar - Real-time Analytics at ScalePulsar - Real-time Analytics at Scale
Pulsar - Real-time Analytics at Scale
Tony Ng
 
Hadoop Summit 2016 - Evolution of Big Data Pipelines At Intuit
Hadoop Summit 2016 - Evolution of Big Data Pipelines At IntuitHadoop Summit 2016 - Evolution of Big Data Pipelines At Intuit
Hadoop Summit 2016 - Evolution of Big Data Pipelines At Intuit
Rekha Joshi
 
Analytics&IoT
Analytics&IoTAnalytics&IoT
Analytics&IoT
Selvaraj Kesavan
 
Lecture1 BIG DATA and Types of data in details
Lecture1 BIG DATA and Types of data in detailsLecture1 BIG DATA and Types of data in details
Lecture1 BIG DATA and Types of data in details
AbhishekKumarAgrahar2
 
Kinesis @ lyft
Kinesis @ lyftKinesis @ lyft
Kinesis @ lyft
Mian Hamid
 

Similar to Java one2013 monitoringatscaleincloud (20)

Assessing New Databases– Translytical Use Cases
Assessing New Databases– Translytical Use CasesAssessing New Databases– Translytical Use Cases
Assessing New Databases– Translytical Use Cases
 
Kaseya Connect 2012 - THE ABC'S OF MONITORING
Kaseya Connect 2012 - THE ABC'S OF MONITORINGKaseya Connect 2012 - THE ABC'S OF MONITORING
Kaseya Connect 2012 - THE ABC'S OF MONITORING
 
Monitoring Node.js Microservices on CloudFoundry with Open Source Tools and a...
Monitoring Node.js Microservices on CloudFoundry with Open Source Tools and a...Monitoring Node.js Microservices on CloudFoundry with Open Source Tools and a...
Monitoring Node.js Microservices on CloudFoundry with Open Source Tools and a...
 
Mesoscon 2015
Mesoscon 2015Mesoscon 2015
Mesoscon 2015
 
Effective Microservices In a Data-centric World
Effective Microservices In a Data-centric WorldEffective Microservices In a Data-centric World
Effective Microservices In a Data-centric World
 
Building Streaming And Fast Data Applications With Spark, Mesos, Akka, Cassan...
Building Streaming And Fast Data Applications With Spark, Mesos, Akka, Cassan...Building Streaming And Fast Data Applications With Spark, Mesos, Akka, Cassan...
Building Streaming And Fast Data Applications With Spark, Mesos, Akka, Cassan...
 
StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis
StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark SonisStatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis
StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis
 
Monitoring Containerized Micro-Services In Azure
Monitoring Containerized Micro-Services In AzureMonitoring Containerized Micro-Services In Azure
Monitoring Containerized Micro-Services In Azure
 
API and Big Data Solution Patterns
API and Big Data Solution Patterns API and Big Data Solution Patterns
API and Big Data Solution Patterns
 
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
 Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
 
ADDO Open Source Observability Tools
ADDO Open Source Observability Tools ADDO Open Source Observability Tools
ADDO Open Source Observability Tools
 
The Evolution of Big Data Pipelines at Intuit
The Evolution of Big Data Pipelines at Intuit The Evolution of Big Data Pipelines at Intuit
The Evolution of Big Data Pipelines at Intuit
 
Traitement d'événements
Traitement d'événementsTraitement d'événements
Traitement d'événements
 
Day 5 - Real-time Data Processing/Internet of Things (IoT) with Amazon Kinesis
Day 5 - Real-time Data Processing/Internet of Things (IoT) with Amazon KinesisDay 5 - Real-time Data Processing/Internet of Things (IoT) with Amazon Kinesis
Day 5 - Real-time Data Processing/Internet of Things (IoT) with Amazon Kinesis
 
Application Performance Management
Application Performance ManagementApplication Performance Management
Application Performance Management
 
Pulsar - Real-time Analytics at Scale
Pulsar - Real-time Analytics at ScalePulsar - Real-time Analytics at Scale
Pulsar - Real-time Analytics at Scale
 
Hadoop Summit 2016 - Evolution of Big Data Pipelines At Intuit
Hadoop Summit 2016 - Evolution of Big Data Pipelines At IntuitHadoop Summit 2016 - Evolution of Big Data Pipelines At Intuit
Hadoop Summit 2016 - Evolution of Big Data Pipelines At Intuit
 
Analytics&IoT
Analytics&IoTAnalytics&IoT
Analytics&IoT
 
Lecture1 BIG DATA and Types of data in details
Lecture1 BIG DATA and Types of data in detailsLecture1 BIG DATA and Types of data in details
Lecture1 BIG DATA and Types of data in details
 
Kinesis @ lyft
Kinesis @ lyftKinesis @ lyft
Kinesis @ lyft
 

Java one2013 monitoringatscaleincloud

  • 1. 1 Monitoring Java Applications at Scale In the Cloud: Lessons from eBay Raju Kolluru (Sr. Manager), eBay Inc., and Mahesh Somani (Principal Architect), eBay Inc.
  • 2. 2 Agenda • Background • Monitoring – Logs – Metrics – Alerts – Self-healing • Summary
  • 3. 3 eBay: The Biggest eCommerce Marketplace Platform Founded in September 1995, eBay is a global online marketplace where practically anyone can trade practically anything From Devices to Diamonds . . .
  • 4. 4 eBay: The Biggest eCommerce Marketplace Platform Founded in September 1995, eBay is a global online marketplace where practically anyone can trade practically anything From Clothing to Cameras . . . and more
  • 5. 5 Founded in September 1995, eBay is a global online marketplace where practically anyone can trade practically anything Cards, Missile Base, Cities, Jets, Yachts . . . eBay: The Biggest eCommerce Marketplace Platform
  • 6. 6 What we’re up against ? eBay manages … – Over 100 million active users – Over 2 Billion photos – eBay averages 2 billion page views per day – eBay has over 300 million items for sale in over 50,000 categories – eBay site stores over 5 Petabytes of data – eBay Analytics Infrastructure processes 80+ PB of data per day – eBay handles 40 billion service calls per month In a dynamic environment – 300+ features per quarter – Roll 100,000+ code lines every 2 weeks – 40+ million lines of code • In 40+ countries, in 20+ languages, 24x7x365 >100 Billion SQL executions/day! An SUV is sold every 5 minutesA sporting good sells every 2 seconds Over ½ Million pounds of Kimchi are sold every year!
  • 7. 7 • Background • Monitoring – Logs – Metrics – Alerts – Self-healing • Summary Agenda: Motivation
  • 8. 8 Monitoring: Philosophy “You can’t control what you don’t measure. And, what you measure tends to improve over time.”
  • 9. 9 Monitoring: Scale and Complexity Billions of Events 100s of DBs Thousands of Services Billions of Service Calls More than 1000 applications More than 50K servers 2 Billion Hits
  • 10. 10 – 50,000+ Servers monitored – 1,000+ Applications – 3000+ Load Balancers monitored – 150 TB Logging volume per day – 60K Metrics/second (6 Billion /day) – 1,000+ Alert Rules Monitoring: Data Volume
  • 12. 12 • Background • Monitoring – Logs – Metrics – Alerts – Self-healing • Summary Agenda: Logs
  • 14. 14 Logs Processing Architecture • Data volume scale and extensibility needed – Open source and Big Data technologies adoption (Hadoop) – HDFS and TSDB/HBase Client (Metrics and Logs) Transport Metrics Processing Metrics Store (TSDB/HBase ) Logs Store
  • 15. 15 Logs • Advantages – Temporal record – Detailed – Provides instance level information – Distributed w/ co-relations • Traditional Challenges – Unstructured – Decentralized – Storage and retention – Processing requires parsing
  • 16. 16 Logs: Dealing w/ Challenges • Client API’s – Log different kind of information • Transaction: Nested activities • Transaction: Start and end of activity – Additional structures • Types (URL, Service, SQL) • Names (Request name, Query name) • Server – Centralized storage – Distributed processing (Hadoop) – Volume: 150 TB / day (uncompressed). 5x compression
  • 17. 17 Logs: Processing • Processing – Generate on-going reports and aggregation – Converts logs to metrics • Data breakdown along different dimensions – Requests, Browsers, Experiments, Errors, Machines, IP addresses, Geo • On-demand processing. Distributed processing – Search – PIG / Hive / MR jobs
  • 19. 19 • Background • Monitoring – Logs – Metrics – Alerts – Self-healing • Summary Agenda: Metrics
  • 20. 20 Metrics and Events • A Metric is a measure sampled over time. – Has a metric id as unique identifies – Has a value • A gauge, that is a measurement • A counter that increments (error counts, bytes transferred). – Has “tags” that uniquely identify instance of metric from others • An Event is an occurrence indicating thing of interest. Events are aperiodic.
  • 21. 21 Metrics: When to use? • Balance between volume and quality – Short SLA (~seconds) • Periodicity enables trending • Client – Convenience for users – Dealing with volume • Server – Caching and in-memory processing – Feed to other systems with real-time data • Aggregation: Both client and server end Volume Quality
  • 22. 22 Metrics: Which metrics? SAAS Null searches SEO traffic Shipping option selection Unsuccessful login rate PAAS Requests per second Error per second Latency Services GC Overhead IAAS CPU Network Memory Disk Load Balancer
  • 24. 24 • Background • Monitoring – Logs – Metrics – Alerts – Self-healing • Summary Agenda: Alerts
  • 25. 25 Alerts • Static thresholds – E.g., Machine CPU > 70%, Response time > 500 ms • Cliffs – Bollinger bands • Slow poison – Day over day or week over week comparison • Alerts and Alarms – Multiple correlated alerts => Alarm(s) – Alarms are time sensitive • Proactive vs. Reactive detection
  • 27. 27 Slow Poison: Week Over Week Analysis
  • 28. 28 • Background • Motivation – Logs – Metrics – Alerts – Self-healing • Summary Agenda: Self-healing
  • 29. 29 Self-Healing: In the Cloud Remediate (PAAS) Deploy (PAAS) Monitor (PAAS) Provision (IAAS)
  • 30. 30 • Background • Motivation – Logs – Metrics – Alerts – Self-healing • Summary Agenda: Summary
  • 31. 31 Summary: Monitoring at Scale Data in eBay is BIG and getting BIGGER. Need Big Data for Scale Scope of Monitoring includes Logs, Metrics, Alerts and Self-Healing Data Quality versus Data Volume Multiple Client Sensors Monitoring and management at Scale needs Self Healing
  • 32. 32 Connect with us o raju@ebay.com (@raju_kolluru) o msomani@ebay.com (@mahesh_somani) We are Hiring! o Opportunities in Java, Big data, Software applications and systems indiajobs@ebay.com Q & A

Editor's Notes

  1. Platform Services pioneers the SAAS vision at eBay
  2. [Raju] Can we have some other pic ? Should we change the topic to Monitoring Philosophy and Overview ?
  3. Fonts are not very visible
  4. Put Hadoop logos(elephant) & UI logo
  5. We currently have 3 logging slides; will be good if we reduce by 1.
  6. Cant see the text
  7. Pretty Busy slide; reduce content by 20%
  8. Show a Stack Diagram like the one in NHT
  9. Dilbert cartoon
  10. Alerts and Alarms: DifferencesStatic ThresholdCliffsSlows poison
  11. Dilbert cartoon