This presentation describes how we deployed Splunk within the Forex and Fixed Interest divisions of one of Australia's big 4 trading banks.
The deployment enabled the bank to move closer towards a DevOps environment, while also saving them considerable money with the consolidation of FX & FI platforms.
2. Objectives & Deliverables:
Reduce IT overheads by integrating the FX & FI
platforms.
Provide an e-trading platform for internal and external
use, with a browser-based, custom user interface (UI).
3. Objectives & Deliverables:
A unified view for Business & IT Operations
providing real time BI & actionable business
analytics.
4. Objectives & Deliverables:
Establish real-time monitoring of trading activity &
underlying technology, supporting issue resolution & analysis
Monitoring & Analysis Targets
Business Transactions FUNCTIONS | ACTORS | FLOWS
Technology APPS | APP INFRA | INTEGRATION | SERVERS/STORAGE | COMMS
Customer Project
Monitoring &
Alerting
Status
Dashboards &
Query
Historical
Analysis
Support &
Incident
Management
Investigation &
Resolution
Project for Business
Client & Channel Monitoring, Business Function & Flow Monitoring, User Support Case Mgmt,
Event & Transaction Investigation, Business Performance Analysis
Project for Technology
Application & Integration Monitoring, Infrastructure Status Monitoring, Technology Incident Case
Mgmt, Technology Investigation, Fix/Test Support, Technical Performance Analysis
Functions
Inputs
5. Business Benefits
● Identify “stuck” trades which are in millions of dollars
each
● Identify potential system impacts to trades
● Identify quickly all the involved parties and details of
a trade
Enablement of BizOps and DevOps
● Business Operations can see into IT systems
● IT Operations can see business impacts
Faster feedback on development and
testing
● Bugs identified in SIT and Staging environments
6. Splunk as a Solution for Client Project:
Example:
A BUY order for $5,000,000 AUD/HKD at 7.20354 rate has taken more than 5 seconds to
clear the booking system.
Flag as RED and drill into the transaction.
8. Challenges:
Constraints for the Solution
● Had to use simpleXML, needed to be accessible to bank developers and
business operations
● Few moving parts (initially no Nagios or other products)
● Performance, needed to have as little page reloading as possible
● initially a very small deployment to test out the technology
Requirements for Splunk
● Real-time views and alerting
● Environment aware Service Model
9. Business Ops and IT Ops
Business Flows IT Components
● Apache WebServer
● Apache Tomcat
● WebStreaming
● FX Trading Core
● Integration Server
● Credit
● Rates Adaptor
● Cache
● DB
● RedHat Linux
● Network/Storage
Login
Credit Check
Deal Capture
Reference Data
Price Distribution
10. Business Process Status Flows
Business flows relate directly to system
components
Client/User Login Processes
Pricing / Reference Data
Deal Capture / STP
Credit Check
11. Business Ops & Support Dashboard
Trade Search Process Status
In-Flight Trades
Rate Updates In-Flight Trade Detail Trade Detail/Search Results
12. Trade Search
Trade search allows you to search for any trade booked
Search period will be limited by data capture vs. storage space
Estimated to be 4+ years based on testing estimates (1.37GB per day compressed to
400mb on a 500GB index)
You can search for trades on:
ID e.g. XXX300614-0926474596
Price (All-in Rate) e.g. 0.94500
Currency Pair e.g. AUDUSD or AUD/USD (Drop down selection pre-populated by last
30 days worth of values
Client WID (Legal Entity) e.g. 5100230
Search Period (default Today → driven by server location → London Time)
13. Component Status
• Provides a high level overview of the health of all StarXchange components:
– Apache HTTPD [3 node cluster]
– Tomcat [3 node cluster]
– Frontend (Web streaming) [3 node cluster]
– Core [3 node cluster]
– Backend (Integration) [3 node cluster]
– Credit [Single instance failover across 3 servers]
– Rates Adaptor [Single instance failover across 3 servers]
– Cache [2 node cluster]
– DB
• Clicking on any of the processes will take you to the Tech Dashboard which will provide more details
about the process status
• Status
– Up = If all nodes for a given process are up and running
– Degraded = if 1 or 2 of the nodes are down for a given process (except ESB → Degraded only if 1
process is down)
– Down = if all nodes are not running
– Exceptions are Credit and Pricing Adaptor → these are single node so will only show Up or Down status
14. Rate Updates
• FIX logs are consumed from Application by Splunk, these logs generate a message for every
rate update
• Rate updates are ordered by default with the seconds since the last rate change
• All currency pairs are show by default
• Rate updates captured denote if a rate is dealable or non-dealable
• Green = Rate update
within last 15 seconds or
less
• Orange = Rate update >15
second <30 seconds
• Red = Last rate update
detected >30 seconds ago
15. Inflight Trades
• Real time search that displays all trades booked that has not received an Execution
Report back
• In theory this panel should be empty at all times
• Any trades that appear within this view should be manually checked to ensure STP
of Risk Capture
• Possibly reasons why a trade might appear within this view:
– Queues between are down
– Integration Backend is down
– Booking systems down down
– Deal might have been captured in Dealing system but Execution Report was not
received by system to confirm booking
16. Tech Dashboard
Active Connections Count from
respective component
JVM Status for respective process Process Status (same as eTeam Dashboard)
Disk Space Usage per server
being monitored
CPU Utilisation per server being
monitored
17. Data Extraction for Status & Events
Business Events
FX transactions
Service Events
Infra and application
notifications
Service Status
Polled status of system
components
Splunk JMX Agent
Splunk Unix
Pack
Splunk Forwarder
Log monitoring
Network
Network quality and performance for ingress and egress
connections.
Hardware
Physical machine health and performance metrics of servers
and storage
OS
Metrics and events such as CPU, memory and storage.
Process
Status of the OS process running the monitored component.
Framework and Runtime (JVM)
Events and metrics of the Java Virtual Machine such as
threads and garbage collection.
Application
Events and metrics which relate directly to processing of
business transactions
Monitoring Layer This graphic shows the Splunk
coverage over the monitored
layers of System components and
supporting infrastructure. (For
M1 only)
There are 3 types messages sent
Splunk:
1) Business events
2) Service Events
3) Service Status
These messages are generated
by:
1) Consumed log files
2) Process and OS monitoring
3) JMX agent monitoring
Splunk Stream
18. Learnings
Splunk
● Creation of lookup based service model – this will be moved to a CMDB
● Developed a small angularJS app to expose some widgets
Splunk Extensions:
● Java agent customized to run standalone with a plugin system, used to scrape JMX
● Lookup Editor used to easily edit business alerts
Adoption and Integration into DevOps
● Automated deployment of Splunk forwarders and Splunk Servers via Chef
● Splunk apps are fully managed in git repo and binaries distributed via artifactory
● Main Splunk app is packaged with Vagrantfile, eventgen samples, development settings
and can fully replicate production
Future – Expansion other use cases, broaden scope
19. Troy Bebee is a Managing Consultant at Ecetera, and was the lead consultant on this
engagement. With over 12 years experience working directly with Telco and Banking IT
teams, Troy is a highly regarded Application Performance Management & DevOps
specialist.
troy.bebee@ecetera.com.au @trizow
Our mission is to rid the world of badly behaving applications
and sites.
We measure and monitor the performance and availability of
enterprise applications.
We diagnose the source of performance issues and provide
solutions to improve applications functionality.