CIO Interview about Flopsar APM - Application Performance Management
1. Galaxy
or the escape fromillusion
Michał Zabiełło
A newwayto visualize systemperformance developedbyaPolishcompanyhasbeengaining
recognition.The solutionisalreadyusedbyseveraldozenPolishcompaniesandresolutelycutsthrough
the well-knownweaknessesof APMsolutions.
One of the elementswhichmayimplementrational savingsinITisthe groupof toolsforapplication
performance management(APM).Large corporationsare investinginpurchasesof APMtools.The
providersof suchsolutionsare implementingtensof dashboards,hundredsof graphsandflow
diagrams.Theydefine thousandsof variousalertsandinundate the mailboxesof relevantrecipients
withmessagesaboutthe “healthcheck”of businessprocesses. Thisisdesignedto convince thatthe
scatteredIT infrastructure isundercontrol.Itall worksuntil aseriousmalfunctionoccurs.ITspecialists
try to identifythe cause of the problem, analyze millionsof out-of-date,unnecessaryorerroneous
piecesof informationcoming fromthe implementedtools.
Bombarded by alerts
The toolsto diagnose ormonitorapplicationsare of keyimportance.Goodtoolsare expensive –they
require manylaboratorychecks,tests,anda precise manufacturingprocess.Goodandexpensivetools
are, in turn,complicated.
It isworth notingthatsuch productshave a specificmethodologyconnectedwithperformance
management:we install atool,configurethe scope of reportedmetricsandbuildacomplicated“health
check” applicationtowarnusabout problemsoccurringinthe monitoredapplications.Inpractice,the
systemwarnsus abouta problemthathas occurred – but the cost of using,maintaininganddeveloping
the applicationisoftenhigherthanplanned.
Dashboardshave become,paradoxically,the Achilles’footof those tools –everymonitoredapplication
has to have a setof hierarchical dashboards,andeachbitof informationpresentedonitrequiresaset
of definedSLA perimeterswhichallowtochange the resultof the “healthcheck” – whichis signaledby
colorsgreen,yellow,orred.Thissignalingisnotunequivocal –it isnot clearwhetheritmeansa failure
of the systemorjust a slowdown,whetherthe problemconcernsasingle functionora whole set.
The toolsare bombardingthe administratorswithinformation.The commandcenterhasitshandsfull
withsiftingandseparatingfalse alarmsfromthose responsiblefordisruptionsindataprocessing.The
implementationspecialistsresponsible fortoolsare constantlyworkingonupdatingand adaptingthe
dashboardsto frequentlychangingapplicationsorrequirementsconcerningnotificationsabout
applicationproblems.
2. The command centerhasits handsfull with
separatingfalse alarms.The implementation
specialistsresponsible fortoolsare constantly
workingonupdatingandadaptingthe
dashboardsto frequentlychangingapplications
or requirementsconcerningnotificationsabout
applicationproblems.Thatishow APM
operates.
In search of an intuitive APM
In 2012 a group of programmersexperiencedinimplementingandadministrationof APMsolutions
formeda company.Itsgoal wasto create a solutionwhichwouldovercomethe weaknessesand
limitationsof monitoringsystemsandincrease the performance of applications.“Ourpointof departure
increatingthe systemwasa fundamental question:Dodatafrom monitoredsystems,alertsandtrends
have to be representedinawaywhichrequireshuge outlays?” –says GrzegorzPawluk,CTOand one of
the co-foundersof FlopsarTechnology.
Perhapsitis possible toshow ina
simple, intuitive mannerwhatis
the most importantforIT services:
that a malfunctionhasjustoccurred;
that the usersmay complainaboutthe systemworkinginefficiently;
that the providerimplementedabadlywritten applicationwhichcannotfunctioninan
overloadedenvironment;
that the applicationisusinguptoomuch of the powerof the expensive equipment.
Those commonsensical assumptionsare behindFlopsar(FlopSearchandRescue).The creatorsof
FlopsarSuite askedthemselvesone more question“Whatisreallyimportantinthe tangle of
informationreportedfromthe monitoredsystem?”Andtheyformulatedthe followinganswers:
1. Simple implementationandnoneedforanadvancedconfiguration:Plug-and-play.
2. No need totrainpeople whobenefitfromthe tool.
3. SIMPLE, intuitiveinterface (preferablyone window).
4. Maximumproductivity - todiscoveraproblemandto finditscause,the usershouldnotneedto
performmore thanthree operations.
5. No “earlywarningsystems”basedonlabor-intensive development.
Flopsar Galaxy
3. Innovation can be seen in the approach to
the project. The Flopsar project started with
designing the infrastructure: messages,
protocols, engines, data structure,
mechanisms for load-balancing and
bypassing the malfunction. The entire
infrastructure was programmed in C
language.
Flopsardoesnotaggregate data. It doesnoesnot
showaverages,mediansorquartiles.With
unstable systemsthe sampleistoolarge and
therefore notcredible.The galaxyshowsEVERY
single operationperformedwithinthe monitored
system.Each time atransferwas performedor
someone loggedinto anapplication,adotwould
appear,locatedwithinthe timescale of the event
(axisX) andthe response timescale (axisY).The
majorityof “correct” times(the oneswith
sufficientprocessingquality) isconcentrated
withinthe lowerregistersof the galaxy.The dots
forma multicoloredplane there.If anapplication
or its functionhassloweddownor
malfunctioned,the dotsmigrate intothe upper
registersof the galaxyandformvarious
concentrationpatterns.The factthat those
concentrationsappearinthe galaxyisthe reason
for furtherinvestigation.The concentrationsare
automaticallydetectedbyasystembasedon
artificial intelligence algorithmsormaybe
markedmanuallyinordertoidentifythe reason
for theiroccurrence.Aftermarking,the user
receivesaprecise diagnosisof whatand whyis
not workingcorrectlyinthe system.
Afterseveral daysof workingwiththe Flopsar
systemadministratorsbegintofeel thatthey
knowwhattheysee.Basedoneventsobservedin
the past and interpretedconcentrationstheymay
say “the queue systemgotdisconnectedagain,”
Flopsar in UFG:
productionmonitoring of critical
applications
Reduction of production problems
related to application performance
Code optimization – shorter
response times
Reduced use of hardware
infrastructure
How quickly does conclusion-making learn
based on Flopsar visualization?
“We collect millions of data on policies,
drivers and road events. It is critical to
ensure the reliability and quality of
operation of the IT systems which perform
our statutory tasks. We selected the Flopsar
Suite because of its intuitiveness and
functionality. The tool was implemented
within a few hours and its effective
operation by the team of administrators
started immediately after the
implementation. The factors in favor of
choosing Flopsar included also costs, the
level of after-sales service, flexibility and the
range of additional solution services offered
by the provider. The data used from
monitoring indicate unequivocally where
the problem has occurred and, therefore,
who is responsible for its servicing or repair.
Today, we use the information obtained
from Flopsar software in many cases as an
argument in our negotiations with our IT
service providers” – says Grzegorz
Rymarski, IT Department Director, The
Insurance Guarantee Fund (UFG).
4. or “webservice isnotworkingagain”or evenignore the patternassomethingnatural.
The systemworkswithoutconfiguration –there isno needtoconstruct dashboards,todefine staticSLA
for selectedmethods,toprovideexpensivesystemmaintenance.Once the monitoringsystemhasbeen
switchedon,the applicationserverprocessesdata,the monitorstartsshowingconcentrationsandthe
administratorstartslookingforunnatural anddisturbedconcentrationpatterns.
Innovation through goingback to the roots
Is the “galactic” wayof showingdatainnovativeandunique?Scatter-plotisusedinstatisticstovisualize
data. GrzegorzPawlukexplains:“Flopsarreportseverytransactionperformedinthe monitoredsystem
separately.Itconnectsstackframesintostacktraces and thenreportsthe aggregateddurationof the
transactionas one point(withfull accesstoall the remainingdata).Inthistype of service,the volume of
data whichneedstobe recordedinthe monitoringbase isgigantic.Therefore,itisthe database
infrastructure (datapersistence)andnotdata-generatingagentwhichisthe ‘heart’of the Flopsar
system.”
Innovation –or perhapsratherthe returnto healthyroots – can be seeninthe approachto the project.
The Flopsar project started with designing the infrastructure: messages, protocols, engines,
data structure, mechanisms for load-balancing and bypassing the malfunction. The entire
infrastructure was programmed in C language – the most efficient programming language. The
code which has 5,000,000 lines was written from scratch and entirely without using any
external (e.g. OpenSource) libraries. The engineers and Flopsar support are responsible for
100% of the solution. Tests and production implementation prove that Flopsar can process
around 40,000 metrics per second or a cumulated load at the level of 200 MB/sec for a single
data base instance in the 24/7/365 mode.
In 2013 Flopsar Technology implemented its solution as the only APM software provider on
approximately 100 production application servers in the Polish market and in cooperation with
strategic business partners it carried out several dozen projects to optimize critical systems.
During the same period of time, the competitors have record a few individual license sales in
Poland. At this time, the company, together with a number of partners is running a few Proof of
Concept projects. “We estimate that until the end of 2014 the number of implementations will
exceed 300 monitored application servers in mission critical-type systems. This will make
Flopsar Technology an unrivalled market leader in the field of monitoring and managing the
performance of critical applications based on Java servers” – says Grzegorz Pawluk. In the boxes
you can see examples of using Flopsar at UFG and Generali – together with their top IT
managers’ comments.
5. CIOMagazineasked MichałZaremba,IT Infrastructure Project Manager,IT Department Support and
Infrastructure Section,Generali Group,to commenton detailed changesrelated to the Generali Group
APMsolution implementation.
The Generali Group:
Salesmanagement systemproduction monitoring
Complete detectionof all productionissues(failures,delays,defects)
Full control overIT systemproductionversionacceptance –earlyissue detection,application
code optimizationsuggestions,architecture andperformanceissue consulting
Code refactoring– processingoptimization(performanceincrease)
Capacityrequirementestimationforincreaseddataprocessingperiods
Flopsar Suite – Whoshouldmanage quality and efficiency?
Until recentlyFlopsarSuite wasutilizedbythe Generali Grouponlyforearlydetectionof performance
issuesinproductionsystems. Itwashandledbythe teamresponsible forITsystemandservice
monitoring.Duringperformance testingdeveloperswere usingittodiscoverinefficientmethodsand
queries. Furtherexperienceswiththe FlopsarSuite helpeddevelopadifferent,more effective
applicationperformance monitoringmodel.
If you take a closerlookat the tool,itisdifficulttodecide,whetherthisisanadvancedapplication
serverperformance monitoringsystem,orareportingsystemdesignedforanalyzingITsystem
operationperformance. Inthe firstcase Flopsarmaybe perceivedasjustanothermonitoringsystem
utilizedinmaintenance activities,andinthe secondcase,as an additional systemforsupporting
applicationdevelopmentandservice transitionfromthe developmenttothe maintenance stage. -
However,one mustrealize,thatinorderto provide ourcustomerswithtopvalue andperformance,a
verydeepsynergyof these areasisrequired.Thisalsoopensupextensive processoptimization
capabilitiesbyeliminatingunnecessaryITresource consumers, whichprovide novalue toservice
recipients.
Departmentstructure transformationandtransitiontoa dev-opsconceptenabledFlopsarSuite to
finallyendupina spot,where itsfull capabilitiesmaybe utilized –inthe handsof a team responsiblefor
IT applicationsandservices –boththeirdevelopmentandoperationalactivities. The importantfactis
that systemutilizationinbothareasisverysimilar,andthereforerequiresnochangesinteamwork
style ormode,or any additional training.
Theoretical conclusionsand diagnosisare supposedlydeliveredbyFlopsarvery quickly.How quickly,
and have you beensuccessful intransforming them intoIT processand product optimization?
The use of Flopsarenablesustogreatlyimprove the speedof handlingincidentsinaproduction
environment. The time betweenananomalyappearinginaproductionsystem, andcorrective actions
beinglaunchedbythe team,isnearlynull.Inthe past,if an end-userhadasubjective feeling,thatthe
systemisnotperformingwell,suchinformationhadtopass throughmultiple ITorganizationlevels. Now
thisinformationisvisibletoan expertpreciselywhenthe userbeginstofeelthe systembecomingless
6. responsive. All inall,the userreportsproblemstothe service desklike before,butthe service desk
alreadyknowsaboutfaultysystemoperations,andaboutaninterventionbeingunderway.Thisgreatly
cuts downon the time requiredtoresolve incidents,due tobeingable tofindthe problem-causing
method,service,orqueryinaquickand intuitive fashion.
Applicationdevelopmentandtestprocesseshave alsobeenoptimized.Thankstomonitoring
applicationsindevelopmentandtestenvironments,we are able todiscoveroperationswithexecution
time beyondacceptable limits.
By analyzingthe numberof particularcallsina givenperiodof time we are able todefine business
activitypatterns,andas a result,properlymanage ITservice capacity,performance,anddemands. This
alsoenablesustoproperlyschedule change managementprocesses,includingplannedmaintenance
outages.
Based on those patterns and querystatistics, is it possible tooptimize otherorganizational processes
and activities?Can the solutionbecome a source of other innovations?
If the businessprocessisperformedinanITsystem, whichiscoveredbyFlopsaranalysis,all system
operationsare registered,andmaybe analyzed. Specificdatavisualizationenablesustoestablish
businessprocessactivitieswhichare performedinefficiently.
Usuallya businessprocessperformedinanITsystemistreatedbya businessuserasanoperationwitha
definitestartandend. In reality,thisprocessincludesmultiple operationswhichreachbeyondthe
application,towardsthe integrationarchitecture,the database,andothersystems. AdvancedBPM
systemsfeature aBusinessActivityMonitoring(BAM) component,whichmaybe utilizedtooptimize
businessprocesses.However,if applicationsare developedin-house,abusinessprocessmonitoringtool
shouldalsobe provided,whichissupportedbyparticularapplications. If the ownerdecidesnotto
implementsuchfunctionalityinthe developedapplication,database-baseddeductionmaybe helpful,
whichmay be providedbythe Flopsarsystem.
Has capacity demand forecast accuracy improved? Has this lead to optimizinginfrastructure usage?
In termsof infrastructure optimizationforapplicationperformance Generalireliesonthree base
techniques:monitoringtechnical parametersof infrastructure components(usingSNMP,WMI,etc.),
optimizingloadbalancing,andapplicationperformance monitoringusingthe FlopsarSuite.
The firstand secondtechnique are knownandusedbymanyorganizations,butonlyananalysisof
correlationsbetween all of the above providesacomplete imageforcapacityforecasting. Thismaybe
done bytranslatingtechnical parametersof infrastructurecomponentstothe executiontime of an
operationina monitoredapplication.
The character of recentGenerali marketingactivitiesrequiredatemporarymulti-foldcapacityincrease
inMerkury 2.0 – the primarysalessystemutilizedbyGenerali. Atfirst,we consideredlinearserver
infrastructure componentscaling.Whentestingthe solutionwithFlopsar,itturnedout,thatthere are
multiple factors,whichmaygreatlyinfluence performance,andmaybe modifiedinordertoincrease
systemcapacity. We noticedthatstandard loadbalancingtechniquesmayhave anadverse effectonthe
time requiredtoperformoperationsbyasingle user. Loadbalancingconditioningbasedon
infrastructure andsystemparametersenabledustoprovide asolution,whichfeaturedthe same
7. efficiencyforeveryuser. Curiously,the testshave shown,thatFlopsarSuite impactonenvironmentload
fallsbelow1–2%.Finally,aftercompletingseveraloptimizations,we have reachedastate,where the
systemloadincrease couldbe handledwithoutmodifyingthe serverinfrastructureatall. After
completingthismarketingactivitywe wereable toreduce thatinfrastructure.
How did the transitionto the new methodof observingsalesefficiencygo,especiallyincase of
interpretingeventdistributionvisualizations?Didthe users easilyreach a new deductionprocess?
FlopsarSuite isan intuitivepackage.The systemiscurrentlyusedbythe IT department,butwe are
seriouslyconsideringsharingitsdatawithbusinessusers,whomightthenuse ittooptimize business
processes.
However,youhave toconsiderthe fact,that businessusersoftenrequire numericaldata,notgraphical
presentations,inordertoperformdataanalysis.If Flopsarwasto be usedfor salesefficiencyanalysis,it
wouldbe good,if ithad an optionto provide resultsinanumerical format.Forexample:Departments
responsible forsalescare notonlyabouthow the systemperformance influencesproductsales,butalso
whatthe productsearch operationdistributionisduringparticularhours,withingivenmonthsorwithin
the year.
The fact, that Generali reachedsuchanadvancedlevel of tool use proves,thatthe systemiseasyto
handle. We alsonoticed,thatthe tool may be usedinan evenmore optimizedfashion,if additional
expertiseisgainedpertainingtoitsoperation:analysis,resultinterpretation,aswell asbuildingreport
extensions.Itisworthmentioning,thatall the datacollectedinthe Flopsardatabase are available toour
developersthroughadedicatedAPI.
Are processand factor complexityconsideredlimitationsforthe applicationperformance
visualizationmethodproposedby Flopsar? If so, how can this be circumvented?
Most probablyeveryone,whowaseverresponsible forITsystemperformance optimization,faced
uncertainty,whetherthe systemoperatesthe same waybetweenmeasurements,asduring
measurements. Thisistypical forsystems,whereperformance ismeasuredatestablishedtimeperiods.
Flopsaranalyzeseveryoperationwithinthe system.If we donotfilterparticularcallsina so-called-
galaxy,everypointrepresentsone systemcall.If the processesperformedare of highcomplexity,we
are forcedto operate ona large numberof geometricallycorrelatedpoints. Insuchcase data analysis
requiresverifyingparticularcallsamongstalargernumberof those measuredandpresented. Thismight
become a limitationdue tothe speedof dataanalysisbyan expert. Itmayalso adverselyimpactthe
applicationserverloaddue toFlopsarcollectingdata. Thiscan be circumvented,if we utilizetechniques
to exclude particularcalls, whichare outside ourinterest. Itispossible toachieve atthe system
administrationlevel,whichenablesmonitoringtobe developedindividuallyforeveryapplication. -
Anothermethodtoreduce the data,whichdo notrequire analysis,isanoptionto filteroutminimum
and maximumoperationtimeinthe analyzedsystem. Finally,incase of systemsworkingonseveral
applicationservers,we are able tochange the pointcolorsdependingonthe server. Ibelieve,thatit
wouldbe useful,if there wasan optiontodefine itemcolorsinacustomfashion,e.g.basedonthe type
of systemoperationoronthe executiontime.