• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Analysis Software Benchmark
 

Analysis Software Benchmark

on

  • 1,263 views

 

Statistics

Views

Total Views
1,263
Views on SlideShare
1,263
Embed Views
0

Actions

Likes
0
Downloads
7
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Analysis Software Benchmark Analysis Software Benchmark Presentation Transcript

    • Root analysis and implications to analysis model in ATLAS Akira Shibata, New York University @ ACAT 08 in Erice Nov 05, 2008 1
    • Are we ready to face data from LHC collisions? Grid computing? Do we have enough CPU? Tape? Disks? RAM? Do we need T1? T2? T3? AF? Do we need backdoor access? Are the machines maintained? Is it scary? Are they online? Do we have enough bandwidth? Can we copy data across the world? Can we reach the data we need? Can we reduce the data size? ESD? AOD? D1PD? D2PD? D3PD? Can we download them? Do we need interactive access? How do we write an analysis? How fast do they run? Do we need to buy more disk? How big is my ntuple? Do we need to buy more CPU? Disks? RAM? Are we up to date? Do I look cool if I buy a mac? Is virtual machine useful? Why do we use ROOT? What is PROOF? Is python fast enough? Is it easy to code? How often will I need to process my data? How fast will my analysis run? What can I do to get faster? What are the options? What is the future technology? ACAT - Novebmer 5, 2008 akira.shibata@nyu.edu 2
    • Analysis in the Era of Grid Computing ESD ESD T1 ESD T1 ESD ~500kB/evt Central Root Native AOD/ DPD Root + POOL making Rough size estimate D1PD AOD T2 D1PD AOD T2 AOD cpu ~100kB/evt Grid Get Analy / DPD making T3 D1PD D2PD D3PD T3 D1PD D2PD D3PD 30-80kB/evt 10-50kB/evt 1-10kB/evt request ROOT / ARA deliver Analysis at Institute User User Local Root Histo Desktop Ntuple Histo Desktop Ntuple Analysis ~1kB/evt Tiered model for computing model. Leveled approach needed to optimize the system. Above all, how well does it work from the physicistsʼ point of view? ACAT - Novebmer 5, 2008 akira.shibata@nyu.edu 3
    • Derived Physics Data • DPDs are created using the following operations: • Skimming: selecting the events one needs • Thinning: selecting the objects one needs • Slimming: removing information from objects. • ESDs hold full information from reconstruction. AOD, DnPDs are derived with increasing level of derivation. • Primary purpose of D1PD is to have access to parts of the ESD information that are otherwise difficult to get to. • D1/2PD are in POOL format. D3PD refers to any DPD that are in ntuple format. • ESD, AOD, D1PD contents are defined by groups. Several types of D1PD are defined by performance groups. D2PD and D3PD are defined by users. • First level analysis may be done (variable calculation, object reco etc) when D2/3PD are created. ACAT - Novebmer 5, 2008 akira.shibata@nyu.edu 4
    • Motivation for Profiling ROOT Analysis • The primary use of the Grid is event reconstruction, storage and production of reduced data. This is done using ATLAS software, Athena. Some analysis happens here too. • However, post-Grid (non-Athena) ROOT analysis is the main stage for physics analysis. • Mostly a user-level decision due to the private nature of physics analysis but: • the situation is becoming more complex due to availability of new technology; • no good summary exists comparing the available options; • it is an important ingredient for an efficient analysis model; • it is needed for estimating resource requirements. • Technical discussions does not always answer practical questions. This study will benchmark analysis “modes” in realistic settings based on wall-time measurements. ACAT - Novebmer 5, 2008 akira.shibata@nyu.edu 5
    • “Flat” vs POOL Persistency • Many of the complexity in the current situation is due to the POOL technology (additional layer to the ROOT persistency technology) used in ATLAS. POOL supports: • Metadata lookup - used by TAG to access events in large file without having to read the full contents. • More flexibility in writing out complex objects. Has its own way of T/P separation and schema evolution. • When the decision was made ROOT persistency was not so great as it is now. • Problems writing out STL objects. • Problems referring to objects in different trees/files. • ROOT persistency has improved and now has less issues. • ARA - enabling reading POOL objects from ROOT by calling POOL converters on demand. P->T conversion. Takes extra read time. ACAT - Novebmer 5, 2008 akira.shibata@nyu.edu 6
    • Summary of Existing Analysis Modes Mode Draw CINT ACLiC PyRoot g++ Athena Ntuple POOL Compiled/ Interpreted Interpreted Compiled Interpreted Compiled Both Interpreted C++ C++ Language (C++)-- C++ Python C++ Python Python Interactive Additional MakeClass SFrame - SPyroot - packages MakeSelector AMA Standard - - - dev env Athena components Implemented most common options. All codes available in ATLAS CVS: users/ashibata/RootBenchmark ACAT - Novebmer 5, 2008 akira.shibata@nyu.edu 7
    • Benchmark Analysis Contents • A simple Zee reconstruction analysis implemented for every mode: 1. Access electron container (POOL) / electron kinematics branches (Ntuple) 2. Select electrons using isEM and pt and charge 3. Fill histograms with electron kinematics (pT and multiplicity) 4. Combine electrons to reconstruct Z 5. Fill histogram with Z mass 6. Write histograms out in finalize • Repeated the above 10 times • Not complex enough for a real analysis but not entirely trivial. • For Draw, plot electron after isEM/pt/charge selection. No four vector arithmetics. ACAT - Novebmer 5, 2008 akira.shibata@nyu.edu 8
    • Obtaining Reliable Results • Using POSIX measurement as much as possible. Use wall time from time module. • Avoiding somewhat unstable measurement with TStopwatch. • Measurements affected by other activities on the machine. Overcome by multiple measurements. • Machine: Acas (BNL) node with normal load 3.34GB mem, 2 cores Xeon@ 2.00 GHz, data on NFS. • Disk cache leads to misleading results. CPU time = Wall time once the data is in memory. • Force disc read by flushing RAM. Do not re- read until all other files have been read. Alternate between AOD and ntuple analyses. ACAT - Novebmer 5, 2008 akira.shibata@nyu.edu 9
    • Methodology AOD AOD 1. Measured time taken to Wall time (s) process with increasing Wall time (s) gpp (init:6.64e+01s, rate:5.35e+02Hz) gpp (init:6.64e+01s, rate:5.35e+02Hz) 1600 1600 SFrame (init:3.62e+01s, rate:3.15e+02Hz) SFrame (init:3.62e+01s, rate:3.15e+02Hz) number of events. Draw (init:4.62e+01s, rate:1.25e+02Hz) Draw (init:4.62e+01s, rate:1.25e+02Hz) 2. Repeat measurements and 1400 1400 PyAthena (init:2.74e+01s, rate:9.65e+01Hz) PyAthena (init:2.74e+01s, rate:9.65e+01Hz) take average for each point. Athena (init:3.08e+01s, rate:6.86e+01Hz) 1200 Athena (init:3.08e+01s, rate:6.86e+01Hz) CINT (init:5.25e+01s, rate:1.85e+01Hz) 3. Fit a straight line to obtain overhead (offset) and rate 1200 PyRoot (init:2.50e+00s, rate:1.24e+01Hz) CINT (init:5.25e+01s, rate:1.85e+01Hz) (evt/sec). 1000 PyRoot (init:2.50e+00s, rate:1.24e+01Hz) 4. Calculate errors from 1000 800 standard deviation. Only use rate in comparing 800 600 the modes. Overhead varies between a fraction of seconds 400 to tens of seconds. 600 200 400 0 0 1000020000300004000050000 200 Number of events ACAT - Novebmer 5, 2008 akira.shibata@nyu.edu 0 10
    • Data and Format POOL Ntuple AOD CBNT? Full contents 144.22 kB/evt not tried DPD contents TopD1PD TopD3PD Trigger/Jets/Leptons etc 31.42 kB/evt 4.87 kB/evt Small DPD contents SmallD2PD SmallD3PD Tracks + Electrons 18.74 kB/evt 0.71 kB/evt Very small DPD VerySmallD2PD VerySmallD3PD Electrons 1.06 kB/evt 0.37 kB/evt All derived from FDR2 AODs. All produced on PANDA (except AOD and D1PD). Around 10,000 events per file. Total sample size for one data type ranges between 1 GB - 100 GB. A use-case driven comparison. Input file sizes are different. ACAT - Novebmer 5, 2008 akira.shibata@nyu.edu 11
    • AOD Analysis Results AOD Input AOD (rate, error) mode Compiled non- gpp (535Hz, 3%) framework analysis is the fastest. SFrame (321Hz, 13%) Draw (138Hz, 35%) Only small difference Athana (98Hz, 8%) between C++/Python in Athena. PyAthena (95Hz, 11%) CINT (21Hz, 15%) CINT by far the slowest. TSelector (19Hz, 2%) PyRoot (17Hz, 18%) Seems to be reading all containers in the files 0 200 400 Hz ACAT - Novebmer 5, 2008 akira.shibata@nyu.edu Hz 12
    • D1PD Level Comparison mode (rate, error) Top_D3PD Top Top_D1PD D1PDInput Top D3PD Input ACLiC_Opt (58719Hz, 16%) gpp (1130Hz, 15%) ACLiC (48494Hz, 20%) SFrame (721Hz, 17%) Ntuple/POOL=40.6 gpp (45869Hz, 21%) Athana (313Hz, 6%) TSelector_ACLiC (18551Hz, 18%) Ntuple/POOL=13.1 SFrame (9453Hz, 19%) Draw (298Hz, 55%) Ntuple/POOL=7.9 Draw (2343Hz, 15%) PyAthena (204Hz, 4%) Ntuple/POOL=2.7 Athana (838Hz, 1%) PyRoot (43Hz, 9%) Ntuple/POOL=7.1 PyRoot (300Hz, 21%) Ntuple/POOL=1.2 PyAthena (242Hz, 30%) CINT (26Hz, 6%) Ntuple/POOL=1.8 TSelector (39Hz, 3%) TSelector (22Hz, 2%) Ntuple/POOL=1.2 CINT (32Hz, 2%) 0 500 1000 Hz 0 20000 40000 60000 Hz An order of magnitude advantage for using ntuple for g++ analysis. Much less Hz difference with non-compiled modes. Hz ACAT - Novebmer 5, 2008 akira.shibata@nyu.edu 13
    • D2PD Level Comparison mode (rate, error) Small_D2PD Small_D3PD Small D2PD Input Small D3PD Input gpp (2132Hz, 6%) Ntuple/POOL=33.3 gpp (71003Hz, 7%) ACLiC_Opt (58223Hz, 18%) SFrame (1679Hz, 29%) TSelector_ACLiC (33579Hz, 23%) Athana (596Hz, 5%) Ntuple/POOL=8.7 SFrame (14597Hz, 26%) PyAthena (326Hz, 4%) Ntuple/POOL=21.2 Draw (6358Hz, 17%) Draw (300Hz, 29%) Ntuple/POOL=1.4 Athana (855Hz, 3%) Ntuple/POOL=3.8 PyRoot (382Hz, 22%) PyRoot (100Hz, 10%) Ntuple/POOL=1.1 PyAthena (367Hz, 28%) CINT (29Hz, 4%) Ntuple/POOL=1.7 TSelector (40Hz, 2%) TSelector (23Hz, 1%) Ntuple/POOL=1.1 CINT (32Hz, 1%) 0 1000 2000 Hz 0 20000 40000 Hz 60000 POOL analysis faster than AOD input by x4. Larger difference between Athena Hz and PyAthena with smaller input files. Why? Hz ACAT - Novebmer 5, 2008 akira.shibata@nyu.edu 14
    • Very Small Input Comparison Very_Small_D2PD Very_Small_D3PD Very Small D2PD Input Very Small D3PD Input ACLiC_Opt (63555Hz, 9%) gpp (2798Hz, 5%) Ntuple/POOL=17.3 gpp (48516Hz, 17%) SFrame (2519Hz, 12%) TSelector_ACLiC (34201Hz, 22%) Athana (667Hz, 8%) Ntuple/POOL=5.5 SFrame (13751Hz, 28%) Ntuple/POOL=23.0 Draw (6777Hz, 16%) PyRoot (416Hz, 19%) Ntuple/POOL=1.3 Athana (854Hz, 5%) PyAthena (307Hz, 14%) Ntuple/POOL=1.1 PyAthena (343Hz, 28%) Ntuple/POOL=0.8 PyRoot (331Hz, 25%) Draw (294Hz, 47%) TSelector (40Hz, 1%) CINT (31Hz, 0%) Ntuple/POOL=1.0 CINT (32Hz, 1%) 0 1000 2000 3000 0 20000 40000 60000 Hz D2PD nearing D3PD even more. A few thousand Hz possible with ARA. Ntuple Hz mode still factor of 5-10 faster in C++ modes. Hz ACAT - Novebmer 5, 2008 akira.shibata@nyu.edu 15
    • Event Size * E I/O104 Dependency Comparison POOL Analysis Ntuple Analysis Event Size * Exec Rate (kB/s) Event Size * Exec Rate (kB/s) 105 104 3 10 ACLiC 4 10 gpp PyAthena 3 10 AthAthena Draw TSelector PyRoot Athena 103 ACLiC_Opt 2 PyAthena 10 AthAthena PyRoot Draw PyRoot PyAthena PyAthena CINT ACLiC gpp TSelector_ACLiC PyAthena 2 gpp TSelector 10 Draw gpp 2 ACLiC_Opt AthAthena CINT gpp 10 CINT PyRoot TSelector_ACLiC CINT CINT SFrame AthAthena SFrame PyRoot SFrame SFrame SFrame TSelector TSelector Draw Draw 0 20 40 60 80 100120 140160 0 1 2 3 4 5 0 20 40 60 80 100120 140160 0 20 40 60 80 100120 140160 Event Size (kB) 0 Event Size (kb) 2 analysis coming from file Event 1 (kB) Clear I/O constraint > 20 kB in POOL 3 4 5 Event Size (kB) Event Size (kb) size, NOT read-out size. Ntuples are usually smaller than 20kB. ACAT - Novebmer 5, 2008 akira.shibata@nyu.edu 16
    • Summary • Very clear performance advantage for ROOT native ntuple format. An order of magnitude difference. Ball park figure: Thousands evts/sec vs hundreds of Hz. Those numbers should be taken as upper limit, real analyses would be more complex. • Compiled mode is ~two orders of magnitude faster than non-compiled options. • Use of frameworks, even quite a simple one, can slow things down, though, any realistic analysis would require some infrastructure. Choose/write frameworks wisely! • With Athena, the overhead of framework seems large, though typical DPD jobs can be highly CPU intensive. • Effect of file caching by system ties input file size and the execution rate (regardless of the actual read-out). Above 20 kb/evt, the analysis is bound by this effect. This is a very tight slimming/thinning requirement for D12PD. May be able to improve this with high performance disk. ACAT - Novebmer 5, 2008 akira.shibata@nyu.edu 17
    • Acknowlegement I have bothered a lot of people with this project including (random order): Scott Snyder, Wim Lavrijsen, Sebastien Binet, Emil Obrekov, David Quarrie, Kyle Cranmer, David Adams, Sven Menke, Shuwei Ye, Sergey Panitkin, Stephanie Majeski, Hong Ma, Tadashi Maeno, Attila Krasznahorkay, Jim Cochran, roottalk, Paolo Califiura Many thanks. ACAT - Novebmer 5, 2008 akira.shibata@nyu.edu 18