Hadoop & Hep

1,786 views
1,724 views

Published on

Simon Metson of Bristol University and CERN's CMS experiment, discussing how to use Hadoop for processing CERN event data, or other data generated in/by the experiment

Published in: Technology, Education
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,786
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
61
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Hadoop & Hep

  1. 1. Hadoop and HEP Simon Wednesday, 12 August 2009
  2. 2. About us • CMS will take 1-10PB of data a year • we’ll generate approx. the same in simulation data • It could run for 20-30 years • Have ~80 large computing centres around the world (>0.5PB, 100’s job slots each) • ~3000 members of the collaboration Wednesday, 12 August 2009
  3. 3. Why so much data? • We have a very big digital camera • Each event is ~1MB for normal running • size increases for HI and upgrade studies • Need many millions of events to get statistically significant results out for rare processes • In my thesis I started with ~5M events to see an eventual “signal” of ~300 Wednesday, 12 August 2009
  4. 4. What’s an event? • We have protons colliding, which contain quarks • Quarks interact to produce excited states of matter • These excited states decay and we record the decay products • We then work back from the products to “see” the original event • Many events happen at once • Think of working out how a carburettor works by crashing 6 cars together on a motorway Wednesday, 12 August 2009
  5. 5. An event Wednesday, 12 August 2009
  6. 6. Duplication of data • We keep events in multiple “tiers” of data • Each tier contains a subset of the information of the parent tier • We do this to let people work on huge amounts of data quickly • In reality this style of working hasn’t really kicked off yet, but it’s early days • Data is housed at >1 site Wednesday, 12 August 2009
  7. 7. Duplication of work • One person’s signal is another’s background • Common framework (CMSSW) for analysis but very little ability to share large amounts of work • People coalesce into working groups, but these are generally small • While everyone is trying to do the same thing they’re all trying to do it in different ways • I suspect this is different from, say, Yahoo or last.fm Wednesday, 12 August 2009
  8. 8. How we work • Large, ~dedicated compute farms • PBS/Torque/Maui/SGE accessed via grid interface • ACL’s to prevent misuse of resources • Not worried about people reading our data, but worried they might delete it accidentally • Prevent DDoS Wednesday, 12 August 2009
  9. 9. Where we use Hadoop • We currently use Hadoop’s HDFS at some of our T2 sites, mainly in the US • Led by Nebraska, been very successful to date • I suspect more people will switch as centres expand • Administration tools as well as performance particularly appreciated • Alternatives are academic/research projects and tend to have a different focus (pub for details/rants) • Maintenance & stability of code a big issue • Storage in WN’s is also interesting Wednesday, 12 August 2009
  10. 10. What would we have to do to run analysis with Hadoop? • Split events sensibly over the cluster • By event? by file? don’t care? • Data files are ~2G - need to reliably reconstruct these files for export if we split them up • Have CMSSW run in Hadoop • Many, many pitfalls there, may not even be possible... Wednesday, 12 August 2009
  11. 11. Metadata • Lots of metadata associated with the data itself • Moving that to HBase or similar and mining with Hadoop would be interesting • Currently this is stored in big Oracle databases • Also, log mining - probably harder to get people interested in this Wednesday, 12 August 2009
  12. 12. Issues • Some analyses don’t map onto MapReduce • Data is complex and in a weird file format • CMSSW has a large memory foot print • Not efficient to run only a few events as start up/tear down is expensive • Sociologically it would be difficult to persuade people to move to MapReduce algorithms • Until people see benefits - demonstrating those benefits is hard, physicists don’t think in cost terms Wednesday, 12 August 2009

×