Your SlideShare is downloading. ×
0
Hadoop & Hep
Hadoop & Hep
Hadoop & Hep
Hadoop & Hep
Hadoop & Hep
Hadoop & Hep
Hadoop & Hep
Hadoop & Hep
Hadoop & Hep
Hadoop & Hep
Hadoop & Hep
Hadoop & Hep
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Hadoop & Hep

1,568

Published on

Simon Metson of Bristol University and CERN's CMS experiment, discussing how to use Hadoop for processing CERN event data, or other data generated in/by the experiment

Simon Metson of Bristol University and CERN's CMS experiment, discussing how to use Hadoop for processing CERN event data, or other data generated in/by the experiment

Published in: Technology, Education
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,568
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
60
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Hadoop and HEP Simon Wednesday, 12 August 2009
  • 2. About us • CMS will take 1-10PB of data a year • we’ll generate approx. the same in simulation data • It could run for 20-30 years • Have ~80 large computing centres around the world (>0.5PB, 100’s job slots each) • ~3000 members of the collaboration Wednesday, 12 August 2009
  • 3. Why so much data? • We have a very big digital camera • Each event is ~1MB for normal running • size increases for HI and upgrade studies • Need many millions of events to get statistically significant results out for rare processes • In my thesis I started with ~5M events to see an eventual “signal” of ~300 Wednesday, 12 August 2009
  • 4. What’s an event? • We have protons colliding, which contain quarks • Quarks interact to produce excited states of matter • These excited states decay and we record the decay products • We then work back from the products to “see” the original event • Many events happen at once • Think of working out how a carburettor works by crashing 6 cars together on a motorway Wednesday, 12 August 2009
  • 5. An event Wednesday, 12 August 2009
  • 6. Duplication of data • We keep events in multiple “tiers” of data • Each tier contains a subset of the information of the parent tier • We do this to let people work on huge amounts of data quickly • In reality this style of working hasn’t really kicked off yet, but it’s early days • Data is housed at >1 site Wednesday, 12 August 2009
  • 7. Duplication of work • One person’s signal is another’s background • Common framework (CMSSW) for analysis but very little ability to share large amounts of work • People coalesce into working groups, but these are generally small • While everyone is trying to do the same thing they’re all trying to do it in different ways • I suspect this is different from, say, Yahoo or last.fm Wednesday, 12 August 2009
  • 8. How we work • Large, ~dedicated compute farms • PBS/Torque/Maui/SGE accessed via grid interface • ACL’s to prevent misuse of resources • Not worried about people reading our data, but worried they might delete it accidentally • Prevent DDoS Wednesday, 12 August 2009
  • 9. Where we use Hadoop • We currently use Hadoop’s HDFS at some of our T2 sites, mainly in the US • Led by Nebraska, been very successful to date • I suspect more people will switch as centres expand • Administration tools as well as performance particularly appreciated • Alternatives are academic/research projects and tend to have a different focus (pub for details/rants) • Maintenance & stability of code a big issue • Storage in WN’s is also interesting Wednesday, 12 August 2009
  • 10. What would we have to do to run analysis with Hadoop? • Split events sensibly over the cluster • By event? by file? don’t care? • Data files are ~2G - need to reliably reconstruct these files for export if we split them up • Have CMSSW run in Hadoop • Many, many pitfalls there, may not even be possible... Wednesday, 12 August 2009
  • 11. Metadata • Lots of metadata associated with the data itself • Moving that to HBase or similar and mining with Hadoop would be interesting • Currently this is stored in big Oracle databases • Also, log mining - probably harder to get people interested in this Wednesday, 12 August 2009
  • 12. Issues • Some analyses don’t map onto MapReduce • Data is complex and in a weird file format • CMSSW has a large memory foot print • Not efficient to run only a few events as start up/tear down is expensive • Sociologically it would be difficult to persuade people to move to MapReduce algorithms • Until people see benefits - demonstrating those benefits is hard, physicists don’t think in cost terms Wednesday, 12 August 2009

×