Hadoop & Hep

Hadoop and HEP

Simon

Wednesday, 12 August 2009

About us
• CMS will take 1-10PB of data a year
• we’ll generate approx. the same in simulation data

• It could run for 20-30 years
• Have ~80 large computing centres around
the world (>0.5PB, 100’s job slots each)
• ~3000 members of the collaboration


Why so much data?
• We have a very big digital camera
• Each event is ~1MB for normal
running
• size increases for HI and upgrade
studies

• Need many millions of events to
get statistically signiﬁcant results
out for rare processes
• In my thesis I started with ~5M events
to see an eventual “signal” of ~300


What’s an event?
• We have protons colliding, which contain quarks
• Quarks interact to produce excited states of
matter
• These excited states decay and we record the
decay products
• We then work back from the products to “see”
the original event
• Many events happen at once
• Think of working out how a carburettor works
by crashing 6 cars together on a motorway


An event


Duplication of data
• We keep events in multiple “tiers” of data
• Each tier contains a subset of the
information of the parent tier
• We do this to let people work on huge
amounts of data quickly
• In reality this style of working hasn’t really kicked off
yet, but it’s early days

• Data is housed at >1 site

Duplication of work
• One person’s signal is another’s background
• Common framework (CMSSW) for analysis but
very little ability to share large amounts of work
• People coalesce into working groups, but these are generally
small

• While everyone is trying to do the same thing
they’re all trying to do it in different ways
• I suspect this is different from, say, Yahoo or
last.fm


How we work
• Large, ~dedicated compute farms
• PBS/Torque/Maui/SGE accessed via grid
interface
• ACL’s to prevent misuse of resources
• Not worried about people reading our data, but
worried they might delete it accidentally

• Prevent DDoS


Where we use Hadoop
• We currently use Hadoop’s HDFS at some of our T2 sites,
mainly in the US
• Led by Nebraska, been very successful to date
• I suspect more people will switch as centres expand

• Administration tools as well as performance particularly
appreciated
• Alternatives are academic/research projects and tend to
have a different focus (pub for details/rants)
• Maintenance & stability of code a big issue

• Storage in WN’s is also interesting


What would we have to do
to run analysis with Hadoop?
• Split events sensibly over the cluster
• By event? by file? don’t care?
• Data files are ~2G - need to reliably
reconstruct these files for export if we split
them up
• Have CMSSW run in Hadoop
• Many, many pitfalls there, may not even be possible...


Metadata
• Lots of metadata associated with the data
itself
• Moving that to HBase or similar and mining
with Hadoop would be interesting
• Currently this is stored in big Oracle
databases
• Also, log mining - probably harder to get
people interested in this


Issues
• Some analyses don’t map onto MapReduce
• Data is complex and in a weird file format
• CMSSW has a large memory foot print
• Not efficient to run only a few events as start up/tear
down is expensive
• Sociologically it would be difficult to persuade people
to move to MapReduce algorithms
• Until people see benefits - demonstrating those benefits is hard,
physicists don’t think in cost terms


Hadoop & Hep

More Related Content

Viewers also liked

Similar to Hadoop & Hep

More from Steve Loughran

Recently uploaded

Hadoop & Hep