Hadoop and HEP

                                 Simon




Wednesday, 12 August 2009
About us
                   • CMS will take 1-10PB of data a year
                            •   we’ll generate approx. the same in simulation data

                   • It could run for 20-30 years
                   • Have ~80 large computing centres around
                            the world (>0.5PB, 100’s job slots each)
                   • ~3000 members of the collaboration

Wednesday, 12 August 2009
Why so much data?
                   •        We have a very big digital camera
                   •        Each event is ~1MB for normal
                            running
                            •   size increases for HI and upgrade
                                studies

                   •        Need many millions of events to
                            get statistically significant results
                            out for rare processes
                            •   In my thesis I started with ~5M events
                                to see an eventual “signal” of ~300




Wednesday, 12 August 2009
What’s an event?
                   •        We have protons colliding, which contain quarks
                   •        Quarks interact to produce excited states of
                            matter
                   •        These excited states decay and we record the
                            decay products
                   •        We then work back from the products to “see”
                            the original event
                   •        Many events happen at once
                   •        Think of working out how a carburettor works
                            by crashing 6 cars together on a motorway



Wednesday, 12 August 2009
An event




Wednesday, 12 August 2009
Duplication of data
                   • We keep events in multiple “tiers” of data
                   • Each tier contains a subset of the
                            information of the parent tier
                   • We do this to let people work on huge
                            amounts of data quickly
                            •   In reality this style of working hasn’t really kicked off
                                yet, but it’s early days

                   • Data is housed at >1 site
Wednesday, 12 August 2009
Duplication of work
                   •        One person’s signal is another’s background
                   •        Common framework (CMSSW) for analysis but
                            very little ability to share large amounts of work
                            •   People coalesce into working groups, but these are generally
                                small

                   •        While everyone is trying to do the same thing
                            they’re all trying to do it in different ways
                   •        I suspect this is different from, say, Yahoo or
                            last.fm


Wednesday, 12 August 2009
How we work
                   • Large, ~dedicated compute farms
                   • PBS/Torque/Maui/SGE accessed via grid
                            interface
                   • ACL’s to prevent misuse of resources
                            •   Not worried about people reading our data, but
                                worried they might delete it accidentally

                            •   Prevent DDoS



Wednesday, 12 August 2009
Where we use Hadoop
                   •        We currently use Hadoop’s HDFS at some of our T2 sites,
                            mainly in the US
                   •        Led by Nebraska, been very successful to date
                            •   I suspect more people will switch as centres expand

                   •        Administration tools as well as performance particularly
                            appreciated
                   •        Alternatives are academic/research projects and tend to
                            have a different focus (pub for details/rants)
                            •   Maintenance & stability of code a big issue

                   •        Storage in WN’s is also interesting



Wednesday, 12 August 2009
What would we have to do
               to run analysis with Hadoop?
                • Split events sensibly over the cluster
                 • By event? by file? don’t care?
                • Data files are ~2G - need to reliably
                            reconstruct these files for export if we split
                            them up
                   • Have CMSSW run in Hadoop
                            •   Many, many pitfalls there, may not even be possible...



Wednesday, 12 August 2009
Metadata
                   • Lots of metadata associated with the data
                            itself
                   • Moving that to HBase or similar and mining
                            with Hadoop would be interesting
                   • Currently this is stored in big Oracle
                            databases
                   • Also, log mining - probably harder to get
                            people interested in this


Wednesday, 12 August 2009
Issues
                   •        Some analyses don’t map onto MapReduce
                   •        Data is complex and in a weird file format
                   •        CMSSW has a large memory foot print
                   •        Not efficient to run only a few events as start up/tear
                            down is expensive
                   •        Sociologically it would be difficult to persuade people
                            to move to MapReduce algorithms
                            •    Until people see benefits - demonstrating those benefits is hard,
                                physicists don’t think in cost terms



Wednesday, 12 August 2009

Hadoop & Hep

  • 1.
    Hadoop and HEP Simon Wednesday, 12 August 2009
  • 2.
    About us • CMS will take 1-10PB of data a year • we’ll generate approx. the same in simulation data • It could run for 20-30 years • Have ~80 large computing centres around the world (>0.5PB, 100’s job slots each) • ~3000 members of the collaboration Wednesday, 12 August 2009
  • 3.
    Why so muchdata? • We have a very big digital camera • Each event is ~1MB for normal running • size increases for HI and upgrade studies • Need many millions of events to get statistically significant results out for rare processes • In my thesis I started with ~5M events to see an eventual “signal” of ~300 Wednesday, 12 August 2009
  • 4.
    What’s an event? • We have protons colliding, which contain quarks • Quarks interact to produce excited states of matter • These excited states decay and we record the decay products • We then work back from the products to “see” the original event • Many events happen at once • Think of working out how a carburettor works by crashing 6 cars together on a motorway Wednesday, 12 August 2009
  • 5.
  • 6.
    Duplication of data • We keep events in multiple “tiers” of data • Each tier contains a subset of the information of the parent tier • We do this to let people work on huge amounts of data quickly • In reality this style of working hasn’t really kicked off yet, but it’s early days • Data is housed at >1 site Wednesday, 12 August 2009
  • 7.
    Duplication of work • One person’s signal is another’s background • Common framework (CMSSW) for analysis but very little ability to share large amounts of work • People coalesce into working groups, but these are generally small • While everyone is trying to do the same thing they’re all trying to do it in different ways • I suspect this is different from, say, Yahoo or last.fm Wednesday, 12 August 2009
  • 8.
    How we work • Large, ~dedicated compute farms • PBS/Torque/Maui/SGE accessed via grid interface • ACL’s to prevent misuse of resources • Not worried about people reading our data, but worried they might delete it accidentally • Prevent DDoS Wednesday, 12 August 2009
  • 9.
    Where we useHadoop • We currently use Hadoop’s HDFS at some of our T2 sites, mainly in the US • Led by Nebraska, been very successful to date • I suspect more people will switch as centres expand • Administration tools as well as performance particularly appreciated • Alternatives are academic/research projects and tend to have a different focus (pub for details/rants) • Maintenance & stability of code a big issue • Storage in WN’s is also interesting Wednesday, 12 August 2009
  • 10.
    What would wehave to do to run analysis with Hadoop? • Split events sensibly over the cluster • By event? by file? don’t care? • Data files are ~2G - need to reliably reconstruct these files for export if we split them up • Have CMSSW run in Hadoop • Many, many pitfalls there, may not even be possible... Wednesday, 12 August 2009
  • 11.
    Metadata • Lots of metadata associated with the data itself • Moving that to HBase or similar and mining with Hadoop would be interesting • Currently this is stored in big Oracle databases • Also, log mining - probably harder to get people interested in this Wednesday, 12 August 2009
  • 12.
    Issues • Some analyses don’t map onto MapReduce • Data is complex and in a weird file format • CMSSW has a large memory foot print • Not efficient to run only a few events as start up/tear down is expensive • Sociologically it would be difficult to persuade people to move to MapReduce algorithms • Until people see benefits - demonstrating those benefits is hard, physicists don’t think in cost terms Wednesday, 12 August 2009