Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Big Science, Big Data         Simon Metson      simon@cloudant.com
Outline• Who’s this guy?• Computing for the LHC experiments• NoSQL tools• Landslide modelling• Issues arising for Universi...
Who am I?• Until March I was a research associate  working on computing for CMS - one of  the LHC experiments at CERN• In ...
The CMS Experiment
Workflow ladderNumber of users                   Large datasets (>100 TB)                                                 ...
Warning: Obligatoryformula on next slide
The formula
The formulaFixed                 Usually fixed
The formula
People are important• Be nice to people working on weekends• The “cost” of a person is one place you  can make savings - e...
Observation•   Whats interesting is that big data isnt    interesting any more    •   Unless you are of a similar scale to...
****
When all you have is a hammer everything looks like a nail
General NoSQL     observations• Good for startups with limited resources  and exposure to risk• Good for large companies w...
LHC computing       evolution• Our current system works, but at a high  staff cost• Expect simplification of system, retir...
Why are landslides   an issue?
Use cases• Needs to be usable by geographically  dispersed, non-expert field engineers• Need expert approval step• Need to...
Aside: Complexity                                  Variations         Cut slope   Stochastic    for eachSlopes            ...
Aside: Complexity                                  Variations         Cut slope   Stochastic    for eachSlopes            ...
Aside: Complexity                                  Variations         Cut slope   Stochastic    for eachSlopes            ...
Aside: Complexity                                  Variations         Cut slope   Stochastic    for eachSlopes            ...
Aside: Complexity                                  Variations         Cut slope   Stochastic    for eachSlopes            ...
Aside: Complexity• The above is for one storm, simulate  many• Can easily have more stochastic  parameters, or vary them i...
Problem solved?      Text
Use cases • Needs to be usable by geographically    dispersed, non-expert field engineers • Need expert approval step • Ne...
Design schematic Geographers                      Job submi validate input                     daemon      data           ...
Design schematic Geographers                                                            Job submi validate input          ...
Web 3.0
Visualisation
Big Data and       Universities• Data intensive research will become the  norm (already is in many fields)• Universities w...
Workflow ladderNumber of users                   Large datasets (>100 TB)                                                 ...
Workflow ladderNumber of users                   Large datasets (>100 TB)                    Complex computation          ...
Implications for the        future• Quality and scale of Big Data resource  will have direct impact on ability of  Univers...
Implications for the        future• Big data clusters are very different  architecturally to HPC clusters• Data is statefu...
Implications for the        future• Cost savings from SaaS vendors can  be hard to realise at an institute level• Building...
Summary• Big data is mainstream• Should be seen as an enabling  technology for academics• Not trivial to adopt• Universiti...
Big Science, Big Data: Simon Metson at Eduserv Symposium 2012
Big Science, Big Data: Simon Metson at Eduserv Symposium 2012
Big Science, Big Data: Simon Metson at Eduserv Symposium 2012
Big Science, Big Data: Simon Metson at Eduserv Symposium 2012
Big Science, Big Data: Simon Metson at Eduserv Symposium 2012
Big Science, Big Data: Simon Metson at Eduserv Symposium 2012
Upcoming SlideShare
Loading in …5
×

Big Science, Big Data: Simon Metson at Eduserv Symposium 2012

1,421 views

Published on

This talk looks at experience and lessons from LHC computing applicable to the current 'Big Data' trend; how these tools and techniques can be applied to other disciplines and the potential impact on provision of computing at universities.

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

Big Science, Big Data: Simon Metson at Eduserv Symposium 2012

  1. 1. Big Science, Big Data Simon Metson simon@cloudant.com
  2. 2. Outline• Who’s this guy?• Computing for the LHC experiments• NoSQL tools• Landslide modelling• Issues arising for Universities
  3. 3. Who am I?• Until March I was a research associate working on computing for CMS - one of the LHC experiments at CERN• In March I began transitioning to working for Cloudant as an “ecology engineer”• Have dealt with multi-petabyte datasets for the last 10 years
  4. 4. The CMS Experiment
  5. 5. Workflow ladderNumber of users Large datasets (>100 TB) } Complex computation Large datasets (>100 TB) Use Grid compute and storag Simple computation exclusively Shared datasets (>500 GB) Complex computation Shared datasets (10-500 GB) } Complex computation Work on departmental resourc Shared datasets (10-100 GB) store resulting datasets to Grid st Simple computation Shared datasets (0.1-10 GB) } Work on laptop/desktop machi Simple computation Private datasets (0.1-10 GB) store resulting datasets to loca Simple computation Grid storage
  6. 6. Warning: Obligatoryformula on next slide
  7. 7. The formula
  8. 8. The formulaFixed Usually fixed
  9. 9. The formula
  10. 10. People are important• Be nice to people working on weekends• The “cost” of a person is one place you can make savings - e.g. by giving them the ability to do more• Building a suitable team is hard, takes time and is essential for success
  11. 11. Observation• Whats interesting is that big data isnt interesting any more • Unless you are of a similar scale to Google you dont need to write your own system • Doesnt mean its easy, though!• A terabyte was quite a lot 10 years ago, now it’s commodity hardware
  12. 12. ****
  13. 13. When all you have is a hammer everything looks like a nail
  14. 14. General NoSQL observations• Good for startups with limited resources and exposure to risk• Good for large companies who build data centres with lots of loosely related data and large DevOps teams• How does this fit with University researchers?
  15. 15. LHC computing evolution• Our current system works, but at a high staff cost• Expect simplification of system, retire bespoke components in favour of generic tools
  16. 16. Why are landslides an issue?
  17. 17. Use cases• Needs to be usable by geographically dispersed, non-expert field engineers• Need expert approval step• Need to be accessible on low end hardware• Need to run 1000’s of simulations per slope/storm and analyse result data
  18. 18. Aside: Complexity Variations Cut slope Stochastic for eachSlopes Output files Runtime angles parameters stochastic parameter 0.25 1 1 0 0 1 (cpu hours
  19. 19. Aside: Complexity Variations Cut slope Stochastic for eachSlopes Output files Runtime angles parameters stochastic parameter 6.25 1 25 0 0 25 (cpu hours
  20. 20. Aside: Complexity Variations Cut slope Stochastic for eachSlopes Output files Runtime angles parameters stochastic parameter 312.5 1 25 5 10 1250 (cpu hours
  21. 21. Aside: Complexity Variations Cut slope Stochastic for eachSlopes Output files Runtime angles parameters stochastic parameter 31250 100 25 5 10 125000 (cpu hours
  22. 22. Aside: Complexity Variations Cut slope Stochastic for eachSlopes Output files Runtime angles parameters stochastic parameter 3.5 100 25 5 10 125000 (years)
  23. 23. Aside: Complexity• The above is for one storm, simulate many• Can easily have more stochastic parameters, or vary them in a more fine grained manner• May want to compare across software versions - standard datasets
  24. 24. Problem solved? Text
  25. 25. Use cases • Needs to be usable by geographically dispersed, non-expert field engineers • Need expert approval step • Need to be accessible on low end hardware • Need to run 1000’s of simulations per slope/storm and analyse result datampossible scale with current tools/manpower
  26. 26. Design schematic Geographers Job submi validate input daemon data Job Job Job Job Write results Job Job Job Job Job JobField engineers Governme
  27. 27. Design schematic Geographers Job submi validate input daemon data Job Job Job Job Write results Job Job Job Job Replicate Job JobField engineers Governme Upload measurements via
  28. 28. Web 3.0
  29. 29. Visualisation
  30. 30. Big Data and Universities• Data intensive research will become the norm (already is in many fields)• Universities will need access to Big Data resources• Expect significant use from nontraditional fields• Expect new fields to emerge
  31. 31. Workflow ladderNumber of users Large datasets (>100 TB) } Complex computation Large datasets (>100 TB) Use Grid compute and storag Simple computation exclusively Shared datasets (>500 GB) Complex computation Shared datasets (10-500 GB) } Complex computation Work on departmental resourc Shared datasets (10-100 GB) store resulting datasets to Grid st Simple computation Shared datasets (0.1-10 GB) } Work on laptop/desktop machi Simple computation Private datasets (0.1-10 GB) store resulting datasets to loca Simple computation Grid storage
  32. 32. Workflow ladderNumber of users Large datasets (>100 TB) Complex computation Large datasets (>100 TB) Simple computation Shared datasets (>500 GB) Complex computation Shared datasets (10-500 GB) Complex computation Shared datasets (10-100 GB) Simple computation Shared datasets (0.1-10 GB) Simple computation Private datasets (0.1-10 GB) Simple computation
  33. 33. Implications for the future• Quality and scale of Big Data resource will have direct impact on ability of Universities to do research at international level• Universities will need to provide data intensive compute resources to complement traditional HPC
  34. 34. Implications for the future• Big data clusters are very different architecturally to HPC clusters• Data is stateful; harder to manage than HPC• Interesting legal issues arise
  35. 35. Implications for the future• Cost savings from SaaS vendors can be hard to realise at an institute level• Building DevOps teams is not something University funding easily supports
  36. 36. Summary• Big data is mainstream• Should be seen as an enabling technology for academics• Not trivial to adopt• Universities need to build up teams to support these activities, or find ways to out source

×