Big Data Challenges at NASA
Big Data Challenges at NASA Presentation Transcript

  • 1. Big  Data  Challenges  at  NASA   Chris  A.  Ma4mann   Senior  Computer  Scien.st,  NASA     Adjunct  Assistant  Professor,  USC   Member,  Apache  So<ware  Founda.on  
  • 2. And  you  are?   •  Senior  Computer  ScienLst  at   NASA  JPL  in  Pasadena,  CA   USA   •  SoNware  Architecture/ Engineering  Prof  at  Univ.  of   Southern  California     •  Apache  Member  involved  in   –  OODT  (VP,  PMC),  Tika  (VP,PMC),  Nutch  (PMC),  Incubator  (PMC),   SIS  (Mentor),  Lucy  (Mentor)  and  Gora  (Champion),  MRUnit   (Mentor),  Airavata  (Mentor)  13-­‐Jun-­‐12   HADOOPSUMMIT12   2  
  • 3. Agenda  •  Big  Data  Challenges  and  where  we’re  headed  •  Example  systems  at  NASA  and  other  agencies  •  Apache  OODT:  a  primer  •  Apache  OODT  +  Hadoop  •  Where  we’re  headed  and  wrapup  13-­‐Jun-­‐12   HADOOPSUMMIT12   3  
  • 4. Some  “Big  Data”  Grand  Challenges  I’m   interested  in   •  How  do  we  handle  700  TB/sec  of  data  coming  off  the  wire  when  we   actually  have  to  keep  it  around?   –  Required  by  the  Square  Kilometre  Array   •  Joe  scien.st  says  I’ve  got  an  IDL  or  Matlab  algorithm  that  I  will  not   change  and  I  need  to  run  it  on  10  years  of  data  from  the  Colorado   River  Basin  and  store  and  disseminate  the  output  products   –  Required  by  the  Western  Snow  Hydrology  project   •  How  do  we  compare  petabytes  of  climate  model  output  data  in  a   variety  of  formats  (HDF,  NetCDF,  Grib,  etc.)  with  petabytes  of  remote   sensing  data  to  improve  climate  models  for  the  next  IPCC  assessment?   –  Required  by  the  5th  IPCC  assessment  and  the  Earth  System  Grid  and  NASA   •  How  do  we  catalog  all  of  NASA s  current  planetary  science  data?   –  Required  by  the  NASA  Planetary  Data  System   13-­‐Jun-­‐12   HADOOPSUMMIT12   2012.  Jet  Propulsion  Laboratory,  California  InsLtute  of  Technology.  US   Copyright   4  Image  Credit:  h4p://www.jpl.nasa.gov/news/news.cfm?release=2011-­‐295   Government  Sponsorship  Acknowledged.  
  • 5. The  NASA  ESDS  Context   Where is open source most useful? Which area should produce open source software?13-­‐Jun-­‐12   HADOOPSUMMIT12   5
  • 6. Lessons  from  90’s  era  missions  •  Increasing  data  volumes  (exponen>al  growth)  •  Increasing  complexity  of  instruments  and  algorithms  •  Increasing  availability  of  proxy/sim/ancillary  data  •  Increasing  rate  of  technology  refresh  …  all  of  this  while  NASA  Earth  Mission  funding  was  decreasing   A  data  system  framework  based  on  a  standard  architecture  and  reusable  soKware  components  for  suppor>ng  all  future  missions.  13-­‐Jun-­‐12   HADOOPSUMMIT12   6  
  • 7. Where  do  Big  Data  technologies     fit  into  this?  U.S.  NaLonal  Climate  Assessment  (pic  credit:  Dr.  Tom  Painter)   SKA  South  Africa:  Square  Kilometre  Array   (pic  credit:  Dr.  Jasper  Horrell,  Simon  Ratcliffe  13-­‐Jun-­‐12   HADOOPSUMMIT12   7  
  • 8. 13-­‐Jun-­‐12   HADOOPSUMMIT12   8   Credit:  Cameron  Goodale  
  • 9. day2_TDEM0003_10s_norx EVLA  demonstraLon   architecture   EVLA day2_TDEM0003_10s_norx WWW Staging Area products, CAS Data Services metadata Crawler Browser Science system Services status PCS Curator FM proc Data System Legend: rep cat status Operator data flow Apache OODT control flow W Cub WM Monitor data Disk Area /met ska-dc.jpl.nasa.gov13-­‐Jun-­‐12   HADOOPSUMMIT12   evlascube event 9  
  • 10. Apache OODT•  Entered incubation at the Apache Software Foundation in 2010•  Selected as a top level Apache Software Foundation project in January 2011•  Developed by a community of participants from many companies, universities, and organizations•  Used for a diverse set of science data system activities in planetary science, earth science, radio astronomy, biomedicine, astrophysics, and more http://oodt.apache.orgOODT Development & user community includes: 13-­‐Jun-­‐12   HADOOPSUMMIT12   10  
  • 11. Apache  OODT:  OSS  “big  data”  plaPorm   originally  pioneered  at  NASA  •  OODT is meant to be a set of tools to help build data systems –  It s not meant to be turn key –  It attempts to exploit the boundary between bringing in capability vs. being overly rigid in science Copyright  2012.  Jet  Propulsion  Laboratory,  California   –  Each discipline/project extends InsLtute  of  Technology.  US  Government  Sponsorship   Acknowledged.  •  Projects  that  are  deploying  it  operaLonally  at   –  Decadal-­‐survey  recommended  NASA  Earth  science    missions,  NIH,  and  NCI,   CHLA,  USC,  South  African  SKA  project  •  Why  Apache?   –  Less than 100 projects have been promoted to top level (Apache Web Server, Tomcat, Solr, Hadoop) –  Differs from other open source communities; it provides a governance and management structure 13-­‐Jun-­‐12   HADOOPSUMMIT12   11  
  • 12. Why Apache and OODT?•  OODT is meant to be a set of tools to help build data systems –  It s not meant to be turn key –  It attempts to exploit the boundary between bringing in capability vs. being overly rigid in science –  Each discipline/project extends•  Apache is the elite open source community for software developers –  Less than 100 projects have been promoted to top level (Apache Web Server, Tomcat, Solr, Hadoop) –  Differs from other open source communities; it provides a governance and management structure13-­‐Jun-­‐12   HADOOPSUMMIT12   12  
  • 13. Governance  Model+NASA=&hearts;  •  NASA  and  other  government     agencies  have  tons  of  process   –  They  like  that  13-­‐Jun-­‐12   HADOOPSUMMIT12   13  
  • 14. OODT Framework and PCS OODT/Science Archive Web Tools Client Navigation Service OBJECT ORIENTED DATA TECHNOLOGY FRAMEWORK Catalog & Archive Archive Profile Catalog  &CArchive   Process     ontrol     Product Query Bridge to External Other Service 1 Service  ((CAS)   Service Service Service Service System   PCS)   Service Services Other Service 2 Profile Data Data XML Data System 1 System 2 CAS has recently become known as Process Control System when applied to mission work.13-­‐Jun-­‐12   HADOOPSUMMIT12   14  
  • 15. Current PCS deployments Orbiting Carbon Observatory (OCO-2) - spectrometer instrument NASA ESSP Mission, launch date: TBD 2013 PCS supporting Thermal Vacuum Tests, Ground-based instrument data processing, Space- based instrument data processing and Science Computing Facility EOM Data Volume: 61-81 TB in 3 yrs Processing Throughput: 200-300 jobs/day NPP Sounder PEATE - infrared sounder Joint NASA/NPOESS mission, launch date: October 2011 PCS supporting Science Computing Facility (PEATE) EOM Data Volume: 600 TB in 5 yrs Processing Throughput: 600 jobs/day QuikSCAT  -­‐  sca4erometer   NASA  Quick-­‐Recovery  Mission,  launch  date:  June  1999   PCS  supporLng  instrument  data  processing  and  science  analyst  sandbox   Originally  planned  as  a  2-­‐year  mission   SMAP  -­‐  high-­‐res  radar  and  radiometer   NASA  decadal  study  mission,  launch  date:  2014   PCS  supporLng  radar  instrument  and  science  algorithm  development  testbed  13-­‐Jun-­‐12   HADOOPSUMMIT12   15  
  • 16. Other PCS applications Astronomy  and  Radio   Prototype  work  on  MeerKAT  with  South  Africans  and  KAT-­‐7  telescope   Discussions  ongoing  with  NRAO  Socorro  (EVLA  and  ALMA)   Bioinforma>cs   NaLonal  InsLtutes  of  Health  (NIH)  NaLonal  Cancer  InsLtute s  (NCI)  Early  DetecLon   Research  Network  (EDRN)   Children s  Hospital  LA  Virtual  Pediatric  Intensive  Care  Unit  (VPICU)   Earth  Science   NaLonal  Climate  Assessment  –  Snow  Hydrology  in  the  Western  US  and  Alaska   NaLonal  Climate  Assessment  –  Regional  Climate  Modeling  and  EvaluaLon   Technology  Demonstra>on   JPL s  AcLve  Mirror  Telescope  (AMT)   White  Sands  Missile  Range  13-­‐Jun-­‐12   HADOOPSUMMIT12   16  
  • 17. PCS Core Components•  All  Core  components  implemented  as  web  services   –  XML-­‐RPC  used  to  communicate  between  components   –  Servers  implemented  in  Java   –  Clients  implemented  in  Java,  scripts,  Python,    PHP  and  web-­‐apps   –  Service  configuraLon  implemented  in  ASCII  and  XML  files     13-­‐Jun-­‐12   HADOOPSUMMIT12   17  
  • 18. Core Capabilities•  File  Manager  does  Data  Management   –  Tracks  all  of  the  stored  data,  files  &  metadata   –  Moves  data  to  appropriate  locaLons  before  and  aNer  iniLaLng  PGE  runs  and  from  staging  area  to   controlled  access  storage  •   Workflow  Manager  does  Pipeline  Processing   –  Automates  processing  when  all  run  condiLons  are  ready   –  Monitors  and  logs  processing  status  •  Resource  Manager  does  Resource  Management   –  Allocates  processing  jobs  to  compuLng  resources   –  Monitors  and  logs  job  &  resource  status   –  Copies  output  data  to  storage  locaLons  where  space  is  available   –  Provides  the  means  to  monitor  resource  usage   13-­‐Jun-­‐12   HADOOPSUMMIT12   18  
  • 19. File/Metadata Capabilities13-­‐Jun-­‐12   HADOOPSUMMIT12   19  
  • 20. Advanced Workflow Monitoring13-­‐Jun-­‐12   HADOOPSUMMIT12   20  
  • 21. Resource Monitoring13-­‐Jun-­‐12   HADOOPSUMMIT12   21  
  • 22. How do we deploy PCS for a mission?•  We implement the following mission-specific customizations –  Server Configuration •  Implemented in ASCII properties files –  Product metadata specification •  Implemented in XML policy files –  Processing Rules •  Implemented as Java classes and/or XML policy files –  PGE Configuration •  Implemented in XML policy files –  Compute Node Usage Policies •  Implemented in XML policy files•  Here s what we don t change –  All PCS Servers (e.g. File Manager, Workflow Manager, Resource Manager) •  Core data management, pipeline process management and job scheduling/submission capabilities –  File Catalog schema –  Workflow Model Repository Schema 13-­‐Jun-­‐12   HADOOPSUMMIT12   22  
  • 23. Server and PGE Configuration13-­‐Jun-­‐12   HADOOPSUMMIT12   23  
  • 24. Latest  Apache  OODT  release:  0.3   •  First  appearance  of  PCS   –  Core,  Services  (JAX-­‐RS)   •  Web  ApplicaLons   –  Balance  (PHP),  and  Wicket  (Java)-­‐based  apps  for   file  management  and  workflow  monitoring   •  First  release  deployed  to  Maven  Central   –  We  did  backport  0.2  there  aNer  this   –  Over  60  issues  fixed  in  JIRA   •  June  2011:  recommended  stable  release  13-­‐Jun-­‐12   HADOOPSUMMIT12   24  
  • 25. Working  on:  0.4  •  Operator  Interface  (OODT-­‐157)  •  Workflow2  integraLon  (OODT-­‐215)  and  all  of  its  sub-­‐issues   –  Global  workflow  condiLons,  dynamic  workflows,  parallel/sequenLal   model,  new  workflow  engine,  etc.  •  OODT  RADIX  for  super  easy  deployment  (OODT-­‐120)  •  Solr  sync  with  File  Manager  (OODT-­‐326)  •  Improvements  to  XMLPS  (OODT-­‐333)  and  new  crawler  acLons   (OODT-­‐33,  OODT-­‐34,  OODT-­‐35,  OODT-­‐36,  OODT-­‐37)  •  CLI  rewrite  and  refactor  •  Over  130  issues  currently  resolved  •  Likely  to  come  before  end  of  Q2  2012  13-­‐Jun-­‐12   HADOOPSUMMIT12   25  
  • 26. How  do  these  fit  together?  •  Hadoop  HDFS   –  OODT  file  manager  leveraging  HDFS  for  virtual  disk  path,  replicaLon,   archiving,  scalability  •  Hadoop  M/R   –  Work  done  in  OODT  branch  to  connect  OODT  Workflow  +  Resource   Mgmt  to  Hadoop  (pre  YARN)  •  Hadoop  HIVE  used  in  Regional  Climate  Modeling  DB  13-­‐Jun-­‐12   HADOOPSUMMIT12   26  
  • 27. Where  are  we  headed  with     OODT  +  Hadoop?  •  InvesLgate  and  integrate  YARN   –  Workflow  and  Resource  Mgmt  •  Plug  in  HBase  as  File  Manager  Catalog   –  Already  plugged  in  HIVE   –  PotenLally  leverage  Gora?  •  OODT  +  Hadoop  Virtual  Machines  and  RPMs   –  Easy  InstallaLon  leveraging  OODT  RADIX  •  Remote  file  acquisiLon  (Push/Pull)  as  Hadoop   M/R  13-­‐Jun-­‐12   HADOOPSUMMIT12   27  
  • 28. Key  Takeaway   Apache  OODT,  Apache  Hadoop,  other  big  data   technologies  preparing  the  world  to  handle  all  of   these  diverse  use  cases!     Constantly  evolving  and  improving  frameworks  –  join  up  and  help.     Free  and  open  source  from  Apache  and  helping  government  demonstrate  the   public  good  13-­‐Jun-­‐12   HADOOPSUMMIT12   28  
  • 29. Apache OODT Project Contact Info•  Learn more and track our progress at: –  http://oodt.apache.org –  WIKI: https://cwiki.apache.org/OODT/ –  JIRA: https://issues.apache.org/jira/browse/OODT•  Join the mailing list: –  dev@oodt.apache.org•  Chat on IRC: –  #oodt on irc.freenode.net•  Acknowledgements –  Key Members of the OODT teams: Chris Mattmann, Daniel J. Crichton, Steve Hughes, Andrew Hart, Sean Kelly, Sean Hardman, Paul Ramirez, David Woollard, Brian Foster, Dana Freeborn, Emily Law, Mike Cayanan, Luca Cinquini, Heather Kincaid –  Projects, Sponsors, Collaborators: Planetary Data System, Early Detection Research Network, Climate Data Exchange, Virtual Pediatric Intensive Care Unit, NASA SMAP Mission, NASA OCO-2 Mission, NASA NPP Sounder Peate, NASA ACOS Mission, Earth System Grid Federation 13-­‐Jun-­‐12   HADOOPSUMMIT12   29  
  • 30. Alright,  I ll  shut  up  now  •  Any  quesLons?  •  THANK  YOU!   –  chris.a.ma4mann@nasa.gov     –  @chrisma4mann  on  Twi4er  13-­‐Jun-­‐12   HADOOPSUMMIT12   30