• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Big Data Challenges at NASA
 

Big Data Challenges at NASA

on

  • 5,734 views

 

Statistics

Views

Total Views
5,734
Views on SlideShare
5,670
Embed Views
64

Actions

Likes
5
Downloads
0
Comments
0

3 Embeds 64

http://eventifier.co 46
http://eventifier.com 17
http://www.twylah.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Big Data Challenges at NASA Big Data Challenges at NASA Presentation Transcript

    • Big  Data  Challenges  at  NASA   Chris  A.  Ma4mann   Senior  Computer  Scien.st,  NASA     Adjunct  Assistant  Professor,  USC   Member,  Apache  So<ware  Founda.on  
    • And  you  are?   •  Senior  Computer  ScienLst  at   NASA  JPL  in  Pasadena,  CA   USA   •  SoNware  Architecture/ Engineering  Prof  at  Univ.  of   Southern  California     •  Apache  Member  involved  in   –  OODT  (VP,  PMC),  Tika  (VP,PMC),  Nutch  (PMC),  Incubator  (PMC),   SIS  (Mentor),  Lucy  (Mentor)  and  Gora  (Champion),  MRUnit   (Mentor),  Airavata  (Mentor)  13-­‐Jun-­‐12   HADOOPSUMMIT12   2  
    • Agenda  •  Big  Data  Challenges  and  where  we’re  headed  •  Example  systems  at  NASA  and  other  agencies  •  Apache  OODT:  a  primer  •  Apache  OODT  +  Hadoop  •  Where  we’re  headed  and  wrapup  13-­‐Jun-­‐12   HADOOPSUMMIT12   3  
    • Some  “Big  Data”  Grand  Challenges  I’m   interested  in   •  How  do  we  handle  700  TB/sec  of  data  coming  off  the  wire  when  we   actually  have  to  keep  it  around?   –  Required  by  the  Square  Kilometre  Array   •  Joe  scien.st  says  I’ve  got  an  IDL  or  Matlab  algorithm  that  I  will  not   change  and  I  need  to  run  it  on  10  years  of  data  from  the  Colorado   River  Basin  and  store  and  disseminate  the  output  products   –  Required  by  the  Western  Snow  Hydrology  project   •  How  do  we  compare  petabytes  of  climate  model  output  data  in  a   variety  of  formats  (HDF,  NetCDF,  Grib,  etc.)  with  petabytes  of  remote   sensing  data  to  improve  climate  models  for  the  next  IPCC  assessment?   –  Required  by  the  5th  IPCC  assessment  and  the  Earth  System  Grid  and  NASA   •  How  do  we  catalog  all  of  NASA s  current  planetary  science  data?   –  Required  by  the  NASA  Planetary  Data  System   13-­‐Jun-­‐12   HADOOPSUMMIT12   2012.  Jet  Propulsion  Laboratory,  California  InsLtute  of  Technology.  US   Copyright   4  Image  Credit:  h4p://www.jpl.nasa.gov/news/news.cfm?release=2011-­‐295   Government  Sponsorship  Acknowledged.  
    • The  NASA  ESDS  Context   Where is open source most useful? Which area should produce open source software?13-­‐Jun-­‐12   HADOOPSUMMIT12   5
    • Lessons  from  90’s  era  missions  •  Increasing  data  volumes  (exponen>al  growth)  •  Increasing  complexity  of  instruments  and  algorithms  •  Increasing  availability  of  proxy/sim/ancillary  data  •  Increasing  rate  of  technology  refresh  …  all  of  this  while  NASA  Earth  Mission  funding  was  decreasing   A  data  system  framework  based  on  a  standard  architecture  and  reusable  soKware  components  for  suppor>ng  all  future  missions.  13-­‐Jun-­‐12   HADOOPSUMMIT12   6  
    • Where  do  Big  Data  technologies     fit  into  this?  U.S.  NaLonal  Climate  Assessment  (pic  credit:  Dr.  Tom  Painter)   SKA  South  Africa:  Square  Kilometre  Array   (pic  credit:  Dr.  Jasper  Horrell,  Simon  Ratcliffe  13-­‐Jun-­‐12   HADOOPSUMMIT12   7  
    • 13-­‐Jun-­‐12   HADOOPSUMMIT12   8   Credit:  Cameron  Goodale  
    • day2_TDEM0003_10s_norx EVLA  demonstraLon   architecture   EVLA day2_TDEM0003_10s_norx WWW Staging Area products, CAS Data Services metadata Crawler Browser Science system Services status PCS Curator FM proc Data System Legend: rep cat status Operator data flow Apache OODT control flow W Cub WM Monitor data Disk Area /met ska-dc.jpl.nasa.gov13-­‐Jun-­‐12   HADOOPSUMMIT12   evlascube event 9  
    • Apache OODT•  Entered incubation at the Apache Software Foundation in 2010•  Selected as a top level Apache Software Foundation project in January 2011•  Developed by a community of participants from many companies, universities, and organizations•  Used for a diverse set of science data system activities in planetary science, earth science, radio astronomy, biomedicine, astrophysics, and more http://oodt.apache.orgOODT Development & user community includes: 13-­‐Jun-­‐12   HADOOPSUMMIT12   10  
    • Apache  OODT:  OSS  “big  data”  plaPorm   originally  pioneered  at  NASA  •  OODT is meant to be a set of tools to help build data systems –  It s not meant to be turn key –  It attempts to exploit the boundary between bringing in capability vs. being overly rigid in science Copyright  2012.  Jet  Propulsion  Laboratory,  California   –  Each discipline/project extends InsLtute  of  Technology.  US  Government  Sponsorship   Acknowledged.  •  Projects  that  are  deploying  it  operaLonally  at   –  Decadal-­‐survey  recommended  NASA  Earth  science    missions,  NIH,  and  NCI,   CHLA,  USC,  South  African  SKA  project  •  Why  Apache?   –  Less than 100 projects have been promoted to top level (Apache Web Server, Tomcat, Solr, Hadoop) –  Differs from other open source communities; it provides a governance and management structure 13-­‐Jun-­‐12   HADOOPSUMMIT12   11  
    • Why Apache and OODT?•  OODT is meant to be a set of tools to help build data systems –  It s not meant to be turn key –  It attempts to exploit the boundary between bringing in capability vs. being overly rigid in science –  Each discipline/project extends•  Apache is the elite open source community for software developers –  Less than 100 projects have been promoted to top level (Apache Web Server, Tomcat, Solr, Hadoop) –  Differs from other open source communities; it provides a governance and management structure13-­‐Jun-­‐12   HADOOPSUMMIT12   12  
    • Governance  Model+NASA=&hearts;  •  NASA  and  other  government     agencies  have  tons  of  process   –  They  like  that  13-­‐Jun-­‐12   HADOOPSUMMIT12   13  
    • OODT Framework and PCS OODT/Science Archive Web Tools Client Navigation Service OBJECT ORIENTED DATA TECHNOLOGY FRAMEWORK Catalog & Archive Archive Profile Catalog  &CArchive   Process     ontrol     Product Query Bridge to External Other Service 1 Service  ((CAS)   Service Service Service Service System   PCS)   Service Services Other Service 2 Profile Data Data XML Data System 1 System 2 CAS has recently become known as Process Control System when applied to mission work.13-­‐Jun-­‐12   HADOOPSUMMIT12   14  
    • Current PCS deployments Orbiting Carbon Observatory (OCO-2) - spectrometer instrument NASA ESSP Mission, launch date: TBD 2013 PCS supporting Thermal Vacuum Tests, Ground-based instrument data processing, Space- based instrument data processing and Science Computing Facility EOM Data Volume: 61-81 TB in 3 yrs Processing Throughput: 200-300 jobs/day NPP Sounder PEATE - infrared sounder Joint NASA/NPOESS mission, launch date: October 2011 PCS supporting Science Computing Facility (PEATE) EOM Data Volume: 600 TB in 5 yrs Processing Throughput: 600 jobs/day QuikSCAT  -­‐  sca4erometer   NASA  Quick-­‐Recovery  Mission,  launch  date:  June  1999   PCS  supporLng  instrument  data  processing  and  science  analyst  sandbox   Originally  planned  as  a  2-­‐year  mission   SMAP  -­‐  high-­‐res  radar  and  radiometer   NASA  decadal  study  mission,  launch  date:  2014   PCS  supporLng  radar  instrument  and  science  algorithm  development  testbed  13-­‐Jun-­‐12   HADOOPSUMMIT12   15  
    • Other PCS applications Astronomy  and  Radio   Prototype  work  on  MeerKAT  with  South  Africans  and  KAT-­‐7  telescope   Discussions  ongoing  with  NRAO  Socorro  (EVLA  and  ALMA)   Bioinforma>cs   NaLonal  InsLtutes  of  Health  (NIH)  NaLonal  Cancer  InsLtute s  (NCI)  Early  DetecLon   Research  Network  (EDRN)   Children s  Hospital  LA  Virtual  Pediatric  Intensive  Care  Unit  (VPICU)   Earth  Science   NaLonal  Climate  Assessment  –  Snow  Hydrology  in  the  Western  US  and  Alaska   NaLonal  Climate  Assessment  –  Regional  Climate  Modeling  and  EvaluaLon   Technology  Demonstra>on   JPL s  AcLve  Mirror  Telescope  (AMT)   White  Sands  Missile  Range  13-­‐Jun-­‐12   HADOOPSUMMIT12   16  
    • PCS Core Components•  All  Core  components  implemented  as  web  services   –  XML-­‐RPC  used  to  communicate  between  components   –  Servers  implemented  in  Java   –  Clients  implemented  in  Java,  scripts,  Python,    PHP  and  web-­‐apps   –  Service  configuraLon  implemented  in  ASCII  and  XML  files     13-­‐Jun-­‐12   HADOOPSUMMIT12   17  
    • Core Capabilities•  File  Manager  does  Data  Management   –  Tracks  all  of  the  stored  data,  files  &  metadata   –  Moves  data  to  appropriate  locaLons  before  and  aNer  iniLaLng  PGE  runs  and  from  staging  area  to   controlled  access  storage  •   Workflow  Manager  does  Pipeline  Processing   –  Automates  processing  when  all  run  condiLons  are  ready   –  Monitors  and  logs  processing  status  •  Resource  Manager  does  Resource  Management   –  Allocates  processing  jobs  to  compuLng  resources   –  Monitors  and  logs  job  &  resource  status   –  Copies  output  data  to  storage  locaLons  where  space  is  available   –  Provides  the  means  to  monitor  resource  usage   13-­‐Jun-­‐12   HADOOPSUMMIT12   18  
    • File/Metadata Capabilities13-­‐Jun-­‐12   HADOOPSUMMIT12   19  
    • Advanced Workflow Monitoring13-­‐Jun-­‐12   HADOOPSUMMIT12   20  
    • Resource Monitoring13-­‐Jun-­‐12   HADOOPSUMMIT12   21  
    • How do we deploy PCS for a mission?•  We implement the following mission-specific customizations –  Server Configuration •  Implemented in ASCII properties files –  Product metadata specification •  Implemented in XML policy files –  Processing Rules •  Implemented as Java classes and/or XML policy files –  PGE Configuration •  Implemented in XML policy files –  Compute Node Usage Policies •  Implemented in XML policy files•  Here s what we don t change –  All PCS Servers (e.g. File Manager, Workflow Manager, Resource Manager) •  Core data management, pipeline process management and job scheduling/submission capabilities –  File Catalog schema –  Workflow Model Repository Schema 13-­‐Jun-­‐12   HADOOPSUMMIT12   22  
    • Server and PGE Configuration13-­‐Jun-­‐12   HADOOPSUMMIT12   23  
    • Latest  Apache  OODT  release:  0.3   •  First  appearance  of  PCS   –  Core,  Services  (JAX-­‐RS)   •  Web  ApplicaLons   –  Balance  (PHP),  and  Wicket  (Java)-­‐based  apps  for   file  management  and  workflow  monitoring   •  First  release  deployed  to  Maven  Central   –  We  did  backport  0.2  there  aNer  this   –  Over  60  issues  fixed  in  JIRA   •  June  2011:  recommended  stable  release  13-­‐Jun-­‐12   HADOOPSUMMIT12   24  
    • Working  on:  0.4  •  Operator  Interface  (OODT-­‐157)  •  Workflow2  integraLon  (OODT-­‐215)  and  all  of  its  sub-­‐issues   –  Global  workflow  condiLons,  dynamic  workflows,  parallel/sequenLal   model,  new  workflow  engine,  etc.  •  OODT  RADIX  for  super  easy  deployment  (OODT-­‐120)  •  Solr  sync  with  File  Manager  (OODT-­‐326)  •  Improvements  to  XMLPS  (OODT-­‐333)  and  new  crawler  acLons   (OODT-­‐33,  OODT-­‐34,  OODT-­‐35,  OODT-­‐36,  OODT-­‐37)  •  CLI  rewrite  and  refactor  •  Over  130  issues  currently  resolved  •  Likely  to  come  before  end  of  Q2  2012  13-­‐Jun-­‐12   HADOOPSUMMIT12   25  
    • How  do  these  fit  together?  •  Hadoop  HDFS   –  OODT  file  manager  leveraging  HDFS  for  virtual  disk  path,  replicaLon,   archiving,  scalability  •  Hadoop  M/R   –  Work  done  in  OODT  branch  to  connect  OODT  Workflow  +  Resource   Mgmt  to  Hadoop  (pre  YARN)  •  Hadoop  HIVE  used  in  Regional  Climate  Modeling  DB  13-­‐Jun-­‐12   HADOOPSUMMIT12   26  
    • Where  are  we  headed  with     OODT  +  Hadoop?  •  InvesLgate  and  integrate  YARN   –  Workflow  and  Resource  Mgmt  •  Plug  in  HBase  as  File  Manager  Catalog   –  Already  plugged  in  HIVE   –  PotenLally  leverage  Gora?  •  OODT  +  Hadoop  Virtual  Machines  and  RPMs   –  Easy  InstallaLon  leveraging  OODT  RADIX  •  Remote  file  acquisiLon  (Push/Pull)  as  Hadoop   M/R  13-­‐Jun-­‐12   HADOOPSUMMIT12   27  
    • Key  Takeaway   Apache  OODT,  Apache  Hadoop,  other  big  data   technologies  preparing  the  world  to  handle  all  of   these  diverse  use  cases!     Constantly  evolving  and  improving  frameworks  –  join  up  and  help.     Free  and  open  source  from  Apache  and  helping  government  demonstrate  the   public  good  13-­‐Jun-­‐12   HADOOPSUMMIT12   28  
    • Apache OODT Project Contact Info•  Learn more and track our progress at: –  http://oodt.apache.org –  WIKI: https://cwiki.apache.org/OODT/ –  JIRA: https://issues.apache.org/jira/browse/OODT•  Join the mailing list: –  dev@oodt.apache.org•  Chat on IRC: –  #oodt on irc.freenode.net•  Acknowledgements –  Key Members of the OODT teams: Chris Mattmann, Daniel J. Crichton, Steve Hughes, Andrew Hart, Sean Kelly, Sean Hardman, Paul Ramirez, David Woollard, Brian Foster, Dana Freeborn, Emily Law, Mike Cayanan, Luca Cinquini, Heather Kincaid –  Projects, Sponsors, Collaborators: Planetary Data System, Early Detection Research Network, Climate Data Exchange, Virtual Pediatric Intensive Care Unit, NASA SMAP Mission, NASA OCO-2 Mission, NASA NPP Sounder Peate, NASA ACOS Mission, Earth System Grid Federation 13-­‐Jun-­‐12   HADOOPSUMMIT12   29  
    • Alright,  I ll  shut  up  now  •  Any  quesLons?  •  THANK  YOU!   –  chris.a.ma4mann@nasa.gov     –  @chrisma4mann  on  Twi4er  13-­‐Jun-­‐12   HADOOPSUMMIT12   30