A Look into the Apache OODT Ecosystem Chris A. Mattmann NASA JPL/Univ. Southern California/ASF [email_address]  November 9, 2011
Apache Member involved in OODT (VP, PMC), Tika (VP,PMC), Nutch (PMC), Incubator (PMC), SIS (Mentor), Lucy (Mentor) and Gora (Champion), MRUnit (Mentor), Airavata (Mentor) Senior Computer Scientist at NASA JPL in Pasadena, CA USA Software Architecture/Engineering Prof at Univ. of Southern California  And you are?
Welcome to the Apache in Space! (OODT) Track
Agenda Overview of OODT and its history How we got it to Apache How other projects can follow our model Existing successful deployments of OODT Pointers to papers, and more information including case studies
Lessons from 90’s era missions Increasing data volumes (exponential growth) Increasing complexity of instruments and algorithms Increasing availability of proxy/sim/ancillary data Increasing rate of technology refresh …  all of this while NASA Earth Mission funding was decreasing A data system framework based on a standard architecture and reusable software components for supporting all future missions.
Enter OODT Object Oriented Data Technology  http://oodt.apache.org Funded initially in 1998 by NASA ’s Office of Space Science Envisaged as a national software framework for sharing data across heterogeneous, distributed data repositories OODT is both an architecture and a reference implementation providing Data Production Data Distribution Data Discovery Data Access OODT is Open Source and available from the Apache Software Foundation
Apache OODT Originally funded by NASA to focus on distributed science data system environments science data generation  data capture, end-to-end Distributed access to science data repositories by the community A set of building blocks/services to exploit common system patterns for reuse  Supports deployment based on a rich information model Selected as a top level Apache Software Foundation project in January 2011 Runner up for NASA Software of the Year Used for a number of science data system activities in planetary, earth, biomedicine, astrophysics http://oodt.apache.org
Apache OODT Press
Why Apache and OODT? OODT is meant to be a set of tools to help build data systems It ’s not meant to be “turn key”  It attempts to exploit the boundary between bringing in capability vs. being overly rigid in science Each discipline/project extends Apache is the elite open source community for software developers Less than 100 projects have been promoted to top level (Apache Web Server, Tomcat, Solr, Hadoop) Differs from other open source communities; it provides a governance and management structure
Governance Model+NASA=♥ NASA and other government  agencies have tons of process They like that
Publicly accessible and  searchable archives http://svnsearch.org/svnsearch/repos/ASF/search?path=%2Foodt   http://mail-archives.apache.org/mod_mbox/oodt-dev/   http://mail-archives.apache.org/mod_mbox/oodt-user/   100+ ML  list subscriptions
Great Metrics and Insight http://www.ohloh.net/p/oodt
Movement to the ASF Meeting held June 15, 2007 at JPL with ASF President Justin Erenkrantz Develop plan moving forward to bring first NASA project to Apache Discuss obstacles, sponsorship Discuss outlook
2007: original goals Come up with incubation proposal Chris Mattmann was one of the principal contributors to the proposal for the Tika project, and to other Incubation activities (Apache SIS) Send out emails to the Incubator mailing list Look for mentors Get sponsorship from ranking Apache PMC member or board member Justin and others Top-level  project versus sub project outlook heading out of incubation
OODT Incubator Planning Monthly Updates (for first 3 months, then quarterly) Status Progress Community Acceptance Plan for exiting incubation How to have a solid user base How to operate as a unit in the Apache way Maintenance of user interest and community going forward
OODT’s next steps circa 2007 JPL to tackle legal issues Is OODT releasable as an Apache product http://www.apache.org/licenses/software-grant.txt This needs to be signed by parties that be by JPL Contributor License Agreement Do we need a corporate one? In parallel to this Draft OODT incubation proposal Start identifying who would initially be interested More external, non-JPL people who are interested, the better Justin to get slides from other incubator people
… 2 years later Worked it out with JPL legal Turns out the ALv2 license is extremely friendly and is something that JPL (note not all of NASA) was amenable to Developed OODT incubator proposal http://wiki.apache.org/incubator/OODTProposal   Found willing Apache mentors besides Justin Jean-Frederic Clere, Ross Gardler, Ian Holsman … Put OODT at Apache!
Apache OODT Community Includes PMC members from NASA JPL, Univ. of Southern California, Google, Children’s Hospital Los Angeles (CHLA), Vdio, South African SKA Project Projects that are deploying it operationally at Decadal-survey recommended NASA Earth science  missions, NIH, and NCI, CHLA, USC, South African SKA project Use in the classroom My graduate-level software architecture and seach engines courses
OODT Framework OBJECT ORIENTED DATA TECHNOLOGY FRAMEWORK OODT/Science Web Tools Archive Client Profile XML Data Data System 1 Data System 2 Catalog & Archive Service Profile Service Product Service Query Service Bridge to  External Services Navigation Service Other Service 1 Other Service 2 Andrew Hart and Emily Law will talk about these later You’ll hear about this later today I’ll tell you about these now
Architectural Principles Division of Labor Don ’ t make one component the workhorse! Technology Independence Don ’ t get bitten in the rear when a software vendor decides to charge you a lot of $$$ for their previously  low cost  technology Metadata as a first-class citizen Descriptions of resources come in handy Separation of software and data models Allow each to evolve independently
OODT Architecture Reference Architecture Four pairs of component types Product Client/Server, Profile Client/Server, Query Client/Server, Catalog and Archive Client/Server Two connector types Messaging layer discussed in  http://sunset.usc.edu/~mattmann/pubs/ICSE.pdf   Handler connector (discussed in this presentation) Instantiated for different domains using these fundamental building blocks
Product Client and Server Product  Client (A) Product  Server (A) RAID Disk Deliver data from underlying data store Accept uniform query structure that identifies 0 or more  “ products ”  (data items) to retrieve Many-to-Many Web site MSSQL Product  Client (B) Product  Client (C) Product  Server (B) Product  Server (C)
How about an example of a product?
Profile Client and Server Profile  Client (A) Profile  Server (A) Oracle Deliver metadata from underlying metadata store Metadata gives user enough information about where to find actual data Housekeeping information Resource information Domain-specific information Many-to-Many Web site MSSQL Profile Client (B) Profile  Client (C) Profile Server (B) Profile  Server (C)
How about an example of a profile? Attributes Relationships Credit: A. Hart
Query Client and Server Query Server seeded with initial set of pointers to Profile Servers Profile Servers point to actual resources (Product Servers, even  other Profile Servers ) Interactive (metadata returned) and non-interactive (data returned) Many-to-Many Query  Server (A) Query  Client (A) Product Server (A) Product Server (B) Discovered Profile Server (B) Profile Server (A) Initial set Query  Client (B) Query  Client (C) Query  Server (B) Query  Server (C)
Catalog and Archive Client and Server (CAS) Ingest data into repository and metadata into registry Run processing algorithms on data/metadata upon ingestion Workflow support Serve back Repository data with Product Server Serve back Registry metadata with Profile Server Many-to-Many Archive  Client (A) Archive  Server (A) Repository Registry Profile  Server (A) Product  Server (A) Archive Server (B) Archive Client (B) Archive Client (C)
Some notes about CAS All Core components implemented as web services XML-RPC used to communicate between components Servers implemented in Java Clients implemented in Java, scripts, Python,  PHP and web-apps Service configuration implemented in ASCII and XML files  Credit: D. Woollard
Handler Connectors RAID Disk Encapsulate (meta-)data coordination and communication Allow for dynamic addition and removal of different classes of back end metadata and data stores DBMS Product Handler Flat File Product Handler Web Site Product Handler Product/Profile Server Product/Profile Server Web site MSSQL MSSQL
Example handler connectors XMLPS http://oodt.apache.org/components/maven/xmlps/   XML config file specifies recipe for extracting records from an RDBMS and turning them into a NoSQL repository PS XML configurable profile server to unlock OPeNDAP datasets and pass them to OODT
So, how do you piece them together: NASA VODC NASA’s Virtual Oceanographic Data Center (VODC) http://vodc.jpl.nasa.gov   Information integration using OODT components Profile, Product, Query, also uses Apache Solr, and Plone
So, how do you piece them together: JPL’s CDX CDX = Climate Data Exchange Provide comparison of remote sensing data and model outputs Existing systems remain in place; services expose data and functions over the network; support the era of IPCC 5 th  assessment and distributed, petabytes of data
Who’s doing what? Children’s Hospital Los Angeles Improving upon XMLPS, and CAS (Andrew Hart + Ricky Nguyen will talk about this) Supporting data analytics  Google Brian Foster working on command line improvements and data protocol push/pull SKA South Africa Deploying file manager and crawler for use in KAT-7 pipeline ingestion NIH/NCI Maintaining the XMLPS components, and CAS components Helping with user interfaces Various JPL and NASA research projects OPeNDAPps, XMLPS Various NASA missions Workflow, PCS, services, OPSui, other web apps
Latest release: 0.3 First appearance of PCS Core, Services (JAX-RS) Web Applications Balance (PHP), and Wicket (Java)-based apps for file management and workflow monitoring First release deployed to Maven Central We did backport 0.2 there after this Over 60 issues fixed in JIRA June 2011: recommended stable release
Working on: 0.4 Operator Interface (OODT-157) Andrew Hart and I will talk about this Workflow2 integration (OODT-215) and all of its sub-issues Global workflow conditions, dynamic workflows, parallel/sequential model, new workflow engine, etc. OODT RADIX for super easy deployment (OODT-120) Paul Ramirez and Cameron Goodale will discuss this Solr sync with File Manager (OODT-326) Improvements to XMLPS (OODT-333) and new crawler actions (OODT-33, OODT-34, OODT-35, OODT-36, OODT-37) Over 48 issues currently resolved Likely to come before end of Q4 2011
Using Apache OODT as a testbed for software process Missions maintain their own local CMs Local mission CMs  contain forks of  existing OSS  software Forks can be patch based or CM  based Changes found  particularly effective are discussed  within the comm. And eventually  brought before a CCB that  reviews their generality, etc. Credit: D. Freeborn
Some Grand Challenges I’m interested in: OODT can help! How do we handle 700 TB/sec of data coming off the wire when we actually have to keep it around? Required by the Square Kilometre Array Joe scientist says I’ve got an IDL or Matlab algorithm that I  will not change  and I need to run it on 10 years of data from the Colorado River Basin and store and disseminate the output products Required by the Western Snow Hydrology project
Some Grand Challenges I’m interested in: OODT can help! How do we compare petabytes of climate model output data in a variety of formats (HDF, NetCDF, Grib, etc.) with petabytes of remote sensing data to improve climate models for the next IPCC assessment? Required by the 5 th  IPCC assessment and the Earth System Grid and NASA How do we catalog all of NASA’s current planetary science data?
Key Takeaway OODT is already doing and/or preparing the world to handle all of these diverse use cases! It’s a constantly evolving and improving framework – join up and help. It’s free and open source from Apache and helping government demonstrate the public good
OODT Project Contact Info Learn more and track our progress at: http://oodt.apache.org   WIKI:  https://cwiki.apache.org/OODT /   JIRA:  https://issues.apache.org/jira/browse/OODT   Join the mailing list: [email_address]   Chat on IRC: #oodt on irc.freenode.net Acknowledgements Key Members of the OODT teams: Chris Mattmann, Daniel J. Crichton, Steve Hughes, Andrew Hart, Sean Kelly, Sean Hardman, Paul Ramirez, David Woollard, Brian Foster, Dana Freeborn, Emily Law, Mike Cayanan, Luca Cinquini, Heather Kincaid Projects, Sponsors, Collaborators: Planetary Data System, Early Detection Research Network, Climate Data Exchange, Virtual Pediatric Intensive Care Unit, NASA SMAP Mission, NASA OCO-2 Mission, NASA NPP Sounder Peate, NASA ACOS Mission, Earth System Grid Federation
Alright, I ’ll shut up now Any questions? THANK YOU! [email_address]   @chrismattmann  on Twitter

A Look into the Apache OODT Ecosystem

  • 1.
    A Look intothe Apache OODT Ecosystem Chris A. Mattmann NASA JPL/Univ. Southern California/ASF [email_address] November 9, 2011
  • 2.
    Apache Member involvedin OODT (VP, PMC), Tika (VP,PMC), Nutch (PMC), Incubator (PMC), SIS (Mentor), Lucy (Mentor) and Gora (Champion), MRUnit (Mentor), Airavata (Mentor) Senior Computer Scientist at NASA JPL in Pasadena, CA USA Software Architecture/Engineering Prof at Univ. of Southern California And you are?
  • 3.
    Welcome to theApache in Space! (OODT) Track
  • 4.
    Agenda Overview ofOODT and its history How we got it to Apache How other projects can follow our model Existing successful deployments of OODT Pointers to papers, and more information including case studies
  • 5.
    Lessons from 90’sera missions Increasing data volumes (exponential growth) Increasing complexity of instruments and algorithms Increasing availability of proxy/sim/ancillary data Increasing rate of technology refresh … all of this while NASA Earth Mission funding was decreasing A data system framework based on a standard architecture and reusable software components for supporting all future missions.
  • 6.
    Enter OODT ObjectOriented Data Technology http://oodt.apache.org Funded initially in 1998 by NASA ’s Office of Space Science Envisaged as a national software framework for sharing data across heterogeneous, distributed data repositories OODT is both an architecture and a reference implementation providing Data Production Data Distribution Data Discovery Data Access OODT is Open Source and available from the Apache Software Foundation
  • 7.
    Apache OODT Originallyfunded by NASA to focus on distributed science data system environments science data generation data capture, end-to-end Distributed access to science data repositories by the community A set of building blocks/services to exploit common system patterns for reuse Supports deployment based on a rich information model Selected as a top level Apache Software Foundation project in January 2011 Runner up for NASA Software of the Year Used for a number of science data system activities in planetary, earth, biomedicine, astrophysics http://oodt.apache.org
  • 8.
  • 9.
    Why Apache andOODT? OODT is meant to be a set of tools to help build data systems It ’s not meant to be “turn key” It attempts to exploit the boundary between bringing in capability vs. being overly rigid in science Each discipline/project extends Apache is the elite open source community for software developers Less than 100 projects have been promoted to top level (Apache Web Server, Tomcat, Solr, Hadoop) Differs from other open source communities; it provides a governance and management structure
  • 10.
    Governance Model+NASA=♥ NASAand other government agencies have tons of process They like that
  • 11.
    Publicly accessible and searchable archives http://svnsearch.org/svnsearch/repos/ASF/search?path=%2Foodt http://mail-archives.apache.org/mod_mbox/oodt-dev/ http://mail-archives.apache.org/mod_mbox/oodt-user/ 100+ ML list subscriptions
  • 12.
    Great Metrics andInsight http://www.ohloh.net/p/oodt
  • 13.
    Movement to theASF Meeting held June 15, 2007 at JPL with ASF President Justin Erenkrantz Develop plan moving forward to bring first NASA project to Apache Discuss obstacles, sponsorship Discuss outlook
  • 14.
    2007: original goalsCome up with incubation proposal Chris Mattmann was one of the principal contributors to the proposal for the Tika project, and to other Incubation activities (Apache SIS) Send out emails to the Incubator mailing list Look for mentors Get sponsorship from ranking Apache PMC member or board member Justin and others Top-level project versus sub project outlook heading out of incubation
  • 15.
    OODT Incubator PlanningMonthly Updates (for first 3 months, then quarterly) Status Progress Community Acceptance Plan for exiting incubation How to have a solid user base How to operate as a unit in the Apache way Maintenance of user interest and community going forward
  • 16.
    OODT’s next stepscirca 2007 JPL to tackle legal issues Is OODT releasable as an Apache product http://www.apache.org/licenses/software-grant.txt This needs to be signed by parties that be by JPL Contributor License Agreement Do we need a corporate one? In parallel to this Draft OODT incubation proposal Start identifying who would initially be interested More external, non-JPL people who are interested, the better Justin to get slides from other incubator people
  • 17.
    … 2 yearslater Worked it out with JPL legal Turns out the ALv2 license is extremely friendly and is something that JPL (note not all of NASA) was amenable to Developed OODT incubator proposal http://wiki.apache.org/incubator/OODTProposal Found willing Apache mentors besides Justin Jean-Frederic Clere, Ross Gardler, Ian Holsman … Put OODT at Apache!
  • 18.
    Apache OODT CommunityIncludes PMC members from NASA JPL, Univ. of Southern California, Google, Children’s Hospital Los Angeles (CHLA), Vdio, South African SKA Project Projects that are deploying it operationally at Decadal-survey recommended NASA Earth science missions, NIH, and NCI, CHLA, USC, South African SKA project Use in the classroom My graduate-level software architecture and seach engines courses
  • 19.
    OODT Framework OBJECTORIENTED DATA TECHNOLOGY FRAMEWORK OODT/Science Web Tools Archive Client Profile XML Data Data System 1 Data System 2 Catalog & Archive Service Profile Service Product Service Query Service Bridge to External Services Navigation Service Other Service 1 Other Service 2 Andrew Hart and Emily Law will talk about these later You’ll hear about this later today I’ll tell you about these now
  • 20.
    Architectural Principles Divisionof Labor Don ’ t make one component the workhorse! Technology Independence Don ’ t get bitten in the rear when a software vendor decides to charge you a lot of $$$ for their previously low cost technology Metadata as a first-class citizen Descriptions of resources come in handy Separation of software and data models Allow each to evolve independently
  • 21.
    OODT Architecture ReferenceArchitecture Four pairs of component types Product Client/Server, Profile Client/Server, Query Client/Server, Catalog and Archive Client/Server Two connector types Messaging layer discussed in http://sunset.usc.edu/~mattmann/pubs/ICSE.pdf Handler connector (discussed in this presentation) Instantiated for different domains using these fundamental building blocks
  • 22.
    Product Client andServer Product Client (A) Product Server (A) RAID Disk Deliver data from underlying data store Accept uniform query structure that identifies 0 or more “ products ” (data items) to retrieve Many-to-Many Web site MSSQL Product Client (B) Product Client (C) Product Server (B) Product Server (C)
  • 23.
    How about anexample of a product?
  • 24.
    Profile Client andServer Profile Client (A) Profile Server (A) Oracle Deliver metadata from underlying metadata store Metadata gives user enough information about where to find actual data Housekeeping information Resource information Domain-specific information Many-to-Many Web site MSSQL Profile Client (B) Profile Client (C) Profile Server (B) Profile Server (C)
  • 25.
    How about anexample of a profile? Attributes Relationships Credit: A. Hart
  • 26.
    Query Client andServer Query Server seeded with initial set of pointers to Profile Servers Profile Servers point to actual resources (Product Servers, even other Profile Servers ) Interactive (metadata returned) and non-interactive (data returned) Many-to-Many Query Server (A) Query Client (A) Product Server (A) Product Server (B) Discovered Profile Server (B) Profile Server (A) Initial set Query Client (B) Query Client (C) Query Server (B) Query Server (C)
  • 27.
    Catalog and ArchiveClient and Server (CAS) Ingest data into repository and metadata into registry Run processing algorithms on data/metadata upon ingestion Workflow support Serve back Repository data with Product Server Serve back Registry metadata with Profile Server Many-to-Many Archive Client (A) Archive Server (A) Repository Registry Profile Server (A) Product Server (A) Archive Server (B) Archive Client (B) Archive Client (C)
  • 28.
    Some notes aboutCAS All Core components implemented as web services XML-RPC used to communicate between components Servers implemented in Java Clients implemented in Java, scripts, Python, PHP and web-apps Service configuration implemented in ASCII and XML files Credit: D. Woollard
  • 29.
    Handler Connectors RAIDDisk Encapsulate (meta-)data coordination and communication Allow for dynamic addition and removal of different classes of back end metadata and data stores DBMS Product Handler Flat File Product Handler Web Site Product Handler Product/Profile Server Product/Profile Server Web site MSSQL MSSQL
  • 30.
    Example handler connectorsXMLPS http://oodt.apache.org/components/maven/xmlps/ XML config file specifies recipe for extracting records from an RDBMS and turning them into a NoSQL repository PS XML configurable profile server to unlock OPeNDAP datasets and pass them to OODT
  • 31.
    So, how doyou piece them together: NASA VODC NASA’s Virtual Oceanographic Data Center (VODC) http://vodc.jpl.nasa.gov Information integration using OODT components Profile, Product, Query, also uses Apache Solr, and Plone
  • 32.
    So, how doyou piece them together: JPL’s CDX CDX = Climate Data Exchange Provide comparison of remote sensing data and model outputs Existing systems remain in place; services expose data and functions over the network; support the era of IPCC 5 th assessment and distributed, petabytes of data
  • 33.
    Who’s doing what?Children’s Hospital Los Angeles Improving upon XMLPS, and CAS (Andrew Hart + Ricky Nguyen will talk about this) Supporting data analytics Google Brian Foster working on command line improvements and data protocol push/pull SKA South Africa Deploying file manager and crawler for use in KAT-7 pipeline ingestion NIH/NCI Maintaining the XMLPS components, and CAS components Helping with user interfaces Various JPL and NASA research projects OPeNDAPps, XMLPS Various NASA missions Workflow, PCS, services, OPSui, other web apps
  • 34.
    Latest release: 0.3First appearance of PCS Core, Services (JAX-RS) Web Applications Balance (PHP), and Wicket (Java)-based apps for file management and workflow monitoring First release deployed to Maven Central We did backport 0.2 there after this Over 60 issues fixed in JIRA June 2011: recommended stable release
  • 35.
    Working on: 0.4Operator Interface (OODT-157) Andrew Hart and I will talk about this Workflow2 integration (OODT-215) and all of its sub-issues Global workflow conditions, dynamic workflows, parallel/sequential model, new workflow engine, etc. OODT RADIX for super easy deployment (OODT-120) Paul Ramirez and Cameron Goodale will discuss this Solr sync with File Manager (OODT-326) Improvements to XMLPS (OODT-333) and new crawler actions (OODT-33, OODT-34, OODT-35, OODT-36, OODT-37) Over 48 issues currently resolved Likely to come before end of Q4 2011
  • 36.
    Using Apache OODTas a testbed for software process Missions maintain their own local CMs Local mission CMs contain forks of existing OSS software Forks can be patch based or CM based Changes found particularly effective are discussed within the comm. And eventually brought before a CCB that reviews their generality, etc. Credit: D. Freeborn
  • 37.
    Some Grand ChallengesI’m interested in: OODT can help! How do we handle 700 TB/sec of data coming off the wire when we actually have to keep it around? Required by the Square Kilometre Array Joe scientist says I’ve got an IDL or Matlab algorithm that I will not change and I need to run it on 10 years of data from the Colorado River Basin and store and disseminate the output products Required by the Western Snow Hydrology project
  • 38.
    Some Grand ChallengesI’m interested in: OODT can help! How do we compare petabytes of climate model output data in a variety of formats (HDF, NetCDF, Grib, etc.) with petabytes of remote sensing data to improve climate models for the next IPCC assessment? Required by the 5 th IPCC assessment and the Earth System Grid and NASA How do we catalog all of NASA’s current planetary science data?
  • 39.
    Key Takeaway OODTis already doing and/or preparing the world to handle all of these diverse use cases! It’s a constantly evolving and improving framework – join up and help. It’s free and open source from Apache and helping government demonstrate the public good
  • 40.
    OODT Project ContactInfo Learn more and track our progress at: http://oodt.apache.org WIKI: https://cwiki.apache.org/OODT / JIRA: https://issues.apache.org/jira/browse/OODT Join the mailing list: [email_address] Chat on IRC: #oodt on irc.freenode.net Acknowledgements Key Members of the OODT teams: Chris Mattmann, Daniel J. Crichton, Steve Hughes, Andrew Hart, Sean Kelly, Sean Hardman, Paul Ramirez, David Woollard, Brian Foster, Dana Freeborn, Emily Law, Mike Cayanan, Luca Cinquini, Heather Kincaid Projects, Sponsors, Collaborators: Planetary Data System, Early Detection Research Network, Climate Data Exchange, Virtual Pediatric Intensive Care Unit, NASA SMAP Mission, NASA OCO-2 Mission, NASA NPP Sounder Peate, NASA ACOS Mission, Earth System Grid Federation
  • 41.
    Alright, I ’llshut up now Any questions? THANK YOU! [email_address] @chrismattmann on Twitter