2. Introduction
•Cal Poly Pomona Computer Science student
•Found out about SIRI Internship through professor
•Interested in building something from the ground up
3. Problem
• PDS IMG houses over 700 TB of digital images archives
• Currently have a loose understanding of the data that we have
• Need to enable a better picture of the archive data
• Need to inventory and query archive easily
4. Solution – AIMS Inventory Component
• Crawl through the archive to examine each file or directory
• Represent each item in the archive as an Archive Product
• Extract the file metadata for each Archive Product
• Track all information about each Archive Product to maintain the inventory
• Index and store each Archive Product
6. Archive Product
• Create an object to organize
the metadata associated
with each file/directory in
the archive
• Metadata Extractor for
Archive Product
7.
8. File Manager
WHAT IT DOES
•Open-source software part of Apache
OODT CAS
•Goal is to collect, catalog, and store
files
•Similar to idea of iTunes Library
•More powerful, can store any type of
data
WHAT WE NEED
•Collect, catalog and store Archive
Products
•Extend and configure software for use
with Archive Products
• Use Archive Product Metadata Extractor
• Alter a few xml files
9. Crawler
• Apache OODT CAS
• Traverse the many directories and files within the data archive
• Push each Archive Product to the File Manager
• No additional extension for crawler is needed
• Minor configuration changes
10. Solr
• Use of the Apache Solr software to
index and store information on each
Archive Product
• Allows user to query the indexed
data
• Many possible add-ons
• Google-like search of PDS
documentation
• Create a core where the archive will
be stored
• Create a special configuration for PDS
IMG Archive Products
• Change the solrconfig.xml file to allow the
core to use a manually edited schema file
• Includes a modified schema.xml file which
indexes the metadata fields specific to
Archive Products
• Modify the filemgr.properties file to
integrate Filemgr with Solr
14. Documentation
• Added documentation to
the the Confluence wiki
page for other PDS IMG
developers
• Future extensions to the
software will be easier
• Added instructions to extract
more metadata
• Wiki Page
15. Future
• Utilize the Banana
software, which runs on top
of Solr
• Offers a rich and more
flexible user interface
• Free search PDS
documentation
16. What I learned
• Used many different open-source projects
• Learned about the software creation process
• Learned more about the data systems field
• Opportunity to apply Computer Science knowledge
1. My name’s Gabe, CPP CS student who will be graduating in the Spring
2. Found out about SIRI internship through an email from the CPP advisor for this internship
3. Chose computer science because I was always interested in creating something from the ground up
wanted to design something, implement it, test it, and see others use it
4. This project specifically seemed like a good opportunity to build something from the ground up
Transition: The problem I have been trying to solve during this internship is…
Transition: The solution to this problem is….
The solution to this problem is the AIMS Inventory Component
This software does the following task:
This architecture shows the various software that we used to accomplish these tasks
And shows how the software is connected to each other
1. Crawler takes each file directly from the Data Archive
2. Each file and its file metadata is represented as an Archive Product
3. FileManager extracts the metadata from the Archive Product
4. Solr indexes and stores the Archive Product and the extracted metadata
Started this project with the creation of the Archive Product to represent each file/directory within the archive
Also had to create methods to extract the metadata for each archive product
Diagram shows that the Inventory will consist of Archive Product objects
This is the metadata that each Archive Product will contain
This is a screen shot of my code.
These are the function headers of the functions I created to extract the metadata for each Archive Product
function name explains task of function (get MD5 checksum gets checksum for each file)
All of these functions are called by the doExtract method
Apache Object Oriented Data Tecehnology Catalog and Archive Services
iTunes Library: want to store music
also want to store music metadata – if the product being stored is a song, metadata would include Title, Album, Year, Track Number
Chose this software because of its extensibility, can be configured and extended to work on any type of data
Need to collect, catalog, and store archive products
Instead of song metadata like album and track number, extract ArchiveProduct metadata such as file size, checksum, mission
To extend for use with Archive Products: made use of the metadata extractor code shown earlier, altered a few xml files
Each file ingested by FileManager would be pushed to Solr for storage
While File Manager could techinically handle storage, this allows for storage and querying in a more user friendly way
Possibility of Google-like search, many add ons
Example of query, will be demoed
Can display certain fields, can search by field
This is the test archive I have been testing the project on
Consists of two missions: Cassini and MSL
Within each mission, there are sample files:
includes img files, and files that would be considered Old Volumes Data, Staged Data, and Safed Data, or considered Extra
Documented the whole process of creating this software on our wiki page
All modifications to the xml files mentioned are listed here
Example of usefullness of this wiki is it is easier to extract other metadata in the future because of instructions provided
Open source: learned how to read documentation, implement software, extend and configure it for specific needs
learned about some of the problems with open source, documentation is sparse and sometimes not updated
Software creation process: designing, implementing, documenting, writing reports, presenting
Learned a lot in my classes, haven’t been able to apply until this internship