A system for scalable visualization of geographic archival records                                               Jefferson...
The indexing architecture in Figure 2 scales to multiple                                                       touch-enabl...
REFERENCES[1]  CI-BER: CyberInfrastructure for Billions of Electronic Records,     http://ci-ber.blogspot.com/[2] J. Heard...
Upcoming SlideShare
Loading in …5

A system for scalable visualization of geographic archival records


Published on

a poster presented at the LDAV 2011 IEEE Symposium on Large-Scale Data Analysis and Visualization, in Providence, RI on Oct. 24, 2011. Authors: Jefferson Heard, RENCI & Richard Marciano, UNC/SALT lab. Research funded by the NARA Information Services / Applied Research division (Continuing grant NSF/OCI-0848296)

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

A system for scalable visualization of geographic archival records

  1. 1. A system for scalable visualization of geographic archival records Jefferson R. Heard and Richard J. Marciano Renaissance Computing Institute (RENCI) and Sustainable Archives & Leveraging Technologies lab (SALT) University of North Carolina at Chapel HillABSTRACT 3 APPROACHWe present a system that visualizes large collections of archival The next section will discuss our approaches to archiving,geographic records. This system is comprised of a data grid indexing, and visualization.containing a 60TB test collection gleaned from the US NationalArchives, and three web-applications: an indexer and two web and 3.1 Archivingmobile-device based visualizations focusing on collection Archiving is done through the IRODS system [4]. It provides theunderstanding in a geographic context. ability to write rules that are run on files and directories as they are entered into the system, and it allows for extensible metadataKEYWORDS: Visualization, archival records, large collections. collocated with a file or directory itself. IRODS additionally forms a Data Grid [5] that can be federated and expanded asINDEX TERMS: Big data, data-intensive research, preservation requirements grow or policies require. All of our record groups were copied into a central iRODS1 INTRODUCTION repository, built on top of a DataDirect Networks DDN9900The visualization of large collections of documents has had a storage rack and managed by a metadata catalog, iCAT. Wesignificant amount of attention over the last few years. The considered using a federated grid, but it was determined that forproblem of indexing and visually browsing archival records, while performance in visualization and indexing, it was best to collocateit can be said to include the above problem, is more complex. compute and data resources.Archival metadata includes file attributes, location, provenance,etc. Thus archival records are complex semi-structured data, and 3.2 Indexingscaling to millions or billions of records is not trivial. An important special case of archiving is that of large archives !9$"&!#("<*& 89+(#"&@A*"<)(9<)&of geographic records. These are common in the governmental 71,"$%& 4;"<=/%"$)& 71,"$%:2& 71,"$%89+(#"& >?9<"6/;& !#("<*&5(-"& N&collections we have studied in the CI-BER project,CyberInfrastructure for Billions of Electronic Records [1]. Each G7GHIJ"+&C#("<*&$"K,")*)&geographic record may contain large amounts of metadata within J"+/;;&5"$L"$& O& G;;I9<#(<"&/</#%*(C)&)LC)& 39#")IGCC"))& JM5& J85& J!5& 5.0&that is not readily indexed by common methods. In this paper, wepresent a system for indexing and web-based visualization of this 89-"#&K,"$(")& F/*C?&G</#%*(C)&kind of archive in a scalable fashion using RENCI’s Geoanalytics .2D& 5*/*(C&/*/& /*/!,+"& >%$/B(-& 5"<)9$!9##"C*(9<&cyber-infrastructure [2]. ()*$(+,*"-& P& /*/&89-"#)& >9)*625& 89<E9F&2 PROBLEM DESCRIPTIONThe CI-BER project is about scaling archiving systems to handle G</#%*(C)&;$9C"))")& G$C?(L(<E&archives of billions of electronic records. We have built a testbed !"#"$%&()*$(+,*"-&./)0&1,","&& @A*"$</#& /*/&collection called the CI-BER Testbed [1] that currently contains Q&over 60 Terabytes of archival records from the US Government’s !9B;,*"&3")9,$C")& 2345&/*/&6$(-&National Archives and Records Administration (NARA). Thesecover hundreds of different agencies and currently compriseroughly 60M archival records. Throughout these archives are Figure 1. Geoanalytics Architecturelarge chunks of geographic data. Geographic data falls into roughly two categories: vector andraster. For these there are several dozen file formats. Some are Indexing happens through RENCI’s Geoanalytics[2] cyber-no longer readable, but many can be opened using open source infrastructure, chosen because it provides facilities for managingtools like GDAL[3]. In addition to different formats, there are large amounts of geographic data. Its architecture is brieflythousands of geographic projections be used by different datasets. described in Figure 1. We take advantage of its distributed task Our problem is to be able to interactively visualize the metadata queue, Celery [6], and its document-oriented data store,from these records and get a clear picture of what physical areas MongoDB [1] to handle our indexing process. The indexingthese collections cover, allowing a user to “drill-down” to the process is started through a web-application.actual file if desired. Our indexer has thus far indexed the largest of the geographic data collections, around 12TB of data. The indexer is incremental in nature and can be run on new collections as they are incorporated into the archive. Incremental indexing does not effect on the availability of visualizations on the already indexed data.
  2. 2. The indexing architecture in Figure 2 scales to multiple touch-enabled mobile device) or clicking on a box in the tree-mapmachines and CPUs. Our current indexer uses five four-CPU shows the bounding box of all the files in that box and shows amachines, each with a single 1GBit network interface to the grid listing of all the geographic metadata for its directory, or in theto index data. The indexing process is thus: case of a single record, the metadata for that record. 789%48:;8<2=%% !% >5984?<>1@A>BCC8>DB@% )% E5+8<B;4>8F&% 0>1@% *% )% +% -@68H% ,% G5C8<%2B%5@68H% G5C284%1@6%5@68H%3-0%!1@656128<% !"#% !"#% !"#% !"#% !"#% !"#% !"#$% &% !"#$% &% !"#$% &% (% % (% % (% % 5C<% 5I82% Figure 3. The bottom-up visualization. -+./0%/121%3456% The “top down” visualization begins with an OpenLayers physical map, and allows the user to navigate, pan, and zoom, then draw a Figure 2. Indexer architecture bounding box. Once drawn, the bounding box lists the collections in a list on the left. The user can then tap on a collection and see 1. Request to index a collection stored in IRODS. the subdirectories in that collection, and can continue to “drill 2. The indexer identifies a set of nodes in the down” until he or she hits an actual metadata record. If the user Geoanalytics cluster to perform the indexing, and has taps on a metadata record, the detailed accounting of the metadata them start a new IRODS session. for that record replaces the map. 3. The indexer asks one node to perform the “crawl” task, which recursively iterates the collection. 4. The “crawl” task marks potential GIS files and archives containing them (tarballs, zipfiles), and queues them with Celery to be indexed. 5. All other nodes pull items of the indexing queue and perform the following: a. iget the resource b. Optionally unarchive the resource c. Identify GIS files. d. Identify a program that opens the file, transform it to lat/lon, and index.3.3 VisualizationFor archival purposes, understanding the context of a document iscritical. Collection understanding [7] is the task of developingtools that help the user comprehend the collection as a whole andcontextualize documents’ place in that whole. We chose to focuson the collection understanding task because of the size of the Figure 4. The “top-down” visuaiization on the iPadcollections we were given and because we wanted to build toolsthat would be broadly applicable to other collections. 4 CONCLUSION To create tools that can be used by a wide audience, we chose We have presented a system that indexes and visualizes largeto create web-based visualizations that can be reformatted to archival record sets containing geographic data. We have anappear on mobile devices, such as the iPad and iPhone 4. We have indexer that can scale to use multiple CPUs on a cluster ofcreated two visualizations which represent “bottom up” and “top machines and two web-based interactive visualizations that showdown” views of a geographic collection. this index in a geographic context. Our future work will include The “bottom up” visualization shown in Figure 3 allows a user unifying these visual interfaces and providing statistics on theto start with a collection, shown as a tree-map similar to the scalability of the indexer relative to data grid size. This project isvisualization in [9]. The user is presented with a tree-map funded by NSF/OCI grant 0848296 as part a cooperative researchcontaining grey, red, yellow, and blue boxes. Each box agreement between the NARA’s Applied Research division, thecorresponds to a directory in the collection, which may contain a National Science Foundation (NSF), and the University of Northnumber of subdirectories. Each box is scaled to the number of Carolina at Chapel Hill. Project Director is Richard Marcianofiles it contains. The colors correspond to entries containing with visualization expert Jeff Heard. Project collaborators includevector records only (red), raster only (blue), both raster and vector Stan Ahalt, Leesa Brieger, Chien-Yi Hou, Arcot Rajasekar, Sarah(yellow), and no geographic files (gray). Next to the tree-map is a Lippincott, Brendan O’Connell, and Sheau-Yen Chen.physical map, provided by OpenLayers [10]. Tapping (on a
  3. 3. REFERENCES[1] CI-BER: CyberInfrastructure for Billions of Electronic Records, http://ci-ber.blogspot.com/[2] J. Heard. The Geoanalytics system. http://www.renci.org/. Technical Note, Renaissance Computing Institute. 2011.[3] The Open Source Geospatial Foundation. GDAL/OGR. http://gdal.org. 2011.[4] Introduction to iRODS. https://www.irods.org/ index.php/Introduction_to_iRODS.[5] “The Grid: Blueprint for a New Computing”: A book edited by I. Foster, C. Kesselman, Pub. Morgan Kaufmann, San Francisco, 1999. Chapter 5, “Data Intensive Computing”, R. Moore, C. Baru, R. Marciano, A. Rajasekar, M. Wan.[6] The Celery Group. Celery. http://celeryproject.org. 2011[7] The MongoDB Group. MongoDB. http://www.mongodb.org. 2011Chang, M., Leggett, J.J., Furuta, R., Kerne, A., Williams, J.P., Burns, S.A., and Blas, R.G. Collection Understanding [visualization tools in information retrieval]. Proceedings of the 2004 Joint ACM/IEEE Conference on Digital Libraries, pages 334-342. IEEE Press, June 2004.[8] B Shneiderman. Tree visualization with tree maps: 2-d space-filling approach. ACM Transactions on Graphics (TOG), Volume 11, Number 1, pages 91-99. 1992.[9] Jiu, W., Esteva, M., and Dott S.J. Visualization for Archival Appraisal of Large Digital Collections. Proceedings of the IS&T Archiving Conference 2010 (The Hague), pages 157-162. 2010.[10] The Open Source Geospatial Foundation. OpenLayers. http://openlayers.org. 2011.