E-ARK: Open Data Mining for Government Archives


Published on

Copyright Ross King at CeDEM14

Published in: Government & Nonprofit
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

E-ARK: Open Data Mining for Government Archives

  1. 1. E-ARK: Open Data Mining for Government Archives CeDEM14 Sensor and Government Data: their Role in Public Policy 2014-05-22 Dr. Ross King Senior Scientist Thematic Coordinator: Next Generation Content Management Systems Safety & Security Department AIT Austrian Institute of Technology
  2. 2. NGCMS Research 2
  3. 3. Digital Preservation: Challenges Ensuring preservation of and access to digital information over very long periods (centuries). Preservation: maintaining bit integrity and replication. Access: long-term management and access for multimedia data resources. Managing ever-increasing volumes of digital information Very large number of files, size of files, and high ingest rates. Example: DK National broadcast archive ca. 1 petabyte, growth rate: ca. 100 terabytes/year Dealing with complexity and heterogeneity of data over long periods of time Identification, normalization, and validation of content Example: Web archive information extraction and organization of very large data sets. 3
  4. 4. Digital Preservation: Solutions Web Archiving Monitoring with Monitrix Quality Assurance with W3ACT (open source, developed by AIT) Database Preservation Continuous archiving versus application retirement • CHRONOS (proprietary, developed by CSP GmbH & Co. KG) • SIARD (free software, developed by the Swiss Federal Archives) Bit Preservation LOCKSS – Lots of Copies Keeps Stuff Safe (open source, developed by the Stanford University Library) 4
  5. 5. Big Data – Challenges Data Management Managing petabytes of information in scalable storage Quality Assurance Automated ingest, normalization Data Mining and Analytics Distributed processing Visualization of large data sets Navigating processed results Information Retrieval and Open Data Multimedia indexing Sharing data with the world from digitalbevaring.dk 5
  6. 6. Big Data: Solutions AIT SCALUP Bring computation to your data, integrate with your repository Coordinate workflows with complete provenance information Scalable distributed processing for multimedia Indexing • Enhance access to digital collections Clustering • Discover patterns in large data sets Analysis • Natural language processing • Feature extraction – Shape detection – Facial recognition from digitalbevaring.dk 6
  7. 7. E-ARK Project 7
  8. 8. E-ARK: Facts and Figures EU CIP PCP ICT Programme Objective 2.5: eArchiving services Pilot B • The pilot should share information on integration, operation and interoperability issues throughout the EU in order to facilitate the creation and maintenance of a European archiving infrastructure for government and public services thus promoting the re-use of archival data. 36 months: February 2014 – January 2017 6 M€ Budget, 3 M€ funded 16 Partners 8
  9. 9. E-ARK Participants 5 archives (Estonia, Slovenia, Norway, Denmark, Hungary (+input from Sweden)) 4 Research institutions (University of Portsmouth, Austrian Institute of Technology, University of Köln, Instituto Tecnico Lisbon) 3 SMEs (Magenta, ES Solutions, KEEP Solutions) 2 government departments (AMA, MINHAP) 2 umbrella organisations (DLM Forum, DPC) 3 External Advisory Boards (Archival, Data Provider, Commercial / Technical) 7 archive pilots Estonia (+Estonian Business Archives), Slovenia, Hungary, Portugal, Norway, Denmark 9
  10. 10. E-ARK Goals Provide better indexing and access Integrate OAIS with Big Data Scalable storage and computation Provide Big Data / Open Data technology for national archives 10
  11. 11. WP6: Archival Storage, Services, and Integration Data Management Application ESS Arch Preservation Platform Scalable Computation Staging Area Lily, Hadoop, HBase, HDFS Data Connector API CRUD API Query API Archive Storage (WORM) AIP Storage EARK-AIP Data Management Integration Re-use and Data Mining Query and Indexing Data Mining API Data Mining Showcase 11
  12. 12. 1 2 n HDFS Hadoop ... PigLily EPPSDBRODA Digital Objects Production and Research 12
  13. 13. 1 2 n HDFS Hadoop ... PigLily EPPSDBRODA Digital Objects Pilot Implementations Data Connector API Query API Data Mining API 13
  14. 14. 1 2 n HDFS Hadoop ... PigLily EPP Digital Objects Reference Implementation Data Connector API Query API Data Mining API 14
  15. 15. E-ARK Challenges Heterogeneous Data Sets Documents Relational Databases Geographical Data Data Mining Use Cases Policy Open versus restricted data sets Licensing and Revenue 15
  16. 16. Contact Information http://www.ait.ac.at/ http://eark-project.eu/ Dr. Ross King AIT Austrian Institute of Technology GmbH Donau-City-Strasse 1 A-1220 Wien ross.king@ait.ac.at 16
  17. 17. Thank you for your attention! Questions? 17