Your SlideShare is downloading. ×
0
Content profiling and C3PO
Content profiling and C3PO
Content profiling and C3PO
Content profiling and C3PO
Content profiling and C3PO
Content profiling and C3PO
Content profiling and C3PO
Content profiling and C3PO
Content profiling and C3PO
Content profiling and C3PO
Content profiling and C3PO
Content profiling and C3PO
Content profiling and C3PO
Content profiling and C3PO
Content profiling and C3PO
Content profiling and C3PO
Content profiling and C3PO
Content profiling and C3PO
Content profiling and C3PO
Content profiling and C3PO
Content profiling and C3PO
Content profiling and C3PO
Content profiling and C3PO
Content profiling and C3PO
Content profiling and C3PO
Content profiling and C3PO
Content profiling and C3PO
Content profiling and C3PO
Content profiling and C3PO
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Content profiling and C3PO

228

Published on

This presentation was given as part of a SCAPE Training event on ‘Effective Evidence-Based Preservation Planning’ in Aarhus, Denmark, 13-14 November 2013. …

This presentation was given as part of a SCAPE Training event on ‘Effective Evidence-Based Preservation Planning’ in Aarhus, Denmark, 13-14 November 2013.

Artur Kulmukhametov, Vienna University of Technology, introduced the importance of content profiling and how this can be done with the help of the SCAPE developed tool C3PO. Content profiling is based on characteristics extracted from the files’ metadata and will help the user to plan digital preservation. The tool C3PO can be easily integrated with both PLATO and Scout.

Published in: Technology, News & Politics
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
228
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  1. Content Profiling and C3PO Artur Kulmukhametov Vienna University of Technology SCAPE PW Training Event Aarhus, 13-14 November 2013
  2. Agenda • Motivation: collection scale and heterogeneity • An approach to getting a control • Characterisation tools • C3PO, a tool for content profiling This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). 2
  3. What is it? * * - P. Petrov, Content Profiling and Planning, SCAPE Training Event. Guimarães, 2012 This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). 3
  4. Large Synoptic Survey Telescope * * - P. Petrov, Content Profiling and Planning, SCAPE Training Event. Guimarães, 2012 This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). 30 Terabytes of data nightly 4
  5. Variety of Data • Personal • Cultural Heritage • Scientific Data • Government Documents • …. a huge variety of formats and information This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). 5
  6. * * - P. Petrov, Content Profiling and Planning, SCAPE Training Event. Guimarães, 2012 This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). 6
  7. Conclusions? ….. that’s a lot of data …… Do you know what that data is? Do you want to do something with it? This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). 7
  8. Place for Characterization * * - P. Petrov, Content Profiling and Planning, SCAPE Training Event. Guimarães, 2012 This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). 8
  9. Characterization * * - P. Petrov, Content Profiling and Planning, SCAPE Training Event. Guimarães, 2012 This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). 9
  10. Characterization * * - P. Petrov, Content Profiling and Planning, SCAPE Training Event. Guimarães, 2012 This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). 10
  11. Characterization * ! One size does not fit all ! * - P. Petrov, Content Profiling and Planning, SCAPE Training Event. Guimarães, 2012 This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). 11
  12. Scalability * * - P. Petrov, Content Profiling and Planning, SCAPE Training Event. Guimarães, 2012 This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). 12
  13. Tools for Characterization fido Exif jpylyzer ffident Exiftool Droid This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). 13
  14. A few Problems… • A lot of tools to manage and invoke • Different output schemas • Different configuration/environments • No or bad higher level management • Difficult to spot differences This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). 14
  15. File Information Tool Set • FITS is a software designed to identify, validate, and extract technical metadata for various file formats • By Harvard University Library in 2009 • v0.6.2, LGPL • Wraps other tools • New version every 6-12 months This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). 15
  16. File Information Tool Set Main features: FITS includes: • Consolidates output • Droid • Can include raw output • Metadata Extra • Configurable/Extendable • Jhove • Exiftool http://code.google.com/p/fits/ • FFident • File Utility This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). 16
  17. FITS Output <fits xmlns="http://hul.harvard.edu/ois/xml/ns/fits/fits_output" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://hul.harvard.edu/ ois/xml/ns/fits/fits_output http://hul.harvard.edu/ois/xml/xsd/fits/fits_output.xsd" version="0.6.0" timestamp="12/27/11 10:49 AM"> <identification> format="Portable Document Format" mimetype="application/pdf" <identity toolname="FITS" toolversion="0.6.0"> <tool toolname="Jhove" toolversion="1.5" /> <tool toolname="file utility" toolversion="5.03" /> <tool toolname="Exiftool" toolversion="7.74" /> <tool toolname="NLNZ Metadata Extractor" toolversion="3.4GA" /> <tool toolname="ffident" toolversion="0.2" /> < version toolname="Jhove" toolversion="1.5">1.4</version> <externalIdentifier toolname="Droid" toolversion="3.0" type="puid">fmt/18</externalIdentifier> </identity> </identification> <fileinfo> < size toolname="Jhove" toolversion="1.5">39586</size> <creatingApplicationName toolname="NLNZ Metadata Extractor" toolversion="3.4GA" status="SINGLE_RESULT">/XPP</creatingApplicationName> <lastmodified toolname="Exiftool" toolversion="7.74" status="SINGLE_RESULT">2011:12:27 10:44:28+01:00</lastmodified> <created toolname="Exiftool" toolversion="7.74" status="SINGLE_RESULT">2002:04:25 13:02:24Z</created> <filepath toolname="OIS File Information" toolversion="0.1" status="SINGLE_RESULT">/home/petrov/taverna/tmp/000/000009.pdf</filepath> This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). 17
  18. FITS Output Conflict <?xml version="1.0" encoding="UTF-8"?> <fits xmlns="http://hul.harvard.edu/ois/xml/ns/fits/fits_output" xmlns:xsi="http://www.w3.org/2001/XMLSchemainstance" xsi:schemaLocation="http://hul.harvard.edu/ois/xml/ns/fits/fits_output http://hul.harvard.edu/ois/xml/xsd/fits/fits_output.xsd" version="0.6.1“ timestamp="7/21/12 3:51 PM"> <identification status="CONFLICT“ > <identity format="Plain text" mimetype="text/plain" toolname="FITS" toolversion="0.6.1"> <tool toolname="Jhove" toolversion="1.5" /> </identity> <identity format="Rich Text Format" mimetype="application/rtf, text/rtf" toolname="FITS" toolversion="0.6.1"> <tool toolname="Droid" toolversion="3.0" /> <version toolname="Droid" toolversion="3.0" status="CONFLICT">1.5</version> <version toolname="Droid" toolversion="3.0" status="CONFLICT">1.6</version> <externalIdentifier toolname="Droid" toolversion="3.0" type="puid">fmt/50</externalIdentifier> <externalIdentifier toolname="Droid" toolversion="3.0" type="puid">fmt/51</externalIdentifier> </identity> <identity format="Rich Text Format" mimetype="text/rtf" toolname="FITS" toolversion="0.6.1"> <tool toolname="ffident" toolversion="0.2" /> </identity> </identification> This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). 18
  19. Conflicts 3 types of conflicts: 1. Inconsistent property naming, e.g: image_width and imagewidth 2. Competing characterisation results, e.g: tool1 identifies a file as plain text, but tool2 identifies the file as PDF 3. Close, but not the same property values, e.g: application/xhtml+xml vs. application/xml. This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). 19
  20. Yet Another? Advantages • All-in-one • Unified output schema • Broad type coverage Disadvantages • Consolidation is hard • Low performance: runs all the tools on every file • Conflicts This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). 20
  21. Content Profiling • Global View of Content • Distribution of characteristics • Statistics (size, min, max, …) • Sampling * * - P. Petrov, Content Profiling and Planning, SCAPE Training Event. Guimarães, 2012 This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). 21
  22. Representative Sampling * • Based upon metadata • Outliers identification • As few as possible, as many as necessary • Stratification across file type, size, time or any other relevant characteristic for the use case * - E. Poltorak, Representative sampling, Flickr, http://www.flickr.com/photos/44461316@N08/4110321514/, 2009 This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). 22
  23. Clever, Crafty Content Profiling of Objects C3PO is a tool for content profile generation. • Uses characterization results • Deeper content analysis with nice visuals through the web-app • Generates content profiles (map/reduce) * Sometimes, I don’t understand human behavior?! http://github.com/openplanets/c3po * - P. Petrov, Content Profiling and Planning, SCAPE Training Event. Guimarães, 2012 This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). 23
  24. Clever, Crafty Content Profiling of Objects • CLI-app • Parses and processes FITS, Apache Tika files • Stores data in mongoDB • Output: XML Profile + CSV • Support new adaptors • Web-app • Overview and Browsing • Filtering • Representative Sample Set Generation • REST API (Scout) This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). 24
  25. C3PO: Representative Samples Size'o'Matic 3000 DistSampler ** * SysSampler * -- Statistical Consultants Ltd, http://www.statisticalconsultants.co.nz/weeklyfeatures/WF7.html, 2013 ** D. Lane, Online Statistics Education, http://onlinestatbook.com/2/sampling_distributions/samp_dist_mean.html, 2013 This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). 25
  26. C3PO: Performance • CPU: 2.3GHz 2-core, RAM: 4GB, HDD. • CLI + Web-app • Govdocs1 • • • • 945699 FITS files ingest - 1h 48m profile - 12 minutes 112 different object properties • Internet Memory Web Archive Data • • • • 958638 FITS files ingest - 2h 58m profile - 13.5 minutes 105 different object properties This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). 26
  27. C3PO: Performance • CPU: 2.3GHz 2-core, RAM: 4GB, HDD. • CLI + noDB adaptor (not publicly available yet) • SB (Denmark) dataset - 12 TB of data • • • • 563M FITS files no ingest profile - 49 hours 5314 different object properties This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). 27
  28. C3PO: Roadmap • Conflict reduction • Conflicts of type 2 are solved • Use the PW ontology for an alignment with other tools • Consistent naming of properties, values, measures • The ontology will solve conflicts of type 1 • Data Connector API • A common interface to interact with repositories This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). 28
  29. Summary • Characterization is time consuming • It can be faulty • Know your tools • A tool for content profiling? C3PO! This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). 29

×