PLAT-13 Metadata Extraction and Transformation


Published on

In this session, we will look first at the rich metadata that documents in your repository have, how to control the mapping of this on to your content model, and some of the interesting things this can deliver. We’ll then move on to the content transformation and rendition services, and see how you can easily and powerfully generate a wide range of media from the content you already have. Finally, we’ll look at how to extend these services to support additional formats.

Published in: Technology, Education
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

PLAT-13 Metadata Extraction and Transformation

  1. 1. Metadata Extraction, ContentTransformations and RenditionsNick Burch • Senior Engineer, Alfresco • twitter: @Gagravarr
  2. 2. Introduction: 3 Content Related Services Covering:• Metadata Extractor  Service Uses  Interfaces• Content Transformer  Calling the Service• Renditions  Java & JS APIs  Demos  Configuration  Extending  Apache Tika
  3. 3. Why Now? Arent these old Services? The Metadata Extractor and Content Transformer are core repository services Theyve been around since the early days For a long time, not a lot change with them, “Theyre boring and just work....” In Alfresco 3.4 we added support for delegating some of the work to Apache Tika This has lead to a large improvement in the numbers of file formats that are supported! Renditions came in in Alfresco 3.3
  4. 4. What did Alfresco 3.3 Support? PDF Word, PowerPoint, Excel HTML Open Document Formats (OpenOffice) RFC822 Email Outlook .msg Email And thats it...
  5. 5. Supported Formats in Alfresco 4.0 Audio – WAV, RIFF, MIDI DWG and PRT (CAD Formats) Epub RSS and ATOM Feeds True Type Fonts HTML Images – JPEG, GIF, PNG, TIFF, Bitmap Includes EXIF Metadata where present
  6. 6. Alfresco 4.0 Formats - Continued iWorks (Keynote, Pages, Numbers) RFC822 MBox Mail Microsoft Outlook .msg Email Microsoft Office (Binary) – Word, PowerPoint, Excel, Visio, Publisher, Works Microsoft Office (OOXML, 2007+) – Word, PowerPoint, Excel Open Document Format (OpenOffice) Old-style OpenOffice (.sxw etc)
  7. 7. Alfresco 4.0 Formats – Still Continued MP3 (id3 v1 and v2) Ogg Vorbis and FLAC CDF, HDF (Scientific Data) RDF RTF PDF Adobe Illustrator (PDF based) Adobe PSD (expected shortly) Plain Text
  8. 8. Alfresco 4.0 Formats – Final Set! Zip, Tar, Compress etc (Archive Formats) FLV Video XML Java Class Files CHM (Windows Help Files) Configurable External Programs And probably some others too!
  9. 9. Services Overview
  10. 10. The Metadata Extractor ServiceWhat, How, Why? For a given piece of content, returns the Metadata held within that Document Metadata is converted into the content model Typically used with uploaded binary files Upload a PDF, extract out the Title and Description, save these as the properties on the Alfresco Node Powered internally by a number of different extractors Service picks the appropriate extractor for you Since Alfresco 3.4, makes heavy use of Apache Tika
  11. 11. The Content Transformer ServiceWhat, How, Why? Transforms content from one format to another Driven by source and destination mime types Used to generate plain text versions for indexing Used to generate SWF versions for preview Used to generate PDF versions for web download Powered by a large number of different transformers internally Transformers can be chained togther, eg .doc → .pdf via OpenOffice, then .pdf → .swf via pdf2swf Since Alfresco 3.4, makes heavy use of Apache Tika
  12. 12. The Rendition Service (Alfresco 3.3+)What, How, Why? Can turn content from one kind to another Or can just alter some content in the same format Used to manipulate images, eg crop and resize Used to generate HTML previews from .docx in the Web Quick Start Often uses the Content Transformation Service to do the actual heavy lifting The Thumbnail Service has been re-written to use the Rendition Service (all thumbnail actions now delegate to the Rendition Service) Renditions are all Actions
  13. 13. Apache TikaApache Tika – Apache Project which started in 2006 Grew out of the Lucene community, now widely used in both Search and Content settings Provides detection of files – eg this binary blob is really a word file Plain text, HTML and XHTML versions of a wide range of different file formats Consistent Metadata from different files Tika hides the complexity of different formats and their libraries, instead its a simple, powerful API Easy to use and extend
  14. 14. Any questions so far? ?
  15. 15. Metadata Extractor Service
  16. 16. Metadata Extractor – Java Use MetadataExtractorRegistry registry = (MetadataExtractorRegistry) context.getBean(“metadataExtracterRegistry”); ContentReader reader = contentService.getReader(nodeRef, ContentModel.PROP_CONTENT); MetadataExtracter extractor = registry.getExtracter(reader.getMimetype()); Map<QName, Serializable> properties = new HashMap<QName, Serializable>(); extractor.extract(reader, properties); System.err.println(properties);
  17. 17. Metadata Extractor – JavaScript Use Full access is not JavaScript directly availble in JS You cant get at the  var action = extractor registry actions.create( You cant get at the raw "extract-metadata"); properties  action.execute( You can, however, easily document); trigger the extraction on a given node This is done via the Script Actions service
  18. 18. Calling Apache Tika // Get a content detector, and an auto-selecting Parser // In Alfresco we already know the type, so we don’t need to Auto Detect! TikaConfig config = TikaConfig.getDefaultConfig(); DefaultDetector detector = new DefaultDetector( config.getMimeRepository() ); Parser parser = new AutoDetectParser(detector); // We’ll only want the plain text contents ContentHandler handler = new BodyContentHandler(); // Tell the parser what we have Metadata metadata = new Metadata(); metadata.set(Metadata.RESOURCE_NAME_KEY, filename); // Have it processed parser.parse(input, handler, metadata, new ParseContext());
  19. 19. Metadata Extractor – Mappings Mappings control how to turn the metadata an extractor produces into node properties Maps from extractor names to your content model Typically set in a properties file, one per extractor Can also be done in Spring when defining the bean An OverwritePolicy controls what happens when extracting for a 2nd (or subsequent) time One output from a metadata can map to multiple properties on your node Not all outputs need to be mapped, some can (and often are) ignored
  20. 20. Geo Content Model (cm:geographic) <aspect name="cm:geographic"> <title>Geographic</title> <properties> <property name="cm:latitude"> <title>Latitude</title> <type>d:double</type> </property> <property name="cm:longitude"> <title>Longitude</title> <type>d:double</type> </property> </properties> </aspect>
  21. 21. Metadata Extractor – Geo Mapping # Namespaces # Geo Mappings # Note – escape : in metadata keys inside properties files! geo:lat=cm:latitude geo:long=cm:longitude # Normal Mappings author=cm:author title=cm:title description=cm:description created=cm:created
  22. 22. DemoGeo-Tagged Image Upload
  23. 23. Demo Tika + Geo-Tagged Imagesjava ­jar tika­app­1.0­SNAPSHOT.jar ­­metadata geotagged.jpg date: 2009­08­11T09:09:45exif:DateTimeOriginal: 2009­08­11T09:09:45exif:ExposureTime: 6.25E­4exif:FNumber: 5.6exif:Flash: falseexif:FocalLength: 194.0exif:IsoSpeedRatings: 400geo:lat: 12.54321geo:long: ­54.1234subject: canon­55­250tiff:BitsPerSample: 8tiff:ImageLength: 68tiff:ImageWidth: 100tiff:Make: Canontiff:Model: Canon EOS 40Dtiff:ResolutionUnit: Inchtiff:Software: Adobe Photoshop CS3 Macintoshtiff:XResolution: 240.0tiff:YResolution: 240.0
  24. 24. DemoAudio File Upload
  25. 25. Ways to Customise and Extend Customise Identify already available metadata of interest Define a content model for this Add mappings tika-app.jar can be very helpful here Extend Locate/Write library or program to read file format Write either Tika Plugin, or whole Extractor Define mappings has more
  26. 26. Content Transformer Service
  27. 27. Out-of-the-box Transformations These are the main ones, there are others Plain Text, HTML & XHTML for all Apache Tika supported text and document formats (~30) PDF to Image and SWF (thumbnails and previews) Office File Formats to PDF (via Open Office direct / JODConverter in Enterprise) Plain Text and XML to PDF Zip listing to Text Image to other Images (via ImageMagick) With FFMpeg, video transforms and thumbnails Can chain transformers together, eg text preview via txt -> pdf -> swf
  28. 28. Checking Supported Transformations Checking active Transformations and Extractors New webscript in 3.4 exposes information on the available transformers and extractors http://localhost:8080/alfresco/service/mimtypes Shows live information as of when the page is requested As transforms come and go (eg OpenOffice dies), the list will show whats current active Only shows the current transformer, not in-active or lower preference ones Includes information on transformation both from and two each mimetype, plus metadata extractor
  29. 29. DemoMimetype Information WebScript
  30. 30. Content Transformer – Java Use ContentTransformerRegistry registry = (ContentTransformerRegistry) context.getBean(“contentTransformerRegistry”); ContentTransformer transformer = registry. getTransformer(“application/”,”text/csv”, new TransformationOptions()); ContentReader reader = contentService.getReader(sourceNodeRef, ContentModel.PROP_CONTENT); ContentWriter writer = contentService.getWriter(destNodeRef, ContentModel.PROP_CONTENT); transformer.transform(reader, writer);
  31. 31. Content Transformer – JavaScript Use Full access is not JavaScript var action = directly availble in JS  actions.create("transform"); You cant get at the  // Transform into the same folder tranformer registry  action.parameters["destination- You cant control which folder"] = document.parent; property is transformed,  action.parameters["assoc-type"] = its always Content "{ You can, however, easily ontent/1.0}contains"; trigger the  action.parameters["assoc-name"] transformation of a = +"transformed"; given node  action.parameters["mime-type"] = "text/html"; This is done via the  // Execute Script Actions service  action.execute(document);
  32. 32. Custom Command Line Transformer<bean id="transformer.worker.helloWorldCMD"   class="org.alfresco.repo.content.transform.RuntimeExecutableContentTransformerWorker">  <property name="mimetypeService“><ref bean="mimetypeService"/></property>  <property name="transformCommand">    <bean class="org.alfresco.util.exec.RuntimeExec">      <property name="commandsAndArguments“><map>         <entry key=".*“><list>           <value>/bin/bash</value>           <value>­c</value>           <value>/bin/echo Hello World ­ ${source} &gt; ${target}</value>          </list></entry>      </map></property>      <property name="errorCodes“><value>1,127</value></property>    </bean>  </property  <property name="explicitTransformations">     <list><bean class="org.alfresco.repo.content.transform.ExplictTransformationDetails">        <property name="sourceMimetype“><value>text/plain</value></property>        <property name="targetMimetype“><value>hello/world</value></property>     </bean></list>  </property></bean><bean id="transformer.helloWorldCMD" class="org.alfresco.repo.content.transform.ProxyContentTransformer"   parent="baseContentTransformer">  <property name="worker"><ref bean="transformer.worker.helloWorldCMD"/></property></bean>
  33. 33. Content Transformers and Tika Tika generates HTML-like SAX events as it parses Uses Java SAX API Events can be captured or transformed The Body Content Handler is used for plain text Both HTML and XHTML are available You can customise with your own handler, with XSLT or with E4X from JavaScript Text Indexing just uses a Body Content Handler The Excel to CSV transformer has a text altering SAX handler The Web Quick Start Word→HTML transformer both alters text, tags and embedded resources
  34. 34. Tika Plugins Tika ships with Parsers for a wide range of file formats as standard All of these Parsers depend on libraries that are Apache Licensed or similar For other Parsers, Tika provides a mechanism for having the Parser auto-loaded Typically used by GPL or Proprietary plugins Great way to have your custom formats handled Alfresco will auto-load these if available Current list of known third party plugins is:
  35. 35. Custom Tika Plugins Writing a new Tika Plugin is very straightforward Only 2 methods needed – getSupportedTypes to list which mimetypes you support, and parse Magic file used for detecting new plugins isMETA-INF/services/org.apache.tika.parser.Parser With the service file, the Tika Auto-Detect parser will load and use the parser Without it, you can explicitly configure it into Alfresco via TikaSpringConfiguredContentTransformer Very easy way to add indexing and metadata support for custom file formats
  36. 36. Custom Tika Parser – “Hello World”public class HelloWorldParser extends AbstractParser { public Set<MediaType> getSupportedTypes(ParseContext context) { Set<MediaType> types = new HashSet<MediaType>(); types.add(MediaType.parse("hello/world")); return types; } public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws SAXException { XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata); xhtml.startDocument(); // Document Heading xhtml.element("h1", “Hello World!”); // To prove this worked, add some extra text to search for xhtml.startElement("p"); xhtml.characters("To show that this went via the parser, we have "); xhtml.characters("some special text that we can search for. "); xhtml.characters("BADGER BADGER BADGER BADGER BADGER "); xhtml.characters("BADGER BADGER BADGER MUSHROOM MUSHROOM "); xhtml.endElement("p"); // All Done xhtml.endDocument(); metadata.set("hello","world"); metadata.set("title","Hello World!"); metadata.set("custom1","Hello, Custom Metadata 1!"); metadata.set("custom2","Hello, Custom Metadata 2!"); }}
  37. 37. Demo “Hello World” Transformer Round- Tripvar action = actions.create("transform");action.parameters["destination­folder"] = document.parent;action.parameters["assoc­type"] = "{}contains";action.parameters["assoc­name"] = + "HW";if(document.mimetype == "hello/world") {   // Its current a "Hello World" file   // Use Apache Tika to create a plain text version   action.parameters["mime­type"] = "text/plain";} else {   // Its a regular new text file   // Have the command line tool make a "Hello World" version   action.parameters["mime­type"] = "hello/world";}action.execute(document);
  38. 38. Demo Excel to HTML, CSV and Textvar nameBase =,"."));var action = actions.create("transform");action.parameters["destination­folder"] = document.parent;action.parameters["assoc­type"] = "{}contains";action.parameters["assoc­name"] = nameBase + ".txt";action.parameters["mime­type"] = "text/plain";action.execute(document);action.parameters["assoc­name"] = nameBase + ".csv";action.parameters["mime­type"] = "text/csv";action.execute(document);action.parameters["assoc­name"] = nameBase + ".html";action.parameters["mime­type"] = "text/xml";action.execute(document);
  39. 39. Rendition Service
  40. 40. Standard Rendition Engines Renditions Supported in Alfresco v4.0 reformat – access to the Content Transformation Service image – crop, resize, etc freemarker – runs a Freemarker Template against the content of the node html – turns .docx files into clean HTML + images xslt – runs a XSLT Transformation against the content of the node, XML content nodes only! composite – execute several renditions in a series, eg reformat followed by image crop
  41. 41. Persisted and Transient Definitions For Complicated or Simple Renditons To run a rendition, first create a Rendition Definition for a given Rendering Engine Next, set your parameters on the definition Finally, execute this against a source node For very complicated, or very commonly used renditions, you dont want to have to create these definitions every time Instead, save them to the Data Dictionary, and load via the Rendition Service on demand Rendition Service provides Save and Load methods
  42. 42. Rendition Service – Call from Java// Retrieve the existing Rendition DefinitionQName renditionName = QName.createQName( NamespaceService.CONTENT_MODEL_1_0_URI, "myRendDefn");RenditionDefinition renditionDef = loadRenditionDefinition(renditionName);// Make some changes.renditionDef.setParameterValue(AbstractRenderingEngine.PARAM_MIME_TYPE, MimetypeMap.MIMETYPE_PDF);renditionDef.setParameterValue( RenditionService.PARAM_ORPHAN_EXISTING_RENDITION, true);// Persist the changes.renditionService.saveRenditionDefinition(renditionDef);// Run the RenditionChildAssociationRef assoc = renditionService.render( sourceNode, renditionDef);
  43. 43. Renditions from JavaScript// Crop the image and place in a specified locationvar renditionDef = renditionService.createRenditionDefinition( "cm:cropResize", "imageRenderingEngine");renditionDef.parameters["destination-path-template"] = "/Company Home/Cropped Images/${name}.jpg";renditionDef.parameters["isAbsolute"] = true;renditionDef.parameters["xsize"] = 50;renditionDef.parameters["ysize"] = 50;renditionService.render(nodeRef, renditionDef);var renditions = renditionService.getRenditions(nodeRef);
  44. 44. Rendition Service – More Ways to Call Actions, Rules, CMIS Renditions are Actions, but by default hidden Dont show up in Share when defining Rules Dont show up in Explorer for Run Custom Action They are available from Java and JS Solution – create JS script to call the Rendition, then run that script from your Rule / from Explorer No dedicated REST API is available Renditions show up in CMIS Or you can use standard Action and Node APIs
  45. 45. Custom Rendition Engines For when a composite just isnt enough... Rendition Engines are just a special kind of Action Executor, within the Action Framework If you know how to write Custom Actions, you can write your own Rendering Engine! org.alfresco.repo.rendition.executor. AbstractRenderingEngine provides a helpful superclass, with handy methods See the Actions talk for more on Custom Actions and Custom Action Executors!
  46. 46. Demo Crop and Resize and Image (Using Share Rules)var renditionDef = renditionService.createRenditionDefinition("cm:cropResize", "imageRenderingEngine");renditionDef.parameters["destination­path­template"] =                "/Company Home/Cropped Images/${name}.jpg";renditionDef.parameters["isAbsolute"] = true;renditionDef.parameters["xsize"] = 50;renditionDef.parameters["ysize"] = 50;renditionDef.parameters["percent_crop"] = true;renditionDef.parameters["crop­width"] = 75;renditionDef.parameters["crop­height"] = 60;renditionDef.parameters["crop_x"] = 20;renditionDef.parameters["crop_y"] = 150;renditionDef.execute(document);
  47. 47. DemoVideo Thumbnailing and Rendition
  48. 48. DemoWord .docx → HTML & Images (Uses Web Quick Start)
  49. 49. Metadata Extraction, ContentTransformations and Renditions
  50. 50. Any Questions??
  51. 51. Learn More _Extraction_with_Apache_Tika twitter: @Alfresco, @Gagravarr