Metadata Extraction and Content Transformation


Published on

In this session, we will look first at the rich metadata that documents in your repository have, how to control the mapping of this on to your content model, and some of the interesting things this can deliver. We'll then move on to the content transformation and rendition services, and see how you can easily and powerfully generate a wide range of media from the content you already have.

Published in: Technology, Education
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Use devconf-files/geotagged.jpgAKA quickGEO.jpg from projects/repository/source/test-resources/quick/Upload it into ShareGo to the details page, and show the EXIF and GEO bits
  • Create a script with the code shown, also available as devconf-js/contentTransformHellowWorld.jsCreate a simple .txt entry in explorerRun the script against itShow that we get a .bin with content type of hello/worldMaybe view the contents of thisRun the script againShow that we get a new plain text file, with the simple contents
  • Use quick.xls or similar from projects/repository/source/test-resources/quick/Upload it into ExplorerRun a custom action which is a scriptvarnameBase =,"."));var action = actions.create("transform");action.parameters["destination-folder"] = document.parent;action.parameters["assoc-type"] = "{}contains";action.parameters["assoc-name"] = nameBase + ".txt";action.parameters["mime-type"] = "text/plain";action.execute(document);action.parameters["assoc-name"] = nameBase + ".csv";action.parameters["mime-type"] = "text/csv";action.execute(document);action.parameters["assoc-name"] = nameBase + ".html";action.parameters["mime-type"] = "text/xml";action.execute(document);
  • In Share, create a rule on a folder to execute the script belowEnsure the folder that the script writes to isn’t a child!Upload an image, at least 600x400 big into the folderIn share, use the repo browser to get to the special folder, and show the smaller cropped versionvarrenditionDef = renditionService.createRenditionDefinition("Test", "htmlRenderingEngine");renditionDef.parameters["destination-path-template"] = "/Company Home/Test/${name}.html";renditionDef.execute(document);renditionDef.parameters["destination-path-template"] = "/Company Home/Test/BodyOnly/${name}.html";renditionDef.parameters["bodyContentsOnly"] = true;
  • In share, create a Video folder, and a Video Drop subfolderOn the subfolder, create a ruleRule is copy + transformTarget mimetype is flash videoTarget directory is the parent (Videos)Don’t run in the backgroundUpload devconf-files/Video.mp4WaitHopefully show the thumbnail of the .mp4 (refresh if needed)Go to the videos directoryShow the large preview image of the new .flv image
  • Upload a .docx (including images) to a Web Quick Start folder with .docx extraction enabled via ShareView in Share the HTML, to show what it looks likeFlip to the WCM site, and show it with imagesBACKUP – Create a script based on devconf-files/renderTest.jsIn share, create a folder with a rule of execute scriptPick the scriptUpload devconf-files/word3imgs.docxGo to Company Home/TestShow the htmlShow the images(You can’t show the two together without WCM, sorry....)
  • Metadata Extraction and Content Transformation

    1. 1. Metadata Extraction and Content Transformations<br />2<br />Nick Burch<br />Software Engineer, Alfresco<br />twitter: @gagravarr<br />
    2. 2. Introduction – 3 Content Related Services<br />3<br />Covering<br />Uses<br />Interfaces<br />Calling the Services<br />Java & JavaScript APIs<br />Demos<br />Extensions<br />Apache Tika<br />Metadata Extractor<br />Content Transformer<br />Renditions<br />
    3. 3. The Metadata Extractor Service<br />4<br />What, How, Why?<br /><ul><li> For a given piece of content, returns the Metadata held within that
    4. 4. Document Metadata is converted into the content model
    5. 5. Typically used with uploaded binary files
    6. 6. Upload a PDF, extract out the Title and Description, save these as the properties on the Alfresco Node
    7. 7. Powered internally by a number of different extractors
    8. 8. Service picks the appropriate extractor for you
    9. 9. Since Alfresco 3.4, makes heavy use of Apache Tika</li></li></ul><li>The Content Transformation Service<br />5<br />What, How, Why?<br /><ul><li> Transforms content from one format to another
    10. 10. Driven by mime types, source and destination
    11. 11. Used to generate plain text versions for indexing
    12. 12. Used to generate SWF versions for preview
    13. 13. Used to generate PDF versions for web download
    14. 14. Powered by a large number of different transformers
    15. 15. Transformers can be linked together, eg .doc -> .pdf via Open Office, then .pdf -> .swf via pdf2swf
    16. 16. Since Alfresco 3.4, makes heavy use of Apache Tika</li></li></ul><li>The Rendition Service<br />6<br />What, How, Why?<br /><ul><li> Can turn content from one kind to another
    17. 17. Or can just alter some content as-is
    18. 18. Used to manipulate images, eg crop and resize
    19. 19. Used to generate HTML .docx previews in Web Quick Start
    20. 20. Often uses the Content Transformation Service
    21. 21. Replaced the Thumbnail Service
    22. 22. Renditions are actions</li></li></ul><li>Apache Tika<br />7<br />Apache Tika –<br /><ul><li> Apache Project which started in 2006
    23. 23. Grew out of the Lucene community, now widely used
    24. 24. Provides detection of files – eg this binary blob is really a word file
    25. 25. Plain text, HTML and XHTML versions of a wide range of different file formats
    26. 26. Consistent Metadata from different files
    27. 27. Tika hides the complexity of the different formats, and presents a simple, powerful API
    28. 28. Easy to use and extend</li></li></ul><li>Metadata Extractor Service<br />8<br />
    29. 29. Alfresco 3.3 - Supported Formats<br />9<br />File Formats supported out of the box<br /><ul><li> PDF
    30. 30. Word, PowerPoint, Excel
    31. 31. HTML
    32. 32. Open Document Formats (OpenOffice)
    33. 33. RFC822 Email
    34. 34. Outlook .msg Email</li></li></ul><li>Alfresco 3.4 - Supported Formats – Page 1<br />10<br />File Formats supported out of the box, Page 1<br /><ul><li> Audio – WAV, RIFF, MIDI
    35. 35. DWG (CAD)
    36. 36. Epub
    37. 37. RSS and ATOM Feeds
    38. 38. True Type Fonts
    39. 39. HTML
    40. 40. Images – JPEG, GIF, PNG, TIFF, Bitmap (including EXIF where found)
    41. 41. iWork (Keynote, Pages etc)
    42. 42. RFC822 mbox Mail</li></li></ul><li>Alfresco 3.4 - Supported Formats – Page 2<br />11<br />File Formats supported out of the box, Page 2<br /><ul><li> Microsoft Outlook .msg Email
    43. 43. Microsoft Office (Binary) – Word, PowerPoint, Excel, Visio, Publisher, Works
    44. 44. Microsoft Office (OOXML) – Word, PowerPoint, Excel
    45. 45. MP3 (id3 v1 and v2)
    46. 46. CDF (Scientific Data)
    47. 47. Open Document Format (Open Office)
    48. 48. Old-style Open Office (.sxw etc)
    49. 49. PDF</li></li></ul><li>Alfresco 3.4 - Supported Formats – Page 3<br />12<br />File Formats supported out of the box, Page 3<br /><ul><li> Zip and Tar archives
    50. 50. RDF
    51. 51. Plain Text
    52. 52. FLV Video
    53. 53. XML
    54. 54. Java class files</li></ul>And I probably forgot one...!<br />
    55. 55. Calling Apache Tika<br />13<br />// Get a content detector, and an auto-selecting Parser<br />TikaConfigconfig = TikaConfig.getDefaultConfig();<br />ContainerAwareDetector detector = new ContainerAwareDetector(<br />config.getMimeRepository()<br /> );<br />Parser parser = new AutoDetectParser(detector);<br />// We’ll only want the plain text contents<br />ContentHandler handler = new BodyContentHandler();<br />// Tell the parser what we have<br />Metadata metadata = new Metadata(); <br />metadata.set(Metadata.RESOURCE_NAME_KEY, filename);<br />// Have it processed<br />parser.parse(input, handler, metadata, new ParseContext());<br />
    56. 56. Metadata Extractor – Java Use<br />14<br />MetadataExtractorRegistry registry = (MetadataExtractorRegistry)context.getBean(“metadataExtracterRegistry”);<br />MetadataExtracter extractor = registry.getExtracter(“application/”);<br />Map<QName, Serializable> properties = new HashMap<QName, Serializable>();<br />ContentReader reader = contentService.getReader(nodeRef, ContentModel.PROP_CONTENT);<br />extractor.extract(reader, properties);<br />System.err.println(properties);<br />
    57. 57. Metadata Extractor – JavaScript Use<br />15<br />JavaScript<br />var action = actions.create("extract-metadata");<br />action.execute(document);<br />Full access is not directly available<br />You can’t get at the raw properties<br />You can, however, trigger extraction and saving to the node easily<br />Available via an action<br />
    58. 58. Metadata Extractor – Geo Content Model<br />16<br /> <aspect name="cm:geographic"><br /> <title>Geographic</title><br /> <properties><br /> <property name="cm:latitude"><br /> <title>Latitude</title><br /> <type>d:double</type><br /> </property><br /> <property name="cm:longitude"><br /> <title>Longitude</title><br /> <type>d:double</type><br /> </property><br /> </properties><br /></aspect><br />
    59. 59. Metadata Extractor – Geo Mapping<br />17<br /># Namespaces<br /><br /># Geo Mappings<br />geo:lat=cm:latitude<br />geo:long=cm:longitude<br /># Normal Mappings<br />author=cm:author<br />title=cm:title<br />description=cm:description<br />created=cm:created<br />
    60. 60. Demo:Geo Tagged Image in Share<br />18<br />
    61. 61. Content Transformation Service<br />19<br />
    62. 62. Supported Transformations<br />20<br />Transformations Supported in Alfresco v3.4<br /><ul><li> Plain Text, HTML & XHTML for all Apache Tika supported text and document formats (around 30 file formats)
    63. 63. PDF to Image
    64. 64. PDF to SWF (for preview)
    65. 65. Office File Formats to PDF (via Open Office, using JODConverter in Enterprise)
    66. 66. Plain Text and XML to PDF
    67. 67. Zip listing to Text
    68. 68. Image to other Images (via ImageMagick)</li></li></ul><li>Content Transformer and Tika<br />21<br />Handlers<br />ContentHandler handler = new BodyContentHandler();<br />String text = handler.toString();<br />SAXTransformerFactory factory = SAXTransformerFactory.newInstance();<br />TransformerHandler handler = factory.newTransformerHandler();<br />handler.getTransformer().setOutputProperty(OutputKeys.METHOD, "xml");<br />handler.getTransformer().setOutputProperty(OutputKeys.INDENT, "yes");<br />StringWritersw = new StringWriter();<br />handler.setResult(new StreamResult(sw));<br /> String text = sw.toString();<br />Tika generates HTML-like SAX events as it parses<br />Uses Java SAX API<br />Events can be captured or transformed<br />Body Content Handler used for plain text<br />HTML and XHTML available<br />Can customise with your own handler, with XSLT or with E4X from JavaScript<br />
    69. 69. Content Transformer – Java Use<br />22<br />ContentTransformerRegistry registry = (ContentTransformerRegistry)context.getBean(“contentTransformerRegistry”);<br />ContentTransformer transformer = registry.getTransformer(“application/”,”text/csv”, new TransformationOptions());<br />ContentReader reader = contentService.getReader(sourceNodeRef, ContentModel.PROP_CONTENT);<br />ContentWriter writer = contentService.getReader(destNodeRef, ContentModel.PROP_CONTENT);<br />transformer.transform(reader, writer);<br />
    70. 70. Content Transformer – JavaScript Use<br />23<br />JavaScript<br />var action = actions.create("transform"); <br />// Transform into the same folder<br />action.parameters["destination-folder"] = document.parent; <br />action.parameters["assoc-type"] = "{}contains"; <br />action.parameters["assoc-name"] = + "transformed"; <br />action.parameters["mime-type"] = "text/html"; <br />// Execute<br />action.execute(document); <br />Full access is not directly available<br />You can’t control which property is transformed, it’s always Content<br />You can control where the transformed version goes<br />Triggering the transformation is easier than in Java<br />Available via an action<br />
    71. 71. Custom Tika Parsers - Interface<br />24<br />Interface<br />public interface Parser {<br />Set<MediaType> getSupportedTypes(ParseContext context);<br />void parse(<br />InputStream stream, ContentHandler handler,<br />Metadata metadata, ParseContext context)<br />throws IOException, SAXException, TikaException;<br />}<br />The Tika Parser interface is quite simple<br />Need to provide a list of supported mime types, so that auto-detection can work<br />Accept an input stream, populate the Metadata object, and fire SAX events to the supplied handler<br />That’s it!<br />
    72. 72. Custom Tika Parser – Hello World Parser<br />25<br />public class HelloWorldParser implements Parser {<br /> public Set<MediaType> getSupportedTypes(ParseContext context) {<br /> Set<MediaType> types = new HashSet<MediaType>();<br />types.add(MediaType.parse("hello/world"));<br /> return types;<br /> }<br /> public void parse(InputStream stream, ContentHandler handler,<br /> Metadata metadata, ParseContext context) throws SAXException {<br />XHTMLContentHandlerxhtml = new XHTMLContentHandler(handler, metadata);<br />xhtml.startDocument();<br />xhtml.startElement("h1");<br />xhtml.characters("Hello, World!");<br />xhtml.endElement("h1");<br />xhtml.endDocument();<br />metadata.set("hello","world");<br />metadata.set("title","Hello World!");<br /> }<br />}<br />
    73. 73. Custom Command Line Transformer<br />26<br /> <bean id="transformer.worker.helloWorldCMD" class="org.alfresco.repo.content.transform.RuntimeExecutableContentTransformerWorker"><br /> <property name="mimetypeService“><ref bean="mimetypeService"/></property><br /> <property name="transformCommand"><br /> <bean class="org.alfresco.util.exec.RuntimeExec"><br /> <property name="commandsAndArguments“><map><br /> <entry key=".*“><list><br /> <value>/bin/bash</value><br /> <value>-c</value><br /> <value>/bin/echo 'Hello World - ${source}' &gt; ${target}</value><br /> </list></entry><br /> </map></property><br /> <property name="errorCodes“><value>1,127</value></property><br /> </bean><br /> </property<br /> <property name="explicitTransformations"><br /> <list><bean class="org.alfresco.repo.content.transform.ExplictTransformationDetails"><br /> <property name="sourceMimetype“><value>text/plain</value></property><br /> <property name="targetMimetype“><value>hello/world</value></property><br /> </bean></list><br /> </property><br /> </bean><br /><bean id="transformer.helloWorldCMD" class="org.alfresco.repo.content.transform.ProxyContentTransformer" parent="baseContentTransformer"><br /> <property name="worker"><ref bean="transformer.worker.helloWorldCMD"/></property><br /> </bean><br />
    74. 74. Custom Transformer – Demo<br />27<br />JS Code<br />var action = actions.create("transform");<br />action.parameters["destination-folder"] = document.parent;<br />action.parameters["assoc-type"] = "{}contains";<br />action.parameters["assoc-name"] = + "HW";<br />if(document.mimetype == "hello/world") {<br />action.parameters["mime-type"] = "text/plain";<br />} else {<br />action.parameters["mime-type"] = "hello/world";<br />}<br />action.execute(document);<br /><ul><li>Use our Command Line transformer to generate a “hello/world” version
    75. 75. Use our Tikatransfomer to turn this back into plain text
    76. 76. Uses the JavaScript API to access the content transformation service</li></li></ul><li>Demo 2:Excel to Plain Text, CSV and HTML<br />28<br />
    77. 77. Rendition Service<br />29<br />
    78. 78. Standard Rendition Engines<br />30<br />Renditions Supported in Alfresco v3.4<br /><ul><li> reformat – access to the Content Transformation Service
    79. 79. image – crop, resize, etc
    80. 80. freemarker – runs a Freemarker Template against the content of the node
    81. 81. html – turns .docx files into clean HTML + images
    82. 82. xslt – runs a XSLT Transformation against the content of the node, XML content nodes only!
    83. 83. composite – execute several renditions in a series, eg reformat followed by image crop</li></li></ul><li>Persisted vs Transient Definitions<br />31<br />For your more complicated renditions<br /><ul><li> To run a rendition, first create a rendition definition for a given rendering engine
    84. 84. Then set all the parameters against it
    85. 85. Finally execute it against a node
    86. 86. For very complicated / common renditions, you can save the definition to the data dictionary
    87. 87. It can then be retrieved and run
    88. 88. Rendition Service provides support to create, load, save and execute definitions</li></li></ul><li>Rendition Service – Calling From Java<br />32<br />Load, Edit, Save, Run<br /><ul><li>// Retrieve the existing Rendition Definition
    89. 89. QNamerenditionName = QName.createQName( NamespaceService.CONTENT_MODEL_1_0_URI, "myRendDefn");
    90. 90. RenditionDefinitionrenditionDef = loadRenditionDefinition(renditionName);
    91. 91. // Make some changes.
    92. 92. renditionDef.setParameterValue(AbstractRenderingEngine.PARAM_MIME_TYPE, MimetypeMap.MIMETYPE_PDF);
    93. 93. renditionDef.setParameterValue(RenditionService.PARAM_ORPHAN_EXISTING_RENDITION, true);
    94. 94. // Persist the changes.
    95. 95. renditionService.saveRenditionDefinition(renditionDef);
    96. 96. // Run the Rendition
    97. 97. ChildAssociationRef assoc = renditionService.render(sourceNode, renditionDef);</li></li></ul><li>Rendition Service – Calling From JavaScript<br />33<br />Create, Run, List<br /><ul><li>varrenditionDef = renditionService.createRenditionDefinition("cm:cropResize", "imageRenderingEngine");
    98. 98. renditionDef.parameters["destination-path-template"] = "/Company Home/Cropped Images/${name}.jpg";
    99. 99. renditionDef.parameters["isAbsolute"] = true;
    100. 100. renditionDef.parameters["xsize"] = 50;
    101. 101. renditionDef.parameters["ysize"] = 50;
    102. 102. renditionService.render(nodeRef, renditionDef);
    103. 103. var renditions = renditionService.getRenditions(nodeRef);</li></li></ul><li>Rendition Service – More Calling Options<br />34<br />Actions, Rules, CMIS<br /><ul><li> Renditions are Actions, but normally hidden ones
    104. 104. They won’t show up in Share when defining Rules, or in Explorer for running a Custom Action
    105. 105. Solution – create a JS Script, or some custom Java
    106. 106. Use this from your Rule / to run as an Action
    107. 107. No dedicated REST API, but Renditions are available through CMIS
    108. 108. More details available in the CMIS talks!</li></li></ul><li>Custom Rendition Engines<br />35<br />When a composite just isn’t enough<br /><ul><li> Rendition Engines are a special kind of Action Executor
    109. 109. This delivers lots of flexibility, and means anyone who can write Custom Actions already knows enough to write Custom Rendition Engines!
    110. 110. org.alfresco.repo.rendition.executer.AbstractRenderingEngine provides a helpful superclass
    111. 111. To learn more about Custom Actions and Custom Action Executors, see Neil McErlean’s talk</li></li></ul><li>Demo 1:Crop and Resize an Image(Using Share Rules)<br />36<br />
    112. 112. Demo 2:Video Rendition<br />37<br />
    113. 113. Demo 3:Word .docx -> HTML & Images(Using Web Quick Start)<br />38<br />
    114. 114. Any Questions?<br />39<br />
    115. 115. Learn More<br />40<br /><br /><br /><br />twitter: @AlfrescoECM @Gagravarr<br />