PLAT-13 Metadata Extraction and Transformation
Upcoming SlideShare
Loading in...5
×
 

PLAT-13 Metadata Extraction and Transformation

on

  • 1,646 views

In this session, we will look first at the rich metadata that documents in your repository have, how to control the mapping of this on to your content model, and some of the interesting things this ...

In this session, we will look first at the rich metadata that documents in your repository have, how to control the mapping of this on to your content model, and some of the interesting things this can deliver. We’ll then move on to the content transformation and rendition services, and see how you can easily and powerfully generate a wide range of media from the content you already have. Finally, we’ll look at how to extend these services to support additional formats.

Statistics

Views

Total Views
1,646
Views on SlideShare
1,646
Embed Views
0

Actions

Likes
1
Downloads
71
Comments
0

0 Embeds 0

No embeds

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

PLAT-13 Metadata Extraction and Transformation PLAT-13 Metadata Extraction and Transformation Presentation Transcript

  • Metadata Extraction, ContentTransformations and RenditionsNick Burch • Senior Engineer, Alfresco • twitter: @Gagravarr
  • Introduction: 3 Content Related Services Covering:• Metadata Extractor  Service Uses  Interfaces• Content Transformer  Calling the Service• Renditions  Java & JS APIs  Demos  Configuration  Extending  Apache Tika
  • Why Now? Arent these old Services? The Metadata Extractor and Content Transformer are core repository services Theyve been around since the early days For a long time, not a lot change with them, “Theyre boring and just work....” In Alfresco 3.4 we added support for delegating some of the work to Apache Tika This has lead to a large improvement in the numbers of file formats that are supported! Renditions came in in Alfresco 3.3
  • What did Alfresco 3.3 Support? PDF Word, PowerPoint, Excel HTML Open Document Formats (OpenOffice) RFC822 Email Outlook .msg Email And thats it...
  • Supported Formats in Alfresco 4.0 Audio – WAV, RIFF, MIDI DWG and PRT (CAD Formats) Epub RSS and ATOM Feeds True Type Fonts HTML Images – JPEG, GIF, PNG, TIFF, Bitmap Includes EXIF Metadata where present
  • Alfresco 4.0 Formats - Continued iWorks (Keynote, Pages, Numbers) RFC822 MBox Mail Microsoft Outlook .msg Email Microsoft Office (Binary) – Word, PowerPoint, Excel, Visio, Publisher, Works Microsoft Office (OOXML, 2007+) – Word, PowerPoint, Excel Open Document Format (OpenOffice) Old-style OpenOffice (.sxw etc)
  • Alfresco 4.0 Formats – Still Continued MP3 (id3 v1 and v2) Ogg Vorbis and FLAC CDF, HDF (Scientific Data) RDF RTF PDF Adobe Illustrator (PDF based) Adobe PSD (expected shortly) Plain Text
  • Alfresco 4.0 Formats – Final Set! Zip, Tar, Compress etc (Archive Formats) FLV Video XML Java Class Files CHM (Windows Help Files) Configurable External Programs And probably some others too!
  • Services Overview
  • The Metadata Extractor ServiceWhat, How, Why? For a given piece of content, returns the Metadata held within that Document Metadata is converted into the content model Typically used with uploaded binary files Upload a PDF, extract out the Title and Description, save these as the properties on the Alfresco Node Powered internally by a number of different extractors Service picks the appropriate extractor for you Since Alfresco 3.4, makes heavy use of Apache Tika
  • The Content Transformer ServiceWhat, How, Why? Transforms content from one format to another Driven by source and destination mime types Used to generate plain text versions for indexing Used to generate SWF versions for preview Used to generate PDF versions for web download Powered by a large number of different transformers internally Transformers can be chained togther, eg .doc → .pdf via OpenOffice, then .pdf → .swf via pdf2swf Since Alfresco 3.4, makes heavy use of Apache Tika
  • The Rendition Service (Alfresco 3.3+)What, How, Why? Can turn content from one kind to another Or can just alter some content in the same format Used to manipulate images, eg crop and resize Used to generate HTML previews from .docx in the Web Quick Start Often uses the Content Transformation Service to do the actual heavy lifting The Thumbnail Service has been re-written to use the Rendition Service (all thumbnail actions now delegate to the Rendition Service) Renditions are all Actions
  • Apache TikaApache Tika – http://tika.apache.org/ Apache Project which started in 2006 Grew out of the Lucene community, now widely used in both Search and Content settings Provides detection of files – eg this binary blob is really a word file Plain text, HTML and XHTML versions of a wide range of different file formats Consistent Metadata from different files Tika hides the complexity of different formats and their libraries, instead its a simple, powerful API Easy to use and extend
  • Any questions so far? ?
  • Metadata Extractor Service
  • Metadata Extractor – Java Use MetadataExtractorRegistry registry = (MetadataExtractorRegistry) context.getBean(“metadataExtracterRegistry”); ContentReader reader = contentService.getReader(nodeRef, ContentModel.PROP_CONTENT); MetadataExtracter extractor = registry.getExtracter(reader.getMimetype()); Map<QName, Serializable> properties = new HashMap<QName, Serializable>(); extractor.extract(reader, properties); System.err.println(properties);
  • Metadata Extractor – JavaScript Use Full access is not JavaScript directly availble in JS You cant get at the  var action = extractor registry actions.create( You cant get at the raw "extract-metadata"); properties  action.execute( You can, however, easily document); trigger the extraction on a given node This is done via the Script Actions service
  • Calling Apache Tika // Get a content detector, and an auto-selecting Parser // In Alfresco we already know the type, so we don’t need to Auto Detect! TikaConfig config = TikaConfig.getDefaultConfig(); DefaultDetector detector = new DefaultDetector( config.getMimeRepository() ); Parser parser = new AutoDetectParser(detector); // We’ll only want the plain text contents ContentHandler handler = new BodyContentHandler(); // Tell the parser what we have Metadata metadata = new Metadata(); metadata.set(Metadata.RESOURCE_NAME_KEY, filename); // Have it processed parser.parse(input, handler, metadata, new ParseContext());
  • Metadata Extractor – Mappings Mappings control how to turn the metadata an extractor produces into node properties Maps from extractor names to your content model Typically set in a properties file, one per extractor Can also be done in Spring when defining the bean An OverwritePolicy controls what happens when extracting for a 2nd (or subsequent) time One output from a metadata can map to multiple properties on your node Not all outputs need to be mapped, some can (and often are) ignored
  • Geo Content Model (cm:geographic) <aspect name="cm:geographic"> <title>Geographic</title> <properties> <property name="cm:latitude"> <title>Latitude</title> <type>d:double</type> </property> <property name="cm:longitude"> <title>Longitude</title> <type>d:double</type> </property> </properties> </aspect>
  • Metadata Extractor – Geo Mapping # Namespaces namespace.prefix.cm=http://www.alfresco.org/model/content/1.0 # Geo Mappings # Note – escape : in metadata keys inside properties files! geo:lat=cm:latitude geo:long=cm:longitude # Normal Mappings author=cm:author title=cm:title description=cm:description created=cm:created
  • DemoGeo-Tagged Image Upload
  • Demo Tika + Geo-Tagged Imagesjava ­jar tika­app­1.0­SNAPSHOT.jar ­­metadata geotagged.jpg date: 2009­08­11T09:09:45exif:DateTimeOriginal: 2009­08­11T09:09:45exif:ExposureTime: 6.25E­4exif:FNumber: 5.6exif:Flash: falseexif:FocalLength: 194.0exif:IsoSpeedRatings: 400geo:lat: 12.54321geo:long: ­54.1234subject: canon­55­250tiff:BitsPerSample: 8tiff:ImageLength: 68tiff:ImageWidth: 100tiff:Make: Canontiff:Model: Canon EOS 40Dtiff:ResolutionUnit: Inchtiff:Software: Adobe Photoshop CS3 Macintoshtiff:XResolution: 240.0tiff:YResolution: 240.0
  • DemoAudio File Upload
  • Ways to Customise and Extend Customise Identify already available metadata of interest Define a content model for this Add mappings tika-app.jar can be very helpful here Extend Locate/Write library or program to read file format Write either Tika Plugin, or whole Extractor Define mappings http://blogs.alfresco.com/wp/nickb/ has more
  • Content Transformer Service
  • Out-of-the-box Transformations These are the main ones, there are others Plain Text, HTML & XHTML for all Apache Tika supported text and document formats (~30) PDF to Image and SWF (thumbnails and previews) Office File Formats to PDF (via Open Office direct / JODConverter in Enterprise) Plain Text and XML to PDF Zip listing to Text Image to other Images (via ImageMagick) With FFMpeg, video transforms and thumbnails Can chain transformers together, eg text preview via txt -> pdf -> swf
  • Checking Supported Transformations Checking active Transformations and Extractors New webscript in 3.4 exposes information on the available transformers and extractors http://localhost:8080/alfresco/service/mimtypes Shows live information as of when the page is requested As transforms come and go (eg OpenOffice dies), the list will show whats current active Only shows the current transformer, not in-active or lower preference ones Includes information on transformation both from and two each mimetype, plus metadata extractor
  • DemoMimetype Information WebScript
  • Content Transformer – Java Use ContentTransformerRegistry registry = (ContentTransformerRegistry) context.getBean(“contentTransformerRegistry”); ContentTransformer transformer = registry. getTransformer(“application/vnd.ms-excel”,”text/csv”, new TransformationOptions()); ContentReader reader = contentService.getReader(sourceNodeRef, ContentModel.PROP_CONTENT); ContentWriter writer = contentService.getWriter(destNodeRef, ContentModel.PROP_CONTENT); transformer.transform(reader, writer);
  • Content Transformer – JavaScript Use Full access is not JavaScript var action = directly availble in JS  actions.create("transform"); You cant get at the  // Transform into the same folder tranformer registry  action.parameters["destination- You cant control which folder"] = document.parent; property is transformed,  action.parameters["assoc-type"] = its always Content "{http://www.alfresco.org/model/c You can, however, easily ontent/1.0}contains"; trigger the  action.parameters["assoc-name"] transformation of a = document.name +"transformed"; given node  action.parameters["mime-type"] = "text/html"; This is done via the  // Execute Script Actions service  action.execute(document);
  • Custom Command Line Transformer<bean id="transformer.worker.helloWorldCMD"   class="org.alfresco.repo.content.transform.RuntimeExecutableContentTransformerWorker">  <property name="mimetypeService“><ref bean="mimetypeService"/></property>  <property name="transformCommand">    <bean class="org.alfresco.util.exec.RuntimeExec">      <property name="commandsAndArguments“><map>         <entry key=".*“><list>           <value>/bin/bash</value>           <value>­c</value>           <value>/bin/echo Hello World ­ ${source} &gt; ${target}</value>          </list></entry>      </map></property>      <property name="errorCodes“><value>1,127</value></property>    </bean>  </property  <property name="explicitTransformations">     <list><bean class="org.alfresco.repo.content.transform.ExplictTransformationDetails">        <property name="sourceMimetype“><value>text/plain</value></property>        <property name="targetMimetype“><value>hello/world</value></property>     </bean></list>  </property></bean><bean id="transformer.helloWorldCMD" class="org.alfresco.repo.content.transform.ProxyContentTransformer"   parent="baseContentTransformer">  <property name="worker"><ref bean="transformer.worker.helloWorldCMD"/></property></bean>
  • Content Transformers and Tika Tika generates HTML-like SAX events as it parses Uses Java SAX API Events can be captured or transformed The Body Content Handler is used for plain text Both HTML and XHTML are available You can customise with your own handler, with XSLT or with E4X from JavaScript Text Indexing just uses a Body Content Handler The Excel to CSV transformer has a text altering SAX handler The Web Quick Start Word→HTML transformer both alters text, tags and embedded resources
  • Tika Plugins Tika ships with Parsers for a wide range of file formats as standard All of these Parsers depend on libraries that are Apache Licensed or similar For other Parsers, Tika provides a mechanism for having the Parser auto-loaded Typically used by GPL or Proprietary plugins Great way to have your custom formats handled Alfresco will auto-load these if available Current list of known third party plugins is:http://wiki.apache.org/tika/3rd%20party%20parser%20plugins
  • Custom Tika Plugins Writing a new Tika Plugin is very straightforward Only 2 methods needed – getSupportedTypes to list which mimetypes you support, and parse Magic file used for detecting new plugins isMETA-INF/services/org.apache.tika.parser.Parser With the service file, the Tika Auto-Detect parser will load and use the parser Without it, you can explicitly configure it into Alfresco via TikaSpringConfiguredContentTransformer Very easy way to add indexing and metadata support for custom file formats
  • Custom Tika Parser – “Hello World”public class HelloWorldParser extends AbstractParser { public Set<MediaType> getSupportedTypes(ParseContext context) { Set<MediaType> types = new HashSet<MediaType>(); types.add(MediaType.parse("hello/world")); return types; } public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws SAXException { XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata); xhtml.startDocument(); // Document Heading xhtml.element("h1", “Hello World!”); // To prove this worked, add some extra text to search for xhtml.startElement("p"); xhtml.characters("To show that this went via the parser, we have "); xhtml.characters("some special text that we can search for. "); xhtml.characters("BADGER BADGER BADGER BADGER BADGER "); xhtml.characters("BADGER BADGER BADGER MUSHROOM MUSHROOM "); xhtml.endElement("p"); // All Done xhtml.endDocument(); metadata.set("hello","world"); metadata.set("title","Hello World!"); metadata.set("custom1","Hello, Custom Metadata 1!"); metadata.set("custom2","Hello, Custom Metadata 2!"); }}
  • Demo “Hello World” Transformer Round- Tripvar action = actions.create("transform");action.parameters["destination­folder"] = document.parent;action.parameters["assoc­type"] = "{http://www.alfresco.org/model/content/1.0}contains";action.parameters["assoc­name"] = document.name + "HW";if(document.mimetype == "hello/world") {   // Its current a "Hello World" file   // Use Apache Tika to create a plain text version   action.parameters["mime­type"] = "text/plain";} else {   // Its a regular new text file   // Have the command line tool make a "Hello World" version   action.parameters["mime­type"] = "hello/world";}action.execute(document);
  • Demo Excel to HTML, CSV and Textvar nameBase = document.name.substring(0, document.name.lastIndexOf("."));var action = actions.create("transform");action.parameters["destination­folder"] = document.parent;action.parameters["assoc­type"] = "{http://www.alfresco.org/model/content/1.0}contains";action.parameters["assoc­name"] = nameBase + ".txt";action.parameters["mime­type"] = "text/plain";action.execute(document);action.parameters["assoc­name"] = nameBase + ".csv";action.parameters["mime­type"] = "text/csv";action.execute(document);action.parameters["assoc­name"] = nameBase + ".html";action.parameters["mime­type"] = "text/xml";action.execute(document);
  • Rendition Service
  • Standard Rendition Engines Renditions Supported in Alfresco v4.0 reformat – access to the Content Transformation Service image – crop, resize, etc freemarker – runs a Freemarker Template against the content of the node html – turns .docx files into clean HTML + images xslt – runs a XSLT Transformation against the content of the node, XML content nodes only! composite – execute several renditions in a series, eg reformat followed by image crop
  • Persisted and Transient Definitions For Complicated or Simple Renditons To run a rendition, first create a Rendition Definition for a given Rendering Engine Next, set your parameters on the definition Finally, execute this against a source node For very complicated, or very commonly used renditions, you dont want to have to create these definitions every time Instead, save them to the Data Dictionary, and load via the Rendition Service on demand Rendition Service provides Save and Load methods
  • Rendition Service – Call from Java// Retrieve the existing Rendition DefinitionQName renditionName = QName.createQName( NamespaceService.CONTENT_MODEL_1_0_URI, "myRendDefn");RenditionDefinition renditionDef = loadRenditionDefinition(renditionName);// Make some changes.renditionDef.setParameterValue(AbstractRenderingEngine.PARAM_MIME_TYPE, MimetypeMap.MIMETYPE_PDF);renditionDef.setParameterValue( RenditionService.PARAM_ORPHAN_EXISTING_RENDITION, true);// Persist the changes.renditionService.saveRenditionDefinition(renditionDef);// Run the RenditionChildAssociationRef assoc = renditionService.render( sourceNode, renditionDef);
  • Renditions from JavaScript// Crop the image and place in a specified locationvar renditionDef = renditionService.createRenditionDefinition( "cm:cropResize", "imageRenderingEngine");renditionDef.parameters["destination-path-template"] = "/Company Home/Cropped Images/${name}.jpg";renditionDef.parameters["isAbsolute"] = true;renditionDef.parameters["xsize"] = 50;renditionDef.parameters["ysize"] = 50;renditionService.render(nodeRef, renditionDef);var renditions = renditionService.getRenditions(nodeRef);
  • Rendition Service – More Ways to Call Actions, Rules, CMIS Renditions are Actions, but by default hidden Dont show up in Share when defining Rules Dont show up in Explorer for Run Custom Action They are available from Java and JS Solution – create JS script to call the Rendition, then run that script from your Rule / from Explorer No dedicated REST API is available Renditions show up in CMIS Or you can use standard Action and Node APIs
  • Custom Rendition Engines For when a composite just isnt enough... Rendition Engines are just a special kind of Action Executor, within the Action Framework If you know how to write Custom Actions, you can write your own Rendering Engine! org.alfresco.repo.rendition.executor. AbstractRenderingEngine provides a helpful superclass, with handy methods See the Actions talk for more on Custom Actions and Custom Action Executors!
  • Demo Crop and Resize and Image (Using Share Rules)var renditionDef = renditionService.createRenditionDefinition("cm:cropResize", "imageRenderingEngine");renditionDef.parameters["destination­path­template"] =                "/Company Home/Cropped Images/${name}.jpg";renditionDef.parameters["isAbsolute"] = true;renditionDef.parameters["xsize"] = 50;renditionDef.parameters["ysize"] = 50;renditionDef.parameters["percent_crop"] = true;renditionDef.parameters["crop­width"] = 75;renditionDef.parameters["crop­height"] = 60;renditionDef.parameters["crop_x"] = 20;renditionDef.parameters["crop_y"] = 150;renditionDef.execute(document);
  • DemoVideo Thumbnailing and Rendition
  • DemoWord .docx → HTML & Images (Uses Web Quick Start)
  • Metadata Extraction, ContentTransformations and Renditions
  • Any Questions??
  • Learn More http://wiki.alfresco.com/wiki/Metadata_Extraction http://wiki.alfresco.com/wiki/Content_Transformations http://wiki.alfresco.com/wiki/Content_Transformation_and_Metadata _Extraction_with_Apache_Tika http://wiki.alfresco.com/wiki/Rendition_Service http://blogs.alfresco.com/wp/nickb/ twitter: @Alfresco, @Gagravarr