Bulk Export Utility
Richard McKnight
Solution Architect
Alfresco Professional Services
https://github.com/rmknightstar
Learn. Connect. Collaborate.
What is the Bulk
Exporter?
The Bulk Exporter
dumps all of the
information associated
with each node in a tree
as line delimited JSON.
Learn. Connect. Collaborate.
How is it invoked
Programatically
• Java Class
provided.
• Root Scoped
Object for
General JS
Code.
REST Call
• Web Script
Provided
Share UI
• JS Console
• Script Actions
Learn. Connect. Collaborate.
What does it
Generate?
• Line Delimited JSON (JSONL)
• 1 Line per object
• Everything that you would see in
the node browser about each
object
Item Description
nodeRef The unique identifier of the object.
path A slash delimited path generated from the QNAMEs in the
primary parent associations
type The QNAME of the type of the object
aspects An array of the aspect QNAMEs
properties A Map of the properties indexed by the QNAME of each
property. Multi valued properties will be arrays.
childAssocs An array of the associations with children nodes that contain a
pipe delimited list of parent nodeRef, child nodeRef, association
type QNAME, QNAME of the association name and a primary
parent flag
parentAssocs An array of associations with the parent(s) of the node.
targetAssocs An array target associations that contain a pipe delimited list of
source nodeRef, target nodeRef and , association type QNAME
sourceAssoc An array of source associations.
Learn. Connect. Collaborate.
What does the output Look Like?
JSONL
{"path":"/{http://www.alfresco.org/...
{"path":"/{http://www.alfresco.org/...
{"path":"/{http://www.alfresco.org/...
{"path":"/{http://www.alfresco.org/...
{"path":"/{http://www.alfresco.org/...
{"path":"/{http://www.alfresco.org/...
{"path":"/{http://www.alfresco.org/...
{"path":"/{http://www.alfresco.org/...
{"path":"/{http://www.alfresco.org/...
{"path":"/{http://www.alfresco.org/...
{"path":"/{http://www.alfresco.org/...
{"path":"/{http://www.alfresco.org/...
{"path":"/{http://www.alfresco.org/...
{"path":"/{http://www.alfresco.org/...
{"path":"/{http://www.alfresco.org/...
Single Line, Formatted
{
"aspects": [
"{http://www.alfresco.org/model/recordsmanagement...
"{http://www.alfresco.org/model/content/1.0}ownab...
],
"targetAssocs": [],
"type": "{http://www.alfresco.org/model/recordsmana...
"parentAssocs": [...
"workspace://SpacesStore/3a2c8324-7b38-4dcc-9ea8-...
],
"properties": {
"{http://www.alfresco.org/model/system/1.0}node-d...
"{http://www.alfresco.org/model/content/1.0}owner...
"{http://www.alfresco.org/model/content/1.0}modif...
}, ...
"path": "/{http://www.alfresco.org/model/applicatio...
"sourceAssocs": [],
"nodeRef": "workspace://SpacesStore/06582f50-e6b7-4...
"childAssocs": [
"workspace://SpacesStore/06582f50-e6b7-438b-9eee-...
]
}
1
Learn. Connect. Collaborate.
Why was it
developed
Large customer wanted to do an
Alfresco to Alfresco Migration
1. Export full node information to
export file
2. Copy Content to the new Content
Store
3. Generate the Bulk File System
Importer Property Files
4. Run the Bulk File System
Importer
Learn. Connect. Collaborate.
Other Potential Use Cases
• Alfresco to Alfresco Bulk Transfers
• Pushing curated content and metadata to other systems
• Content Archival
Learn. Connect. Collaborate.
Bulk Exporter
Design
• Collection Provider
– Determines which nodes are
to be processed
– Calls the Emitter for each
node.
• Emitter
– Creates a string
representation for each node
– Sends that String to the
Export File via the Writer
Learn. Connect. Collaborate.
Bulk
Exporter
Design:
Collection
Provider
• Provides a collection of objects to be exported
• Current implementation uses a Tree Walker to:
– Visit every file and folder in a top level directory
– Send the node to the Emitter for processing
2
Learn. Connect. Collaborate.
Bulk
Exporter
Design:
Emitter
Decided on a Line Delimited JSON (JSONL)
Implementation
• JSON is the “lingua franca” of data between
systems
• Line delimited JSON lines can be handled
independently of each other reducing memory
requirements
The underlying implementation is straightforward
• The Tree Walker passes each node that it
visits to the Emitter.
• The Emitter creates a JSON representation of
the node and then writes the JSON String as a
single line to the Writer that was passed to it
upon construction.
Learn. Connect. Collaborate.
Different Formats
• Two Formats in the Sample Code
– Full Export (Line Delimited JSON)
– Manifest for Bulk Export (work in progress)
• Implements the BulkEmitter Interface shown below:
public interface BulkExportEmitter {
void setServiceRegistry(ServiceRegistry serviceRegistry);
void setTopNode(NodeRef nodeRef);
String separatorString();
String nodeToString(NodeRef nodeRef);
boolean includeTopNode();
String startString();
String endString();
}
Learn. Connect. Collaborate.
Java API
public interface BulkExporter {
void export(
NodeRef topDir,
Writer manifest,
BulkExportEmitter emitter
);
}
Learn. Connect. Collaborate.
JavaScript
API
public class BulkExporterScriptApi {
void exportFullToFile(
ScriptNode topDir,
String manifest
);
void exportManifestToFile(
ScriptNode topDir
String manifest
);
String exportFullToString(
ScriptNode topDir
);
String exportManifestToString(
ScriptNode topDir
);
}
3
Learn. Connect. Collaborate.
REST API
Full Export:
/alfresco/s/devcon2019/fullexport/{uuid}
Manifest:
/alfresco/s/devcon2019/manifest/{uuid}
Learn. Connect. Collaborate.
Integration with
Bulk Object Mapper
1. Bulk Object Mapper Manifest
Generator, a variant of the Bulk
Exporter, generates the Manifest
2. Selected Content Copier copies
content into the Content Store
bases upon the content URLs in
the Manifest
3. Bulk Object Mapper processes
the Manifest.
Learn. Connect. Collaborate.
Bulk Transfer Use Case
• Bulk Transmitter is a variant of the
Bulk Exporter.
• Bulk Receiver is a variant of the
Bulk Object Mapper.
• The Bulk Receiver and Transmitter
will understand the same manifest
format
• The Bulk Receiver will create the
objects with missing content.
• The Bulk Receiver will launch a
process to copy the content in the
background.
• The Bulk Receiver will update the
objects as their content is copied.
• The Bulk Receiver would need to
handle updates to existing objects.
Learn. Connect. Collaborate.
Pluggable
Tree Walkers
• Currently Delivered with FileFolder TreeWalker
• Could add a full TreeWalker (would pick up
avatars, rules and actions et ...)
4
Learn. Connect. Collaborate.
Alternate
Collection
Providers
• Alternatives
– Search
– Tracker
– Filtering/Combo (Search, Tracker or Tree)
• Would need a different structure passed in
• Warning when trying to replicate the structure
(parents might be missing)
• Good for publishing of content
Learn. Connect. Collaborate.
Running in
the
Background
• Good for potentially long running exports
• Would need to provide a status method to see
the progress of the export
• Need a place for temporary storage of the
output
Key Takeaways
• Any administrator can take an existing tree of
objects and export the metadata (including
Content Data)
• Built on the Alfresco Public API so it should be
stable across upgrades
• There are a number of easy to build
enhancements that could be added to the
current implementation
Support, Licensing and Availability
• Open source core available on github
– Free to use and extend
– No bells and whistles
– No Support
– https://github.com/rmknightstar/devcon2019
• Consulting maintains an internal version as an
accelerator
– Available as part of Professional Services
Engagements
• Versions → 5.0+ (recompile for earlier versions)
Bulk Export Utility
Thank you!

Bulk Export Tool for Alfresco

  • 1.
    Bulk Export Utility RichardMcKnight Solution Architect Alfresco Professional Services https://github.com/rmknightstar
  • 2.
    Learn. Connect. Collaborate. Whatis the Bulk Exporter? The Bulk Exporter dumps all of the information associated with each node in a tree as line delimited JSON.
  • 3.
    Learn. Connect. Collaborate. Howis it invoked Programatically • Java Class provided. • Root Scoped Object for General JS Code. REST Call • Web Script Provided Share UI • JS Console • Script Actions
  • 4.
    Learn. Connect. Collaborate. Whatdoes it Generate? • Line Delimited JSON (JSONL) • 1 Line per object • Everything that you would see in the node browser about each object Item Description nodeRef The unique identifier of the object. path A slash delimited path generated from the QNAMEs in the primary parent associations type The QNAME of the type of the object aspects An array of the aspect QNAMEs properties A Map of the properties indexed by the QNAME of each property. Multi valued properties will be arrays. childAssocs An array of the associations with children nodes that contain a pipe delimited list of parent nodeRef, child nodeRef, association type QNAME, QNAME of the association name and a primary parent flag parentAssocs An array of associations with the parent(s) of the node. targetAssocs An array target associations that contain a pipe delimited list of source nodeRef, target nodeRef and , association type QNAME sourceAssoc An array of source associations.
  • 5.
    Learn. Connect. Collaborate. Whatdoes the output Look Like? JSONL {"path":"/{http://www.alfresco.org/... {"path":"/{http://www.alfresco.org/... {"path":"/{http://www.alfresco.org/... {"path":"/{http://www.alfresco.org/... {"path":"/{http://www.alfresco.org/... {"path":"/{http://www.alfresco.org/... {"path":"/{http://www.alfresco.org/... {"path":"/{http://www.alfresco.org/... {"path":"/{http://www.alfresco.org/... {"path":"/{http://www.alfresco.org/... {"path":"/{http://www.alfresco.org/... {"path":"/{http://www.alfresco.org/... {"path":"/{http://www.alfresco.org/... {"path":"/{http://www.alfresco.org/... {"path":"/{http://www.alfresco.org/... Single Line, Formatted { "aspects": [ "{http://www.alfresco.org/model/recordsmanagement... "{http://www.alfresco.org/model/content/1.0}ownab... ], "targetAssocs": [], "type": "{http://www.alfresco.org/model/recordsmana... "parentAssocs": [... "workspace://SpacesStore/3a2c8324-7b38-4dcc-9ea8-... ], "properties": { "{http://www.alfresco.org/model/system/1.0}node-d... "{http://www.alfresco.org/model/content/1.0}owner... "{http://www.alfresco.org/model/content/1.0}modif... }, ... "path": "/{http://www.alfresco.org/model/applicatio... "sourceAssocs": [], "nodeRef": "workspace://SpacesStore/06582f50-e6b7-4... "childAssocs": [ "workspace://SpacesStore/06582f50-e6b7-438b-9eee-... ] } 1
  • 6.
    Learn. Connect. Collaborate. Whywas it developed Large customer wanted to do an Alfresco to Alfresco Migration 1. Export full node information to export file 2. Copy Content to the new Content Store 3. Generate the Bulk File System Importer Property Files 4. Run the Bulk File System Importer
  • 7.
    Learn. Connect. Collaborate. OtherPotential Use Cases • Alfresco to Alfresco Bulk Transfers • Pushing curated content and metadata to other systems • Content Archival
  • 8.
    Learn. Connect. Collaborate. BulkExporter Design • Collection Provider – Determines which nodes are to be processed – Calls the Emitter for each node. • Emitter – Creates a string representation for each node – Sends that String to the Export File via the Writer
  • 9.
    Learn. Connect. Collaborate. Bulk Exporter Design: Collection Provider •Provides a collection of objects to be exported • Current implementation uses a Tree Walker to: – Visit every file and folder in a top level directory – Send the node to the Emitter for processing 2
  • 10.
    Learn. Connect. Collaborate. Bulk Exporter Design: Emitter Decidedon a Line Delimited JSON (JSONL) Implementation • JSON is the “lingua franca” of data between systems • Line delimited JSON lines can be handled independently of each other reducing memory requirements The underlying implementation is straightforward • The Tree Walker passes each node that it visits to the Emitter. • The Emitter creates a JSON representation of the node and then writes the JSON String as a single line to the Writer that was passed to it upon construction.
  • 11.
    Learn. Connect. Collaborate. DifferentFormats • Two Formats in the Sample Code – Full Export (Line Delimited JSON) – Manifest for Bulk Export (work in progress) • Implements the BulkEmitter Interface shown below: public interface BulkExportEmitter { void setServiceRegistry(ServiceRegistry serviceRegistry); void setTopNode(NodeRef nodeRef); String separatorString(); String nodeToString(NodeRef nodeRef); boolean includeTopNode(); String startString(); String endString(); }
  • 12.
    Learn. Connect. Collaborate. JavaAPI public interface BulkExporter { void export( NodeRef topDir, Writer manifest, BulkExportEmitter emitter ); }
  • 13.
    Learn. Connect. Collaborate. JavaScript API publicclass BulkExporterScriptApi { void exportFullToFile( ScriptNode topDir, String manifest ); void exportManifestToFile( ScriptNode topDir String manifest ); String exportFullToString( ScriptNode topDir ); String exportManifestToString( ScriptNode topDir ); } 3
  • 14.
    Learn. Connect. Collaborate. RESTAPI Full Export: /alfresco/s/devcon2019/fullexport/{uuid} Manifest: /alfresco/s/devcon2019/manifest/{uuid}
  • 15.
    Learn. Connect. Collaborate. Integrationwith Bulk Object Mapper 1. Bulk Object Mapper Manifest Generator, a variant of the Bulk Exporter, generates the Manifest 2. Selected Content Copier copies content into the Content Store bases upon the content URLs in the Manifest 3. Bulk Object Mapper processes the Manifest.
  • 16.
    Learn. Connect. Collaborate. BulkTransfer Use Case • Bulk Transmitter is a variant of the Bulk Exporter. • Bulk Receiver is a variant of the Bulk Object Mapper. • The Bulk Receiver and Transmitter will understand the same manifest format • The Bulk Receiver will create the objects with missing content. • The Bulk Receiver will launch a process to copy the content in the background. • The Bulk Receiver will update the objects as their content is copied. • The Bulk Receiver would need to handle updates to existing objects.
  • 17.
    Learn. Connect. Collaborate. Pluggable TreeWalkers • Currently Delivered with FileFolder TreeWalker • Could add a full TreeWalker (would pick up avatars, rules and actions et ...) 4
  • 18.
    Learn. Connect. Collaborate. Alternate Collection Providers •Alternatives – Search – Tracker – Filtering/Combo (Search, Tracker or Tree) • Would need a different structure passed in • Warning when trying to replicate the structure (parents might be missing) • Good for publishing of content
  • 19.
    Learn. Connect. Collaborate. Runningin the Background • Good for potentially long running exports • Would need to provide a status method to see the progress of the export • Need a place for temporary storage of the output
  • 20.
    Key Takeaways • Anyadministrator can take an existing tree of objects and export the metadata (including Content Data) • Built on the Alfresco Public API so it should be stable across upgrades • There are a number of easy to build enhancements that could be added to the current implementation
  • 21.
    Support, Licensing andAvailability • Open source core available on github – Free to use and extend – No bells and whistles – No Support – https://github.com/rmknightstar/devcon2019 • Consulting maintains an internal version as an accelerator – Available as part of Professional Services Engagements • Versions → 5.0+ (recompile for earlier versions)
  • 22.

Editor's Notes

  • #3 This dumps everything including: Associations Content URLs All Metadata
  • #5 Talk about efficiency
  • #6 This is the full manifest output As part of refactoring, I added the ability to create alternative outputs
  • #7 We stopped at the bulk exporter Decided not to generate the old-style bulk filesystem importer files Generate copy script would generate a shell script Alternatively a python script could read the export file and copy the files (to file systems or S3 buckets) Files must be in-place before the bulk object mapper is run
  • #11 A JSON Array is a large single entity Manifest output is sent out as an
  • #12 Start Output End Output Pass the Top Node to the Emitter Separator Node to String
  • #17 since this needs to happen in real time, performance is an issue Making the objects available while they are waiting on the content is a technique that allows the users to get early awareness of the content While transferring the content in the background, does bypass the normal APIs, it does allow for better data transfer performance.