Select Page

Background

On a recent project I have spent some time investigating and implementing a content migration solution from a legacy content system into Alfresco.  After researching numerous alternatives, I found that a combination of two popular technologies with a small extension produced a high performance migration solution.  The project had a relatively straightforward requirement:

Inject content into Alfresco from a source content management system.  Take into account performance impacts on the existing and target system, as well as the overall cost of the migration.  The source content management system has millions of content documents to migrate.  Migration may be batched based on business requirements.

This article is focused on the ingestion of content into Alfresco.

Technologies

Various technology options were available to perform the content import.  I spent some time scouring the web looking for migration tools and technologies as well as considering direct access to the source and target system.  It quickly became obvious (partly due to Peter Monks’ blog post on one approach) that there were a handful of options available:

  1. Direct API calls to Alfresco.  Using CMIS, ReST, SOAP, etc.
  2. Alfresco’s JLan Server.
  3. Alfresco Content Import using ACP Files.
  4. Open Migrate (from Technology Services Group)
  5. Bulk Filesystem Import (from Peter Monks)

After carefully considering the options and understanding the client’s needs, I arrived at really three options:  Direct API calls, Open Migrate or Bulk Filesystem Import.  Direct API calls would require custom code for a migration and seemed like a fairly high cost for the value add.  We were also concerned about any kind of remote API calls.  That left us with the two tools.  It is worth explaining how these two tools work:

Open Migrate is a configurable extensible framework for content migrations.  The basic flow is to retrieve content from the source system, map it to the target representation and deliver the content to the target system.  The mapping activity is configurable and allows you to map and modify content as appropriate during a migration process.  The tool has the possibility to be a complete end-to-end solution for many content migrations.  The out of the box implementation for an Alfresco target calls the Alfresco remote API’s and transfers content over the wire during import.  On the surface this isn’t a huge deal, but for a large volume migration this could be a performance barrier.

The Bulk Filesystem Import tool is a highly specialized tool designed to import content in a set of folders and files on a local filesystem into Alfresco.  The tool runs in process (it is deployed with Alfresco) and therefore does not have any over the wire performance implications.  The tool is simply pointed to a given folder hierarchy and content and folders within the filesystem are replicated in Alfresco.  Properties specified in any optionally accompanying properties files can specify the content’s type, aspects and metadata property values.  The format required by the tool for metadata properties is a simple properties file and while perhaps somewhat limited, would meet our needs based on an analysis of the content in the source system.  Peter claims a ~20x performance improvement over a CIFS approach to migration.  This got my attention and helped validate my concern about remote API calls, specifically around sending file contents over the wire.

Decision Time

Open Migrate was the first choice from a cost perspective.  We hoped to have little to no effort to configure the tool and execute a simple migration.  Using the Bulk Filesystem Import tool was a reasonable alternative but would require a translation of content from the source system to the format understood by the import tool.  The source system’s content mapped reasonably easily to Alfresco so this didn’t seem like too large of a hurdle.  Thus, I was intrigued by the idea of combining the possibility of a good end-to-end framework with the high performance of Bulk Filesystem Import.  I explored how I could develop a Bulk Filesystem Import folder structure as a target in Open Migrate.

Implementation

I configured Open Migrate with a “simple migration target” which is a container that allows extension by writing listeners.

    <bean id="MigrationTarget"
        class="com.tsgrp.migration.target.SimpleMigrationTarget"
        scope="prototype">
        <property name="eventListeners">
            <list>
                <ref bean="AlfrescoBulkFileSystemImportWriter" />
            </list>
        </property>
    </bean>

Then I configured a target listener to perform the actual writing to disk. Note, targetDir is a property specified in the Open Migrate properties configuration and is the top level output directory from the Open Migrate process – where you’ll ultimately point the Bulk Filesystem Import tool.

    <bean id="AlfrescoBulkFileSystemImportWriter"
        class="com.ziaconsulting.migration.event.target.AlfrescoBulkFileSystemImportWriter"
        scope="prototype">
        <property name="targetDir" value="${targetDir}" />
    </bean>

The implementation of the listener is fairly straightforward. Each target migration node in Open Migrate has already been properly populated with the desired attributes (metadata properties). The files need to be laid into the directories and the metadata properties need to be written to a format prescribed by the Bulk Filesystem Import tool (e.g. filename.metadata.properties).  It should be noted that due to issue 19 in the Bulk Filesystem Import tool, dates are handled specially.  Also see ISO-8601.  In this migration I’ve only handled single-valued dates as noted below, and String properties.  The properties writing is accomplished as follows:

    private void createNodeProperties(MigrationNode node) {
        Properties props = new Properties();

        for (String attr : node.getAttributeNames(false)) {
            if (attr.startsWith("migration_info_node_")) {
                // Skip this attribute, it's an open-migrate migration detail,
                // not represented in Alfresco.
                continue;
            }
            NodeAttribute nodeAttr = node.getAttribute(attr);

            String value = EMPTY_STRING;

            if (nodeAttr.getDataType().getJavaTypeName().equals(Date.class.getName())) {
                // TODO Doesn't handle multi-valued date properties
                if (nodeAttr.getFirst() != null) {
                    // Only store date fields which have a value
                    Date date = (Date) nodeAttr.getFirst();
                    props.setProperty(attr, ISO8601DateFormat.format(date));
                }
            } else {
                if (nodeAttr != null && nodeAttr.getFirst() != null) {
                    value = node.getAttribute(attr).valuesToString(DELIM);
                }

                props.setProperty(attr, value);
            }
        }

        if (props.size() == 0) {
            // If a node has no properties, don't write the file.
            return;
        }

        // Helper method to get the file path based on targetDir and the node's folder.
        String targetFullFilePath = PathHelper.getContentNodeFilePath(getTargetDir(), node) + ".metadata.properties;
        logger.debug("Target will create properties file " + targetFullFilePath);

        // Get the file object
        File targetFile = new File(targetFullFilePath);

        try {
            if (targetFile.createNewFile()) {

                FileWriter writer = null;
                try {
                    writer = new FileWriter(targetFile);
                    props.store(writer, null);
                    writer.close();
                } catch (IOException ioe) {
                    MigrationException.throwException(ExceptionType.TARGET_NODE_EXCEPTION, "I/O Exception on File Folder Migration Target", ioe);
                    if (writer != null) {
                        try {
                            writer.close();
                        } catch (IOException e) {
                            // Ignore
                        }
                    }
                }
            }
        } catch (IOException e) {
            MigrationException.throwException(ExceptionType.TARGET_NODE_EXCEPTION, "I/O Exception on File Folder Migration Target", e);
        }
    }

As for laying out the binary files, the exercise is largely left to the reader. In our case, the files were accessible on disk and we performed a copy from the source system to the target location for importing. For each target migration node the code writes out the metadata properties and the associated binary content file.

Running the Bulk Filesystem Import utility is as simple as pointing to the target directory and watching your documents import (quickly!) into Alfresco.

Pin It on Pinterest

Sharing is caring

Share this post with your friends!