Extracting Metadata in Alfresco

by Jeff Rosler, Solutions Architect at Zia

When importing files, each is uploaded with additional information including things like title, description, and text. Out of the box, Alfresco extracts the properties that have been mapped and metadata is taken from the content using Apache Tika. The TikaAutoMetadataExtracter class loads the supported mime types so all users have to do is create a bean that references that class and then set the properties desired in extraction.

The following are some simple samples for how metadata can be pulled from different mime types and set to Alfresco properties. Since Apache Tika is used as a basic metadata extractor in Alfresco, you can use that to extract metadata for all the mime types that it supports. The current version of Tika that Alfresco is using (for Alfresco 5.0.2.5 and 5.1) is basically Tika 1.6 which supports the following file types. The TikaAutoMetadataExtracter class loads all the mime types that embedded version of Tika supports. So, all you need to do is to create a spring bean that references that class and set the properties to extract and set the Alfresco properties you’d like to have set. You don’t have to write any custom code.

Example 0 – Set logging to see what metadata can be extracted

Before defining your metadata extraction, it’s good to set your logging level for metadata extraction to DEBUG. When you do this, the extracted metadata for a file is shown in the log. This lets you correctly choose the embedded metadata property names to configure. You can set this by going to your log4j.properties file for the repo (alfresco) and adding the following line.

log4j.logger.org.alfresco.repo.content.metadata.AbstractMappingMetadataExtracter=DEBUG

Restart alfresco and import a file. You should see something like this in the log. You can see properties with name spaces such as dc:title (the dc stands for dublin core, a metadata standard) as well as other properties that don’t contain a namespace. You can use these embedded properties to map to standard or custom Alfresco properties.

2016-02-03 10:03:49,474 DEBUG [content.metadata.AbstractMappingMetadataExtracter]
 [http-bio-8080-exec-10] Extracted Metadata from ContentAccessor[ 
 contentUrl=store://2016/2/3/10/3/068b7c2b-1f7f-4b12-aa90-e78794eb8e77.bin, 
 mimetype=application/vnd.openxmlformats-officedocument.wordprocessingml.document,
 size=286436, encoding=UTF-8, locale=en_US]
 Found: {date=2016-01-22T18:59:00Z, Total-Time=1, extended-properties:AppVersion=14.0000,
 meta:paragraph-count=12, subject=beer, ipsum, meta:print-date=2016-01-22T18:59:00Z,
 Word-Count=405, meta:line-count=45, Manager=null, Template=Normal.dotm, Paragraph-Count=12,
 meta:character-count-with-spaces=2246, dc:title=Tom's Ipsum Beer, modified=2016-01-22T18:59:00Z,
 meta:author=Jeff Rosler, meta:creation-date=2015-12-31T15:49:00Z,
 Last-Printed=2016-01-22T18:59:00Z, extended-properties:Application=Microsoft Macintosh Word,
 author=Jeff Rosler, created=2015-12-31T15:49:00Z, Creation-Date=2015-12-31T15:49:00Z,
 Character-Count-With-Spaces=2246, Last-Author=Jeff Rosler, Character Count=1853, Page-Count=2,
 Application-Version=14.0000, extended-properties:Template=Normal.dotm, Author=Jeff Rosler,
 publisher=Zia Consulting, meta:page-count=2, cp:revision=4,
 Keywords=beer, ipsum, meta:word-count=405,
 dc:creator=Jeff Rosler, extended-properties:Company=Zia Consulting,
 description=beer, ipsum, dcterms:created=2015-12-31T15:49:00Z,
 Last-Modified=2016-01-22T18:59:00Z, dcterms:modified=2016-01-22T18:59:00Z,
 title=Tom's Ipsum Beer, Last-Save-Date=2016-01-22T18:59:00Z, meta:character-count=1853,
 Line-Count=45, meta:save-date=2016-01-22T18:59:00Z, Application-Name=Microsoft Macintosh Word,
 extended-properties:TotalTime=1, extended-properties:Manager=null,
 Content-Type=application/vnd.openxmlformats-officedocument.wordprocessingml.document,
 creator=Jeff Rosler, comments=null, dc:subject=beer, ipsum, meta:last-author=Jeff Rosler,
 xmpTPg:NPages=2, Revision-Number=4, meta:keyword=beer, ipsum, dc:publisher=Zia Consulting}

Example 1 – Set author, title, description

Specify your spring bean. You can name the id anything you want (that is a legitimate XML id) and point to the TikaAutoMetadataExtracter class (yes I know, that isn’t the way you spell Extractor, but the code has misspelled Extractor with an “e” instead of an “o”). In the code block below, we are overriding the default mapping and pointing to a separate property file. The properties could have been listed inline here, but pointing to the property files allows for easier editing.

<?xml version="1.0" encoding="UTF-8"?>
<beans xmlns="https://www.springframework.org/schema/beans"
       xmlns:xsi="https://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="https://www.springframework.org/schema/beans https://www.springframework.org/schema/beans/spring-beans.xsd">

   <bean id="extractor.auto" class="org.alfresco.repo.content.metadata.TikaAutoMetadataExtracter" parent="baseMetadataExtracter">
      <constructor-arg>
         <ref bean="tikaConfig"/>
      </constructor-arg>
      <property name="inheritDefaultMapping">
         <value>false</value>
      </property>
      <property name="mappingProperties">
         <bean class="org.springframework.beans.factory.config.PropertiesFactoryBean">
            <property name="location">
               <value>classpath:alfresco/extension/TikaAutoMetadataExtracter.properties</value>
            </property>
        </bean>
      </property>
   </bean>

</beans>

After specifying your spring bean that points to a properties file (e.g. TikaAutoMetadataExtracter.properties), within the properties file, set any Alfresco namespaces you’re specifying for the content model and then each property to be mapped. Note that during the extraction if you specify properties on aspects, those aspects will be applied to the content node automatically for you. Note that you put the embedded metadata property name on the left of the equal sign and the Alfresco property on the right. If you are specifying an embedded property that has a namespace prefix (e.g. dc:title) remember to escape the colon with a backslash (e.g. dc\:title). You don’t need to do that on the property value, just the property.

# Namespaces
namespace.prefix.cm=https://www.alfresco.org/model/content/1.0
&nbsp;
# Mappings
author=cm:author
dc\:title=cm:title
description=cm:description

Example 2 – Setting multiple Alfresco properties

Embedded Metadata can be mapped to multiple Alfresco properties by specifying those properties as comma separated values. The example below shows setting the embedded author value to both cm:author and cm:description.

# Namespaces
namespace.prefix.cm=https://www.alfresco.org/model/content/1.0
 
# Mappings
author=cm:author,cm:description

Example 3 – Specifying when properties are extracted

The Metadata extractor has something called an OverwritePolicy. The OverwritePolicy specifies when an Alfresco property is overwritten. For example, you might not want your extractor to overwrite every time a new version is stored of a file as this would overwrite any of the mapped property values that were updated manually via Share or automatically through actions, workflows or other processes. Therefore, Alfresco defaults the OverwritePolicy to PRAGMATIC. This basically sets it to extract if the extracted property is not null and the Alfresco property is not set or is empty.

However, if you want to change the behavior so that the extraction happens all the time (e.g. when content is updated), then you should set the OverwritePolicy to EAGER. This can be done by passing that as a parameter within your extractor bean as can be seen below.

<bean id="extractor.auto" class="org.alfresco.repo.content.metadata.TikaAutoMetadataExtracter" parent="baseMetadataExtracter">
   <constructor-arg>
      <ref bean="tikaConfig"/>
   </constructor-arg>
   <property name="inheritDefaultMapping">
      <value>false</value>
   </property>
   <property name="overwritePolicy">
     <value>EAGER</value>
   </property>
 
   <property name="mappingProperties">
      <bean class="org.springframework.beans.factory.config.PropertiesFactoryBean">
         <property name="location">
            <value>classpath:alfresco/extension/TikaAutoMetadataExtracter.properties</value>
         </property>
      </bean>
   </property>
</bean>

Example 4 – Setting tags

Support for mapping tags was added in Alfresco 4.2.c. Details are mentioned in this blog post. You can easily add that to your extraction mapping. It just needs to be enabled in the extract-metadata bean and then the mapping set within your properties file.

NOTE: When setting tags, don’t do this while running from the Alfresco SDK using springloaded. Tagging won’t work and as soon as you try and import some content with tags (after you’ve made the updates below), your content will fail to load.

ALSO NOTE: I noticed in Alfresco 5.0 that the embedded keywords are getting concatenated into a single comma separated tag. This has been identified as a bug and a JIRA (MNT-15497) was created for fixing it. The fix was put in 5.0.4 and 5.1.1.

The following code block can be added to your spring bean xml config file to enable tagging.

<!--
    Override metadata extraction bean from action-services-context.xml to turn on the taggingService and enableStringTagging
    This will allow keywords to get mapped to tags.
 -->
<bean id="extract-metadata" class="org.alfresco.repo.action.executer.ContentMetadataExtracter" parent="action-executer">
  <property name="nodeService">
    <ref bean="NodeService" />
  </property>
  <property name="contentService">
    <ref bean="ContentService" />
  </property>
  <property name="dictionaryService">
    <ref bean="dictionaryService" />
  </property>
  <property name="taggingService">
      <ref bean="TaggingService" />
  </property>
  <property name="metadataExtracterRegistry">
    <ref bean="metadataExtracterRegistry" />
  </property>
  <property name="applicableTypes">
    <list>
      <value>{https://www.alfresco.org/model/content/1.0}content</value>
    </list>
  </property>
  <property name="carryAspectProperties">
    <value>true</value>
  </property>
  <property name="enableStringTagging">
    <value>true</value>
  </property>
</bean>

After tagging is enabled, just update your property file to map the appropriate embedded Keywords property to cm:taggable. The example below uses the embedded Keywords property.

# Namespaces
namespace.prefix.cm=https://www.alfresco.org/model/content/1.0
 
# Mappings
Keywords=cm:taggable

Jeff Rosler has more than 15 years’ experience architecting and developing enterprise content management solutions for customers across multiple verticals to help solve different business challenges. These solutions include digital asset management, component content management using XML, business process management, and web content management utilizing Alfresco and related standards, technologies, and products.

Tech Post: Extracting Metadata in Alfresco

Extracting Metadata in Alfresco

Example 0 – Set logging to see what metadata can be extracted

Example 1 – Set author, title, description

Example 2 – Setting multiple Alfresco properties

Example 3 – Specifying when properties are extracted

Example 4 – Setting tags

Pin It on Pinterest

Sharing is caring