Using Google AutoML Translation with Magnolia CMS

Joaquín Alfaro
7 min readJun 28, 2019

--

Google AutoML Translation is a service provided by Google to create custom translation models when Google Translation API is not enough because you have specific vocabulary for example.

This service is really interesting for a CMS (Content Management System) why? because the CMS stores translation of contents related with your business and you can use these translations to create your own translation model.

In this article I will show how to integrate Google AutoML Translation in Magnolia CMS to translate contents with a model based on existing translations.

How Google AutoML Translation works

AutoML Translation is really simple to use, you create a Dataset from your custom translations and Google creates a custom model from the generic Translation API model and adds a layer that specifically helps the model get the right translation based on your Dataset

To start using AutoML translation it is necessary to create a project in Google Cloud and activate Cloud AutoML API

Visit https://cloud.google.com/translate/automl/docs/ for a detailed explanation to activate AutoML API

AutoML API provides a console to upload translations and train the model

Creating translation model using AutoML console

1. Create the Dataset

From the AutoML console click the NEW DATASET button on the top of the console

A form to create the Dataset will be opened, here you can specify the name of the dataset the source/target languages and the files with translations

Creation of new dataset

The files can be uploaded from your computer or from an existing bucket in Google Cloud Storage

Though you can upload just one file, AutoML needs three sets of translations: training, validation and testing. You can visit https://cloud.google.com/translate/automl/docs/prepare for a deeper explanation.

The sentence pairs must be in Tab-separated values (.tsv) or Translation Memory eXchange (.tmx) format

Example of .tsv file

You must provide at least three sentence pairs for training. The minimum number of sentence pairs used for VALIDATION or TEST is 100 each. The maximum number of pairs used for VALIDATION or TEST is 10,000; AutoML Translation uses the first 10,000 pairs and ignores the remaining pairs.

2. Train the model

To train the model based on the dataset created in the previous step, access to the dataset, select the TRAIN tab and click the button TRAIN NEW MODEL

The pricing for training a model depends on the time required and the time depends on the number of lines in the dataset, so take care about the number of rows in the dataset before launching the training. You can see the pricing here https://cloud.google.com/translate/automl/pricing

3. Predict translation using the model

To predict translations from the console access to the model and go to the tab PREDICT

Using AutoML Translation API Client

The purpose to integrate a CMS with AutoML Translation is to execute the actions directly from the CMS instead of accessing the console to create Dataset or train a model

Google provides REST Api for AutoML and clients in Python, Java and Node.js that allows almost the same actions as the console.

Samples of AutoML client in Java https://github.com/GoogleCloudPlatform/java-docs-samples/tree/master/translate/automl

In order to use de API It is necessary to create a service account with the rol AutoML Editor to have permissions to create datasets, train models and make predictions. You can find an explanation here https://cloud.google.com/translate/automl/docs/before-you-begin

As other Google API’s it is required to store the credentials in a json file, this file can be downloaded from Google console and must be kept in a safe place. The client uses the env. variable GOOGLE_APPLICATION_CREDENTIALS to get the location of the credentials file

Integration of a CMS with AutoML translation

For the integration of a CMS with AutoML you can use one of the implementations of the client of AutoML: Python, Java or Node.js. It dependes on the language of your CMS. Magnolia is developed in Java.

The following state diagram represents the interaction between the CMS and Google Cloud services.

The Cloud Storage API is used to create the bucket that will store the files of translations.

Magnolia CMS

Magnolia is an open source CMS developed in Java. It is a widely used and strongly growing platform, suitable for both simple sites and enterprise solutions.

I have chosen Magnolia CMS because it is open source, developed in Java and it is really easy to add new features.

Creating the feature AutoML translation in Magnolia

Magnolia is developed in java and it uses maven as packager. To create new features, Magnolia provides a maven archetype that creates module structure and dependencies.

Link to the documentation of Magnolia that explains the creation of new modules using maven archetype https://documentation.magnolia-cms.com/display/DOCS60/Module+structure#Modulestructure-CreatingaMagnoliaMavenmodulewithMavenarchetypes

The Magnolia module created for this article will include the following features:

  • Integration with Google AutoML using the java client API
  • Creation of Dataset in Magnolia from contents in Magnolia
  • Training of AutoML model for the dataset created with Magnolia
  • Translation of contents in Magnolia using the custom model

Creating Dataset in AutoML with contents in Magnolia

The steps to create a dataset in AutoML will be

  • Generation of the set of translation from contents in Magnolia
  • Upload of translations set to Google Storage
  • Creation of Dataset in AutoML using the translations of Magnolia.

Remember that to use AutoML client is necessary to set the environment variable GOOGLE_APPLICATION_CREDENTIALS with the path to the json credentials.

1. Generating set of translations from Magnolia contents

Magnolia stores contents in a hierarchical content store that implements Java Content Repository (JCR).

To generate the set of translations, the process will navigate the tree of nodes starting at an specific element and retrieving the translation for the properties of type Text.

The process receives the source and target lang, if source lang is empty then it is used default language of the site. If target lang is empty then it is generated a file for all the languages available for the site.

Below is an example of the file generated from Magnolia contents

Contents in Magnolia
File generated to train the model in Google AutoML

Magnolia allows rich text that includes html code, so it is necessary to clean text. To do this I use jsoup library.

Note that this action is specific by CMS, it will be different for Adobe CMS or Liferay

Generates list of translations from contents in Magnolia

2. Creating Dataset in AutoML Translation

To create the Dataset in AutoML it can be used the Java library of the AutoML client. This library provides a method to create datasets and import files of sentences/translations that will be used to train the model.

AutoML does not allow uploading files to a Dataset so the translations must be uploaded to Google Storage and then added to the Dataset.

To upload files to Google Storage I use the client API to create the bucket that will contain the files of sentences.

Training a model from Magnolia

Once the Dataset of translations from Magnolia has been created, the custom model can be trained. To do it from Magnolia it can be used the Java library of the AutoML client.

To execute the training I have created a Magnolia command that receives the training dataset and the name of the model to be created.

Trains model from a dataset created using Magnolia translations

Translating contents in Magnolia using custom model

To translate contents from the custom model it can be used the Java library of the AutoML client

To translate contents I have created a Magnolia command that receives the model and the text to be translated.

It is just an example but it is possible to create a command that translates contents automatically and even custom the backoffice of magnolia to propose translations to the editors.

Conclusions

We have created a prediction model based that fits translation to the domain of our site.

It is possible to implement the same feature for any CMS, I choose Magnolia because it allows to integrate all the steps inside the product.

Repository in github https://github.com/joaquin-alfaro/magnolia-content-translation-ml

--

--