SUMO localization experiment - Report

Overview

The Open Innovation Team ran a two week experiment with SUMO localizers to understand if introducing machine translation could improve the current workflow, the contributor experience and make it easier to have top priority articles covered with less effort.

The results were very satisfactory. By the end of the experiment 62 of the 72 articles were fully translated. We attracted 2 new contributors and volunteers reported an overall satisfaction score of 4.75/7. While there is lots more to learn and experiment with, these results suggest experimenting with adding mechanisms, like machine translation, into traditional workflows can save localizers a lot of time and cover many pending articles. Keep reading to learn more!

Intro

The demand for new and updated localization on SUMO has been increasing in the past years. Mozilla is bigger and we are creating and testing new concepts and products faster than ever (and this will keep increasing in the future). This translates into a lot of new and updated documentation for these products that is as important for users experience as the product itself.

We also value and care about the valuable volunteer time, contributors are devoting their free time to help Mozilla and its products to be better, and drive the mission in their languages.

We want to ensure that contributing to Mozilla is something fun and rewarding, and help communities grow by engaging existing and new people who might be interested in a low effort activity and where they would not need to be fluent in English to contribute, ensuring locales are fully localized and content is updated easily.

We have the assumption that if we are able to improve the localization experience we will save contributors a lot of time, they will enjoy the experience more and we will be able to do more with less.

The current status of machine translation technologies has evolved a lot in the past years and today the technology is able to produce super accurate localization drafts that just take a few minutes for a human to proofread and validate. Based on our external research and conversations with other volunteer-based organization this can help save a lot of time to get an article ready in a different language quickly.

In order to validate these assumptions we wanted to run a test with some locales to see if that was true. We didn’t want to make a decision on whether or not to implement machine translation, we wanted to test with contributors’ help if we were moving in the right direction.

This work will also inform a wider localization strategy for SUMO moving forward.

Project framing

From our internal research we knew that the locales that represent 90% of Firefox Desktop monthly active users (MAU) were mainly maintained by 1-2 core contributors each, doing 85-95% of the localizations.

We also identified this as a risk that became a reality in some locales where there were no active contributors anymore, exposing users to a bad experience due to a lack of support articles in their language and outdated information. This directly impacted SUMO team goals around consumer satisfaction.

During Q2 2019 we ran an experiment in some locales at risk (not 100% localization coverage in top 50 articles for a priority product).

Project goals:

  1. Selected locales have 100% coverage on Top 50 articles for the product selected.
  2. Community satisfaction with the workflow is at least â…—
  3. We are able to engage at least 2 new people in each selected locale.

Locales identified for experiment:

  • Dutch → All products
  • Korean → Firefox iOS
  • Thai → Firefox Lite
  • Chinese (Taiwan) → Firefox Lite

What happened

Preparation

After identifying the locales, we engaged with the local communities to let them know about this experiment and asked them to engage and help to review the machine translated articles, proofreading them directly on Kitsune, editing if needed and approving.

A communication and engagement plan was developed to make sure we managed the relationship and expectations with these localizers and get it into a positive framing, applying experiences with mozilla communities in the past and leanings from conversation with external projects like Wikimedia who already have dealt with similar experiences with machine translation.

For each locale we secured at least 1-2 people who commited to help during the two weeks where we wanted to run the experiment.

We developed a workflow to export the articles that needed localization, applied machine translation on them and imported them back into the SUMO platform “pending for community review”.

This process was very manual and was supported by our technical writers, who ended up having 72 articles to export. We also developed a script to connect to Google Cloud Translate API and run the translation, then we manually imported them back to kitsune and marked them for review by the community.

Community work

Between May 13th and May 24th localizers checked the full list of URLs we provided.

Eight localizers engaged in this activity:

Thank you so much for your time and contributions! You were fundamental to shape this experiment and keep improving our thinking about how to improve localization at SUMO.

Outcomes and insights

In general we can say that the experiment was very satisfactory and we met â…” of the goals of the success indicators.

The help of machine translation saved localizers from 2 to 3 times what it would have taken them to create a localization from scratch.

From the 72 articles we imported, 62 were covered in time for the campaign ending. When we asked about how likely they would recommend a similar experiment to other localizers the result was an average of 67% satisfaction 4,75/7.

Goal: Outcome:
Selected locales have 100% coverage on Top 50 articles for the product selected. (72 selected articles) 86% of the selected articles were fully translated by the end of the campaign. (62/72 - 10 missing articles for Chinese Taiwan)
Community satisfaction with the workflow is at least â…— (60% satisfaction) 100% The community reported a satisfaction rating of 4.75/7 or 67%. )
We are able to engage at least 2 new people in each selected locale. 30% of the contributors who participated in the campaign were new (2/7)

These are some detailed insights:

The export/import process took a lot of time

Exporting/importing the articles and mark them for review took a lot of staff manual work (2-3 days of work), it’s not scalable.

Recommendation: Invest in creating a export/import mechanism for kitsune.

Machine translation was good, but it can improve

When contributors were asked about the quality, accuracy and “if it sounded natural” for machine translations, localizers said that the quality was good (2,75/4). There were some comments expressing their positive surprise comparing with the quality they expected.The only exception was Chinese (Taiwan) who rated the quality as low.

Additionally, on average localizers had to edit from 5 to 15% of the original machine translation provided in order to adapt it and sound more natural.

This work took on average between 5 and 15 minutes per article, depending on article length. We also asked localizers how long would have taken them to create a translation from zero and on average they reported that at least 20 to 30+ minutes.

It is clear that machine translation can save localizers at lot of time.

Recommendation: Invest in the development of the machine translation code.

Machine translation provider quality didn’t not work for all locales

The provider we used (Google Cloud Translate) was not really effective for Chinese (Taiwan).Translators rated the quality as extremely low. Making the translation not useful for localizers who had to re-do most of the sentences.

Chinese (Taiwan) locale struggled to have the articles reviewed in time because the quality of the machine translation was reported to be unnatural and containing too many Chinese (China) expressions, meaning they had to re-do most paragraphs.

A single provider won’t work for every locale.

Recommendation: Do additional research and testing to understand MT providers limitations.

SUMO review system is not very friendly

The review system at SUMO is not very straight forward for newcomers. You need someone with review rights to approve revisions (potential bottlenecks) and it’s a multi-step process not as agile as some might expect, detailed instructions had to be provided.

Recommendation: Invest in understanding kitsune UX limitations and identify quick wins.

The community engagement plan was important

The engagement and communication plan really paid-off by having really positive reactions to the experiment and being open to participate in something new that has been historically a “tricky topic” among the mozilla localization communities.

Scalability will be an issue

We know wikipedia has been doing a per-language approach to make sure there is no abuse of the system from localizers and ensure high quality. We currently have a technical limitation for scalability.

Recommendation: Iterate the experiment with additional locales to understand scalability.

Recommendations Summary

  1. Invest in creating a export/import mechanism for kitsune.
  2. Invest in the development of the machine translation code.
  3. Do additional research and testing to understand MT providers limitations.
  4. Invest in understanding kitsune UX limitations and identify quick wins.
  5. Iterate the experiment with additional locales to understand scalability.

1. Invest in creating a export/import mechanism for kitsune

Importance: High
Effort: Low (2 days of dev time)
People needs: Dev time, CM time

Not only we’ll need to automate the export/import process for further experiment around machine translation, but it’s clearly an identified need for other internal processes dealing with content.

2. Invest in the development of the machine translation code

Importance: Medium
Effort: Low-medium (depending on how much we want to improve)
People needs: Dev time, CM time

If we want to expand our tests around machine translation, there are certain improvements we can apply to the code generating these translations. At least we’ll need minor changes to markup handling (P1) and multi-provider support (P2). Bonus (P3) if we can get formal/informal handling, translation memories and terminology, but not a blocker.

3. Do additional research and testing to understand MT providers limitations.

Importance: High
Effort: Low (depending on how many locales we need to understand)
People needs: CM time

We should definitely understand how the current provider is handling our larger locales (including both Desktop and Mobile products)

Running a sample test for feedback with these communities will help us understand our needs for additional providers, which would influence recommendation 2.

4. Invest in understanding kitsune UX limitations and identify quick wins.

Importance: Medium
Effort: Low-medium (depending on how deep we want to go)
People needs: Program manager time, dev time

We should work with our communities to understand the main pain points about using kitsune and prioritize the ones that are clearly draining people’s time. I suspect there are a few quick wins we can just do by tweaking the frontend.

5. Iterate the experiment with additional locales to understand scalability.

Importance: High
Effort: Medium
People needs: CM time, team support

We need to understand if this experience would work at scale. This would be a combination (and blocked by) recommendations 1, 2 (if possible) and 3.

Once we have the technical limitations solved and we understand how to provide the best MT to each locale, we should test if this works with all our larger locales to test how does it scale.

2 Likes

I’m dreaming of a “help with robot” button on Kitsune, so people can trigger machine translate once they need it.

1 Like

I just spent 3 days at the TAUS conference listening to how other enterprise companies testing and implemented neural MT in their processes, and have a few questions:

  1. Was raw MT output what was used for this?
  2. Did you train the engine on your data or did you use the generic Google Translate engine?
  3. If you trained on data, did you also create a glossary of terminology the engine could use at the time of translation to improve the translation’s accuracy (a new feature in the customizable Google Translate)?
  4. My impression from the project goals is that an underlying goal was to determine community tolerance for MT output. What role did post-editing play in the process and how did that factor into the results? Were you able to provide MT post-editing training to those communities beforehand?
  5. What tangible outcomes did having these pages translated through this method provide (i.e., increased traffic, longer time spent on the article, more engagement from the target audience, etc.)?
  1. Raw MT output was provided to communities for review and proofreading.
  2. No, although possible, we didn’t have the time to develop this into our script for this first iteration.
  3. General guidance was provided to communities to know what to expect from MT and things they should be reviewing. No major flags (except from Taiwanese) were signaled, most concerns were about the current tools (that we also use for regular localization) rather than the MT output itself.
  4. It’s still early to say, depending on the metrics we will be able to have something in the coming months.

Thans for the quick reply! Was the review purely linguistic or were their questions around utility and purpose-fulfillment for the target audience? Very interesting to hear that the major concerns were more about the tools than the inclusion of MT itself. It’s cool to see us beginning to experiment and embrace MT as a superpowered resource for the community. Do you plan to follow up on this report in the coming months with the outcomes?

We focused on linguistic, as well as adaptation to local audience. Similar, if not the same, as the questions asked for other manual localizations.

A proposal on next steps will be delivered to the SUMO team and then there will be prioritization to decide next steps.

Hopefuly, the recommendations on this report can be accommodated with the team resources as part of the second half of the year’s work. Each one might results in one or more additional experiments to run.

Cheers.

Thanks. As I mentioned, I’ve been at the TAUS conference all week talking about these topics with others who have come before us. I’m very happy to share any knowledge or answer any questions.

1 Like

Yes, thanks, let’s definitely connect next week :slight_smile: