Skip to main content

Gazing at the Future of Machine Translation Quality with MT-Telescope

Throughout the history of machine translation, one of the greatest challenges that researchers and engineers have come up against is the complex task of assessing the accuracy of translations. When putting different translation systems head to head to determine which is better, what makes one system superior to another in terms of translation quality?

The enigma of machine translation quality is something that our research team has been studying closely for many years now. From our award-winning and open-source OpenKiwi quality estimation tool to our Crosslingual Optimized Metric for Evaluation of Translation (COMET) framework, we are proud to have helped the machine translation community make great strides in assessing machine translation quality performance. 

Building off of the momentum and knowledge we’ve gained since releasing COMET at the end of last year, we are very excited to announce the latest development in understanding the quality performance of machine translation systems: MT-Telescope

Let’s take a look at why MT-Telescope is a leap forward for machine translation quality analysis, how it works, and what it means for our customers and the MT community at large. 

Why is MT-Telescope a breakthrough in machine translation quality?


“Our research shows that one of the biggest needs in applying machine translation is insight into its usability, an area where current methods fall short. Guidance-focused evaluation that focuses on how well MT suits particular use cases will help extend the technology to new areas and increase acceptance of machine translation-based workflows.”

— Dr. Arle Lommel, senior analyst at CSA Research

Automated measurement metrics that generate quality scores, such as COMET, are very useful because they reduce the need for human inspection, enabling fast and cost-effective prediction of translation quality. But the automated metric itself is only one piece of a much larger puzzle.  Another important factor in effective MT evaluation is the selection and curation of meaningful and representative sets of test data. 

However, even with well-curated data and much experience in using and applying automated metrics, we came to realize that the performance scores generated by these metrics only tell part of the story. It’s important for our MT engineers to not only determine that one model scored better than another, but to also understand why. 

Let’s say that a team of MT engineers is examining the quality performance of two machine translation systems on a test set using an automated measurement metric, and one gets a score of 78 while the other receives a score of 80. Until now, many engineers in this situation would presume the system that scored an 80 is the “winner,” so they might drop the other system and carry on with their work.

In that scenario, it’s hard to know what factors caused the second system to earn a higher score on that particular test data. What if the lower-scoring system actually produced more accurate translations for specific terms that resonate better with a native speaker of a given language? When engineers manually dig deeper into the translations generated by these two systems to try to find answers, it is extremely time-intensive, which nullifies the efficiency gains that an automated metric like COMET provides. 

This conundrum is why we developed MT-Telescope, which is capable of surfacing data underlying quality performance and providing granular comparison between the two systems. MT-Telescope essentially blows the machine translation models' performance on test data wide open and allows engineers to make much more nuanced and informed decisions about why they would choose one system over another. 

Machine translation is a rapidly evolving field: As one recently published paper made clear, an increasing number of MT researchers and developers have failed to follow rigorous state-of-the-art evaluation methods in recent years, relying solely on reporting aggregate scores based on outdated automated metrics, without any additional validation. That’s why MT-Telescope was designed to enable the seamless adoption of best practices in MT evaluation and quality analysis as new advancements are made.

How does MT-Telescope work? 

The process is fairly simple. A user uploads a test suite consisting of data extracted from a meaningful and representative collection of test documents. For a given test suite, the data files should include:

  1. The source segments (untranslated content in the source language)

  2. The reference content (perfect, human-generated translations) for the source segments

  3. Translations from two different MT systems for the source segments

At this point, the analysis, based on the underlying automated metric, is automated and visualized in an intuitive browser interface, which allows engineers to execute a rigorous and detailed evaluation effortlessly. Instead of a complex set of outputs, MT-Telescope’s UI makes it simple to compare and contrast the scored translations generated by the two machine translation systems. 

These include graphical representations showing the difference in quality scores for specific subsets of the test data (such as those containing named entities and terminology), a side-by-side error analysis of each system, and a high-level evaluation of the distribution of quality scores between the two systems.

When a user is viewing the contrast between systems with MT-Telescope, they can unlock deeper, more granular insights by applying filters based on specific features of the data, such as the presence of keywords and phrases (named entities and terminology). These keywords could include job titles, company names, product types, geographic locations, and so on. 

Users can also filter by the length of the translation (segment length) to compare the performance of the two MT systems on short vs. long segments. This is especially helpful when comparing a communication channel that is usually brief, like chat, with one that is typically more long-form, like email. 

The MT-Telescope tool is modular enough that users can easily perform the analysis based on other quality measurement metrics in addition to our own COMET, such as Google’s BLEURT or Johns Hopkins’ Prism.

Combined, all of these features enable engineers to perform richer, faster analyses between MT systems so they can make more confident decisions about the best system to deploy. 

What does the introduction of MT-Telescope mean for you?

Unbabel is currently using MT-Telescope to help our LangOps specialists and MT developers evaluate and make deployment decisions for our own machine translation systems. That means our customers and future customers will continue to benefit from a powerful Language Operations platform that will only get better with time. 

MT-Telescope allows us to make smarter choices about which system delivers the best possible quality performance and thus the best possible customer experience. As we continue to push boundaries on new frameworks and tools at the forefront of machine translation and MT quality evaluation, our own solutions will only improve. That brings us closer to the goal of seamless multilingual conversations that are indistinguishable from communications written by a native speaker.

Our COMET metric is already being embraced by leading technology companies on the front lines of machine translation innovation. Our release of MT-Telescope as a complementary open-source MT evaluation tool is intended to serve as an extension of COMET, empowering researchers and engineers to develop their own cutting-edge machine translation models. 

This aligns with a common thread in Unbabel’s research: the academic spirit of sharing knowledge. In keeping with this “rising tide lifts all ships” mentality, we will continue to contribute our tools and best practices back to the machine translation community so that our advancements are able to benefit everyone who wants to help extend the limits of what this technology can do. 

A major advancement for multilingual communications 

MT-Telescope is a key milestone in our ability to assess the quality of machine translation systems, and we’re still in the early stages of determining the scope of everything it will help us achieve. We hope that our customers are just as excited as we are about taking machine translation quality and efficiency to the next level.

Head over here to learn more about how machine translation technology powers an organization's Language Operations strategy for a better multilingual customer experience and accelerated international growth.

About the Author

Alon Lavie is the Vice President of Language Technologies at Unbabel. He leads and manages Unbabel’s U.S. artificial intelligence lab based in Pittsburgh, and provides strategic leadership for AI R&D teams company-wide. Previously, Alon was a senior manager at Amazon, where he led the Amazon Machine Translation R&D group.

Profile Photo of Alon Lavie