Comparing Machine Translation to Native Language Analysis

May 10, 2019

ml-translate-interpret As our world becomes increasingly global, so does our data. Being an analyst working with almost any size company today often means facing the challenge of receiving text data that contains multiple languages. So what do you do?

Essentially, there are two options we may consider: machine translation or native language analysis.

With machine translation, we actually create a new dataset where the text has all been translated into a single language before we do the analysis. This makes the subsequent analysis much easier, as we only need to use a single language grammar module for the analysis.
Native language analysis means that we keep documents in their original languages and perform a separate analysis for each language with the corresponding grammar module.

To demonstrate how these options work, let’s imagine we are working with a dataset that has records (or rows) of textual responses that are either in English or Spanish. Regardless of whether we want to use machine translation or to process individual documents in their native languages, we first need to split the dataset up by language. In this example, all the English records are together in one dataset, and all the Spanish records are in another. We then use the software PolyAnalyst for Text™ to perform native language text analysis by implementing the corresponding language modules. Currently, this software can process the texts of 16 different languages, and it can integrate with third-party translation services such as Microsoft, SDL, and Google. Splitting the data can be easily accomplished by using a combination of the Language Detection and Filter Rows nodes in PolyAnalyst. The Language Detection node automatically samples the text of each record and determines what language is being used (as seen below). Once each record is tagged with its language, we can use these tags to filter out the data by language.

Translation services like Google and Microsoft are fully capable of identifying the language of texts on a record-by-record basis, but they will charge you even if you ask them to translate English to English.

PolyAnalyst workflow for detecting and filtering text data by language.

Translation services like Google and Microsoft are fully capable of identifying the language of texts on a record-by-record basis, but they will charge you a fee even if you ask them to translate English to English. Therefore, it is best to do the split beforehand and send them only the texts that really need translation. Some of you may even go so far as to split individual texts into multiple records when that text contains multiple languages. This will minimize translation costs.

Once we’ve separated our data based on the language, it’s time to decide what approach to use: Machine Translation (MT) or Native Language Analysis (NLA). Let’s review some pros and cons of each of these approaches.

Machine Translation Pros & Cons

PROS

Relatively Cheap
Machine translation is relatively cheap, even with a fairly large dataset. These services tend to charge by the character, so the cost will vary based on how many documents you have and how long (a.k.a. wordy) they are. For reference, at Microsoft’s lowest (and least cost-effective) pricing tier, the cost is $10 per 1 million characters. A page in Microsoft word is about 3000 characters, so you can translate about 333 pages for $10. If you have a huge dataset, then this might sound like a lot. However, compared to hiring multiple analysts with fluency in different languages, it may still be the more affordable option.
Simple and Accessible Results
The data is more widely accessible and ends up consolidated into a single language. This means that even if the data was originally in 10 different languages, end users such as the company or team managers, who may only speak English, can review the text of each supporting record for a proposed insight and see what is being said and why that record was processed that way.
Low Maintenance
Typically for ongoing analysis, you will want to update your analysis workflow periodically to account for new issues and trends. Because MT facilitates in creating a single analysis scenario for a single language, its maintenance becomes simpler. If you want to use NLA and have 10 languages present in the data, you will have to build and maintain a separate analysis scenario for each language.

CONS

Low Accuracy
The accuracy of machine translation is still relatively low compared to manual translation. And even with manual translation, some things like sarcasm and figures of speech may not translate well. For example, if you translated “break a leg” into another language, the meaning of “I wish you good luck” is likely to be lost. Additionally, different languages may be more or less difficult to translate to or from. Going from Spanish to English will likely result in a reasonable translation, but most translation services that have been tested by our analysts at Megaputer performed relatively poor when working with languages like Japanese, Chinese, and Korean. Anyone looking to use machine translation will need to run some tests to see if the accuracy is at a level that can meet the output goals.

Here is an example of poor translation from Google Translate. As you may know, Japanese is a highly contextual language, which makes machine translation difficult.

Original Text: 生懸命指でまぶたを広げて目薬を差しました。
Google Translate: I spread my eyelids with my fingers and put on my eyes.
Manual Translation: With great effort I held his eyelids open with my fingers and dropped in the eye medicine.

Garbage In, Garbage Out
The accuracy of the analysis will be partially dependent on the accuracy of the translation. A low accuracy translation will cause the results of the analysis to be less trustworthy.

Native Language Analysis Pros and Cons

PROS

Better Accuracy
Native language analysis generally results in much more accurate results. This is, of course, dependent on the analyst.
Traceability and Transparency
There is a 1-to-1 account of what parts of the original text match the search query, and this will be visible in the final results.

CONS

More Expensive
Native language analysis tends to be more expensive. There will be costs for hiring additional analysts to cover different languages. Those analysts will need to not only have the skills of an analyst, but also the skills of a polyglot linguist.
More Maintenance
In the future when models and algorithms need to be adjusted, the work will be multiplied by the number of languages being worked with.
Less Accessibility
When consumers of the results review the analysis, they may not be able to independently read all the records to understand the supporting information for suggested insights.

Other Options and Common Questions

There is actually a third option that is the most expensive choice. You can use native language analysis for the model building and analysis to ensure high accuracy of the results, but then also use machine translation so that end users reading the report can get a general idea of what each record says. However, it may not be 100% clear to them why it was processed as it was since the analysis was done on the untranslated text.

As for which option should you choose… How much accuracy are you willing to sacrifice for cost?

As for which option you should choose, there is no way to know except to do a small-scale test analysis. Try some MT analysis, and if the accuracy is acceptable, then that might be the better choice for you. How much accuracy are you willing to sacrifice for cost? The answer to that will vary from company to company and task to task.

Another common question is, “Which translation service should I use?” Here is a suggestion on how to decide when your goals involve a categorization task. Suppose you have a dataset of customer complaints and you want to categorize each text based on what complaints were made, thus allowing you to get the count for each complaint type. It is recommended that you do a small-scale analysis for each MT service you are considering, and compare this to the results you achieve using NLA. Make sure that the analyst-driven portion of the NLA is rock solid, then compare the results of each MT service to your NLA results. To make this comparison, treat the NLA as 100% correct, then calculate the precision, recall, and F score of the post-MT analysis. This means you will count how many categorizations were made incorrectly, and how many categorizations the post-MT analysis failed to make that it should have made.

For Example

Your NLA made 100 categorizations.
You use translation service “A”.
When you run your algorithm on the machine translated data from “A”, it makes 70 categorizations, 10 of which were not made by the NLA (and therefore are incorrect), and 60 of which are identical to the NLA results.

Precision is the number of correct results divided by the number of all returned results.
In this case, we had 60 correct results out of 70 returned categorizations.
P = 60/70
P = .857

Recall is the number of correct results divided by the number of results that should have been returned.
In this case, we had 60 correct results found out of the possible 100 correct results.
R = 60/100
R = .6

F Score is the harmonic mean of recall and precision. The harmonic mean is the preferred method for averaging ratios. The F score is a good measure of how “correct” are the categorization results.
F = ((2)(Precision)(Recall))/(Precision + Recall)
F = ((2)( .857)( .6))/( .857 + .6)
F = .706

Then you try translation service “B”, and perform a similar calculation of precision, recall, and F-ratio for the corresponding categorization results. Let’s assume the F-score on the machine translated data from “B” is .75. Since service “B” facilitated a higher F-score (.75 > .706), we conclude that service “B” provides a more accurate machine translation than service “A”. Ceteris paribus, you should go with service “B”.

All that being said, there is at least one other factor to consider. If your data is highly sensitive, remember that services like Microsoft and Google have rules about keeping a sample of your data, which they can use for improving their algorithms. SDL, on the other hand, does not keep any of your data. Unfortunately for those with highly sensitive data, it appears that generally Microsoft and Google have put to good use the additional data they are receiving. In the tests Megaputer staff have run, these services tend to outperform SDL in terms of accuracy.

February 6, 2024

Query languages—the Swiss army knife of information extraction

October 24, 2023

Medical Ontology and Its Use in Text Analysis

September 28, 2023

Comparing Machine Translation to Native Language Analysis

Comparing Machine Translation to Native Language Analysis

Machine Translation Pros & Cons

PROS

CONS

Native Language Analysis Pros and Cons

PROS

CONS

Other Options and Common Questions

For Example

Related Articles

Query languages—the Swiss army knife of information extraction

Medical Ontology and Its Use in Text Analysis

Mastering Language Models: A Deep Dive into Input Parameters

Contact Us

Software

Solutions

Services