Sometimes you need to hack it (MT and memoQ)
memoQ is great product, I’m saying it a lot. But some features need a bit polishing and one of them is support for Machine Translation (MT). It’s done through series of plugins. I’ve just used two so far. Pseudo-translation which is fine, but limited. Although they’ve enhanced it in Adriatic version. Second plugin is for Google Translate. There’s not much in terms of configuration, you set the API key, can specify regex which matches will be ignored in MT process, and enable option to put the tags from source at the end of the translation. And this is pain I’ve had to hack.
Word of advice here. If plugin is enabled and set as preferred in your client, MT results show up alongside Translation Memory (TM) hits in Translation results pane. So if you spend a lot of time there clicking through segments it’ll cost you some money. Make sure to uncheck Offer machine-translated results while working in the translation grid or adjust it to your actual needs.
Tags in memoQ, as in other localization tools, serve as placeholders for source file elements which are wrapping or are placed between text to be translated. If you translate Word documents your tags will be mainly formatting. So, if these tags will be placed at the end of translation, formatting will be broken. Which is maybe not big thing if you’re using MT as support to human translation, because your translator will place tags where they belong. Although he probably won’t be very happy about it. But I’ve wanted to use MT as some kind of self-service for teams which don’t require quality translations, just want to understand the text.
Luckily memoQ offers export/import of bilingual files and its native format is mqxliff which is XML. So, I’ve started digging. Each segment is wrapped in trans-unit
tag. Source segment if it contains tags looks like this.
<source xml:space="preserve" mq:segpart="8">You can <bpt id="1" ctype="underlined">{}</bpt>
<bpt id="2"><hlnk id="rId8" history="1" fileName="document.xml" href="@07c1b597-2ac9-446f-b816-78a8a151172a"></bpt>
<bpt id="3"><rpr id="0"></bpt>download it here for free<ept id="1">{}</ept>
<ept id="3"></rpr id="0" transform="close"></ept><ept id="2"></hlnk></ept>.</source>
We need to work on XML level as we want these tags encoded back exactly as they’re now. First we need to get rid of source tags, I don’t like regex in my code, but this time it’s necessary </?source(.+?)?>
. Now we have this.
You can <bpt id="1" ctype="underlined">{}</bpt>
<bpt id="2"><hlnk id="rId8" history="1" fileName="document.xml" href="@07c1b597-2ac9-446f-b816-78a8a151172a"></bpt>
<bpt id="3"><rpr id="0"></bpt>download it here for free<ept id="1">{}</ept>
<ept id="3"></rpr id="0" transform="close"></ept><ept id="2"></hlnk></ept>.
Not bad, but still you don’t want to translate tags, so we’ll split this with another regex (<.+?>.+?</.+?>)
to get nice array of strings. And then if item in array starts with <
we add it as-is to target tag, else we first MT it.
Of course we could pass it as-is to Google Translate and receive back translation with tags intact, but then while adding it to original mqxliff all tags will be escaped, like <bpt id="1" ctype="underlined">
. And after import they’ll appear as regular text in memoQ and will be exported to DOCX (or whatever was original format) as such. So not only formatting will be broken, you’ll have garbage text within your translated text.
Last, but very important, thing. We need to set mq:status
attribute for each changed trans-unit tag, so it equals "MachineTranslated"
. Otherwise even though we have translation for memoQ it’s empty and on export it’ll either not be exported or reverted to source text, depending on your settings.
As I was digging through mqxliff I’ve found another interesting thing. If segment is locked, it has following attributes translate="no" mq:locked="locked"
. Apparently, it’s enough to remove them from trans-unit tag and after import segment will be unlocked in memoQ project. It’s very useful as currently you need to have PM license to be able to lock/unlock segments, which is expensive. Plus, it’s tedious task if you need to do it manually.