When overwhelmed stream it
Recently we’ve had some huge Translation Memory to cleanup. It was built by copying segments from two other TMs, so it was a mess. It had over million segments and we needed to shrink it. So the obvious thing to do was to get rid of duplicates. Unfortunately our CAT tool isn’t very good at it, or rather very fast. But we can export/import TM in TMX format which is basically an XML. And you can do anything with an XML.
At work I’m mainly programming in C#, so my first step was to utilise XmlDocument. But it’s very slow and uses a lot of RAM. So much in fact that it’s never been able to finish, because system run out of it. OK, dead end.
My second try was good old XSLT. It was designed to transform XMLs and so far was pretty robust in my projects. But it’s failed me this time. For exactly the same reason as above.
I’ve tried to give up, especially that in the meantime we’ve archived the TM. However my brain doesn’t give up, so I kept thinking about it. I knew I needed an iterator, but didn’t exactly know how to write it and better yet how to consume it. But I’ve kept digging and found XStreamingElement. After I’ve figured out how it works I’ve written solution to my problem and…
It’s downsized the TMX to 47% of its original size (~1.2 GB) in less than 20 seconds. That’s right, seconds, not minutes. And memory footprint? Less than 250 MB. Wow! I’ve read about iterators before, and I’ve even used it in one of my hobby projects. So I knew the power of them, but now I was simply astonished. You can find my lousy implementation in this gist, in case you’ll be facing similar problem. It’s only checking if source segment has been duplicated, but it can easily be extended to check target (translation) as well, or only.
The problem with iterators is that most books is giving you an example like this:
public static System.Collections.Generic.IEnumerable<int>
EvenSequence(int firstNumber, int lastNumber)
{
// Yield even numbers in the range.
for (int number = firstNumber; number <= lastNumber; number++)
{
if (number % 2 == 0)
{
yield return number;
}
}
}
And then you have to figure out how to incorporate it in your project. For the mentioned hobby stuff it was easy, because I’ve just combined two streams together, so I wasn’t actually producing anything. But the TMX cleanup was different and I’ve learnt a lot. Mostly that I should use iterators more and maybe start with them when looking for solution;) They’re not so scary after all.
Not to brag, but I now have two confirmed people my blog has helped;) It’s always nice to know that what you’re writing isn’t for nothing. Thanks!