(Part 2) Machine Translation with Deep Neural Networks
The first part of this article discussed how words are pre-processed and transformed into vectors. In the second part, we’ll explore how AI subsequently processes this data.
Encoding the input sequence
As we’ve seen, a sentence can be transposed into a sequence of vectors using embedding. The meaning of a word, however, often depends on context. “Address” can mean either a place to deliver mail or a speech, for instance. Understanding means looking at those words which at least occur in the same sentence, but possibly much further away as well, to translate a word into another language, say German. If the sentence starts with “The mailman knows my…” it makes more sense translate “address” with “Adresse”.
The next step is transposing the input sequence into an internal representation which considers context. This internal condition is at its simplest another vector, which now represents the meaning of a sentence or a meaning unit.
As sentences can be of variable, random length, this is not a trivial task. Usually, so-called Recurrent Neural Networks (RNNs) are used for this type of sequential data. They resemble the neural networks discussed in the post on machine vision [LINK MISSING], though they don’t only project an input on an output but additionally also on a so-called internal condition. An RNN is fed with sequence data step by step, in our case word by word. At each step, the RNN calculates its hidden state, which is fed back into the network together with the new word in the next step. This mechanism allows the network to build up a short-term memory, which it uses to memorise information about previous input.
The actual output of the net is ignored until the end of the sentence. Thus, the network needs to store the relevant facts about the input sequence in its hidden state. The output of the network in the last step is then the desired representation of the sentence.
Decoding into the target language
The last part of the translation architecture is the decoder network that transposes this representation into a word sequence in the target language. It also uses a recurrent neural network to accomplish the task. In the simplest case, the sentence representation calculated in the last step is continuously fed into the network unchanged. Through its hidden state, the network can memorise which part of the input has already been processed.
In each step, the output of the network is a probability distribution over all possible symbols. Let’s assume that the system tries to translate the sentence “I’m happy!” into German. If the decoder RNN already put out the words “Ich bin” (I am), then the network will not immediately put out “froh” (happy), but rather just assign this word a high probability. Whereas other words such as e.g. “traurig” (sad), as well as many others that don’t make grammatical or semantical sense in this place, are assigned very small output probabilities. This allows for multiple solutions of a given translation problem, but in the end, the network puts out the sentence with the highest overall probability.
Since all system parameters – the entries of the embedding matrix, the network weights of the sentence encoder and decoder – aren’t set yet, the system would return translations made up of random strings of words. That’s why neural systems initially need many translation examples to train. Based on that training data, the deviation between the output of the network and the desired output is calculated step-by-step for two corresponding sentences in the source and target language. The goal of the complete system is, after all, to minimise the deviation and to best possible predict the desired output. As all described components are derived, they can be optimised through a backpropagation algorithm.
A system trained and designed in such a way can “learn” the statistical characteristics of a language from many millions of sentences. The rules of a language determine which sequences show up in the training examples, after all. You can think of it as an intricate pencil drawing where each line may be a weak grey, but the aggregate multitude of lines represents what the artist sees. In a similar fashion, valid constructs in a given language become apparent through their statistical accumulation. The translation results of this relatively simple method are impressive and usually far outperform traditional approaches.
Google recently demonstrated that neural translation networks can translate texts from one language into another even when the training data doesn’t contain that particular pair. To do so, the network architecture was expanded so the system can manage multiple input and output languages. Once the encoder network receives the additional information into which language should be translated, the model can be trained with pairs of any desired languages. It’s especially impressive that the sentence representation, which the encoder learns, represents a sort of lingua franca or “bridge language” that creates itself through the training process. Semantically similar sentences from different languages are mapped on a nearly identical representation. (see Googleblog).
Which raises the question whether these systems can be considered intelligent? Only a few people are capable of translating between dozens of languages. And even if neural networks can pull off this impressive feat, they are basically nothing more than statistical learning procedures crunching data correlations. That makes them very data hungry by nature. They need to be trained with millions of data points to produce a satisfactory quality of work. Humans, on the other hand, can transpose learned concepts to a new problem and solve it using innate logic, even when they have just a few hints.
A good example is the decoding of the Egyptian hieroglyphs. Until the discovery of the Rosetta stone in 1799, there was no known connection to other languages. With the help of a text engraved in Demotic, Greek, and hieroglyphs, scientists managed to deduce the meaning of the Egyptian symbols. To accomplish this task, they needed to integrate and interpret their existing knowledge, go down quite a few blind alleys and always stay creative. All things that are far from an AI’s current capacities. Decoding the hieroglyphs took almost a quarter of a century – let’s see how long it takes AI to match humans.