New research into the carbon footprint created by machine learning translation models indicates that German may be the most carbon-intensive popular language to train, though it is not entirely clear why. The new report is intended to open up additional avenues of research into more carbon-efficient AI training methods, in the context of growing awareness of the extent to which machine learning systems consume electricity.
The preprint paper is titled Curb Your Carbon Emissions: Benchmarking Carbon Emissions in Machine Translation, and comes from researchers at India's Manipal Institute of Technology.
The authors tested training times and calculated carbon emission values for a variety of possible inter-language translation models, and found ‘a notable disparity' between time taken to translate the three most carbon-intensive language pairings, and the three most carbon-economical models.
The paper found that the most ‘ecological' language pairings to train are English>French, French>English and, paradoxically, German to English, while German features in all the highest-consuming pairs: French>German, English>German and German>French.
The findings suggest that lexical diversity ‘is directly proportional to training time to achieve an adequate level of performance', and note that the German language has the highest lexical diversity score among the three tested languages as estimated by its Type-Token Ratio (TTR) – a measurement of vocabulary size based on text length.
The increased demands of processing German in translation models are not reflected in the source data that was used for the experiment. In fact, the German language tokens generated from the source data has fewer (299445) derived tokens than English (320108), and far fewer than French (335917).
The challenge, from a Natural Language Processing (NLP) standpoint, is to decompose compound German words into constituent words. NLP systems often have to accomplish this for German without any of the pre-‘split' surrounding grammar or contextual clues that can be found in languages with lower TTR scores, such as English. The process is called compound splitting or decompounding.
The German language has some of the longest individual words in the world, though in 2013 it lost official recognition of its 65-character former record-breaker, which is long enough to require its own line in this article:
The word refers to a law delegating beef label monitoring, but fell out of existence due to a change in European regulations that year, conceding the place to other popular stalwarts, such as ‘widow of a Danube steamboat company captain' (49 characters):
In general, German's syntactic structure requires a departure from the word-order assumptions underpinning NLP practices in many western languages, with the popular (Berlin-based) spaCY NLP framework adopting its own native language in 2016.
Data and Testing
For source data, the researchers used the Multi30k dataset, containing 30,000 samples across the French, German and English languages.
The first of the two models used by the researchers was Facebook AI's 2017 Convolutional Sequence to Sequence (ConvSeq), a neural network that contains convolutional layers but which lacks recurrent units, and instead uses filters to derive features from text. This allows all operations to take place in a computationally efficient parallel manner.
The second approach used Google's influential Transformers architecture, also from 2017. Transformers uses linear layers, attention mechanisms and normalization routines. Admittedly, the original released model has come under criticism for carbon inefficiency, with claims of subsequent improvements contested.
The experiments were carried out on Google Colab, uniformly on a Tesla K80 GPU. The languages were compared using a BLEU (Bilingual Evaluation Understudy) score metric, and the CodeCarbon Machine Learning Emissions Calculator. The data was trained over 10 epochs.
The researchers found that it was the extended duration of training for German-related language pairs that tipped the balance into higher carbon-consumption. Though some other language pairs, such as English>French and French>English had even higher carbon consumption, they trained more quickly and resolved more easily, with these spurts of consumption characterized by the researchers as ‘relatively insignificant' in relation to the consumption by language pairings that include German.
The researchers conclude:
‘Our findings provide clear indication that some language pairs are more carbon intense to train than others, a trend that carries over different architectures as well.'
‘However, there remain unanswered questions regarding why there are such stark differences in training models for a particular language pair over another, and whether different architectures might be more suited for these carbon-intense language pairs, and why this would be the case if true.'
The paper emphasizes that the reasons for disparity of carbon consumption across training models are not entirely clear. They anticipate developing this line of study with non Latin-based languages.
1.20pm GMT+2 – Text error corrected.