Curbing the Growing Power Needs of Machine Learning
In light of growing concern about the energy requirements of large machine learning models, a recent study from MIT Lincoln Laboratory and Northeastern University has investigated the savings that can be made by power-capping GPUs employed in model training and inference, as well as several other techniques and methods of cutting down AI energy usage.
The new work also calls for new AI papers to conclude with an ‘Energy Statement’ (similar to the recent trend for ‘ethical implication’ statements in papers from the machine learning research sector).
The chief suggestion from the work is that power-capping (limiting the available power to the GPU that’s training the model) offers worthwhile energy-saving benefits, particularly for Masked Language Modeling (MLM), and frameworks such as BERT and its derivatives.
For larger-scale models, which have captured attention in recent years due to hyperscale datasets and new models with billions or trillions of parameters, similar savings can be obtained as a trade-off between training time and energy usage.
For these higher-scale deployments, the researchers found that a 150W bound on power utilization obtained an average 13.7% lowering in energy usage compared to the default 250W maximum, as well as a relatively small 6.8% increase in training time.
Additionally, the researchers note that, despite the headlines that the cost of model training has garnered over the last few years, the energy costs of actually using the trained models are far higher*.
‘For language modeling with BERT, energy gains through power-capping are noticeably greater when performing inference than for training. If this is consistent for other AI applications, this could have significant ramifications in terms of energy consumption for large-scale or cloud computing platforms serving inference applications for research and industry.’
Further, and perhaps most controversially, the paper suggests that major training of machine learning models be relegated to the colder months of the year, and to night-time, to save on cooling costs.
The authors state:
‘Evidently, heavy NLP workloads are typically much less efficient in the summer than those executed during winter. Given the large seasonal variation, if there, are computationally expensive experiments that can be timed to cooler months this timing can significantly reduce the carbon footprint.’
The paper also acknowledges the emerging energy-saving possibilities that are possible through pruning and optimization of model architecture and workflows – though the authors leave further development of this avenue to other initiatives.
Finally, the authors suggest that new scientific papers from the machine learning sector be encouraged, or perhaps constrained, to close with a statement declaring the energy usage of the work conducted in the research, and the potential energy implications of adopting initiatives suggested in the work.
The paper is titled Great Power, Great Responsibility: Recommendations for Reducing Energy for Training Language Models, and comes from six researchers across MIT Lincoln and Northeastern.
Machine Learning’s Looming Energy Grab
As the computational demands for machine learning models has increased in tandem with the usefulness of the results, current ML culture equates energy expenditure with improved performance – in spite of some notable campaigners, such as Andrew Ng, suggesting that data curation may be a more important factor.
In one key MIT collaboration from 2020, it was estimated that a tenfold improvement in model performance entails a 10,000-fold increase in computational requirements, along with a corresponding amount of energy.
Consequently, research into less power-intensive effective ML training has increased over the last few years. The new paper, the authors claim, is the first to take a deep look at the effect of power caps on machine learning training and inference, with an emphasis on NLP frameworks (such as the GPT series).
Since quality of inference is a paramount concern, the authors state of their findings at the outset:
‘[This] method does not affect the predictions of trained models or consequently their performance accuracy on tasks. That is, if two networks with the same structure, initial values and batched data are trained for the same number of batches under different power-caps, their resulting parameters will be identical and only the energy required to produce them may differ.’
Cutting Down the Power for NLP
To assess the impact of power-caps on training and inference, the authors used the nvidia-smi (System Management Interface) command-line utility, together with an MLM library from HuggingFace.
The authors trained Natural Language Processing models BERT, DistilBERT and Big Bird over MLM, and monitored their power consumption in training and deployment.
The models were trained against DeepAI’s WikiText-103 dataset for 4 epochs in batches of eight, on 16 V100 GPUs, with four different power caps: 100W, 150W, 200W, and 250W (the default, or baseline, for a NVIDIA V100 GPU). The models featured scratch-trained parameters and random init values, to ensure comparable training evaluations.
As seen in the first image above, the results demonstrate good energy savings at non-linear, favorable increases in training time. The authors state:
‘Our experiments indicate that implementing power caps can significantly reduce energy usage at the cost of training time.’
Slimming Down ‘Big NLP’
Next the authors applied the same method to a more demanding scenario: training BERT with MLM on distributed configurations across multiple GPUs – a more typical use case for well-funded and well-publicized FAANG NLP models.
The main difference in this experiment was that a model might use anywhere between 2-400 GPUs per training instance. The same constraints for power usage were applied, and the same task used (WikiText-103). See second image above for graphs of the results.
The paper states:
‘Averaging across each choice of configuration, a 150W bound on power utilization led to an average 13.7% decrease in energy usage and 6.8% increase in training time compared to the default maximum. [The] 100W setting has significantly longer training times (31.4% longer on average). A 200W limit corresponds with almost the same training time as a 250W limit but more modest energy savings than a 150W limit.’
The authors suggest that these results support power-capping at 150W for GPU architectures and the applications that run on them. They also note that the energy savings obtained translate across hardware platforms, and ran the tests again to compare the outcomes for NVIDIA K80, T4 and A100 GPUs.
Inference, Not Training, Eats Power
The paper cites several prior studies demonstrating that, despite the headlines, it’s inference (the use of a finished model, such as an NLP model) and not training that draws the greatest amount of power, suggesting that as popular models are commodified and enter the mainstream, power usage could become a bigger issue than it currently is at this more nascent stage of NLP development.
Thus the researchers measured the impact of inference on power usage, finding that the imposition of power-caps has a notable effect on inference latency:
‘Compared to 250W, a 100W setting required double the inference time (a 114% increase) and consumed 11.0% less energy, 150W required 22.7% more time and saved 24.2% the energy, and 200W required 8.2% more time with 12.0% less energy.’
The paper suggests that training (if not inference, for obvious reasons) could be scheduled at times when the data center is at peak Power Usage Effectiveness (PUE) – effectively, that’s in the winter, and at night.
‘Significant energy savings can be obtained if workloads can be scheduled at times when a lower PUE is expected. For example, moving a short-running job from daytime to nighttime may provide a roughly 10% reduction, and moving a longer, expensive job (e.g. a language model taking weeks to complete) from summer to winter may see a 33% reduction.
‘While it is difficult to predict the savings that an individual researcher may achieve, the information presented here highlights the importance of environmental factors affecting the overall energy consumed by their workloads.’
Keep it Cloudy
Finally, the paper observes that homegrown processing resources are unlikely to have implemented the same efficiency measures as major data centers and high-level cloud compute players, and that environmental benefits could be gained by transferring workloads to locations that have invested heavily in good PUE.
‘While there is convenience in having private computing resources that are accessible, this convenience comes at a cost. Generally speaking energy savings and impact is more easily obtained at larger scales. Datacenters and cloud computing providers make significant investments in the efficiency of their facilities.’
* Pertinent links given by the paper.