Scott Chamberlin

1/29/25

2 min read

The claims of reduced compute and energy demands for Deep Seek type models may be overblown.

We began evaluating Deep Seek last week (like everyone else in the world) to assess its energy impact and optimization capabilities for our platform. Below are the results from our real-world inference test suite. 
I chose to compare Llama 3.3 70b to Deep Seek r1 70b here, as I can effectively stress a single H100 with each without extensive setup. There is an apples-to-oranges element here; while Deep Seek r1 70b is distilled from Llama 3.3, it is designed to provide a chain of thought and generate a much longer output window, whereas Llama 3.3 is tuned for chat and some reasoning. I am primarily interested in the energy consumption of each and will normalize the results (both have the same number of parameters, which helps control for that).

The net outcome is that the two models have approximately the same energy efficiency (tokens/Ws) with slightly slower performance (4.8% tokens/s). However, Deep Seek r1 70b has about twice the runtime (due to generating so many output tokens), resulting in 87% more total energy consumption compared to Llama 3.3 on the same set of prompts. Even if the token output was constrained the energy efficiency per token will be similar meaning that the total energy impact compared to more commonly deployed models is likely the same (output constrained) or worse (longer outputs).

The net outcome is that the two models have approximately the same energy efficiency (tokens/Ws) with slightly slower performance (4.8% tokens/s). However, Deep Seek r1 70b has about twice the runtime (due to generating so many output tokens), resulting in 87% more total energy consumption compared to Llama 3.3 on the same set of prompts. Even if the token output was constrained the energy efficiency per token will be similar meaning that the total energy impact compared to more commonly deployed models is likely the same (output constrained) or worse (longer outputs).

While the consensus seems to be that training Deep Seek likely used significantly less energy than comparable models, we believe that inference will constitute the bulk of energy and computational usage in the future. Better models will be more useful, and even if we improve their energy efficiency, we will likely encounter Jevon’s paradox, where they will be even more widely used. 

Looking at the data I can only anticipate that these innovations are going to results in us staying on the same energy consumption growth trends as prior to DeepSeek v3/r1 being released.

© 2025 neuralwatt

© 2025 neuralwatt

© 2025 neuralwatt