
Chad Gibson
4/24/2025
5 minutes
Harnessing Dynamo: Unlocking Sustainable AI Performance
At Neuralwatt, we’re dedicated to making AI computation more energy-efficient. We believe AI inference already consumes a substantial amount of energy, and as models scale, efficient inference becomes crucial for AI’s continued growth. In this post, we’ll share some initial insights from our testing of NVIDIA’s new Dynamo inference framework.
What is Dynamo?
If you followed GTC 2025 announcements, you may have heard about Dynamo—NVIDIA’s new open-source inference framework. As the successor to Triton and a project that leverages other open-source solutions (notably vllm), Dynamo is designed to scale inference in more efficient ways.
For our first series of experiments, we started with a straightforward deployment: an “aggregated hosting” setup where everything runs on a single datacenter GPU. The deployment configuration we used is available on the Dynamo GitHub page.
Datacenter GPU + Model
We work closely with Datacrunch because we value their datacenters powered by 100% renewable energy. For these tests, we used a server in their Iceland datacenter—driven by geothermal energy—to deploy a single H200 GPU with 141 GB of VRAM.
We tested the Deepseek-R1 model (an Llama 70b parameter distillation) in its original BF16 format. Our eventual plan includes a deeper analysis with various quantization levels to see how they affect both performance and power efficiency.

Initial Analysis: Concurrency & Tokens/Watt
When scaling large AI inference, multiple factors come into play—latency, throughput, cost, and energy consumption. One of our primary questions was: Does concurrency make the system more or less energy-efficient?
Based on our preliminary tests with Dynamo in a simple single-GPU setup, we found that increasing concurrency significantly improves both throughput and tokens-per-watt. For instance, at low concurrency, the GPU spent a long time running at nearly the same power draw, leading to higher total energy consumption. Specifically:
Single concurrency consumed over 2.5 million Joules for 100 requests—roughly the energy of running a 1250W microwave for 33 minutes.
Higher concurrency (e.g., 5 or 10 concurrent requests) greatly reduced total energy consumption while producing more tokens-per-watt.

![]() | ![]() |
Balancing Latency and User Experience
Many chat AI scenarios demand a certain level of responsiveness. Dynamo streams tokens back to the client as they’re generated, enabling interactive chat experiences like ChatGPT or Deepseek.
At a concurrency of 1, we saw a time-to-first-token (TTFT) of about 50ms.
Increasing concurrency to 10 raised TTFT to roughly 74ms, still fairly responsive.
While extreme concurrency might add queuing delays, in our experiments Dynamo effectively balanced throughput and acceptable latency.
Next Steps
Our initial exploration only scratched the surface of Dynamo’s potential for more efficient large-scale AI inference. Here are the immediate follow-up tests on our roadmap:
Quantization: Quantization can drastically reduce GPU memory requirements, making more VRAM available for caching (both KV caching and prefill caching). We want to quantify how reduced precision affects power consumption alongside model accuracy.
Disaggregated Model Serving: Dynamo supports separating the “AI Factory” into different components, such as prefill and decode operations. Combining this with robust caching should further boost power efficiency.
Our Mission: Optimizing AI Power
All of these power-focused studies align with our mission to make AI computation more sustainable. By finding ways to serve larger inference loads using limited power, we’re also reducing carbon emissions and enabling more cost-effective deployments. If you’re looking to optimize power usage for your own AI or compute needs, let us know—we’d love to help.
Appendix: Raw Data
Single concurrent prompts
Concurrency of 5 prompts
Concurrency of 10 prompts
Appendix: Dynamo Configuration
The actual file is part of the Dynamo deployment example.