Chad Gibson

4/24/2025

5 minutes

Harnessing Dynamo: Unlocking Sustainable AI Performance

At Neuralwatt, we’re dedicated to making AI computation more energy-efficient. We believe AI inference already consumes a substantial amount of energy, and as models scale, efficient inference becomes crucial for AI’s continued growth. In this post, we’ll share some initial insights from our testing of NVIDIA’s new Dynamo inference framework.

What is Dynamo?

If you followed GTC 2025 announcements, you may have heard about Dynamo—NVIDIA’s new open-source inference framework. As the successor to Triton and a project that leverages other open-source solutions (notably vllm), Dynamo is designed to scale inference in more efficient ways.

For our first series of experiments, we started with a straightforward deployment: an “aggregated hosting” setup where everything runs on a single datacenter GPU. The deployment configuration we used is available on the Dynamo GitHub page.

Datacenter GPU + Model

We work closely with Datacrunch because we value their datacenters powered by 100% renewable energy. For these tests, we used a server in their Iceland datacenter—driven by geothermal energy—to deploy a single H200 GPU with 141 GB of VRAM.

We tested the Deepseek-R1 model (an Llama 70b parameter distillation) in its original BF16 format. Our eventual plan includes a deeper analysis with various quantization levels to see how they affect both performance and power efficiency.

Initial Analysis: Concurrency & Tokens/Watt

When scaling large AI inference, multiple factors come into play—latency, throughput, cost, and energy consumption. One of our primary questions was: Does concurrency make the system more or less energy-efficient?

Based on our preliminary tests with Dynamo in a simple single-GPU setup, we found that increasing concurrency significantly improves both throughput and tokens-per-watt. For instance, at low concurrency, the GPU spent a long time running at nearly the same power draw, leading to higher total energy consumption. Specifically:

  • Single concurrency consumed over 2.5 million Joules for 100 requests—roughly the energy of running a 1250W microwave for 33 minutes.

  • Higher concurrency (e.g., 5 or 10 concurrent requests) greatly reduced total energy consumption while producing more tokens-per-watt.


Balancing Latency and User Experience

Many chat AI scenarios demand a certain level of responsiveness. Dynamo streams tokens back to the client as they’re generated, enabling interactive chat experiences like ChatGPT or Deepseek.

  • At a concurrency of 1, we saw a time-to-first-token (TTFT) of about 50ms.

  • Increasing concurrency to 10 raised TTFT to roughly 74ms, still fairly responsive.

While extreme concurrency might add queuing delays, in our experiments Dynamo effectively balanced throughput and acceptable latency.

Next Steps

Our initial exploration only scratched the surface of Dynamo’s potential for more efficient large-scale AI inference. Here are the immediate follow-up tests on our roadmap:

  1. Quantization: Quantization can drastically reduce GPU memory requirements, making more VRAM available for caching (both KV caching and prefill caching). We want to quantify how reduced precision affects power consumption alongside model accuracy.

  2. Disaggregated Model Serving: Dynamo supports separating the “AI Factory” into different components, such as prefill and decode operations. Combining this with robust caching should further boost power efficiency.

Our Mission: Optimizing AI Power

All of these power-focused studies align with our mission to make AI computation more sustainable. By finding ways to serve larger inference loads using limited power, we’re also reducing carbon emissions and enabling more cost-effective deployments. If you’re looking to optimize power usage for your own AI or compute needs, let us know—we’d love to help.

Appendix: Raw Data

Single concurrent prompts

======== Inference Load Generator Results ========
Successful requests:                     100       
Benchmark duration (s):                  4588.16   
Total input tokens:                      9110      
Total generated tokens:                  109335    
Request throughput (req/s):              0.02      
Average completion time (s):             45.88     
Output token throughput (tok/s):         23.83     
Total Token throughput (tok/s):          25.82     
-------------- Time to First Token ---------------
Mean TTFT (ms):                          49.58     
Median TTFT (ms):                        47.72     
P50 TTFT (ms):                           47.72     
P90 TTFT (ms):                           49.13     
P95 TTFT (ms):                           50.10     
P99 TTFT (ms):                           109.05    
---- Time per Output Token (excl. 1st token) -----
Mean TPOT (ms):                          41.20     
Median TPOT (ms):                        41.21     
P50 TPOT (ms):                           41.21     
P90 TPOT (ms):                           41.45     
P95 TPOT (ms):                           41.56     
P99 TPOT (ms):                           41.64     
-------------- Inter-token Latency ---------------
Mean ITL (ms):                           41.30     
Median ITL (ms):                         41.18     
P50 ITL (ms):                            41.18     
P90 ITL (ms):                            42.08     
P95 ITL (ms):                            42.26     
P99 ITL (ms):                            42.59     
----------- Power Consumption Metrics ------------
Total Power (J):                         2622786.02
Peak Power (W):                          583.59    
Average Power (W):                       571.64    
Mean Tokens/Watt:                        0.05

Concurrency of 5 prompts

======== Inference Load Generator Results ========
Successful requests:                     100       
Benchmark duration (s):                  1286.96   
Total input tokens:                      9110      
Total generated tokens:                  110037    
Request throughput (req/s):              0.08      
Average completion time (s):             46.79     
Output token throughput (tok/s):         85.50     
Total Token throughput (tok/s):          92.58     
-------------- Time to First Token ---------------
Mean TTFT (ms):                          70.17     
Median TTFT (ms):                        65.87     
P50 TTFT (ms):                           65.87     
P90 TTFT (ms):                           87.40     
P95 TTFT (ms):                           89.31     
P99 TTFT (ms):                           147.42    
---- Time per Output Token (excl. 1st token) -----
Mean TPOT (ms):                          41.62     
Median TPOT (ms):                        41.61     
P50 TPOT (ms):                           41.61     
P90 TPOT (ms):                           42.36     
P95 TPOT (ms):                           42.39     
P99 TPOT (ms):                           42.60     
-------------- Inter-token Latency ---------------
Mean ITL (ms):                           41.76     
Median ITL (ms):                         41.70     
P50 ITL (ms):                            41.70     
P90 ITL (ms):                            42.92     
P95 ITL (ms):                            43.27     
P99 ITL (ms):                            44.04     
----------- Power Consumption Metrics ------------
Total Power (J):                         738824.44 
Peak Power (W):                          603.10    
Average Power (W):                       574.08    
Mean Tokens/Watt:                        0.16

Concurrency of 10 prompts

======== Inference Load Generator Results ========
Successful requests:                     100       
Benchmark duration (s):                  783.04    
Total input tokens:                      9110      
Total generated tokens:                  107062    
Request throughput (req/s):              0.13      
Average completion time (s):             45.83     
Output token throughput (tok/s):         136.73    
Total Token throughput (tok/s):          148.36    
-------------- Time to First Token ---------------
Mean TTFT (ms):                          74.71     
Median TTFT (ms):                        71.35     
P50 TTFT (ms):                           71.35     
P90 TTFT (ms):                           87.57     
P95 TTFT (ms):                           139.69    
P99 TTFT (ms):                           166.76    
---- Time per Output Token (excl. 1st token) -----
Mean TPOT (ms):                          41.87     
Median TPOT (ms):                        41.83     
P50 TPOT (ms):                           41.83     
P90 TPOT (ms):                           42.13     
P95 TPOT (ms):                           42.26     
P99 TPOT (ms):                           44.99     
-------------- Inter-token Latency ---------------
Mean ITL (ms):                           41.87     
Median ITL (ms):                         41.84     
P50 ITL (ms):                            41.84     
P90 ITL (ms):                            43.30     
P95 ITL (ms):                            43.78     
P99 ITL (ms):                            44.96     
----------- Power Consumption Metrics ------------
Total Power (J):                         455546.68 
Peak Power (W):                          613.87    
Average Power (W):                       581.77    
Mean Tokens/Watt:                        0.26

Appendix: Dynamo Configuration

The actual file is part of the Dynamo deployment example

# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

Frontend:
  served_model_name: deepseek-ai/DeepSeek-R1-Distill-Llama-70B
  endpoint: dynamo.Processor.chat/completions
  port: 8000

Processor:
  model: deepseek-ai/DeepSeek-R1-Distill-Llama-70B
  block-size: 64
  max-model-len: 16384
  router: round-robin

VllmWorker:
  model: deepseek-ai/DeepSeek-R1-Distill-Llama-70B
  enforce-eager: true
  block-size: 64
  max-model-len: 16384
  max-num-batched-tokens: 16384
  enable-prefix-caching: true
  router: random
  tensor-parallel-size: 1
  ServiceArgs:
    workers: 1
    resources:
      gpu: 1

© 2025 neuralwatt

© 2025 neuralwatt

© 2025 neuralwatt