Small model inference engineering – bigger isn’t better

Comparing NVIDIA L4, A100 and H100 when serving Meta’s LLAMA-3.1-8B-Instruct across 50 simulated users yielded some unexpected results:

L4
metric P50 P90 P95 P99
———————————————————-
TTFT (s) 0.180 0.201 0.202 0.206
mean ITL (ms) 62.959 63.046 63.081 63.100
perceived TPS 16.017 16.110 16.164 16.273
total time (s) 7.637 8.643 9.193 10.133

Card Cost = 0.49$/hr
Cost per Mtoken = 8.5*10^-12

A100
metric P50 P90 P95 P99
———————————————————-
TTFT (s) 0.063 0.523 0.552 0.567
mean ITL (ms) 20.937 21.700 21.754 23.173
perceived TPS 48.097 49.819 49.838 49.891
total time (s) 2.630 3.211 3.304 3.471

Card Cost = 1.39$/hr
Cost per Mtoken = 8.04*10^-12

H100
metric P50 P90 P95 P99
———————————————————-
TTFT (s) 0.038 0.237 0.294 0.350
mean ITL (ms) 11.541 11.794 11.877 11.888
perceived TPS 87.282 88.620 88.679 88.788
total time (s) 1.425 1.714 1.782 2.000

Card Cost = 2.89$/hr
Cost per Mtoken = 9.22*10^-12 


The best card for this job would be an A100. All of the card can serve this (small) model for next to nothing. But the L4 is almost 6% more expensive per million tokens than the A100. The H100, from the Hopper family is still quite expensive, and also ridiculously outsized for this class of a challenge.

As a result it is 15% more expensive than the A100 at Runpod prices.

Imo old GPUs still have a lot of life in them, especially for edge inference of simple workloads.


Posted

in

by

Tags: