{"id":464,"date":"2026-06-10T15:29:11","date_gmt":"2026-06-10T13:29:11","guid":{"rendered":"https:\/\/dubidu.io\/?p=464"},"modified":"2026-06-10T15:29:32","modified_gmt":"2026-06-10T13:29:32","slug":"small-model-inference-engineering-bigger-isnt-better","status":"publish","type":"post","link":"https:\/\/dubidu.io\/?p=464","title":{"rendered":"Small model inference engineering &#8211; bigger isn&#8217;t better"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">Comparing NVIDIA L4, A100 and H100 when serving Meta&#8217;s LLAMA-3.1-8B-Instruct across 50 simulated users yielded some unexpected results:<br><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">L4<br>metric                   P50       P90       P95       P99<br>&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;-<br>TTFT (s)               0.180     0.201     0.202     0.206<br>mean ITL (ms)         62.959    63.046    63.081    63.100<br>perceived TPS         16.017    16.110    16.164    16.273<br>total time (s)         7.637     8.643     9.193    10.133<br><br>Card Cost = 0.49$\/hr<br>Cost per Mtoken = 8.5*10^-12<br><br>A100<br>metric                   P50       P90       P95       P99<br>&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;-<br>TTFT (s)               0.063     0.523     0.552     0.567<br>mean ITL (ms)         20.937    21.700    21.754    23.173<br>perceived TPS         48.097    49.819    49.838    49.891<br>total time (s)         2.630     3.211     3.304     3.471<br><br>Card Cost = 1.39$\/hr<br>Cost per Mtoken = 8.04*10^-12<br><br>H100<br>metric                   P50       P90       P95       P99<br>&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;-<br>TTFT (s)               0.038     0.237     0.294     0.350<br>mean ITL (ms)         11.541    11.794    11.877    11.888<br>perceived TPS         87.282    88.620    88.679    88.788<br>total time (s)         1.425     1.714     1.782     2.000<br><br>Card Cost = 2.89$\/hr<br>Cost per Mtoken = 9.22*10^-12 \u2028<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The best card for this job would be an A100. All of the card can serve this (small) model for next to nothing. But the L4 is almost 6% more expensive per million tokens than the A100. The H100, from the Hopper family is still quite expensive, and also ridiculously outsized for this class of a challenge.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">As a result it is 15% more expensive than the A100 at Runpod prices.<br><br>Imo old GPUs still have a lot of life in them, especially for edge inference of simple workloads.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Comparing NVIDIA L4, A100 and H100 when serving Meta&#8217;s LLAMA-3.1-8B-Instruct across 50 simulated users yielded some unexpected results: L4metric P50 P90 P95 P99&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;-TTFT (s) 0.180 0.201 0.202 0.206mean ITL (ms) 62.959 63.046 63.081 63.100perceived TPS 16.017 16.110 16.164 16.273total time (s) 7.637 8.643 9.193 10.133 Card Cost = 0.49$\/hrCost per Mtoken = 8.5*10^-12 A100metric P50 [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[3],"tags":[],"class_list":["post-464","post","type-post","status-publish","format-standard","hentry","category-tech"],"_links":{"self":[{"href":"https:\/\/dubidu.io\/index.php?rest_route=\/wp\/v2\/posts\/464","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/dubidu.io\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/dubidu.io\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/dubidu.io\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/dubidu.io\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=464"}],"version-history":[{"count":2,"href":"https:\/\/dubidu.io\/index.php?rest_route=\/wp\/v2\/posts\/464\/revisions"}],"predecessor-version":[{"id":466,"href":"https:\/\/dubidu.io\/index.php?rest_route=\/wp\/v2\/posts\/464\/revisions\/466"}],"wp:attachment":[{"href":"https:\/\/dubidu.io\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=464"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/dubidu.io\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=464"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/dubidu.io\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=464"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}