I recently did some testing using Dolphin-Llama3 across various (inexpensive-ish) AWS instances to compare performance. The results are in line with what one might expect.
Testing was done using default settings with Ollama. I spun up a new instance on Ubuntu, installed Ollama and ran it with Dolphin-Llama3 —verbose.
Key Takeaways:
-Fastest Prompt Eval Rate: AWS g5 (fastest AWS instance tested)
-Fastest Eval Rate: Home PC w/RTX 3080
-Best Cost-Performance Balance: AWS g4dn.xlarge offers a good balance of performance and cost, at $0.58/hr.
-GPU speed is the key differentiator. Within the same family of models, such as the g4dn and g5 instances, the evaluation rates remain consistent. If the model fits in GPU memory there is no need for more cores/memory.
-I did notice that the more system memory available the greater number of tokens used in the output.
Test Results
AWS Instances
c7g.8xlarge (Compute Instance) •32 cores, 64GB RAM •Prompt Eval Rate: 38.38 tokens/s •Eval Rate: 25.07 tokens/s •Price: $1.27/hr, $941.16/mo r6g.4xlarge (Memory Instance) •16 cores, 128GB RAM •Prompt Eval Rate: 10.15 tokens/s •Eval Rate: 8.29 tokens/s •Price: $0.88/hr, $657.10/mo g4dn.xlarge (GPU Instance) •4 cores, 16GB RAM, 16GB GPU •Prompt Eval Rate: 222.23 tokens/s •Eval Rate: 41.71 tokens/s •Price: $0.58/hr, $434.50/mo g4dn.2xlarge (GPU Instance) •8 cores, 32GB RAM, 32GB GPU •Prompt Eval Rate: 214.25 tokens/s •Eval Rate: 41.74 tokens/s •Price: $0.84/hr, $621.24/mo g5.xlarge (GPU Instance) •4 cores, 16GB RAM, 24GB GPU •Prompt Eval Rate: 624.29 tokens/s •Eval Rate: 68.08 tokens/s •Price: $1.12/hr, $831.05/mo g5.2xlarge (GPU Instance) •8 cores, 32GB RAM, 24GB GPU •Prompt Eval Rate: 624.48 tokens/s •Eval Rate: 66.67 tokens/s •Price: $1.35/hr, $1,000.96/mo
Local Machines
M2 MacMini •M2, 8GB RAM, <8GB GPU •Prompt Eval Rate: 66.38 tokens/s •Eval Rate: 18.33 tokens/s M1 MacBook Air •M1, 16GB RAM, <16GB GPU •Prompt Eval Rate: 71.58 tokens/s •Eval Rate: 11.46 tokens/s Home PC w/RTX 3080 •Intel i5, 64GB RAM, 10GB GPU •Prompt Eval Rate: 185.67 tokens/s •Eval Rate: 83.79 tokens/s
Oracle Ampere
Ampere 16 Core, 32GB RAM •Prompt Eval Rate: 11.96 tokens/s (Duration: 1m34.955180835s) •Eval Rate: 9.01 tokens/s (Duration: 1m28.461256s) •Price: $0.1276/hr, $95/mo Ampere 32 Core, 32GB RAM •Prompt Eval Rate: 22.54 tokens/s (Duration: 47.93207936s) •Eval Rate: 14.11 tokens/s (Duration: 44.423782s) •Price: $0.2796/hr, $208/mo
Here's the data formatted in table for easier viewing - courtesy of u/sergeant113. https://www.reddit.com/r/LocalLLaMA/comments/1dclmwt/comment/l7zrgzm/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button