aws gpu instances for llm

reddit.com › r/localllm › aws gpu instances that can handle micro llms

r/LocalLLM on Reddit: AWS GPU instances that can handle micro LLMs

December 17, 2023 -

What is the best AWS server instance having a local dedicated GPU capable of running a GPT-3.5 equivalent micro LLM adequate for embedding and summarization. This would be for inference only, no training.

Top answer

1 of 4

I don't have the exact answer, but I can give you the place to look at if you want to work out the answer on your own. First you should be selecting an opensource LLM. For that I suggest taking a look at: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard For your use case I think it's safe to grab something with 7B parameters or lower. Some suggestions for the model (this is by no mean an exhausting list): - Mistral 7B: https://huggingface.co/mistralai/Mistral-7B-v0.1 - Llama 2 7B: https://huggingface.co/meta-llama/Llama-2-7b-hf - Microsoft Phi-2: https://huggingface.co/microsoft/phi-2 Then to answer your question about hardware take a look here: https://huggingface.co/spaces/Vokturz/can-it-run-llm This should give you a rough idea of the amount of VRAM you need to fit these models on a GPU. For example you are able to run Mistral 7B quantized to 8-bit on a single V100 GPU (p3.2xlarge ec2 instance). More about quantization: https://huggingface.co/docs/accelerate/usage_guides/quantization

2 of 4

I wouldn't, AWS is very expensive in comparison to services like Runpod and Vastai.

AWS

docs.aws.amazon.com › deep learning ami › developer guide › getting started with dlami › choosing a dlami instance type › recommended gpu instances

We recommend a GPU instance for most deep learning purposes. Training new models is faster on a GPU instance than a CPU instance.

Videos

56:14

YouTube

Setup an AI / ML Server From Scratch in AWS With NVIDIA GPU, Docker, ...

August 25, 2024

09:57

YouTube

Deploy ANY Open-Source LLM with Ollama on an AWS EC2 + GPU in 10 ...

August 3, 2024

youtube.com

Expert Guide: Installing Ollama LLM with GPU on AWS in Just ...

13:27

YouTube

Running a "Free" Local LLM (Bigcode) using just AWS GPUs and Rust ...

Deploy LLM Application on AWS EC2 with Langchain and Ollama | Deploy ...

October 21, 2024

View all

AWS

aws.amazon.com › amazon ec2 › instance types › p5 instances

Amazon EC2 P5 Instances – AWS

2 days ago - GPU-based EC2 instances, and reduce cost to train ML models by up to 40%. These instances help you iterate on your solutions at a faster pace and get to market more quickly. You can use P5, P5e, and P5en instances for training and deploying complex large language models (LLMs) and diffusion ...

Chariot Solutions

chariotsolutions.com › home › getting started with llm in the cloud with amazon dlami ec2 instances

Getting started with LLM in the Cloud with Amazon DLAMI EC2 Instances — Chariot Solutions

April 3, 2024 - Still, $6k for a workstation that can delay your need for cloud GPUs in development is not a bad investment · Managed cloud services – Amazon has a wide variety of cloud services available – including Amazon Kendra which costs $810 / month / developer for the license (yes, there is a free tier of 750 hours available to start), Amazon Bedrock, which is a serverless pay-as-you-go access platform for LLMs, and their many other ML APIs. EC2 instances tuned for GPU work – if you don't want to dig deeply into a managed solution yet, or you may use more than Amazon's own APIs and platforms, and you have a simple workflow to try out on an accelerated platform but no platform to use, try out Amazon's EC2 AMIs.

CodiLime

codilime.com › blog › data › data science › hosting llms on aws

Hosting LLMs on AWS

We explored the differences between ... LLMs on your own. The tutorial covered setting up AWS EC2 instances, particularly G5 instances with NVIDIA GPUs, ideal for demanding machine learning tasks....

DEV Community

dev.to › aws-builders › deploy-your-llm-on-aws-ec2-2ig3

Deploy Your LLM on AWS EC2 - DEV Community

September 14, 2024 - AWS instances like g4, g5, p3, and p4 are the latest generation of GPU-based instances and provide the highest performance in Amazon EC2 for deep learning and high-performance computing (HPC).

Mamezou

developer.mamezou-tech.com › en › blogs › 2025 › 08 › 21 › ec2-gpu-demo

Build Your Own LLM Environment on AWS! A Hands-On Guide to Running AI with EC2 GPU Instances and Ollama | Mamezou Developer Portal

With Ollama, you can run LLMs on the CPU alone, but having a GPU with sufficient VRAM (video memory) offers speed-up benefits. This time, since I plan to run gpt-oss, which OpenAI recently released and is now available via Ollama, I'll choose ...

Medium

medium.com › @thomasjay200 › run-your-own-llm-ollama-on-aws-with-nvidia-gpu-dab7dc008bfe

Run your own LLM — Ollama on AWS with Nvidia GPU | by Tom Jay | Medium

February 29, 2024 - You will need an AWS account, you will also need access to GPU based instances, this is not provided by default, there is “Service Request” page that you will need to request the instance type, for what we want we will request access to ...

Stack Overflow

stackoverflow.com › questions › 79381462 › is-it-possible-to-train-llms-on-ec2-gpus-using-lambda-for-on-demand-instance-act

amazon ec2 - Is it possible to train LLMs on EC2 GPUs using Lambda for on-demand instance activation? - Stack Overflow

Use a GPU-capable EC2 instance to host the LLM model, but by default, leave the instance switched off. When training is necessary, programmatically launch the instance using an AWS Lambda function.

Find elsewhere

Google Bing Mojeek

Medium

medium.com › @chinmayd49 › self-host-llm-with-ec2-vllm-langchain-fastapi-llm-cache-and-huggingface-model-7a2efa2dcdab

Self host LLM with EC2, vLLM, Langchain, FastAPI, LLM cache and huggingFace model | by Chinmay Deshpande | Medium

November 22, 2023 - Lets start the technical discussion ... serve LLM faster and efficiently to customers. I would recommend go through AWS g5 instances and P4 instances which provides good performance for ML and LLM specifically....

AWS

aws.amazon.com › blogs › machine-learning › serving-llms-using-vllm-and-amazon-ec2-instances-with-aws-ai-chips

Serving LLMs using vLLM and Amazon EC2 instances with AWS AI chips | Artificial Intelligence

November 26, 2024 - Using vLLM on AWS Trainium and Inferentia makes it possible to host LLMs for high performance inference and scalability. In this post, we will walk you through how you can quickly deploy Meta’s latest Llama models, using vLLM on an Amazon Elastic Compute Cloud (Amazon EC2) Inf2 instance.

AWS

aws.amazon.com › blogs › publicsector › deploy-llms-in-aws-govcloud-us-regions-using-hugging-face-inference-containers

Deploy LLMs in AWS GovCloud (US) Regions using Hugging Face Inference Containers | AWS Public Sector Blog

June 19, 2024 - Another way this can be achieved is through Hugging Face Inference containers. We’ll utilize Amazon EC2 GPU instances and the Hugging Face Inference Container to host and serve custom LLMs in the AWS GovCloud (US) Regions.

AWS re:Post

repost.aws › questions › QUlHAbaJiIRt-eem9gizSmOQ › is-gpu-serverless-inferencing-for-custom-llm-models

Is GPU Serverless inferencing for custom LLM models? | AWS re:Post

Top answer

1 of 3

Serverless GPU is not supported in SageMaker since it is based on Lambda technology, which currently doesn't support GPU. As an alternative, you can host custom models on Amazon Bedrock, and they will be served in a serverless way. But you need to note that currently you can only use Flan, Llama, and Mistral families. https://docs.aws.amazon.com/bedrock/latest/userguide/model-customization-import-model.html

2 of 3

For SageMaker serverless endpoint GPU is not supported. However, If your customer is happy with a cool down, you can use SageMaker Async inference (see https://docs.aws.amazon.com/sagemaker/latest/dg/async-inference.html) and scale the instance to 0 when not in use. There will be more managing involved here, but nothing too complex. You can even implement a parking lot system like of approach where if your customer knows when they need the endpoint, you can have the Async endpoint scale at that time, and run for specified credit then scale down.

reddit.com › r/aws › host llm using a single a100 gpu instance?

r/aws on Reddit: Host LLM using a single A100 GPU instance?

September 9, 2024 -

Is there any way of hosting llm using on a single A100 instance? I could only find p4d.24xlarge which has 8 A100. My current workload doesn't justify the cost for that instance.

Also as I am very new to AWS; any general recommendations on the most effective and efficient way of hosting llm on AWS are also appreciated. Thank you

Top answer

1 of 1

Do you actually need to host your own? Even if you could get a single A100 (which I don't see an option for), it would be around $4.10/hr, or $2950/month. If you used Bedrock with Claude 3.5 Sonnet, that's about 80M input tokens and 80M output tokens per month. We started by hosting our own on the g6 instance family but we found it was significantly cheaper for our use cases to use Bedrock. If you really want to host your own, there doesn't look to be an option for you with the A100 processors. You'd have to step down to the V100 processors in the P3 family to get down to a single GPU, or the g6/g5 families. If you're just running an already-trained model these are likely fine.

Brandonharris

brandonharris.io › Local-LLMs-Getting-Started-with-LLaMa-and-AWS

Cloud LLaMa - Local LLM's and Getting Started with LLaMa on AWS EC2 – brandonharris.io

Choose p3.2xlarge instance type to start. There are a variety of GPU instances AWS and others offer, but typically the main constraint will be GPU memory. The p3.2xlarge offers a GPU with 16GB of GPU memory which is on the low end, but sufficient for our needs here.

AWS

aws.amazon.com › blogs › machine-learning › optimize-price-performance-of-llm-inference-on-nvidia-gpus-using-the-amazon-sagemaker-integration-with-nvidia-nim-microservices

Optimize price-performance of LLM inference on NVIDIA GPUs using the Amazon SageMaker integration with NVIDIA NIM Microservices | Artificial Intelligence

March 18, 2024 - NIM, part of the NVIDIA AI Enterprise software platform listed on AWS marketplace, is a set of inference microservices that bring the power of state-of-the-art LLMs to your applications, providing natural language processing (NLP) and understanding capabilities, whether you’re developing chatbots, summarizing documents, or implementing other NLP-powered applications. You can use pre-built NVIDIA containers to host popular LLMs that are optimized for specific NVIDIA GPUs for quick deployment or use NIM tools to create your own containers.

reddit.com › r/aws › aws sagemaker or aws ec2 for llm model training

r/aws on Reddit: AWS Sagemaker or AWS EC2 for llm model training

April 21, 2024 -

Hi ! I have a question for ml practitioners who are familiar with AWS products.

In my workplace, we are assessing two options : using Amazon SageMaker or having an EC2 instance with GPU.

We mainly need the computing power (GPU) and nothing more. We are about to train a oss llm model with our own dataset. Havent considered regarding cloud gpu services such as (runpod and vast), current poc focus is on Aws products as our ecosystem is mostly in AWS itself.

Which is more adapted for our case cost-wise and from an ease-of-use point of view ?

Thank you in advance for your help.

Top answer

1 of 5

I'm sorry, based on the superficial level of your question I checked your post history, so forgive me if my answer is biased. There is a trend lately in this subreddit of people trying to build the next OpenAI with close to zero previous data or computer engineering knowledge. How large is your dataset? How fast do you want to train it? How big is your budget? What are your preprocessing needs?

2 of 5

Generally speaking Ec2 will be cheaper and give you more control and flexibility. Sagemaker does have few features for mlops but from our team’s experience it’s still kind of prototype that aws is pushing out. From missing or old documentation to straight out bugs, every time we try to implement something new in sagemaker, half our time is spent fighting the tool rather than doing the actual work. For our project we have already started moving out of sagemaker and to ec2 instances and developing our own pipeline for the model

AWS re:Post

repost.aws › questions › QU5GO1pICeTrWIvHAQK8W_Zw › what-are-the-cost-effective-options-for-on-demand-api-of-fine-tuned-llm-with-gpu

What are the cost effective options for on-demand API of fine tuned llm with gpu | AWS re:Post

September 23, 2024 - EC2 spot instances with GPU is a strong option for cost-efficiency, especially if you can automate starting/stopping the instance. Hugging Face offers a more convenient API-based option with usage-based billing.

AWS

aws.amazon.com › blogs › hpc › scaling-your-llm-inference-workloads-multi-node-deployment-with-tensorrt-llm-and-triton-on-amazon-eks

Scaling your LLM inference workloads: multi-node deployment with TensorRT-LLM and Triton on Amazon EKS | AWS HPC Blog

December 2, 2024 - Feel free to edit the TensorRT-LLM specific parameters like batch size, depending on your workload. # Replace <PATH_TO_AWSOME_INFERENCE_GITHUB> with path to where you cloned the GitHub repo bash <PATH_TO_AWSOME_INFERENCE_GITHUB>/2.projects/multinode-triton-trtllm-inference/update_triton_configs.sh · You can find the example_values.yaml file that we use for deploying our application here. The relevant sections of this deployment manifest are: … gpu: NVIDIA-H100-80GB-HBM3 gpuPerNode: 8 persistentVolumeClaim: efs-claim tensorrtLLM: parallelism: tensor: 8 pipeline: 2 triton: image: name:${ACCOU

Medium

medium.com › @mr.sean.ryan › deploying-a-high-performance-llm-with-user-interface-on-aws-ec2-with-gpu-part-1-of-a-series-cc99a98e3185

Deploying a high performance LLM with user interface on AWS EC2 with GPU [part 1 of a series] | by Sean Ryan | Medium

April 14, 2024 - This series presents step-by-step directions to host an LLM (Large Language Model) with a basic user interface, on Amazon’s AWS cloud. There are various articles and documentation already available…