Hello,
I have a FastAPI application running on cloud run, which has some endpoints doing fairly complex computations. On cloud run those endpoints take 3x more than when running them locally (on my m1 macbook). My guess is that the cpu provided by cloud run is just slower? Does anyone know which CPUs are attached by default, and if there's a solution for that?
Cheers
Cloud Run, ideal vCPU and memory amount per instance? - Stack Overflow
Does the cpu flag in gcloud run deploy control the number of cpus per VM or the total amount of cpu
Google Cloud Run: interpreting CPU utilization metrics for concurrency - Stack Overflow
Cloud run with CPU always allocated is cheaper than only allocated during request processing. How? - Stack Overflow
Videos
There isn't a good answer to that question. You have to know the limits:
- The max number of concurrent requests that you can handle concurrently with 4cpu or/and 32Gb of memory (up to 1000 concurrent requests)
- The max number on instance on Cloud Run (1000)
Then it's a matter of tradeoff, and it's highly dependent of your use case.
- Bigger instances reduce the number of cold starts (and so high latency when your service scale up). But, if you have only 1 request at a time, you will pay a BIG instance for a small processing
- Smaller instances allow you to optimize cost and to add only a small slice of resource in your cluster, but you will have to spawn often a new instance and you will have several cold start to endure.
Optimize what you prefer, find the right balance. No magic formula!!
You can simulate a load of requests in your current settings using k6.io, check the memory and cpu percentage of your container and adjust them to a lower or higher setting to see if you can get more RPS out of a single container.
Once you are satisfied with a single container instance's let's say 100 rps per container instance, you can then specify using gcloud the flags --min-instances and --max-instances depending of course on the --concurrency flag, which in my explanation would be set to 100.
Also note that it starts at the default of 80 and can go up to 1000.
More info about this can be read on the links below: https://cloud.google.com/run/docs/about-concurrency https://cloud.google.com/sdk/gcloud/reference/run/deploy
I would also recommend you investigating if you need to pass the --cpu-throttling flag or the --no-cpu-throttling depending on your need for adjusting for cold starts.
The legend is statistics. The 50% is the same as the median, and the 95% and 99% are percentiles. Meaning 50% of the measurements taken are below 0.67% CPU, 95% of the measurements taken are below 17.8%, and 99% of the measurements taken are 17.96%. Your CPU is not being used all that much.
To answer your questions:
- Yes, the graph shows that your requests are using about 20% of your CPU. The legend below means that 95% of the time, your CPU usage will be around 20%.
- Yes you can increase your concurrency up to a maximum of
1000. You can check this documentation on concurrency values and setting maximum concurrency (services). The default value for concurrency is80. - I haven't tried this as it would depend on the load of the request. There are instances where there are single requests with lighter or heavier load.
- Setting your minimum number of instances to
1will reduce the number of cold starts as it will be ready to serve incoming requests as it will run on idle. The downside is that this would incur charges as the service is still running. Google recommends purchasing a committed use discount as these charges are very predictable. Full documentation regarding minimum instances can be found through this link.
Cloud Run is serverless by default: you pay as you use. When a request comes in, an instance is created (and started, it's called cold-start) and your request processed. The timer starts. When your web server send the answer, the timer stop.
You pay for the memory and the CPU used during the request processing, rounded to the upper 100ms. The instance continue to live for about 15 minutes (by default, can be changed at any moment) to be ready to process another request without the need to start another one (and wait the cold start again).
As you can see, the instance continue to live EVEN IF YOU NO LONGER PAY FOR IT. Because you pay only when a request is processed.
When you set the CPU always on, you pay full time the instance run. No matter the request handling or not. Google don't have to pay for instances running and not used, waiting a request as the pay per use model. You pay for that, and you pay less
It's like a Compute Engine up full time. And as a Compute Engine, you can have something similar to sustained used discount. That's why it's cheaper.
In general it depends on how you use cloud run. Google is giving some hints here: https://cloud.google.com/run/docs/configuring/cpu-allocation
To give a summary to the biggest pricing differences:
CPU is only allocated during request processing
you pay for:
- every request on a per request basis
- cpu and memory allocation time during request
CPU always allocated
you pay for
- cpu and memory allocation with a cheaper rate for time of instance active
Compare the pricing here: https://cloud.google.com/run/pricing
So if you have a lot of request which do not use a lot of ressources and not so much variance in it, then "always allocated" might be a lot cheaper.