Videos
TL;DR: Google Cloud Run Jobs failing silently w/o any logs and also restarts even if `maxRetries: 0`
Today my boss pinged that something weird happening with our script that runs every 15 minutes to collect data from different sources. I was the one who developed it and support it. I was very curious why it's failed as it really simple and whole body of the script is wrapped in try {} catch {} block. Every error produced by the script forwarded to Rollbar, so I should be the one that receive the error first before my boss.
When I opened Rollbar I didn't find any errors, however in the GCP console I found several failed runs. See image below.
When I tried to see the logs it was empty even in Logs Explorer. Only default message `Execution JOB_NAME has failed to complete, 0/1 tasks were a success."`. But based on the records in the database script was running and it was running twice (so it was relaunched, ignoring the fact that I set `maxRetries: 0` for the task)
It all sounds very bad for me, because I prefer to trust GCP for all my production services. However, I found that I'm not the one with this kind of issue -> https://serverfault.com/questions/1113755/gcp-cloud-run-job-fails-without-a-reason
I'll be very happy if someone could point me in the right direction regarding this issue. I don't want to migrate to another cloud provider because of this.
[Update]
Here is what I see in the logs explorer. I have tracing logs. But there is no logs et all, just default error message -> `Execution JOB_NAME has failed to complete, 0/1 tasks were a success."`
[Update 2]
Here is a metrics for the Cloud Run Job. I highlighted with the red box time where an error happened. As you can see memory is ok, but there is a peak in received bytes
[Update 3]
Today we had a call with one of Googlers. We found that it seems to be a general issue for all Cloud Run Jobs in the us-central1 region. It started on Dec 6 2023 (1pm - 5pm PST) . If you see the same issue on your Google Cloud Run Job post relevant info to this thread. We want to figure out what happened.
Right now i am evaluating the use of Cloud Tasks or the newly introduced Cloud Run Jobs. I am having a hard time to understand which one to choose and some questions arose during my research. If anybody could chime in, i would be very thankful!
Since Cloud Tasks can execute Cloud Run Containers too, what are the benefits of using Cloud Run Jobs?
One thing i noticed, a job task can only be retried 10 times, not indefinitely like in Cloud Tasks. This seems like an odd limitation, why can't a task be retried indefinitely?
Also it seems not possible to create or execute a job through the Google Cloud Go Library. Calling the API endpoint and attaching the access token myself seems rather tedious. Is it bad practice to create a Cloud Run Job from inside a HTTP handler?
On the other side there is no way to batch enqueue cloud tasks and it is suggested to create a separate injector queue. This seems elegantly solved with Cloud Run Jobs where i can spawn up to 1000 tasks in one job.
I have the feeling the two services overlap in many places and make it confusing to choose. A comparision page similar to this PubSub comparison would be very helpful.
I'm running Cloud Run Jobs for geospatial processing tasks and seeing 15-25 second cold starts between when I execute the job and when the job is running. I've instrumented everything to figure out where the time goes, and the math isn't adding up:
What I've measured:
-
Container startup latency: 9.9ms (99th percentile from GCP metrics - essentially instant)
-
Python imports: 1.4s (firestore 0.6s, geopandas 0.5s, osmnx 0.1s, etc)
-
Image size: 400MB compressed (already optimized from 600MB with multi-stage build)
-
Execution creation โ container start: 2-10 seconds (from execution metadata, varies per execution)
So ~1.4 seconds is Python after the container starts. But my actual logs show:
PENDING (5s) PENDING (10s) PENDING (15s) PENDING (20s) PENDING (25s) RUNNING (30s)
So there's 20+ seconds unaccounted for somewhere between job submission and container start.
Config:
-
python:3.12-slimbase + 50 packages (geopandas, osmnx, pandas, numpy, google-cloud-*) -
Multi-stage Dockerfile: builder stage installs deps, runtime stage copies venv only
-
Aggressive cleanup: removed test dirs, docs, stripped .so files, pre-compiled bytecode
-
Gen2 execution environment
-
1 vCPU, 2GB RAM (I have other, higher resource services that exhibit the same behavior)
What I've tried:
-
Reduced image 600MB โ 400MB (multi-stage build, cleanup)
-
Pre-compiled Python bytecode
-
Verified region matching (us-west1 for both)
-
Stripped binaries with `strip --strip-unneeded`
-
Removed all test/doc files
Key question: The execution metadata shows a 20-second gap from job creation to container start. Is this all image pull time? If so, why is 400MB taking 20-25 seconds to pull within the same GCP region?
Or is there other Cloud Run Jobs overhead I'm not accounting for (worker allocation, image verification, etc)?
Should I accept this as normal for Cloud Run Jobs and migrate to Cloud Run Service + job queue instead?
This is now possible using the REST API for Cloud Run Jobs.
- The REST API allows passing in a request body
- The request body accepts an override defnition
- To make the call, you'll need to get the token of the current executing identity
A full working repo of how to do this is here: https://github.com/CharlieDigital/gcr-invoke-job-overrides
A more in depth writeup here: https://medium.com/@chrlschn/programmatically-invoke-cloud-run-jobs-with-runtime-overrides-4de96cbd158c
The sample is in C#/.NET, but it's easy enough to translate to any other language.
Here's the log output of the environment variables and entry point args:

With Cloud Run Jobs, you can't provide parameters (entry points, args or env vars). I discussed that point with the Cloud Run Job PM to implement something, and, obviously, the other alpha testers had the same issue and something will happen :).
My current solution is to wrap the current batch job in a web server. Package that webserver in a container and deploy it on Cloud Run. Then, provide the correct body, with your parameters, parse it, and invoke your batch with those inputs