Cloud Run min-instances=1: The Silent $150/Month Budget Trap

When you build something on Google Cloud Run, the pitch is compelling: serverless, scalable, and you only pay for what you use. That last part is true — with one asterisk that cost us three billing cycles before we caught it.

We were building a travel content tool backed by a RAG (Retrieval-Augmented Generation) pipeline. Cloud Run was a natural home for it: containerised, easy to deploy, and auto-scaling. We shipped it, it worked, and we moved on to the next feature. Then the GCP bill arrived.

$150/month. For a service handling near-zero traffic.

The Project: A RAG Travel Blog Tool

The architecture was straightforward:

User Query
    │
    ▼
Cloud Run Service (FastAPI)
    │
    ├─► Vector Database (Pinecone)
    │
    └─► LLM API (Gemini)
    │
    ▼
Generated Content

During the ingestion phase — loading and vectorising thousands of travel articles — we sized the Cloud Run service generously:

Memory: 4 GB
CPU: 2 vCPUs
Min instances: 1 (to avoid cold starts during demos)

The ingestion finished. We moved to the serving phase. We forgot to resize.

The Root Cause: What `min-instances=1` Actually Means

min-instances=1 tells Cloud Run: “Keep at least one container running at all times.” That sounds reasonable. The catch is what “running at all times” actually bills you for.

Cloud Run charges for:

CPU allocation while the container is active
Memory allocation while the container is active
Time counts whether requests are coming in or not

With min-instances=1, your container never idles to zero. The billing clock never stops.

Visual breakdown of the cost:

Without min-instances (scale-to-zero):
─────────────────────────────────────
Requests: ░░░▓▓▓░░░░░░░░▓░░░░░░░░░░
Billing:  ░░░▓▓▓░░░░░░░░▓░░░░░░░░░░
           (pay only when serving)

With min-instances=1 + 4 GB memory:
─────────────────────────────────────
Requests: ░░░▓▓▓░░░░░░░░▓░░░░░░░░░░
Billing:  ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
           (pay 24/7 regardless)

At 4 GB memory and 2 vCPUs, that continuous billing comes to roughly $150/month — before handling a single request.

Why We Missed It

A few things made this easy to overlook:

1. The setting made sense at the time. During development and demos, cold starts on Cloud Run can take 5–15 seconds. min-instances=1 eliminates that. It was a sensible choice for the ingestion phase.

2. We never revisited it. The setting lived in a deployment script. Once the ingestion job finished, we deployed the serving version — same script, same settings, different workload.

3. GCP billing is not real-time. By the time the invoice landed, two billing cycles had already passed with this configuration.

The Fix: Scale-to-Zero + Right-Sizing

The serving workload was completely different from the ingestion workload. The serving layer only needs to respond to API queries — lightweight, fast, low memory.

Before (ingestion sizing, never changed):

Deployment config — ingestion era:
  min-instances: 1
  memory:        4 GB
  cpu:           2 vCPUs
  Monthly cost:  ~$150 (idle)

After (right-sized for serving):

Deployment config — serving era:
  min-instances: 0  ← scale to zero
  memory:        512 MB
  cpu:           1 vCPU
  concurrency:   80
  Monthly cost:  ~$5 (low traffic)

The serving API doesn’t load a 3 GB model. It sends a query to Pinecone, calls the Gemini API, and returns a response. 512 MB is generous for that workload.

The Lesson: Separate Ingestion and Serving

This is the deeper pattern we extracted from this incident:

Ingestion job (run once or weekly):
┌─────────────────────────────────────┐
│  Resource-hungry, short-lived       │
│  Large memory for batch processing  │
│  Run as Cloud Run Job, not Service  │
│  Billed only while running          │
└─────────────────────────────────────┘

Serving API (always available):
┌─────────────────────────────────────┐
│  Lightweight, latency-sensitive     │
│  Small memory for query/response    │
│  Cloud Run Service, scale-to-zero   │
│  Billed only when serving requests  │
└─────────────────────────────────────┘

They are different workloads and should have different deployment configurations. If you must keep a warm instance for latency reasons, use min-instances=1 with appropriately sized memory — not the size you needed for ingestion.

When `min-instances=1` Is Justified

Not every use case should scale to zero. Here’s a simple decision framework:

Does this service need sub-2-second response?
        │
        ├─ YES ─► Is traffic predictable / always-on?
        │              │
        │              ├─ YES ─► min-instances=1 is justified
        │              │         BUT right-size the container
        │              │
        │              └─ NO  ─► Use min-instances=1 for business hours
        │                        Set to 0 overnight via Cloud Scheduler
        │
        └─ NO  ─► Use min-instances=0 (scale-to-zero)
                  Accept 5–15 second cold start
                  Save 90–97% on idle compute

For most Australian SMB workloads — internal tools, low-traffic APIs, scheduled jobs — scale-to-zero is the right default. Cold starts are a latency problem for consumer apps at scale. For a B2B tool or internal service, they are usually acceptable.

Catching This Before It Hits Your Bill

Three things we now do on every Cloud Run deployment:

Pre-deploy checklist item:

□ Is min-instances > 0?
    If YES → Is this intentional?
           → Is memory right-sized for serving (not ingestion)?
           → What is the monthly idle cost at this configuration?

GCP Budget Alert: Set a budget alert at 120% of your expected monthly Cloud Run spend. When the alert fires, something changed — find it before the next invoice.

Workload tagging: Tag your Cloud Run services with their purpose (ingestion, serving, scheduled-job). When reviewing bills, the tag makes it obvious if an ingestion-sized container is running 24/7 in production.

What We Saved

Switching from the original configuration to scale-to-zero with right-sized resources:

Before:  $150/month (idle, 24/7, 4 GB)
After:    $5/month  (low traffic, scale-to-zero, 512 MB)

Saving:  $145/month — $1,740/year

For a bootstrapped product at zero revenue, that saving matters.

The Broader Pattern for Australian Cloud Users

Cloud cost overruns on small GCP deployments follow predictable patterns. The three we see most often:

min-instances set for development, never reset for production (this post)
Always-on managed Redis with no traffic (Cloud Memorystore at $35+/month — covered in our next post)
Cloud SQL left running after a project pivot (billed by uptime, not queries)

All three share the same root cause: a setting that made sense in one context, carried silently into another context where it no longer applies.

The fix for all three is the same: workload-specific deployment configs, reviewed at each phase change.

If you are running Cloud Run services in Australia and want a quick review of your current deployment configuration, our Cloud Geeks team offers a free 30-minute GCP cost audit for Sydney-based businesses. We have done this enough times to know exactly where the money leaks.

Related reading: Stop Paying for Idle Redis: How We Cut Our GCP Bill by $35/Month — the second post in this GCP cost series.

If your team needs on-the-go access to cloud-managed systems, Awesome Apps builds practical mobile solutions for Australian businesses.

Cloud Geeks operates under the Ganda Tech Services umbrella, delivering end-to-end technology solutions for Australian businesses.

Cloud Run min-instances=1: The Silent $150/Month Budget Trap

Cloud Run min-instances=1: The Silent $150/Month Budget Trap

The Project: A RAG Travel Blog Tool

The Root Cause: What `min-instances=1` Actually Means

Why We Missed It

The Fix: Scale-to-Zero + Right-Sizing

The Lesson: Separate Ingestion and Serving

When `min-instances=1` Is Justified

Catching This Before It Hits Your Bill

What We Saved

The Broader Pattern for Australian Cloud Users

Share this article

Ready to transform your business?

Cloud Run min-instances=1: The Silent $150/Month Budget Trap

The Project: A RAG Travel Blog Tool

The Root Cause: What min-instances=1 Actually Means

Why We Missed It

The Fix: Scale-to-Zero + Right-Sizing

The Lesson: Separate Ingestion and Serving

When min-instances=1 Is Justified

Catching This Before It Hits Your Bill

What We Saved

The Broader Pattern for Australian Cloud Users

Share this article

Ready to transform your business?

The Root Cause: What `min-instances=1` Actually Means

When `min-instances=1` Is Justified