Serverless: Hidden Cost Drivers That Increase Your Cloud Bill

Serverless, sometimes, is a bit like trying to buy 1.5 kilograms of sugar sold in one-kilogram packages.

A study conducted at the University of British Columbia leans on this analogy. Set against a backdrop, a trend across the major market offerings toward resource overprovisioning.

Lambda function (1 vCPU, 1769 MB RAM, 512 MB disk): 2.3034 x 10^-5 $/second
EC2 c6g.medium instance (1 vCPU, 2 GB RAM, 1 GB disk): 9.4753 x 10^-6 $/second, i.e. 41.1 % of the Lambda function’s price
Fargate container (same resources as the VM): 1.1003 x 10^-5 $/s, i.e. 47.8 % of the Lambda function’s price

This calculation does not include Lambda invocation fees (2 x 10^-7$ per request). Nor the “economy” burst options typically available for scalable capacity.

Also read: LinkedIn combines Kafka and gRPC for service discovery

RAM, CPU, or both? A restricted parameterization

Serverless billing rests on four main dimensions:

Execution duration
Allocated resources and actual consumption during the billable window
Billing granularity and/or minimum thresholds
Fixed invocation charges

Some platforms allocate CPU proportional to memory (Lambda, Vercel Functions, and Azure Functions, for example). Sometimes with a limited set of configurations (Huawei Function Compute and Oracle Functions are cited).

On these platforms that expose and charge largely on memory-based controls, CPU consumption is implicitly included. The pricing is at least broadly comparable to offerings that allow separate CPU provisioning.

Nonetheless, these platforms impose granularity limits. For instance, Alibaba Cloud requires a ratio between 1:1 and 1:4, with thresholds of 0.05 vCPU or 64 MB. This likely reflects the challenges a severe imbalance could pose for infrastructure management.

From consumed to billed, clear gaps

On Google Cloud Functions, as well as AWS Fargate and IBM Cloud Code Engine, the ratio between the unit prices of CPU (in CPU-seconds) and RAM (in Go-seconds) sits between 9 and 10, signaling a shared sense of relative value.

There isn’t the same uniformity in billing. The ratio between actually consumed CPU and billed CPU ranges from 1.02 (Cloudflare) to 3.99 (GCP). For RAM, it ranges from 1.95 (Azure) to 5.49 (GCP).

Also read: Cloudflare commits to a resilience plan: the main axes

The execution duration emerges as a significant driver of cost inflation. Even with Lambda’s millisecond-level granularity, the ratio reaches 2.62 for CPU and 3.62 for RAM.

Cloudflare Workers stands out as the only platform that actually bills for real CPU time, but its ceilings (128 MB RAM, 10 MB of code) constrain the workloads that can be run.

It has become commonplace to bill for both the execution of functions and their initialization. This startup phase carries fixed fees: depending on the provider, from 1.5 x 10^-7 to 6 x 10^-7 $/request.

With functions that execute quickly and/or consume few resources, costs can rise quickly. On Lambda, the fixed fees (2 x 10^-7 $/request) equate to about 96 ms of execution time for a function with 128 MB of RAM.

The concurrency management, a cost multiplier

The underlying infrastructure entails costs that are often hidden. Concurrency management is a prime example.

On platforms like AWS and other cloud ecosystems, each request uses its own sandbox (new or recycled).

With players such as Google, IBM, and Microsoft, the same sandbox can handle several parallel requests if the code permits. Users typically have control over concurrency settings and the scaling strategies those controls drive. However, a misconfiguration can increase execution time and resource usage.

Also read: Cloud databases: abundance of offerings becomes a challenge

A study explored this with a Python function allocated 1 vCPU. Under normal conditions, each request takes about 160 ms. AWS, by its architecture, is relatively insensitive to traffic spikes. Google Cloud is noticeably more sensitive. With default concurrency levels (80 requests) and scaling (60% CPU usage), average execution time climbs up to 10x when requests exceed six per second. A corollary of the delay involved in collecting metrics that trigger scaling.

The HTTP architecture, a latency engine

The platforms studied exhibit three architectures for processing requests:

Polling API (AWS model)
The user provides a management method (non-HTTP) or an executable to handle requests. A runtime—typically supplied by the provider—executes in the sandbox. It polls an endpoint in a loop to receive requests, and uses the same channel to post responses.
HTTP server (IBM, Google Cloud, Microsoft)
The function runs an HTTP server. The queue, usually hosted in a sidecar, acts as a reverse proxy. The user’s logic is encapsulated in an HTTP handler.
Code/binary execution (Cloudflare)
The user uploads a block of code or a Wasm module. For each request, the runtime compiles (or loads) the payload, executes it, captures the output, and returns the response.

Experimented with a minimalist function that returns a string and an empty status, the HTTP architecture induces more latency (5.93 ms on Google Cloud on average versus 1.17 ms for AWS and 0.01 ms for Cloudflare). A potential underprovisioning of resources can amplify this effect, given CPU-heavy operations during the request–response cycle (parsing headers and messages, encoding, serialization, etc.).

Keep-alive less generous

To limit cold starts, sandboxes can be kept warm for a certain period after a request completes. Whether by including a scale-down delay (Azure, GCP, IBM), caching the code (Cloudflare), freezing the runtime (AWS), or taking a snapshot.

Azure Functions keeps the sandbox warm for between 120 and 360 seconds; Lambda for 300 to 360; Google Cloud Run up to 900. If one trusts the 27 minutes AWS disclosed in 2018, keep-alive has likely shrunk, possibly due to cost-saving measures.

Azure Functions is among the services that do not modify resource allocations during keep-alive. That may help explain its shorter keep-alive duration, especially since it is not billed during that period.

Half the CPU does not necessarily mean twice as fast

Scheduling mechanisms also shape outcomes.

Since version 6.8, the Linux kernel uses EEVDF (Earliest Eligible Virtual Deadline First), replacing CFS (Completely Fair Scheduler). Both appear insufficiently granular for serverless needs. Broadly speaking, tasks can queue far longer than their execution time. At the same time, workloads shorter than the bandwidth control interval can produce CPU overprovisioning.

The deployment of a single-thread PyAES function on AWS and GCP revealed another facet: there is no linear relationship between allocated resources and execution time. The overall trend is roughly proportional, but with abrupt variations that become less frequent as more resources are allocated.

On Lambda, there are stepped changes at 1400 MB RAM, 700 MB, 470 MB, 350 MB, 280 MB, and so on. The study’s authors interpret this as a harmonic sequence of type 1/n (1400 x 1, 1/2, 1/3, 1/4, 1/5, …). They suggest a quantization effect at work. In this regime, a function may receive more than it needs (CPU allocation being a function of memory allocation). Hence the sugary analogy…