When Logging Became a Scaling Problem

Table of Contents

At small scale, logging feels almost free.

Add a few lines. Ship the service. Search the logs when something breaks.

That works until the system gets large enough that every log line carries a cost. Each line has to be stored, indexed, queried, and paid for. In our case, the cost showed up as slow searches, engineer time spent waiting on queries, BigQuery slot consumption, small storage overhead, and, above all, the cost of streaming logs into BigQuery.

In a high-volume supply-chain environment, some services processed around 500M application transactions per day and over 1B at peak traffic. At that scale, logging stopped being a background detail. It became infrastructure.

The cloud bill made the problem visible, but cost was only the first symptom. Logs were noisy. Important events sat beside low-value request and response payloads, meaning large bodies of data attached to some log events. Different services described similar concepts in different ways. Most importantly, the platform treated too many logs as if they all needed the same expensive, immediate path.

Streaming meant logs were searchable almost immediately. Batch meant logs were collected first and loaded later at lower cost. The architectural shift was to draw a clear operational line: keep urgent logs fast, move delay-tolerant volume to the cheaper batch path, and standardize the data so both paths stayed useful.

The goal was not simply to log less.

The goal was to keep the signal, reduce the waste, and turn logging into a shared system teams could reason about instead of a pile of service-by-service habits.

The Problem Was Bigger Than The Bill

Cloud logging costs had become large enough to deserve architectural attention. Streaming every log directly into analytics storage was simple, but it treated all logs as equally urgent and equally valuable.

They were not.

A failure event that helps an engineer diagnose a live incident is worth paying to see immediately. A high-volume success-path payload that rarely gets queried has a different cost profile. Route both through the same expensive path by default, and the platform loses the ability to make that distinction.

Teams were not being careless. Most were making reasonable local decisions: add a field here, preserve a payload there, keep the default route because it works. At enterprise scale, those local decisions compounded into a centralized cost and observability problem.

Every log line needed to answer a harder question:

Does this log earn the cost of preserving and querying it this way?

The Default Model Was Too Blunt

The default model was straightforward: applications emitted logs, the cloud logging platform collected them, and BigQuery stored them for search and analysis.

That model worked, but it could not distinguish between logs that needed immediate visibility and logs that could wait. It also failed to create a consistent language across services. One team might use one field name, another team might use another, and another might bury the same identifier inside a text message.

During incidents, those inconsistencies matter. Engineers need to filter by common fields, follow a request across services, and separate failures from routine traffic quickly. They should not have to remember each service's naming habits before they can debug.

Cost avoidance and operational clarity turned out to be the same problem from different angles. To control cost, we had to understand which logs mattered. To make logs useful, we needed structure, routing, and shared conventions.

What I Built

I architected and implemented the shared logging approach for Sourcing services: the common JSON logging shape, the stream-vs-batch decision model, and the cloud logging pipeline that supported both ingestion paths.

The library gave Java services a consistent way to emit structured logs. Instead of inventing a new shape in every application, teams could emit a standard set of operational fields: service, event type, request context, correlation identifiers that let engineers follow one request across services, timing, status, structured payloads, and ingestion method.

That common shape did two jobs at once: it made logs searchable by the same fields across services, and it gave the pipeline enough information to route each event through the right path.

The pipeline then carried those events through streaming or batch ingestion.

Batch-loaded logs moved through regional cloud storage and scheduled BigQuery transfer jobs before landing in the consolidated analytics table, usually about 1 to 1.5 hours after ingestion.

That delay was not a hidden downside. It was an explicit tradeoff.

The Key Architectural Decision

The decision was not simply to batch logs. It was to decide where the line belonged.

Batch too much, and incident response goes blind. Stream too much, and the cost problem comes back.

So we split the platform into two paths:

Immediate streaming for logs engineers need during incidents, deployments, and production validation.
Batch loading for delay-tolerant, high-volume logs where lower cost mattered more than instant availability.

That choice changed the logging conversation. Teams had to decide which request types, services, and event categories needed immediate visibility. For example, we streamed error, exception, and failure logs because incident responders needed them as quickly as possible. Happy-path request and response logs could batch because teams could usually wait about 1 to 1.5 hours for those records. Individual events could still stream when a real operational need required it.

That gave teams room to make service-specific calls without inventing a new logging model every time.

Standardization Made The Split Work

The split only worked because the logs had a common shape.

We needed logs to be easier to query across services. That meant aligning on common fields and a common destination model. The consolidated BigQuery table reflected the dimensions engineers use when investigating production behavior: workload, log type, service, correlation ID, request type, severity, timing, and payload context.

That structure helped engineers ask consistent questions:

Which service handled this request?
What request type was involved?
Was this a normal functional event or a failure?
Did related logs share a correlation ID?
Was the event streamed immediately or loaded through batch processing?

That common model turned the routing decision from theory into operations.

The Tradeoffs Were The Work

The hard part was not choosing the cheaper path. The hard part was deciding when the cheaper path was safe.

Cost vs. immediacy. Streaming gave engineers fast access, but at a higher cost. Batch loading lowered cost, but introduced a predictable delay. The design had to make that distinction visible and intentional.

Standardization vs. team autonomy. A shared schema made logs easier to query, but teams still needed room to describe their own domains. The library handled the common structure while preserving service-specific payloads and event choices.

Operational safety vs. cost pressure. The cheapest system would have batched too much. The safest system would have streamed everything. The useful system kept urgent signals immediate and moved delay-tolerant volume to the lower-cost path.

Platform design vs. application ownership. The infrastructure provided the routes, tables, transfers, and shared library. Teams still had to decide which logs were operationally critical for their services.

The bigger shift was giving teams a cost-aware way to classify logging behavior: urgent signals stayed fast, delay-tolerant volume moved to the cheaper path.

What Changed

Based on internal cloud cost analysis, the result was more than $1M in annual cloud cost avoidance.

That number mattered because it showed the work had direct business impact.

The better long-term outcome was that logging became more intentional. Services had a consistent JSON logging shape. Engineers could query common fields across applications. The platform could route logs through streaming or batch ingestion based on operational need. High-volume logs no longer had to default to the most expensive path.

The system also made future logging decisions easier. When a team added or changed logging, the conversation could move beyond "should we log this?" and toward better questions:

Who uses this log?
How quickly do they need it?
What fields make it queryable?
Does it belong on the streaming path or the batch path?
What cost does this event create at production volume?

Those questions made teams design logging deliberately instead of accumulating it by default.

What I Learned

The biggest lesson was simple: observability has a cost model.

Ignore that cost model, and the bill eventually becomes the architecture.

Good logging is not about emitting the most data. It is about preserving the right information, in the right shape, on the right path, for the right amount of time.

At enterprise scale, useful logs are designed. They are not accumulated.

That is the lesson I would carry into any high-volume system now. Treat logging like infrastructure early. Give teams a shared schema. Make routing decisions explicit. Keep urgent operational signals fast. Move delay-tolerant volume to cheaper paths.

The right answer is not "log everything." It is not "log less" either.

The right answer is to make logs earn their place.