
Signal vs Noise: Lessons in Logging from the Deep End
Not so long ago, I was working on a voice-based OTP delivery system. Users could opt to get their one-time passcodes through a phone call instead of SMS. Think of it like this:
- Whilst registering or logging in, they could select “Call me with the code” as an alternative to SMS for 2-factor authentication
- We generate an OTP
- Send it to Amazon Polly for text-to-speech
- Initiate a phone call
- Polly reads: “Your verification code is six, four, nine, two…”
Pretty straightforward. Except it wasn’t.
I Logged Everything
When I say “everything,” I mean:
- When the OTP was generated
- The full text sent to Polly (SSML and all)
- How long Polly took to respond
- Whether Polly used standard or neural voice
- The audio stream conversion result
- The telephony API’s response
- The retry logic
- Even the payloads of those requests
At first, it felt right. “Let me log everything so I can catch anything that goes wrong.”
That’s what you do when you’re new: log first, ask questions later.
But as traffic grew, especially when we started delegating parts of the stack to external services (Polly, cloud logging, call APIs)
Reality Check: The Noise Was Deafening
Here’s what started happening:
-
Debugging time increased: When a user reported “I didn’t get a call,” I had to scroll through hundreds of successful call logs, retry attempts, Polly confirmations — and maybe, maybe — find a timeout or real error.
-
Logs were too verbose to be useful: Everything looked important, but nothing stood out. Even when filtering for
[DEBUG]
,[ERROR]
, or[INFO]
, it wasn’t obvious which failure mattered. -
Cloud logging costs spiked: We started paying per GB of log ingestion. Yeah.
-
Latency issues showed up: Some services were logging synchronously, blocking event loops (minor, but at scale, that adds up). # skill issue btw
-
Async complexity made debugging harder: When we made the system asynchronous, tracing issues became a nightmare. A single user flow could span multiple logs across different timestamps, making it nearly impossible to follow the sequence of events.
So… What Could I Have Done Differently?
I think this is the part that separates someone early in their career from someone who’s done it wrong a few times and learned:
1. Log for Understanding, Not for Everything
Just because you can log it doesn’t mean you should. Now, I ask:
If this log line appeared in an incident, would it actually help me fix the problem?
If not, maybe that’s a DEBUG
, or maybe it’s not needed at all.
2. Aggregate, Don’t Repeat
Rather than logging every retry individually, I’d log a summary like:
Polly retry: 3 attempts, succeeded on 2nd, total latency 540ms
This saves space and actually tells me what I need to know.
3. Structure Logs
I didn’t think structured logs mattered until I had to grep through 200 lines for a single call_id
.
In the case of the OTP system, I structured logs as JSON:
{
"event": "otp_call",
"status": "success",
"call_id": "123456",
"tts_latency_ms": 320,
"user_completed_flow": true
}
But let’s be real, not all logs look this clean. Here’s what you would often see in the wild:
[ERROR] 2024-03-20T15:42:13.456Z Failed to connect to DB
[ERROR] 2024-03-20T15:42:13.457Z Error in user registration
[ERROR] 2024-03-20T15:42:13.458Z Transaction failed
Or even worse:
[ERROR] Connection refused
[ERROR] null
[ERROR] undefined
The difference is night and day. The first example (JSON) gives me everything I need to understand and fix the issue. The second? I’m left with more questions than answers. What user? What transaction? Why did it fail?
This kind of contrast really drove home why structured logging matters. It’s not just about making logs pretty — it’s about making them useful when it matters most.
2. Sampling in High-Traffic Systems
One thing I learned the hard way: when dealing with high-traffic systems, logging everything isn’t just expensive, it’s often useless. Instead, I’ve started using sampling:
if (Math.random() < 0.1) { // Log 10% of requests
logger.info("Request processed", {
path: req.path,
duration: endTime - startTime,
status: res.status
});
}
This gives me enough data to spot patterns without drowning in noise.
3. The Human Element
I’ve come to realize that logs aren’t just for machines, they’re for humans too. When I write logs now, I try to think about who might read them:
- Is it a fellow developer debugging at some odd hours?
- Is it an ops engineer trying to understand system behavior?
- Is it me, six months from now, trying to remember why I made certain decisions?
This perspective has completely changed how I approach logging.
What I learnt? The funny thing about logging is that it’s like debugging itself. You don’t really understand it until you’ve done it wrong a few times. I’m not saying I’m an expert now, but I’ve definitely learned to be more intentional about what I log and why.
Maybe that’s the real lesson here: sometimes you need to make the mistakes to truly understand why the best practices exist in the first place. And that’s okay, as long as you learn from it.
If you’re interested in diving deeper into logging best practices, I found these resources particularly helpful in my journey:
- The 10 Commandments of Logging by Masterzen
- Logging Wisdom: How to Log by Emil Stenqvist