Umur Inan Blog

The SPA Was a Twenty-Year Detour

Fri, 05 Jun 2026 12:00:00 GMT

The SPA earned its moment

I want to start by giving the single-page app its due, because the case against it only works if you are honest about the case for it. When Gmail and Google Maps showed up, they made the web feel like real software for the first time. You could drag a map and watch it move. You could read mail without the whole page blinking white and reloading. That was a genuine leap, and the technology behind it was the SPA: load the app once, then talk to the server in the background and repaint only what changed.

For a product that is genuinely an application, this is still the right shape. Rich interactivity, instant transitions between views, the ability to keep working when the network drops, a feel that matches a desktop program. None of that was hype. The SPA delivered something the old request-render-reload web could not, and a generation of impressive products were built on it. I am not here to pretend otherwise.

Then everything became one

The trouble started when the SPA stopped being a tool for building applications and became the default for building anything with a URL. A marketing site became a client-rendered app. A blog became a client-rendered app. A documentation page, a pricing table, a company's about section, all of it shipped as JavaScript that booted in the browser, fetched JSON from an API, and rendered itself on the user's machine.

These are documents. They are text and images that a server could have rendered to HTML and handed over in one round trip, the way the web did from the start. Instead we taught a generation of engineers that the normal way to display an article is to ship a JavaScript runtime that downloads the article separately and assembles it on arrival. The exception had quietly become the rule.

We rebuilt the server in the browser

Once rendering moved to the client, everything the server used to handle had to be rebuilt there. The browser needed a router, so we wrote client-side routers. It needed to turn data into markup, so we shipped templating and a virtual DOM. It needed to fetch and cache data, so we grew whole libraries for request caching and state management that a database and an HTTP cache used to provide for free.

Then search engines could not read pages that did not exist until JavaScript ran, so we invented server-side rendering to draw the first paint on the server after all, plus hydration to wire it back up on the client. Read that sequence again. We moved rendering off the server because the server felt old-fashioned, then reimplemented server rendering inside the client framework to fix what we broke, and called the round trip progress.

The bill: two of everything

A client-rendered site in the maximalist style runs two of nearly everything. There are two routers, one in the framework and one the server still needs. The render path doubles too, a server pass and a client hydration that have to agree exactly or the page flickers and throws warnings. And an entire API layer exists for a single reason: the browser cannot reach the database directly, so every piece of data takes a detour through JSON.

None of that is free. You pay in the bundle the user downloads before seeing anything, in the seconds of hydration before the page answers a click, and in the engineering time spent keeping two render paths in sync. For an application that needs the interactivity, the bill is worth paying. For an article, it is pure overhead charged to the reader.

The swing back is correction, not fashion

What is happening now is the pendulum finding center again. htmx lets the server send HTML over the wire in response to user actions, with no JSON and no client framework required. Hotwire and Turbo do the same for the Rails world. React's own server components push rendering back to the server by default and ship client JavaScript only for the parts that are genuinely interactive. Astro renders static HTML and hydrates small islands exactly where you need them.

The common thread is not nostalgia. Every one of these tools renders on the server because that is where rendering is cheap and fast, and sends JavaScript to the browser only where interaction demands it. That is the shape the web probably should have kept all along, with the SPA reserved for the cases that actually call for it.

Where the SPA genuinely still wins

I have to hold up the other side of this honestly, because plenty of products are real applications and the SPA is right for them. Figma is not a document. A collaborative editor where ten cursors move at once is not a document. A trading dashboard streaming live prices, a design tool, an offline-first field app that has to keep working in a tunnel, a deeply stateful interface where the client holds most of the truth in memory, these earn the architecture outright.

The test is whether the interactivity is the point or the garnish. When the client genuinely owns rich, live, stateful behavior that a server round trip would ruin, ship the SPA and do not look back. The mistake was never building SPAs. It was building them for things that were never applications in the first place.

The question everyone skipped

Underneath all of this is one question that mostly went unasked: is this thing a document or an application? It is the same question the right tool always depends on, and the SPA era answered it the same way every time, regardless of what was actually being built.

Most of what we make is documents with a few interactive parts. A blog with a comment box. A product page with an add-to-cart button. A docs site with a search field. Those are HTML with islands of behavior, not applications that happen to contain some text. Naming which one you are building, before you pick the stack, is the whole decision.

What I reach for now

From the backend seat, my default has swung back with the pendulum. I render on the server, send HTML, and add JavaScript only on the specific elements that have to react without a round trip. The page is fast before any framework boots, search engines read it without special handling, and there is one render path instead of two to keep honest.

When the thing I am building is actually an application, the kind where the client owns live state and interaction is the entire product, I reach for the SPA and pay its costs gladly. The twenty-year detour was not the SPA existing. It was forgetting that most of the web is still, underneath all the JavaScript, a stack of documents.

Your Internal APIs Shouldn't Be REST

Thu, 04 Jun 2026 12:00:00 GMT

The bug a schema would have caught at build time

It started with a field that changed type. An upstream service that owned user records shipped a release where account_id went from a number to a string, because a new partner used non-numeric IDs. JSON serialized it without complaint. It parsed fine on the other side too. Three downstream services kept running and quietly started writing the wrong thing to their own tables, and we found out two days later from a support ticket, not an alert.

Nothing in the pipeline could have caught it, because there was nothing to catch it with. JSON over HTTP has no opinion about what a field is supposed to be. The contract between those services lived in a wiki page and in the memory of whoever wrote the integration. That is the moment I stopped defending REST for internal traffic. The bug was free to happen because we had picked a protocol that does not know what our own data looks like.

REST was a decision you made for outsiders

Everything REST is good at is aimed at a stranger. It is human-readable, so a developer who has never seen your API can poke at it with curl. Discoverability lets clients you do not control navigate it on their own. Loose typing means a dozen unknown consumers can each ignore the fields they do not care about. And it rides plain HTTP, so any caching layer on earth already understands it.

Those are real virtues at a public boundary, where you do not own the other end and have to be generous about what you accept. Between two services that live in the same repo, deploy from the same pipeline, and get read by the same team, every one of those virtues turns into a cost you pay for nobody. You are being forgiving toward a consumer that is also you, and you are paying in bytes, in latency, and in bugs that wait until production to introduce themselves.

What a typed contract buys inside the walls

A protobuf schema is a real artifact that both sides compile against. The field that changed type becomes a build error in the service that consumed it, the morning the producer tries to ship the change, instead of a support ticket two days later. That schema is the source of truth, the generated stubs make the call look like a local function, and a breaking change is something your CI can refuse before it ever reaches an environment.

The wire format pays you back too. Protobuf encodes that same payload in a fraction of the bytes, because it drops the repeated field names and quotes and whitespace that make JSON pleasant to read and expensive to move. Running on HTTP/2, gRPC multiplexes many calls over one connection and streams in both directions without you hand-rolling any of it. For chatty service-to-service traffic, the difference in payload size and tail latency stops being a rounding error.

The through-line with my last post

I wrote recently that running REST and GraphQL together is two problems wearing one truce, and that inside your own walls a typed contract beats GraphQL's flexibility. This is the other half of that thought. If the reason you keep service-to-service calls off GraphQL is that you want a strict, typed contract, then the honest endpoint of that logic is not REST either. REST is a typed contract by convention and good intentions only. gRPC is one the compiler enforces. The typed contract I was reaching for in that post has a name, and the name is protobuf.

The costs gRPC actually adds

This is a trade, not a free upgrade, and the trade has a real other side. Browsers do not speak gRPC natively, so anything a frontend touches needs grpc-web or a JSON gateway in front, which is exactly why the public edge stays REST. You cannot curl a gRPC endpoint and read the answer with your eyes, so casual debugging gains a tool you did not need before. There is a proto build step, a code-generation toolchain, and a learning curve for a team that has only ever shipped JSON. None of that arrives for free.

Where internal REST is still fine

I am not telling you to put protobuf between two services and call it architecture. If your whole backend is three services and a dozen endpoints, the proto toolchain costs more than the typing saves, and plain JSON over HTTP is the right amount of machine for the job. You do not gRPC three services any more than you stand up Kafka for three topics. The win shows up when the service count climbs, the call volume is real, and the same handful of teams keep breaking each other across an untyped seam.

The signal is not a number you read off a chart. It is the second or third time a renamed field causes an outage, or the first time payload size shows up inside a latency budget. That is the system telling you the contract needs teeth.

The migration is cheaper than it looks

You do not rewrite anything in a weekend, and you do not have to choose all at once. Write protobuf definitions for the calls two services already make, stand the gRPC server up alongside the existing REST handlers in the same process, and move one consumer over. A service can speak both for as long as it needs to. The edge keeps a JSON gateway so browsers and outside callers never notice a thing.

Because the change is per-call instead of per-system, you get to spend the cost exactly where the pain is. The chattiest, most type-sensitive path moves first and earns the toolchain its keep, and the rest follows only if it turns out to be worth it.

What I actually reach for

At the public edge I reach for REST, because the audience is strangers and the whole point is being easy on people I will never meet. Between services, once there are more than a few of them and the calls carry real structure, I reach for gRPC and let the compiler hold the contract instead of a wiki page. The boundary decides the protocol, not habit and not whatever the last service happened to use.

The field that changed type is the whole argument in one bug. At a public boundary, accepting it gracefully is a feature. Between your own services, accepting it silently is a two-day incident with your name on the commit. Pick the protocol that knows what your data is supposed to be, and pick it based on who is standing on the other end.

Postgres Won the Database War. Now What?

Wed, 03 Jun 2026 12:00:00 GMT

Postgres won, and we should just say so

For the first time in the history of the Stack Overflow Developer Survey, Postgres passed MySQL as the most-used database among professional developers. That is not a press release. It is the center of gravity moving. A generation of engineers learned MySQL by default, the next one is learning Postgres by default, and defaults compound.

This was earned. Postgres spent two decades being the careful, correct, slightly less fashionable choice while it grew a feature set nobody else matched: real transactions, rich types, a planner you can actually reason about, and an extension ecosystem that turned it into a platform. The win is deserved. The interesting part is what a deserved win does to how people decide.

"Postgres for everything" is the new default

The reflex now is to put everything on it. Relational data, obviously. The job queue, with SKIP LOCKED. Vector search, with pgvector. Full-text search, with tsvector. Scheduled jobs, with pg_cron. The cache, because a table with a TTL column is right there. Even the analytics, because the data already lives in it.

I understand the pull completely, and most of the time I follow it. One database is one thing to run. The maximalist position has a real argument behind it, and the people making it are not wrong to start there. The trouble is that a default this strong stops feeling like a choice, and a choice you stop making is a choice you stop checking.

Most of that gravity is good

Consolidating on Postgres buys you things that are easy to undervalue until you have lived without them. You back up one system and restore one system. One mental model covers how queries behave under load. One connection story, one failover story, one set of metrics that everyone on the team already reads.

Every extra datastore you add is not just its own operational weight. It is a new seam between two systems that can disagree, a new thing to keep consistent, a new page at 3am for a component half the team has never touched. Keeping the count low is a genuine engineering virtue, and Postgres makes a low count realistic for longer than it used to be. That part of the trend is healthy.

Where the default quietly stops

The walls are real, and I have walked into most of them. A Postgres table makes a fine job queue at low volume and a lock-contention machine when you push real event-stream throughput at it, which is its own whole post. Past tens of thousands of messages a second with many independent consumers, you want a log instead of a table, and that is the day Kafka stops being overkill. pgvector is excellent at a hundred thousand embeddings and a different animal at ten million, where index build time and recall start to fight each other. Heavy analytical scans across tens of gigabytes will pin the same instance your transactions live on, while a columnar engine on its own box finishes the same work in a fraction of the time.

None of these are reasons to avoid Postgres. They are the thresholds where "Postgres can do it" and "Postgres is the right tool here" stop being the same sentence.

The cost of a monoculture reflex

The quiet danger of a strong default is not any single wrong call. It is what happens to the question. "Postgres can do it" is true often enough that it starts answering "should it" before anyone asks "should it" out loud. The reflex eats the decision.

There is a skills cost too. A team that has never run anything but Postgres meets its first message broker, its first search cluster, or its first columnar warehouse during an incident, under load, at the worst possible time to be learning something new. The monoculture feels simple right up to the moment it is not, and by then nobody in the room has the muscle memory for the alternative.

Winning is exactly when over-reach starts

Every tool that wins its category goes through this. The trait that made it the safe choice, ubiquity, is the same trait that turns it into the answer to questions nobody examined. Java got used for shell scripts. JavaScript got used for everything with a screen. Popularity is not the problem on its own. The problem is that popularity makes the reflex feel like prudence.

A default that strong needs a tripwire, or it silently becomes the answer to questions you never actually weighed. Granting that Postgres should be the starting point is easy. Granting it the right to skip the architecture conversation is how you end up running a streaming platform inside a relational database and calling it simplicity.

The rule: a default plus a written threshold

The fix is small, and it is a habit instead of a tool. When you reach for Postgres to do the non-relational thing, write down the number that means you have outgrown it, in the same document where you made the choice. The queue moves off Postgres above this sustained rate. Vectors leave when recall drops below this bar or the index build crosses this duration. Analytics get their own box once the nightly scan starts stealing latency from live traffic.

You do not have to act on those thresholds today. You only have to name them today, while you are calm and thinking clearly, instead of discovering them at 2am when the one decision left on the table is a panicked migration under an incident bridge.

What I actually do

I reach for Postgres first, the same as everyone else, because the gravity is right far more often than it is wrong. The difference is one paragraph in the design doc: what this Postgres-shaped solution actually is, and the measured signal that tells me it is time to move that one workload somewhere built for it.

Postgres winning is good news. It is a sharper, more capable default than the one it replaced, and most systems are better off starting there. A default is still a starting point, not a verdict. The teams that get the most out of this era will be the ones who kept asking the question the win made so easy to skip.

You're Running Kafka for Three Topics

Tue, 02 Jun 2026 12:00:00 GMT

The cluster we stood up to move three topics

Two of our services needed to talk without calling each other directly. An order service produced events, and a billing service and a notifications service wanted them. That was it. Three topics, a few hundred messages a minute on a busy day.

What we built for that was a three-broker Kafka cluster running KRaft, a schema registry so the events stayed typed, Kafka Connect to land a copy in the warehouse, and a small library wrapping the consumer so every service handled offsets the same way. It worked. And it quietly became the thing we spent the most time operating, for the smallest part of the system. We had bought a streaming platform to send a postcard between three houses on the same street.

Kafka is a commit log, and you bought the whole log

Kafka is a distributed, partitioned, replicated commit log. That sentence is the whole pitch and the whole warning. It is extraordinary at keeping an ordered, durable record of a firehose of events, and at letting many independent consumers read it at their own pace, rewind, and replay history from the beginning.

Every one of those properties assumes you have a firehose. Partitions exist so throughput can scale past a single machine. Consumer groups exist so a fleet of workers can split a stream too big for one of them. Retention and replay exist so you can reprocess weeks of history. When your actual load is a few hundred messages a minute with two consumers that never replay anything, you are paying for a machine built to move an ocean so it can move a bucket.

The operational tax nobody priced

The sticker price of Kafka is the brokers. The real price is everything you learn about it at 2am. When a consumer group rebalances for the first time and processing stalls for thirty seconds, someone has to understand why. The first time lag climbs and nobody can tell whether a consumer is slow, stuck, or dead, someone has to learn the difference between those three states under pressure.

Then there is retention you set wrong and a topic that quietly drops messages older than a day. There is the partition count you picked early and cannot raise later without reshuffling keys. Ordering is guaranteed inside a partition and not across them, so the moment your design needs global order you are back to one partition and no parallelism. None of this is Kafka being bad. It is Kafka being a serious tool that bills you in attention whether or not your scale justifies the charge.

What "we need events" usually means

When a team says it needs events, it usually wants three small things. It wants the producer to stop blocking on the consumer, so a slow billing run does not back up order placement. A failed handler should retry without dropping the message on the floor. And once in a while, someone wants to look back at what actually happened.

That is decoupling, durability, and the occasional audit. It is a real and worthy list. It is also a far smaller want than ordered, replayable, horizontally partitioned streaming, and almost any queue on earth satisfies it without a cluster anywhere in sight.

The smaller tools that actually fit

A managed queue is the boring answer and usually the correct one. SQS, Cloud Pub/Sub, or a hosted RabbitMQ gives you decoupling and retries with a dead-letter queue, and the operational surface is a config screen instead of a cluster you babysit. For handing work off reliably from inside a transaction, the outbox pattern does the job with a table and a poller, and I have a whole post on why that beats a two-phase commit. For genuinely low volume, a jobs table that a worker polls is not a sin. It is a queue sized to the problem in front of you.

This is not "use your database as a stream"

I have to be careful here, because I have also argued that your database is not your message queue, and I still mean every word of it. Right-sizing down has a floor. The failure mode in that other post is a team treating a busy Postgres table as a high-throughput event bus, with dozens of workers running SELECT ... FOR UPDATE SKIP LOCKED in a hot loop and reinventing offsets badly.

The line is throughput and intent. A jobs table polled a few times a second by one or two workers is a queue doing honest work. That same table under a real event stream turns into a lock-contention machine and a worse Kafka than Kafka. Reach for the smaller tool, then stop reaching before you turn your database into the thing you were trying to avoid.

The throughput where Kafka starts to win

There is a real line, and past it Kafka stops being overkill and becomes the only sane option. When you are moving tens of thousands of messages a second, a queue's per-message bookkeeping falls over and the log's sequential design pulls ahead. Five or six independent teams reading the same stream at their own pace is another signal, because the commit-log model fits that shape in a way point-to-point queues never will. And if you have to replay a week of events to rebuild a projection, you want retention you can rewind into.

Fan-out, replay, and serious throughput are the signals worth watching. If two or more of them describe your system honestly, stand up the cluster and do not apologize for it.

The "but we might scale" defense

The usual objection is that adopting a queue now means a painful migration later, when the firehose finally arrives. In practice the opposite holds. Code that publishes to an interface does not care whether the other side is SQS or Kafka, and swapping the transport underneath is a week of work you do once you have the volume to justify it and the real data to size it correctly.

Building Kafka first is paying that migration cost up front, every single day, for a scale you may never reach. You carry the operating burden for years to dodge a week of work that might never come due.

What I actually reach for

My default for service-to-service events is a managed queue, with the outbox pattern when the handoff has to be transactional. I keep that setup until the numbers force a change: tens of thousands of messages a second, several independent consumers, or a genuine need to replay history. At that point Kafka earns its keep and the operational tax becomes a fair price for what it buys.

Three topics and a few hundred messages a minute is not that day. It is a postcard. Send it with a stamp, not a shipping container, and put the months you saved straight back into the product.

REST vs GraphQL Is Over. You're Now Running Both, Badly.

Mon, 01 Jun 2026 12:00:00 GMT

The debate we declared over

We ended the REST versus GraphQL argument in a single meeting. GraphQL would sit in front of the web and mobile apps, where clients wanted to ask for exactly the fields they needed. REST would stay where it already worked: service to service, the partner-facing API, the webhooks. Everyone nodded. The debate was resolved. We were going to use both, like grown-ups.

Six months later the biggest item on the board was a gateway nobody wanted to own, our CDN hit rate had quietly collapsed, and the mobile team was filing bugs the backend team could not reproduce because the two halves disagreed about what an error even was. The debate was not resolved. We had agreed to have it twice, forever.

"Use both" is two contracts, not a truce

"Use both" sounds like maturity. What it means in practice is that every engineer now holds two API models in their head and switches between them depending on which corner of the system they are in. There are two ways to describe a resource, two sets of pagination conventions, two doc sites, two client libraries, two ways a request can be malformed.

New hires learn both before they are productive. A change that touches the seam touches both. The cost is not in any single line of code. It is the constant tax of context-switching between two philosophies that disagree about where the smarts belong.

The caching you quietly gave up

REST gets HTTP caching for free, and most teams forget how much they were leaning on it. A GET has a URL, and a URL is a cache key. Your CDN, the browser, a reverse proxy, and an ETag all cooperate to keep load off the origin without anyone writing caching code.

GraphQL sends a POST with the query in the body. A POST is not cacheable by any of that machinery. The shared field that REST served from the edge a million times a day now hits your resolvers a million times a day. You can claw some of it back with persisted queries and a client cache, but you are rebuilding by hand what HTTP handed you in the protocol. I wrote a whole post on the Cache-Control header most people ignore. GraphQL ignores it for you.

Two error models in one client

REST signals failure with a status code. A 404 is missing, a 409 is a conflict, a 500 is your fault. GraphQL returns 200 OK with an errors array in the body, because at the transport level the query arrived fine. Each one is defensible. Living with both inside a single client is the problem.

The app now checks the HTTP status for the REST calls and parses a body-level errors array for the GraphQL calls, and a partial GraphQL response can be half data and half error at the same time. I have a separate post about the endpoint that always returns 200 and why it hides failures. GraphQL makes that the default and asks you to be fine with it.

Rate limiting stops being per-endpoint

With REST you can put a limiter in front of each route, because each route does roughly one bounded thing. GET /orders costs about the same every time someone calls it. You count requests and you are done.

One GraphQL endpoint accepts a query that asks for a single user, and the next query asks for every user, their orders, and each order's line items three levels deep. Same URL, wildly different cost. Counting requests protects nothing. You end up writing query-cost analysis, depth limits, and complexity budgets, which is a small rules engine living in front of your data, and it is now load-bearing.

The BFF that became a third backend

To stitch the two worlds together, someone stands up a backend-for-frontend. At first it just forwards calls. Then it shapes a payload so the mobile client gets something friendlier. A caching layer shows up. It starts owning a slice of authorization. A product rule lands there because it was the convenient place that afternoon.

A year on, the BFF is a third service with its own deploys, its own on-call, and business logic that lives in no design doc. It was supposed to be glue. It became a backend that happens to speak both protocols, and it is the scariest thing to change in the whole system.

N+1 moved, it did not leave

People adopt GraphQL partly to escape over-fetching. The trap is that the classic N+1 query problem does not vanish. It relocates into your resolvers. A query for 50 orders fans out into 50 separate lookups for each order's customer, one resolver call at a time, against the very REST service or database you were trying to be gentle with.

The fix is DataLoader-style batching, where you collect the keys from a tick of resolution and issue one batched call. It works. It is also a real piece of machinery you have to understand, configure, and debug. That is the price of the abstraction, not a bonus feature.

Where each actually earns its place

GraphQL earns its keep at a genuine aggregation boundary: a client that drives wildly different screens from one round trip, pulling from several services, where letting the client name its fields kills dozens of bespoke endpoints. That is a real problem, and GraphQL is a real answer to it.

REST earns its keep almost everywhere else. Resources with stable shapes. Anything that benefits from HTTP caching. Public APIs outsiders have to learn quickly, and the service-to-service calls inside your own walls where a typed contract beats a flexible one. Picking by fit instead of fashion is the whole game.

What I actually reach for

My default is REST, because the protocol does caching, status codes, and conditional requests for me and I do not have to rebuild any of it. I add GraphQL in exactly one place, when there is a real client-aggregation boundary that hurts without it, and I put one team in charge of that boundary so the seam has an owner.

What I do not do anymore is declare the debate over and walk both paths at once without counting the cost. Both is a valid choice. It is just not a free one, and the bill arrives at the seam, six months later, with no name on it.

DuckDB Ate Your Analytics Pipeline and That's Fine

Sun, 31 May 2026 12:00:00 GMT

The cluster we almost provisioned

A nightly job was getting slow. It read about 35 gigabytes of event Parquet from object storage, joined it against a customer table, and rolled everything up by account and day before writing the result back for the dashboards. On a single Python worker with pandas it took 50 minutes, and it fell over roughly once a week when a big day pushed it past the box's memory.

So someone put "move analytics to Spark" on the roadmap. We sized an EMR cluster. We talked about who would own it, how we would test it, what the Airflow operator looked like. Two engineers were about to spend a quarter building a distributed data platform for a job that processes less data than my laptop's SSD holds.

I asked for a week to try something dumber first. I pointed DuckDB at the same Parquet files. The job finished in 90 seconds on the box we already had.

You probably don't have big data

The phrase "big data" got fixed in everyone's head around 2012, when a single server had maybe 64 GB of RAM and spinning disks. A 35 GB join genuinely was a cluster problem then. You needed many machines because no single machine could hold the working set.

That constraint quietly died. A current cloud box will give you 256 GB of RAM and several gigabytes per second of NVMe bandwidth without anyone signing off on a capital expense. Datasets grew too. For most companies, though, the analytical working set never kept pace with what one machine can now hold. Your fact table is tens of gigabytes. Your dimension tables are smaller. The query that feels heavy is heavy because the tool is wasteful, not because the data is large.

Run the numbers on your own warehouse some time. Pull the row counts and average widths of the tables your dashboards actually touch. A surprising amount of big-data infrastructure exists to shuffle a few tens of gigabytes around a network that a single CPU could have scanned in the time it took to schedule the first task.

Why one process is suddenly enough

DuckDB is an analytical database that runs inside your process. No server, no cluster, no JVM warming up. You import it the way you import a JSON parser, and it executes SQL against files on disk or in object storage.

Two design choices make it fast. It stores and processes data by column instead of by row, so a query that touches three of forty columns reads only those three. Vectorized execution does the rest, pushing thousands of values through tight loops that keep the CPU cache and SIMD units busy instead of interpreting one row at a time. Those are the same tricks the expensive warehouses use. DuckDB does them on the machine you are already paying for.

It reads Parquet, CSV, and JSON directly, and it can query a live Postgres table over the wire. There is no load step. You point a SELECT at the file and the file is the table.

-- Query 35 GB of Parquet in object storage, no load, no cluster
SELECT account_id,
       date_trunc('day', event_ts) AS day,
       count(*)    AS events,
       sum(amount) AS revenue
FROM read_parquet('s3://events/2026/*/*.parquet')
GROUP BY 1, 2;

The pipeline it replaced

The old job was a pandas script. It loaded every Parquet file into one giant in-memory frame, merged that against a customer export, and grouped. Most of its runtime went on holding the whole dataset in memory at once and on Python's per-row overhead. When a day's data spiked, the load blew past the memory limit and the pod got killed.

The replacement is a SQL file. DuckDB streams the Parquet from object storage, joins it against the customer table I pull from the Postgres replica, and writes an aggregated Parquet result back. It spills to disk on its own when a step does not fit in memory, so the job that used to die on big days now just runs a little slower instead.

-- customers from the Postgres replica, events from object storage,
-- joined in one query on one node
ATTACH 'postgres://reader@replica/app' AS pg (TYPE postgres, READ_ONLY);

COPY (
  SELECT e.account_id, c.plan,
         date_trunc('day', e.event_ts) AS day,
         count(*) AS events, sum(e.amount) AS revenue
  FROM read_parquet('s3://events/2026/*/*.parquet') e
  JOIN pg.public.customers c USING (account_id)
  GROUP BY 1, 2, 3
) TO 's3://rollups/daily.parquet' (FORMAT parquet);

The numbers that ended the debate

I ran the old script, the DuckDB version, and a small Spark cluster against the same 35 GB so the comparison was honest. The pandas job took 50 minutes when it survived. A three-node Spark cluster managed about four minutes once the executors spun up, and keeping it warm cost real money. DuckDB on the existing 16-core box finished in 90 seconds.

The Spark number is not an insult to Spark. It is a tax. A good part of those four minutes was scheduling, JVM startup, and shuffling data across a network that a single machine never had to touch. For 35 GB, the coordination cost more than the work did.

Where the single node loses

This stops being true at some point, and pretending otherwise would be the same mistake pointed the other way. When the data you must scan in one query runs genuinely into the terabytes, you want many disks and many CPUs reading in parallel, and a cluster earns its keep. The crossover sits higher than most people guess, often well past where they reach for Spark, but it is real.

A single box also gives you a single failure domain. If the machine dies mid-job, the job dies with it. For a nightly batch that just reruns, who cares. For something that must make continuous progress across hours, the retry story a distributed engine gives you starts to matter.

DuckDB is not a warehouse

The trap is to fall in love with the 90 seconds and try to make DuckDB the shared serving layer for the whole company. It does not want that job. One process writes to a DuckDB file at a time. It works best as an embedded engine sitting close to a single workload. A multi-tenant server fielding a hundred concurrent analysts is a different tool.

Keep it where it shines: batch transforms, a local copy of production analytics on a laptop, an embedded query engine inside an app, the heavy lifting behind one dashboard refresh. When you need many writers and many concurrent readers against shared state, that is what an actual warehouse is for. MotherDuck exists if you want the DuckDB engine with a managed server in front of it.

When you should still reach for the cluster

Provision the distributed thing when the facts call for it. Terabyte-scale scans per query qualify. So does a real need for many machines writing concurrently, or a streaming workload that can never pause to rerun. If three separate teams hammer the same tables all day with unpredictable queries, a shared warehouse with its own resource governor will save you grief.

None of those describe a 35 GB nightly rollup. Most pipelines I see are that rollup, dressed up in cluster clothes because the reference architecture diagram showed a cluster.

What I actually run now

My rule is a size test. Under a few hundred gigabytes per query, I start with DuckDB on the biggest single box that makes sense, and I do not apologize for it. The pipeline is a SQL file in version control, it runs in a plain container on a schedule, and there is no cluster to patch, no executors to size, no Airflow operator that only one person understands.

The Spark conversation is still sitting on a roadmap somewhere. We never had it. That 35 GB job has run every night for months in under two minutes, on a box that was already there, and the quarter we would have spent building a data platform went into the product instead.

The RAG Pipeline That Confidently Made Things Up

Sat, 30 May 2026 12:00:00 GMT

The answer was confident, fluent, and made up

A user asked our internal docs assistant a simple question: what is the refund window on annual plans? It answered without hesitation. "Annual plans can be refunded within 30 days, and the cancellation takes effect at the end of the billing period." It cited two source documents. It read exactly like the rest of our help center.

There is no 30-day refund on annual plans. There never was. The real policy is a 14-day refund on monthly plans, and annual plans are non-refundable but cancellable. The model had taken the "30 days" from a document about the cancellation window, the word "refund" from a document about monthly billing, and welded them into a policy that did not exist. Support honored that invented policy for two weeks before someone in finance noticed the refunds going out.

Nobody caught it sooner because the answer had every marker of being correct. It was fluent. It cited sources. It matched the tone of the real docs. Retrieval worked. Generation worked. The system still lied.

Retrieval succeeded. Grounding failed.

When I pulled the trace, the retriever had done its job by every metric we were watching. The top chunks came back with high cosine similarity. The cancellation-window chunk and the monthly-refund chunk both scored near the top, because "refund," "cancel," "window," and "billing period" all sit close together in embedding space.

That is the trap. Similarity measures whether two pieces of text are about the same topic. It does not measure whether a piece of text answers the question. Those are different questions, and a retriever only knows how to answer the first one. So it handed the model two passages that were on-topic and individually true, sitting next to each other, and the model did precisely what we asked of it. It summarized the context faithfully. The context was two half-truths.

Chunking is where the truth gets cut in half

We chunked every document at a fixed 512 tokens with no regard for structure. The code was as blunt as it sounds.

def chunk(text, size=512, overlap=0):
    tokens = tokenizer.encode(text)
    return [
        tokenizer.decode(tokens[i:i + size])
        for i in range(0, len(tokens), size)
    ]

The refund policy lived in a small table. Its header row said "Refund policy." Row one was monthly, 14 days. Row two was annual, none. That 512-token boundary fell between the two body rows, so the chunk the retriever scored highest carried the header, the word "refund," and "14 days," but not the row that said annual plans get nothing. A model handed half a table will cheerfully infer the rest.

The fix is to chunk on structure instead of token count. Keep atomic units whole: a table is one chunk, a list item keeps its heading, a rule never gets split from its exception. Add overlap so anything straddling a boundary survives in both neighbors.

def chunk_structured(blocks, max_size=512, overlap=64):
    # block = one heading, paragraph, table, or list item
    chunks, current = [], []
    for block in blocks:
        if token_len(current) + token_len(block) > max_size and current:
            chunks.append(join(current))
            current = current[-overlap:]   # carry the tail into the next chunk
        current.append(block)
    if current:
        chunks.append(join(current))
    return chunks

This is not flawless. Long tables still need a size cap, and some documents have no clean structure to chunk on. But it stopped cutting tables in half, and that single change removed a whole category of "the model only saw part of the rule" failures.

You tuned recall when you needed precision

My first instinct when retrieval feels wrong is to grab more of it. We pushed k from 4 to 12. Recall climbed, and so did the hallucinations. More chunks means more chances that one of them is topically close but factually irrelevant, and the model treats every chunk in its context as fair game. At k=4 it fused two documents. At k=12 it had six passages that disagreed with each other and confidently blended the most fluent ones.

Here is the counterintuitive part. For grounded question answering, precision matters more than recall. Three exactly-right chunks beat the right chunk buried under eleven plausible distractors. So we went the other way. We dropped k back to 4, added a reranker (a cross-encoder that actually reads the query against each candidate instead of comparing two embeddings), and discarded anything under a relevance threshold. Fewer chunks, higher quality, fewer invented policies.

None of this was real until we wrote the eval set

Now the uncomfortable admission. Every fix I just described, the structural chunking, the reranker, the smaller k, I have been narrating as if we knew each one helped. We did not. Not at first. We were tuning by feel. Ship a change, eyeball five queries, declare victory. The chunking fix might have solved the refund bug and quietly broken twenty other answers, and we would have had no idea.

The change that mattered more than any retrieval tweak was a labeled eval set. A hundred real questions, each paired with the answer we expected and the source chunk that answer should come from. From that you get two numbers. Retrieval hit-rate: did the correct chunk land in the top-k? Faithfulness: is the generated answer actually supported by the retrieved context, graded by a second model or a human?

The hit-rate half is a few lines over pgvector.

def retrieval_hit_rate(eval_set, k=4):
    hits = 0
    for case in eval_set:
        q_emb = embed(case["question"])
        rows = db.execute(
            "SELECT chunk_id FROM chunks ORDER BY embedding <=> %s LIMIT %s",
            (q_emb, k),
        )
        retrieved = {r["chunk_id"] for r in rows}
        if case["gold_chunk_id"] in retrieved:
            hits += 1
    return hits / len(eval_set)

pgvector's <=> operator is cosine distance, so smaller is closer. Faithfulness is the harder score and the one that catches this exact bug: take the generated answer and the retrieved chunks, and ask a grader whether every claim in the answer is supported by those chunks. A confident, wrong, well-cited answer fails faithfulness even when hit-rate looks healthy. That is the refund bug, caught by a number.

Once those two numbers existed, the work stopped being guesswork. The reranker moved hit-rate from 0.71 to 0.89. Dropping k from 12 to 4 raised faithfulness and left hit-rate flat. Every change became falsifiable. It either moved a number or it did not, and the ones that did not, we threw away instead of shipping on a hunch.

RAG relocates hallucination. It does not remove it.

The pitch for retrieval-augmented generation is that grounding the model in your own documents stops it from making things up. What actually happens is quieter. The hallucination moves. It travels from "the model invents a fact out of nothing" to "the model faithfully summarizes the wrong context you handed it." The second kind is more dangerous, because it arrives with citations and reads like the truth.

Your model is only as honest as your retrieval. Your retrieval is only as good as the eval set you keep finding reasons not to write. So write it. It is the one piece of this whole pipeline that tells you whether anything else you did was real.

Server-Sent Events Are Back. You Should Use Them.

Fri, 29 May 2026 12:00:00 GMT

Why SSE is back

Server-Sent Events shipped in HTML5 in 2011. They went almost completely ignored for a decade. WebSocket got all the attention. Every "real-time" tutorial used WebSocket. Every job ad asked about WebSocket. SSE was the forgotten cousin.

Then LLM streaming happened. The output-as-it-generates UX that ChatGPT made canonical needs a one-way stream from server to client. WebSocket can do it, but SSE is the simpler tool for that exact shape of problem. Every major AI provider's streaming endpoint ended up being SSE. OpenAI. Anthropic. Google. Mistral. All SSE.

Now every team shipping an AI feature has SSE in their stack whether they know it or not. The protocol is having its second moment. This is the version of the article I would have wanted before our team rolled their own.

What SSE actually is

SSE is plain HTTP. The client opens a long-lived GET request. The server keeps the connection open and writes lines as they happen. Each line is a UTF-8 text event with a simple format: data: hello\n\n. The client gets each event via the browser's EventSource API.

That is the whole protocol. There is no framing, no binary, no negotiation, no upgrade handshake. The wire format is the same plain text you would get from a normal HTTP response, except the body never ends.

The simplicity is the value. SSE goes through every HTTP intermediary that exists: load balancers, proxies, CDNs, browser dev tools, curl. No special config. Anything that speaks HTTP speaks SSE.

SSE vs WebSocket

WebSocket is bidirectional and binary. SSE is server-to-client only and text. That single difference covers most of the comparison.

Where WebSocket wins: bidirectional traffic. Multiplayer games where every client sends input continuously. Collaborative editors where both sides type. Chat apps where users send messages over the same connection. WebSocket is the right tool when traffic flows both directions on the same channel.

Where SSE wins: server-to-client streams. LLM token streaming. Live dashboards. Notifications. Stock tickers. Server-rendered logs. Any case where the client kicks off the request and the server pushes results back, you reach for SSE. Sending data from client to server stays a regular POST.

The honest tradeoff is that WebSocket can do anything SSE can do. WebSocket comes with more moving parts: a custom protocol, a framing layer, ping/pong logic, and a reconnection scheme you have to write. For a one-way stream, that is a lot of code for no functional gain.

The HTTP/1.1 connection trap

Browsers limit HTTP/1.1 to 6 concurrent connections per origin. An SSE stream eats one of those six for the lifetime of the stream. If you open three SSE streams (a dashboard with three live widgets), the user has three connections left for normal browsing. Open six, the rest of the page stops loading.

This was the historical reason SSE got skipped for WebSocket. WebSocket runs over a single TCP connection that does not count against the HTTP limit.

HTTP/2 fixes this. Under HTTP/2 (and HTTP/3), all requests to the same origin multiplex over one connection. The 6-connection limit goes away. SSE streams can scale to dozens per page without hurting the rest of the experience.

If your site is served over HTTP/2 (almost every site behind Cloudflare, Vercel, Netlify, or any modern CDN), the connection trap is no longer a concern. If you are still on HTTP/1.1, it is the most important thing to know.

Server-side gotchas

SSE is simple. Implementing SSE is full of small mistakes that cost you a day to debug. The list, from most common to least:

Flushing. The framework you use will buffer output by default. Express, Spring, FastAPI, Django, Rails, all of them. You have to call the explicit flush method after each event or the client sees nothing until the connection closes. Look for res.flush(), response.flushBuffer(), or the equivalent.
Proxy buffering. Nginx (by default) buffers proxied responses. Your server flushes, Nginx holds the bytes. Set X-Accel-Buffering: no in the response headers to tell Nginx to pass through. Cloudflare and similar CDNs have similar settings that need explicit opt-out.
Idle timeouts. Load balancers, reverse proxies, and serverless functions all have an idle-connection timeout. AWS ALB defaults to 60 seconds. If your stream goes quiet for longer than that, the connection drops. Send a comment line (a line starting with :) every 30 seconds as a keepalive. It is ignored by the client but counts as traffic to the proxy.
Last-Event-ID. The client sends Last-Event-ID on reconnect. The server should honor it and resume from the next event. If you cannot resume, at least acknowledge the header in your design. Otherwise reconnections lose events.
CORS. Cross-origin SSE needs the same CORS headers as any other endpoint. The EventSource API does not let you set custom headers, which means cookies or bearer tokens have to go through the URL or through a cookie domain. The official spec allows withCredentials: true on the EventSource constructor.

Client side: EventSource and the reconnection story

The browser's EventSource API does most of the heavy lifting:

const es = new EventSource('/stream');
es.onmessage = (e) => console.log(e.data);
es.onerror = (e) => console.log('disconnected, will retry');

The retry is automatic. If the connection drops, the browser waits a default 3 seconds and reopens. The reopened request includes the Last-Event-ID header so the server can resume.

You can override the retry delay by sending a retry: line from the server. retry: 10000\n\n tells the browser to wait 10 seconds. This is useful when your server is taking a deliberate break (deploys, scheduled maintenance) and you do not want a thundering herd of reconnections at three-second intervals.

For Node, Python, Go, and JVM stacks, library support is fine. The Anthropic, OpenAI, and Mistral SDKs all use SSE-flavored clients under the hood; building a custom one is maybe 30 lines of code if you need to.

When SSE is the wrong choice

Three cases where you reach for something else:

Bidirectional traffic in the same channel. Use WebSocket. Multiplayer games, voice/video signaling, collaborative editing.
Binary payloads. SSE is text-only. You can base64-encode binary, but if your stream is mostly binary, WebSocket (or a custom HTTP/2 stream) is the right tool.
Sub-millisecond latency. SSE rides on HTTP, which has framing overhead. For ticker feeds or trading systems where you measure latency in microseconds, WebSocket and raw TCP are still ahead.

What I reach for now

For any "server pushes updates while client watches" UX, SSE is the default. LLM streaming, live logs, deployment progress, notifications, dashboards. The protocol fits the shape of the problem and stays out of the way of every HTTP tool you already use.

For anything bidirectional, WebSocket.

And the thing that always catches teams: turn off proxy buffering on day one. Send a keepalive every 30 seconds. Wire up Last-Event-ID resume on day two. Those three lines of operational hygiene are the difference between SSE that works once and SSE that works in production for a year.

Why Your Distributed Lock Doesn't Lock

Thu, 28 May 2026 12:00:00 GMT

The lock that wasn't

A team I worked with had a Redis distributed lock guarding a billing job. The job processed customer invoices, debited their stripe accounts, and marked the invoice as paid. Only one worker should run at a time. They used SET NX with a 60-second TTL. Standard pattern from every Redis tutorial.

One Tuesday in March, a customer was charged twice for the same invoice. The team checked the logs. Two workers had held the lock at the same time. The Redis console confirmed only one lock had been set. Both workers had it. Both worked on the same invoice. Both billed the customer.

This was not a Redis bug. Redis behaved correctly. The lock fired correctly. What the lock does not provide is mutual exclusion in the way the team assumed. No distributed lock does.

What a lock is supposed to do

A mutex inside a single process guarantees mutual exclusion. When you hold the mutex, no one else holds it. The OS thread scheduler enforces that. If your code panics, the mutex is released. If your code holds the mutex for an hour, no one else gets in for an hour. The guarantee is hard.

A distributed lock tries to offer the same guarantee across machines. It cannot. The reason is that the process holding the lock can be paused, partitioned, or lied to without knowing it is paused, partitioned, or being lied to. By the time it resumes and tries to act on the lock, the lock may have expired and been handed to someone else. The process does not know. It still thinks it has the lock.

The GC pause that hands the lock to two processes

Here is the canonical failure, originally described by Martin Kleppmann.

Process A acquires a lock with a 60-second TTL. It reads the resource it is supposed to mutate. Then, before writing, A's JVM enters a stop-the-world GC pause. The pause lasts 90 seconds. Meanwhile, Redis (or whatever lock service) sees the lock has expired and starts handing it out again. Process B acquires the lock, reads the resource, writes a new value. A's GC pause ends. A does its write. Two writes. Two processes that both believed they held the lock.

Replace "JVM GC pause" with "OS scheduler pause," "VM live migration," "container paused during health check failure," or "kernel page fault hitting a slow disk." All produce the same outcome.

Redlock does not save you

Redis's Redlock algorithm tries to harden the single-Redis case by acquiring the lock from a majority of independent Redis nodes. The pitch: even if one Redis dies, the lock is safe. That pitch is true for the failure mode it addresses (a single Redis crashing), and false for the failure mode that actually matters.

The Kleppmann critique, summarized: Redlock relies on bounded clock drift and bounded request latency to be safe. Neither assumption holds under real network conditions. The algorithm assumes time progresses the same way on every node. Time does not.

Antirez (Redis's author) wrote a thoughtful response. Read both threads if you want the full picture. The short version: Redlock is no worse than other distributed locks. Solving the underlying problem is not what any of them do.

Fencing tokens: the part nobody implements

The fix Kleppmann proposes is fencing tokens. Each time a process acquires the lock, the lock service hands back a monotonically increasing number (a token). When the process performs the write, it includes the token. The resource being protected, not the lock, checks the token against the highest token it has seen. Older tokens are rejected.

Walk through the GC pause again with fencing tokens. Process A acquires the lock, gets token 42. It pauses. Meanwhile, Process B acquires the lock, gets token 43, performs the write, and the resource records "highest token seen: 43." When A wakes up and tries its write with token 42, the resource rejects it because 42 < 43.

Two things have to be true for this to work: the lock service has to issue increasing tokens, and the resource (the database, the file, the API) has to enforce the token. The second part is the one nobody implements. It requires the storage layer to know about the lock, which means the lock cannot be opaque to the resource.

Most distributed lock implementations in the wild do not have fencing tokens. SET NX in Redis does not. A simple Postgres row lock does not. The thing you call "a distributed lock" in your codebase almost certainly does not. Which means it does not actually provide mutual exclusion.

Postgres advisory locks: better, with a caveat

Postgres advisory locks (pg_advisory_lock) are better than Redis SETNX for one reason: they are tied to a session. If the process holding the lock dies, the connection drops, and Postgres releases the lock. There is no TTL race, because there is no TTL. The lock lives as long as the connection lives.

The caveat is that this only helps if your process is healthy enough to maintain the connection. A GC pause does not close the connection. The connection still appears healthy to Postgres while your process is frozen. Same race as before, just a different layer.

For most use cases where the work is short and the process is well-behaved, Postgres advisory locks are the right answer. They are simple, transactional, and the failure modes are bounded. If you are reaching for Redis SETNX, reach for pg_advisory_xact_lock instead. It is the same primitive with better failure semantics.

Zookeeper, etcd, Consul: less wrong, not right

Consensus-based systems (Zookeeper, etcd, Consul) provide locks with better semantics than Redis. They handle leader election, session timeouts, and ordering correctly. Ephemeral nodes give you connection-tied locks similar to Postgres advisory locks.

What they still do not solve is the process-pause problem. A node that pauses during a GC, or gets partitioned and then heals, can still hold a stale lock that the system thinks it has handed off. Consensus does not fix this. The application has to.

What these systems give you is a set of primitives (session IDs, version numbers, watch counters) that you can use to build fencing yourself. Their disadvantage is that you have to know to do it.

Lock vs lease: same idea, different vocabulary

The honest framing is that distributed "locks" are leases. A lease is a time-bounded reservation. It expires. You do not own anything across the expiration boundary. If you act on data after your lease expires, you are doing so without the lease's protection. The naming "lock" implies the OS-style guarantee. The semantics are nothing like an OS lock.

If you internalize "lease, not lock," a lot of patterns fall out automatically. You do not assume the lease still holds when your code resumes from a long operation. You re-check. You build retry semantics that assume the lease may have lapsed. You write idempotent operations at the resource level.

The pattern that actually works

The shortest path to correctness in distributed-lock-heavy systems is to stop relying on the lock for correctness. Use the lock as a performance hint, not a safety boundary. Then make the resource itself the boundary.

For the billing-job war story at the top: the fix was a unique constraint on (invoice_id, status='paid') in the database. Writing 'paid' to an already-paid invoice fails outright, regardless of how many workers think they hold the lock. Workers can still race for the lock and waste cycles, because the lock keeps the system from doing twice the work most of the time. But the constraint is what prevents the double-charge.

This is the pattern: a soft lock for performance, an idempotent resource for correctness. The soft lock can be Redis SETNX. Idempotent resources come in many shapes: a unique constraint, an upsert, a token check, a state-machine transition rule, or a transactional CAS. Either alone is wrong. Together they are correct.

What I reach for now

By default, Postgres advisory locks. Free, transactional, no TTL race, the failure semantics match the database's failure semantics. If the work I am protecting touches Postgres anyway (which is usually true), I do not need a second system.

For coordination across services that do not share a database, etcd or Consul. The session-ephemeral primitive is the closest thing to a real lock you can get.

For Redis specifically: only when latency matters more than correctness, and only paired with an idempotent resource. Redis SETNX is fine as a performance optimization for jobs that are already safe to run twice. It is dangerous as a correctness boundary for jobs that are not.

And the final guidance, which is older than any of these systems: design the operation so it does not need a lock at all. Idempotent writes. Conditional updates. Compare-and-swap. If you can avoid the question "who holds the lock right now," you avoid every failure mode this post lists. That is the engineering work distributed systems actually reward.

The Day Our LLM Bill Hit $40k

Wed, 27 May 2026 12:00:00 GMT

The Slack message at 9:14 AM Monday

"Did anyone authorize a $39,847 charge on our Anthropic account this weekend?"

Three people looked at the message. Three people opened the Anthropic console. The number on the dashboard read $39,847.20 for the trailing 60 hours. Friday 6 PM to Monday 9 AM. Nobody had touched the system over the weekend. The team had been at a wedding.

That weekend ended up costing us about half of a junior engineer's annual salary. The reason was small, the lesson was big, and this post is the one I wish someone had written for me before we shipped.

What the dashboard showed

Hourly tokens, charted Friday through Monday, looked like a step function. Flat through Friday afternoon. A small bump at 6:17 PM. Then a nearly vertical climb from 7 PM Friday to 4 AM Saturday. Then a flat line at the API's per-minute rate limit ceiling for the next 53 hours.

The rate limit was the only thing keeping us from a six-figure bill.

Root cause: a retry loop, and three missing guardrails

Friday afternoon a dev had pushed a change to our document-processing pipeline. Documents enter, the agent reads, the agent decides what to do. One step in the pipeline could throw a validation error if the document was malformed. The previous version of the agent would catch the error and skip the document. It retried instead, without bounds.

The retry loop had no max attempts. The malformed document was permanent, so every retry produced the same error. Each retry was a fresh API call, fresh prompt, fresh response. The agent was a long-context one (Sonnet), and the prompt was 4,200 tokens. Each retry was about $0.02. Twenty per second, sustained, for 53 hours.

The retry loop was the bug. The reason it became a $40k bug was the three things we did not have:

A spend cap on the API key. Anthropic supports them. We had not set one.
An alert on hourly spend. Our cost dashboard polled daily.
A circuit breaker around the agent call. If the same agent threw the same error 100 times in 5 minutes, the call should stop. It did not.

One of those three would have ended the incident in minutes. We had none of them.

How teams burn LLM money

After the weekend, I read the post-mortem write-ups from every team I could find that had a public LLM-cost incident. The categories that came up most:

Unbounded retries. Ours. The most common pattern.
Conversation context bloat. A chat agent that appends every previous message to the next call. By message 40 the context is 100k tokens. Cost scales linearly with conversation length.
Cached embedding miss. A team had a cache, the cache was supposed to dedupe embedding calls, the cache key was the wrong shape. Every page load regenerated all embeddings.
Recursive agent calls. Agent A calls Agent B. Agent B calls Agent A. The exit condition has a bug. The loop terminates when the model finally hallucinates the keyword "DONE."
Dev environment with the prod key. A test run iterates over 10,000 fixtures using a real API key. The fixtures were synthetic but the dollars were real.
Streaming that does not actually stream. The client appears to stream but the server pre-buffers the whole response. The user closes the page; the server keeps generating until completion. Tokens billed for output nobody saw.
No model routing. Every call goes to Opus or Sonnet, including the ones that could have been Haiku at one-fifteenth the cost. The team never measured per-route cost.
Forgot to use prompt caching. A long system prompt repeated on every call. With Anthropic's prompt cache, that gets billed at one-tenth the rate on cache hits. Without it, full price every time.

Each one of these is preventable. Together they account for the vast majority of LLM-cost incidents I have seen in the wild.

The five-line policy we should have had

None of the fixes are clever. All of them are boring. That is why I keep them on a checklist that gets applied to every new agent or LLM feature before it touches prod:

Spend cap on the API key. Set it to 2x your forecasted daily spend. Anthropic, OpenAI, and most providers offer this in the console. Five minutes.
Per-user rate limit at your gateway. Not at the provider, at your own layer. So one user (or one bot, or one runaway retry loop) cannot consume the team's entire budget.
Circuit breaker on repeated errors. If the same prompt produces the same error 50 times in 10 minutes, stop calling. Page someone. The fix is rarely "keep trying."
Hourly cost alert, not daily. An hourly check would have woken someone at 7:30 PM Friday. A daily check found the carnage Monday morning.
Per-feature cost attribution. Tag every API call with the feature that made it. You cannot fix what you cannot attribute. The team that had the recursive-agent bug found it in 20 minutes because their dashboard showed "Feature: scheduler" at 800% of forecast.

That is the entire policy. It is short on purpose. The reason most LLM-cost incidents happen is not that this list is wrong. The list simply does not exist in the team.

What happened after

Anthropic forgave the bill. They were good about it. I do not assume they will be good about it twice, and I do not want to find out. The bill becoming a non-event was the kind of luck you should not plan around.

We spent a calm hour that Monday writing the five-line policy and adding the spend cap. The cap is set to $400 per day, which is more than 10x our normal usage. Two months later, we have not hit it once. The number is high on purpose: a normal week stays well below, an anomalous burst hits the cap before it hits a Slack message.

The category mistake

The reason LLM cost feels different from regular cloud cost is that the unit is invisible. You do not see tokens fly by the way you see EC2 instances spin up. There is no console tab where you watch the runaway happen. The first signal is the bill.

Treat the API key like a write-enabled database credential. Cap it. Rate limit at your gateway. Alert on hourly spend. Attribute every call to the feature that made it. The day you stop thinking of the LLM provider as a friendly cost line and start thinking of it as a resource with the same blast radius as a database is the day you stop having weekends like ours.

Claude Code vs Cursor: Six Months of Both

Tue, 26 May 2026 12:00:00 GMT

The side-by-side setup

For six months I had Cursor open in one virtual desktop and Claude Code in a terminal pane in the other. Same repos, same tasks, same engineer. The plan was: try both, see what actually changed how I ship, and write something honest at the end.

This is the honest write-up. Cursor was fine. Claude Code changed the way I work. Here is the long-form version of why.

Where Cursor wins (and it genuinely does)

Cursor's superpower is "I am already in the editor and the AI is right here." Tab to complete. Cmd-K to ask for a change in the line you're on. Composer for a small multi-file refactor. If your task is "I am editing this file, help me edit this file," Cursor is excellent.

Three places I caught myself reaching for Cursor specifically:

Tab completion in the middle of a sentence. Faster than typing the obvious next line.
Renaming across one file. Cmd-K, "rename foo to bar everywhere in this file." Done.
Quick CSS or HTML tweaks. "Make this button bigger and add a hover state." Cursor is in the file, the file is in the editor, the change appears as a diff. No friction.

If your work is mostly file-sized, Cursor is the right tool. I am not going to pretend it is not.

Where the tools stop being comparable

The trouble with comparing them is that they are different shapes of tool. Cursor is "AI in the editor." Claude Code is "AI as a junior engineer with terminal access." The first one helps you edit faster. The second one does engineering tasks.

I noticed the gap on the third or fourth multi-step task I tried in both. The prompt was something like: "Add a new endpoint, generate the migration, update the DTO, write tests, run the tests, and tell me if anything is broken." In Cursor, this is a Composer session where I am driving every step. In Claude Code, I type the sentence above and walk away for ten minutes.

What actually changed how I ship

The thing that broke the symmetry for me was the agentic loop. Claude Code can read files, run commands, see the output, decide what to do next, and loop. It can run my tests, read the failures, edit the code, run the tests again. That sentence sounds like marketing copy. It is not. It is what the tool does. The first time I watched it fix a flaky integration test on its own while I made coffee, the comparison was over for me.

There is a sub-point inside that. The agentic loop only works if the tool has good context about the repo. Claude Code reads CLAUDE.md automatically, follows the conventions I have written there, and uses skills (small markdown files that describe domain-specific behaviors) that I have built up over months. Cursor has its own rules system and recently added agent mode, but the gap in how the tool understands my repo is real and persistent.

The extensibility story

Claude Code talks MCP. That means I can plug it into anything that exposes an MCP server: Notion, Linear, GitHub, my Postgres, my analytics, a database of book metadata I keep, a screenshot tool. The list grows weekly. Each one is a single config line, not a custom integration.

Cursor has plugins and rules, but the marketplace is centered on IDE features. The "let the AI reach into the company's actual systems" story is thinner.

For a senior engineer whose work cuts across more than one system, the MCP story is not a nice-to-have. It is the thing that turns the AI from "writes code in my editor" into "does the operational task end to end."

The memory layer

Claude Code remembers things across sessions in a structured way. Project memory, user memory, feedback, references. After three months of using it, the tool knows that I prefer Spring Boot 4 over older versions, that I write Postgres-only for personal projects, that I do not want it to use em dashes in prose, that Firebase is database-only on my site. I do not retype any of that.

Cursor's rules file gets close, but it is a single static document. The memory model in Claude Code is multi-file, tagged, and updates as the relationship grows. The difference shows up in week three, not week one.

What Cursor has that Claude Code does not

Being fair to both tools:

Inline ghost-text completion. Claude Code does not do this. If you want autocomplete-style suggestions while you type, Cursor is still the answer.
Visual diff UI inside the editor. Claude Code shows diffs in the terminal. If you prefer reading diffs in a polished side-by-side panel, Cursor wins that round.
The "look ma, no terminal" experience. Some engineers genuinely do not want to work in a terminal. Cursor lets them keep that preference. Claude Code does not.

If you are early in your career, work mostly on frontend in a JavaScript repo, and your tasks are mostly "edit this file," Cursor will feel better. I would not push back on that.

The category mistake

The category mistake people make in this comparison is treating both tools as IDE replacements. Cursor is. Claude Code is not. Claude Code is a coding agent that happens to live in a terminal. It plays a different role.

The right comparison is: do you want AI as a typing assistant, or AI as a colleague who can do a multi-hour task on its own? If you want both, run both. They are not redundant.

I ran both. I now run only Claude Code. The reason is not that Cursor is bad. It is that the work I actually do most days, the work that takes real time, is the kind of work that benefits from the agentic loop more than from inline autocomplete. Multi-step migrations. Repo-wide refactors. Bug hunts across three services. Writing tests for a class that does not have any. The tool that can be tasked with that, and that can verify its own output by running things, is the tool I keep.

What I tell new engineers

Try both for two weeks. Do not pick the one with the better marketing. Pick the one that matches the shape of the work you actually do, not the shape of the work you wish you did.

If your daily work is mostly inside one or two files, Cursor will save you typing. If your daily work is across the repo and across the stack, Claude Code will save you hours.

And ignore anyone who tells you the right answer is the same for everyone. It is not. I picked Claude Code because of the shape of my work. Your work is a different shape. Trust the test, not the testimonial.

pgvector at 10 Million Rows Is a Different Animal

Tue, 26 May 2026 12:00:00 GMT

The demo that fooled everyone

Every pgvector demo I have ever seen runs on a few hundred rows. CREATE EXTENSION vector. ALTER TABLE products ADD COLUMN embedding vector(1536). Insert two hundred test documents. Run a similarity query. Get tens-of-milliseconds latency. Write a tweet about how Postgres just ate the vector database market.

I built one of these demos for an internal review. Worked beautifully. Two months later we had real data. The same query took eleven seconds.

Where it falls over

pgvector with no index does a sequential scan. It compares the query vector to every row's embedding. At a few hundred rows this is fine. At a million rows it is not. At ten million rows it is unacceptable.

The fix is to add an index. The question is which one. pgvector ships two index types: IVFFlat and HNSW. Approximation is the rule for both. Each has parameters that matter. And each will mislead you if you copy the defaults from a tutorial.

IVFFlat is gentle but stale

IVFFlat groups vectors into lists clusters at build time. At query time, it scans the closest probes clusters. The recommendation in every tutorial is lists = sqrt(N) for under a million rows, and lists = N / 1000 beyond that. For ten million rows that is ten thousand lists.

Build time is fast. Query time is fast. Recall is decent. So far so good.

The catch is what happens when you insert. IVFFlat builds the cluster centroids once from the data that existed at index-creation time. New vectors are assigned to whichever existing centroid is closest. If your data distribution shifts (which it will, because the embedding model has biases and your corpus has trends), new vectors pile up in clusters they do not belong in. Recall drops month over month. The only fix is to drop and rebuild the index.

For a static corpus, IVFFlat is fine. For anything with a write stream, it is a maintenance liability you have to schedule around.

HNSW is better, and costlier

HNSW (Hierarchical Navigable Small World) builds a layered graph where each vector has links to its neighbors. Queries traverse the graph from a top-layer entry point downward. Recall is higher than IVFFlat. Inserts update the graph incrementally, so new vectors find their place naturally.

The two parameters that matter: m (the maximum number of links per node, usually 16 or 32) and ef_construction (the breadth-first search size during construction, usually 64 to 200).

Three things to know before you run CREATE INDEX:

Build time is long. On a single Postgres instance, ten million rows with m=16, ef_construction=64 can take several hours. Crank maintenance_work_mem as high as the box can take or you will be there overnight.
Storage is bigger. Roughly two to three times the raw vector data. For ten million 1536-dimensional float32 vectors, that is around 150 GB of index on top of the 58 GB of raw vectors.
Concurrent inserts get slow. Each insert traverses the graph. Under high write volume, your insert throughput drops and your write transactions hold locks longer than you expect.

The storage math at 10M rows

Run the numbers before you sign up for the bill. Ten million vectors, OpenAI text-embedding-ada-002 dimensions (1536), default float32 storage:

Raw vector column: 10,000,000 × 1536 × 4 bytes = ~58 GB
HNSW index on top: ~150 GB
Other table data, WAL, indexes on other columns: easily another 20-40 GB

You are now north of 200 GB of Postgres storage that did not exist before pgvector. RDS or CloudSQL will bill you for it. Worse, you want the working set in RAM, which means upgrading the instance class. The db.r6i.large you started with is no longer the right shape.

The quantization escape

pgvector 0.7 added two storage types that change the math. halfvec uses float16 instead of float32 (16 bits per dimension), cutting raw storage in half with a small recall loss. bit uses one bit per dimension for fully binary vectors.

The serious pattern is binary quantization with rerank. You store both: a binary bit column for fast first-pass search across the whole corpus, and the full vector or halfvec column for reranking the top candidates. The first pass narrows ten million rows down to a few thousand in milliseconds. The second pass reranks those thousands with exact distances. Latency stays low, storage cost drops, recall stays high.

Most production pgvector setups I see in 2026 are running halfvec by default, with binary quantization layered on top when scale crosses the line where halfvec alone is not enough.

The write-rate problem

HNSW handles writes, but not at the rates a busy product expects. If you are ingesting thousands of new vectors per second (which is normal for a logging or messaging app indexing user content), the index becomes a bottleneck and your inserts back up.

Two patterns to consider:

Batched build. Write new vectors to a staging table without indexing. Periodically merge them into the main table during a maintenance window, rebuilding or extending the HNSW index in batch. Latency on freshly-written rows is higher (they are searchable only after the batch), but throughput stays sane.
Two-tier setup. Recent vectors live in a small, fully-indexed hot table. Older vectors live in a large, batch-built cold table. Queries union the two. Most reads only touch the hot table.

Both patterns add operational complexity. They are still simpler than running a separate vector database.

When you do not need pgvector at all

Under 100k vectors, no index. Just use Postgres. Sequential scan is fast enough and you save the indexing complexity.

Above 50M vectors with high write rate and tight latency SLAs, a dedicated vector database (Qdrant, Weaviate, or a managed service) starts to make sense. The operational overhead of running it pays for itself in tighter performance characteristics.

The sweet spot for pgvector is between those two: 100k to 10M-ish vectors, mixed read-write, where keeping vectors next to your relational data simplifies a lot of joins and avoids a second system in the architecture. That is most apps that have shipped AI features in the last two years.

What I actually run now

On the production system I tune most often:

halfvec for storage. The recall hit is below my measurement noise.
HNSW with m=16, ef_construction=64. Anything higher than 64 stops paying for itself in my tests.
maintenance_work_mem set to 8 GB during index builds, dropped back to default after.
A weekly batch job that rebuilds the index on a replica, promotes the replica, demotes the old primary. Avoids the build window stalling writes.
Recall monitoring at the application layer. I keep a ground-truth set of 200 query-document pairs and run them through the live index on a schedule. If recall drops more than 5%, I get paged.

None of this was in the demo. The demo was four lines of SQL. Production was a quarter of an engineering month and a Postgres bill that has its own line in the invoice. pgvector earns the choice, but it earns it. Walk in knowing the numbers.

Your AI Agent Isn't Broken. Your Evals Are.

Tue, 26 May 2026 12:00:00 GMT

"My agent doesn't work"

The most common thing I hear from teams shipping AI features is "the agent worked great in the demo, but in production it's a mess." They want help debugging the agent. They want me to look at the prompts. They want to talk about temperature, sampling, model choice.

None of that is the problem. The problem is that they have no evals. They are debugging by feel.

What "evals" actually means

An eval is a test case for AI behavior. You give the agent an input, you check whether the output matches what you wanted, you grade it. That is the whole concept. Doing it at the scale and rigor the system actually needs is where the work hides.

The same engineers who would never ship a payment endpoint without integration tests will happily ship an agent with zero structured tests. The reason is that AI output is nondeterministic and feels harder to test. It is harder. Not by as much as people think.

The five tiers of AI evals

From worst to best, this is the maturity ladder I see in actual companies:

Tier 0: Vibes. The PM tries the demo at standup. If it doesn't feel weird, you ship. Most teams are here.
Tier 1: Smoke tests. A handful of golden examples in a notebook. Run before each deploy. Catches obvious regressions.
Tier 2: Regression suite. Hundreds of cases in a versioned dataset, with expected outputs or graded rubrics. Run in CI. Catches subtler regressions.
Tier 3: LLM-as-judge. Cases without single right answers (summarization, reasoning, multi-step) get graded by another model against a rubric. Cheaper than human labeling, good enough at the comparison granularity.
Tier 4: Production logging + replay. Every real production conversation gets logged, tagged, and replayable. New model versions get scored against last week's actual traffic before they ship.

Most teams shipping production AI sit at Tier 0 or Tier 1. They genuinely believe they are at Tier 3 because they have a few prompts saved in a Notion page. They are not.

What good evals look like

The eval set you need depends on the agent, but the shape is consistent:

Versioned. The dataset has a Git history. You can answer "what did our agent get right two months ago that it gets wrong now."
Tiered by difficulty. Easy cases (the agent should never fail these), medium cases (improvement frontier), hard cases (research targets).
Adversarial. Prompt injection attempts, ambiguous inputs, conflicting context, role-confusion attacks. If you don't have these, you don't have an eval suite, you have wishful thinking.
Graded per step, not just end-to-end. If the agent has five tool calls, each step needs its own correctness signal. End-to-end success hides a lot of partial failures.
Tracked with cost and latency. Correctness alone is half the picture. An agent that gets the right answer in 90 seconds and $0.40 of tokens is broken even if it's correct.

The dirty secret

Most teams I've worked with don't have a versioned eval set. They have screenshots in Slack. They have a Notion page titled "Test cases" that nobody opens. They have a vague sense that things are getting better because the founder said the new prompt felt better at the demo.

When something regresses (and it will, every model update is a small chance of a big regression), they cannot tell. They notice when customer complaints spike. They notice when a sales call goes badly. They never catch it before it leaks out, because the only eval is the customer.

Build vs buy

The vendor landscape for eval tooling is now reasonable. Braintrust, Langsmith, Helicone, Arize, Phoenix. Each has its sharp edges, but all of them give you the basic shape: dataset versioning, run history, side-by-side comparison, LLM-as-judge integration.

If you have one engineer who can spend three days, build the first version yourself. A JSON file of cases, a script that runs the agent against each, a CSV that records outputs and grades. That is enough to leave Tier 0. You will move to a vendor or a richer in-house tool when your eval suite outgrows the script.

The mistake is to skip the homemade version and wait for the perfect vendor. The vendor will not arrive. Or it will, and you will not know which features matter to you, because you have never run an eval.

A short war story

I worked with a team that had a customer support agent. The agent gave great responses on the test cases they had. After the model provider released a minor version update, the same agent started refusing to give refund estimates. The team thought the system was broken. We had recently set up an eval suite with 80 cases, including 12 about refund logic. Reran it against the old model and the new model side by side.

Old model: 11 of 12 refund cases passed. New model: 3 of 12. Same prompt. Same temperature. Same tools.

The new model had picked up a more conservative refusal stance during training. Nothing in the changelog mentioned it. Without the eval, we would have spent days re-prompting before suspecting the model itself. With the eval, we had the answer in twenty minutes.

That is the experience that converts a team to caring about evals. Until it happens, it sounds like overhead.

What to build first

If you have an agent in production and no evals, here is the order:

Pick 20 representative inputs from your actual production logs. Real user inputs, not made-up ones. Five easy, ten medium, five hard.
For each, write the output you expect, or the rubric you'd grade against if there's no single right output.
Write a script that runs the agent against each and dumps inputs, outputs, elapsed time and token cost into a CSV.
Manually grade the CSV. Repeat with each prompt change or model change. Diff against the last run.
When the manual grading becomes a bottleneck (around 100-200 cases), introduce LLM-as-judge with a rubric you've tuned against your manual grading.

That is six engineering hours from zero to a working eval pipeline. Less than the time you'll spend the next time a model update breaks your agent in a way you cannot diagnose. Stop telling yourself your agent is broken. Build the thing that tells you whether it actually is.

The Kafka Consumer Group That Stopped Consuming

Sun, 24 May 2026 12:00:00 GMT

The on-call ticket at 02:14

The payment-events consumer group had lag of zero in the dashboard. The downstream service was paging because no payment-confirmation emails had been sent in 20 minutes. We pulled up the lag-by-partition view. Two of the six partitions had lag of zero. The other four had lag of 4 million and growing.

Aggregate lag across six partitions, averaged: still small. Per-partition: a cliff.

What "consuming" actually means here

A Kafka consumer group is one or more processes that share the work of reading from a topic. Each partition is owned by exactly one consumer at a time. The group coordinator (a broker) decides who owns what. When a consumer joins or leaves, the coordinator triggers a rebalance and reassigns partitions across the surviving members.

This is the source of every consumer-group failure mode you have ever seen. The rebalance protocol assumes consumers behave in specific ways within specific time windows. When they don't, the group misbehaves quietly.

The three timeouts that decide whether you are in the group

You need three numbers in your head before any of the rest makes sense:

heartbeat.interval.ms: how often the consumer's background thread tells the broker "I am still here." Default 3 seconds.
session.timeout.ms: how long the broker waits without a heartbeat before declaring the consumer dead. Default 45 seconds in modern Kafka.
max.poll.interval.ms: how long the broker waits between calls to poll() before declaring the consumer dead. Default 5 minutes.

The first two are about a background heartbeat thread. The third is about your application thread calling poll(). Two different "is this consumer alive?" detectors, two different ways to fail, one shared name in your team's vocabulary: "the consumer dropped out."

Failure mode 1: the heartbeat lies

Your consumer calls poll(). It gets back a batch of records. Processing starts. The first record makes a synchronous HTTP call to a slow downstream service. Four minutes pass. Meanwhile, the heartbeat thread keeps sending heartbeats. Session timeout is happy.

But max.poll.interval.ms is the default five minutes. If processing takes longer than that, the broker decides this consumer is dead, kicks it out of the group, and triggers a rebalance. The consumer eventually finishes processing the record, calls poll() again, gets CommitFailedException because it no longer owns the partition, and the work it just did was wasted because someone else has already started consuming it.

Lesson: max.poll.interval.ms must be larger than your worst-case batch processing time. The heartbeat being healthy means nothing about whether your processing loop is actually making progress.

Failure mode 2: the rebalance storm

You deploy a new version of the consumer service. It rolls out across six pods. Each pod takes eight seconds to start. Each restart triggers a rebalance. Each rebalance takes around ten seconds to complete because all six partitions have to be reassigned. During the rebalance, no consumer is consuming anything.

You get six rebalances back to back. That is roughly a minute of zero consumption. Lag piles up. Worse: if your rolling deploy is configured to wait for "healthy" and "healthy" is defined as "joined the group successfully," the rolling deploy ping-pongs because every new pod that joins triggers another rebalance that briefly knocks the rest out.

The fix here has two parts. First, use static group membership (group.instance.id): a consumer that disappears for less than session.timeout.ms rejoins with its old partition assignments, no rebalance needed. Second, use incremental cooperative rebalancing (partition.assignment.strategy=CooperativeStickyAssignor): rebalances only redistribute the partitions that need to move, not the entire set. Each change is independent. Apply them both.

Failure mode 3: the stuck partition

This is the one from the war story. One of the consumer instances is stuck. Maybe a deadlock in the processing code, maybe a Kafka network thread blocked on a slow DNS lookup, maybe a JVM in a stop-the-world GC pause that just won't stop. The instance isn't calling poll() anymore. But the heartbeat background thread is still running.

From the broker's view, the consumer is healthy. From reality's view, the partition it owns is going nowhere. Aggregate lag looks fine because five other partitions are draining. Per-partition lag for this one partition is a vertical line.

This is why "is the consumer group consuming?" is the wrong question. The right question is "is every partition draining?" You need per-partition lag in your dashboards. Aggregate lag hides this every time.

Failure mode 4: the poison message

The consumer reads a record. Parsing the record throws an exception. Nobody catches it. The consumer's processing loop dies. No more poll() calls. Eventually max.poll.interval.ms fires and the consumer is kicked out.

Then a new consumer in the group picks up the partition. Same record (the offset was never committed). Same exception. Same eviction. Repeat forever.

Real consumers wrap record processing in a try/catch and either skip-and-log poison messages or route them to a dead-letter topic. The exact policy is your call. Having no policy and letting the exception propagate is the bug.

Failure mode 5: the silent skip

Auto-commit is on. The consumer reads 500 records, processes 50, then crashes. Auto-commit had already committed offset 500 because the commit interval expired during processing. The remaining 450 records are now silently skipped. Nothing alerts. The consumer "kept consuming." The data is gone.

Default behavior for many Kafka clients is auto-commit. For anything that matters, turn it off and commit explicitly after the work is done. The two-line change is worth the discipline.

What "consumer lag" actually measures

Lag is the gap between the latest offset in a partition and the offset the consumer has committed. A snapshot, not a derivative. Lag of zero right now does not mean the consumer is processing fast enough. It means the consumer committed an offset equal to the latest one, which could have been done by skipping records (failure 5) or by committing-without-processing patterns.

The metric that catches problems earlier is lag rate: how fast lag is growing or shrinking, per partition. Flat lag means you are keeping up. Growing lag is the early warning. Shrinking lag means a recent backlog is draining. Lag-rate panels in Grafana save you hours of confusion.

Diagnostic order when a consumer group misbehaves

Look at per-partition lag, not aggregate. If only some partitions are lagging, it is failure 3 (stuck consumer) or failure 4 (poison message).
Look at the consumer group's member list (kafka-consumer-groups.sh --describe). If members are appearing and disappearing, it is failure 2 (rebalance storm).
Look at consumer logs for CommitFailedException. That is the signature of max.poll.interval.ms being too low for the work (failure 1).
Look at consumer logs for unhandled exceptions in the processing loop. That is failure 4 (poison message).
Check whether auto-commit is on. If it is, you might be silently losing records (failure 5).

Five checks. Most consumer-group incidents are one of those five.

One thing that will catch the next one

Per-partition lag rate, alerting on positive sustained slope for any one partition. Not aggregate lag. Not member count. Not group state. The single most informative panel for a Kafka consumer is "is every partition draining at the rate I expect." Build it once, point alerts at it, and three months from now you will be paged twenty minutes earlier than you would have been otherwise.

The Postgres Index That Never Gets Used

Sat, 23 May 2026 12:00:00 GMT

Every team that has run a Postgres performance push has added indexes. Almost no team has removed any. The result, on every long-running cluster I've ever inherited, is a database carrying a tax it does not collect. Indexes from a sprint nobody documented. Duplicates of each other in different column orders. The one a contractor added in 2022 to fix a ticket whose number nobody can find.

The cleanup is one query and one Friday. Teams skip it because the cost of an unused index is invisible until it isn't.

What an unused index actually costs

Disk is where every team starts, but it is not the main cost. The actual line items, in rough order of how much they matter:

Write amplification. Every INSERT, UPDATE, and DELETE updates every index that includes the affected columns. A table with six indexes on the hot path takes six index updates per write, plus the heap write. A table with eleven indexes takes eleven. The math is linear. Drop three unused indexes on a write-heavy table and the write path gets meaningfully faster.

VACUUM work. Autovacuum has to walk every index when reclaiming dead tuples. More indexes equals longer vacuum equals more bloat catching up between runs. The cluster that complains about autovacuum lag often has an index problem masquerading as a vacuum problem.

Shared buffers. Index pages compete with data pages for cache. Cold index pages still get pulled into memory during writes and during autovacuum. The unused index is evicting a heap page that the planner would actually use.

WAL volume. Every index update produces a WAL record. Replication lag scales with WAL volume. Every replica is paying for the unused index in extra apply work.

Planner overhead. Postgres considers every applicable index when planning a query. The marginal cost per query is small. At a hundred thousand queries per second across many tables, the marginal cost adds up.

Disk. Less important than people think, but a 50 GB table with 100 GB of indexes is not unusual. The index does not have to be loaded to occupy that space.

How to find them

Start with the catalog view pg_stat_user_indexes. Its idx_scan column counts how many times the planner picked the index for a query, summed since the last stats reset (often since the cluster started).

SELECT
  schemaname,
  relname              AS table_name,
  indexrelname         AS index_name,
  idx_scan,
  pg_size_pretty(pg_relation_size(indexrelid)) AS size
FROM pg_stat_user_indexes
WHERE idx_scan = 0
ORDER BY pg_relation_size(indexrelid) DESC;

This returns every index with zero scans, biggest first. Start with the big ones, because their write cost is the worst.

What not to drop

idx_scan = 0 does not mean "safe to drop." Four cases burn people:

Primary key indexes. They enforce row uniqueness. They may also be used rarely by the planner because most queries hit a different index. You cannot drop them without changing the table's structure.

UNIQUE constraint indexes. Same reason. The constraint depends on them.

Foreign key supporting indexes. When you DELETE or UPDATE a parent row, Postgres uses the child's index to enforce the foreign key. The lookup is not counted as an idx_scan in older Postgres versions, so the index can look idle while doing real constraint work.

Indexes used only by rare jobs. A monthly billing run. A quarterly reconciliation report. A DR drill. If your stats window is shorter than the rare job's interval, the index looks unused. It is not.

The filter that handles the first two:

SELECT
  s.schemaname, s.relname, s.indexrelname, s.idx_scan,
  pg_size_pretty(pg_relation_size(s.indexrelid)) AS size
FROM pg_stat_user_indexes s
JOIN pg_index i ON i.indexrelid = s.indexrelid
WHERE s.idx_scan = 0
  AND NOT i.indisunique
  AND NOT i.indisprimary
ORDER BY pg_relation_size(s.indexrelid) DESC;

The third and fourth cases need human judgment. Search the schema for foreign key relationships involving the columns the index covers. Search the code and the cron schedule for the rare jobs. The cleanup is not a SQL problem at this step. It is a code-archaeology problem.

The duplicate index trap

A composite index on (a, b, c) can serve queries that filter on a, on (a, b), and on (a, b, c). A second index on (a, b) is a duplicate. Postgres will not warn you when you create it. The planner happily picks one or the other; the writes pay for both.

Finding them takes one query:

SELECT pg_size_pretty(SUM(pg_relation_size(idx))::bigint) AS size,
       (array_agg(idx))[1] AS idx1,
       (array_agg(idx))[2] AS idx2
FROM (
  SELECT indexrelid::regclass AS idx,
         (indrelid::text || E'\n' || indclass::text || E'\n' ||
          indkey::text || E'\n' || COALESCE(indexprs::text, '') || E'\n' ||
          COALESCE(indpred::text, '')) AS key
  FROM pg_index
) sub
GROUP BY key
HAVING COUNT(*) > 1;

Run this once per cluster. Its output is your "pick one and drop the other" list. Pick the more specific one (longer column list) and drop the prefix.

The workflow that does not break anything

The mistake is reading pg_stat_user_indexes once and dropping the zeros. The right flow is slower and boring:

Reset stats at a known moment: SELECT pg_stat_reset();. (For a specific table use pg_stat_reset_single_table_counters(oid).) Note the moment. Let it run for at least one full business cycle. For most apps, a week. For systems with monthly batch jobs, a month. For systems with quarterly reports, a quarter.

Run the filtered candidate query. For each candidate, search application code, ORM repositories, and the job scheduler for column combinations that would use the index. False zeros happen when stats reset between uses.

Drop one at a time. Wait a week between drops. Monitor pg_stat_statements for plan changes on the affected table. If a query gets slower, recreate the index; the recreate takes a one-time hit, and the wrong drop is reversible.

The discipline that matters: drop slowly, and write down what you dropped. Six months later, when somebody asks "why is this report slow," your audit log is the difference between five minutes of investigation and an afternoon.

Replicas and DR

pg_stat_user_indexes only reports for the database it runs on. If you have read replicas serving production traffic, an index that looks unused on the primary may be heavily used on a replica running different queries. Run the audit on every node in the topology before dropping anything.

The failover question is sharper. If you fail over to a replica that runs a reporting workload, the indexes you dropped on the primary are gone from the replica too (DDL replicates). The right question before dropping is "where is this index used across the whole cluster, and what happens during failover?" The answer involves talking to whoever runs the replica's workload.

What it actually saves

Concrete numbers from a Postgres 16 cluster I cleaned up earlier this year. Heavy write workload, sustained at about 8,000 inserts per second across the day. An audit table with 14 indexes, of which 4 had been touched in the previous 30 days.

Dropping the 10 unused indexes, one per week over six weeks:

Average write latency on the audit table:  12 ms  -> 6 ms
WAL volume on the table:                  320 GB -> 180 GB / day
Autovacuum duration:                       28 min -> 9 min
Replica lag p95:                          200 ms -> 60 ms
Disk used by table + indexes:             240 GB -> 165 GB

The disk savings were a footnote. The real win was the write path getting twice as fast at the tail, and the replica lag dropping below the SLO with comfortable headroom.

Why this keeps happening

Indexes accumulate because the marginal cost of adding one is invisible at the moment of adding. The slow query gets faster. No dashboard shows write amplification. Replica lag bumps on insert-heavy paths show up two months later and get blamed on holiday traffic, a new feature, a recent dependency upgrade.

Make the discipline calendar-based. Every index added during a performance push gets a calendar entry to re-evaluate in 30 days. If idx_scan is still zero or trivially low, drop it. If it is in active use, keep it and update the documentation. Track each one. Re-evaluate. Drop on a schedule, not by accident.

Indexes are not a free-storage problem. They are debt with interest, paid on every write. The database is patient about that debt until the day it isn't, which is usually the day a write-heavy job pushes the cluster to the edge of its replica lag SLO. By then the cleanup is a panic. Do it now, before you need to.

AI Code Review Is Mostly Noise

Fri, 22 May 2026 12:00:00 GMT

Every dev tool company shipped an AI code reviewer in the last twelve months. GitHub's Copilot reviewer. Greptile. CodeRabbit. Cursor's review feature. Anthropic's PR reviewer. The marketing pitch is identical across all of them, sometimes verbatim: a tireless senior engineer who reads every pull request and catches the bugs your humans would miss.

The pitch is great. It does not survive contact with a real review queue.

This post is the sequel to Your AI Coding Speedup Is Not What You Think. Same framing: I ran the tool, I measured the result, the result is not the brochure.

What the tools actually do

Every AI code reviewer I've used follows the same loop. Read the diff. Generate comments. Some comments flag style. Some flag potential bugs. Some demand more tests. Some suggest "improvements" with refactor diffs. The output volume is high. The variance across vendors is low. Once you've seen comments from two of them, you've seen comments from all of them.

The differentiation in this market is mostly the integration: how it posts to GitHub, whether it gates merges, whether it summarizes the PR for human reviewers, whether it speaks to your existing linters. The actual review content is roughly the same across vendors because the underlying model is roughly the same. They are all asking some foundation model the same question.

The narrow signal

The genuinely useful catches are real, and worth listing honestly:

Obvious typos in identifiers. Unused imports. Null-dereference patterns the type system would have caught if you had one. Missing test coverage on the obvious paths. Unclosed resources (file handles, JDBC connections, HTTP clients). An obviously wrong condition in a boolean. The classic if (x = 5) instead of if (x == 5) that survived your formatter.

Most of this is what a tightly-tuned linter, a coverage tool, and a strict type checker catch for free. The AI reviewer's contribution here is convenience: you get the catches without configuring the linter. Configuring the linter takes one afternoon. The AI reviewer costs one subscription per seat per month, plus the cost of reading everything else it writes.

The noise floor

The bulk of the output is not signal. After three months of running an AI reviewer on every PR, the noise patterns are predictable:

Hallucinated nulls. The bot insists a value could be null in code where the type system already proves it cannot. Kotlin code with non-nullable types, Java code with @NonNull annotations, TypeScript code with strict null checks: the bot keeps suggesting defensive null guards that are dead code by construction.

Defensive code in safe zones. Functions whose contracts guarantee non-empty input get "consider checking for empty input" comments. The check would never fire. Adding it pollutes the function. Not adding it produces a comment.

Style nitpicks against established conventions. Every codebase has conventions. The bot has not read your style guide. It suggests refactors that fight the patterns the rest of the file uses. "Extract this to a helper" on the one part of a coordinated four-step function that is obviously not extractable without breaking the four steps.

Comment-on-everything energy. "Consider adding a comment explaining this logic" on lines that are self-explanatory. The implied bar is that every line should have a comment, which is the opposite of how good code reads.

Repeated observations. The same point made on twenty files in the same PR. If the convention applies to the codebase, file an issue against the project once. The bot files it twenty times.

"Did you mean to do X?" The author obviously did. The comment exists because the bot cannot model intent.

Concurrency warnings on single-threaded paths. Comments about thread safety on code that lives behind a single-writer queue. The bot does not know the surrounding architecture.

What it misses

The bugs that actually cost you sleep, and that the bot does not flag:

Business logic errors. The function returns the right type and the right shape, but the value is wrong because the bot does not know what your customer is supposed to see. A discount that should compound but doesn't. A status that should transition through three steps but skips one. The bot reads syntax, not policy.

Race conditions across multiple files. Each file looks correct in isolation. The race lives in the interaction. The bot reviews files, not interactions.

Ordering issues in async code. Two awaits that look reasonable side by side, where one needs to complete before the other for the postcondition to hold. The bot does not reason about ordering between independent statements.

Performance issues only visible at scale. The query that runs in 5 ms against your dev seed data and runs in 5 seconds against the production table. Or the N+1 that fires only when the result set is non-empty. Production schema and production data are not in the bot's context.

Security issues requiring trust context. The endpoint that interpolates a path parameter into a SQL string, where the path parameter is in fact validated by middleware two layers up. Or where it isn't. The bot does not know what's trusted.

Architectural drift. This PR is fine in isolation. It normalizes a bad pattern that the team is trying to phase out. The bot has no opinion about your direction.

Bugs in the test. The test passes, the code is wrong. Maybe the assertion checks the wrong thing. To the bot, code and tests are parallel artifacts, not a system where the test must independently constrain the code.

Behavior change in a dependency upgrade. The library went from version 4.0 to 4.1. Buried in the patch notes: a default value change. The PR is a one-line version bump that the bot calls "safe."

The numbers

Three months on a real service. One AI code reviewer running on every PR. One human review queue running in parallel.

Comments generated by the bot: about 3,500.

Comments that led to a real change: about 80, or 2.3%.

Of those 80, the breakdown:

~50: things the linter would have caught with one config rule
~20: style nitpicks I agreed with after the fact
~ 8: real catches that mattered (unused imports, dead code)
~ 2: actual logic bugs

Comments dismissed as wrong: ~1,100. Comments dismissed as nitpicks I disagreed with: ~1,700. Comments that were correct but immaterial: ~620.

Human reviewers in the same period, on the same PRs, flagged 28 logic bugs.

The ratio is the headline. Bot hit rate on actual logic bugs: roughly 7% of a competent human reviewer's hit rate, while producing 40x the comment volume.

The hidden cost

The subscription is not the cost. Review fatigue is.

When the bot drops eight comments on every PR, humans skim them. When humans skim bot comments, the skim behavior leaks into how they read human comments too. The reviewer who used to spend ten minutes per PR now spends six, because seven of the eight bot comments are noise and the human has trained themselves to triage faster. Careful review goes away.

Six weeks later, the bug rate climbs. Blame goes to the new hire, the recent dependency upgrade, the holiday week. Nobody connects it to the bot, because the bot didn't introduce the bug. The bot lowered the bar of what counts as "reviewed."

Where it does help

The wins are real but narrow.

Teams with no review culture at all: any review beats no review. A bot that catches typos and missing tests is a net improvement on "merge after CI passes."

Solo developers: a second pair of eyes is better than zero pairs of eyes. The bot will not catch the business logic bugs, but the solo dev wasn't going to either.

Reviewing an unfamiliar codebase: the bot is a worse senior engineer than the senior engineer who wrote the code, but the senior engineer is not available. The bot's average comment is still informed by patterns from millions of repos.

Compliance theater: when a regulator or a customer auditor needs to see "automated code review," the bot fills the box.

Outside those four cases, the trade is bad.

What works better per dollar

The same monthly budget allocated to other things catches more real bugs:

A real linter with project-specific rules. Configure once, runs forever, no noise floor, no hallucinations. Catches every "unused import" without commenting on the rest of the file.

A small set of well-written invariant tests. The kind that fails when a real customer constraint is violated. One good integration test catches the bugs the AI reviewer cannot see.

A senior engineer who reviews carefully and is given time to do it. Slowest, most expensive, catches the most. The whole point of human review is the part the bot cannot do.

A pull request template with two questions: "what could break" and "what did you not test." Forces the author to think before the reviewer has to.

A pre-merge integration test that hits a real database, real cache, real downstream. The bot cannot run this for you. CI can.

The market is in cosplay

AI code review is a market where the demos look good because the demos are run on toy PRs in fresh codebases with no context. In a real review queue with real conventions and real architectural commitments, the signal-to-noise ratio inverts. You get the catches a linter could give you for free, plus a lot of noise that makes humans worse reviewers.

If a tool's value proposition is "we read every PR" and the actual catches are "unused import on line 47," the tool is solving a problem you already had a better solution for. Spend the same budget on tests, tooling, and senior time. The real bugs still need a human who knows the code and the customer. That hasn't changed.

The Endpoint That Always Returns 200

Thu, 21 May 2026 12:00:00 GMT

Look at enough internal API code and you will eventually find an endpoint that returns HTTP 200 for everything. Success: 200, body {"success": true, "data": ...}. Failure: 200, body {"success": false, "error": "user not found"}. Server bug: 200, body {"success": false, "error": "internal error"}. One status code for every outcome, and a body field that carries the actual meaning.

The team building it usually thinks the design is clean. One code path on the frontend. One parser on the mobile client. No "weird HTTP error stuff" to handle. The argument is always the same: simpler.

It is not simpler. It costs you retries, caches, load balancers, circuit breakers, and every monitoring tool that has ever been built. HTTP status codes are a contract that the entire web ecosystem was constructed around, and when you opt out you take on the work of rebuilding all of it inside your client code. You never finish that work.

Why teams do it

Five reasons I have actually heard, in real meetings, from real engineers. All of them have an answer.

"It is easier for the JavaScript / iOS / Android client to handle." The difference is five lines vs eight lines. Modern fetch wrappers (axios, ky, OkHttp, URLSession) handle 4xx/5xx as a typed branch automatically. The "simpler" argument is a phantom.

"iOS shows ugly errors on 5xx." No, your UI does. The HTTP status does not render anything. The UI layer reads the response and decides what to show. Fix the UI.

"Our monitoring alerts on 5xx and the noise is bad." This one is honest, and it is the worst reason. Tune the alerts. Silencing the signal at the protocol layer is not the fix. The next incident now starts with a customer email, not a page.

"We already put the error in the body; the status code is redundant." The duplication is the point. Status codes are the part of the response that proxies, caches, load balancers, and SDK generators can read without parsing your body. Bodies are for humans and detailed clients. Status codes are for the half-dozen middleware layers between you and the human.

"GraphQL returns 200 with errors in the body, so why not us?" GraphQL is a different protocol with explicit semantics around partial success. HTTP-based REST is not GraphQL. The argument transplants badly.

What you break

Six concrete things. Each one matters; together they are devastating.

Client retries. Apache HttpClient, OkHttp, Spring's RestClient, the JDK's HttpClient, and every reasonable HTTP library defaults to retrying on 5xx and not retrying on 2xx. A 200 wrapping "service temporarily unavailable" tells the client "do not retry, this succeeded." Your transient backend blip becomes a permanent failure for the caller.

HTTP caches. A 200 response is cacheable by default. CDNs, reverse proxies, and browser caches all respect this. A 200 wrapping "user not found" can land in a Cloudflare cache with the same key as the eventual real user record. The next user load hits a stale cached failure for the rest of the cache TTL.

Load balancers. AWS ALB, nginx upstream health checks, Envoy outlier detection, and every other modern LB tracks 5xx rates to decide whether a backend instance is healthy. A 200-wrapped 500 leaves the sick instance in rotation. The bad node serves traffic until a human notices.

Circuit breakers. Resilience4j, Hystrix, Polly, and the rest decide to open the circuit based on HTTP status or thrown exceptions. A 200 with success: false in the body is success to them. The circuit never opens. Cascading failures get worse, not better.

Observability. Datadog APM, Grafana traces, Prometheus exporters, AWS X-Ray, OpenTelemetry collectors: they all bucket spans by HTTP status. Your error-rate dashboard reads zero forever. The first signal you get of a broken endpoint is a customer ticket.

OpenAPI and SDK generation. Generated client SDKs produce typed result objects, with success and error paths derived from status codes. With everything 200, the generated SDK has no error path. Every consumer has to re-derive the error envelope by hand.

The 4xx versus 5xx distinction you are throwing away

The most important loss is the implicit retry policy that HTTP status codes communicate.

A 4xx says "the request itself is the problem; do not retry without changing it." A 5xx says "the server is the problem; retrying may succeed." Every modern client library implements that contract. The retry logic is automatic and correct.

Collapse both to 200 and the contract disappears. Clients hammer endpoints that will never succeed (the 4xx case: invalid input retried thousands of times). Clients silently fail to retry transient errors (the 5xx case: backend blip becomes user-visible). The failure mode is invisible to everyone except the user.

The taxonomy that actually matters

You do not need to memorize the full RFC 9110 list. You need to pick the right one in the right place.

400  malformed request (cannot parse the JSON, missing required header)
422  valid request, business-rule violation (overdrawn account, password too short)
401  not authenticated (no credentials, expired token)
403  authenticated, but not authorized (logged in, wrong role)
404  resource does not exist
410  resource existed and was deleted (signals "stop polling forever")
409  conflict on resource state (concurrent update, duplicate create)
412  precondition failed (If-Match header, optimistic locking)
429  rate limited
500  server bug
502  upstream broken
503  service temporarily overloaded
504  upstream timed out

Pick five of these, use them consistently, document them in your OpenAPI spec. That is the whole game.

The body still matters

None of this is an argument against putting structured error information in the body. The status code is the routing key; the body carries the detail. Each layer is useful, and they do not compete.

RFC 7807 Problem Details (application/problem+json) is the modern shape:

HTTP/1.1 422 Unprocessable Entity
Content-Type: application/problem+json

{
  "type": "https://example.com/errors/insufficient-funds",
  "title": "Insufficient funds",
  "status": 422,
  "detail": "Account 12345 has a balance of $40, but the transfer is for $100.",
  "instance": "/transfers/abc-789",
  "account": "12345",
  "balance": 40,
  "requested": 100
}

The client sees 422 in the status (cannot fix by retrying; the request is the problem). The body carries the human-readable reason and the structured fields the UI needs. Field-level validation errors live in an errors array. Status, then body. Both, not either.

The monitoring-noise argument, properly addressed

The team that ships 200-for-everything because their on-call gets paged on 5xx has a real problem. The real problem is not the protocol.

Alert on 5xx rate, not absolute count. Threshold on rolling windows (1% of requests over 5 minutes, not "any 5xx in the last hour"). Page per-route, not per-service. Exclude expected 4xx classes (404s on a content endpoint are normal). Treat 4xx separately as "client behavior signal," not "server health signal."

Fix the alerts, not the responses. Silencing errors at the protocol layer is the bandaid that destroys every other use case.

The one exception: GraphQL

GraphQL deliberately uses HTTP 200 with errors in the response body, because a GraphQL response can be a partial success: some fields resolved, others failed. The protocol assigns no meaning to HTTP status for partial responses, and that is a feature.

Do not apply the REST argument to a GraphQL endpoint. Do apply it everywhere else.

How to migrate away

If you have inherited an API that does 200-for-everything, two paths.

The incremental path: add proper status codes to new endpoints; keep the wrapper on old ones; version the API and migrate routes one at a time. Existing clients keep parsing the wrapper. New clients (and the SDK generator) get correct status codes.

The clean-break path: change the responses, ship a release note, fix the clients you own. Every modern HTTP client handles 4xx/5xx natively; the migration surface is usually smaller than the team thinks. Public APIs need a deprecation window. Internal services often do not.

Either way, do not write a third version that wraps the status code in the body again because "the team is used to it."

Why this is worth the work

The web is built on a contract that says HTTP status codes carry meaning. That contract is implemented in dozens of layers: client libraries, proxies, caches, load balancers, observability tools, SDK generators, OpenAPI tooling, every modern API gateway. When your endpoint opts out, you become the only one responsible for translating the body field into all of those layers' expected behavior. You never finish.

The endpoint that always returns 200 is not simpler. It is hiding complexity in the place where it is hardest to find: spread across every consumer, every middleware, and every dashboard you ever wire up. The simple version is the one where the status code says what happened.

DNS Is Always the Answer

Wed, 20 May 2026 12:00:00 GMT

"It's always DNS" is a meme because it is true. The reason it is true is that DNS sits at the bottom of every network operation, and it has more failure modes than people remember. When production breaks in a way that does not fit your mental model, DNS deserves to be the first thing you check, not the last.

This post is a catalog of the real DNS failure modes that hit modern services, plus the three commands that resolve most of them, plus what to actually do to prevent the next one.

Why DNS is special

Every TCP connection starts with a name lookup. Load balancers, TLS, service meshes, Kubernetes, all of them layer on top of a working DNS resolution. DNS is the substrate. So a DNS misbehavior masquerades as a problem in whichever upper layer happens to log first. A slow service, a failed TLS handshake, an intermittent 502, a pod that flaps between healthy and unhealthy: any of these can be DNS underneath.

The mental model that helps: when a system behaves in a way that does not fit your understanding of how it should fail, suspect a layer below the one you are looking at. DNS is below almost everything.

The TTL that outlived the migration

You move a service from one IP to another. The DNS record is updated immediately. On the old record, the TTL is 24 hours. Resolver caches in your network sit on the same 24-hour TTL. Half your fleet resolves the new IP on its next query. The other half keeps hitting the old IP for the rest of the day, because the cached value is still inside its TTL window.

The symptom is intermittent: 50% of requests succeed, 50% time out. No pattern in the dashboards because the load balancer sees nothing wrong. The fix is waiting out the TTL or forcing a flush on every resolver in the path. Each option is slow. The lesson is that TTLs matter before the migration, not after. Short TTLs are cheap insurance.

The resolver under load

A local resolver (systemd-resolved, dnsmasq, the kubelet DNS sidecar, CoreDNS itself) runs out of UDP source ports because of poor query multiplexing. Concurrent lookups stack up. Each name lookup that used to take 2 ms now takes 5 seconds because of retry timeouts.

The downstream effect is that every service that calls another service through a hostname inherits the resolver latency. Everything looks slow at the same time. The "slow service" you are debugging is downstream of the slow resolver, not the cause. Start with the resolver's metrics (latency, ServFail rate, port-exhaustion counters) before opening the next dashboard.

NXDOMAIN under load

The upstream DNS server returns NXDOMAIN intermittently. The cause might be rate limiting, a flapping zone refresh, or a backend problem at your DNS provider. Whatever the cause, the negative cache picks up the NXDOMAIN and serves it for the negative TTL window. That window is usually 5 to 30 minutes.

For that window, callers in the resolver's cache think the hostname does not exist. Not slow, not failing: doesn't exist. The actual upstream blip was 30 seconds. The customer impact is 15 minutes. Negative caching is a feature, except when the negative answer was wrong, at which point it is the longest-lived bug in your incident timeline.

The split-horizon trap

Internal DNS resolves db.internal.example.com to a private VPC IP. External DNS does not resolve it at all, or resolves it to something different. The service works from your laptop on the VPN. It works from the pod, because the pod's resolver points at internal DNS. It fails from the CI runner that is on a network you did not check.

Split-horizon is a fine pattern when you know it exists. It is a half-day debugging session when you do not. The first thing to ask in any "works for me, doesn't work for them" report is which DNS view each side is using.

The /etc/resolv.conf you forgot

On Kubernetes, a pod's /etc/resolv.conf has a search list inherited from the kubelet plus the node's resolver. That list usually contains default.svc.cluster.local, svc.cluster.local, cluster.local, plus whatever the node had. A bare hostname like db goes through the list element by element until one resolves.

That means db resolves to db.default.svc.cluster.local in one namespace and to db.production.svc.cluster.local in another and to nothing at all in a third. Copying a manifest between namespaces silently changes which database the service talks to. The fix is always to use fully qualified names with the trailing dot (db.production.svc.cluster.local.). The bug is always discovered after the migration is live.

Service discovery is DNS

Kubernetes "service discovery" is CoreDNS, which is DNS. Consul service discovery exposes a DNS interface. ECS service discovery uses Route 53. AWS Cloud Map is DNS. Every fancy abstraction in this space sits on top of standard DNS. Failure modes are the same: caches that lag, resolvers under load, negative caching, the search list.

The trap is that the abstraction makes you stop thinking about DNS. Flip that mental model. When service discovery misbehaves, ask what the underlying DNS resolution looks like before opening the orchestrator's docs.

The TLS handshake that failed because SNI

The TLS client sends Server Name Indication during the handshake. The server uses SNI to pick which certificate to present. If your CNAME chain rewrites the hostname (you point app.example.com at app.example-cdn.com, which points at the CDN's edge), and your client sends the CDN hostname as SNI, the certificate presented may not match the hostname your code expected.

The error message looks like a TLS problem: "cert mismatch," "untrusted issuer," "hostname does not match." The cause is the DNS chain in front of the TLS handshake. The TLS layer is doing exactly what you asked it to. You asked it the wrong question because DNS sent you somewhere unexpected.

How to actually debug DNS

Three dig commands resolve most production DNS questions.

dig +short returns just the answer. Use this from the box where the problem is happening to see what the local resolver currently returns.

$ dig +short api.example.com
10.0.42.7

dig +trace walks from the root servers down through every delegation step. Use this when the answer is wrong and you need to figure out which authoritative server is wrong, or where the cache is.

$ dig +trace api.example.com
; root, then .com, then example.com NS, then api.example.com
; each hop shows you who answered and what they said

dig @ bypasses the local cache by querying a specific resolver. Use @8.8.8.8 or @1.1.1.1 to compare what public resolvers see versus what your local one does.

$ dig @8.8.8.8 api.example.com
$ dig @10.0.0.10 api.example.com
; compare the two answers and you find the disagreement

Run all three from at least two networks: your laptop and a prod pod, or a CI runner and an internal box. The thing you are debugging is almost always the difference between two answers.

How to prevent the next one

Short TTLs cost almost nothing and buy you fast rollback. 300 seconds for anything that might move. 60 seconds when you are actively migrating. The argument against short TTLs ("more query load on the resolver") is real but small; the argument for them ("I can fix a mistake in a minute, not a day") is enormous.

Treat DNS as an SLO. Alert on resolver query latency, ServFail rate, and NXDOMAIN rate, not just on downstream service latency. When the resolver is sick, every dashboard above it lights up red at once, and the resolver dashboards are the only ones that tell you why.

Standardize on fully qualified names with the trailing dot in config files, especially anywhere a manifest can move between namespaces. The trailing dot says "do not append the search list," which prevents the kind of cross-namespace surprise that takes a senior engineer an afternoon to track down.

Before deploying a DNS change, run dig from at least three networks (laptop, prod pod, CI runner) and confirm the answer is what you expect. DNS changes look reversible, and they usually are, but the TTL on the wrong record is your rollback ceiling.

The punchline is not the point

"It's always DNS" is the punchline. The work happens before you reach the punchline: knowing which of the half-dozen failure modes you are actually looking at, and having the muscle memory to run dig +trace before opening another dashboard. Every senior engineer learns this the same way, which is by losing an afternoon to a TTL or a CNAME chain or a misbehaving negative cache. The faster you internalize that DNS deserves your first check, not your last, the fewer afternoons you give up.

Open Session in View Is Spring Boot's Quietest Footgun

Tue, 19 May 2026 12:00:00 GMT

Spring Boot ships with one line of YAML you have probably never set, and a single WARN message at startup that almost nobody reads. The line is spring.jpa.open-in-view. Its default is true. That default is the quietest, most expensive architectural decision the framework makes on your behalf, and almost every Spring Boot service in production has it left as is.

Turning it off is a one-line change. The interesting work is in the consequences.

What it actually does

Open Session in View, OSIV for short, is a request-scoped interceptor. Spring registers an OpenEntityManagerInViewInterceptor that opens a Hibernate Session at the start of an HTTP request and closes it when the response is committed. The session lives across every @Transactional boundary inside that request. Code that runs after your service method returns, including the controller layer, Jackson serialization, and Thymeleaf rendering, still has a session attached.

That is why lazy-loaded associations work when you touch them from a controller method or a view template. The proxy on order.getItems() finds an open session, fires the SQL, and returns. With OSIV off, that same call throws LazyInitializationException.

Why it is on by default

This is Spring 1.x era convenience. Server-rendered apps. The user pattern was load an entity in a controller, hand it to a JSP, let the JSP traverse whatever it needs. Lazy initialization made the model object cheap to load up front; OSIV made the rendering not blow up. Spring Boot inherited the default and never broke compatibility.

The Spring team is openly conflicted about it. Reference documentation calls OSIV "controversial" and says the recommendation is to turn it off. Yet the default stays for historical reasons. Since Spring Boot 2.0, leaving the default produces this log line at startup:

spring.jpa.open-in-view is enabled by default. So database queries may be performed during view rendering. Explicitly configure spring.jpa.open-in-view to disable this warning.

If you have never seen that line, search your logs. It is there.

The connection-pool tax

The session holds a JDBC connection for the entire HTTP request, including the parts of the request that are not doing any database work. That includes JSON serialization, response compression, and the bytes-on-the-wire phase where the client is still draining the body.

Concretely: a service with a HikariCP pool of 10 connections and a 200ms average response time, where the actual SQL portion is 30ms, caps at roughly 50 requests per second with OSIV on. The connection is locked for 200ms but only used for 30. Turn OSIV off and the connection is released the moment the service method commits. Same pool, same hardware, now serves five to six times the traffic before queuing.

This is the failure mode that produces the 9am-incident shape: a baseline that works fine, then a moderate spike, then queue buildup as connections back up, then 503s. Pool sizing is not the issue. Each connection is being held longer than necessary.

The architectural rot

The performance cost is the easy half. The harder half is what OSIV does to how you write code.

With OSIV on, lazy traversal works from anywhere in the request. A new developer reads a controller method, sees this:

@GetMapping("/orders/{id}")
public OrderResponse get(@PathVariable Long id) {
    Order order = orderService.find(id);
    return new OrderResponse(
        order.getId(),
        order.getCustomer().getEmail(),
        order.getItems().stream().map(ItemResponse::from).toList()
    );
}

There are at least two lazy loads here: order.getCustomer() and order.getItems(). Each may be a separate SQL query, fired from the controller layer, in code that does not import a single repository class. The developer reads the method, decides it looks clean, and moves on.

With OSIV off, the second the controller touches order.getCustomer() outside the service transaction, you get an exception. Now the same code looks like this:

@GetMapping("/orders/{id}")
public OrderResponse get(@PathVariable Long id) {
    Order order = orderService.findWithCustomerAndItems(id);
    return OrderResponse.from(order);
}

The fetching is the service's responsibility. The controller is just the HTTP adapter. The N+1 risk has been pulled into the service layer where you can write a single JOIN FETCH or an @EntityGraph and cover it with a query-count test.

How to know if you have it on

Three signals.

Check spring.jpa.open-in-view in your application.yml. If it is missing, the default is true. If it is explicitly true, the default is also true. Same thing.

Grep the startup logs for the WARN message. If you find open-in-view there, OSIV is on.

Watch for controller methods that touch a lazy association without explicit fetching and somehow do not throw. That is OSIV doing the work silently.

Turn it off

spring:
  jpa:
    open-in-view: false

Restart. Now run your integration test suite. Several tests probably break with LazyInitializationException. Do not roll back. Each one is a real architectural seam that OSIV was hiding. Fix them one at a time:

For collection associations, use JOIN FETCH in JPQL or @EntityGraph on the repository method:

@EntityGraph(attributePaths = {"customer", "items"})
Optional findById(Long id);

For DTO-shaped responses, project directly with a constructor expression or an interface projection. That skips the entity entirely and you never have a lazy proxy to traverse.

For places that genuinely need a hydrated entity in the controller, return it from a service method that does the loading inside its own @Transactional. The controller stays dumb.

When OSIV is actually fine

Prototypes that exist for two weeks. Internal admin UIs with single-digit RPS. Hackathon code. Anywhere request volume is bounded by humans clicking buttons and the connection pool is over-provisioned. In all of those, OSIV trades a tiny architectural smell for a real ergonomics win, and the trade is correct.

Anywhere with real concurrency, a customer-facing API, or a team larger than three: turn it off. The smell becomes a problem.

What it costs to flip the switch

One line of YAML. Half a day to fix the integration tests that surface. A second half-day to write query-count assertions so the new fetch strategies do not silently regress into N+1s. After that, every controller in the codebase becomes easier to reason about, the connection pool serves more traffic, and the WARN line disappears from your logs.

This is one of the cheapest performance and architecture wins available in a Spring Boot service. The reason it goes unfixed is that the default is invisible until it bites you in production, and by then you are debugging a connection-pool incident instead of reading a tutorial about defaults.

Spring Data Derived Queries: Crossing Boundaries

Tue, 19 May 2026 12:00:00 GMT

The single-entity toolkit in part one covered every keyword Spring Data offers when the query stays on the root entity. The interesting questions start when the predicate has to traverse into an associated entity. Order has many OrderItems. How do you find all orders whose items match some condition, and how do you do it without surprising yourself with the resulting SQL?

Derived queries can do nested traversal. They use a syntax you may have seen and not understood: the underscore. This post covers that syntax, the duplicate-parent problem it creates, projections, entity graphs, streams, and the moment when you should stop reaching for derived queries entirely.

The underscore rule

When Spring Data parses a method name like findByItemsProductSku, it has a choice. It can read itemsProductSku as a flat property on Order, or as the nested path items.productSku. The parser is greedy: it tries the longest property name first, then steps back one camelCase token at a time and tries again. If itemsProductSku happens to exist as a property on Order, the parser stops there and never traverses.

The underscore forces the split. findByItems_ProductSku is unambiguous: split at the underscore, traverse from items into productSku, never confuse it with anything else. Use the underscore on every nested traversal. It is explicit, it survives refactoring, and it is the syntax the Spring Data team recommends.

Traversing a collection

List findByItems_ProductSku(String sku);

The generated JPQL is roughly:

SELECT o FROM Order o
  JOIN o.items i
WHERE i.productSku = :sku

The join is an inner join. Orders without items will not appear in the result. If you have an order with zero OrderItems and you want it included anyway, you need a left join, which derived queries do not produce. Switch to @Query.

The duplicate parent problem, then and now

Join a one-to-many in SQL and the database returns one row per matching child. If an Order has three FULFILLED items, the raw SQL result for a fulfilled-items predicate has that order three times. This is the classic duplicate-parent trap.

If you are on Hibernate 5 or older, the JPA result reflects the SQL: the List contains ORD-003 three times. The historical fix is the Distinct keyword:

List findDistinctByItems_Status(OrderItemStatus status);

That emits SELECT DISTINCT o. Each parent appears once. Easy.

If you are on Hibernate 6 or newer (and you are, because Spring Boot 3 and 4 ship with it), the story changed. Hibernate 6 deduplicates entity-returning query results in memory by primary key. The SQL still returns the same multiplied rows, but by the time the List reaches your code each parent appears once. The Distinct keyword is no longer needed to make the Java result correct.

Distinct still matters in three cases. First, projections (interface, record, raw column) are not entities and are not deduplicated, so a DTO projection over a one-to-many join produces row-per-child unless you add Distinct. Second, the SQL itself is still wasteful without DISTINCT: the database transfers N rows that Hibernate then collapses to 1 entity. On a large result set this is real bandwidth and CPU. Third, count queries (countBy) operate at the SQL level and need countDistinct if you want a count of unique parents, not unique join rows. So Distinct is no longer a correctness fix for entity queries on modern Hibernate, but it is still the right choice for efficiency and for non-entity returns.

Combining parent and child predicates

List findByStatusAndItems_Status(
    OrderStatus orderStatus,
    OrderItemStatus itemStatus);

Generated JPQL:

SELECT o FROM Order o
  JOIN o.items i
WHERE o.status = :orderStatus
  AND i.status = :itemStatus

One join, two predicates: one on the parent, one on the joined child. This is the natural shape. Same duplicate-parent caveat applies: add Distinct if you want each parent once.

The trap: two predicates on the same collection

This is the part that catches everyone. Consider:

List findDistinctByItems_StatusAndItems_QuantityGreaterThan(
    OrderItemStatus status, int quantity);

What you read this as: orders where some item is FULFILLED, and some item has quantity greater than N. The items could be different.

What you get: orders where some single item is both FULFILLED and has quantity greater than N. Spring Data emits one join, not two. Both predicates apply to the same joined row.

SELECT DISTINCT o FROM Order o
  JOIN o.items i
WHERE i.status = :status
  AND i.quantity > :quantity

If your data model needs the looser semantic (any FULFILLED item AND any over-quantity item, possibly different items), you need two separate joins. Spring Data will not write that for you. Use a Specification, or write a @Query that does EXISTS twice with two correlated subqueries. The same rule applies to any pair of predicates that both traverse through the same collection.

Existence and counting through a join

boolean existsByItems_ProductSku(String sku);
long countDistinctByItems_Status(OrderItemStatus status);

existsBy compiles to a cheap SELECT EXISTS(SELECT 1 FROM ...) with no entity hydration. countDistinctBy emits COUNT(DISTINCT o.id) and gives you the number of distinct parents that have at least one matching child. Either is the right shape for these questions; findDistinctBy... .size() is the wrong shape (it loads every row).

Projections, in three flavors

You do not always want the full entity. You want a few columns and a quick render. Spring Data has three projection styles.

Interface-based projections. Declare an interface with getters. Spring returns proxies that implement it.

public interface OrderSummary {
    Long getId();
    String getOrderNumber();
    BigDecimal getTotalAmount();
    OrderStatus getStatus();
}

List findSummaryByStatus(OrderStatus status);

The generated SELECT projects only the columns the interface declares. No entity, no first-level cache pollution, no lazy associations. Cheap and read-only.

Class-based projections (records). Declare a record matching the columns. Spring uses the canonical constructor.

public record OrderDto(
    Long id,
    String orderNumber,
    String customerEmail,
    BigDecimal totalAmount,
    OrderStatus status,
    Instant createdAt) {}

List findDtoByStatus(OrderStatus status);

Records are the cleanest fit because the constructor parameter names match the entity property names, and Spring Data maps them by name. The result is immutable and serializable. This is usually the shape you want for HTTP responses.

Dynamic projections. Pass Class at call time and Spring Data picks the projection per call.

 List findByCustomerEmail(String customerEmail, Class type);

Now the same method backs multiple consumers: a list view that wants OrderSummary, an export job that wants the full entity, a CSV builder that wants OrderDto. One query method, three shapes. Useful for sharing query logic across read paths.

EntityGraph: eager-load a collection in the derived query

@EntityGraph(attributePaths = {"items"})
List findWithItemsByCustomerEmail(String customerEmail);

This attaches a fetch hint to the derived query. Hibernate emits a single JOIN FETCH for the items collection so the parents and their children come back in one round trip. Outside the persistence context, calling order.getItems() does not trigger another query, and never throws LazyInitializationException.

EntityGraph plus a one-to-many JOIN FETCH plus pagination is a trap. Hibernate has to either fetch all rows into memory and paginate in Java (with a warning in the log) or refuse to combine them. For paginated reads, fetch the parent page first, then load the children with a separate WHERE id IN (...) query, or use entity sub-select fetching.

Streaming results

@Transactional(readOnly = true)
public void processPaidOrders() {
    try (Stream stream = repo.streamByStatus(OrderStatus.PAID)) {
        stream.forEach(this::process);
    }
}

A Stream return type gives you a cursor-backed iteration. The database holds an open cursor; Hibernate reads rows in batches as the stream advances. Two strict requirements: the stream must be consumed inside an open transaction, and it must be closed (use try-with-resources). For a batch job that processes a hundred thousand rows, streaming uses constant memory regardless of result size.

When to leave derived queries behind

Derived queries cover most repository methods you will ever write. The cases where they do not work are clear:

Dynamic predicates. If the predicate changes based on which fields the caller filled in (a search form), use Specification. Spring Data composes specifications with and, or, and not as code, not as a method name.

Parenthesized boolean logic. Derived queries have no parens. findByAOrBAndC is always A OR (B AND C). If you need (A OR B) AND C, switch to @Query or Specification.

Two predicates on the same collection that should match different rows. Already covered. Use Specification with two joins, or a @Query with two EXISTS subqueries.

Complex joins or subqueries. Group by, having, window functions, lateral joins. None of these have a derived-query syntax. Write the JPQL or the native SQL.

Performance. If your read path is performance-critical and entity hydration is the bottleneck, a projection might fix it. If not, drop to JdbcTemplate or a native query. Repository methods are an abstraction. Sometimes the right answer is to step around them.

Companion code

The repo at github.com/umur/spring-data-derived-queries implements every method in this post and tests it against a Postgres container via Testcontainers. OrderRepositoryNestedIT covers nested traversal and the same-collection trap. OrderRepositoryProjectionIT covers interface, record, and dynamic projections, plus the EntityGraph case.

The two parts together describe everything Spring Data derived queries can do on a single aggregate. Dynamic SQL, parenthesized boolean logic, and two-join semantics on a single collection are all outside what derived queries can express. Inside those limits, the syntax stays concise. Queries are validated at startup. The methods read like English. That is a good trade for most repository code.