Code is getting cheap, so my job is becoming “picking the right bets”

Code is getting cheap, picking the right bets

⚡

The day code stopped being the bottleneck

We now have controlled evidence that AI coding assistants can materially speed up implementation for certain classes of tasks. In one well-cited controlled experiment (developers building the same small server), the group with an AI pair-programmer finished about 55.8% faster than the group without it.

That’s not “10% nicer velocity”. That’s “the bottleneck moved”.

And it’s not just lab-style tasks. Field evidence is messy (real companies are messy), but it still points in the same direction: in a real-world field experiment involving developers at Microsoft and Accenture, researchers found suggestive increases in output (e.g., more pull requests per week) after adopting an AI coding assistant—often reported in the low double-digit to ~20% range depending on setting and method.

Even more bluntly: a field experiment written up by the Bank for International Settlements (using a coding assistant trial at Ant Group) reported a ~55% increase in code output (measured as lines of code produced) for the group using the tool, with the biggest statistically significant gains showing up among junior staff.

So yes: code is not literally free. But “write the code” is becoming less scarce than “decide what’s worth coding.”

That shift changes what becomes valuable.

The market has been screaming this for years

If you want to know what the world values, look at what it pays for.

📈

Quant Trading

The industry has been paying eye-watering money to people early in their careers (and, importantly, haven’t spent decades “building a portfolio”). Intern pay at Jane Street annualises to around $250,000 for some roles—while explicitly stating they don’t expect a background in finance.

And there are credible reports of new-graduate offers far above typical software salaries: one widely reported example is a ~$508k starting package for a new graduate hired into a trading role in Hong Kong.

Quant firms also show up at the top of entry-level compensation lists for engineers: Hudson River Trading around ~$400k (on average), with Jane Street around ~$350k.

🧠

Frontier AI

In 2025, reporting said Meta’s CEO Mark Zuckerberg was personally leading a recruiting push for a “superintelligence lab”, with offers described as reaching up to $300 million over four years for some top-tier research talent (and an unusually front-loaded first year).

The same reporting includes pushback from Meta, including a spokesperson calling the numbers misrepresented and internal leadership arguing the "$100 million offers" narrative was exaggerated to a small number of roles.

A separate report by Reuters described Meta reorganising its AI efforts under “Meta Superintelligence Labs”, led by Alexandr Wang and co-led by Nat Friedman, and also noted public claims about extremely large recruitment bonuses.

The bottleneck is not "someone who can implement CRUD screens."
It’s "someone who can make the right high-uncertainty bets when the chips are insanely expensive."

Why research feels different from engineering

Engineering usually has a target: a spec is met, a test passes, the latency drops, the customer stops shouting. Research is the opposite.

A lot of the time, you don’t even know if the thing you’re trying to do is possible. That uncertainty isn’t a vibe. It’s structural.

In computer science, there are formal results showing you can’t have a universal procedure that decides whether an arbitrary program halts. Alan Turing’s work is central to the framework behind the modern statement of that impossibility, and modern treatments lay out the self-referential contradiction style of argument: assume a universal decider exists, then construct a program that breaks it.

I’m not saying “research = halting problem” in a strict mathematical sense. I’m saying the shape is similar:

You can spend a year on a question.
You may learn a lot.
The answer can still be "no".
And you often can’t know that upfront.

That’s why, when code gets cheaper, the comparative advantage shifts to people who can reason well under uncertainty and cut losing paths fast.

The real moat is taste: choosing which problems deserve a week of your life

Imagine I’m standing in front of a wall of a million buttons.

🔘

Most Buttons

Do absolutely nothing.

🔥

Some Buttons

Set $10 on fire immediately.

💎

A Very Small Number

Print $1,000,000.

My job isn’t "press buttons quickly." Any motivated person (or AI agent) can press buttons quickly.

My job is to build an instinct for which buttons are even worth pressing, before I waste the week.

That instinct—call it taste, judgement, problem selection, intuition under uncertainty—seems to transfer across domains more than people expect. There’s a long-running pattern of physics and maths graduates migrating into finance because the underlying skill is modelling, uncertainty, and inference—not memorising a specific industry playbook.

A concrete example: Document Automation

Say I’m running a team building document automation for banks (loan packs, covenants, KYC—pick your favourite PDF nightmare). A stakeholder asks: "Can we extract this new document type too?"

Old world: I’d estimate engineering effort, build a prototype, iterate, ship, pray.
New world (where code is cheaper): The real question is not “can I code this?” It’s:

Is there enough signal in the document layout/content to extract reliably?
Do we have access to enough real samples to generalise?
What’s the failure cost (compliance, money, reputational risk)?
Can we define a measurable pass/fail that maps to value?

In practice, the highest-value work is often a scrappy research loop:

Collect Samples: Collect 30–100 realistic samples (not perfect "marketing PDFs").

Define Rubric: Define a scoring rubric tied to business risk.

Baseline: Run a quick baseline with a couple of approaches.

Decide: Scale it, redesign the problem, or kill it.

That’s “research.” And it is increasingly the thing that decides whether the next month of engineering is a win or a waste.

AI can flip more coins now, but it still struggles to pick the right coins

There’s a reason AI coding agents are improving fast: software tasks often come with a built-in scoring function.

A good illustration is SWE-bench, an evaluation framework built from real GitHub issues and pull requests. In SWE-bench, you can measure success using tests and reference fixes (a kind of “ground truth” for many tasks). This enables the entire ecosystem of “agent writes patch → run tests → keep/iterate.”

People are already trying to push reinforcement learning into this space. For example, one RL approach (SWE-RL) explicitly uses rule-based rewards such as similarity to a ground-truth patch, and reports results on SWE-bench Verified.

But even here, you can see the cracks: once a benchmark becomes famous, it risks becoming training data. In February 2026, OpenAI publicly said SWE-bench Verified was becoming contaminated and that many remaining tasks were flawed (e.g., tests that reject correct solutions), making the benchmark a less meaningful measure of “real” frontier coding capability.

Now compare that to research in the wild.

Often there is no unit test for “is this research direction the right one?” The reward function is fuzzy, delayed, and contested. That’s why most “automated research” demos currently look like very fast local search inside a narrow space.

A perfect recent example: Andrej Karpathy shared an “autoresearch” style loop where an agent modifies training code, runs short experiments, keeps changes that improve a metric, and repeats—reportedly completing 126 experiments in an overnight run.

That’s genuinely cool. It’s also revealing:

The loop can flip many more coins per hour than a human.
The “coins” are mostly local knobs (hyperparameters, small code tweaks).
The system still relies on a human-chosen objective, dataset, and search boundary.

It gets you faster flipping. It doesn’t automatically give you better taste.

Why taste is hard to scrape from the internet

If I had to bet on one reason “research taste” is hard to automate, it’s this: the most valuable training data for it is missing.

"You can scrape arXiv for the polished path that worked. You can’t scrape the 97 things a good researcher killed quickly because they smelled wrong."

Public research is biased toward successes (or at least publishable narratives). Entire fields acknowledge that “null results” and “non-findings” are disproportionately not published, which distorts the public evidence base. Even in domains actively trying to fix it, you’ll see explicit discussion of “file drawers” of well-designed experiments that never made it into the literature—wasting resources and causing others to repeat dead ends.

And in the most commercially valuable contexts—quant trading, competitive AI labs, internal product R&D—there are further incentives not to publish the real loop (including the failures). Even mainstream reporting describes top trading firms as secretive and highlights how far their compensation can go.

So the models end up learning from the “highlight reel”, while human researchers learn from the full messy tape: half-formed hypotheses, dead ends, and tacit heuristics for when to stop.

That doesn’t mean the gap is permanent. It means the bottleneck, right now, is still the chooser.

💡

What I’m taking from this as an engineering leader

When code was expensive

The best teams squeezed waste out of implementation.

As code gets cheaper

The best teams squeeze waste out of decisions.

The Defensible Edge

Treat product and technical decisions as bets, not tasks.
If the bet is small: ship fast, learn fast.
If the bet is big: invest in research cycles that de-risk the bet before scaling engineering.

This also makes me bullish on tools that make research legible: not just “we tried stuff,” but “here’s what we tried, what failed, why we stopped, and what we learned.”

Product translating ideas

An internal “research ledger” that logs hypotheses, experiments, outcomes, and “kill reasons” (basically: make the invisible negative data visible).
A lightweight system that forces every initiative to define: success metric, failure mode, stop rule, and expected value.
For a project like Feedsion, a “prediction-to-outcome” loop where the system tracks what it recommended, what happened later, and how that should update the “taste model” (a structured feedback loop, not vibes). The AI Index’s observation that industry dominates frontier model production is a reminder that the winning orgs will be the ones that operationalise these loops at scale.

When everyone can write code quickly, the scarce resource is judgement.