The Skill Tax of AI Coding Assistance: What The Study Missed

This post is a critical analysis of How AI Impacts Skill Formation by Judy Hanwen Shen and Alex Tamkin at Anthropic. For a summary of the study’s findings, see my companion post AI Makes You Faster But Dumber?.

Why This Study Matters

Let me start with what’s genuinely praiseworthy: Anthropic, an AI company that directly profits from increased AI usage, published a rigorous study showing a meaningful downside of using their own technology. That takes institutional integrity. The study itself, a pre-registered, randomized controlled trial with screen recordings and careful evaluation design, is methodologically stronger than most work in this space, which tends toward observational surveys or anecdotal evidence.

The core finding (17% lower quiz scores for AI-assisted developers, p=0.01) is a real signal worth taking seriously. But as with any study, the design choices shape what you can and can’t conclude. Let me dig into both.

What the Study Gets Right

1. The RCT Design

This is the right methodology for this question. Observational studies of AI usage suffer from massive selection effects; maybe developers who use AI more were already less inclined to learn deeply. By randomly assigning participants to AI vs. no-AI conditions, the researchers can make causal claims about the effect of AI access on skill acquisition. That’s a meaningful step up from “we surveyed developers and here’s what they said.”

2. Measuring What Matters

The evaluation design drew from established CS education research to test four types of coding knowledge: debugging, code reading, code writing, and conceptual understanding. They deliberately de-emphasized code writing on the quiz (correctly reasoning that syntax recall is less important in an AI-augmented world) and focused most heavily on debugging, reading, and conceptual understanding, the skills most important for overseeing AI-generated code. This is thoughtful, forward-looking evaluation design.

3. The Qualitative Depth

Most studies would stop at the headline stat. This one goes further by watching screen recordings of every participant and building a taxonomy of interaction patterns. The six-pattern framework (from AI Delegation to Conceptual Inquiry) is easily the most valuable contribution of the paper. It moves the conversation from “AI is good/bad for learning” to “which ways of using AI help or hurt learning”, a far more actionable question. (I break down all six patterns with their outcomes in my summary of the study.)

4. Iterating Through Pilots

The paper describes four pilot studies where they discovered and fixed real problems: non-compliance (participants in the control group secretly using AI), local item dependence in quiz questions, and non-Trio syntax barriers confounding their results. This kind of honest methodological iteration builds confidence in the final results.

Where the Study Falls Short

1. The Sample Size Problem

With n=52 (26 per group), this study is underpowered for the richness of conclusions it tries to draw. The main effect (quiz score difference) is adequately powered given the medium-to-large effect size (d=0.738). But the qualitative analysis subdivides the AI group into six behavioral clusters with as few as 2-4 participants each. You cannot draw robust conclusions from n=2 (the “Generation-Then-Comprehension” group) or n=4 (multiple clusters). The patterns are interesting hypotheses, not findings.

To their credit, the authors acknowledge this: “Our qualitative analysis does not draw a causal link between interaction patterns and learning outcomes.” But the blog post and the paper’s discussion section lean heavily on these patterns as if they are more than suggestive.

2. Short-Term Quiz ≠ Skill Formation

The title says “skill formation.” What the study actually measures is short-term retention on a quiz taken immediately after a 35-minute task. This is closer to testing working memory than measuring genuine skill development.

Real skill formation happens over weeks to months through repeated practice, spaced repetition, and application to varied problems. Someone who scored lower on an immediate quiz might retain the same amount (or more) a week later if the AI interaction prompted deeper encoding of certain concepts. Or they might retain less. We simply don’t know, because the study doesn’t measure it.

The pilot study is revealing here: it showed an even larger effect (Cohen’s d=1.7 for quiz scores), which the researchers wisely treated as potentially inflated. But the leap from “immediate quiz after 35 minutes of coding” to “skill formation” is a significant one that deserves more skepticism.

3. The Trio Proxy Problem

The study uses a single, relatively obscure Python library (Trio) as a proxy for “learning new skills on the job.” But learning Trio in a controlled 35-minute exercise is fundamentally different from learning a technology you’ll use daily for months. Several important differences:

No stakes: Participants were paid a flat fee regardless of performance. In a real job, you need to understand the code you write because you’ll be maintaining it, debugging it at 2 AM, and explaining it in code reviews.
No iteration: Real learning involves using a concept, forgetting it, looking it up again, making mistakes in production, and gradually internalizing it. A single 35-minute exposure followed by a quiz captures none of this.
No social context: Real-world learning involves asking colleagues, attending team discussions, reviewing others’ code, and teaching concepts to junior developers. These social reinforcement mechanisms are entirely absent.

4. Chat-Based AI Is Already Outdated

The study used a chat-based AI assistant (GPT-4o) where participants had to explicitly compose queries. But the AI coding landscape has already shifted dramatically toward agentic and autocomplete paradigms: tools like GitHub Copilot, Cursor, and Claude Code that generate code proactively, sometimes without even being asked.

The authors acknowledge this in a footnote: “This setup is different from agentic coding products like Claude Code; we expect that the impacts of such programs on skill development are likely to be more pronounced.” This is probably correct, but it means the study’s findings are a lower bound on a phenomenon that’s already being overtaken by more aggressive forms of AI integration.

5. Missing the Human Assistance Counterfactual

The study compares AI assistance to no assistance. But in real workplaces, the counterfactual to AI help isn’t “struggling alone”; it’s “asking a colleague.” A junior developer stuck on an async error would ask a senior engineer, and that interaction might involve exactly the kind of conceptual explanation that the study’s high-scoring AI users sought.

How does AI-assisted learning compare to human-mentored learning? The paper lists this as future work, but it’s arguably the more relevant comparison for workplace policy decisions.

The Uncomfortable Meta-Question

There’s a deeper tension in this research that I think deserves more attention. The study frames the problem as: “AI might prevent skill formation.” But it implicitly assumes that the current skills junior developers need to form are the right skills to measure.

Consider an analogy: when calculators became ubiquitous, the ability to do long division by hand became less important, while the ability to set up problems correctly and interpret results critically became more important. The skill landscape shifted, but we wouldn’t say calculators “prevented math skill formation”; they changed which math skills matter.

Is it possible that the “skills” being tested here, debugging Trio syntax errors, reading async boilerplate, are precisely the skills that become less important as AI gets better? And that other skills, like knowing when to be suspicious of AI output, composing effective prompts, or making architectural decisions AI can’t, are the emerging competencies we should be measuring?

The study’s evaluation design, while rigorous within its scope, doesn’t capture these higher-order skills. It measures mastery of library-specific concepts. In a world where AI handles implementation details, mastery of implementation details may be the wrong metric.

I don’t think this invalidates the study; the ability to read and debug code remains genuinely important, and the study demonstrates a real mechanism by which AI can erode it. But it does suggest that the framing of “skill formation” might be narrower than the title implies.

What I’d Want to See Next

If I could design the follow-up studies, here’s what I’d prioritize:

Longitudinal measurement: Same study, but test retention at 1 day, 1 week, and 1 month. Does the AI group catch up through spaced review? Or does the gap widen?
Agentic AI condition: Add a third condition using an autocomplete/agentic tool like Copilot. If chat-based AI produced a 17% drop, what does fully agentic AI produce?
Guided AI interaction: A condition where the AI assistant is designed to promote learning: refusing to give full code, asking Socratic questions, requiring the user to attempt a solution first. Does intentional design close the gap?
Real-world replication: Run a similar study within actual engineering teams over weeks, measuring not just quiz performance but code quality, debugging success on novel problems, and architectural decision-making.
Human mentorship comparison: Add a condition where participants can ask a human expert instead of AI. This is the comparison that matters most for workplace policy.

The Bottom Line

This is a good study that asks an important question and provides a credible, if preliminary, answer. The main finding, that AI assistance reduces immediate recall of newly learned concepts, is robust enough to take seriously. The interaction pattern taxonomy is a genuinely useful framework for thinking about AI usage.

But the leap from “lower scores on an immediate quiz after 35 minutes” to “AI stunts skill formation” is larger than the evidence supports. The study measures a real phenomenon, but the real-world implications depend on dynamics (longitudinal learning, workplace incentives, evolving skill requirements) that this study doesn’t capture.

Use this as a data point, not a verdict. The most useful takeaway isn’t “AI is bad for learning”; it’s that cognitive engagement is what drives learning, and AI makes it easy to disengage. If you’re looking for concrete ways to stay engaged while still using AI productively, I’ve put together a practical guide for developers and managers based on these findings.

Read the full paper for all the methodological details.

Why This Study Matters#

What the Study Gets Right#

1. The RCT Design#

2. Measuring What Matters#

3. The Qualitative Depth#

4. Iterating Through Pilots#

Where the Study Falls Short#

1. The Sample Size Problem#

2. Short-Term Quiz ≠ Skill Formation#

3. The Trio Proxy Problem#

4. Chat-Based AI Is Already Outdated#

5. Missing the Human Assistance Counterfactual#

The Uncomfortable Meta-Question#

What I’d Want to See Next#

The Bottom Line#