Software development is changing fast. Agents write code, engineers review and orchestrate. Here's what agentic engineering looks like from the inside; the tools, the costs, and the new skills that matter.
Remember below post by Andrej Karpathy about "vibe coding" last year? That casual thought where you just let the LLM handle stuff and go with the flow?
I chuckled at it back then. It felt like a fun experiment for side projects - cool demos, weekend hacks, nothing serious.
A year later, I'm eating my words.
My workflow has completely flipped. I went from writing code with occasional AI suggestions to describing what I want and occasionally stepping in when things derail. It happened faster than I expected, and honestly, it still catches me off guard some days.
Let me walk you through what I've actually experienced.
Here's the thing nobody tells you about working with AI agents: it doesn't feel like coding anymore. It feels like managing.
A typical session for me or any software developer using AI tools now looks something like this:
On good days, it's magic. On bad days, you're debugging AI-generated spaghetti.
This way, I'm not writing software. I'm commissioning it. Read the post below by Karpathy:
Karpathy himself said he went from 80% manual coding to 80% agent coding in just a few weeks.
Boris Cherny at Anthropic mentioned that 100% of their code is now written by Claude - he shipped 22 PRs one day, 27 the next. All AI-generated.
These aren't press releases. These are real people sharing their actual workflows.
That diagram above? That's the developers new daily life. The loop between "Agent Writes Code" and "Agent Runs Tests" happens without them touching anything.
Let me tell you about something I built recently. I've been working on BuildTrack - a construction project management platform. One of the features was an enterprise-grade Equipment/Asset Tracking system. You can find this project on my Github.
Here's what that meant: database models for Equipment, EquipmentAssignment, EquipmentMaintenance, EquipmentDocument. API routes with Zod validation for CRUD operations, check-in/check-out flows, maintenance scheduling, utilization analytics. Plus Terraform infrastructure, Docker containerization, AWS Secrets Manager integration, the whole stack.
A year ago, this would've been a two-week sprint minimum. With AI agents? Most of it shipped in days.
I started with the Prisma schema. Described the data model I wanted - equipment items, who they're assigned to, maintenance records, related documents. The agent generated the schema, but here's where it got interesting: it also fixed relations I hadn't specified. When I said "Equipment belongs to a Tenant and can be assigned to Users," it inferred the back-relations I'd forgotten on the User model.
Then came the API routes. I described the endpoints I needed. The agent wrote them with proper Zod validation, error handling, pagination - stuff I would've added but might have been lazy about in a first pass.
Not everything was smooth:
The Prisma client issue. After regenerating the schema, TypeScript started throwing errors. The agent had added models but the Prisma client wasn't regenerated. It took me a few minutes of confusion before I realized the agent hadn't run npx prisma generate. Simple fix, but it shows you can't fully check out.
Overcomplicated the maintenance scheduler. First version had like 400 lines of code for something that could've been 80. I had to say "simplify this, we don't need support for recurring schedules yet" and it immediately cut it down.
Docker networking assumptions. The agent assumed I wanted a specific MongoDB setup. It worked, but the credentials were hardcoded. I had to explicitly ask for environment variable configuration.
| Task | Estimated (Manual) | Actual (AI-Assisted) |
|---|---|---|
| Database schema + migrations | 3-4 hours | 15 minutes |
| API routes (6 endpoints) | 1-2 days | 1 hour |
| Terraform infrastructure | 1 day | 1 hour |
| Docker + CI/CD setup | Half day | 1 hour |
| Debugging AI mistakes | N/A | 2 hours |
Total: ~2 days (including assisting AI + generating code) instead of ~2 weeks. Even with the debugging overhead, it's still a massive win.
The point isn't that AI did everything perfectly. It didn't. The point is that the bottleneck shifted from writing code to reviewing code - and reviewing is faster.
Let's clear something up. What we're doing now isn't vibe coding anymore.
Vibe coding was casual - you'd ask the AI something vague and hope for the best. It worked for throwaway projects. It almost worked for real stuff.
What's happening now is different. People are calling it agentic engineering, and the name matters:
It's not magic. It's a different kind of problem-solving.
The difference is oversight. I'm still responsible for what ships. I just changed how I get there.
Everyone asks about tools. Here's my honest setup:
| Tool | What I Use It For | Why It Works |
|---|---|---|
| Claude Code | Complex refactors, terminal workflows | Deep reasoning across files |
| Cursor | When I need precise control | Great VS Code integration |
| Copilot Workspace | High level Planning & PR management | Task-to-Plan |
| Antigravity IDE | Full task delegation & E2E features | Agent Orchestration |
| OpenAI Codex | Heavy-duty Agentic workflows | Autonomous Command Center |
No single tool dominates. I switch between them depending on the task. Multi-tool workflow is the reality.
Three days ago, Anthropic dropped Opus 4.6. I didn’t get the chance to try it out as of now in full-fledged mode but I’m seeing lot of noise and its applications around it and how big the context window it provides, ofcourse with higher rates.
Here's my take as someone who's been building with these models daily: this release changes the math on what's practical for agentic workflows.
The headline features aren't just marketing:
1M token context window (in beta). That's not a typo. You can now load an entire medium-sized codebase into context. Previously, I had to carefully curate which files to include. Now I can feed it the whole src/ directory and let it figure out what's relevant. The catch? Premium pricing kicks in above 200K tokens ($10/$37.50 per million input/output vs the standard $5/$25).
Agent teams. Claude Code now lets you spin up multiple agents that work in parallel. You can test this on a codebase review; three agents running simultaneously, each analyzing different parts of the code. They coordinate autonomously. You can jump into any subagent with Shift+Up/Down. It feels weird at first, like managing a small team instead of using a tool. See the below code by Lydia Hallie (@AnthropicAI) in which she explains how Claude Code can spin up multiple agents in parallel.
Context compaction. This is subtle but huge for long-running tasks. When the conversation approaches the context limit, Claude automatically summarizes older context and keeps going. No more "let me start a new chat because we hit the limit."
Adaptive thinking. The model now decides when to think deeper. Previous versions were binary; extended thinking on or off. Opus 4.6 reads the room. For complex architecture decisions, it thinks longer. For simple refactors, it moves fast.
I'm not usually a benchmarks person, but some numbers here caught my attention:
That last one is the practical differentiator. Context rot was the reason agentic workflows would break down after 30 minutes. The model would "forget" something critical you mentioned earlier. Opus 4.6 holds context better than anything I've used.
Let me break down what this actually costs in practice:
| Scenario | Context Size | Input Cost | Output Cost |
|---|---|---|---|
| Standard request | < 200K tokens | $5/MTok | $25/MTok |
| Long context | > 200K tokens | $10/MTok | $37.50/MTok |
| US-only inference | Any | 1.1x standard | 1.1x standard |
| Fast mode (beta) | Any | 6x standard | 6x standard |
| Batch processing | Any | 50% discount | 50% discount |
The batch processing discount is interesting for background agents. If you're running overnight code reviews or test generation, that 50% cut adds up.
My cost optimization strategy:
Bigger context, fewer sessions. Instead of breaking work into multiple conversations, you can load everything once and let it rip.
Agent teams for reviews. You can spin up parallel agents for PR reviews; one for logic, one for security, one for style. They catch things a single pass would miss.
Let it think. Stop micromanaging the /effort setting. The adaptive thinking is good enough now that I trust it to calibrate.
Compaction for long sessions. You no longer need to restart conversations when context fills up. Just let compaction handle it.
The release notes mention that Anthropic builds Claude with Claude. Their engineers use Claude Code daily. Opus 4.6 is what they've been testing internally. That shows.
I want to be real here. AI agents are not perfect. Not even close.
The mistakes aren't simple syntax errors anymore. They're subtle conceptual mistakes - the kind a hasty junior dev might make:
Karpathy put it perfectly:
"They will implement an inefficient, bloated construction over 1000 lines and it's up to you to be like 'couldn't you just do this instead?' and they'll say 'of course!' and cut it down to 100 lines."
So yeah, oversight still matters. Drawing from the discussions above, the below pie reflects the time distribution of developers coding with AI tools today.
Here's something that really hit me. As OpenAI Co-founder have recently suggested, we are in a "step function" transition. For top-tier engineers, the tool of first resort is no longer the editor; it’s the Agent.
The AI isn't replacing the engineer; it is automating the "slop" (the boilerplate and the manual wiring) so the engineer can focus on the system architecture. But here’s the catch: You cannot automate what you do not understand.
To bridge this gap, the modern workflow has shifted from "writing code" to "curating trajectories":
The Verdict: AI multiplies your existing knowledge. If your knowledge is zero, 0×100 is still zero. But if you know how to architect a system, AI turns your 2-week roadmap into a 2-day sprint.
The AI multiplies your existing knowledge. It doesn't replace it.
Nobody talks about the cost side enough. Running agentic workflows isn't cheap.
Andrew Pignanelli from General Intelligence Company shared that his company spent around $4000 per engineer per month on Opus tokens in January 2026. That's a real line item in the budget now.
But here's the flip side: his engineers shipped an average of 20 PRs a day. Sometimes hundreds of commits daily. 20% more spend for 3-4x output? That math works out.
The scary part is runaway agents. One developer asked Claude to remind him to check his kid's homework. Token usage? 3.2 million.
What happened? The agent thought it needed to check the homework itself. It scanned directories, sent every image to a multimodal model, searched websites, found nothing, and finally sent the simple reminder it should've started with.
What I do to manage costs:
Here's the uncomfortable truth: we're generating code faster than we can audit it.
Some predictions say 30% of new security vulnerabilities by 2027 will stem from AI-generated logic. Not because AI writes malicious code - but because we trust AI output without proper review.
Now look at the post below:
Simon Willison has been warning about something else. Systems that give agents access to email, browsers, and external services create what he calls the "lethal trifecta":
He calls it his "most likely to result in a Challenger disaster" scenario.
What I've added to my workflow:
Andrew Pignanelli, CEO of General Intelligence Company, and his team learned this the hard way. They sped up engineering significantly in December. By January, they were way behind on design and UX.
Makes sense when you think about it. Engineers can now spin up features at the speed of thought. But those features still need to look good and feel intuitive.
That traditional 1:20 designer-to-engineer ratio? Probably too low now.
If you're only focused on building, you might end up with functional but clunky products.
I think about this one a lot. Heavy AI reliance means you code less. Code less means your manual skills atrophy.
Karpathy noticed it in himself - his ability to write code manually is already starting to fade.
Here's the problem: if you can't write code, how do you know when the AI output is subtly wrong?
Generation (writing code) and discrimination (reviewing code) are different cognitive skills. You can lose one while keeping the other, but it's risky.
[!TIP]What we should do: Rotate through "AI-off" development sprints quarterly. It feels weird, almost nostalgic. But it's risk mitigation. You need people who can debug when the AI fails.
A few predictions people are throwing around for end of 2026:
Longer term speculation:
I don't know which of these will prove accurate. But they're worth thinking about.
My LinkedIn still says "Software Engineer." The job looks nothing like it did 18 months ago.
I'm not primarily a writer of code anymore. I'm a specifier, reviewer, writer, prompter and orchestrator.
This isn't the death of engineering skill; it's a redistribution. The premium now is on architectural judgment, quality assessment, and the meta-skill of working effectively with AI.
Something shifted around December 2025. LLM agent capabilities crossed some threshold of coherence and caused a phase shift in how we build software.
Despite all the rough edges, programming feels more fun now. The drudgery is removed. What remains is the creative part. Less feeling blocked, more courage to try things.
But there's a split coming. Engineers who primarily liked coding will have a different experience than those who primarily liked building.
I liked building.
Build accordingly.
This post incorporates perspectives from Andrej Karpathy, Simon Willison, Boris Cherny, and others at the frontier. Views are my own. I'm still figuring this out like everyone else.
If this post was useful, consider supporting my open source work and independent writing.