Building competent AI SWE Agents through determinism

AI SWE Agents are capable today of completing end-to-end SWE tasks of simple to moderate complexity. consistently.

there's a significant knowledge gap between what's possible today in this realm, and what devs/pms/eng managers know about. This blog addresses that gap.

What's our goal?

The claim that I am here to defend: "AI SWE Agents TODAY can pickup low-hanging fruit and complete them END-TO-END."

But -- how do we do this such that:

The workflow is completely autonomous. No babysitting needed.
Doesn't need to run on my laptop. Anyone on the team can kick off the workflow at any time.
Meets the quality and standards that WE SET.

The last point is by far the most important.

We've all seen AI slop. It's everywhere.

If we don't have confidence that the pull requests generated by the AI workflow meet our quality standards, the tool is useless; no value to be had here.

The Workflow

Let's look at the workflow at a high level before we get into the details:

Seems pretty straightforward, and that's the point. We are replicating what a developer would normally do when delivering a scope of work.

Except we aren't relying on a developer -- we are relying on a completely non-deterministic tool.

So how do we make sure the quality is high?

Deterministic workflows as safeguards

The secret sauce is wrapping the generative, non-deterministic behaviour (the LLM), in a loop that ensures all our quality checks are met.

This is probably the same steering you (the developer) are doing locally using Claude Code or Codex or whatever. Give the tool a prompt, tell it to write tests, run your formatting, type checking, build checks, etc etc etc.

First you start with the prompt (the scope of work). That's typically using a CLI, chat interface, or IDE integration.

Then, you come back in a bit, review the code it's written, make sure tests make sense, manual test, etc. When everything looks good, you raise a pull request.

This is the key point: Our first prompt doesn't end in "and then make create a pull request against main and push to remote".

Our SWE Agentic Workflow will follow the same pattern, except we're going to automate all the bits around the LLM. We are wrapping the magic sauce in guardrails and boundaries in order to ensure quality and consistency.

The end result: We as a development team can asynchronously offload low-hanging fruit to our SWE LLM workflow, and continue to focus on the bigger tasks.

The AI workflow generates high quality pull requests. Our CI re-runs all our checks, ensuring that the AI Agent generated high quality code:

We can batch time to review pull requests created by our workflows, expecting them to be of high quality and little slop.

OK great, but how do we actually build this?

Off the shelf, or build yourself?

Let's start with reviewing off-the-shelf solutions that solve this problem today.

There are a number of tools that exist today, offering a "SWE" autonomous background software engineer.

Cursor Background Agents

Cursor Background Agents was one of the first to offer an asynchronous AI SWE Agent. And they make the install dead-simple.

1-click install to pull into Slack
Then, authorize it to read our Github repositories via OAuth.

That's it.

to kick off the workflow - you simply tag the bot in slack

"@Cursor can you fix this timezone bug"

And it then reads the thread you tagged it in where a support agent tried to re-schedule a customer's delivery dropoff to yesterday because of a timezone bug.

Credit where credit is due: The absolute fastest time to "aha".

Another pro? Cursor doesn't even charge for this. I assume it uses your existing Cursor subscription if you have one. This is a "stay sticky" strategy.

How about a con? and a BIG one.

Cursor Background Agents aren't very useful.

Why?

In the worklow, Cursor will write some code. And it will create a pull request.

But, that's all it does. And it does it The Dirty LLM way.

It didn't use our repo's coding style rules.
It didn't check lint.
It can't even run commands such as "npm test, build, check, cook, clean"

It cannot verify its work.

It doesn't uphold any of our team's standards.

We're back to where we started: S L O P.

It will likely write decent solutions, much of the time. But that's not good enough. Not even close.

We need a solution that will consistently generate high quality work within the quality standards we expect.

Let's move on to another option.

Devin

Finding little value in Cursor Background Agents, my search continued for an off-the-shelf solution, and I discovered a tool called Devin (thanks to a good friend of mine who works there and suggested it).

Devin has a concept of "Devin Machines", which allow us to configure our VM environments including our repos, set up our guard rails as described above, and more.

This is exactly what we're looking for.

Devin provides us a full virtual sandbox machine. The setup wizard helps us install our git repos, dependencies, and everything we need for coding, testing, etc, including a web browser. It's AMI Templating / snapshotting a full dev environment.

You can then provide it with commands the agent will run before doing any work ("git pull, npm i").

You can provide it secrets, give it your lint commands, test commands, and everything else.

This is exactly what we need to guarantee high quality work will be generated by the workflow.

Triggering the workflows

How do we trigger the AI workflows? However we want. Devin has an API to invoke Sessions programmatically, meaning we can trigger the workflow natively in our tooling: Jira, Github, Slack, a text message, or whatever other way you want.

As of writing, we've been able to shift hundreds of story points worth of low-hanging fruit out of our backlog and into the Devin workflow. This has freed up developers from context switchingl.

The catch is that Devin is relatively expensive when you compare it to Claude Code or similar. We budget between $5-10 per "small" sized task.

For a hobbyist or a tinkerer, that might be unappealing. For any professional team, this is a rounding error.

Devin charges in terms of "ACU", which is likely a combination of compute time and token usage.

Build It Yourself

We should never build something which we can reasonably buy ourselves. However, tinkering is fun.

So how would we build these workflows ourselves?

Open Hands

OpenHands is the open source equivalent of Devin. 65k Stars and counting. Impressive community engagement.

The readme contains all the information you need. It's pretty impressive. This is the most likely starting spot for a quick PoC.

E2B or Cloudflare Sandbox.

E2B.dev - https://e2b.dev/

A compute platform for AI Agents. E2B and CF Sandbox provide a sandbox environment for building your containerized and ephemeral LLM environments.

Think Devin, without any of the niceties that the quick-start gives you. None of the opinionated approaches that cater to software development workflows.

Instead, the approach they've taken is quite low-level in comparison; This is a feature, not a bug.

# pip install openai e2b-code-interpreter
from openai import OpenAI
from e2b_code_interpreter import Sandbox

# Create OpenAI client
client = OpenAI()
system = "You are a helpful assistant that can execute python code in a Jupyter notebook. Only respond with the code to be executed and nothing else. Strip backticks in code blocks."
prompt = "Calculate how many r's are in the word 'strawberry'"

# Send messages to OpenAI API
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": system},
        {"role": "user", "content": prompt}
    ]
)

# Extract the code from the response
code = response.choices[0].message.content

# Execute code in E2B Sandbox
if code:
    with Sandbox.create() as sandbox:
        execution = sandbox.run_code(code)
        result = execution.text

    print(result)

The reason for this design choice is that these players are offering a platform for building AI agents of any flavour. As opposed to Devin/OpenHands/Cursor which are focussed on SWE AI workflows.

You have an empty compute environment, and low-levels to fully customize the workflow.

Basically: you gotta build everything yourself.

This is analogous to Fly.io vs AWS. One is an opinionated platform for doing one thing really well (serverless deploy), the other is a suite of tools for virtually every use case.

More flexibility, and much more complexity.

The pricing will be immensely favourable on a session level. Of course, the overhead is much higher than an opinionated solution.

Do you want to see a more detailed example of hand-building a SWE AI Agent using a BYO or OSS solution? Let me know on LinkedIn or email.