Adventures in Vibe Coding

Recently, I’ve been playing around with GenAI tools. Yes, I know I’m late at this point, but these days coding is not as big a part of my day-to-day as it used to (though I still do it regularly), and I have never been too interested in autocomplete tools, which has been the way most of the GenAI for programming has been labelled.

The integration also with my regular workflow is not as straightforward. Most of the tools are more based on an IDE, most of what I’ve seen has been mainly derivatives of VS Code, probably the most popular IDE at the moment. But, as I’ve discussed in this blog before, I’m firmly into the “Vim text editor / Command line as IDE” kind of flow. there are things there, but not as extensive as the ones for the IDEs. Of course, part of that is my language of choice, Python, where the dynamic typing nature of it makes less of a requisite the autocomplete and other niceties of an IDE than for other options like C or Java.

Anyway, given that the basic interaction with GenAI models is text, it’s actually quite flexible, and there’s a lot of activity into integrating them into different workflows.

Which brings us back to the Agentic flow, which is the one that’s getting most interest at the moment. In essence, instead of having a “super powerful autocomplete”, that integrates with your writing code flow and present suggestions, it works more as an extra prompt where you ask the agent to generate or change your code in certain ways.

This is, as far as I know, not available with an integration of my workflow¹, but more with certain new IDEs like Cursor or Windsurf. I tried both and they both look to me quite similar in the way they operate. They present a screen with your code and a prompt where you can enter questions or actions. The prompt, which acts as a chat window, will show the response from the agent, and any change on the code will be labelled so it can be reviewed and approved.

For example, starting a script from the start, you can request something like:

Create a script that analyses a git repo, and presents a graph showing the amount of commits between tags, and the number of commits that contain the word “merge” in their commit messages

The agent will go and write for you a pretty capable script following your requirements, which then you can review and execute. You can iterate through it, making more comments as it generates the code. For example, you may need to ask for adding a requirements.txt file, as it will tend to add required libraries. For example, matplotlib if you ask for a graph.

Showing the example that we just described in Cursor

The interesting part of all of this is the fact that the agent will produce a lot of text describing the actions and what is doing, to you to verify what’s doing.

Tools

I’ve been trying mostly Cursor. I’ve also taken a look to Windsurf, but they operate in a similar way. They appear based on VS Code, with some add ons to integrate the AI tools.

The operation on both in Agentic mode is very similar. You describe the action in the chat window, and the agent presents the changes, labelled as such, and a response with the changes and options that it’s produced.

In both cases, you are talking with someone and overseeing the results.

Models

With all the hype about different modes, I think that, at certain level, they are on the way to be somehow irrelevant. Sure, there’s differences, and the most powerful models are actually pretty expensive and capable.

But they are on their way to be a commodity to a certain degree. As all of them are improving and the tools are presenting a detached version of them, I think very soon they’ll become something in the background that doesn’t have a big impact. In a similar fashion than using a particular CPU can produce better results, but most people are not worried about it. They just have one that is good enough for their usage, and don’t ever think about it.

I think that models, outside of very specific use cases, are going to be quite difficult to differentiate, with probably a bunch of them being “good enough” for most use cases.

It’s also very difficult to test the exact same action in different models to see the differences. It’s more “this appears to be producing better results”. And anyway you notice is that 6 months ago they were less capable than they are now.

Both Cursor and Windsurf allow to configure the models to use. Some can be quite expensive!

Three kinds of software

I tend to write three kinds of differentiated code, with different expectations and usages²

One off code. This is code generated for a very specific case, that will run only once. Could be changing the format of a file into another, solving a particular query of some sort, etc. The important part is the result of the task, and not much the code that generates it, as it’s not going to be run again.
A recurring internal task. These tasks are run normally manually, and with some ad-hoc purpose. Could be to prepare some report, let’s say semi-regular growth info. They evolve over time, but some parameter is typically required. Because is being adjusted from time to time and run in different situations, maintainability is more important than before. But given the manual nature of them, there’s still no need for enforcing “best practices”. Convenience trumps. When this gets out of hand, this code evolves and becomes so complex and critical that everyone hates it.
Production code. This code has, since the beginning, the purpose to run many times, with little to no supervision, and should conform to a high bar to ensure that it doesn’t break (and/or can recover from failure), is performant, maintainable, readable, compliant, and multiple people can check it out and expand it. Testing is important, as are metrics, logs, etc. And the fit within the architecture of the system, both the software one and the organisational. This is the “software engineering” part over the “coding”

While software developers will always talk about the importance of standards, the fact is that these three levels of code can get away with a very different coding approach. Trying to be super strict with something at level 1 is missing the point.

Code can, and sometimes do, move up in this scale. A task that started as a “there could be something here” (level 1) to “let’s do a small POC” (level 2) to “let’s ship it” (level 3). But it doesn’t go down on the scale. The level of effort grows exponentially with each level, though. It is not too difficult to move code from level 1 to level 2, but moving it from level 2 to 3 is a lot. This is the kind of process that takes a company to scale from a small pilot to a full grown operation. You can probably argue that there’s an intermediate 2.5 level which is typical of new companies, where the product is still young, the codebase is small and moving fast is more critical than stability, but things are a bit planned. If the company is successful, all that “tech debt” will have to be addressed and move to proper level 3.

My experience is that Agentic code development excels at levels 1 and 2. As we have seen above, it is very easy to describe a particular problem and “guide” the agent to a solution that it’s good enough. It allows to generate POC and initial code super quick and easy. It gets good defaults and can implement something that’s useful very fast.

But moving to the next level is more difficult. It can help on productionise code, adding logs, metrics, etc. But operating an existing code base that works is not trivial, and requires more attention to the different actions that are created. It can help, though, and can suggest interesting code that can be integrated, with the right supervision. But in level 3 purely the “write code” part is not the only thing. Standards are important, but relationship with many other moving parts, and that includes teams, permissions, etc, is critical.

Testing

The best way to work on those situations is actually by using TDD. Define the tests that the code should pass, and then generate backwards the code that should fulfil the use cases.

I’ve seen a lot of recommendations of using Gen AI to generate tests, but other than helping with the boilerplate, I don’t think that’s the right approach. It should be the other way around. Generate the test cases yourself, paying attention to what you’re trying to do, and then ask the Agent to change the code to accommodate them.

Testing your code is an activity that should be approached with the right mindset. The idea is that you double-check that the code is performing as it should, doing what is supposed to do (and not doing what is not supposed to do). Certain detachment is useful, because it’s very easy to fall into the trap of creating tests that verify what the code is doing, and not what it should do. The key element in TDD (write tests before writing the code) is to increase this detachment and force the developer to define the result before the operation.

For level 3 code, if we want to use Agentic mode, I think is important to achieve this decoupling from tests and code. And, for that, the easiest is to create the tests independently. GenAI can still help with the boilerplate! But set the test cases independently.

Experiences

In more specific cases where I’ve tried to use it, and experiences, I have been using Agentic mode, among other usages, to do these actions:

Creating a few scripts between level 1 and 2 for creating graphs and compiling data for analysis and presentations

This was incredibly productive and positive. I wanted to compile some statistics about a couple of git repos to show evolution of the development process. I was able to create scripts that were pretty capable very fast, and presenting data. There were some problems with how the data was presented, and some stuff had to be adjusted, but if I had written these scripts myself, I would have the same problems. In general it was a big saving of time.

Update the version for ffind

I developed and maintain a command line tool to search files by name called ffind. I use it day-to-day. It is pretty stable by now, and I try to keep it up-to-date with new versions of Python, etc. It has a good suit of tests, CI and good coverage. So I thought “why don’t I try to update it using an Agent? Normally is just changing a few number in the files and push the new version”. So I did.

The process was a bit more complicated than anticipated, because the usage of setup.py script has been deprecated in the latest version of Python. So I had to change how the project was built and installed. The Agent made the change to hatch,. which is a tool to help with the build of the package and installation. This was not without its difficulties.

The Agent tended to hallucinate quite often, presenting methods and calls that were not real, and mixing the interface for hatch. Some hallucinations were easy to spot, for example, it tended to change my name on the package, for some reason, to another Spanish-sounding name. I’m talking about the copyright in the license and things like that. That was strange.

More difficult to know was the non existing methods on the hatching module, and the changes that left dead code or mixed something. Given that I have a pretty good test coverage, it was easy to detect, but it was pretty disconcerting. Another interesting detail was that the Agent didn’t really learn. If I told it “you’re changing my name there”, it will apologise, but three questions later it will try to do the same thing.

The process was, though, relatively quick. It was quicker than if I had done it on my own. I would have fixed it, but probably would take me a bit longer.

Commoditisation

A surprising outcome to me is probably that I can clearly see that AI is pretty much in the way to be commoditised. The different IDEs are not that different, and the abstractions and opinions they pose are not that different. Working through code in an Agentic mode is, at is raw level, having a conversation and reviewing results³. There is no secret sauce or moat other than “Model X is better”.

But the differences between the models are also fuzzy, and difficult to address. There are a bunch of very capable models already, and they don’t change how you interact with them. You can start using another model without changing anything on your interaction. Just configure to use model M+1. Perhaps there’s the possibility of specialised models (e.g. a model that’s tailored to generate Java code, or embedded code), but so far the industry seems to go for generic coding models. And using one model or another is just a matter of changing a config file. It’s not as changing a programming language or even a text editor, which have more opinionated statements. It’s just using a chat box. I wonder if, at some point, models will have their own opinions as a way of differentiating them.

Mandatory AI illustration about commoditised robots

In an essence, is like going back to early 2000s, but instead of having to choose between Altavista and Google, you have three different Googles that produce similar results and evolve very quickly. The window of opportunity for a winner takes all is not likely there.

Embracing the vibes

The interesting part, though, is the fact that it all devolves into vibe coding very quickly. This term is not 100% clear at the moment, as sometimes it seems to mean just work with an Agent. But what I really mean is that you start delegating a lot on the Agent. You are just interested in the end result, and less in the underlying code that generates it.

Yeah, whatever, just show me the result

And I think that’s both the incredibly powerful potential of the tool as well as the danger inherent to it. For code in levels 1 and 2, really the output is all you need. The process to get there is less important.

But for level 3, what the code is doing becomes critical. It needs to be analysed critically, in the same way that you’ll do with a coworker. But we all know that good, thorough code reviews are difficult. There’s a level of implicit trust on your coworkers that allow for a quick “👍 LGTM”, especially for long PRs, that is not ideal here. This can be a big difference in work, from focusing on writing code to reviewing code, from very early in one’s career.

This is probably a key element why a lot of business people and leadership are so hyped about it. They can use it for the kind of tasks that they perform! Really creating a new website from scratch is super easy, or produce a POC about something to test it out. But I think that scale them into a “real business”, where performance, stability and consistency is critical can present a challenge, at least for now. A great deal of actions in a stabilised company are not about adding more code, but about first finding out subtle things that are going on, define and align features with existing code, which is not trivial to do. Not that GenAI cannot help there, but it’s less magical. There’s no silver bullet!

The other caveat is that, as a developer that likes their job, it can suck the joy out of it. I like to write code and to think about it. It’s true that you can generate code faster, but at the same time, it’s sort of the most enjoyable part of my day-to-day. I’d prefer if it was able to handle my JIRA tickets and status report meetings instead, to be honest. I hope that we start working on tools more focused on those tasks, which is really the thing that I’d prefer to automate.

Writing well is thinking well

At a fundamental level, writing is structure thoughts around ideas and confronting them with the real world, even if it’s just to be self-sustaining. While proof-checking and helping with grammar, etc is greatly advantageous (and even more so as a non-native English speaker), relaying too much into generated code can produce laziness of thought. Some subtle details are discovered while you are thinking hard about the details. It’s not unusual to realise a bug by talking with someone that wrote the code about some condition and then connecting the dots. Generating the code automatically removes this intimate, instintual knowledge of the code.

I’m putting this on the Internet, so I expect a lot of people telling me how incredibly wrong I am ↩︎
Please don’t take these levels too seriously. They are very broad ways to talk, but different code bases have different expectations and practices. ↩︎
And I have hope for having a more “command line centric Agentic mode” that I can integrate more into my workflow. ↩︎

Category: python, Software Tags: ai, english, python, software engineering, vibe coding

Wrong Side of Memphis

Adventures in Vibe Coding

Tools

Models

Three kinds of software

Testing

Experiences

Commoditisation

Embracing the vibes

Leave a comment Cancel reply

Top Posts

Archives

Wrong Side of Memphis

Adventures in Vibe Coding

Tools

Models

Three kinds of software

Testing

Experiences

Commoditisation

Embracing the vibes

Rate this:

Share this:

Leave a comment Cancel reply

Top Posts

Category Cloud

Archives