impactful human tech from simple tools

How testing made vibe coding work for me

Creating complex logic for standing tiebreakers in web app

by Carl James | BLOOMINGTON, IN | Jan. 13, 2026

ChatGPT generated code for apps.iubase.com

Every year I add features to apps.iubase.com before the start of college baseball season in mid-February. This off-season has been particularly bold as I have wanted to add a number of significant features. While this is a "hobby" project (I don't get paid to do it) it is a relatively significant Python/Django web application with numerous sub apps and well over 200 tests.

I have used the basic principals of Test Driven Development (TDD) in that I write a failing test (RED), write code to pass the test (GREEN), and refactor with running tests. I have done this all along with this app. I demonstrated this last summer in my PyOhio 2025 talk How a Solo Hobbyist Learned to Love Testing.

I recently discovered that the tests I write would likely be defined as integration tests, with very few unit tests. I primarily test what the output of the request/response cycle of the application is. I don't test individual units of code on a regular basis. Even with that, I still have maintained 99%+ test coverage as I don't write code that doesn't have a purpose.

Since I had a bold set of features to add within a relatively short amount of time, this off-season, I decided to give large language models (LLMs) a try at helping me get some more complex logic of the app completed quickly.

I have read and listened to a lot of thought leaders on how or even if to use LLMs in coding work. The opinions are all across the spectrum. The opinion that resonated most with me was that of Robert "Uncle Bob" Martin. He said he's not only open to using LLMs, but actively uses them himself, with one major caveat: he writes all of his own tests.

This is where he maintains control. In the red-green-refactor process of TDD, there is no real rule on how the "green" code is created. In fact, it is supposed to be kind of sloppy. The focus is on getting from red to green quickly. Once green, you refactor the "slop" to make it maintainable while continuously running the tests to ensure the code still works as expected.

Why can't an LLM produce this "green phase" code?

Uncle Bob sees no reason why not. In the days before LLMs it was acceptable to copy/paste code from StackOverflow in this phase as a starting point. Since the LLM probably got this code from StackOverflow anyway, how is it that different?

The standings problem

I went ahead this past weekend and gave it a try on a particularly gnarly problem I was facing. In the past few years I chose to maintain a page with Big Ten (B1G) conference standings on our group's primary Wordpress site. Now lots of sites produce standings, but none of them break the ties. The ordering is important (including tiebreakers) for qualification and seeding in the B1G Tournament in May. So I wanted the standings to always reflect the tiebreakers. I did this by manually figuring them every single time I posted an update.

This had multiple problems. One, it took a lot of time during a busy baseball season. Second, as a consequence of the first, I only updated the standings late in the evenings of game days as I only wanted to calculate tie-breakers once a day. Third, on the last day of the season I got a tiebreaker wrong and got called out on it.

I looked up the rules and this season the B1G added another criteria to an already dizzying process of breaking ties, since the 17 teams do not all play each other during the course of an 11-week season.

The process is that all teams are ordered by winning percentage. It is possible due to things like rain-outs that teams play a different number of games. Percentage is the order. Now if teams are tied the following process breaks those ties. If at any point a team is placed in a tie-breaker and other teams remain tied, the process starts over recursively.

If all teams that are tied played each other, tie is broken by winning percentage among the group.
If one team played ALL of the other tied teams and has a better or worse record against all, that team is ranked highest/lowest.
Then they compare record against all opponents the teams have in common
Then the ranking of the NCAA's Ratings Percentage Index (RPI) will break the last ties as Warren Nolan's site always separates teams in rank.

I knew that my model for storing the series data my database would have what was necessary to make these calculations, I was afraid that it would take weeks to develop a working algorithm.

Apply the LLM to the problem

So, I decided to give the LLM a try. I used a paid version of ChatGPT. I wrote the test suite for what I was expecting with a different fixture set of series results and tests of standings for each case. All of those tests failed.

I then fed a pdf of my whole repository with a prompt to get the LLM to rewrite one module.

Prompt using the CORE prompting framework:

Context

Web application providing fans of Indiana University's baseball team with data an insights generally not available through the team or other media.

Website: https://apps.iubase.com/

Tech stack: -Python 3.14.2 -Django 6.0 -HTMX -CSS -Pytest-django

Including a PDF of the repository and a text file showing the failed tests The key app is the conference app and the views/standings.py is the key file I want to change Templates and settings are in the django_project app the Teams model is in the live_game_blog app

Question: The conference app within the larger Django application has a standings page that lists the conference standings for one season in order by winning percentage.

Teams included in the standings will be those that have a TeamRpi rpi_rank for that season stored in the database AND have a value for wins OR losses in the season (they must have played a game).

Even if one team has 3 wins and only 1 loss (0.750), they would come after a team with 1 win and 0 losses (1.000). Tie games (which are rare) are recorded as 0.5 wins and 0.5 losses for both teams.

This view needs to break any winning percentage ties and order the teams with pre-set tiebreaker rules. The algorithm should work through the rules in order until one or more teams come out better or worse. Any remaining ties within the group should go back to the first rule.

For Example: if five teams are tied for 6th place (lets call them a, b, c, d, e) and the first tie breaker's sub winning percentage has b at .600, a and e at .500, and c and d at .400, then: -b should be awarded 6th place -a and e start a new tiebreaker for 7th and 8th place -c and d start a new tiebreaker for 9th and 10th place

If all of the teams (two or more) that have the same winning percentage have all played each other, teams should be ordered by winning percentage on games played among the teams in the tiebreaker.
If any team has played all teams in the tied group and has a better winning percentage against all of them, that team gets the highest order among the tied group. If any team has played all teams in the tied group and has a worse winning percentage against all of them, that team gets the lowest order among the tied group.
Rank tied teams by winning percentage among common opponents not in the tied group.
Rank tied teams by the rpi_rank order

Output

Re-write the standings.py view file to accomplish the above.

Roles

Me: I created, maintain, and continue to add features to the web application as an un-paid solo hobbyist. You: Act as a contracted senior Django developer writing code to pass the tests that I have written

Results

ChatGPT kicked out over 400 lines of code and comments. I pasted it in and ran the tests. First time all failed and there were errors. It only took a couple of minutes with the error messages to fix a couple of minor errors. There were two tests with recursion depth errors. I pasted the error message into the chat and ChatGPT added some code to enforce a depth limit of 20 and then default to RPI as the tiebreaker. After I put in that code in four places, all of the tests pass.

There were two blocks of code that didn't seem to do anything. I deleted them one at a time and re-ran the tests. With the tests passing I left out the extraneous code. Further refactoring had me pull out the tiebreaker logic into its own module and organize the view code into a few more extracted functions. I deleted the comments and worked what the comments were saying into the new function names.

There is still a lot of refactoring to do long term, but I can least get my head around what each section is doing and the view module is under 100 lines now. The tiebreaker module is still over 200, and that will need to be addressed, but likely not until after baseball season ends.

Categories || All || Coding

See a complete list of all jovian34 blogs

As always, I'd love to hear you're thoughts and questions, so feel free to shoot a Noot to @jovian34 on Mastodon and I will gladly respond and perhaps write a blog post on a topic you suggest.