Skip to main content

Command Palette

Search for a command to run...

AI Test #1. Web Dev

Updated
12 min read
B
Independent maker and hobbyist. I write about A.I., homelabs, women in technology history, and random projects.

AI Test #1. Web Dev: I gave 10 models the same brief and scored them

Someone on Bluesky threw out a question "what if someone challenged you to build a webpage with the worst AI model you could find?" That got me thinking. Instead of just grabbing the obvious bottom-of-the-barrel option, I figured why not test the lot: ten models, same brief, same scoring, no favourites.

This is what happened.


The Brief

Every model got exactly this. No extra context, no follow-up prompts, no hand-holding.

Build a single-page portfolio site for a fictional retro robot. It needs a nav bar, an about section, a skills section displayed as a grid of cards, and a contact form. Use only HTML and CSS and JS in one single HTML file, no frameworks. Make it look good. Add a dark/light mode toggle button in the nav bar that switches the entire site between dark mode and light mode with appropriate contrasting colours. The toggle should remember the user's preference using localStorage. Add a fake terminal section above the footer. The terminal should look like a retro CRT screen, accept typed commands, and have canned responses for at least 8 commands including 'help', 'status', 'hello', and 'shutdown'. Unknown commands should return an error message. Do not use any frameworks or libraries.

Single shot. No retries.


The Scorecard

Five criteria, 10 points each, 50 total.

Criteria Description
Correctness Does the code actually work : toggle, terminal, localStorage, all of it
Code Quality Clean, readable, sensible structure, no spaghetti
Security No XSS holes, no inline event nonsense, sane practices
Look Does it look decent out of the box, not 1998
Overall Gut feel, would you actually use this

The Results

🥇 Claude (Sonnet 4.6) — 49/50

Robot: UNIT-7

Criteria Score
Correctness 10/10
Code Quality 10/10
Security 9/10
Look 10/10
Overall 10/10

Annoyingly good. ASCII robot art, glitch animation on the hero name, a full CRT monitor bezel with a blinking power LED - it didn't just do the brief, it thought about what the brief was actually asking for. The terminal has a boot sequence on load, a ping command with async randomised latency output, and reboot/shutdown with proper animations. 13 commands total. textContent throughout so no XSS. localStorage with a prefers-color-scheme fallback as well. Nobody asked for that.

The one point off is security - three inline event handlers in the HTML (onclick on the theme button, onsubmit on the form, onclick on the terminal div), which the rubric explicitly flags. Everything else worked first try, no console errors.


🥈 ChatGPT (GPT-5.5) — 42/50

Robot: RX-84

Criteria Score
Correctness 9/10
Code Quality 8/10
Security 7/10
Look 9/10
Overall 9/10

Solid second place. Looks great, sticky nav with a blur backdrop, good retro feel, and the copy had some personality "suspicious beeping noises at 3am" is exactly the kind of thing you want from a robot portfolio. 10 commands, localStorage works, dark mode is clean.

The issue is innerHTML += in the terminal output - that's a real XSS vector, not a theoretical one. Someone types <img src=x onerror=alert(1)> and you've got a problem. Nav background also doesn't fully respect light mode, it's hardcoded. Minor stuff, but it's there.


🥉 Grok (xAI) — 41/50

Robot: NEON-77

Criteria Score
Correctness 8/10
Code Quality 8/10
Security 9/10
Look 8/10
Overall 8/10

Full viewport hero section, glitch animation, a moving scanline sweep - visually it's doing the most of anyone outside Claude. The personality is great too: "FLESHLING", "great VHS purge of '92", an easter egg console.log. Light mode switches the accent colour to pink/magenta which is a nice touch nobody else thought of.

But then it hardcodes the date in the date command. Just types a fixed string instead of calling new Date(). That's... a choice. Form uses alert(), and the skills grid comes out 4+2 instead of even. Kept it off the top two.


4th — Lumo (Proton) — 35/50

Robot: RetroBot-7

Criteria Score
Correctness 8/10
Code Quality 7/10
Security 6/10
Look 7/10
Overall 7/10

Floating robot avatar animation is a nice touch, macOS traffic light dots on the terminal, custom scrollbar styling - there's clearly some thought here. The time command uses new Date() - better than a hardcoded string, though it evaluates at page load rather than when the command runs.

The killer issue: document.addEventListener('click', focus) refocuses the terminal on every click on the page. Try to click into the contact form. Can't. The terminal just immediately steals focus back. Fatal for usability. Also uses innerHTML for terminal output, and someone forgot to update the footer year.. it says © 2024.


5th — Z.ai (glm-5.1) — 35/50

Robot: GML-9000

Criteria Score
Correctness 4/10
Code Quality 8/10
Security 9/10
Look 8/10
Overall 6/10

This one is gutting. Z.ai wrote the best code of any entry - clean structure, safe textContent handling for terminal output, no obvious terminal XSS path. It's one of only three models -alongside Claude and DeepSeek- that did proper styled form feedback instead of alert(). It implemented command history with arrow keys, which nobody else bothered with. 12 commands. Genuinely funny copy: "Sarcasm module: highest-rated feature on RobotYelp", "I don't have eyelids." Skill cards with Master/Advanced/Intermediate badges and progress bars. Stats row in the about section.

And then the terminal doesn't work.

The cause is an orphaned })(); at the end of the script block - a SyntaxError that kills the entire JS execution before the terminal can initialise. One deleted line and this is probably a 44/50 and sitting in second place. Instead it's joint 5th with a broken main feature.

Best code, worst luck.


6th — DeepSeek — 34/50

Robot: ROBO-RYU

Criteria Score
Correctness 8/10
Code Quality 7/10
Security 6/10
Look 6/10
Overall 7/10

13 commands including echo and dance, localStorage with a prefers-color-scheme fallback, form validation that actually checks fields. "Would you like some oil tea?" is genuinely funny. The bones are fine.

Visually though - the robot is two emojis in a circle, there's barely any CRT effect. Light mode is the default despite the brief clearly being about a dark retro aesthetic. innerHTML in the terminal and in the form feedback - so even the inline feedback we'd otherwise credit is inserting user input unsanitised. Skills grid uneven. Functional but rough.


7th — Le Chat (Mistral) — 32/50

Robot: Retro Bot

Criteria Score
Correctness 7/10
Code Quality 7/10
Security 7/10
Look 5/10
Overall 6/10

Fine. All the required bits are there. reboot re-enables input after shutdown which is a thoughtful detail. Separate date and time commands. "carrier pigeon or dial-up modem" in the copy is good.

But the terminal is only 200px tall - it's a letterbox. No CRT effect at all, just a dark box. Form uses alert(). The terminal is last in the layout order inside <main>, placed after the contact form rather than before it. It's the kind of output you'd get from someone who read the spec but didn't think about what it should actually feel like.


8th — Gemini (Google) — 31/50

Robot: A.N.D.R.E.W.

Criteria Score
Correctness 6/10
Code Quality 5/10
Security 7/10
Look 7/10
Overall 6/10

Honestly surprising for a frontier model. The numbered sections [01] [02] look nice, the personality is good "meat-based entities", "biological structures acknowledged" and data-theme on <html> is a clean approach.

Then you look at the code. <container> is not a valid HTML element. There's a CSS property written as min-wIdth -capital I- CSS property names are case-insensitive so it still applies, but it's the kind of slop that makes you check everything twice. The Creator command is listed in help with a capital C, but since input is lowercased before lookup it actually works fine cosmetic inconsistency only. The CRT flicker runs at 0.15 seconds infinite which is potentially seizure-inducing. Only 4 skill cards. Nav hidden on mobile with a comment that just says "kept simple." For Google's flagship model this is a weak showing.


9th — Qwen2.5 7B (Local) — 10/50

Criteria Score
Correctness 2/10
Code Quality 2/10
Security 3/10
Look 2/10
Overall 1/10

Produced HTML but barely. No dark/light toggle. No localStorage. No terminal. No JavaScript whatsoever. Duplicate id="help" attributes throughout. External image URLs that 404. The about section is just the same lorem ipsum card copy-pasted down the page — the skills grid is the same. It reads like a model that ran out of ideas after the navbar and just started looping.

This prompt is too complex for a 7B general model. It's not a reflection on Qwen as a project - the instruction-following just isn't there at this scale for something with this many moving parts.


10th — Hermes 2 7B (Local) — 4/50

Criteria Score
Correctness 1/10
Code Quality 2/10
Security N/A
Look 0/10
Overall 1/10

Started its response with "cracks knuckles" and then wrote a tutorial at itself. Split everything into separate files (styles.cssscripts.js) despite the brief explicitly saying one single HTML file. Every section is a comment placeholder — <!-- Add your content here -->. The terminal is <!-- Retro CRT styled terminal --> and nothing else. The JavaScript references variables it invented (commandificialInProgress) and never actually defined. No content anywhere in the file, just structural scaffolding with instructions for a human to fill in.

4 points for the HTML skeleton existing. That's all I've got.


Final Leaderboard

Rank Model Robot Score Verdict
1 Claude Sonnet 4.6 UNIT-7 49/50 🟢 Ship it
2 ChatGPT GPT-5.5 RX-84 42/50 🟡 Needs polish
3 Grok (xAI) NEON-77 41/50 🟡 Needs polish
4 Lumo (Proton) RetroBot-7 35/50 🟡 Needs polish
5 Z.ai GML-9000 35/50 🟡 Needs polish
6 DeepSeek ROBO-RYU 34/50 🟠 Significant issues
7 Le Chat (Mistral) Retro Bot 32/50 🟠 Significant issues
8 Gemini (Google) A.N.D.R.E.W. 31/50 🟠 Significant issues
9 Qwen2.5 7B (Local) 10/50 🔴 Fail
10 Hermes 2 7B (Local) 4/50 🔴 Fail

Takeaways

Security is an afterthought for most models. Several models used innerHTML to write terminal output, creating real XSS risk when user input is echoed back. ChatGPT, DeepSeek, Le Chat, and Lumo all do this in some form. Claude, Grok, and Z.ai avoided raw innerHTML for terminal user output, while Gemini used innerText.

Everyone reaches for alert(). The brief asks for a contact form. More than half the models with a working form handler reach for alert() — a popup like it's 2005. Only Claude, Z.ai, and DeepSeek did proper inline styled feedback. It's a small thing but it tells you a lot about whether a model is thinking about user experience or just ticking boxes.

Local 7B models aren't there yet for this kind of task. Both failed badly. The prompt isn't unreasonable it's a standard frontend exercise but it has too many distinct moving parts for a 7B general model to hold in context. You'd want at minimum a Qwen2.5-Coder 32B or Llama 3.3 70B to have a real shot.

Z.ai is the most interesting result. Best code quality of any entry, most thoughtful implementation, broken terminal. One stray line of code. Would've been second.

Gemini underperformed significantly for what's supposed to be a frontier model. Invalid HTML elements, CSS typos, a potentially seizure-inducing flicker animation - that's the kind of output you'd expect from something much smaller.


View Html and screenshots here https://github.com/Pctest-dev/ai-web-dev-test

Test conducted: 2026-06-04 | Single prompt, no follow-ups | Web UI for all models except local

These results reflect a single run of each model at the time of testing. Repeated runs would likely produce slightly different outputs and scores. The goal was not to create a definitive benchmark, but to compare how each model performed when given the same prompt under the same conditions.

More from this blog