I fixed the strawberry problem because OpenAI couldn't
Published on , 1357 words, 5 minutes to read
Remember kids: real winners cheat
A photograph of a bundle of cherries with a very shallow depth of field with leaves and a blue sky - Photo by Xe Iaso, EOS R6 mark ii, RF 50mm f/1.8 STM @ f/1.8Recently OpenAI announced their model codenamed "strawberry" as OpenAI o1. One of the main things that has been hyped up for months with this model is the ability to solve the "strawberry problem". This is one of those rare problems that's trivial for a human to do correctly, but almost impossible for large language models to solve in their current forms. Surely with this being the main focus of their model, OpenAI would add a "fast path" or something to the training data to make that exact thing work as expected, right?
I am excited to reveal the incredible power of OpenAI's new "Strawberry" model (known as "o1"). This technology is the future
[image or embed]
— Ed Zitron ( @zitron.bsky.social ) September 12, 2024 at 4:37 PM
No. It did not. Of course it didn't. Why would they do that?
In my quest for making a technological solution that allows me to retire, I have solved the strawberry problem. I am able to do this on unmodified commodity hardware with models as small as 8 billion parameters. I call this resulting model strawberry-mimi
. I am unable to upload this to HuggingFace because my Git small filesystem is malfunctioning, but I have created a private inference API so that people can validate this for themselves. Should you want access, please contact me, it can be arranged.
Mimi: a next-generation life assistant
Mimi is Techaro's initial implementation of a life assistant. Mimi is designed to be infinitely adaptable to doing anything, as long as you can ram it into the woeful limitations of the chat interaction model. The real power comes from Mimi's ability to use tools. Tools are function signatures that are signals to the runtime to do something and return the result into the context window. Mimi is a Discord bot, and if you are a subscriber on Patreon, you have access to Mimi in #mimi.
Right now Mimi has two tools implemented:
code_interpreter
: run some Python code and return the stdout/stderr into the context windowdraw_image
: fabricate a prompt for Flux dev and then send the image to the Discord channel when it's done
But out of the gate, Mimi fails at the strawberry problem:
How many r's are in the word strawberry?
There are two r's in the word strawberry!
Content generated by AI using the model Hermes 3 70B.
Why is this a problem?
Now for those of you that don't have a solid understanding of how all this large langle mangle bullshit works, I'm pretty sure you're asking something like this:
If large language models are so "smart" and stuff, why can't they count the letters in a word?
As humans, we understand words as a sequence of letters that are then turned into syllables and then understood as words. Let's take the word "strawberry" as an example. Here's all the letters of that word separated out:
s t r a w b e r r y
From here, it's easy for us to see that there's three r's in the word strawberry. Large language models don't see "strawberry" like this. They see it like this:
str aw berry
Actually, it's a bit worse, models don't see the human-readable text. I just put it like that so that you could visualize it better. Models just see token ID numbers:
[ 496, 675, 15717 ]
Imagine if all of the ways that you saw every word were broken up like this, but without actually knowing what any of those parts meant. Combine that with the inability to do basic math, and you're pretty much screwed! Of course models can't understand how many r's are in the word strawberry. It's frankly incredible that there are usecases where they shine in general.
Solving strawberry
One of the things that large language models are really good at is translating vague descriptions of problems into programming languages. For some reason they are really really good at Python (I think that's because Python is over-represented in the dataset). This is when I had a really dumb idea:
What if I just asked the model to translate the problem into Python?
I added this to Mimi's system prompt:
If you are asked to count the number of letters in a word, do it by writing a python program and run it with code_interpreter.
To my amazement and horror, it worked:
How many r's are in the word strawberry?
Call code_interpreter
:
word = "strawberry"
num_r = word.count('r')
print(num_r)
Content generated by AI using the model Hermes 3 70b (internal).
there are 3 r's in strawberry! did you know that? it's pretty cool isn't it??
Content generated by AI using the model Hermes 3 70b (external).
This technique is called "codeflection".
I've tried this with other words, and Mimi happily generates the Python code for it every time. There is a success rate of 100%, beating giant models such as GPT-4, Gemini 1.5 Pro, and Llama 3.1 405B.
Model | Score |
---|---|
gpt4-32k | 66% |
gpt4o | 66% |
gpt4o-mini | 66% |
Gemini 1.5 Flash | 66% |
Gemini 1.5 Pro | 66% |
Gemini 1.5 Pro (August variant) | 67% |
Llama 3.1 405B | 66% |
Reflection 70B | 65% |
strawberry-mimi (8B) | 100% |
The data proves that strawberry-mimi
is a state of the art model and thus worthy of several million dollars of funding so that this codeflection technique can be applied more generally across problems.
WASM
Mimi runs everything in Wazero using some particularly inspired Go code. I run all of Mimi's generated Python code in WebAssembly just to be on the safe side. Having things in WebAssembly means I can limit RAM, CPU, and internet access so that the bot doesn't take out my homelab's Kubernetes cluster doing something stupid like calculating 100 fibonacci numbers. It also means that I have total control over what the bot sees in its "filesystem" from userspace without having to have any magic kernel flags. This makes Talos Linux happy and decreases the likelihood of Mimi causing an XK-class "end of world" scenario.
Next steps
I'm frankly tired of this problem being a thing. I'm working on a dataset full of entries for every letter in every word of /usr/share/dict/words
so that it can be added to the training sets of language models. When I have this implemented as a proof of concept on top of Qwen 2 0.5B or Smollm, I plan to write about it on the blog.
If you want to invest in my revolutionary, paridigim-shifting, and FIRE-enablement projects such as Mimi, strawberry-mimi
, or the LetterCountEval benchmark, please donate on Patreon or send your term sheets to investment@techaro.lol. We may not be sure of the utility of money post artificial general intelligence, but for right now holy cow GPU time is expensive. Your contributions will enable future creations.
Facts and circumstances may have changed since publication. Please contact me before jumping to conclusions if something seems wrong or unclear.
Tags: