Wordle golf challenge: wrap-up and retrospective

(This is the sixth article in the series for Golf Horse: Wordle.)

I’ve made a “sanitized” version of my submission available on GitHub. I’ve changed the parameters and the compressed data, so that any future competitors still have some work to do! I also included an annotated version of the source code, so you don’t have to decipher all the single-letter variables.

My final submission was 10982 bytes, and I felt I gave a good effort in every area: packing bits, choosing context models, designing and optimizing a context mixing formula, and golfing down the actual JavaScript code.

However, there are still a number of ideas I have for more improvements. There’s a good chance I pursue some of these myself, but until then, I want to share some ideas for anyone else interested in trying the competition.

Look at the data. I spent very little time analyzing the data coming from the compression algorithm. It could be useful to find out where the models were mispredicting, which contexts were outperforming in what circumstances, and what words were costing the most bits. Perhaps there are patterns that could be exploited for more gains.
Better context mixing formula. My context mixing formula was quite unprincipled, I’m sure there must be a better way.
Experiment more with adaptive weights. That is, adjust the context mixing weights on-the-fly. I had bad results early on when I tried adaptive weights, but I’m pretty sure there’s a better way to do it. Perhaps some combination of ahead-of-time optimized parameters and on-the-fly adaptive parameters could improve the mixer’s performance.
Find more useful “letter categories”. My submission benefitted a lot from including vowel-consonant patterns in the context models. I wonder if there are other ways of grouping letters that could improve compression. I tried a few simple ideas, and they weren’t beneficial, but I wonder if a more systematic search would uncover useful categories.
Utilize more information from sibling nodes (that is, words with the same prefix). As for sibling information, my context models only included the number of “true” bits seen so far in this node. Although I tried several times to find a way to include siblings in the context models, I was never able to get enough useful information from them to be worthwhile. I am suspicious that the large number of sibling contexts effectively “confuses” the context mixer—basically there’s a bad signal-to-noise ratio among the sibling contexts. Maybe a smarter context mixing formula could extract more useful information.
Try a directed acyclic graph representation. My submission represented the wordlist as a prefix tree, which automatically shares prefixes of words. A DAG would allow sharing of suffixes, too. Though I feel this is a promising idea, as there are lots of shared suffixes, it also seems like it would lead to a lot of complications, and would require a very different approach for context modeling, etc.
Try splitting up into multiple wordlists. English words have roots from different languages, which have different spelling patterns, which will have different optimal parameters for compression. I wonder if it’d be worthwhile to split the wordlist into multiple pieces and compress each separately.
Try using bigrams. That is, treat some two-letter combinations as a single unit. For example, “th”, “sh”, and “ck” usually behave like single letters in English spelling. I didn’t pursue this because this means our words aren’t all 5 “letters” anymore, so there’s some significant impact on how the program works. I hoped that including long enough context models would capture most of the bigram behavior.

Finally, I’d like to share a few ideas I tried, but didn’t make the cut:

Neural networks. I wrote my differentiation and gradient descent routines because I wanted to try neural networks for making bit predictions. However, after a bit of experimentation, it seemed like I’d need a lot of parameters before the networks could do anything useful. Which raised the question, how do I compress the network weights? Basically, neural networks didn’t seem like a compact way to encode information. Also, I worried that there would be some performance implications: there’s a 100 second limit on run time, and it might not be enough for a big model.
Adaptive weights. As mentioned earlier, I tried adaptive weights for context mixing early in my development process. It was underperforming optimized fixed weights, so it didn’t seem useful. But, I might not have been doing it quite right, which is why I also list this as an idea for improvement!
“Sparse grid inspired” context modeling. I’ll probably write an article about sparse grids in the future, as they’re one of my favorite numerical techniques. I was hoping I could use the idea to combine overlapping smaller contexts into a larger one. To give an example, consider the following four contexts: (1) current letter (2) current + previous letter (3) current + 2nd previous letter (4) current + previous + 2nd previous letter. The idea would be to estimate (4) as (2) + (3) – (1). This expression is accurate if things are linear. But it turns out, 5-letter words are very nonlinear, so I had to give up on this approach.
Multi-base encodings. I thought I might be able to pack bits best if I could use multiple bases in the arithmetic decoder. Although I still believe this is theoretically sound, I couldn’t make a working encoder. It’s a bit hard to describe the problem without a whole essay, but basically, due to accumulated rounding errors, I lost monotonicity of the mapping from encoded data to bits, which was essential to the implementation of my encoder.

Perhaps in the future I’ll try some of these improvements myself, or try another one of the Golf Horse challenges. Until then, that wraps up this series on Golf Horse. I hope you enjoyed these articles going into depth about my submission!

Wordle golf challenge: wrap-up and retrospective

Wordle golf challenge: wrap-up and retrospective

Leave a Reply Cancel reply