Survey of the current state of code spell checking

💡

Update: I went ahead and made one. Check out Codebook!

There doesn’t seem to be a universal spell checking solution for code. The code editor Zed has an open issue about it which led me to dig into the issue. I figured that there must be a general solution out there, but as I continued to dig deeper, I discovered it’s a difficult problem to solve in the general case.

What is a code spell checker?

People expect a lot from a modern spell checker. Many challenges emerge from just the Zed open issue:

Scope: Should a spell checker also fix grammar issues?
Locality: Is it acceptable to call a remote service?
Languages: Which programming languages to support? What about different written languages?
Native APIs: Instead should the user’s native OS checker be used?
Existing libraries: Alternatively, are there existing libraries to leverage?
Configuration: How are global, project, and per file configurations handled?
UI: How do corrections show up in the interface?

I wanted to explore each issue, specifically within the context of Zed.

Scope

Let’s start with the scope of a code spell checker.

My personal feeling is that a code spell checker should not try to fix grammar issues. While a good code spell checker should be usable in longer prose, I think another, heavier tool should be used for that (or in addition to). Code requires special considerations because of heavy jargon used which prose can generally ignore. A code spell checker will need to evaluate words in function and variable names that sending to a grammar checker would be a waste of resources, for example.

Locality

Services like Grammarly, while helpful, require sending data to a 3rd party to evaluate. For code, there are often secrets, that if leaked, could be expensive. Think OpenAI API keys. My preference is that a code spell checker be 100% local. I personally would not use a spell checker that sent data outside of my environment.

Another consideration is speed and availability. A remote spell checker would not be fast and would not work offline. Either one would be a dealbreaker for many people.

For Zed specifically, many people use it for the responsiveness and expect Zed features to optimize for performance. I think a code spell checker must be local. Likewise, I think any solution using LLMs (local or not) is not viable for performance reasons.

Languages

Language support is complicated. It’s difficult for spell checkers to support multiple written languages, adding another dimension of programming languages even more so.

As an English speaker, I think focusing on programming language support is more important; however, any solution that cannot support multiple written languages is likely a dead end.

Native vs Library

Every OS comes built in with a decent spell checker. I’m writing this in Obsidian, which uses native text views to leverage macOS’ spell checker. It works well and is fast. However, code spell checking requires levels of configuration native spell check may not be able to handle.

For example, the word fn is a common word in Rust. It would be annoying if every time fn came up, the spell check triggered. A code spell checker will need dictionaries for every programming language supported.

Additional dictionaries will be needed for a user’s global, project, and file needs. These dictionaries will need to be a combination of predefined, user-defined, and maybe even shared among a group of people in a company.

On top of all that, users will likely want to add comment directives in code to enable or disable aspects of the code spell checker. See how CSpell does this.

It seems to me only the library approach could possibly handle this level of complexity.

UI

Finally, how should users interact with a code spell checker? Luckily, VSCode (based on CSpell) has a popular code spell checker that we can learn from, and it’s a relatively simple interface.

Misspelled words can get a ‘hint’ squiggle, like how linters show issues.

When a misspelled word does occur, users will want to take one of a few actions:

Select one of the suggested words. The suggested words need to maintain the capitalization of the original word. A suggestion for the word Wolrd should be World, and not world.
Add the ‘misspelled’ word to a global dictionary.
Add the ‘misspelled’ word to a project dictionary.
Add the ‘misspelled’ word to a file dictionary. (Optional)

Current Solutions

With all this in mind, I looked for all the current solutions in this space. Sadly, I don’t think any solution quite meets all the needs laid out above, but I’ll put them here and note why. I’m going to filter out any solution that requires a remote connection or uses an LLM for the same reasons stated above.

Code Spell Checker for VSCode

This is currently the closest to meeting all the criteria. So, why not just use it? The main issue I see with Code Spell is that it doesn’t have a language server. Instead, it’s implemented directly into a VSCode extension using TypeScript. Without a language server, other editors, like Zed, cannot easily use it. There are attempts to make a language server for it, but all the examples I found seem to be unfinished.

While I think making a language server for this could be a good way forward, I think we can do better without much extra work. The biggest issue with it is that the core codebase is written in TypeScript, and performance can be an issue. Looking through the code, there is a lot happening to keep it performant. However, if there was a solution in a faster language, the implementation would likely be much simpler.

Harper

Harper is a project made by Automattic for prose. It’s written in Rust and has an extensive language server. There’s even already a Zed extension for it. The main downside is that it’s specifically for prose and explicitly English only. The goals of Harper just don’t align with the needs of a code spell checker.

Language Tool

In a similar vein, there’s also a Zed extension for Language Tool. This is also a full grammar checker. The other main downside is that it’s written in Java, which is fast, but a heavy install and uses a bunch of memory.

Vale

Vale is similar to Harper, but is “markup aware”. Vale is written in Go, so it’s likely fast enough. It has a language server, and a Zed extension already exists.

I think Vale gets closer than Harper, but after trying it, I found a few issues as well. First, Vale is more of a style linter than a spell checker. Since it focuses on style issues (like avoiding the word “We”), it expects a config to get started. The config, in my opinion, is not easy to understand. This makes the barrier to entry too high. Overall, it’s just more complicated than what we need for a code spell checker.

Typos

The current front-runner for Zed seems to be Typos. Typos is written in Rust and is specifically a source code spell checker. There’s a Language Server and a Zed extension for it. I’m currently using Typos, but it (self-admittedly) has a ton of false negatives. It catches only a few errors.

Typos intentionally does not catch many issues because it’s meant to be run in CI. Their design doc goes over the trade-offs. It specifically does not use a Hunspell-like dictionary approach and instead uses a hardcoded ‘typos’ word list. I don’t think this is a good trade-off for a local code spell checker due to the maintenance burden around supporting multiple written languages.

Spellbook

Spellbook itself is not a complete solution, but I’m including it because I think it could be a powerful building block. Spellbook, written by the Helix editor project in Rust, is a Hunspell-compatible spell checking library.

Spellbook is not ‘code-aware’, but has great written language compatibility due to the Hunspell support. There are already open-source many dictionaries available. However, the author currently has no plans to implement a language server for it.

Recommendations

After all this, I’ve come to the conclusion that there are not currently any portable code spell checkers that would work for most people. All the current solutions either are editor-specific, do too much, or do too little.

In my opinion, there are two promising paths. First, a language server for CSpell could be developed. This would probably be the fastest way to get to where we want, but comes with the JavaScript performance overhead.

The second, and I think the optimal, is to make a code spell checker around Spellbook. This would involve a bit more work since a language server needs to be made, along with working out the complexities around parsing code into words to feed into it.

Django Islands: Part 1

Write Rust Like a Pythonista