Emacs Tree-sitter custom highlighting

amitp.blogspot.com

123 points by ibobev 4 months ago

kleiba 4 months ago

Classic Emacs syntax highlighting is based on regular expressions ("font-lock-mode"). Of course, the grammars of programming languages are usually not regular languages but higher up in the language class hierarchy (hi, C++!). But you can get a surprising amount of things right just through the context in which a token appears.

For instance, the example of this article (`type` as a keyword vs. `type` as a function) would probably have worked with font-lock-mode as well because you could distinguish the two cases from whether or not a left parenthesis follows the token. But, of course, without proper parsing, there's always the possibility of edge cases that you cannot resolve correctly.

The interesting cases arise anyway when whatever you have in your buffer does not adhere to the grammar, i.e. you have a syntax error: how does then your syntax highlighter cope with that?

amitp 4 months ago

(author here) I agree, the `type` example could be done with regular expressions. In part 2 I'm planning to describe the real reason I was using tree-sitter here. I wanted to highlight certain combinations of operations based on the naming conventions I use in one of my projects. In particular, I want to catch a function call where a function named "x_to_y" has an argument with a name that does not appear to be an "x". However, while writing part 1 I realized that I could probably do that with a regular expression…
- kleiba 4 months ago
  
  Sound interesting, looking forward to part 2 then!
neilv 4 months ago

In addition to leaning mostly on regexps (used in a few ways), the ancient Emacs `font-lock` highlighting also uses "syntax classes" of characters to help tokenize/lex and structure (e.g., is this character an identifier constituent, does it start a string literal, does it start a structural grouping like a parentheses, etc.). There's also some ways to insert arbitrary code to do some things that are harder, like non-regexp lookahead. You can also annotate pieces of text as you go through it, to cache information.
The rules for indenting are actually implemented differently, even though they also involve some kind of parse. And it's not unusual to have to cache context information about the current line, for performance, so that you don't have to look back at preceding lines until you're satisfied you have enough context to indent the current line. The functions to indent multiple lines at once of course might represent this context without having to annotate the buffer.
> you have a syntax error: how does then your syntax highlighter cope with that?
I wrote (but didn't release) an all-new language-specific incremental fast parser for Emacs that recovered from some syntax errors. My general approach was to pick a region of text that included the obvious syntax error, visually highlight it in red, annotate it so that a mouseover would hover an explanation bubble of what's wrong with it, and then continue the parse assuming some reasonable context. You can see screenshots at:
https://www.neilvandyke.org/quack/#meow
For example, for an unterminated string literal, it would error-highlight the opening quote and subsequent characters up to the first whitespace. For another example, a string literal with an invalid escape sequence would error-highlight the entire string literal up through the closing quote. Another example shown is detecting a character that can't occur in that context (a close-paren immediately after a comment-the-following-s-expression).
- ssivark 4 months ago
  
  Very excited to see parsing for ill-defined states! I like your naming scheme of using animal sounds, but just wanted to bring to your attention that Emacs already has a popular package named meow (for modal editing)
  https://github.com/meow-edit/meow
  
  neilv 4 months ago
  
  Thanks for the heads-up on the name collision!
  I just updated my page to acknowledge that there's a different project with that name, and I will rename my unreleased project.
  (I'd mentioned Meow online several times, years ago, but understandable that they wouldn't have been aware of it, and I have no claim to the name, anyway. Not only was my project never released, but the community where I mostly mentioned it had/has a problem with many posts from our Google Group no longer showing up in Google search hits.)
  > I like your naming scheme of using animal sounds,
  It originally wasn't. :) The developers of the Scheme implementation family that's now called Racket developed a bespoke IDE for students, called DrScheme (as in doctor), which did some fancy things. For my much less fancy Emacs kludges, I named it "Quack", as in a fake doctor. The animal sounds only came when I needed a name for the successor to Quack.
krupan 4 months ago

Hopefully it copes very poorly so you see the syntax error quickly and fix it :-)
Only half joking

tptacek 4 months ago

I'm genuinely psyched about this. One of the few bits of Elisp I've ever written and used consistently was some goop to drive `hi-lock-mode`, which allows you to highlight arbitrary regexps --- I used it exclusively to highlight tokens. It was unreasonably effective for source code audits, being able to click a variable and then sweep through the code spotting everything that used it. But hi-lock is an afterthought of a package, and Tree-sitter isn't. Neat!

vzaliva 4 months ago

When I first read about the integration of tree-sitter into Emacs, I was very excited. I work with a DSL, for which I maintain a tree-sitter grammar and highlighting rules. I can view source files with highlighting from the command line, and I was hoping I could now easily re-use this grammar in Emacs to edit files in my DSL with proper highlighting.

Unfortunately, it wasn't as straightforward as I hoped. You need to create a custom major mode for your language and manually integrate the tree-sitter highlighting.

What I'd really like to see one day is an Emacs mode that allows you to automatically plug in any tree-sitter grammar with just a couple of lines of configuration in your .emacs, and instantly get syntax highlighting. Is that too much to ask?

toomim 4 months ago

That should be easy to build. In 2002 I built harmonia-mode, which did that for the harmonia research project that inspired tree-sitter. It did that.
- toomim 4 months ago
  
  The best way forward is similar to how you describe. Instead of making one mode per language -- just make a generic "tree-sitter" mode, and attach that mode to all the filetypes you want it load via regexp patterns in `auto-mode-alist`.
  Then when the file goes into tree-sitter-mode, you can check the filetype again, and map that into the language to load into tree-sitter. Keep a buffer-local variable to remember that current language, so that you can use it for any additional language-specific customization that you want as well.
  Keep in mind that there's nothing about a major mode in Emacs that has to be specific to a programming language. It's totally cool to have a major mode that works for multiple programming languages!
  
  vzaliva 4 months ago
  
  I think the main problem is that highligthing framework used by tree-sitter (https://github.com/tree-sitter/tree-sitter/tree/master/highl...) is not easily pluggable into emacs font-lock-mode.
  
  toomim 4 months ago
  
  I wrote similar code in emacs for harmonia-mode in the past. Let me find and share the source code. We don't need font-lock. We can just borrow the styles it uses.
  *Update:* I found the code. Instead of using font-lock, we simply draw our own overlays over the text and color them, which is what font-lock does. Font-lock IIRC is specifically designed to use regexps to parse the text. We don't need that. So throw font-lock away. Tree-Sitter itself knows how the parse tree maps to text regions. Just use that information to draw the overlays directly. It's way simpler that way.
  Lemme know if anyone wants the code.
  
  vzaliva 4 months ago
  
  It's starting to sound like it's time to create a GitHub repository and start putting it all together.
  My elisp skills are basic, so I could help more with testing, coordination, publishing packages, or documentation.
  Shall I create the repo?
  
  vzaliva 4 months ago
  
  repo created. I've sent you a request on github.
- pama 4 months ago
  
  I would contribute to a repo that works towards this goal.
  
  toomim 4 months ago
  
  I'd be happy to help as well! I could probably even get the basic framework started.
  I'm toomim@gmail.com.
  
  vzaliva 4 months ago
  
  Here's the repo: https://github.com/vzaliva/emacs-tree-sitter-highlight
lambda_foo 4 months ago

Sure, you will need to write a major mode but honestly it's not that hard to get syntax highlighting working. If you follow something like https://www.masteringemacs.org/article/lets-write-a-treesitt... plus using M-x treesit-explore-mode it took me 1 day to get treesitter support for OCaml, starting from knowing nothing about how Treesitter works with Emacs to a semi-decent highlighting and indentation setup. The indentation was harder to get working, but all up 300 lines of elisp including whitespace and comments.
amitp 4 months ago

With the older tree-sitter package[1], I was able to use it with the existing major modes. The new built-in emacs tree-sitter seems to be more ambitious, involving new major modes.
[1] https://emacs-tree-sitter.github.io/

tsuru 4 months ago

Wow authored by person who created Solar Realms Elite... A blast from my past crossing into my present.

(A bit reductionist of his many accomplishments in between, I know, it's just a thing that's hit me in the moment)

timewizard 4 months ago

The BBS era will always be the favorite era of my life. Thanks for pointing that out.

wglb 4 months ago

This reminds me of an editor that Datalogics produced back in the 1990s that edited SGML-based documents. The formatting could be directed by the context the element was in--the enclosing tags at various levels.