Index 1,600,000,000 Keys with Automata and Rust

It turns out that finite state machines are useful for things other than expressing computation. Finite state machines can also be used to compactly represent ordered sets or maps of strings that can be searched very quickly.

In this article, I will teach you about finite state machines as a data structure for representing ordered sets and maps. This includes introducing an implementation written in Rust called the fst crate. It comes with complete API documentation. I will also show you how to build them using a simple command line tool. Finally, I will discuss a few experiments culminating in indexing over 1,600,000,000 URLs (134 GB) from the July 2015 Common Crawl Archive.

The technique presented in this article is also how Lucene represents a part of its inverted index.

Along the way, we will talk about memory maps, automaton intersection with regular expressions, fuzzy searching with Levenshtein distance and streaming set operations.

Target audience: Some familiarity with programming and fundamental data structures. No experience with automata theory or Rust is required.

Error Handling in Rust

NOTE: This blog post has been merged into the Rust Book.

Like most programming languages, Rust encourages the programmer to handle errors in a particular way. Generally speaking, error handling is divided into two broad categories: exceptions and return values. Rust opts for return values.

In this article, I intend to provide a comprehensive treatment of how to deal with errors in Rust. More than that, I will attempt to introduce error handling one piece at a time so that you’ll come away with a solid working knowledge of how everything fits together.

When done naively, error handling in Rust can be verbose and annoying. This article will explore those stumbling blocks and demonstrate how to use the standard library to make error handling concise and ergonomic.

Target audience: Those new to Rust that don’t know its error handling idioms yet. Some familiarity with Rust is helpful. (This article makes heavy use of some standard traits and some very light use of closures and macros.)

Syntax extensions and regular expressions for Rust

A few weeks ago, I set out to add regular expressions to the Rust distribution with an implementation and feature set heavily inspired by Russ Cox’s RE2. It was just recently added to the Rust distribution.

This regex crate includes a syntax extension that compiles a regular expression to native Rust code when a Rust program is compiled. This can be thought of as “ahead of time” compilation or something similar to compile time function execution. These special natively compiled regexes have the same exact API as regular expressions compiled at runtime.

In this article, I will outline my implementation strategy—including code samples on how to write a Rust syntax extension—and describe how I was able to achieve an identical API between regexes compiled at compile time and regexes compiled at runtime.

Writing type parametric functions in Go

Go’s only method of compile time safe polymorphism is structural subtyping, and this article will do nothing to change that. Instead, I’m going to present a package ty with facilities to write type parametric functions in Go that maintain run time type safety, while also being convenient for the caller to use.

Introducing NFLGame: Programmatic access to live NFL game statistics

As a programmer and a fantasy football addict, I am embarassed by the means through which we must expend ourselves to get data in a machine readable form. This lack of open source software cripples the community with sub-standard tools, and most importantly, detracts from some really cool and fun things that could be done with easily available statistics. Many tools are either out-dated or broken, or if they work, they are closed source and often cost money.

Yesterday I started work on a new library package that I hope will start to improve this sorry state of affairs.

Running Archlinux on the Lenovo Thinkpad T430

In sum, Archlinux is working beautifully. What follows is a rough run down of my notes while installing, configuring, tuning and using Archlinux on the Lenovo Thinkpad T430.

Daemonizing Go Programs (with a BSD-style rc.d example)

Go, by its very nature, is multithreaded. This makes a traditional approach of daemonizing Go programs by forking a bit difficult.

To get around this, you could try something as simple as backgrounding your Go program and instructing it to ignore the HUP signal:

Adding Thread Safety to the X Go Binding

The X Go Binding (XGB) is a low level library that provides an API to interact with running X servers. One can only communicate with an X server by sending data over a network connection; protocol requests, replies and errors need to be perfectly constructed down to the last byte. Xlib did precisely this, and then some. As a result, Xlib became huge and difficult to maintain.

In recent years, the XCB project made things a bit more civilized by generating C code from XML files describing the X client protocol using Python. While the Python to generate said code is no walk in the park, it is typically preferred to the alternative: keeping the X core protocol up to date along with any number of extensions that exist as well. (There are other benefits to XCB, like easier asynchronicity, but that is beyond the scope of this post.)

XGB proceeds in a similar vain; Python is used to generate Go code that provides an API to interact with the X protocol. Unlike its sister project XCB, it is not thread safe. In particular, if X requests are made in parallel, the best case scenario is a jumbled request or reply and the worst case scenario is an X server crash. Parallel requests can be particularly useful when images are being sent to the X server to be painted on the screen; other useful work could be done in the interim.

Dynamic desktop workspaces

Do you have dynamic workspaces in your window manager?

You might be wondering: what in the world are dynamic workspaces? A dynamic workspace model allows one to add, remove or rename workspaces on the fly. Comparatively, in a typical window manager (or desktop environment) configuration, you tell the window manager to have x number of workspaces. When you start your window manager, you’ll have x workspaces, and you can typically cycle between them using some variation of “next workspace” or “previous workspace” commands. The disadvantage with this model is that it’s difficult to have a large number of workspaces—else you might forget which window is on each workspace.