GitHub’s Linguist Library Is Now Open Source
GitHub is one of the largest hosting services for software development. In April 2011, it was announced that a total of 2 million repositories are hosted on GitHub. With such a large number of repositories, it is inevitable that developers use a very large number of languages.
GitHub uses the Linguist library to work out which language is used in a particular file. Once the language used is detected, other features such as syntax highlighting repository statistics graphs etc. can be activated.
Joshua Peek of GitHub announced that they are now releasing the Linguist library as open source, so that others can add features such as support for currently unsupported language. This was what Peek wrote announcing it:
From time to time we get requests asking us to add support for new highlighting lexers, recognize additional extensions as certain languages, or ignore a directory from a repo's stats graph.
The code for these concerns was scattered around the app. I decided to unify and package them all up into a single library. Now it's open source.
So if you notice an unrecognized extension or you're really into some obscure language that isn't supported yet, now is your chance to help contribute back.
Now that the codes are available to anyone, it is interesting to see how the Linguist directory works. When it sees a new file, Linguist detects the language used in the file using the file extension. In the cases of extension-less files, Linguist uses what they call “deep content inspection” to detect the language used.
Once the language has been detected, it is passed to Albino, a Pygments wrapper, which does the actual syntax highlighting.
If you are interested, you can get Linguist from here.