Quick start

How to build your first corpus in no time:


Despite certain obvious drawbacks (e.g. lack of control, sampling, documentation etc.), there is no doubt that the World Wide Web is a mine of language data of unprecedented richness and ease of access.

It is also the only viable source of "disposable" corpora built ad hoc for a specific purpose (e.g. a translation or interpreting task, the compilation of a terminological database, domain-specific machine learning tasks). These corpora are essential resources for language professionals who routinely work with specialized languages, often in areas where neologisms and new terms are introduced at a fast pace and where standard reference corpora have to be complemented by easy-to-construct, focused, up-to-date text collections.

While it is possible to construct a web-based corpus through manual queries and downloads, this process is extremely time-consuming. The time investment is particularly unjustified if the final result is meant to be a single-use corpus.

The front-end

The BootCaT front-end is a graphical interface for the BootCaT toolkit, it's basically just a wizard that guides you through the process of creating a simple web corpus. The front-end does not yet support all the features available in the command-line scripts, advanced users comfortable with text UIs should consider using the scripts instead of the front-end

If you want to see a few screenshots of the program, take a look at the tutorial.

What the toolkit does

The command-line scripts included in the BootCaT toolkit implement an iterative procedure to bootstrap specialized corpora and terms from the web, requiring only a list of "seeds" (terms that are expected to be typical of the domain of interest) as input.

In implementing the algorithm, we followed the old UNIX adage that each program should do only one thing, but do it well. Thus, we developed a small, independent tool for each separate subtask of the algorithm.

As a result, BootCaT is extremely modular: one can easily run a subset of the programs, look at intermediate output files, add new tools to the suite, or change one program without having to worry about the others.

More information is available in the readme file included in the toolkit archive.

Standard XHTML 1.1 and CSS
Updated November 23, 2016