Quick start

How to build your first corpus in no time:

Introduction

Despite certain obvious drawbacks (e.g. lack of control, sampling, documentation etc.), there is no doubt that the World Wide Web is a mine of language data of unprecedented richness and ease of access.

It is also the only viable source of "disposable" corpora built ad hoc for a specific purpose (e.g. a translation or interpreting task, the compilation of a terminological database, domain-specific machine learning tasks). These corpora are essential resources for language professionals who routinely work with specialized languages, often in areas where neologisms and new terms are introduced at a fast pace and where standard reference corpora have to be complemented by easy-to-construct, focused, up-to-date text collections.

While it is possible to construct a web-based corpus through manual queries and downloads, this process is extremely time-consuming. The time investment is particularly unjustified if the final result is meant to be a single-use corpus.

What BootCaT does

BootCaT automates the process of finding reference texts on the web and collating them in a single corpus.

The pipeline allows varying levels of control. In the first step, users provide a list of single- or multi-word terms to be used as seeds for text collection. These are then combined into “tuples” of varying length and sent as queries to a search engine, which returns a list of potentially relevant URLs. At this point the user has the option of inspecting the URLs and trimming them; the actual web pages are then retrieved, converted to plain text and saved in "txt" format. The corpus can thus be interrogated using most concordancers.

Using BootCat one can build a relatively large quick-and-dirty corpus (typically of about 80 texts, with default parameters and no manual quality checks) in less than half an hour. This flexible approach to the task makes BootCaT a very useful tool for translators and translation students, which has been used in the translation and terminology classroom to build small DIY corpora of varying size and specialization.

If you want to see a few screenshots of the program, take a look at the tutorial.

Real-time usage data

Since the launch of version 1.0 (March 15, 2018)


 
Standard XHTML 1.1 and CSS
Updated April 19, 2023