DReaM: The Dictionary/Grammar Reading Machine: Computational Tools for Accessing the World's Linguistic Heritage

Project facts

Duration: 01-01-2018 - 31-12-2020

Project coordinator: Uppsala Universitet, Uppsala, Sweden

Project consortium: Uppsala Universitet, Uppsala (SWEDEN); Universiteit Leiden, Leiden (THE NETHERLANDS); LLACAN, CNRS, Paris (FRANCE).

Funding bodies: JPI CH

Subject areas: Archives, Conservation, Digital Heritage, Digitization, Heritage values - Identity, History, Intangible Heritage, Methods - Procedures, Tangible Heritage, Technologies - Scientific processes

Budget: 299 378 €

Presentation

The diversity of the world’s 6,500 languages embodies a wealth of information on human cognition and the history of populations. As languages go extinct, the linguistic heritage of humankind increasingly resides in grammars and dictionaries, which are rapidly accumulating. Accessing this heritage entails that the descriptions are available and that they are read by someone. Availability is a problem because publications are often difficult to access.

In this project, we aimed at enhancing access to the world’s linguistic heritage by making an existing collection of more than 9,000 PDF documents no longer protected by copyright and made available in a stable archive enriched by added metadata and computational tools developed to search information within the texts.

A number of dictionaries have been converted to apps for mobile devices that can be distributed to speakers of minority languages., handing back to these speakers some of their linguistic heritage.

Another aim of the project was to develop information-extraction tools specifically tailored to the task of dealing with language descriptions. Using cutting-edge methods from Machine Learning and Natural Language Processing, we intended to build a system that can extract millions of snippets of information and link them in ways to construct individual language profiles from a variety of sources. The aim was also to output comparative databases for the purpose of typological and historical linguistics.

Impacts & Results

The archive of non-copyrighted linguistic material has improved access to information on the world’s languages for the purposes of researchers, students, policy-makers, various organizations and the general public.

The project has reinvigorated interest in minority language communities by refeeding digitized dictionary data to speaker communities.

The project has helped to safeguard lexicons of lesser-known languages as well as older sources for widely spoken ones and has made them available to the communities.

The apps have contributed to the enhancing of the vitality of languages.

Banner: The language Esperanto explained in Esperanto in a dictionary