Our Korp instances

Changed structure

As of May 2025 we run one instance of Korp for each language we have a corpus of. The address is /korp/<language>, where <language> is the ISO 639-3 language code for that language. See the list below.

Previously we had 3 instances of Korp: One for the Sámi languages, one for Finnic languages and Faroese, and one for the other Uralic languages.

The old Sámi instance one is now located at /old_korp. The Finnic languages and Faroese one is still at the same address: /f_korp. Likewise, the other Uralic languages ones is at also still at the same address: /u_korp

Language ↑ ISO-639-3 code Size (in number of tokens) ↑
South Sámi sma 2,030,158
North Sámi sme 39,261,373
Lule Sámi smj 1,809,537
Inari Sámi smn 3,161,454
Skolt Sámi sms 251,351
Faroese fao 25,451,390
Meänkieli fit 447,462
Kven fkv 498,966
Komi Permyak koi 241,614
Komi kpv 963,802
Moksha mdf 12,792,781
Eastern Mari mhr 57,375,100
Hill Mari mrj 6,252,302
Erzya myv 14,050,585
Livvi-Karelian olo 298,787
Udmurt udm 271,895
Veps vep 1,110,261
Võro vro 668,088

Giellatekno · giellatekno.uit.no