Our Korp instances
Changed structure
As of May 2025 we run one instance of Korp for each language we have a corpus of. The address is /korp/<language>, where <language> is the ISO 639-3 language code for that language. See the list below.
Previously we had 3 instances of Korp: One for the Sámi languages, one for Finnic languages and Faroese, and one for the other Uralic languages.
The old Sámi instance one is now located at /old_korp. The Finnic languages and Faroese one is still at the same address: /f_korp. Likewise, the other Uralic languages ones is at also still at the same address: /u_korp
Language ↑ | ISO-639-3 code | Size (in number of tokens) ↑ |
---|---|---|
South Sámi | sma | 2,030,158 |
North Sámi | sme | 39,261,373 |
Lule Sámi | smj | 1,809,537 |
Inari Sámi | smn | 3,161,454 |
Skolt Sámi | sms | 251,351 |
Faroese | fao | 25,451,390 |
Meänkieli | fit | 447,462 |
Kven | fkv | 498,966 |
Komi Permyak | koi | 241,614 |
Komi | kpv | 963,802 |
Moksha | mdf | 12,792,781 |
Eastern Mari | mhr | 57,375,100 |
Hill Mari | mrj | 6,252,302 |
Erzya | myv | 14,050,585 |
Livvi-Karelian | olo | 298,787 |
Udmurt | udm | 271,895 |
Veps | vep | 1,110,261 |
Võro | vro | 668,088 |