Our Korp instances
Changed structure
As of May 2025 we run one instance of Korp for each language we have a corpus of. The address is /korp/<language>, where <language> is the ISO 639-3 language code for that language. See the list below.
Previously we had 3 instances of Korp: One for the Sámi languages, one for Finnic languages and Faroese, and one for the other Uralic languages.
The old Sámi instance one is now located at /old_korp. The Sámi parallel corpora still only exist at this instance for technical reasons. The old Finnic languages and Faroese one is still at the same address: /f_korp. Likewise, the old other Uralic languages Korp is also still at the same address: /u_korp
| Language ↑ | ISO-639-3 code | Size (in number of tokens) ↑ |
|---|---|---|
| South Sámi | sma | 2,030,158 |
| North Sámi | sme | 39,261,373 |
| Lule Sámi | smj | 1,809,537 |
| Inari Sámi | smn | 3,161,454 |
| Skolt Sámi | sms | 251,351 |
| Faroese | fao | 25,451,390 |
| Meänkieli | fit | 447,462 |
| Kven | fkv | 498,966 |
| Komi Permyak | koi | 241,614 |
| Komi | kpv | 963,802 |
| Moksha | mdf | 12,792,781 |
| Eastern Mari | mhr | 57,375,100 |
| Hill Mari | mrj | 6,252,302 |
| Erzya | myv | 14,050,585 |
| Livvi-Karelian | olo | 298,787 |
| Udmurt | udm | 271,895 |
| Veps | vep | 1,110,261 |
| Võro | vro | 668,088 |