Skip to content

Feat/c-value algorithm

Sesboue Matthias requested to merge feat/c-value into olms2

Code for the C-value algorithm and its tests.

Compared to the previous version, I made the following decisions that can be discussed:

  • I simplified the term tokenisation to use based the algorithm on a space tokeniser. It leads to a potential difference between the extracted term strings and their actual form in the corpus. I made sure to log a warning if this could happen.
  • I thought that tagging the extracted candidate terms in the corpus (e.g., creating linguistic realisations) should be the responsibility of the term extraction pipeline component.

Merge request reports