Feat/c-value algorithm
Code for the C-value algorithm and its tests.
Compared to the previous version, I made the following decisions that can be discussed:
- I simplified the term tokenisation to use based the algorithm on a space tokeniser. It leads to a potential difference between the extracted term strings and their actual form in the corpus. I made sure to log a warning if this could happen.
- I thought that tagging the extracted candidate terms in the corpus (e.g., creating linguistic realisations) should be the responsibility of the term extraction pipeline component.