RAKE :
input:
I was looking at the Wikipedia dump data, and the link notation is a nice one token in SentencePiece, so I think I can use this. I'll have to do a lot of preprocessing, though.
SentencePiece / 100 document stoplist :
output:
Link notation is sentencepiece: 81.00
wikipedia: 25.00
View dump data: 16.00
Becoming 1 token: 16.00
Various: 4.00
Pretreatment: 4.00
Also good: 4.00
Likely to be able to: 4.00
Mecab / 1000 document stoplist :
Not good: 7.50
Not done: 4.50
Pretreatment: 4.00
Link notation: 4.00
So here it is: 4.00
Likely to be able to: 4.00
1 token: 4.00
In the name of: 3.50
but: 2.00
Hands: 1.50
Dump data: 1.00
Things: 1.00
Good: 1.00
wikipedia: 1.00
sentencepiece: 1.00
Mecab / 1000 document stoplist / use average phrase-character-length instead of average pharase-token-length :
I shouldn't have: 15.00
sentencepiece: 13.00
Link notation: 10.00
1 token: 10.00
Not done: 9.00
wikipedia: 9.00
So here it is: 8.00
Possible: 8.00
Pretreatment: 6.00
Dump data: 6.00
NAME: 5.00
but: 4.00
Hand: 2.00
Things: 2.00
Good: 2.00
This page is auto-translated from /nishio/RAKE実験1 using DeepL. If you looks something interesting but the auto-translated English is not good enough to understand it, feel free to let me know at @nishio_en. I'm very happy to spread my thought to non-Japanese readers.