2018-10-24 ppoi's suggestions for improvement, I'll note them here because it's distracting if I start fixing them while I'm using it.
$ wc unknown.txt 35203 110327 2069652 unknown.txt
python
>>> lines = open("unknown.txt").readlines()
>>> lines = [line for line in lines if line != "\n"]
>>> len(lines)
25762
>>> open("unknown.txt", "w").writelines(lines)
>>> from collections import Counter
>>> c = Counter()
>>>> for line in lines:
c.update(line)
>>> c.most_common(100)
...
>>> len(c)
2851
>>> len([1 for k in c if c[k] > 9])
1513
>>> open("chars.txt", "w").write("".join([k for (k, v) in c.most_common() if v > 9]))
1513
Since it is an OCR-ized judgment this time :
$ cat ppoi/positive.txt
|11111'「 11
$ cat ppoi/negative.txt
Without good interviews, no matter how deftly you put them together, you will never know how to reach a good decision.
Change feature creation in user.py python
CHARS = open("chars.txt").read()
def make_features(s):
"take a string, return np.array"
x = np.array([s.count(c) for c in CHARS], dtype=np.float)
# normalize
x = x / x.sum()
return x
It would be nice to have undo in case of wrong input during active learning
During active learning, you want to add more clusters.
This page is auto-translated from /nishio/ppoi改善案 using DeepL. If you looks something interesting but the auto-translated English is not good enough to understand it, feel free to let me know at @nishio_en. I'm very happy to spread my thought to non-Japanese readers.