NISHIO Hirokazu[English][日本語]

Natural Language Processing on Heroku

Notes on the process of turning a locally experimented natural language processing algorithm into an API server on Heroku (No consideration of whether it is appropriate to do this on Heroku → API server placement considerations)

I'll refer to my notes Heroku+Flask when I did something similar before.
- Create a new working directory
  - I've thought about using existing repositories, but it's so complicated that it would take a long time to find the cause of a problem and be barren, so I'll try to keep it as simple as possible.
  - $ mkdir regroup-split-server
  - $ cd regroup-split-server
- Create a virtual environment and work on VSCode
  - $ python3 -m venv venv
  - $ code .
  - View -> Terminal
  - $ source venv/bin/activate
- Create a minimal server with Flask
  - $ mkdir server
  - $ code server/__init__.py

python

from flask import Flask

app = Flask(__name__)

def create_app():
    return app

@app.route('/')
def root():
    return "OK"

    - `$ pip install --upgrade pip`
    - `$ pip install flask`
    - Set environment variables in a file
        - `$ code .env`

.env

FLASK_APP=server
FLASK_ENV=development

        - `$ pip install python-dotenv`
    - `$ flask run`
        - Verify that it runs without problems and that you get an OK when you open [http://127.0.0.1:5000/](http://127.0.0.1:5000/)
    - `$ git init`
    - `$ code .gitignore`

.gitignore

venv/     
*.pyc
__pycache__/

        - `$ git commit -m 'minimal Flask server'`
            - Actually, I'm doing Cmd+Enter in the Source Control tab of VSCode.
- Add [gunicorn](/en/gunicorn) and deploy.
    - This also serves as [HTTPS for Flask](/en/HTTPS%20for%20Flask), which is included in the minimum configuration because I think it is not possible to have only HTTP as a modern API server.
    - `$ pip install gunicorn`
    - `$ pip freeze > requirements.txt`
    - `$ code Procfile`

Procfile

web: gunicorn server:"create_app()"

    - `$ heroku create regroup-split-server`
    - `$ git commit -m "add gunicorn"`
    - `$ git push --set-upstream heroku master`
        - Build logs appear. Make sure it's not an error.
    - `$ heroku open`
        - Open the deployed one in a browser, making sure OK is displayed.

I haven't figured out yet what to do with this deployment repository and what to do with the local R&D repository to keep it clean.
- I'm sure I'll want to separate them under certain circumstances, but until I have a clearer idea of how I want to separate them, I'm going to do it in unison.
  - I've had a hard link with separate repositories, but I don't think it's a good idea.
  - I guess I'd better use a git submodule or pip to connect them.
    - Resolving Application Dependencies with Git Submodules | Heroku Dev Center
      - PIP for myself
- Keep folders separate for easy separation in the future.
  - $ mkdir server/regroup_split
- Copy files that look necessary

deploy.sh

cp rich_tokenizer.py ../regroup-split-server/server/regroup_split/
cp regroup_split.py ../regroup-split-server/server/regroup_split/
cp TAIL_TOKENS_TO_REMOVE.txt ../regroup-split-server/server/regroup_split/
cp HEAD_TOKENS_TO_REMOVE.txt ../regroup-split-server/server/regroup_split/
cp test/simplelines1.txt ../regroup-split-server/server/regroup_split/test
cp test/regression_test.json ../regroup-split-server/server/regroup_split/test

- Run unit tests and check for errors.
    - `ModuleNotFoundError: No module named 'MeCab'`
        - $ pip install mecab
            - Don't do this see [mecab on heroku](/en/mecab%20on%20heroku).
        - `$ pip install mecab-python3==0.996.5`

- If the unit test passes, call the test from server/__init__.py
    - flask run to see if the test works on the local development server
    - It's easier to read error messages on the local development server than after deployment.
    - Common Corrections
        - Relative import `from .foo import bar`.
            - I usually run it as a script and experiment with it, but it is imported from the server and run as a module, so the import behavior changes.
            - Maybe it's better to [IPython with %run -m](/en/IPython%20with%20%25run%20-m) on a regular basis.
        - path of a data file
            - If you're writing in a way that depends on the current directory at runtime, you'll get into trouble here.
                - Use `DIR = os.path.dirname(__file__)`.

Push to heroku when it works locally
- $ pip freeze > requirements.txt
  - Don't forget to ADD and COMMIT!
  - I guess I should have done that when I INSTALLED it.
- $ git push
  - Build errors mecab on heroku.
- After successful build, heroku open with 500 error
  - View runtime logs
    - $ heroku logs --tail
  - TypeError: 'dict_keys' object is not reversible
    - Python on heroku is 3.6 by default
    - By default, newly created Python apps use the python-3.6.12 runtime. --- Heroku Python Support | Heroku Dev Center
  - Align the executed version with the one at hand
    - $ echo python-3.8.7 > runtime.txt
- Test cases now work on heroku as well.
Add an interface to return processed values passed from the server to the experimental scripts that have been running on the terminal and observing the results on the standard output.
- In this case, it takes a string and returns a list of token strings.
- At this point, it is up to you to decide whether you want to return a rich object or one that can be serialized in json.
  - By itself, it depends on the application.
  - I think the process of making json serializable is something that is required everywhere, so I think it would be good to have it on the library side.
  - Proper serialization can change as internal structures change, and

python

def process_single_line(line):
    tokens = tokenize(line)
    calc_split_priority(tokens)
    return dict(
        tokens=concat_tokens(tokens, " "),
        split=[concat_tokens(ts) for ts in split(tokens)])

GET python

@app.route('/api/', methods=['GET'])
def api():
    text = request.args["q"]
    ret = regroup_split.process_single_line(text)
    return ret

- /api/?q=... Pass to GET to check operation with
- Automatically serialized in JSON

POST python

@app.route('/api/', methods=['GET', 'POST'])
def api():
    if request.method == "GET":
        text = request.args["q"]
    else:
        text = request.json["q"]
    ret = regroup_split.process_single_line(text)
    return ret

- `$ curl -X POST -H "Content-Type: application/json" -d '{"q":"test"}' localhost:5000/api/`
- operation check
- git push to make sure it works on heroku as well

Create a client side that calls this API python

import requests
import json

API_URL = "https://regroup-split-server.herokuapp.com/api/"
sample_text = "Ah, so people who are not used to the process of making lots of stickies and doing the KJ method don't have a good idea of how granular the information should be at the point of making the stickies in the first place. That's where the software needs to help."
payload = {"q": sample_text}
r = requests.post(API_URL, json=payload)
assert r.ok
for s in r.json()["split"]:
    print(s)

"""
Expected output:
Make lots of sticky notes.
People unfamiliar with the process of doing the KJ method.
How granular is the information at the point where you make a sticky note?
I can't pinpoint a good one.
Software needs to support
"""

Call from JS
- Flask-CORS
- Done ✅ longest line is ticked with one click.

Flask to HTTPS

This page is auto-translated from /nishio/Herokuで自然言語処理 using DeepL. If you looks something interesting but the auto-translated English is not good enough to understand it, feel free to let me know at @nishio_en. I'm very happy to spread my thought to non-Japanese readers.

(C)NISHIO Hirokazu / Converted from Markdown (en)
Source: [GitHub] / [Scrapbox]