Summary

Twitcomp is an interactive web application that allows users to input two twitter users and a hypothetical tweet text, and the app will predict who is more likely to have twitted the hypothetical tweet.

Deployed app

Introduction:

For any application to be “real”, it has to be available or productized. Nowadays, productization means to be deployed and accessible on a website. Sometimes we data scientists are very productive at creating original codes and perfecting Machine Learning algorithms, but tend to overlook the productization of our work. Well, with the help of Flask, some basic knowledge of HTML and CSS, some homework on data engineering, we can get things done and finally have our shinny baby on the web. Did I mention a little frustration with app environment configurations and huff and puff at some quirks of the deployment platform, and other kinks that took me so long to work everything out? This post focuses on the deployment of an Flask App on a platform-as-a-service (PaaS) provider Heroku, and hopeful will be useful for any one who is climbing up the steep part of this learning curve as I was a week ago!

Data ETL process

Training data has been pulled from Tweeter through Twitter API for developer account. Apply for a twitter developer account here.

The tweets were transformed from texts to vectors using spaCy word2vec algorithm.

The transformed tweet vectors are stored in Heroku’s adds-on PostgreSQL database for production.

Prediction model

The prediction is based on a simple Machine Learning logistic regression model. Due to the demonstration nature of this project, we only retrieved the latest 200 tweets from each user as training data. The model is in twitcomp/prediction.py.

Database management

In the development stage of the product, all data is stored in a sqlite3 database in the local project directory. In the production phase, we create a PostgreSQL database (available as a free add-on provided by Heroku).

Environment variables

In the development stage, we use $.env$ to store all credentials, e.g, Twitter API keys and Twitter API passwords.

I found it is much more convenient to use just one file to do all the configurations. In this case, a config.py in created in the project root directory, and define the different configuration classes for the different DevOps stages.

"""config.py"""
import os
from dotenv import load_dotenv

load_dotenv

# base configuration
class Config(object):
    DEBUG = False
    TESTING = False
    SQLALCHEMY_TRACK_MODIFICATIONS = False
    
    TWITTER_API_KEY = os.environ.get("TWITTER_API_KEY")
    TWITTER_API_SECRET = os.environ.get("TWITTER_API_SECRET")
    TWITTER_ACCESS_TOKEN = os.environ.get("TWITTER_ACCESS_TOKEN")
    TWITTER_ACCESS_TOKEN_SECRET = os.environ.get("TWITTER_ACCESS_TOKEN_SECRET")
    

# product config
class ProductionConfig(Config):
    Env = "production" # actually the default of flask
    DEBUG = False
    DATABASE_URL = os.environ.get('DATABASE_URL')

# dev config
class DevelopmentConfig(Config):
    ENV = "development"
    DEVELOPMENT = True
    SQLALCHEMY_TRACK_MODIFICATIONS = False   
    DATABASE_URL = "sqlite:///db.sqlite3"

# test config
class TestingConfig(Config):
    TESTING=True

Deployment

Twitcomp uses Heroku as deployment platform. Heroku provides free tier for anyone who is interested in getting their small project to product. Heroku has CLI commands that I prefer over clicking through their website for deployment. Below are the 5 steps for deploying an app to Heroku.

Step 1. In the terminal, in the virtual environment of your local project directory “who-may-twit-this”, install gunicorn and save all packages requirements.

pip install gunicorn
pip freeze > requirements.txt

Step 2. Add a Procfile to tell Heroku it is a web process and where to look for app.

cat > Procfile    #Hit Enter
web: gunicorn twitcomp:APP -t 120
# Hit Enter
# Hit Ctrl+D to save and quit

Step 3. Log in your Heroku account. Create a Heroku app and git repository for your project. Associate a postgreSQL database with the Heroku app (hobby-dev is free).

cd who-may-twit-this
heroku login -i

heroku create twitcomp
heroku git remote -a add twitcom
heroku addons:create heroku-postgresql:hobby-dev --app twitcomp

Step 4. Set environment variables. The environment variable for this app include data API credentials, postgreSQL database URL, and Flask setting. We use Heroku config:set commands to set config vars.

To configurate data API credentials and postgreSQL database URL:

heroku config:set TWITTER_API_KEY="XXXXXXXXXXXXXXXXXX" TWITTER_API_CREDENTIALS="XXXXXXXXXXXXXXXXXX"
TWITTER_ACCESS_TOKEN="XXXXXXXXXXXXXXXXXX"
TWITTER_ACCESS_TOKEN_SECRET="XXXXXXXXXXXXXXXXXX"   
DATABASE_URL="postgres://XXXXXXXXXXXXXXXXXXXXXX"

Now, in the main app.py where we bring everything together, we let the os get the environment variable: os.environ['APP_SETTINGS'].

"""app/flask_app.py"""
.
.
.
def create_app():
    
    app = Flask(__name__) # __name__: current path module

    env_configuration = os.environ['APP_SETTINGS']
    app.config.from_object(env_configuration)
.
.
.   

So accordingly, we configure the variable APP_SETTINGS as follows:

  • For development, in the terminal, type: export APP_SETTINGS:config.DevelopmentConfig

  • For deployment, in terminal, type: Heroku config:set command: heroku config:set APP_SETTINGS:config.ProductionConfig

Step 5. Now we are ready to deploy! Simply by pushing to heroku. My local branch name is main, so I push to heroku main. You might want to use master if your local branch name is master.

git push heroku main

If the app is built successfully, we can open the deployed web app by tying heroku open. There will opens the URL for the deployed app. For this app, we will see an application error, becasue the tables haven’t been created in the database. If we exam the routes in [twitcomp/app.py], we have a simple solution around the problem: by add “/reset” to the Heroku Deployment URL, all the tables will be created and the app is now fully functioning.

It should work, otherwise, use heroku logs --tail to help diagnosis/debug. Sometimes, Heroku is quirky when updating the database, so we have to click the Add User button a couple of more times, or to consider a upgrade for paid deployment service.

Selection of PaaS

Heroku was among the pioneers of platform as a service (Paas), and that’s why most developers go to it for deployment. But Heroku is becoming expensive, not to mention some of limitations. If you paid attention to its infrastructure, e.g, database, you would notice that it actually uses AWS (USA and Europe regions) as infrastructure provider. So, users in other regions may encounter latency if you app scales globally. Beside’s the quirky behavior updating database, Heroku also puts its dyno (its containers for running app) to sleep after 1 hour of inactivity. The dyno will automatically wake when the web app is accessed but the user who “wake up” the dyno will anticipate a longer delay than subsequent users.

So, developers seeks other alternatives to Heroku, and there are so many choices now compare to the year of 2007 when Heroku as established. This article introduced a list of top 10 PaaS providers for developers to shop around.

Hope this post can be of help to people having deployment issues! The source code of the project can be found here.