IMDB Reviews

An example of using this package to make a model (with the tokenization layer) and train it on the IMDB Reviews Dataset.

!pip install tokenization-layer
import tokenization_layer

import tensorflow as tf
from tensorflow import keras
import numpy as np
import pandas as pd
import re
import string

Getting & Preparing the Data

Firstly, let's get the data and prepare it. I've the dataset on my Google Drive, so we can download it from there:

import requests
from io import StringIO
orig_url = 'https://drive.google.com/file/d/1-4wZ3VawRfxvX9taPhfHWU7mAiH-gBDe/view?usp=sharing'
file_id = orig_url.split('/')[-2]
dwn_url='https://drive.google.com/uc?export=download&id=' + file_id
url = requests.get(dwn_url).text
csv_raw = StringIO(url)
data = pd.read_csv(csv_raw)

Then, some basic preprocessing:

Do a train-test-validation split:

And finally, turn it into a TensorFlow dataset for the final preprocessing; splitting the text by letter, one-hot encoding it and padding it, so that they're all the same length (+ batch it into 32s):

Defining the Model

Now, let's start making the model

To start, we need an initialization method for the patterns (tokens) of the tokenization layer. Here we'll use tokenization_layer.PatternsInitializerMaxCover.

And then, we'll define our model (we'll use the subclassing API, but the other keras APIs also work):

Making the Training Loop

For the final thing in this example, we'll make a custom training loop for our model. Note you don't have to do this, model.compile() model.fit() also works.

Our training loop will save model checkpoints, as well as info on how the patterns and gradients evolved. We initialize those things here:

Lastly, here's the acutal training loop iself:

There it is! Every 10 steps, this training loop will decode all the patterns (i.e. tokens) in the tokenization layer, save them to patterns_log.txt, and display the top patterns with the most non-zero and zero gradients (which is an indicator of convergance). It also saves the mean and standard deviation of values and gradients along the whole model, in grads_log.csv and vals_log.csv respectively.

Last updated