TokenizationLayer
tf.keras layer that should (but doesn't have to) be the first layer in the neural network. It's for learning and applying a tokenization scheme on the input text.
Placed as the first layer in a neural network, it takes in text that's split by character and one-hot encoded (i.e. has shape (batch_size, num_chars, text_len, 1
), and "tokenizes" it using the trainable parameter patterns
.
tokenization_layer.TokenizationLayer(
n_neurons, initializer, pattern_lens, **kwargs
)
Parameters
n_neurons
---- int
Number of neurons to be in the layer.
initializer
---- keras.initializers.Initializer
Initializer for patterns.
pattern_lens
---- int
Length/Number of characters every pattern will be.
**kwargs
Example
from tensorflow import keras
import re
import nltk
nltk.download("gutenberg")
from nltk.corpus import gutenberg
corpus = gutenberg.raw("austen-emma.txt")
# Remove arbritray strings of "\\n"s and " "s
corpus = re.sub(r"[\\n ]+", " ", corpus.lower())
# We're assuming we got `chars` when preprocessing the train data
init = tokenization_layer.PatternsInitilizerMaxCover(corpus, chars)
model = keras.Sequential([
tokenization_layer.TokenizationLayer(500, init, max(init.gram_lens)),
keras.layers.Lambda(lambda x: tf.transpose(tf.squeeze(x, 3), [0, 2, 1])),
tokenization_layer.EmbeddingLayer(1),
keras.layers.Flatten(),
keras.layers.BatchNormalization(),
keras.layers.Dense(64),
keras.layers.Dense(1, activation="sigmoid")
])
# Initialize parameters and shapes by calling on dummy inputs
_ = model(tf.zeros((32, 30, 2000, 1)))
_ = model(tf.zeros((50, 30, 2000, 1)))
model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])
model.fit(X_train_data, epochs=10)
Last updated
Was this helpful?