PatternsInitializerMaxCover
Keras initializer that uses a corpus of text to initialize the patterns as randomly chosen grams from the corpus, weighted by how common said grams are respectively.
tokenization_layer.PatternsInitilizerMaxCover(
text_corpus, chars,
gram_lens=[5, 6, 7, 8, 9, 10, 11, 12, 13],
filter_over=1
)
Parameters
text_corpus
---- str
The corpus of text that will be used to generate the patterns. WARNING: You may run into memory issues if the corpus is too big.
chars
---- str
String of the one-hot encoding categories (i.e. characters) for how the text will be encoded. The index of a character in the string will be said character's index in the encoding. So, for example, chars = "abcdefghijklmnopqrstuvwxyz"
, then each character in a pattern will be 26-dimensional vector (i.e. a one-hot encoding with 26 characters). You may also include "<UNK>"
in your chars
(and it'll be all characters that can't be found in chars
). If you don't, unidentified characters will be encoded as not having a category (i.e. the same as padding). Finally, MAKE SURE THAT THE SAME CHARS
ARE USED FOR ENCODING ALL TEXT THAT WILL BE INPUT DATA TO NEURAL NET.
gram_lens
---- list
, optional (default: [5, 6, 7, 8, 9, 10, 11, 12, 13]
)
A list of the possible lengths a pattern can be (patterns will be padded with 0s at the end to be the same length).
filter_over
---- int
, optional (default: 1
)
The minimum number of time a gram must occur in the corpus to be a possible pattern.
Example
import re
import nltk
nltk.download("gutenberg")
from nltk.corpus import gutenberg
corpus = gutenberg.raw("austen-emma.txt")
# Remove arbritray strings of "\\n"s and " "s
corpus = re.sub(r"[\\n ]+", " ", corpus.lower())
chars = "".join(pd.Series(list(corpus)).value_counts(sort=True).keys()) + "<UNK>"
init = tokenization_layer.PatternsInitilizerMaxCover(corpus, chars)
# Initialize patterns of shape `(num_chars, max_len, 1, num_neurons)`
# Where there are `num_neurons` patterns (one for each neuron), each
# with random length/number of characters (but padded to be `max_len`)
# and each character being a one-hot encoding with `num_chars`
# categories.
patterns = init((len(init.chars), max(init.gram_lens), 1, 200))
Last updated
Was this helpful?