Common feature processing strategies:
- User id and item id must be converted into embedded vectors
- The original text needs to be tokenized and translated into embedded text
- Numerical characteristics need to be standardized
By using TensorFlow, we can treat this preprocessing as part of the model rather than a separate preprocessing step. This is not only convenient, but also ensures that our pretreatment is exactly the same during training and service. This makes it safe and easy to deploy models that even include very complex preprocessing.
data set
import pprint import tensorflow_datasets as tfds ratings = tfds.load("movielens/100k-ratings", split="train") for x in ratings.take(1).as_numpy_iterator(): pprint.pprint(x)
There are a couple of key features here:
- Movie title is useful as a movie identifier.
- User id is useful as a user identifier.
- Timestamps will allow us to model the effect of time.
The first two are categorical features; timestamps are a continuous feature.
embedding category features
Suppose our goal is to predict which users will watch which movie. To this end, we use an embedded vector to represent each user and each movie. Initially, these embedding values will take random values, but during training, we will adjust them so that users' embedding is ultimately closer to the movie they watch.
Taking raw categorical features and turning them into embeddings is normally a two-step process:
- Firstly, we need to translate the raw values into a range of
contiguous integers, normally by building a mapping (called a
"vocabulary") that maps raw values ("Star Wars") to integers (say,
15). - Secondly, we need to take these integers and turn them into
embeddings.
Define vocabulary through Keras preprocessing layers
import numpy as np import tensorflow as tf movie_title_lookup = tf.keras.layers.StringLookup()
The layer itself doesn't have a vocabulary, but we can build it with data.
movie_title_lookup.adapt(ratings.map(lambda x: x["movie_title"])) print(f"Vocabulary: {movie_title_lookup.get_vocabulary()[:3]}")
Once we have this, we can use the layer to convert the original tag to the embedded id:
movie_title_lookup(["Star Wars (1977)", "One Flew Over the Cuckoo's Nest (1975)"])
Note: the vocabulary of this layer includes one (or more!) Unknown (or "off vocabulary", OOV) tag. This is very convenient: this means that the layer can handle category values that are not in the vocabulary.
Feature hash
In fact, the StringLookup layer allows us to configure multiple OOV indexes. If we do, any raw values that are not in the vocabulary will be deterministic hashed to one of the OOV indexes. The more such indexes, the less likely it is that two different original eigenvalues will be hashed to the same OOV index. Therefore, if we have enough such indexes, the model should be able to train as well as the model with explicit vocabulary without maintaining the token list.
We can hash dependent features without using any vocabulary at all. This is in TF keras. layers. Implemented in the hash layer.
# We set up a large number of bins to reduce the chance of hash collisions. num_hashing_bins = 200_000 movie_title_hashing = tf.keras.layers.Hashing( num_bins=num_hashing_bins )
We can look it up as before without building a vocabulary:
movie_title_hashing(["Star Wars (1977)", "One Flew Over the Cuckoo's Nest (1975)"])
Define embedding
When we have integer IDS, we can use embedded layer to convert them into embedded vectors.
When creating an embedded layer for the movie title, we will set the first value to the size of the Title Vocabulary (or the number of hash boxes). The second problem depends on us: the larger it is, the larger the capacity of the model is, but the slower the speed of adaptation and service is.
movie_title_embedding = tf.keras.layers.Embedding( # Let's use the explicit vocabulary lookup. input_dim=movie_title_lookup.vocab_size(), output_dim=32 )
We can put these two into one layer, which can get the original text and generate embedded text.
movie_title_model = tf.keras.Sequential([movie_title_lookup, movie_title_embedding])
Like this, we can directly get the embedding of the film title:
movie_title_model(["Star Wars (1977)"])
The same operation is adopted for the user side
user_id_lookup = tf.keras.layers.StringLookup() user_id_lookup.adapt(ratings.map(lambda x: x["user_id"])) user_id_embedding = tf.keras.layers.Embedding(user_id_lookup.vocab_size(), 32) user_id_model = tf.keras.Sequential([user_id_lookup, user_id_embedding])
Continuous feature standardization
Continuous features also need to be standardized. For example, the time stamp feature is too large to be directly used in the deep model. There are two commonly used processing methods: discretization and standardization
for x in ratings.take(3).as_numpy_iterator(): print(f"Timestamp: {x['timestamp']}.")
Standardization
timestamp_normalization = tf.keras.layers.Normalization( axis=None ) timestamp_normalization.adapt(ratings.map(lambda x: x["timestamp"]).batch(1024)) for x in ratings.take(3).as_numpy_iterator(): print(f"Normalized timestamp: {timestamp_normalization(x['timestamp'])}.")
Discretization
When we suspect that a feature is discontinuous, we can discretize it:
Equal width
max_timestamp = ratings.map(lambda x: x["timestamp"]).reduce( tf.cast(0, tf.int64), tf.maximum).numpy().max() min_timestamp = ratings.map(lambda x: x["timestamp"]).reduce( np.int64(1e9), tf.minimum).numpy().min() timestamp_buckets = np.linspace( min_timestamp, max_timestamp, num=1000) print(f"Buckets: {timestamp_buckets[:3]}")
Given the bucket boundary, we can convert the timestamp to embedded:
timestamp_embedding_model = tf.keras.Sequential([ tf.keras.layers.Discretization(timestamp_buckets.tolist()), tf.keras.layers.Embedding(len(timestamp_buckets) + 1, 32) ]) for timestamp in ratings.take(1).map(lambda x: x["timestamp"]).batch(1).as_numpy_iterator(): print(f"Timestamp embedding: {timestamp_embedding_model(timestamp)}.")
Processing of text features
title_text = tf.keras.layers.TextVectorization() title_text.adapt(ratings.map(lambda x: x["movie_title"]))
give an example:
for row in ratings.batch(1).map(lambda x: x["movie_title"]).take(1): print(title_text(row))
We can check the vocabulary of learning to verify that the correct tokenization is used in this layer:
title_text.get_vocabulary()[40:45]
Build pretreatment model
User side model
class UserModel(tf.keras.Model): def __init__(self): super().__init__() self.user_embedding = tf.keras.Sequential([ user_id_lookup, tf.keras.layers.Embedding(user_id_lookup.vocab_size(), 32), ]) self.timestamp_embedding = tf.keras.Sequential([ tf.keras.layers.Discretization(timestamp_buckets.tolist()), tf.keras.layers.Embedding(len(timestamp_buckets) + 2, 32) ]) self.normalized_timestamp = tf.keras.layers.Normalization( axis=None ) def call(self, inputs): # Take the input dictionary, pass it through each input layer, # and concatenate the result. return tf.concat([ self.user_embedding(inputs["user_id"]), self.timestamp_embedding(inputs["timestamp"]), tf.reshape(self.normalized_timestamp(inputs["timestamp"]), (-1, 1)) ], axis=1)
Test:
user_model = UserModel() user_model.normalized_timestamp.adapt( ratings.map(lambda x: x["timestamp"]).batch(128)) for row in ratings.batch(1).take(1): print(f"Computed representations: {user_model(row)[0, :3]}")
Film side model
class MovieModel(tf.keras.Model): def __init__(self): super().__init__() max_tokens = 10_000 self.title_embedding = tf.keras.Sequential([ movie_title_lookup, tf.keras.layers.Embedding(movie_title_lookup.vocab_size(), 32) ]) self.title_text_embedding = tf.keras.Sequential([ tf.keras.layers.TextVectorization(max_tokens=max_tokens), tf.keras.layers.Embedding(max_tokens, 32, mask_zero=True), # We average the embedding of individual words to get one embedding vector # per title. tf.keras.layers.GlobalAveragePooling1D(), ]) def call(self, inputs): return tf.concat([ self.title_embedding(inputs["movie_title"]), self.title_text_embedding(inputs["movie_title"]), ], axis=1)
test
movie_model = MovieModel() movie_model.title_text_embedding.layers[0].adapt( ratings.map(lambda x: x["movie_title"])) for row in ratings.batch(1).take(1): print(f"Computed representations: {movie_model(row)[0, :3]}")