1, Some tools
1. Three features of local printout non sequential sequence data_ Method of column converted value
Applicable to tensorflow1 x
import tensorflow as tf from tensorflow.python.feature_column import feature_column_v2 as fc_v2 from tensorflow.python.feature_column import feature_column as fc # Note: only mode 2 will check whether the input data conforms to the feature_ Definition of column def numeric_column(): column = tf.feature_column.numeric_column( key="feature", shape=(3,2,1,), default_value=100, dtype=tf.float32, normalizer_fn=lambda x: x / 2) features = {"feature": tf.constant(value=[ [[1, 2], [3, 4], [5, 6]], [[7, 8], [9, 10], [11, 12]] ])} # feature_column processing method: 1 feature_cache = fc_v2.FeatureTransformationCache(features= features ) rs_1 = column.get_dense_tensor(transformation_cache=feature_cache, state_manager=None) # feature_column processing method: 2 net = tf.feature_column.input_layer(features,column) # feature_column processing method: 3 builder = fc._LazyBuilder(features) rs_3 = column._get_dense_tensor(builder,None) with tf.Session() as sess: print(sess.run(rs_1)) print(sess.run(net)) print(sess.run(rs_3)) numeric_column()
2. Three kinds of features for printing sequence data_ Method of column converted value
Applicable to tensorflow1 x
2.1 function of sequence feature sequence
reference resources: TensorFlow engineering project practice of in-depth learning
2.2. How to use sequence feature
import tensorflow as tf from tensorflow.python.feature_column import feature_column_v2 as fc_v2 from tensorflow.python.feature_column import feature_column as fc from tensorflow.python.feature_column import feature_column_lib as fcl from tensorflow.python.feature_column import sequence_feature_column as sqfc def sequence_numeric_column(): # The usage is the same as numeric_ The columns are basically the same column = tf.feature_column.sequence_numeric_column( key="feature", # Shape specifies the shape of each element in the sequence # The shape of the final returned structure is [batch_size, element_count / sum (shape [:]), shape] # Setting this value will only affect deny_ tensor. sequence_length is only related to the actual input data shape=(3,), default_value=60, dtype=tf.float32, normalizer_fn=lambda x: x / 2) column2 = tf.contrib.feature_column.sequence_numeric_column( key="feature", shape=(3,), default_value=60, dtype=tf.float32, normalizer_fn=lambda x: x / 2) features = { # The value corresponding to the feature must be SparseTensor "feature": tf.SparseTensor( # indices should be written in order indices=[ [0, 0, 1], [0, 1, 0], [0, 5, 0], [0, 5, 1], [1, 2, 1], [1, 3, 0], [1, 3, 1] ], values=[4, 1, 7, 9, 3, 4., 4], dense_shape=[2, 6, 2]) } # Method: 1 feature_cache = feature_column_lib.FeatureTransformationCache(features=features) rs_1 = column.get_sequence_dense_tensor(transformation_cache=feature_cache, state_manager=None) # Method: 2 rs_2 = tf.contrib.feature_column.sequence_input_layer(features,column2) builder = fc._LazyBuilder(features) # Method: 3 rs_3 = column2._get_sequence_dense_tensor(builder,None) with tf.Session() as sess: print(sess.run(rs_1)) print("111"*20) print(sess.run(rs_2)) print("222"*20) print(sess.run(rs_3)) sequence_numeric_column()
3. Explain
input_ Input requirements for layer:
All items should be instances of classes derived from `_DenseColumn` such as `numeric_column`, `embedding_column`, `bucketized_column`, `indicator_column`. If you have categorical features, you can wrap them with an `embedding_column` or `indicator_column`.
Simply put, it's input_ The input of layer requires dense data
2, feature_column introduction
1,feature_ What is column
tf.feature_column is a set of tools for processing data. I generally use it as "feature Engineering" in TensorFlow.
2. What data processing can be done
3. It can handle continuous real number (int or float) features with fixed length
3.1. Examples of data that can be processed
"fea_1":0.123
"fea_2":[0.123,0.222]
"fea_3":[1,3,5]
"fea_4":10
"fea_0":[ [[1, 2], [3, 4], [5, 6]], [[7, 8], [9, 10], [11, 12]] ]
"fea_sparse_1" : tf. Sparsetensor (# indexes should be written in order: indexes = [[0, 0, 1], [0, 1, 0], [0, 5, 0], [0, 5, 1], [1, 2, 1], [1, 3, 0], [1, 3, 1], values = [4, 1, 7, 9, 3, 4, 4], deny_shape = [2, 6, 2])
3.2. Normalizer can be used_ FN method processes the input data
normalizer_fn=lambda x: x/2
3.3 non sequence feature_ How to write column
# numeric_column only supports int and float types tf.feature_column.numeric_column(key="fea_1",shape=(1,),default_value=0,dtype=tf.float32,normalizer_fn=lambda x: ...) tf.feature_column.numeric_column(key="fea_2",shape=(2,),default_value=0,dtype=tf.float32,normalizer_fn=lambda x: ...) tf.feature_column.numeric_column(key="fea_3",shape=(3,),default_value=0,dtype=tf.int64,normalizer_fn=lambda x: ...)
3.4. Skills of using fixed length real number features
For fea_1,fea_2,fea_3,fea_4 features can be put together as "fea_num", so that the generated tfrecord will contain fewer key s and occupy less space.
3.5. sequence feature_ How to write column
column = tf.feature_column.sequence_numeric_column( key="feature", shape=(6,), default_value=60, dtype=tf.float32, normalizer_fn=lambda x: x / 2) # For TF contrib. feature_ column. sequence_ input_ layer column2 = tf.contrib.feature_column.sequence_numeric_column( key="feature", shape=(6,), default_value=60, dtype=tf.float32, normalizer_fn=lambda x: x / 2) # Enter the sparse feature "fea_sparse_1" # result: TensorSequenceLengthPair(dense_tensor=array([[[60. , 2. , 0.5, 60. , 60. , 60. ], [60. , 60. , 60. , 60. , 3.5, 4.5]], [[60. , 60. , 60. , 60. , 60. , 1.5], [ 2. , 2. , 60. , 60. , 60. , 60. ]]], dtype=float32), sequence_length=array([6, 4], dtype=int64))
4. Can handle fixed length discrete feature category column
4.1 examples of non sequence and sequence sequence sequence characteristic data (int type, string type) that can be processed
# 3 data with 2 rows and 2 columns "fea_5":[ [["value1", "value2"], ["value3", "value3"]], [["value3", "value5"], ["value4", "value4"]], [["value4", "value5"], ["value2", "value4"]] ] # The following two are two 1D data "fea_6":["value1","value2"] "fea_7":[["value1"],["value2"]] # One 1D data "fea_8":["value1"] # 2 data with 1 row and 2 columns "fea_9":[["value1","value3"],["value2","value4"]] # 2 rows and 2 columns of data "fea_10":[ [["value1", "value2"], ["value3", "value3"]], [["value3", "value5"], ["value4", "value4"]] ] # 3 dense data with 2 rows and 3 columns "fea_11":[ [[1, 2, 3], [4, 5, 6]], [[5, 6, 7], [8, 9, 10]], [[8, 9, 10], [11, 12, 13]] ] # 3 dense data with 1 row and 6 columns "fea_12":[ [1, 2, 3, 4, 5, 6], [5, 6, 7, 8, 9, 10], [8, 9, 10, 11, 12, 13] ] # The classification feature value corresponds to the weight data (the corresponding data should be consistent with the dimension of the weight, and there are at most 2 dimensions) "fea_weight_1":[ [1.1, 2.2, 3.3, 4.4, 5.5, 6.6], [9.9, 8.8, 7.7, 6.6, 5.5, 4.4] ] # Weight data: 3 data with 1 row and 4 columns "fea_weight_2":[ [1.1, 2.2, 3.3, 4.4], [9.9, 8.8, 7.7, 6.6], [3.4, 8.8, 2.2, 6.6] ] # 3 classification feature data with 1 row and 4 columns "fea_13":[ ["value1", "value2","value3", "value3"], ["value3", "value5","value4", "value4"], ["value4", "value5","value2", "value4"] ] # 2 characteristic data with 3 rows and 2 columns "fea_14":[ [[1, 2], [3, 4], [5, 6]], [[7, 7], [9, 10], [11, 12]] ]
The int data is the same as above
4.2. Few feature value categories: categorical_column_with_vocabulary_list
There are three ways to use:
- The category features and sequence category features are expressed as integer numbers
- The category features and sequence category features are expressed as multi_hot coding
- The category features and sequence category features are expressed as weighted multi_hot coding
- The category feature table and sequence category feature are shown as embedded vector representation
Note: for dense densor features, the input data dimension must be consistent
- Processing result style of non sequence characteristic data:
# The category value is int type column = tf.feature_column.categorical_column_with_vocabulary_list( key="feature", vocabulary_list=[1, 2, 3, 4], dtype=tf.int64, default_value=-1, # Same as default_value, but both cannot work at the same time. # Map the exceeded values to [len (volatile), len (volatile) + num_oov_buckets) # The default value is 0 # When the value is not 0, default_value must be set to - 1 # When default_value and num_ oov_ When all buckets take the default value, they will be mapped to - 1 num_oov_buckets=4) # The category value is string type column = tf.feature_column.categorical_column_with_vocabulary_list( key="feature", vocabulary_list=["value1", "value2", "value3","value4"], dtype=tf.string, default_value=-1, num_oov_buckets=4) # Input data fea_5. Sparse tensor result of post conversion: SparseTensorValue(indices=array([[0, 0, 0], [0, 0, 1], [0, 1, 0], [0, 1, 1], [1, 0, 0], [1, 0, 1], [1, 1, 0], [1, 1, 1], [2, 0, 0], [2, 0, 1], [2, 1, 0], [2, 1, 1]], dtype=int64), values=array([0, 1, 2, 2, 2, 6, 3, 3, 3, 6, 1, 3], dtype=int64), dense_shape=array([3, 2, 2], dtype=int64)) # Usage: 1 # dense tensor results converted to numerical representation: [[[0 1] [2 2]] [[2 6] [3 3]] [[3 6] [1 3]]] # Usage: 2 # Convert to multi_ Result of hot feature (8 columns = vocabulary_list length + num_oov_buckets): [[[1. 1. 0. 0. 0. 0. 0. 0.] [0. 0. 2. 0. 0. 0. 0. 0.]] [[0. 0. 1. 0. 0. 0. 1. 0.] [0. 0. 0. 2. 0. 0. 0. 0.]] [[0. 0. 0. 1. 0. 0. 1. 0.] [0. 1. 0. 1. 0. 0. 0. 0.]]] # Usage: 3 # Result of conversion to embedding feature (3 columns are embedded dimensions set by yourself): [[[-0.36440656 0.1924808 0.1217252 ] # Characterization data as a whole ["value1", "value2"] [ 0.71263236 -0.45157978 -0.3456324 ]] # Characterization data as a whole ["value3", "value3"] [[-0.18493024 -0.20456922 -0.3947454 ] # Characterization data as a whole ["value3", "value5"] [-0.19874108 0.6833139 -0.56441975]] # Characterization data as a whole ["value4", "value4"] [[-0.64061695 0.3628776 -0.50413907] # Characterization data as a whole ["value4", "value5"] [-0.28863966 0.14901578 0.16483489]]] # Characterization data as a whole ["value2", "value4"] # be careful: # Using input_ "fea_5" is unavailable while "fea_9" is available during layer. It seems that too high dimension is not supported # Input data: weighted characteristic results after "fea_weight_2" and "fea_13" (modes 4 and 5): # Usage: 4 # Dense densor numerical features are useless, but embedding features and weighted multi_hot features: IdWeightPair(id_tensor=SparseTensorValue(indices=array([[0, 0], [0, 1], [0, 2], [0, 3], [1, 0], [1, 1], [1, 2], [1, 3], [2, 0], [2, 1], [2, 2], [2, 3]], dtype=int64), values=array([0, 1, 2, 2, 2, 6, 3, 3, 3, 6, 1, 3], dtype=int64), dense_shape=array([3, 4], dtype=int64)), weight_tensor=SparseTensorValue(indices=array([[0, 0], [0, 1], [0, 2], [0, 3], [1, 0], [1, 1], [1, 2], [1, 3], [2, 0], [2, 1], [2, 2], [2, 3]], dtype=int64), values=array([1.1, 2.2, 3.3, 4.4, 9.9, 8.8, 7.7, 6.6, 3.4, 8.8, 2.2, 6.6], dtype=float32), dense_shape=array([3, 4], dtype=int64))) [[ 1.1 2.2 7.7 0. 0. 0. 0. 0. ] [ 0. 0. 9.9 14.299999 0. 0. 8.8 0. ] [ 0. 2.2 0. 10. 0. 0. 8.8 0. ]] # Usage: 5 # Weighted embedding [[ 0.16342753 -0.07898534 -0.33816564 0.2438156 ] [ 0.04507026 0.30109608 0.08584949 0.28742552] [ 0.00048126 0.315775 0.1192891 0.21302155]]
- For the processing result style of sequence feature of sequence feature data:
column = tf.feature_column.sequence_categorical_column_with_vocabulary_list( key="feature", vocabulary_list=["value1", "value2", "value3"], dtype=tf.string, default_value=-1, num_oov_buckets=2) # Sparse tensor result after inputting sequence feature "fea_10": SparseTensorValue(indices=array([[0, 0, 0], [0, 0, 1], [0, 1, 0], [0, 1, 1], [1, 0, 0], [1, 0, 1], [1, 1, 0], [1, 1, 1]], dtype=int64), values=array([0, 1, 2, 2, 2, 3, 3, 3], dtype=int64), dense_shape=array([2, 2, 2], dtype=int64)) # Usage: 1 # Convert to integer sequence representation [[[0 1] [2 2]] [[2 3] [3 3]]] # Usage: 2 # Convert to multi_hot means (dimension is 5 = volatile_list + num_oov_buckets) (array([[[1., 1., 0., 0., 0.], [0., 0., 2., 0., 0.]], [[0., 0., 1., 1., 0.], [0., 0., 0., 2., 0.]]], dtype=float32), array([2, 2], dtype=int64)) # Usage: 3 # Convert to embedded representation (dimension setting: 3) (array([[[ 0.54921925, 0.039222 , -0.20265868], # Means ["value1", "value2"] [ 0.3889632 , 0.43282962, -0.2105029 ]], # Means ["value3", "value3"] [[ 0.20231032, -0.11117572, -0.14481466], # Means ["value3", "value5"] [ 0.01565746, -0.65518105, -0.07912641]] # Means ["value4", "value4"] ], dtype=float32), array([2, 2], dtype=int64)) # Pay attention to using input_layer uses the following api: # tf.contrib.feature_column.sequence_input_layer # This api can handle non sequence input_ "fea_5" feature that layer cannot handle
4.3. Use category when the value range of category features is medium_ column_ with_ vocabulary_ file
There are three ways to use:
- The category features and sequence category features are expressed as integer numbers
- The category features and sequence category features are expressed as multi_hot coding
- The category features and sequence category features are expressed as weighted multi_hot coding
- Show category and feature table as embedding
The same as 4.2. If the input is dense densor, its input characteristics must be consistent with the dimension:
- Processing result style of non sequence characteristic data:
Same as 4.2. Category_ column_ with_ vocabulary_ list
column = tf.feature_column.categorical_column_with_vocabulary_file( key="feature", vocabulary_file="valuelist", dtype=tf.string, default_value=None, num_oov_buckets=3) # be careful: # Using input_layer cannot process multidimensional feature data such as "fea_5" # # The contents of the file valuelist are as follows: value1 value2 value3
- For the processing result style of sequence feature of sequence feature data:
Same as 4.2. Category_ column_ with_ vocabulary_ list
column = tf.feature_column.sequence_categorical_column_with_vocabulary_file( key="feature", vocabulary_file="valuelist", dtype=tf.string, default_value=None, num_oov_buckets=3) # The results and precautions are the same as the sequence in 4.2
4.3. For int integer feature data, it is used as classification feature_ column_ with_ identity
There are four ways to use:
- The category features and sequence category features are expressed as integer numbers
- The category features and sequence category features are expressed as multi_hot coding
- The category features and sequence category features are expressed as weighted multi_hot coding
- Show category and feature table as embedding
The same as 4.2. If the input is dense densor (int type), the input feature dimension must be consistent
- Examples of processing results for non sequence features:
column = tf.feature_column.categorical_column_with_identity( key='feature', # The value range is [0, num_buckets) num_buckets=10, # The value to be mapped when the data is not in [0, num_buckets). # The default is None. In this case, an error will be reported when there is unknown data. # Default required_ The value of value is within [0, num_buckets) default_value=3) # The results and notes are exactly the same as those in 4.2
- Examples of processing results for sequence features:
column = tf.feature_column.sequence_categorical_column_with_identity( key='feature', num_buckets=10, default_value=3) # The results and notes are exactly the same as those in 4.2
4.4. For too many values of string or int data as classification features: category_ column_ with_ hash_ bucket
There are three ways to use:
- The category features and sequence category features are expressed as integer numbers
- The category features and sequence category features are expressed as multi_hot coding
- The category features and sequence category features are expressed as weighted multi_hot coding
- Show category and feature table as embedding
The same as 4.2. If the input is dense densor data, the input feature dimension must be consistent
- Examples of processing results for non sequence features:
# string type column = tf.feature_column.categorical_column_with_hash_bucket( key="feature", # Space size of hash hash_bucket_size=10, # Only string and integer are supported # Numeric types are also hash mapped dtype=tf.string) # int type column = tf.feature_column.categorical_column_with_hash_bucket( key="feature", hash_bucket_size=10, dtype=tf.int64) # The results and notes are exactly the same as those in 4.2
- Examples of processing results for sequence features:
# Handle string type column = tf.feature_column.sequence_categorical_column_with_hash_bucket( key="feature", hash_bucket_size=10, dtype=tf.string) # Handle int type column = tf.feature_column.sequence_categorical_column_with_hash_bucket( key="feature", hash_bucket_size=10, dtype=tf.int64) # The results and notes are exactly the same as those in 4.2
4.5. Cross processing of string or int features_ column
There are four ways to use:
- The category features and sequence category features are expressed as integer numbers
- The category features and sequence category features are expressed as multi_hot coding
- The category features and sequence category features are expressed as weighted multi_hot coding
- Show category and feature table as embedding
The same as 4.2. If the input is dense densor data, the input feature dimension must be consistent
# When keys is the original input characteristic data: column = tf.feature_column.crossed_column( # The type of keys can also be CategoricalColumn (category of hash type cannot be used) keys=["fea_9", "fea_12"], hash_bucket_size=100, hash_key=None) # When keys is a category feature of non hash type: column_voc = tf.feature_column.categorical_column_with_vocabulary_file( key="fea_9", vocabulary_file="valuelist", dtype=tf.string, default_value=None, num_oov_buckets=3) column_iden = tf.feature_column.categorical_column_with_identity( key='fea_12', num_buckets=10, default_value=3) column_cro = tf.feature_column.crossed_column( keys=[column_voc,column_iden], hash_bucket_size=10, hash_key=None) # The result and attention are exactly the same as the non sequence in 4.2
4.5. For int type features, the bucket is divided according to the value boundary_ The feature processing of hot is bucketized_column
How to use:
- The category feature and sequence category feature are represented as one_hot coding
Input as dense feature densor
numeric_column = tf.feature_column.numeric_column( key="feature", shape=6, default_value=0, dtype=tf.float32) column = tf.feature_column.bucketized_column( # numeric column of 1-D source_column=numeric_column, # The list of requirements is in ascending order boundaries=[3, 5, 7, 10]) # Enter as "fea_14" numeric feature # The result is # Output mode: 1 # Using input_layer output [[1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0.] [0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1.]] # Output mode: 2 # get_ dense_ Output of tensor mode [[[[1. 0. 0. 0. 0.] [1. 0. 0. 0. 0.]] [[0. 1. 0. 0. 0.] [0. 1. 0. 0. 0.]] [[0. 0. 1. 0. 0.] [0. 0. 1. 0. 0.]]] [[[0. 0. 0. 1. 0.] [0. 0. 0. 1. 0.]] [[0. 0. 0. 1. 0.] [0. 0. 0. 0. 1.]] [[0. 0. 0. 0. 1.] [0. 0. 0. 0. 1.]]]]
5,multi_hot,one_hot,embedding,share_ Usage examples of embedding
5.1,multi_hot feature making method: indicator_column
column : categorical_column_with_vocabulary_list sequence_categorical_column_with_vocabulary_list categorical_column_with_vocabulary_file sequence_categorical_column_with_vocabulary_file categorical_column_with_identity sequence_categorical_column_with_identity categorical_column_with_hash_bucket sequence_categorical_column_with_hash_bucket crossed_column weighted_categorical_column tf.feature_column.indicator_column(column)
5.2. Manufacturing method of embedding feature: embedding_column
column : categorical_column_with_vocabulary_list sequence_categorical_column_with_vocabulary_list categorical_column_with_vocabulary_file sequence_categorical_column_with_vocabulary_file categorical_column_with_identity sequence_categorical_column_with_identity categorical_column_with_hash_bucket sequence_categorical_column_with_hash_bucket crossed_column weighted_categorical_column tf.feature_column.embedding_column(column)
5.3,one_hot feature making method: Bagged_ column
numeric_column = tf.feature_column.numeric_column( key="feature", shape=6, default_value=0, dtype=tf.float32) column = tf.feature_column.bucketized_column( # numeric column of 1-D source_column=numeric_column, # The list of requirements is in ascending order boundaries=[3, 5, 7, 10])
5.4,share_embedding feature making method: shared_embeddings
column : categorical_column_with_vocabulary_list sequence_categorical_column_with_vocabulary_list categorical_column_with_vocabulary_file sequence_categorical_column_with_vocabulary_file categorical_column_with_identity sequence_categorical_column_with_identity categorical_column_with_hash_bucket sequence_categorical_column_with_hash_bucket crossed_column weighted_categorical_column tf.feature_column.shared_embeddings(column,column)
6. Some problems and explanations
In tensorflow1 In X:
- from tensorflow.python.feature_column import feature_column as fc
- from tensorflow.python.feature_column import feature_column_v2 as fc_v2
- fc_ v2. The featuretransformationcache method is used to cache the input data (densor or spark densor)
- column.get_dense_tensor(transformation_cache=feature_cache, state_manager=None) corresponds to the cached data in the feature transformation cache, where the column type is: TF feature_ column. numeric_ column
- fc._ The lazybuilder method is also used to cache the input data (densor or spark densor)
- column._get_dense_tensor(builder,None) corresponds to_ Cache data in LazyBuilder, where column type is: TF feature_ column. numeric_ column
- tf. feature_column. input_ The layer method directly combines the input data with the feature_column as input and transform, where feature_column type: TF feature_column. numeric_ column
- column2._get_sequence_dense_tensor(builder,None), where the type of column2 is: TF contrib. feature_ column. sequence_ numeric_ Column, builder:_ LazyBuilder
- column.get_sequence_dense_tensor(feature_cache, None), where the column type is: TF feature_ column. sequence_ numeric_ column,feature_catch: FeatureTransformationCache
- tf.contrib.feature_column.sequence_input_layer(features,column2), where the type of column2 is: TF contrib. feature_ column. sequence_ numeric_ column
- fc_v2._StateManagerImpl(layer=tf.keras.layers.Layer(), trainable=True) is used to create weights that generate embedding features
- weigthed_col_emb.create_state(state_manager) where state_ Manager is_ StateManagerImpl,weigthed_col_emb: TF feature_ column. embedding_ column
- rs_w_3 = weigthed_col_emb.get_dense_tensor(feature_weight_cache,state_manager),feature_ weight_ The cache is FeatureTransformationCache, state_ As manager_ StateManagerImpl
- The method of outputting characteristic value (densor or spark densor) is as follows:
with tf.Session() as sess: sess.run(tf.global_variables_initializer()) sess.run(tf.tables_initializer()) print(sess.run(rs_1)) print(rs_1.eval()) print(tf.sparse_tensor_to_dense(rs_2.id_tensor,-1).eval())