reference resources:
- CRF++: Yet Another CRF toolkit
- CRF + + usage: Chinese Translation
- Model Download
- Model installation and training
train
Mode 1:
% crf_learn template_file train_file model_file
Where template_file and train_file files that need to be prepared in advance. crf_learn in model_file to generate the trained model file.
The training output results are as follows:
CRF++: Yet Another CRF Tool Kit Copyright(C) 2005 Taku Kudo, All rights reserved. reading training data: 100.. 200.. 300.. 400.. 500.. 600.. 700.. 800.. Done! 1.94 s Number of sentences: 823 Number of features: 1075862 Number of thread(s): 1 Freq: 1 eta: 0.00010 C: 1.00000 shrinking size: 20 Algorithm: CRF iter=0 terr=0.99103 serr=1.00000 obj=54318.36623 diff=1.00000 iter=1 terr=0.35260 serr=0.98177 obj=44996.53537 diff=0.17161 iter=2 terr=0.35260 serr=0.98177 obj=21032.70195 diff=0.53257 iter=3 terr=0.23879 serr=0.94532 obj=13642.32067 diff=0.35138 iter=4 terr=0.15324 serr=0.88700 obj=8985.70071 diff=0.34134 iter=5 terr=0.11605 serr=0.80680 obj=7118.89846 diff=0.20775 iter=6 terr=0.09305 serr=0.72175 obj=5531.31015 diff=0.22301 iter=7 terr=0.08132 serr=0.68408 obj=4618.24644 diff=0.16507 iter=8 terr=0.06228 serr=0.59174 obj=3742.93171 diff=0.18953
- iter: number of iterations
- Terr: error rate of tags (# of error tags/# of all tag)
- serr: error rate of sentences (# of error sentences/# of all sentences)
- obj: current object value, |||||^ 2. When this value converges to a fixed point, CRF + + stops the iteration.
- diff: relative difference from the previous object value, i.e. (4618.24644-3742.93171) / 4618.24644 = 0.18953
There are four main parameters to control the training conditions:
- -a CRF-L2 or CRF-L1: change the regularization algorithm. The default setting is L2. Generally speaking, L2 performs slightly better than L1, and the number of non-zero features in L1 is much smaller than L2
- -c float: this option can change the super parameter of CRF. With larger C values, CRF tends to over fit a given training corpus. This parameter makes a trade-off between over fitting and under fitting, which will significantly affect the results. You can find the best value using retained data or more general model selection methods, such as cross validation
- -f NUM: this parameter sets the cutoff threshold of the feature. CRF + + uses features that appear no less than NUM times in a given training data. The default value is 1. When you apply CRF + + to big data, the number of unique features will reach millions. This option is useful in this case
- -p NUM: NUM is the number of threads. If your PC has multiple CPU s, you can speed up your training by using multithreading.
Mode 2:
% crf_learn -f 3 -c 1.5 template_file train_file model_file
Starting from version 0.45, CRF + + supports single best MIRA training. MIRA training is used when the - a MIRA option is set.
% crf_learn -a MIRA template train.data model CRF++: Yet Another CRF Tool Kit Copyright(C) 2005 Taku Kudo, All rights reserved. reading training data: 100.. 200.. 300.. 400.. 500.. 600.. 700.. 800.. Done! 1.92 s Number of sentences: 823 Number of features: 1075862 Number of thread(s): 1 Freq: 1 eta: 0.00010 C: 1.00000 shrinking size: 20 Algorithm: MIRA iter=0 terr=0.11381 serr=0.74605 act=823 uact=0 obj=24.13498 kkt=28.00000 iter=1 terr=0.04710 serr=0.49818 act=823 uact=0 obj=35.42289 kkt=7.60929 iter=2 terr=0.02352 serr=0.30741 act=823 uact=0 obj=41.86775 kkt=5.74464 iter=3 terr=0.01836 serr=0.25881 act=823 uact=0 obj=47.29565 kkt=6.64895 iter=4 terr=0.01106 serr=0.17011 act=823 uact=0 obj=50.68792 kkt=3.81902 iter=5 terr=0.00610 serr=0.10085 act=823 uact=0 obj=52.58096 kkt=3.98915
Parameters:
- act: the number of active examples in the working set
- uact: the two parameters reach the upper limit of soft boundary C. 0 uact indicates that the given training data is linearly separable
- kkt: maximum kkt violation value. When it reaches 0.0, MIRA training ends
There are some parameters that can control MIRA training conditions:
- -c float: change the soft boundary parameter, which is similar to the soft boundary parameter C in support vector machine. The definition is basically consistent with the - C option in CRF training. When a given corpus is over trained, the mic tends to be over fitted
- -f NUM:
Same as CRF - -H NUM: change the reduced size. When a training statement is not used to update the parameter vector NUM times, we can think that the instance will no longer contribute to the training. MIRA attempts to delete such instances. This process is called "contraction". When a smaller NUM is set, contraction occurs early, which greatly reduces the training time. However, it is not recommended that NUM be too small. After the training, MIRA tries to traverse all training examples again to see if all KKT conditions are really met. NUM too small will increase the chance of recheck.
Changes shrinking size. When a training sentence is not used in updating parameter vector NUM times, we can consider that the instance doesn't contribute training any more. MIRA tries to remove such instances. The process is called "shrinking". When setting smaller NUM, shrinking occurs in early stage, which drastically reduces training time. However, too small NUM is not recommended. When training finishes, MIRA tries to go through all training examples again to know whether or not all KKT conditions are really satisfied. Too small NUM would increase the chances of recheck.
test
% crf_test -m model_file test_files ...
Output:
% crf_test -m model test.data Rockwell NNP B B International NNP I I Corp. NNP I I 's POS B B Tulsa NNP I I unit NN I I ..
The last column gives the (estimated) label. If column 3 is true answer tag, you can evaluate the accuracy by simply looking at the difference between columns 3 and 4.
Level of detail:
-The v option sets the level of detail. The default value is 0. By increasing the level, you can get additional information from CRF + +
- level 1:
You can also set marginal probabilities for each label (a confidence measure for each output label) and the conditional likelihood of the output (a confidence measure for the entire output)
% crf_test -v1 -m model test.data| head # 0.478113 Rockwell NNP B B/0.992465 International NNP I I/0.979089 Corp. NNP I I/0.954883 's POS B B/0.986396 Tulsa NNP I I/0.991966 ...
- level 2:
% crf_test -v2 -m model test.data # 0.478113 Rockwell NNP B B/0.992465 B/0.992465 I/0.00144946 O/0.00608594 International NNP I I/0.979089 B/0.0105273 I/0.979089 O/0.0103833 Corp. NNP I I/0.954883 B/0.00477976 I/0.954883 O/0.040337 's POS B B/0.986396 B/0.986396 I/0.00655976 O/0.00704426 Tulsa NNP I I/0.991966 B/0.00787494 I/0.991966 O/0.00015949 unit NN I I/0.996169 B/0.00283111 I/0.996169 O/0.000999975