order
This paper mainly studies how to use opennlp to customize named entity, label training and model application.
maven
<dependency> <groupId>org.apache.opennlp</groupId> <artifactId>opennlp-tools</artifactId> <version>1.8.4</version> </dependency>
practice
Training model
// train the name finder String typedEntities = "<START:organization> NATO <END>\n" + "<START:location> United States <END>\n" + "<START:organization> NATO Parliamentary Assembly <END>\n" + "<START:location> Edinburgh <END>\n" + "<START:location> Britain <END>\n" + "<START:person> Anders Fogh Rasmussen <END>\n" + "<START:location> U . S . <END>\n" + "<START:person> Barack Obama <END>\n" + "<START:location> Afghanistan <END>\n" + "<START:person> Rasmussen <END>\n" + "<START:location> Afghanistan <END>\n" + "<START:date> 2010 <END>"; ObjectStream<NameSample> sampleStream = new NameSampleDataStream( new PlainTextByLineStream(new MockInputStreamFactory(typedEntities), "UTF-8")); TrainingParameters params = new TrainingParameters(); params.put(TrainingParameters.ALGORITHM_PARAM, "MAXENT"); params.put(TrainingParameters.ITERATIONS_PARAM, 70); params.put(TrainingParameters.CUTOFF_PARAM, 1); TokenNameFinderModel nameFinderModel = NameFinderME.train("eng", null, sampleStream, params, TokenNameFinderFactory.create(null, null, Collections.emptyMap(), new BioCodec()));
opennlp uses < START > and < end > to label entities. When naming entities, they are marked with a colon after START. For example START:person
Parameter description
- ALGORITHM_PARAM
On the engineering level, using maxent is an excellent way of creating programs which perform very difficult classification tasks very well.
- ITERATIONS_PARAM
number of training iterations, ignored if -params is used.
- CUTOFF_PARAM
minimal number of times a feature must be seen
Usage model
After the above training, the model can be used for analysis
NameFinderME nameFinder = new NameFinderME(nameFinderModel); // now test if it can detect the sample sentences String[] sentence = "NATO United States Barack Obama".split("\\s+"); Span[] names = nameFinder.find(sentence); Stream.of(names) .forEach(span -> { String named = IntStream.range(span.getStart(),span.getEnd()) .mapToObj(i -> sentence[i]) .collect(Collectors.joining(" ")); System.out.println("find type: "+ span.getType()+",name: " + named); });
The output is as follows:
find type: organization,name: NATO find type: location,name: United States find type: person,name: Barack Obama
Summary
opennlp's annotation of custom named entities provides a certain space for customization, which is convenient for developers to customize their own domain specific named entities, so as to improve the accuracy of specific named entity segmentation.