Configuration File options

Bilbo comes with a pipeline_config file (located at the bilbo/config of the bilbo2 directory). Actually, there is two pipeline_config available, one is for annotating bibliographies references(tag <bibl> in the TEI/XML format), one other is for annotating footnote (tag note in the TEI/XML format). You can modified each of the options presented in this file.Currently, the file is an INI configuration file. In future we expect to handle json or XML file configuration. As expected, each module of Bilbo can run on his own. A series (not all) of parsing options are available and can be set with arg cli python running.

PIPELINE

In this part, you have to specify the pipeline wanted. Note that for training it does not make sens to add generate pipeline. Pipeline is one on this components before going any further. This section is marked by:

  • [PIPELINE]

verbose

Set at False by default

pipeline

You have to chained the desired chained algorithm as instance:

PIPELINE=shape_data,features,svm,crf,generate

Example:

[PIPELINE]
PIPELINE=shape_data,features,svm,crf,generate
outputFile=None
verbose=True

SHAPER

This section is marked by:

  • [shaper]

tokenizerOption

This option is currently unnecessary. The next developments regarding tokenization will make it active soon.

tagOptions

This is a wrapper for reduce or rename tag to an other

tagsOptions = {
	"title_a": "title",
	"distributor": "publisher",
	"country": "place",
	"sponsor": "publisher"
} 

Example:

[PIPELINE]
PIPELINE=shape_data,features,svm,crf,generate
outputFile=None
verbose=True

FEATURES

This section is marked by:

  • [features]

listFeatures

Default Value is set to: numbersMixed, cap, dash, biblPosition, initial

You can removed some of them or add a new one (see ../developer/modules.html) This is a wrapper for reduce or rename tag to an other

listFeaturesRegex

You can add a list of regex as : (name_of_regex, python_regex), (name_of_regex1, python_regex1)

listFeaturesExternes

(unic_named_list, path_to_external_list, List_type), …

Note type_list is simple (simple word list) or multi (multi word list as journals names for instance)

listFeaturesXML

This is set to italic by default.

output

Path output, it is handling when you use feature component. Output is fitted to CRF++ format data.

Example:

[features]
listFeatures = numbersMixed, cap, dash, biblPosition, initial
listFeaturesRegex = ('WEBLINK', '^(http:\/\/www\.|https:\/\/www\.|http:\/\/|https:\/\/)?[a-z0-9]+([\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(:[0-9]{1,5})?(\/.*)?$')
listFeaturesExternes = ('place', 'resources/external/place_list.txt', 'multi'), 
		     ('possmonth', 'resources/external/month_list.txt', 'simple'), 
		     ('posseditor', 'resources/external/editor_abbr_list.txt', 'simple'), 
		     ('posspage', 'resources/external/page_abbr_list.txt', 'simple'),
		     ('journal', 'resources/external/journals_list.txt', 'multi'),
		     ('surname', 'resources/external/surname_list.txt', 'simple'),
		     ('forename', 'resources/external/forename_list.txt', 'simple')
		     listFeaturesXML = italic
		     output = bilbo/testFiles/features.output.txt 
		     verbose = False 

CRF

This section is marked by:

  • [crf]

name

Name of libraries used, in some cases you can change the crf libraries (for wapiti for instance)

algoCrf

Default value is set to [lbfgs](for https://en.wikipedia.org/wiki/Limited-memory_BFGS) algorithm : {‘lbfgs’, ‘l2sgd’, ‘ap’, ‘pa’, ‘arow’}

optionCrf

Many option are avalaible . see crfsuite manual

Most important are c1 for a L1 regularisation (in this case algoritm is switch to orthant method), c2 regression ridge and and max_iterations

epsilon : The epsilon parameter that determines the condition of convergence. value set by default at 1e-5

optionCrf = {
	'c2': 0.00001,
	}

patternsFile

path to wapiti pattern. By default pattern used is located in resources/models/bibl/wapiti_pattern_ref

modelFile

Path to the model generated in train action or used in tag action.

seed

This is used to generate a pseudo-random number. This random number is used when you evaluate the crf algorithn only (not the fulle pipeline)

Example:

[crf]
name = crfsuite
algoCrf = lbfgs
#    lbfgs for Gradient descent using the L-BFGS method,
#    l2sgd for Stochastic Gradient Descent with L2 regularization term
#    ap for Averaged Perceptron
#    pa for Passive Aggressive
#    arow for Adaptive Regularization Of Weight Vector
optionCrf = {
	'c2': 0.00001,
	'max_iterations': 2000,
}
seed = 3
patternsFile = resources/models/note/wapiti_pattern_ref
modelFile = resources/models/note/crf_OE_fr.txt

SVM

This section is marked by:

  • [svm]

name

Name of libraries used, in some cases you can change the svm libraries for an other.

modelFile

Path to the vocab model generated in train action or used in tag action.

vocab

Path to the vocab model generated by svm train. Vocab attribute at each word a integer.

output

Not already implemented

Example:

bsvm
modelFile = resources/models/note/svm_OE_fr.txt
vocab = resources/models/note/inputID.txt
output = /tmp/data_SVM.txt