Configuration File options¶
Bilbo comes with a pipeline_config
file (located at the bilbo/config of the bilbo2 directory). Actually, there is two pipeline_config available, one is for annotating bibliographies references(tag <bibl>
in the TEI/XML format), one other is for annotating footnote (tag note
in the TEI/XML format). You can modified each of the options presented in this file.Currently, the file is an INI configuration file. In future we expect to handle json or XML file configuration.
As expected, each module of Bilbo can run on his own. A series (not all) of parsing options are available and can be set with arg cli python running.
PIPELINE¶
In this part, you have to specify the pipeline wanted. Note that for training it does not make sens to add generate pipeline. Pipeline is one on this components before going any further. This section is marked by:
[PIPELINE]
verbose¶
Set at False by default
pipeline¶
You have to chained the desired chained algorithm as instance:
PIPELINE=shape_data,features,svm,crf,generate
Example:
[PIPELINE]
PIPELINE=shape_data,features,svm,crf,generate
outputFile=None
verbose=True
SHAPER¶
This section is marked by:
[shaper]
tokenizerOption¶
This option is currently unnecessary. The next developments regarding tokenization will make it active soon.
tagOptions¶
This is a wrapper for reduce or rename tag to an other
tagsOptions = {
"title_a": "title",
"distributor": "publisher",
"country": "place",
"sponsor": "publisher"
}
Example:
[PIPELINE]
PIPELINE=shape_data,features,svm,crf,generate
outputFile=None
verbose=True
FEATURES¶
This section is marked by:
[features]
listFeatures¶
Default Value is set to: numbersMixed, cap, dash, biblPosition, initial
You can removed some of them or add a new one (see ../developer/modules.html) This is a wrapper for reduce or rename tag to an other
listFeaturesRegex¶
You can add a list of regex as : (name_of_regex, python_regex), (name_of_regex1, python_regex1)
listFeaturesExternes¶
(unic_named_list, path_to_external_list, List_type), …
Note type_list is simple (simple word list) or multi (multi word list as journals names for instance)
listFeaturesXML¶
This is set to italic by default.
output¶
Path output, it is handling when you use feature component. Output is fitted to CRF++ format data.
Example:
[features]
listFeatures = numbersMixed, cap, dash, biblPosition, initial
listFeaturesRegex = ('WEBLINK', '^(http:\/\/www\.|https:\/\/www\.|http:\/\/|https:\/\/)?[a-z0-9]+([\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(:[0-9]{1,5})?(\/.*)?$')
listFeaturesExternes = ('place', 'resources/external/place_list.txt', 'multi'),
('possmonth', 'resources/external/month_list.txt', 'simple'),
('posseditor', 'resources/external/editor_abbr_list.txt', 'simple'),
('posspage', 'resources/external/page_abbr_list.txt', 'simple'),
('journal', 'resources/external/journals_list.txt', 'multi'),
('surname', 'resources/external/surname_list.txt', 'simple'),
('forename', 'resources/external/forename_list.txt', 'simple')
listFeaturesXML = italic
output = bilbo/testFiles/features.output.txt
verbose = False
CRF¶
This section is marked by:
[crf]
name¶
Name of libraries used, in some cases you can change the crf libraries (for wapiti for instance)
algoCrf¶
Default value is set to [lbfgs](for https://en.wikipedia.org/wiki/Limited-memory_BFGS) algorithm : {‘lbfgs’, ‘l2sgd’, ‘ap’, ‘pa’, ‘arow’}
optionCrf¶
Many option are avalaible . see crfsuite manual
Most important are c1 for a L1 regularisation (in this case algoritm is switch to orthant method), c2 regression ridge and and max_iterations
epsilon : The epsilon parameter that determines the condition of convergence. value set by default at 1e-5
optionCrf = {
'c2': 0.00001,
}
patternsFile¶
path to wapiti pattern. By default pattern used is located in resources/models/bibl/wapiti_pattern_ref
modelFile¶
Path to the model generated in train action or used in tag action.
seed¶
This is used to generate a pseudo-random number. This random number is used when you evaluate the crf algorithn only (not the fulle pipeline)
Example:
[crf]
name = crfsuite
algoCrf = lbfgs
# lbfgs for Gradient descent using the L-BFGS method,
# l2sgd for Stochastic Gradient Descent with L2 regularization term
# ap for Averaged Perceptron
# pa for Passive Aggressive
# arow for Adaptive Regularization Of Weight Vector
optionCrf = {
'c2': 0.00001,
'max_iterations': 2000,
}
seed = 3
patternsFile = resources/models/note/wapiti_pattern_ref
modelFile = resources/models/note/crf_OE_fr.txt
SVM¶
This section is marked by:
[svm]
name¶
Name of libraries used, in some cases you can change the svm libraries for an other.
modelFile¶
Path to the vocab model generated in train action or used in tag action.
vocab¶
Path to the vocab model generated by svm train. Vocab attribute at each word a integer.
output¶
Not already implemented
Example:
bsvm
modelFile = resources/models/note/svm_OE_fr.txt
vocab = resources/models/note/inputID.txt
output = /tmp/data_SVM.txt