Welcome to bilbo2’s documentation!¶
BILBO2 is an open source software for automatic annotation of bibliographic reference. It provides the segmentation and tagging of input string. Its main purpose is to provide both a complete development and research space for the improvement of bibliographic reference detections and to be a solid tool capable of being used in production like OpenEdition for example. What you will find here is the user documentation, the technical documentation and the developper documentation for the Bilbo software.
This documentation is organized into a few main sections :
Purpose¶
Context¶
In academic papers, bibliographies are an essential aspect of research. In many case, only few bibiographic references are identified by an information system.
We can consider that there are three levels of detections of bibliographic references:
- Standart bibliographique references: These are usually located at the end of the scientific article.
There are mentioned in a TEI-XML document by the tag
<bibl>
- Footnote: Footnote do not mentioned every time a bibliography. The classification of notes that contains bibliographies or not is the main difficulty
Implemented by the tag
<note>
(TEI-XML Document). - Implicit reference. In the full text, authors mention sometimes a references bibliography. This reference is partial and explicit.
Implemented by the tag
<p>
(TEI-XML Document).
Philosophy¶
Algoritym complexity to extract bibliography is increasing at each level (<bibl>
,<note>
,<p>
).
Currently, the first level could be considered as efficient. The others not.Bilbo2 is considered to be dedicated to research.
It has been thought to be a tools for implement easily new machine learning algorithms at any level of bibliography. Everything has been done so that we can easily add new algorithms to existing codes without affecting what can be deployed in production.
Data structure of a document is constructed to manipulate a Document at any level. A level (that we called a section) is a tag. The scope of processing algorithm will be the section choosen. Then all sections corresponding to this tag will be processed.
Each instance bilbo is specialized in a specific tag. To handle different type of tag, you can pass and process each time the xml document.
Demonstration¶
There is a online demonstration of Bilbo2
In this demonstrator, only bibliographic reference and footnote is processing. The corpus is trained on french and english dataset.
Just paste each bibliographic references surrounded by tag <bibl>
or <note>
and click on annotate. You already have pre-loaded examples.
This instance is for test purposes only and should therefore not be used as production tools. For a production usage you should contact mathieu.orban@openedition.org. The data processing is not kept at all. In the future, a full REST web-API is programming. This REST-API should integrate the research of Digital Object Identifier (DOI) for each bibliographics references processed.
Requirements¶
Bilbo2 can be installed on many Linux distributions but has been tested only on debian and ubuntu distributions with the following prerequisites:
- python3.5
- gcc and g++ (used by LIBSVM compilation)
- git >= 1.7.10 (needed by github)
- pip and setuptools , necessary for launch python installation
Libraries¶
Bilbo2 was tested on Linux/Debian distribution (Debian stretch release). It is running on python versions equal to or greater than 3.5.
Debian¶
Starting from a debian image loaded in a virtual machine, with root privileges or via sudo.
apt-get install git make curl python3 gcc build-essential libxml2-dev libxslt-dev python3-dev python3-setuptools zlib1g-dev
Installation¶
Make sure you full filled the requirements before going any further.
You can now running the installation of python module with setup.py. It will :
- Compile LIBSVM libraries (C++) and install an python binding interface of its library.
- Install python-crfsuite. A smart python binding to crfsuite.
- Install lxml, python libraries to process xml document.
Stable version¶
To install the last stable version on a Unix system, open a console and enter:
git clone https://github.com/openedition/bilbo2.git
cd bilbo2
git checkout `git describe --tags --abbrev=0`
python3 setup.py install --user
Development version¶
If you wish to install the development version, open a console and enter:
git clone https://github.com/openedition/bilbo2.git
cd bilbo2
python3 setup.py install --user
Uninstall bilbo2¶
For uninstall bilbo2:
pip3 uninstall bilbo2
For remove and clean your local bilbo2 repositories:
cd bilbo2
rm -rvf build/
rm -rvf bilbo2.egg-info/
rm -rvf dist/
You can use bash script clean.sh. Note that you have to replace the right path to your repository in this script.
First steps¶
Keep in mind that bilbo has already been trained on a french and english annotated xml corpus.
It is trained on <note>
and <bibl>
section.
By default, bilbo2 is running (for annotation) on a specified pipeline with a default pre-trained model (french and english languages on <bibl>
tag).
Command Line Interface API¶
Overview Command Line Interface API¶
For an overview of different features and CLI command just launch in a shell :
cd bilbo2
bash bilbo/tests/bilbo_demo.sh -v
Common use¶
You will see that a quick use, to annotate your bibliographics references (indicate as <bibl>
in TEI) just launch :
cd bilbo2
python3 -m bilbo.bilbo --action tag -i PATH_TO_XMLFILE -o XML_OUTPUT_TAGGED
For annotate your footnote you need first to mention the tag to process (note) and specify explicitely the config file. Currently the config file pipeline_note.cfg is available.
cd bilbo2
python3 -m bilbo.bilbo --action tag -t note -c bilbo/config/pipeline_note.cfg -i PATH_TO_XMLFILE -o XML_OUTPUT_TAGGED
For train, you just have to change –action=tag to –action=train and given an annotated input xml corpus. Note that output option is not necessay in this case. Saved trained model will be saved at the path indicated in the config file.
Interactive Python Interface¶
Open a terminal:
python3
In a interactive python shell:
from bilbo.importer import Importer
from bilbo.bilbo import Bilbo
importer = Importer("path_to_your_xml_file")
doc = importer.parse_xml('tag(bibl or /note)')
bilb = Bilbo(doc, "path_config_file)
bilb.annotate("path_to_output.xml", format_=None)
Autoloaded Models¶
Two models and prexisted external list already exists and can be loaded as a data package. You do not need to give the config file path. You just have to load ‘bibl’ or ‘note’ with class method load(bibl/note):
from bilbo.importer import Importer
from bilbo.bilbo import Bilbo
importer = Importer("path_to_your_xml_file")
doc = importer.parse_xml('note)')
Bilbo.load('note')
bilbo = Bilbo(doc)
bilb.annotate("path_to_output.xml", format_=None)
Input - Output¶
Input¶
Input must be a valid XML document. Bilbo has be trained on TEI-XML format (Text Echange Initiative) format. But you could give a JATS input format… Actually, any valid XML document can be used for bilbo, the annoatation tagging will be done according to TEI schemas.
Scope of annotation¶
Bilbo is only handling a scope inside a xml. This scope is bounded by a tag. Element outside the scope are only kept in memory to rebuild the xml file as a result.
Output¶
Output is XML. Bilbo has personnal output for research purpose: this output schema is the default xml output. Any output schema can be specified from this default schema. For this, an xsl sheet must contain the xml conversion. This xsl file should be placed in bilbo/stylesheets/
directory.
You can specified TEI (Text Echange Initiative), JATS format and personal research.
XML/TEI OpenEdition Schema¶
The TEI schema versions is used by the OpenEdition Books and OpenEdition Journals platforms. It is associated to the journals editorial model shipped with the Lodel software https://github.com/OpenEdition/lodel
Among the different XML encoding standards for machine-readable texts, the TEI (Text Encoding Initiative) is probably the most comprehensive and mature. The TEI Guidelines define some 500 different textual components and concepts (word, sentence, character, glyph, person, etc.). Any particular usage of the TEI supposes a customization of the TEI to their specificities, so as to adapt and constraint the richness of the TEI to a well scoped and tuned schema. The TEI community has created a specification language called ODD (“One Document Does it all”) to modify the general TEI schema. Having ODD descriptions, it is possible with a tool called Roma to generate automatically customised TEI schemas (xsd, relaxNG, etc.) and some documentation.
JATS¶
Personnal research output¶
This is NOT A VALID XML/TEI schemas. It is only set for research purpose. Each added tag is mentionned by a an attribute bilbo="true"
. It could be used to analyse in a same xml document difference between manual annotated and automatic annotation.
Pipelines¶
Bilbo is founded on processing a specific structure (data struct) on series. The Input and ouptput of each data is stable. In some case the output is enhanced from an information on each section or at the level of a token.
Importer¶
Document is imported: lxlm library parse the whole document. Each document is segmented according to one section, the data structure is constructed at the level of section.
Module available¶
All this module can be found in components directory. Each class will inherit from the Component Class.
Shape Data: extract XML value and tokenizer¶
Shaper section is dedicated to handle xml data and tokenize. Tokenizer is written for french and english. structure is constructed at the level of token. This module is certainly and should be the first in the pipelines series. For see CLI API functionalities (CLI):
python3 -m bilbo.components.shape_data -h
For specify shape component options
Features¶
Features could be extract from list or dictionnaries files (external features). Features could be extract from the local specifity of a word. Features could be extract from the specificity of section or position of a token (global features). List of word could be simple or multiple. For see CLI API functionalities (CLI):
python3 -m bilbo.components.features -h
For specify feature component options
Conditional random field¶
This is based on on python-crfsuite Pythons crf-suite is a pyhon binding of CRFSuite. CRFsuite is an implementation of Conditional Random Fields (CRFs) for labeling sequential data. This labelling is generated by an extraction of feature and to get easier an wrapping with crf++ is avalaible. For see CLI API functionalities (CLI):
python3 -m bilbo.components.crf -h
For specify crf component options
Support Vector Machine¶
This is based on libsvm. LIBSVM is an integrated library for support vector classification, (C-SVC, nu-SVC), regression (epsilon-SVR, nu-SVR) and distribution estimation (one-class SVM). It supports multi-class classification. For now, it is used to classify foot note which are contains bibliography. For see CLI API functionalities (CLI):
python3 -m bilbo.components.svm -h
For specify svm component options
Optionnal output step¶
Each document is segmented according to one section: the data xml structure is reconstructed.
Chaining algorithms¶
For chained Algorithms with action train, tag or evaluate you need to specified parameters in a configuration file. Two default configuration files are defined in ./bilbo/config/ directory. The order of data processing must be clarified. For each pipe you need to fullfilled the algorithm specification.
For specify order and used algorithms see pipeline options
Each pipe is launching a component. At each passage, document object is enhanced from differents attribute. Attribute can be extract (extractor component class) or predict (estimator component class)
CLI toolkit usage¶
If you want to start and understand how bilbo is handling each pipeline, you can launch independantly some test on bilbo component. As above you can do in CLI or in Interactive Python. Beware input of each component. Examples: it does not make sense to lauch CRF modules if you have not extract features previously (in a file or in bilbo data structure).
Overview¶
For an overview and a test of cli usage, from a terminal, run:
cd bilbo2
/bin/bash bilbo/tests/bilbo_demo.sh -v
You can add -v argument to see output.
Command Line Interface API¶
To see an exhaustive list of modules which could be used by bilbo you can launch
python3 -m bilbo.bilbo -L
For instance, you should see wich features are extracted for each token according to the default configuration file. Your features (crf++ format) will be extracted in the output file mentioned in “bilbo/config/pipeline_bibl.cfg” If you want to see explicit output just add -vvvv for a logger output.
python3 -m bilbo.components.features -cf bilbo/config/pipeline_bibl.cfg -s "Amblard F., Bommel P., Rouchier J., 2007, « Assessment and validation of multi-agent models »...."
In order to improve your research you want to analyse directly from a crf++ input format and crf++ pattern your prediction.
python3 -m bilbo.components.crf -cf bilbo/config/pipeline_bibl.cfg -i bilbo/testFiles/features.output.txt --tag -v
import configparser
from bilbo.importer import Importer
from bilbo.components.shape_data.shape_data import ShapeSection
from bilbo.components.features.features import FeatureHandler
from bilbo.components.crf.crf import Crf
from bilbo.bilbo import Bilbo
Bilbo in a shell¶
Construct Data Structure¶
First import your xml document. You can import string or a file. For any action (machine learning prediction, features extraction, set a new xml properties), you will handle this document object.
#xml_str = '<xml>Oustide<bibl><pubPlace>Marseille</pubPlace>, <sponsor>OpenEdition is "! inside </sponsor>>a bibl</bibl></xml>'
xml_str = """<TEI xmlns="http://www.tei-c.org/ns/1.0"> Outside
<bibl>Hillier B., 1996, <hi>Space is the Machine</hi>, Cambridge University Press, <pubPlace>Cambridge.</pubPlace>
</bibl></TEI>"""
imp = Importer(xml_str)
doc = imp.parse_xml('bibl', is_file = False)
Tokenize, extract and wrap xml informations¶
First, load parameters.
dic = """
[shaper]
tokenizerOption = fine
tagsOptions = {
"pubPlace": "place",
"sponsor": "publisher"
}
verbose = True
"""
#Load the dic.
#There are differnt ways to set parameters (ini file...)see: https://docs.python.org/3/library/configparser.html#quick-start
config = configparser.ConfigParser(allow_no_value=True)
config.read_string(dic)
Use ShapeSection class. Note at any moment you can call help for parameters function:
help(ShapeSection.__init__)
Help on function __init__ in module bilbo.components.shape_data.shape_data:
__init__(self, cfg_file, type_config='ini', lang='fr')
Initialize self. See help(type(self)) for accurate signature.
sh = ShapeSection(config, type_config='Dict')
sh.transform(doc)
<bilbo.storage.document.Document at 0x7fc3740d7390>
To see an overview of your document:
for section in doc.sections:
for token in section.tokens:
print('Token:{0}\t\t Label:{1}'.format(token.str_value, token.label))
Token:Hillier Label:bibl
Token:B. Label:bibl
Token:, Label:c
Token:1996 Label:bibl
Token:, Label:c
Token:Space Label:hi
Token:is Label:hi
Token:the Label:hi
Token:Machine Label:hi
Token:, Label:c
Token:Cambridge Label:bibl
Token:University Label:bibl
Token:Press Label:bibl
Token:, Label:c
Token:Cambridge Label:place
Token:. Label:c
Features¶
Set features that you are needed. For external features, you need to give the right path to externals lists…
dic = """
[features]
listFeatures = numbersMixed, cap, dash, biblPosition, initial
listFeaturesRegex = ('UNIVERSITY', '^Uni.*ty$')
listFeaturesExternes = ('surname', 'surname_list.txt', 'simple'),
listFeaturesXML = italic
output = output.txt
verbose = False
"""
config = configparser.ConfigParser(allow_no_value=True)
config.read_string(dic)
Features are given for convenience in Crf++ format.
feat = FeatureHandler(config, type_config='Dict')
feat.loadFonctionsFeatures()
doc = feat.transform(doc)
feat.print_features(doc)
Hillier NONUMBERS FIRSTCAP NODASH BIBL_START NOINITIAL NOUNIVERSITY SURNAME NOITALIC bibl
B. NONUMBERS ALLCAP NODASH BIBL_START INITIAL NOUNIVERSITY NOSURNAME NOITALIC bibl
, NONUMBERS NONIMPCAP NODASH BIBL_START NOINITIAL NOUNIVERSITY NOSURNAME NOITALIC c
1996 NUMBERS NONIMPCAP NODASH BIBL_START NOINITIAL NOUNIVERSITY NOSURNAME NOITALIC bibl
, NONUMBERS NONIMPCAP NODASH BIBL_START NOINITIAL NOUNIVERSITY NOSURNAME NOITALIC c
Space NONUMBERS FIRSTCAP NODASH BIBL_START NOINITIAL NOUNIVERSITY NOSURNAME ITALIC hi
is NONUMBERS ALLSMALL NODASH BIBL_IN NOINITIAL NOUNIVERSITY NOSURNAME ITALIC hi
the NONUMBERS ALLSMALL NODASH BIBL_IN NOINITIAL NOUNIVERSITY NOSURNAME ITALIC hi
Machine NONUMBERS FIRSTCAP NODASH BIBL_IN NOINITIAL NOUNIVERSITY NOSURNAME ITALIC hi
, NONUMBERS NONIMPCAP NODASH BIBL_IN NOINITIAL NOUNIVERSITY NOSURNAME NOITALIC c
Cambridge NONUMBERS FIRSTCAP NODASH BIBL_IN NOINITIAL NOUNIVERSITY NOSURNAME NOITALIC bibl
University NONUMBERS FIRSTCAP NODASH BIBL_END NOINITIAL UNIVERSITY NOSURNAME NOITALIC bibl
Press NONUMBERS FIRSTCAP NODASH BIBL_END NOINITIAL NOUNIVERSITY NOSURNAME NOITALIC bibl
, NONUMBERS NONIMPCAP NODASH BIBL_END NOINITIAL NOUNIVERSITY NOSURNAME NOITALIC c
Cambridge NONUMBERS FIRSTCAP NODASH BIBL_END NOINITIAL NOUNIVERSITY NOSURNAME NOITALIC place
. NONUMBERS NONIMPCAP NODASH BIBL_END NOINITIAL NOUNIVERSITY NOSURNAME NOITALIC c
Make predictions¶
First, to get an Document storage object which make sense (not as above, just for demonstration usage). We load right parameters with path_pipeline_bibl:
# This part is a fast resume of TOKENIZER AND FEATURE explain above.
# There are runned again with the appropriate parameter (path to pipeline_bibl.cfg).
imp = Importer(xml_str)
doc = imp.parse_xml('bibl', is_file = False)
bbo = Bilbo(doc, 'pipeline_bibl.cfg')
bbo.shape_data(doc)
bbo.features(doc)
<bilbo.storage.document.Document at 0x7fc3740ac828>
We have now a Document storage object which contains all needed information
# Start to make predictions
tagger = Crf(bbo.config, type_config='Dict')
labels = tagger.predict(doc)
for label in labels:
for l in label:
print(l[0], l[1])
Hillier surname
B. forename
, c
1996 date
, c
Space title
is title
the title
Machine title
, c
Cambridge publisher
University publisher
Press publisher
, c
Cambridge pubPlace
. c
Add prediction at the data structure¶
Always use transform() function for added prediction to Document storage object. Note for estimator component, three option are availables :’tag’, ‘train’, ‘evaluate’
tagger.transform(doc, 'tag')
for section in doc.sections:
for token in section.tokens:
print('Token:{0}\t\t Label:{1}'.format(token.str_value, token.predict_label))
Token:Hillier Label:surname
Token:B. Label:forename
Token:, Label:c
Token:1996 Label:date
Token:, Label:c
Token:Space Label:title
Token:is Label:title
Token:the Label:title
Token:Machine Label:title
Token:, Label:c
Token:Cambridge Label:publisher
Token:University Label:publisher
Token:Press Label:publisher
Token:, Label:c
Token:Cambridge Label:pubPlace
Token:. Label:c
Annotator bilbo usage¶
For bibliography (Standard tagging)¶
imp = Importer('resources/corpus/bibl/test_bibl.xml')
doc = imp.parse_xml('bibl')
Bilbo.load('bibl')
bilbo = Bilbo(doc)
bilbo.run_pipeline('tag', '/tmp/output.xml', format_= None)
For bibliography (With Lang Detection tagging)¶
imp = Importer('resources/corpus/bibl/test_bibl.xml')
doc = imp.parse_xml('bibl')
Bilbo.load('bibl_lang')
bilbo = Bilbo(doc')
bilbo.run_pipeline('tag', '/tmp/output.xml', format_= None)
For note¶
imp = Importer('resources/corpus/note/test_note.xml')
doc = imp.parse_xml('note')
bilbo = Bilbo(doc, 'pipeline_note.cfg')
bilbo.run_pipeline('tag', '/tmp/output.xml', format_= None)
Train¶
Just modify tag parameter to train parameter!! Note: output could be some binaries constructed model (They must be specified in pipeline_bibl.cfg not as parameters in run_pipeline() function.
Evaluation (end to end)¶
For evaluate the models just launch bilbo on your datatest annotated as:
imp = Importer('resources/corpus/bibl/data_test.xml')
doc = imp.parse_xml('bibl')
bilbo = Bilbo(doc, 'pipeline_bibl.cfg')
bilbo.run_pipeline('evaluate', None, None)
-----------------------------------------------------------
label precision rappel f-measure occurences
-----------------------------------------------------------
abbr 0.874 0.765 0.816 452
biblScope 0.887 0.571 0.695 594
booktitle 0.903 0.629 0.742 89
date 0.716 0.915 0.803 614
edition 0.690 0.460 0.552 126
emph 1.000 1.000 1.000 2
extent 1.000 0.979 0.989 48
forename 0.929 0.956 0.942 942
genName 1.000 1.000 1.000 1
journal 0.823 0.732 0.774 514
nameLink 0.282 1.000 0.440 11
orgName 0.902 0.836 0.868 110
place 0.824 0.933 0.875 15
pubPlace 0.962 0.934 0.948 379
publisher 0.936 0.732 0.821 920
ref 1.000 0.071 0.133 14
surname 0.937 0.934 0.936 823
title 0.868 0.889 0.879 5740
-----------------------------------------------------------
mean 0.863 0.797 0.828 11394
weighted mean 0.877 0.852 0.864 11394
-----------------------------------------------------------
Evaluation by component¶
You can evaluate each component. In this case we use bilbo as toolkit usage. Load your annotated data : data format annotated is depended of component used. You have to always generate this data first. And just launch (for svm for instance)
svm.evaluate(input_svm_data_format)
Configuration File options¶
Bilbo comes with a pipeline_config
file (located at the bilbo/config of the bilbo2 directory). Actually, there is two pipeline_config available, one is for annotating bibliographies references(tag <bibl>
in the TEI/XML format), one other is for annotating footnote (tag note
in the TEI/XML format). You can modified each of the options presented in this file.Currently, the file is an INI configuration file. In future we expect to handle json or XML file configuration.
As expected, each module of Bilbo can run on his own. A series (not all) of parsing options are available and can be set with arg cli python running.
PIPELINE¶
In this part, you have to specify the pipeline wanted. Note that for training it does not make sens to add generate pipeline. Pipeline is one on this components before going any further. This section is marked by:
[PIPELINE]
verbose¶
Set at False by default
pipeline¶
You have to chained the desired chained algorithm as instance:
PIPELINE=shape_data,features,svm,crf,generate
Example:
[PIPELINE]
PIPELINE=shape_data,features,svm,crf,generate
outputFile=None
verbose=True
SHAPER¶
This section is marked by:
[shaper]
tokenizerOption¶
The default value is fine (large available)
tagOptions¶
This is a wrapper for reduce or rename tag to an other
tagsOptions = {
"title_a": "title",
"distributor": "publisher",
"country": "place",
"sponsor": "publisher"
}
Example:
[PIPELINE]
PIPELINE=shape_data,features,svm,crf,generate
outputFile=None
verbose=True
FEATURES¶
This section is marked by:
[features]
listFeatures¶
Default Value is set to: numbersMixed, cap, dash, biblPosition, initial
You can removed some of them or add a new one (see ../developer/modules.html) This is a wrapper for reduce or rename tag to an other
listFeaturesRegex¶
You can add a list of regex as : (name_of_regex, python_regex), (name_of_regex1, python_regex1)
listFeaturesExternes¶
(unic_named_list, path_to_external_list, List_type), …
Note type_list is simple (simple word list) or multi (multi word list as journals names for instance)
listFeaturesXML¶
This is set to italic by default.
output¶
Path output, it is handling when you use feature component. Output is fitted to CRF++ format data.
Example:
[features]
listFeatures = numbersMixed, cap, dash, biblPosition, initial
listFeaturesRegex = ('WEBLINK', '^(http:\/\/www\.|https:\/\/www\.|http:\/\/|https:\/\/)?[a-z0-9]+([\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(:[0-9]{1,5})?(\/.*)?$')
listFeaturesExternes = ('place', 'resources/external/place_list.txt', 'multi'),
('possmonth', 'resources/external/month_list.txt', 'simple'),
('posseditor', 'resources/external/editor_abbr_list.txt', 'simple'),
('posspage', 'resources/external/page_abbr_list.txt', 'simple'),
('journal', 'resources/external/journals_list.txt', 'multi'),
('surname', 'resources/external/surname_list.txt', 'simple'),
('forename', 'resources/external/forename_list.txt', 'simple')
listFeaturesXML = italic
output = bilbo/testFiles/features.output.txt
verbose = False
CRF¶
This section is marked by:
[crf]
name¶
Name of libraries used, in some cases you can change the crf libraries (for wapiti for instance)
algoCrf¶
Default value is set to [lbfgs](for https://en.wikipedia.org/wiki/Limited-memory_BFGS) algorithm : {‘lbfgs’, ‘l2sgd’, ‘ap’, ‘pa’, ‘arow’}
optionCrf¶
Many option are avalaible . see crfsuite manual
Most important are c1 for a L1 regularisation (in this case algoritm is switch to orthant method), c2 regression ridge and and max_iterations
epsilon : The epsilon parameter that determines the condition of convergence. value set by default at 1e-5
optionCrf = {
'c2': 0.00001,
}
patternsFile¶
path to wapiti pattern. By default pattern used is located in resources/models/bibl/wapiti_pattern_ref
modelFile¶
Path to the model generated in train action or used in tag action.
seed¶
This is used to generate a pseudo-random number. This random number is used when you evaluate the crf algorithn only (not the fulle pipeline)
Example:
[crf]
name = crfsuite
algoCrf = lbfgs
# lbfgs for Gradient descent using the L-BFGS method,
# l2sgd for Stochastic Gradient Descent with L2 regularization term
# ap for Averaged Perceptron
# pa for Passive Aggressive
# arow for Adaptive Regularization Of Weight Vector
optionCrf = {
'c2': 0.00001,
'max_iterations': 2000,
}
seed = 3
patternsFile = resources/models/note/wapiti_pattern_ref
modelFile = resources/models/note/crf_OE_fr.txt
SVM¶
This section is marked by:
[svm]
name¶
Name of libraries used, in some cases you can change the svm libraries for an other.
modelFile¶
Path to the vocab model generated in train action or used in tag action.
vocab¶
Path to the vocab model generated by svm train. Vocab attribute at each word a integer.
output¶
Not already implemented
Example:
bsvm
modelFile = resources/models/note/svm_OE_fr.txt
vocab = resources/models/note/inputID.txt
output = /tmp/data_SVM.txt
Knowledge Base¶
Knowledge Base are located in resources/ path at the root at bilbo2 . There are splitted in three ways:
Corpus¶
They are used to train bilbo automatic annotation. This is annotated data used in supervised machine learning algorithms. XML / TEI corpus are available in 4 langages (pt, fr, de, en) for bibliographies references. Only a mixed corpus of french and english is avalaible for footnote.
Externals list¶
List can be simple or with multiwords, you must specifiy type of list in options:
- Authors (fullname, surname, forename).
- Abbreviation (month, page, editor).
- Journals
- Place
Models¶
Models are splited in two ways (in bibl and note directory). There are contains feature templates pattern (CRF++ format), see documentation. Note that we are used crf++ templates with crf-suite for convenience. A script is handling conversion between the both input format.
In note we are used crf and svm model. To svm model see installation and data format documentation in libsvm README.
Evaluation¶
It is not easy in a pipelines to evaluate the revelance of each algorithm and the relevance of a series of algorithms. As bilbo is constructed as a toolkit for researcher, we can evaluate each pipeline and doing a end to end evaluation. In some cases, a component library can handle its own evaluation. In all cases we rebuilt a confusion matrix.
End to End Evaluation¶
For a end to end evaluation, split your dataset in train and test then you need to train first:
python3 -m bilbo.bilbo --action train -c MY_PIPELINE.cfg -i DATA_TRAIN.xml -t tag
Then, evaluate:
python3 -m bilbo.bilbo --action evaluate -c MY_PIPELINE.cfg -i DATA_TEST.xml -t tag
Evaluation by component¶
For evaluate one component, you need to create the input standart format of your library (the features matrice fitted to your library):
python3 -m bilbo.bilbo --action train -c MY_PIPELINE.cfg -i DATA_SET.xml -t tag
Check in MY_PIPELINE.cfg the path to your input standart standart format of your component.
Then evaluate:
python3 -m bilbo.components.MY_COMPONENT -cf MY_PIPELINE.cfg -i INPUT_STANDART_COMPONENT --evaluate
Resultat¶
Bibliography tag¶
Crf component¶
For evaluate Conditional Random Field on the first step of the bibliography pipeline, you have to train crf component and moreover write a crf input format
python3 -m bilbo.bilbo --action train -i resources/corpus/bibl/oe_bibl_en_fr.xml -c bilbo/config/pipeline_bibl1.cfg
(AJouter un write features)
For evaluate, just get your input format and launch the module with evaluate parameter. It is possible because python-crf module offers an option with a random seed for split dataset in two dataset (train and test)
python3 -m bilbo.components.crf -cf bilbo/tests/pipeline.cfg -i bilbo/testFiles/features.output.txt --evaluate -vvvvv
label | precision | rappel | f-measure | occurences |
---|---|---|---|---|
abbr | 0.941 | 0.920 | 0.930 | 138 |
biblScope | 0.960 | 0.975 | 0.967 | 122 |
booktitle | 0.667 | 1.000 | 0.800 | 14 |
date | 0.972 | 1.000 | 0.986 | 175 |
edition | 0.438 | 0.500 | 0.467 | 14 |
extent | 1.000 | 1.000 | 1.000 | 12 |
forename | 0.940 | 0.926 | 0.933 | 269 |
genName | 0.000 | 0.000 | 0.000 | 1 |
journal | 0.773 | 0.829 | 0.800 | 111 |
nameLink | 1.000 | 0.667 | 0.800 | 6 |
orgName | 0.897 | 0.867 | 0.881 | 30 |
place | 1.000 | 1.000 | 1.000 | 5 |
pubPlace | 0.947 | 0.969 | 0.958 | 128 |
publisher | 0.912 | 0.921 | 0.917 | 292 |
ref | 1.000 | 0.500 | 0.667 | 2 |
surname | 0.896 | 0.928 | 0.912 | 250 |
title | 0.898 | 0.966 | 0.931 | 1563 |
title_sub | 1.000 | 1.000 | 1.000 | 8 |
mean | 0.847 | 0.832 | 0.839 | 3140 |
weighted-mean | 0.906 | 0.947 | 0.926 | 3140 |
End to End evaluation¶
In this case you need to split by yourself your dataset in two ways (train.xml and test.xml). Below we randomly assign data in two sets (one data train and one dataset), A simple holdout method for validation (80 % data train, 20 % datatest)
python3 -m bilbo.bilbo --action train -i resources/corpus/bibl/train.xml -c bilbo/config/pipeline_bibl1.cfg -vvvvv
python3 -m bilbo.bilbo --action evaluate -i resources/corpus/bibl/test.xml -c bilbo/config/pipeline_bibl1.cfg -vvvvv
label | precision | rappel | f-measure | occurences |
---|---|---|---|---|
abbr | 0.969 | 0.812 | 0.884 | 117 |
biblScope | 0.921 | 0.953 | 0.937 | 86 |
date | 0.961 | 0.879 | 0.919 | 141 |
edition | 0.750 | 0.375 | 0.500 | 8 |
emph | 0.000 | 0.000 | 0.000 | 2 |
extent | 1.000 | 1.000 | 1.000 | 9 |
forename | 0.954 | 0.959 | 0.956 | 217 |
genName | 0.000 | 0.000 | 0.000 | 1 |
journal | 0.579 | 0.440 | 0.500 | 100 |
nameLink | 1.000 | 1.000 | 1.000 | 2 |
orgName | 0.375 | 0.600 | 0.462 | 10 |
place | 0.000 | 0.000 | 0.000 | 2 |
pubPlace | 1.000 | 0.956 | 0.978 | 91 |
publisher | 0.860 | 0.877 | 0.869 | 211 |
ref | 0.000 | 0.000 | 0.000 | 5 |
surname | 0.948 | 0.926 | 0.937 | 216 |
title | 0.855 | 0.907 | 0.880 | 1106 |
mean | 0.621 | 0.594 | 0.607 | 2324 |
weighted-mean | 0.876 | 0.881 | 0.879 | 2324 |
Note tag¶
In this case, we have to evaluate the classifier algorithm (SVM) (dedicated to get note which contains bibliography) and the CRF components used to annotated bibliographies.
Crf component evaluation¶
label | precision | rappel | f-measure | occurences |
---|---|---|---|---|
abbr | 0.943 | 0.909 | 0.926 | 308 |
biblScope | 0.910 | 0.836 | 0.871 | 365 |
booktitle | 0.800 | 0.364 | 0.500 | 33 |
date | 0.811 | 0.853 | 0.831 | 286 |
edition | 0.000 | 0.000 | 0.000 | 27 |
editor | 0.000 | 0.000 | 0.000 | 2 |
extent | 0.438 | 0.389 | 0.412 | 18 |
forename | 0.907 | 0.913 | 0.910 | 332 |
genName | 0.000 | 0.000 | 0.000 | 2 |
journal | 0.822 | 0.550 | 0.659 | 260 |
name | 0.000 | 0.000 | 0.000 | 2 |
nameLink | 1.000 | 0.250 | 0.400 | 4 |
note | 0.896 | 0.943 | 0.919 | 4986 |
num | 0.000 | 0.000 | 0.000 | 3 |
orgName | 1.000 | 0.364 | 0.533 | 44 |
place | 0.000 | 0.000 | 0.000 | 1 |
pubPlace | 0.885 | 0.911 | 0.898 | 135 |
publisher | 0.812 | 0.803 | 0.807 | 279 |
ref | 1.000 | 0.250 | 0.400 | 4 |
roleName | 0.000 | 0.000 | 0.000 | 23 |
surname | 0.892 | 0.863 | 0.877 | 344 |
title | 0.753 | 0.775 | 0.764 | 2284 |
w | 0.885 | 0.966 | 0.924 | 88 |
mean | 0.598 | 0.476 | 0.530 | 9830 |
weighted-mean | 0.852 | 0.866 | 0.859 | 9830 |
Svm component evaluation¶
python3 -m bilbo.components.svm --evaluate -c bilbo/config/pipeline_note1.cfg -i resources/models/note/data_SVM.txt
Accuracy = 93.3993% (283/303) (classification) (93.3993399339934, 0.264026402640264, 0.6858941220502061)
label | precision | rappel | f-measure | occurences |
---|---|---|---|---|
1 | 0.93 | 0.99 | 0.96 | 222 |
-1 | 0.96 | 0.79 | 0.86 | 81 |
avg-total | 0.93 | 0.93 | 0.93 | 303 |
End to End evaluation¶
In this case you need to split by yourself your dataset in two ways (train.xml and test.xml). Below we randomly assign data in two sets (one data train and one dataset), A simple holdout method for validation (70 % data train, 30 % data test)
Pour les notes:
python3 -m bilbo.bilbo --action train -c bilbo/config/pipeline_note1.cfg -i resources/corpus/note/train.xml -t note -vvvv
python3 -m bilbo.bilbo --action evaluate -c bilbo/config/pipeline_note1.cfg -i resources/corpus/note/test.xml -t note -vvvv
label | precision | rappel | f-measure | occurences |
---|---|---|---|---|
abbr | 0.928 | 0.921 | 0.924 | 445 |
biblScope | 0.903 | 0.833 | 0.867 | 492 |
booktitle | 0.250 | 0.214 | 0.231 | 14 |
date | 0.880 | 0.839 | 0.859 | 446 |
edition | 0.200 | 0.060 | 0.092 | 67 |
editor | 0.000 | 0.000 | 0.000 | 2 |
extent | 0.667 | 0.444 | 0.533 | 36 |
forename | 0.918 | 0.861 | 0.888 | 495 |
genName | 0.000 | 0.000 | 0.000 | 4 |
journal | 0.839 | 0.709 | 0.768 | 381 |
nameLink | 0.333 | 0.200 | 0.250 | 5 |
note | 0.784 | 0.965 | 0.865 | 7393 |
num | 0.000 | 0.000 | 0.000 | 1 |
orgName | 1.000 | 0.242 | 0.390 | 66 |
place | 0.000 | 0.000 | 0.000 | 2 |
pubPlace | 0.923 | 0.919 | 0.921 | 248 |
publisher | 0.816 | 0.752 | 0.783 | 573 |
ref | 0.000 | 0.000 | 0.000 | 4 |
roleName | 1.000 | 0.111 | 0.200 | 9 |
surname | 0.922 | 0.840 | 0.879 | 524 |
title | 0.842 | 0.797 | 0.819 | 3775 |
w | 0.945 | 0.902 | 0.923 | 133 |
mean | 0.598 | 0.482 | 0.534 | 15115 |
weighted-mean | 0.822 | 0.879 | 0.850 | 15115 |
bilbo¶
bilbo package¶
Subpackages¶
bilbo.components package¶
Subpackages¶
CRF module (train / tag / evaluate)
decorator class
-
class
bilbo.components.features.decorator_feature.
PositionDecorator
(extractor)¶ Bases:
object
PositionDecorator Class
-
class
bilbo.components.features.decorator_feature.
SectionDecorator
(extractor)¶ Bases:
object
SectionDecorator Class
-
class
bilbo.components.features.decorator_feature.
WordDecorator
(extractor)¶ Bases:
object
WordDecorator class
External feature Class
-
class
bilbo.components.features.externalfeatures.
DictionnaryFeature
(name, filename)¶ Bases:
bilbo.components.features.externalfeatures.ExternalFeature
Get features from dictionnaries
-
create_list
(sequence)¶ Create a liste of token from a sequence
Parameters: sequence – list of token and label associated i.e: [[“token”, “label”], [“token”, “label”]]
-
-
class
bilbo.components.features.externalfeatures.
ExternalFeature
¶ Bases:
object
ExternalFeature CLass generate the feature from external ressources
-
classmethod
factory
(typeft, name, list_name)¶ Chose between single or multiple tokens feature
Parameters: - typeft – simple or multiple
- name – the name of the feature
- list_name – list file
Returns: the right function to call
-
classmethod
-
class
bilbo.components.features.externalfeatures.
ListFeature
(name, list_name)¶ Bases:
bilbo.components.features.externalfeatures.ExternalFeature
ListFeature Class
Features
-
class
bilbo.components.features.features.
FeatureHandler
(cfg_file, type_config='ini')¶ Bases:
bilbo.components.component.Component
Feature handler
-
format_to_list
(doc)¶
-
loadFonctionsFeatures
()¶ Load function for the features
-
print_features
(doc)¶
-
save_features
(doc)¶ Write the features for each token in the output file specify in the cli
Parameters: doc – document object
-
transform
(document)¶ Generate the features and push them into the section
-
Local features
-
class
bilbo.components.features.localfeatures.
LocalFeature
¶ Bases:
object
Local features class
-
biblPosition
= <bilbo.components.features.decorator_feature.PositionDecorator object>¶
-
cap
= <bilbo.components.features.decorator_feature.WordDecorator object>¶
-
dash
= <bilbo.components.features.decorator_feature.WordDecorator object>¶
-
initial
= <bilbo.components.features.decorator_feature.WordDecorator object>¶
-
numbersMixed
= <bilbo.components.features.decorator_feature.WordDecorator object>¶
-
regular expression features
-
class
bilbo.components.features.regexfeatures.
RegexFeature
(name, pattern)¶ Bases:
object
generate features based on regular expressions
XML features
-
class
bilbo.components.features.xmlfeatures.
XmlFeature
¶ Bases:
object
Generate feature based on XML datas
-
global_boolean
= <bilbo.components.features.decorator_feature.SectionDecorator object>¶
-
italic
= <bilbo.components.features.decorator_feature.SectionDecorator object>¶
-
punc_counter
= <bilbo.components.features.decorator_feature.SectionDecorator object>¶
-
Features generation module
Turns document into a data structure manipulable
SVM
-
class
bilbo.components.svm.svm.
Svm
(cfg_file, type_config='ini')¶ Bases:
bilbo.components.component.Estimator
SVM class
-
evaluate
(document)¶ Evaluate the model for the given data. All the data are split into 80/20% for the training / testing process
Parameters: document – document object
-
extract_xy
(data)¶
-
fit
(document)¶
-
generate_vocab_dict
(document)¶
-
get_svm_data
(document)¶ shape the svm features data for the svm :param document: document object of the document :returns: svm shaped data
-
predict
(document)¶ tag the new data basde on a given model
Parameters: document – document object Returns: list of predictions
-
train
(document)¶ Train the SVM model
Parameters: document – document object
-
transform
(document, mode)¶
-
word_count
(section)¶
-
words_iterator
(section)¶
-
write_data_svm
(data)¶
-
SVM module (train / tag)
Submodules¶
bilbo.components.component module¶
Component
-
class
bilbo.components.component.
Component
(cfg_file, type_config)¶ Bases:
object
Component abstract Class
-
fit
(document)¶
-
classmethod
get_module_name
()¶
-
classmethod
get_parser_name
()¶
-
transform
(document)¶
-
-
class
bilbo.components.component.
Estimator
(cfg_file, type_config)¶ Bases:
bilbo.components.component.Component
Estimator Extract Class
-
evaluate
()¶
-
predict
(document)¶
-
train
(document)¶
-
transform
(document, mode)¶
-
-
class
bilbo.components.component.
Extractor
(cfg_file, type_config)¶ Bases:
bilbo.components.component.Component
Extractor Extract Class
-
extract_from_section
(*args)¶
-
fit
()¶
-
bilbo.libs package¶
Submodules¶
bilbo.libs.opts module¶
@brief Gestion des options
-
class
bilbo.libs.opts.
BilboParser
¶ Bases:
object
-
classmethod
factory
(type_args, section_args)¶
-
classmethod
getArgs
(args, opt, type_opt, pipe)¶
-
classmethod
-
class
bilbo.libs.opts.
DictParser
¶ Bases:
bilbo.libs.opts.BilboParser
-
classmethod
getArgs
(args, opt, type_opt=None, pipe=None)¶
-
classmethod
-
class
bilbo.libs.opts.
IniParser
¶ Bases:
bilbo.libs.opts.BilboParser
-
classmethod
getArgs
(cfg_file, opt, type_opt=None, pipe=None)¶ Gets arguments from config file
Parameters: - cfgFile – config file
- opt – specify the option in the file
-
classmethod
Module contents¶
bilbo.storage package¶
Submodules¶
bilbo.storage.document module¶
Document
bilbo.storage.section module¶
section
-
class
bilbo.storage.section.
Section
(str_value, section_naked, section_xml, tokens=None, token_str_lst=None, bibl_status=True, keys=None)¶ Bases:
object
describe the section stored
Str_value: string value with tags Section_naked: string value without tag Section_xml: lxml object Tokens: list of token object (cf token) Token_str_lst: list of string token Bibl_status: True if the section contain a bibl tag Fasle otherwise -
check_constraint
(constraint)¶
-
print_tokens
()¶ print section
-
bilbo.storage.token module¶
Token
bilbo.storage.trie module¶
Trie
Module contents¶
init storage
bilbo.tests package¶
Submodules¶
bilbo.tests.test_feature module¶
-
class
bilbo.tests.test_feature.
TestDictionnayFeature
(methodName='runTest')¶ Bases:
unittest.case.TestCase
-
setUp
()¶ Hook method for setting up the test fixture before exercising it.
-
test_dict
()¶
-
-
class
bilbo.tests.test_feature.
TestListFeature
(methodName='runTest')¶ Bases:
unittest.case.TestCase
-
setUp
()¶ Hook method for setting up the test fixture before exercising it.
-
test_dict
()¶
-
-
class
bilbo.tests.test_feature.
TestLocalFeature
(methodName='runTest')¶ Bases:
unittest.case.TestCase
-
test_biblposition
()¶
-
test_cap
()¶
-
test_dash
()¶
-
test_initial
()¶
-
test_numbersMixed
()¶
-
-
class
bilbo.tests.test_feature.
TestRegexFeature
(methodName='runTest')¶ Bases:
unittest.case.TestCase
-
setUp
()¶ Hook method for setting up the test fixture before exercising it.
-
test_match_regex
()¶
-
test_nomatch_regex
()¶
-
-
bilbo.tests.test_feature.
load_list_predict
(section, function)¶
-
bilbo.tests.test_feature.
load_section
(data)¶
bilbo.tests.test_importer module¶
bilbo.tests.test_shapesection module¶
bilbo.tests.tests module¶
Module contents¶
bilbo.tokenizers package¶
Submodules¶
bilbo.tokenizers.en module¶
-
class
bilbo.tokenizers.en.
EnglishTokenizer
¶ Bases:
bilbo.tokenizers.tokenizers.DefaultTokenizer
-
tokenize
(option)¶
-
bilbo.tokenizers.fr module¶
-
class
bilbo.tokenizers.fr.
FrenchTokenizer
¶ Bases:
bilbo.tokenizers.tokenizers.DefaultTokenizer
-
tokenize
(text)¶ Tokenize the sentence given in parameter and return a list of tokens. This is a two-steps process: 1. tokenize text using punctuation marks, 2. merge over-tokenized units using the lexicon or a regex (for compounds, ‘^[A-Z][a-z]+-[A-Z][a-z]+$’).
-
bilbo.tokenizers.tokenizers module¶
tokenizer module
-
class
bilbo.tokenizers.tokenizers.
DefaultTokenizer
¶ Bases:
object
-
lexicon
= None¶ The dictionary containing the lexicon.
-
loadlist
(path)¶ Load a resource list and generate the corresponding regexp part.
-
regexp
= None¶ Loads the default lexicon (path is /resources/abbrs.list).
-
resources
= None¶ The path of the resources folder.
-
tokenize
(text)¶
-
-
class
bilbo.tokenizers.tokenizers.
Tokenizer
¶ Bases:
object
Tokenizer class tokenize a given string
Module contents¶
Tokenizers modules
bilbo.utils package¶
Submodules¶
bilbo.utils.crf_datas module¶
crf data
-
bilbo.utils.crf_datas.
apply_patterns
(sections_xyseq, patterns, empty_features=False)¶ brief Transform a list of features given patterns
Parameters: sections_xseq – iterable : a generator on a list of sections features list and labels Returns: a generator that yields a list new list of features given patterns
-
bilbo.utils.crf_datas.
extract_y
(sections, nfeatures=None)¶ Parameters: - sections – iterable : a sections generator (like returned by fd2sections() )
- nfeatures – None|int : if None the first line of the first section is expected to be with a label for last feature. Else nfeatures indicate the number of features, sections[x][nfeatures] is the line’s label.
Returns: a generator that yields one tuple(xseq, yseq) per section
-
bilbo.utils.crf_datas.
fd2patterns
(patterns_fd)¶ brief Read a Wapiti pattern file
Parameters: patterns_fd – iterable : a line generator Returns: An array of tuple(name, row, col)
-
bilbo.utils.crf_datas.
fd2sections
(datas_fd, sep=None)¶ brief Generator that yield sections of features from a BIOS formated content coming from a line generator
Parameters: - datas_fd – iterable: a line generator (as returned by open())
- sep – None|str : if None yield single string containing BIOS formated features. Else splits lines and features given sep
Returns: Depends on bios
-
bilbo.utils.crf_datas.
sections2evaluate
(sections, prop=0.8, seed=None)¶ brief Split sections into a training and an evaluation part
Parameters: - sections – iterable: items are sections
- prop – float : div proportions
- seed – int | None: random seed
Returns: split section fro train / test purposes
-
bilbo.utils.crf_datas.
trainer_opts
(name, options)¶ brief Return a dict of options for the trainer
Parameters: - name – str : can be wapiti | crfsuite
- options – str (dict) with the option of crfsuite
Returns: a dict
bilbo.utils.dictionaries module¶
dictionaries
-
bilbo.utils.dictionaries.
compile_multiword
(infile)¶ Parameters: infile – str
-
bilbo.utils.dictionaries.
generatePickle
(dic, infile)¶ Generate de pickle file
Parameters: - dic – dictionnarie
- infile – str
Returns: pickle file
bilbo.utils.svm_datas module¶
-
bilbo.utils.svm_datas.
fd2features
(datas_fd, to_dict=False)¶ Process SVM data file
Parameters: to_dict – bool : if true yield values are dict, else strings Returns: a generator
-
bilbo.utils.svm_datas.
fd2labeled_evaluation
(datas_fd, to_dict=False, prop=0.8, seed=None)¶ - brief Return 2 iterator on training and on evalutation datas (
- same generator than fd2labeled_features
Parameters: to_dict – bool : if true return a dict else a string Returns: tuple(train_datas, validation_datas)
-
bilbo.utils.svm_datas.
fd2labeled_features
(datas_fd, to_dict=False)¶ - Generator comparable to fd2features but that yield a tuple
- with (label, features)
Parameters: to_dict – bool: if true the features are returned as a dict else a string is yield Returns: a generator that yield tuples
-
bilbo.utils.svm_datas.
svmRepport
(y_test, y_pred)¶ Print the evaluation repport given the test and prediction data
Parameters: - y_test – list of test label (oracle)
- y_pred – list of predicted label (same range as test)
-
bilbo.utils.svm_datas.
svm_opts
()¶ Return kwargs and args for model training given argparse parsed arguments
Parameters: args – NameSpace: as returned by ArgumentParser.parse_argument() Returns: a tuple(args, kwargs)
bilbo.utils.timer module¶
Timer class
-
class
bilbo.utils.timer.
Timer
(name='', autostart=True)¶ Bases:
object
Simple timer class
-
last
¶
-
mean
()¶ Returns: the average of recorded timers
-
name
¶
-
reset
(name=None)¶ Reset the timer and store ellapsed time
Parameters: name – str: new timer name. If giver stored datas are errased
-
start
()¶ Starts the timer
-
t
()¶ Returns: elapsed seconds since last start() call
-
Module contents¶
utils init
Submodules¶
bilbo.bilbo module¶
bilbo.eval module¶
-
class
bilbo.eval.
Evaluation
(gold, predicted, option='fine')¶ Bases:
object
Evaluation class
-
evaluate
()¶ Compute all the precisions, recalls, f-measures and count for the confusion matrix :return: dict(label, precision), dict(label, recall), dict(label, f_measures), dict(label, count), dict(macro)
-
get_col_sum
(label)¶ return the sum of a given column
-
get_confusion_matrix
()¶ Generate the confusion matrix populate matrix with the confusion matrix populate imap
-
get_count_for_label
(label)¶ param label : a given label return the number of occurences for a given label
-
get_count_for_labels
()¶ return a dict with the number of occurences for each label :return: dict (label, count)
-
get_f_measure_for_labels
(beta: float = 1)¶ Returns F1 score for all labels. See http://en.wikipedia.org/wiki/F1_score
Parameters: beta – the beta parameter higher than 1 prefers recall,
lower than 1 prefers precision
Returns: dict (label, F1)
-
get_macro_f_measure
()¶ Returns: the mean f-measure for the whole document
-
get_macro_f_measure_weighted
()¶ Returns: the weighted mean f-measure for the whole document
-
get_macro_precision
()¶
-
get_macro_precision_weighted
()¶
-
get_macro_recall
()¶
-
get_macro_recall_weighted
()¶
-
get_precision_for_label
(label)¶ param label : a given label return the precision for a given label
-
get_precision_for_labels
()¶ return a dict with the precition of each label :return: dict (label, precision)
-
get_recall_for_label
(label)¶ param label : a given label return the recall for a given label
-
get_recall_for_labels
()¶ return a dict with the recall of each label :return: dict (label, recall)
-
get_row_sum
(label)¶ return the sum of a given row
-
get_true_positive
(label)¶ return the true positive from the matrix
-
get_unique_label
()¶ return a list of unique label from the gold and predicted lists
-
print_csv
(precisions, recalls, f_measures, counts, macro, csvfile)¶
-
print_std
(precisions, recalls, f_measures, counts, macro)¶
-