Bilbo is founded on processing a specific structure (data struct) on series. The Input and ouptput of each data is stable. In some case the output is enhanced from an information on each section or at the level of a token.
Document is imported: lxlm library parse the whole document. Each document is segmented according to one section, the data structure is constructed at the level of section.
All this module can be found in components directory. Each class will inherit from the Component Class.
Shape Data: extract XML value and tokenizer¶
Shaper section is dedicated to handle xml data and tokenize. Tokenizer is written for french and english. structure is constructed at the level of token. This module is certainly and should be the first in the pipelines series. For see CLI API functionalities (CLI):
python3 -m bilbo.components.shape_data -h
For specify shape component options
Features could be extract from list or dictionnaries files (external features). Features could be extract from the local specifity of a word. Features could be extract from the specificity of section or position of a token (global features). List of word could be simple or multiple. For see CLI API functionalities (CLI):
python3 -m bilbo.components.features -h
For specify feature component options
Conditional random field¶
This is based on on python-crfsuite Pythons crf-suite is a pyhon binding of CRFSuite. CRFsuite is an implementation of Conditional Random Fields (CRFs) for labeling sequential data. This labelling is generated by an extraction of feature and to get easier an wrapping with crf++ is avalaible. For see CLI API functionalities (CLI):
python3 -m bilbo.components.crf -h
For specify crf component options
Support Vector Machine¶
This is based on libsvm. LIBSVM is an integrated library for support vector classification, (C-SVC, nu-SVC), regression (epsilon-SVR, nu-SVR) and distribution estimation (one-class SVM). It supports multi-class classification. For now, it is used to classify foot note which are contains bibliography. For see CLI API functionalities (CLI):
python3 -m bilbo.components.svm -h
For specify svm component options
Optionnal output step¶
Each document is segmented according to one section: the data xml structure is reconstructed.