Pipelines

Bilbo is founded on processing a specific structure (data struct) on series. The Input and ouptput of each data is stable. In some case the output is enhanced from an information on each section or at the level of a token.

Importer

Document is imported: lxlm library parse the whole document. Each document is segmented according to one section, the data structure is constructed at the level of section.

Module available

All this module can be found in components directory. Each class will inherit from the Component Class.

Shape Data: extract XML value and tokenizer

Shaper section is dedicated to handle xml data and tokenize. Tokenizer is written for french and english. structure is constructed at the level of token. This module is certainly and should be the first in the pipelines series. For see CLI API functionalities (CLI):

python3 -m bilbo.components.shape_data -h

For specify shape component options

Features

Features could be extract from list or dictionnaries files (external features). Features could be extract from the local specifity of a word. Features could be extract from the specificity of section or position of a token (global features). List of word could be simple or multiple. For see CLI API functionalities (CLI):

python3 -m bilbo.components.features -h

For specify feature component options

Conditional random field

This is based on on python-crfsuite Pythons crf-suite is a pyhon binding of CRFSuite. CRFsuite is an implementation of Conditional Random Fields (CRFs) for labeling sequential data. This labelling is generated by an extraction of feature and to get easier an wrapping with crf++ is avalaible. For see CLI API functionalities (CLI):

python3 -m bilbo.components.crf -h

For specify crf component options

Support Vector Machine

This is based on libsvm. LIBSVM is an integrated library for support vector classification, (C-SVC, nu-SVC), regression (epsilon-SVR, nu-SVR) and distribution estimation (one-class SVM). It supports multi-class classification. For now, it is used to classify foot note which are contains bibliography. For see CLI API functionalities (CLI):

python3 -m bilbo.components.svm -h

For specify svm component options

Optionnal output step

Each document is segmented according to one section: the data xml structure is reconstructed.