import configparser
from bilbo.importer import Importer
from bilbo.components.shape_data.shape_data import ShapeSection
from bilbo.components.features.features import FeatureHandler
from bilbo.components.crf.crf import Crf
from bilbo.bilbo import Bilbo

Bilbo in a shell

Construct Data Structure

First import your xml document. You can import string or a file. For any action (machine learning prediction, features extraction, set a new xml properties), you will handle this document object.

#xml_str = '<xml>Oustide<bibl><pubPlace>Marseille</pubPlace>, <sponsor>OpenEdition is "! inside </sponsor>>a bibl</bibl></xml>'
xml_str = """<TEI xmlns="http://www.tei-c.org/ns/1.0"> Outside 
<bibl>Hillier B., 1996, <hi>Space is the Machine</hi>, Cambridge University Press, <pubPlace>Cambridge.</pubPlace>
</bibl></TEI>"""
imp = Importer(xml_str)
doc = imp.parse_xml('bibl', is_file = False)

Tokenize, extract and wrap xml informations

First, load parameters.

dic = """                                                      
[shaper]                        
tokenizerOption = fine          
tagsOptions = {                                                                                 
    "pubPlace": "place",
    "sponsor": "publisher"
    } 
verbose = True
"""
#Load the dic.
#There are differnt ways to set parameters (ini file...)see: https://docs.python.org/3/library/configparser.html#quick-start
config = configparser.ConfigParser(allow_no_value=True) 
config.read_string(dic)

Use ShapeSection class. Note at any moment you can call help for parameters function:

help(ShapeSection.__init__)
Help on function __init__ in module bilbo.components.shape_data.shape_data:

__init__(self, cfg_file, type_config='ini', lang='fr')
    Initialize self.  See help(type(self)) for accurate signature.
sh = ShapeSection(config, type_config='Dict')
sh.transform(doc)
<bilbo.storage.document.Document at 0x7fc3740d7390>

To see an overview of your document:

for section in doc.sections:
    for token in section.tokens:
        print('Token:{0}\t\t Label:{1}'.format(token.str_value, token.label))
Token:Hillier		 Label:bibl
Token:B.		 Label:bibl
Token:,		 Label:c
Token:1996		 Label:bibl
Token:,		 Label:c
Token:Space		 Label:hi
Token:is		 Label:hi
Token:the		 Label:hi
Token:Machine		 Label:hi
Token:,		 Label:c
Token:Cambridge		 Label:bibl
Token:University		 Label:bibl
Token:Press		 Label:bibl
Token:,		 Label:c
Token:Cambridge		 Label:place
Token:.		 Label:c

Features

Set features that you are needed. For external features, you need to give the right path to externals lists…

dic = """                                                      
[features]
listFeatures = numbersMixed, cap, dash, biblPosition, initial
listFeaturesRegex = ('UNIVERSITY', '^Uni.*ty$')
listFeaturesExternes = ('surname', 'surname_list.txt', 'simple'),
listFeaturesXML = italic
output = output.txt 
verbose = False 
"""
config = configparser.ConfigParser(allow_no_value=True) 
config.read_string(dic)

Features are given for convenience in Crf++ format.

feat = FeatureHandler(config, type_config='Dict')
feat.loadFonctionsFeatures()
doc = feat.transform(doc)
feat.print_features(doc)
Hillier NONUMBERS FIRSTCAP NODASH BIBL_START NOINITIAL NOUNIVERSITY SURNAME NOITALIC bibl

B. NONUMBERS ALLCAP NODASH BIBL_START INITIAL NOUNIVERSITY NOSURNAME NOITALIC bibl

, NONUMBERS NONIMPCAP NODASH BIBL_START NOINITIAL NOUNIVERSITY NOSURNAME NOITALIC c

1996 NUMBERS NONIMPCAP NODASH BIBL_START NOINITIAL NOUNIVERSITY NOSURNAME NOITALIC bibl

, NONUMBERS NONIMPCAP NODASH BIBL_START NOINITIAL NOUNIVERSITY NOSURNAME NOITALIC c

Space NONUMBERS FIRSTCAP NODASH BIBL_START NOINITIAL NOUNIVERSITY NOSURNAME ITALIC hi

is NONUMBERS ALLSMALL NODASH BIBL_IN NOINITIAL NOUNIVERSITY NOSURNAME ITALIC hi

the NONUMBERS ALLSMALL NODASH BIBL_IN NOINITIAL NOUNIVERSITY NOSURNAME ITALIC hi

Machine NONUMBERS FIRSTCAP NODASH BIBL_IN NOINITIAL NOUNIVERSITY NOSURNAME ITALIC hi

, NONUMBERS NONIMPCAP NODASH BIBL_IN NOINITIAL NOUNIVERSITY NOSURNAME NOITALIC c

Cambridge NONUMBERS FIRSTCAP NODASH BIBL_IN NOINITIAL NOUNIVERSITY NOSURNAME NOITALIC bibl

University NONUMBERS FIRSTCAP NODASH BIBL_END NOINITIAL UNIVERSITY NOSURNAME NOITALIC bibl

Press NONUMBERS FIRSTCAP NODASH BIBL_END NOINITIAL NOUNIVERSITY NOSURNAME NOITALIC bibl

, NONUMBERS NONIMPCAP NODASH BIBL_END NOINITIAL NOUNIVERSITY NOSURNAME NOITALIC c

Cambridge NONUMBERS FIRSTCAP NODASH BIBL_END NOINITIAL NOUNIVERSITY NOSURNAME NOITALIC place

. NONUMBERS NONIMPCAP NODASH BIBL_END NOINITIAL NOUNIVERSITY NOSURNAME NOITALIC c

Make predictions

First, to get an Document storage object which make sense (not as above, just for demonstration usage). We load right parameters with path_pipeline_bibl:

# This part is a fast resume of TOKENIZER AND FEATURE explain above.
# There are runned again with the appropriate parameter (path to pipeline_bibl.cfg).
imp = Importer(xml_str)
doc = imp.parse_xml('bibl', is_file = False)
bbo = Bilbo(doc, 'pipeline_bibl.cfg')
bbo.shape_data(doc)
bbo.features(doc)
<bilbo.storage.document.Document at 0x7fc3740ac828>

We have now a Document storage object which contains all needed information

# Start to make predictions
tagger = Crf(bbo.config, type_config='Dict')
labels = tagger.predict(doc)

for label in labels:
    for l in label:
        print(l[0], l[1])
Hillier surname
B. forename
, c
1996 date
, c
Space title
is title
the title
Machine title
, c
Cambridge publisher
University publisher
Press publisher
, c
Cambridge pubPlace
. c

Add prediction at the data structure

Always use transform() function for added prediction to Document storage object. Note for estimator component, three option are availables :’tag’, ‘train’, ‘evaluate’

tagger.transform(doc, 'tag')
for section in doc.sections:
    for token in section.tokens:
        print('Token:{0}\t\t Label:{1}'.format(token.str_value, token.predict_label))
Token:Hillier		 Label:surname
Token:B.		 Label:forename
Token:,		 Label:c
Token:1996		 Label:date
Token:,		 Label:c
Token:Space		 Label:title
Token:is		 Label:title
Token:the		 Label:title
Token:Machine		 Label:title
Token:,		 Label:c
Token:Cambridge		 Label:publisher
Token:University		 Label:publisher
Token:Press		 Label:publisher
Token:,		 Label:c
Token:Cambridge		 Label:pubPlace
Token:.		 Label:c