bilbo.tokenizers package


bilbo.tokenizers.en module

class bilbo.tokenizers.en.EnglishTokenizer

Bases: bilbo.tokenizers.tokenizers.DefaultTokenizer

tokenize(option) module


Bases: bilbo.tokenizers.tokenizers.DefaultTokenizer


Tokenize the sentence given in parameter and return a list of tokens. This is a two-steps process: 1. tokenize text using punctuation marks, 2. merge over-tokenized units using the lexicon or a regex (for compounds, ‘^[A-Z][a-z]+-[A-Z][a-z]+$’).

bilbo.tokenizers.tokenizers module

tokenizer module

class bilbo.tokenizers.tokenizers.DefaultTokenizer

Bases: object

lexicon = None

The dictionary containing the lexicon.


Load a resource list and generate the corresponding regexp part.

regexp = None

Loads the default lexicon (path is /resources/abbrs.list).

resources = None

The path of the resources folder.

class bilbo.tokenizers.tokenizers.Tokenizer

Bases: object

Tokenizer class tokenize a given string

Module contents

Tokenizers modules