bilbo.tokenizers package

Submodules

bilbo.tokenizers.en module

class bilbo.tokenizers.en.EnglishTokenizer

Bases: bilbo.tokenizers.tokenizers.DefaultTokenizer

tokenize(option)

bilbo.tokenizers.fr module

class bilbo.tokenizers.fr.FrenchTokenizer

Bases: bilbo.tokenizers.tokenizers.DefaultTokenizer

tokenize(text)

Tokenize the sentence given in parameter and return a list of tokens. This is a two-steps process: 1. tokenize text using punctuation marks, 2. merge over-tokenized units using the lexicon or a regex (for compounds, ‘^[A-Z][a-z]+-[A-Z][a-z]+$’).

bilbo.tokenizers.tokenizers module

tokenizer module

class bilbo.tokenizers.tokenizers.DefaultTokenizer

Bases: object

lexicon = None

The dictionary containing the lexicon.

loadlist(path)

Load a resource list and generate the corresponding regexp part.

resources = None

The path of the resources folder.

tokenize(text)
class bilbo.tokenizers.tokenizers.Tokenizer

Bases: object

Tokenizer class tokenize a given string

Module contents

Tokenizers modules