bilbo.tokenizers package¶

Submodules¶

bilbo.tokenizers.en module¶

class bilbo.tokenizers.en.EnglishTokenizer¶

Bases: bilbo.tokenizers.tokenizers.DefaultTokenizer

tokenize(option)¶

bilbo.tokenizers.fr module¶

class bilbo.tokenizers.fr.FrenchTokenizer¶

Bases: bilbo.tokenizers.tokenizers.DefaultTokenizer

tokenize(text)¶: Tokenize the sentence given in parameter and return a list of tokens. This is a two-steps process: 1. tokenize text using punctuation marks, 2. merge over-tokenized units using the lexicon or a regex (for compounds, ‘^[A-Z][a-z]+-[A-Z][a-z]+$’).

bilbo.tokenizers.tokenizers module¶

tokenizer module

class bilbo.tokenizers.tokenizers.DefaultTokenizer¶

Bases: object

lexicon = None¶: The dictionary containing the lexicon.

loadlist(path)¶: Load a resource list and generate the corresponding regexp part.

regexp = None¶: Loads the default lexicon (path is /resources/abbrs.list).

resources = None¶: The path of the resources folder.

tokenize(text)¶

class bilbo.tokenizers.tokenizers.Tokenizer¶

Bases: object

Tokenizer class tokenize a given string

Module contents¶

Tokenizers modules

Read the Docs v: stable

Versions: latest; stable

Downloads: pdf; html; epub

On Read the Docs: Project Home; Builds

Free document hosting provided by Read the Docs.