Skip to content

tokenizer

Tokenization wrapper

This module provides an abstraction layer around model tokenizers. The purpose is to enable token-based chunking for vectorization and map reduce operations.

Classes

BaseTokenizer

Bases: ABC

An abstraction layer to support different types of tokenizers from within the Eleanor Framework.

Functions

chunk_text
chunk_text(text: str, length: int) -> List[str]
decode abstractmethod
decode(tokens: List[int]) -> str
encode abstractmethod
encode(text: str) -> List[int]

HuggingFaceTokenizer

HuggingFaceTokenizer(tokenizer_path: str)

Bases: BaseTokenizer

A wrapper around HuggingFace tokenizers.

Functions

decode
decode(tokens: List[int]) -> str
encode
encode(text: str) -> List[int]

TikTokenTokenizer

TikTokenTokenizer(encoding_name: str)

Bases: BaseTokenizer

A wrapper around TikToken tokenizers.

Warning

Since I’ve focused on open-source models up ot this point, the TikTokenTokenizer has not been thoroughly tested.

Functions

decode
decode(tokens: List[int]) -> str
encode
encode(text: str) -> List[int]

Functions