Welcome to konoha’s documentation!

Quick Start

Welcome to Konoha! Konoha is a library for text processing in Japanese. In Japanese, we have to split a sentence into a sequence of words, called tokenization. There are many tools available for tokenizing sentences, and usages of them are not the same.

Konoha provides a unified interface to use these tools. You can try Konoha only by running docker run on your computer.

Installation

The latest Konoha supports Python 3.8 or newer.

We recommend to install via pip:

$ pip install konoha[all]

You can also install Konoha with specific tokenizer, please run:

$ pip install konoha[janome,kytea,mecab,sentencepiece,sudachi,nagisa]  # specify one or more of them

If you run pip install konoha, Konoha will be installed only with sentence splitter.

You can also install the development version of Konoha from the main branch of Git repository:

$ pip install git+https://github.com/himkt/konoha.git

API Reference

Word Level Tokenizer Interface

class konoha.word_tokenizer.WordTokenizer(tokenizer: str = 'MeCab', user_dictionary_path: str | None = None, system_dictionary_path: str | None = None, model_path: str | None = None, mode: str | None = None, dictionary_format: str | None = None, endpoint: str | None = None, ssl: bool | None = None, port: int | None = None)
batch_tokenize(texts: List[str]) List[List[Token]]

Tokenize input texts

tokenize(text: str) List[Token]

Tokenize input text

Sentence Level Tokenizer Interface

class konoha.sentence_tokenizer.SentenceTokenizer(period: str | None = None, patterns: List[Pattern[str]] | None = None)

Word Tokenizer Implementations

Base Word Tokenizer

class konoha.word_tokenizers.tokenizer.BaseTokenizer(name: str)

Base class for word levelkonoha.tokenizer

property name: str

Return name ofkonoha.tokenizer

abstract tokenize(text: str) List[Token]

Abstract method forkonoha.tokenization

Character Tokenizer

class konoha.word_tokenizers.character_tokenizer.CharacterTokenizer
tokenize(text: str)

Abstract method forkonoha.tokenization

MeCab Tokenizer

class konoha.word_tokenizers.mecab_tokenizer.MeCabTokenizer(user_dictionary_path: str | None = None, system_dictionary_path: str | None = None, dictionary_format: str | None = None)
tokenize(text: str) List[Token]

Abstract method forkonoha.tokenization

KyTea Tokenizer

class konoha.word_tokenizers.kytea_tokenizer.KyTeaTokenizer(model_path: str | None = None)
tokenize(text: str) List[Token]

Abstract method forkonoha.tokenization

Sentencepiece Tokenizer

class konoha.word_tokenizers.sentencepiece_tokenizer.SentencepieceTokenizer(model_path: str)
tokenize(text: str) List[Token]

Abstract method forkonoha.tokenization

Sudachi Tokenizer

class konoha.word_tokenizers.sudachi_tokenizer.SudachiTokenizer(mode: str)
tokenize(text: str) List[Token]

Abstract method forkonoha.tokenization

Janome Tokenizer

class konoha.word_tokenizers.janome_tokenizer.JanomeTokenizer(user_dictionary_path: str | None = None)
tokenize(text: str) List[Token]

Abstract method forkonoha.tokenization

nagisa Tokenizer

class konoha.word_tokenizers.nagisa_tokenizer.NagisaTokenizer
tokenize(text: str) List[Token]

Abstract method forkonoha.tokenization

Whitespace Tokenizer

class konoha.word_tokenizers.whitespace_tokenizer.WhitespaceTokenizer

Simple rule-based word tokenizer.

tokenize(text: str) List[Token]

Abstract method forkonoha.tokenization

Token

(Deprecated)

Data classes

Token

class konoha.data.token.Token(surface: str, postag: str | None = None, postag2: str | None = None, postag3: str | None = None, postag4: str | None = None, inflection: str | None = None, conjugation: str | None = None, base_form: str | None = None, yomi: str | None = None, pron: str | None = None, normalized_form: str | None = None)

Token class for konoha.

Resource

class konoha.data.resource.Resource(path: str | None)
download_from_s3(path: str) str

Download file(s) from Amazon S3.

Server

TBD

Indices and tables