Welcome to konoha’s documentation!¶
Quick Start¶
Welcome to Konoha!
Konoha is a library for text processing in Japanese.
In Japanese, we have to split a sentence into a sequence of words, called tokenization
.
There are many tools available for tokenizing sentences, and usages of them are not the same.
Konoha provides a unified interface to use these tools.
You can try Konoha only by running docker run
on your computer.
Installation¶
The latest Konoha supports Python 3.8 or newer.
We recommend to install via pip:
$ pip install konoha[all]
You can also install Konoha with specific tokenizer, please run:
$ pip install konoha[janome,kytea,mecab,sentencepiece,sudachi,nagisa] # specify one or more of them
If you run pip install konoha, Konoha will be installed only with sentence splitter.
You can also install the development version of Konoha from the main branch of Git repository:
$ pip install git+https://github.com/himkt/konoha.git
API Reference¶
Word Level Tokenizer Interface¶
- class konoha.word_tokenizer.WordTokenizer(tokenizer: str = 'MeCab', user_dictionary_path: str | None = None, system_dictionary_path: str | None = None, model_path: str | None = None, mode: str | None = None, dictionary_format: str | None = None, endpoint: str | None = None, ssl: bool | None = None, port: int | None = None)¶
Sentence Level Tokenizer Interface¶
- class konoha.sentence_tokenizer.SentenceTokenizer(period: str | None = None, patterns: List[Pattern[str]] | None = None)¶
Word Tokenizer Implementations¶
Token¶
(Deprecated)
Data classes¶
Token¶
- class konoha.data.token.Token(surface: str, postag: str | None = None, postag2: str | None = None, postag3: str | None = None, postag4: str | None = None, inflection: str | None = None, conjugation: str | None = None, base_form: str | None = None, yomi: str | None = None, pron: str | None = None, normalized_form: str | None = None)¶
Token class for konoha.
Resource¶
Server¶
TBD