MTL Toolbox

MTL Toolbox (https://learntaiwanese.org/MTLtoolbox/about.html) is software and data to help people use written Taiwanese in Modern Literal Taiwanese (MLT) and other Latin-script writing systems.

Features

seven Taiwanese dictionaries spanning from Japanese era to present day
full-text search engine accepts written Taiwanese as well as English, and Harnji
audio from government-compiled dictionary: Dictionary of Frequently-Used Taiwanese Taigi (DFT)
basic text segmentation (including "unjoining" into syllables, or syllabification) and "bag-of-syllables" search
Seven Tones of Taiwanese Soundboard: table of all MLT finals with examples

How to perform a full-text search

Steps to do a full-text search on our set of Taiwanese dictionaries:

Visit "Taiwanese–English dictionaries full-text search" [1]:
Select or deselect which dictionaries to search using the checkboxes. By default, all seven dictionaries are included, as well as "DFT_lk", which is the database table for DFT examples.
Input search terms to define your search. Typical inputs include English terms, MLT-style syllables (citation tone, before tone sandhi), and number of syllables. Feel free to try any other terms that would help narrow down your search. For example:
- You searched for apples, but didn't get many results. Instead, search for apple*, to get results with any strings that start with "apple".
- If you want only results starting with MLT consonant "ph", include ph*: apple* ph*
In some cases, it is better to specify a column for a term, especially if it could match in multiple columns. To specify the column to search against, follow the column-name by a ":" character, then the term. For example:
- Suppose your search term is "too", which is a valid English word but is also a valid MLT syllable. If you meant the English word, type en:too. If you meant the MLT syllable, type u:too ("u" stands for "unjoined").
- Suppose you want results that contain the very common MLT syllable "lie", which is also a valid word in English. Again, specify the "u" column: u:lie. However, the number of results is still large. If you want only four-syllable results, add ns:4. The entire search input is then u:lie ns:4
- See #Technical notes for more details.

How to use the MLT segmenter

We describe how to use "Taiwanese–English dictionaries: MLT segmenter & full-text search" [2]. You might want to use this tool if you want to search the dictionaries using MLT but don't want to figure out the citation-tone syllables yourself. As most Taiwanese words written in MLT will not have a hyphen or other character to delimit syllable boundaries, computing the citation tone syllables is a non-trivial process. The MLT segmenter lowers the barrier between MLT and the dictionaries, which have not been completely converted to MLT.

To get started, enter your MLT word, then press "Go". The segmenter will process your input. If successful, it will also run a search. Results from HTB and DFT are displayed, and results from other dictionaries are summarized and linked to.

Typical usage

Input: Taiwanese word (typically disyllable: two syllables joined by tone sandhi)
- Example: køefcie (if ø is not convenient, use 0 (zero) instead: k0efcie)
Press return or tap "Go"
- your input is "unjoined" (original syllables found by database lookup)
  - in this example, the original syllables are: køea and cie
- search is performed using the unordered collection of syllables (we refer to this as "bag-of-syllables" (BOS). see bag-of-words model)
- confirm that the results are basically the same (except for HTB which is not unjoined) as if you had run a full-text search for either: køea cie, or: cie køea

Try more examples:
- chviafmng
- tøsia
- Taioaan

Monosyllable

for a monosyllable, exact matches are displayed by default. Here are some examples from Practical Taiwanese Conversation:
- goar
- lie
- ee

if the syllable is a DFT monosyllable, a navigation bar displays adjacent DFT monosyllables in alphabetical order
due to high number of matches, "monosyllable mode" displays monosyllable search results. To see all matching results, click on [[🔍 ]]

Other columns

The "en:" button will prepend "en:" to your search, causing the following token to be matched against the English column.
We haven't implemented Chinese text segmentation, which is non-trivial (see Chinese word-segmented writing).
DFT now supports using individual Harnji as tokens. For example, 對頭. The h_j column is the Harnji (hj) column with spaces between characters.

Data

Local copies of:

HTB: Hiexntai-buun Dictionary
DFT: Dictionary of Frequently-Used Taiwanese Taigi (in TL. We added MLT annotations and annotated over 5800 definitions in English for monosyllables)
MK: Maryknoll Taiwanese–English Dictionary (in POJ. We added MLT annotations)
EDUTECH: Liim Keahioong (2001-2003) EDUTECH: Taiwanese-English Dictionary Searched with Concise Atonal Spelling (in MLT with unified spellings (øe))
Embree, Bernard L. M. (1973). A Dictionary of Southern Min: based on current usage in Taiwan and checked against the earlier works of Carstairs Douglas, Thomas Barclay, and Ernest Tipson. Hong Kong: Hong Kong Language Institute. (in POJ. We added MLT annotations)
TDJ: Tai-Nichi Daijiten (original 1931 & 1932, in Taiwanese kana. Lim08 version: definitions translated into Taiwanese (Han-Romanization mixed script - POJ). We added MLT annotations)

Note: The M-fields of DFT and MK are largely machine-generated ("auto-joined") and do not attempt to indicate prescriptive spellings. In some cases, the common or recommended spelling may be different from what is shown. Often, the difference is about apostrophes and/or hyphens.

We also support searching other websites with conversion to POJ/TL:

Lim (2019): updated version of TDJ-Lim08 above
Taiwanese - Chinese Dictionary (currently not open to public)

Technical notes

Our full-text search is provided by the SQLite FTS4 extension. We currently use the Standard Query Syntax. One of the three basic query types supported by FTS tables is "token or token prefix queries":

Specify a token prefix by appending an asterisk ('*') to the prefix. (While similar to wildcard character in operating systems, wildcard search is not currently supported by FTS)
- Example: Taioa*, 臺*
Specify a column-name followed by a colon (':')
- Example: hj:頭* (returns entries where Taiwanese written with Harnji begins with character for thaau)
Prefix the token with carat ('^') to require token to be very first token in its column
- Example: ^thaau

Tokenizer: the default tokenizer ("simple") is used. Because it only does case folding of ASCII characters, Ø and ø do not match each other. Some words starting with Ø include Ørciw, Ørmngg, and Ørtøexli.

Acknowledgements

The MTL Toolbox uses data from the Maryknoll Taiwanese–English Dictionary, which was generously released to the public under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 License.

MTL Toolbox

Contents

Features

How to perform a full-text search

How to use the MLT segmenter

Typical usage

Monosyllable

Other columns

Data

Technical notes

See also

Acknowledgements

Navigation menu

MTL Toolbox

Features

How to perform a full-text search

How to use the MLT segmenter

Typical usage

Monosyllable

Other columns

Data

Technical notes

See also

Acknowledgements

Navigation menu

Search