MTL Toolbox

From Taioaan Wiki
Jump to navigation Jump to search

MTL Toolbox (https://learntaiwanese.org/MTLtoolbox/about.html): Modern Taiwanese Language Toolbox. Software and data to help people use written Taiwanese in Modern Literal Taiwanese (MLT) and other Latin-script writing systems.

Features

  • six Taiwanese dictionaries spanning from Japanese era to present day
  • full-text search engine accepts written Taiwanese as well as English, and Harnji
  • audio from government-compiled dictionary: DFT
  • basic text segmentation (including "unjoining" into syllables) and "bag-of-syllables" search
  • Seven Tones soundboard: table of all MLT finals with examples

How to search using the segmenter

We describe how to use "Taiwanese–English dictionaries: MLT segmenter & full-text search" [1]. This interface is mainly for Taiwanese words written in MLT, which we refer to as "M-style" written Taiwanese. After entering M-input, press "Zhøe" to run the segmenter and search. Results from HTB and DFT are displayed, and results from other dictionaries are summarized and linked to.

Typical usage

  • Input: Taiwanese word (typically disyllable: two syllables joined by tone sandhi)
    • Example: køefcie (copy and paste, or substitute 0 for ø: k0efcie)
  • Press return or tap "Zhøe" (means "search")
    • your input is "unjoined" (original syllables found by database lookup)
      • in this example, the original syllables are: køea and cie
    • search is performed using the unordered collection of syllables (we refer to this as "bag-of-syllables" (BOS). see bag-of-words model)
    • confirm the results are the same as for input: (except for HTB which is not unjoined)

Monosyllable

  • if the syllable is a DFT monosyllable, a navigation bar displays adjacent DFT monosyllables in alphabetical order
  • due to high number of matches, "monosyllable mode" returns monosyllable search results. To see all matching results, click "Khahzøe"

Other fields

  • The "en" button is used to direct the search to the English field (en). Harnji (hj) can also be input, which can be useful with DFT. Otherwise, we do not attempt Chinese text segmentation, which is non-trivial (see Chinese word-segmented writing).

How to search the dictionary set (without segmenter)

How to search our set of Taiwanese dictionaries using "Taiwanese-English dictionaries full-text search" [2]:

  • You may select which dictionaries to search using the checkboxes. By default, all seven dictionaries are included, as well as "DFT_lk", which are examples for DFT entries.
  • Input search terms to define your search. Typical inputs include English terms, M-style syllables (original without tone sandhi), and the number of syllables. Feel free to try any other terms that would help narrow down your search.
  • In some cases, it is better to specify a column for a term, especially if it could match in multiple columns. To specify the column to search against, follow the column-name by a ":" character, then the term.
    • For example, if you want only monosyllable results, use ns:1. Likewise, if you know your result should be three syllables, use ns:3
    • Suppose your search term is "too", which is a valid English word but is also a valid MLT syllable. If you want to match only the English column, type en:too. If you want to match only M-style syllables, type u:too ("u" stands for "unjoined").
    • See #Technical notes for more details.

Data

Local copies of:

Note: The M-fields of DFT and MK are largely machine-generated ("auto-joined") and do not attempt to indicate prescriptive spellings. In some cases, the common or recommended spelling may be different from what is shown.

We also support searching other websites with conversion to POJ/TL:

Technical notes

Our full-text search is provided by the SQLite FTS4 extension. We currently use the Standard Query Syntax. One of the three basic query types supported by FTS tables is "token or token prefix queries":

  • Specify a token prefix by appending an asterisk ('*') to the prefix. (While similar to wildcard character in operating systems, wildcard search is not currently supported by FTS)
  • Specify a column-name followed by a colon (':')
    • Example: hj:頭* (returns entries where Taiwanese written with Harnji begins with character for thaau)
  • Prefix the token with carat ('^') to require token to be very first token in its column

Tokenizer: the default tokenizer ("simple") is used. Because it only does case folding of ASCII characters, Ø and ø do not match each other. Some words starting with Ø include Ørciw, Ørmngg, and Ørtøexli.

See also

Acknowledgements