46,011
edits
m (→Data) |
m (il) |
||
(15 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
'''MTL Toolbox''' (https://learntaiwanese.org/MTLtoolbox/about.html): Modern Taiwanese Language Toolbox. Software and data to help people use written Taiwanese in [[Modern Literal Taiwanese]] (MLT) and other Latin-script writing systems. | '''MTL Toolbox''' (https://learntaiwanese.org/MTLtoolbox/about.html): Modern Taiwanese Language Toolbox. Software and data to help people use [[written Taiwanese]] in [[Modern Literal Taiwanese]] (MLT) and other [[Latin script|Latin-script]] writing systems. | ||
== Features == | == Features == | ||
Line 9: | Line 9: | ||
== How to search using the segmenter == | == How to search using the segmenter == | ||
We describe how to use "Taiwanese–English dictionaries: MLT segmenter & full-text search". This interface is mainly for Taiwanese words written in MLT, which we refer to as "M-style" written Taiwanese. After entering M-input, press "Zhøe" to run the segmenter and search. Results from HTB and DFT are displayed, and results from other dictionaries are summarized and linked to. | We describe how to use "Taiwanese–English dictionaries: MLT segmenter & full-text search" {{x|}}. This interface is mainly for Taiwanese words written in MLT, which we refer to as "M-style" written Taiwanese. After entering M-input, press "Zhøe" to run the segmenter and search. Results from HTB and DFT are displayed, and results from other dictionaries are summarized and linked to. | ||
=== Typical usage === | === Typical usage === | ||
* Input: Taiwanese word (typically disyllable: two syllables joined by [[tone sandhi]]) | * Input: Taiwanese word (typically disyllable: two syllables joined by [[tone sandhi]]) | ||
** Example: køefcie (copy and paste | ** Example: køefcie (copy and paste, or substitute 0 for [[ø]]: {{x|k0efcie}}) | ||
* Press return or tap "Zhøe" (means "search") | * Press return or tap "Zhøe" (means "search") | ||
** your input is "unjoined" (original syllables found by database lookup) | ** your input is "unjoined" (original syllables found by database lookup) | ||
*** in this example, the original syllables are: {{x|køea}} and {{x|cie}} | *** in this example, the original syllables are: {{x|køea}} and {{x|cie}} | ||
** search is | ** search is performed using the unordered collection of syllables (we refer to this as "bag-of-syllables" (BOS). see {{w|bag-of-words model}}) | ||
** confirm the results are the same as for input: (except for HTB which is not unjoined) | ** confirm the results are the same as for input: (except for HTB which is not unjoined) | ||
Line 38: | Line 38: | ||
== How to search the dictionary set (without segmenter) == | == How to search the dictionary set (without segmenter) == | ||
How to search our set of Taiwanese dictionaries using "Taiwanese-English dictionaries full-text search" {{TE|}}: | |||
* You may select which dictionaries to search using the checkboxes. By default, all seven dictionaries are included, as well as "DFT_lk", which are examples for DFT entries. | |||
* Input search terms to define your search. Typical inputs include English terms, M-style syllables (original without tone sandhi), and the number of syllables. Feel free to try any other terms that would help narrow down your search. | |||
* In some cases, it is better to specify a column for a term, especially if it could match in multiple columns. To specify the column to search against, follow the column-name by a ":" character, then the term. | |||
** For example, if you want only monosyllable results, use <code>ns:1</code>. Likewise, if you know your result should be three syllables, use <code>ns:3</code> | |||
** Suppose your search term is "too", which is a valid English word but is also a valid MLT syllable. If you want to match only the English column, type <code>en:too</code>. If you want to match only M-style syllables, type <code>u:too</code> ("u" stands for "unjoined"). | |||
** See [[#Technical notes]] for more details. | |||
== Data == | == Data == | ||
Line 49: | Line 56: | ||
* TDJ: ''[[Tai-Nichi Daijiten]]'' (original 1931 & 1932, in [[Taioaan-guo kana|Taiwanese kana]]. Lim08 version: definitions translated into Taiwanese (Han-Romanization mixed script - POJ). We added MLT annotations) | * TDJ: ''[[Tai-Nichi Daijiten]]'' (original 1931 & 1932, in [[Taioaan-guo kana|Taiwanese kana]]. Lim08 version: definitions translated into Taiwanese (Han-Romanization mixed script - POJ). We added MLT annotations) | ||
The M-fields | Note: The M-fields of DFT and MK are largely machine-generated ("auto-joined") and do not attempt to indicate prescriptive spellings. In some cases, the common or recommended spelling may be different from what is shown. Often, the difference is about [[apostrophe]]s and/or [[hyphen]]s. | ||
We also support searching other websites with conversion to POJ/TL: | We also support searching other websites with conversion to POJ/TL: | ||
Line 56: | Line 63: | ||
== Technical notes == | == Technical notes == | ||
Our full-text search is provided by the [[SQLite]] [https://sqlite.org/fts3.html FTS4] extension. We currently use the Standard Query Syntax. One of the three basic query types supported by FTS tables is "token or token prefix queries": | |||
* | * Specify a token prefix by appending an asterisk ('*') to the prefix. (While similar to {{w|wildcard character}} in [[zokgiap hexthorng|operating systems]], wildcard search is not currently supported by FTS) | ||
** Example: {{TE|Taioa*}}, {{TE|臺*}} | ** Example: {{TE|Taioa*}}, {{TE|臺*}} | ||
* Specify a column-name followed by a colon (':') | * Specify a column-name followed by a colon (':') | ||
** Example: {{TE|hj:頭*}} (returns entries where Taiwanese written with Harnji begins with character for [[thaau]]) | ** Example: {{TE|hj:頭*}} (returns entries where Taiwanese written with Harnji begins with character for [[thaau]]) | ||
* | * Prefix the token with carat ('^') to require token to be very first token in its column | ||
** Example: {{TE|^thaau}} | ** Example: {{TE|^thaau}} | ||
Tokenizer: the default tokenizer ("simple") is used. Because it only does case folding of [[ASCII]] characters, [[Ø]] and ø do not match each other. Some words starting with Ø include {{TE|Ørciw}}, {{TE|Ørmngg}}, and {{TE|Ørtøexli}}. | |||
== See also == | == See also == |
edits