MTL Toolbox: Difference between revisions

1,410 bytes added ,  Yesterday at 08:37
m
il
m (il)
 
(14 intermediate revisions by the same user not shown)
Line 1: Line 1:
'''MTL Toolbox''' (https://learntaiwanese.org/MTLtoolbox/about.html): Modern Taiwanese Language Toolbox. Software and data to help people use written Taiwanese in [[Modern Literal Taiwanese]] (MLT) and other Latin-script writing systems.
'''MTL Toolbox''' (https://learntaiwanese.org/MTLtoolbox/about.html): Modern Taiwanese Language Toolbox. Software and data to help people use [[written Taiwanese]] in [[Modern Literal Taiwanese]] (MLT) and other [[Latin script|Latin-script]] writing systems.


== Features ==
== Features ==
Line 9: Line 9:


== How to search using the segmenter ==
== How to search using the segmenter ==
We describe how to use "Taiwanese–English dictionaries: MLT segmenter & full-text search". This interface is mainly for Taiwanese words written in MLT, which we refer to as "M-style" written Taiwanese. After entering M-input, press "Zhøe" to run the segmenter and search. Results from HTB and DFT are displayed, and results from other dictionaries are summarized and linked to.  
We describe how to use "Taiwanese–English dictionaries: MLT segmenter & full-text search" {{x|}}. This interface is mainly for Taiwanese words written in MLT, which we refer to as "M-style" written Taiwanese. After entering M-input, press "Zhøe" to run the segmenter and search. Results from HTB and DFT are displayed, and results from other dictionaries are summarized and linked to.  


=== Typical usage ===
=== Typical usage ===
* Input: Taiwanese word (typically disyllable: two syllables joined by [[tone sandhi]])
* Input: Taiwanese word (typically disyllable: two syllables joined by [[tone sandhi]])
** Example: køefcie (copy and paste into this link {{x|}}. you may substitute 0 for [[ø]]: {{x|k0efcie}})
** Example: køefcie (copy and paste, or substitute 0 for [[ø]]: {{x|k0efcie}})
* Press return or tap "Zhøe" (means "search")
* Press return or tap "Zhøe" (means "search")
** your input is "unjoined" (original syllables found by database lookup)
** your input is "unjoined" (original syllables found by database lookup)
*** in this example, the original syllables are: {{x|køea}} and {{x|cie}}
*** in this example, the original syllables are: {{x|køea}} and {{x|cie}}
** search is done using original syllables (unordered collection of syllables, or "bag-of-syllables" (BOS))
** search is performed using the unordered collection of syllables (we refer to this as "bag-of-syllables" (BOS). see {{w|bag-of-words model}})
** confirm the results are the same as for input:  (except for HTB which is not unjoined)
** confirm the results are the same as for input:  (except for HTB which is not unjoined)


Line 38: Line 38:


== How to search the dictionary set (without segmenter) ==
== How to search the dictionary set (without segmenter) ==
Simply go to {{TE|}} and input the original syllables (M-style but without tone sandhi), and any other terms that would help narrow down your search. You may select or deselect dictionaries, or construct more detailed searches using specified columns. See [[#Technical notes]] for more details.
How to search our set of Taiwanese dictionaries using "Taiwanese-English dictionaries full-text search" {{TE|}}:
 
* You may select which dictionaries to search using the checkboxes. By default, all seven dictionaries are included, as well as "DFT_lk", which are examples for DFT entries.
* Input search terms to define your search. Typical inputs include English terms, M-style syllables (original without tone sandhi), and the number of syllables. Feel free to try any other terms that would help narrow down your search.  
* In some cases, it is better to specify a column for a term, especially if it could match in multiple columns. To specify the column to search against, follow the column-name by a ":" character, then the term.
** For example, if you want only monosyllable results, use <code>ns:1</code>. Likewise, if you know your result should be three syllables, use <code>ns:3</code>
** Suppose your search term is "too", which is a valid English word but is also a valid MLT syllable. If you want to match only the English column, type <code>en:too</code>. If you want to match only M-style syllables, type <code>u:too</code> ("u" stands for "unjoined").
** See [[#Technical notes]] for more details.


== Data ==
== Data ==
Line 49: Line 56:
* TDJ: ''[[Tai-Nichi Daijiten]]'' (original 1931 & 1932, in [[Taioaan-guo kana|Taiwanese kana]]. Lim08 version: definitions translated into Taiwanese (Han-Romanization mixed script - POJ). We added MLT annotations)
* TDJ: ''[[Tai-Nichi Daijiten]]'' (original 1931 & 1932, in [[Taioaan-guo kana|Taiwanese kana]]. Lim08 version: definitions translated into Taiwanese (Han-Romanization mixed script - POJ). We added MLT annotations)


The M-fields we present in DFT and MK may be machine-generated ("auto-joined") and may not represent the common or recommended spelling.
Note: The M-fields of DFT and MK are largely machine-generated ("auto-joined") and do not attempt to indicate prescriptive spellings. In some cases, the common or recommended spelling may be different from what is shown. Often, the difference is about [[apostrophe]]s and/or [[hyphen]]s.


We also support searching other websites with conversion to POJ/TL:
We also support searching other websites with conversion to POJ/TL:
Line 56: Line 63:


== Technical notes ==
== Technical notes ==
* Our full-text search is provided by the [[SQLite]]: [https://sqlite.org/fts3.html FTS4] extension. We currently use the Standard Query Syntax.
Our full-text search is provided by the [[SQLite]] [https://sqlite.org/fts3.html FTS4] extension. We currently use the Standard Query Syntax. One of the three basic query types supported by FTS tables is "token or token prefix queries":
* Token prefix queries: use the asterisk ('*') at the end. Similar to {{w|wildcard character}} in [[zokgiap hexthorng|operating systems]] (normal wildcard search not currently supported by FTS)
* Specify a token prefix by appending an asterisk ('*') to the prefix. (While similar to {{w|wildcard character}} in [[zokgiap hexthorng|operating systems]], wildcard search is not currently supported by FTS)
** Example: {{TE|Taioa*}}, {{TE|臺*}}
** Example: {{TE|Taioa*}}, {{TE|臺*}}
* Specify a column-name followed by a colon (':')
* Specify a column-name followed by a colon (':')
** Example: {{TE|hj:頭*}} (returns entries where Taiwanese written with Harnji begins with character for [[thaau]])
** Example: {{TE|hj:頭*}} (returns entries where Taiwanese written with Harnji begins with character for [[thaau]])
* Add carat ^ before token to require token to be very first token in its column
* Prefix the token with carat ('^') to require token to be very first token in its column
** Example: {{TE|^thaau}}
** Example: {{TE|^thaau}}
* [[Ø]] is not folded to lower case by the tokenizer
 
Tokenizer: the default tokenizer ("simple") is used. Because it only does case folding of [[ASCII]] characters, [[Ø]] and ø do not match each other. Some words starting with Ø include {{TE|Ørciw}}, {{TE|Ørmngg}}, and {{TE|Ørtøexli}}.


== See also ==
== See also ==
46,011

edits