To download the current version: thesis (ps.gz), thesis (2up ps.gz)

Appendices to the thesis:


Clustering Wearable Nouns in Cantonese and Vietnamese:

Evidence from Verb Semantics and Psychological Groupings

Department of Computer Science
University of British Columbia

Introduction

This thesis is a preliminary study of two major problems in computational linguistics: the structure of multilingual lexicons and word sense disambiguation (WSD). Multilingual lexicons are needed in applications that deal with text documents from multiple languages. Such applications include machine translation (MT), cross language information retrieval (CLIR), and bilingual generation and alignment. WSD is a problem that arises not only in a multilingual context, but frequently in a monolingual context as well. Applications that face the problem of WSD are those that have natural language processing or generation capabilities. Possible domains cover intelligent tutoring systems (ITS), speech recognition and synthesis, automatic text summarization, information extraction (IE), and so forth. In practice, research on these problems will improve the ``intelligence'' of computational linguistic systems. In theory, research will allow us to learn about properties of specific languages as well as language universals and typologies.

A novel approach is taken in this thesis to examine the issues of multilingual lexicon structure and WSD across languages. The main goal of this thesis is to find linguistically and psychologically motivated groupings for wearable nouns in Cantonese and Vietnamese. The author will do so by clustering data collected from two experiments. The second goal of this thesis is to identify the set of `wear' verbs in Cantonese and Vietnamese by analyzing the semantics of all the verbs used in these experiments. These surface forms of `wear' will lead to the number of different senses of `wear' across the languages studied.

The selection of the `wear' verb is motivated by previous research in machine translation (MT) that dealt with translational ambiguity between Japanese and English. In English, one could use `wear' to describe what someone has on their upper body, or their head, or their feet, regardless of whether the article is a clothing item or an accessory. In many other languages, this is not the case. Consider the following data taken from Hutchins & Somers (1992):

  1. kiru: generic
  2. haoru: coat, jacket
  3. haku: shoes, trousers
  4. kaburu: hat
  5. hameru: ring, gloves
  6. shimeru: belt, tie, scarf
  7. tsukeru: brooch, clip
  8. kakeru: glasses, necklace

Depending on the theme the agent takes, a different verb is used. For example, `Yetta is wearing a sweater' would be translated with `kiru' as `Yetta wa seta wo kite imasu', `Phil is wearing a hat' would be translated with `kaburu' as `Phil wa boshi wo kabutte imasu', `I am wearing black hair clips' would be translated with `tsukeru' as `Watashi wa kuroi hea kuripu wo tsukete imasu'. These eight senses of `wear' are indistinguishable in English.

These translations illustrate the relevance of selecting the correct word sense in a translation task. In the domain of MT, two linguistically oriented frameworks are common. One is the transfer model, where the source language (SL) translate into the target language (TL) with the help of syntactic and semantic rules. In this framework, rules in the transfer module are coded with specific knowledge of the SL and the TL. Their lexicons are also designed in this manner. Both the rules and information in the lexicon are unidirectional. This limitation means that a system that translates bidirectionally between language A and language B will need separate rules and separate lexicons for A->B and B->A. A system that translates bidirectionally for three languages will need six separate sets of rules and lexicons, and so forth. For this reason, most transfer systems are designed for one pair of languages only.

The second model is called interlingua where all the source languages and target languages communicate via an intermediate level of representation. The representation in the interlingua purports to be universal, so it is independent of the languages that are involved. This feature makes interlingua theoretically most attractive because it is the closest approach to arriving at a universal solution. Although this model is designed for theoretical goals and multilingual translation, the details have not been fully conceptualized and interlingua is still just an ideal.

The MT framework used in this paper is something between transfer and interlingua. The architecture would resemble that of interlingua, but the intermediate level of representation is not universal for every language. The intermediate level should encompass all the universal properties that the system needs to deal with while additional language specific details may appear in the lexicon. In this thesis, the languages of interest are Cantonese and Vietnamese because the translational ambiguity problem with `wear' occurs in these languages as well. The goals, then, are to study the decompositional semantics of `wear' related verbs in Cantonese and Vietnamese and to provide meaningful clusters for wearable nouns. These goals are set in hopes that the results will shed light on the organization of machine lexicons and word sense disambiguation in the context of MT.

A brief introduction of the languages under studied is merited. Chinese consists of five major languages: Mandarin, Wu, Min, Yue, Hakka. Cantonese is a dialect of Yue. Cantonese is spoken primarily in Guangdong, Guangxi, Hong Kong, and Macau. Hong Kong Cantonese speakers refer to their language as Guangdong Wa (Guangdong named after the province in China and Wa means `speech/language'). Varieties of Cantonese can also be found in other countries such as Singapore, Malaysia, Canada, United States, and Austrialia. Due to the influence of British rule and North American media in Hong Kong, the Cantonese language incorporates many English words. In addition, Cantonese is primarily a spoken language, and no standardized orthography has been adopted. The romanization system used in this paper is that of Wong Sek Ling (1991) with the omission of tones. The dialect of Cantonese under studied in this paper is that of Hong Kong-Canadian.

The Vietnamese language is known to its natives as tie^'ng Vie^.t (tie^'ng means `language/voice') and the Vietnamese people refer to themselves as ngu+o+` Vie^.t or ngu+o+`i Kinh (Kinh means `prayer/Bible'). The Vietnamese language underwent Chinese and French influence during ten centuries of Chinese political domination and over eighty years of French being the official language of Vietnam. One of the writing systems used is the Han script, which consists of Chinese characters and it is mainly used in literal texts. The orthography more commonly known today is the national Roman script called chu+~ quo^'c ngu+~, developed by Catholic missionaries in the seventeenth century. This is the orthography adopted in this paper.

The linguistics of these languages is not reviewed in this paper, although there will be a discussion on the semantics of `wear' related Cantonese and Vietnamese verbs in Appendix B and C. For a comparative overview of Chinese languages, the reader is referred to Li (1990). A comprehensive study of Cantonese can be found in Matthews & Yip (1994), which provides an emphasis on Cantonese morphology and syntax. An overview of the Vietnamese language and its syntax can also be found in Nguyen (1990).The linguistics of these languages is not reviewed in this paper, although there will be a discussion on the semantics of `wear' related Cantonese and Vietnamese verbs in Appendix B and C. For a comparative overview of Chinese languages, the reader is referred to Li (1990). A comprehensive study of Cantonese can be found in Matthews & Yip (1994), which provides an emphasis on Cantonese morphology and syntax. An overview of the Vietnamese language and its syntax can also be found in Nguyen (1990).

The next section provides an overview of the problems at hand. In particular, we begin by reviewing the kind of information that goes into lexicons and the organization of this information. Then we study briefly the impact that WSD has on MT tasks and ways of using linguistic information to help resolve ambiguity. We also turn to the task of lexical choice, which is the main focus of our study. After the overview, Section 3 describes two pilot experiments that were carried out with Cantonese and Vietnamese speakers. The first is a multiple choice task in which participants ranked the appropriateness of given `wear' verbs. The second is a grouping task (more commonly referred to as a sorting task in psychology) where participants grouped nominal garments into different piles. The section also details these experiments and their results. With these results, suggestions to improve lexical structure and WSD tasks in MT are provided in Section 4, with indications for future work.