Professor Emeritus  of Computational Linguistics

University of Toronto, Department of Computer Science

Research

Lexical nuances of style and meaning

The nuances of denotation and connotation that are a part of everyday language are a serious problem in many applications of computational linguistics. For example, each word in the output of a machine translation system should be the closest possible match in meaning and connotation to that in the input; but often, the choice must be made from a set of near-synonyms, none of which precisely matches the input. For example, a forest differs from a woods along several fuzzy dimensions of size and ‘wildness’; and the distinctions are not quite the same as those between the nearest German translations, Wald and Gehölz 

(DiMarco, Hirst, and Stede 1993; Hirst 1995). Formalisms that are conventionally used in machine translation (and artificial intelligence in general) simply cannot support the kind of fine-grained representation that is necessary for this task. Researchers working on lexical choice in natural language generation and machine translation have assumed extremely simplistic models of synonymy, and instead concentrated on important but orthogonal issues such as filling out verb frames.

Philip Edmonds (1999; Hirst and Edmonds 2002) developed a new method of representation, supplementary to conventional formalisms, that permits the kind of very fine-grained distinctions that near-synonyms require, both within and across languages. In this method, groups of near-synonyms (possibly in more than one language) are represented by a single concept in the ontology and then differentiated one from another at the sub-conceptual level. The approach permits the representation of lexical connotations and relative emphases and nuances of meaning of the members of each group of near-synonyms.

However, to actually use this kind of representation requires new type of lexical resource: a lexical knowledge base giving information about each near-synonym group in the language, and mappings between near-synonym groups across languages. Diana Inkpen (2003; Inkpen and Hirst 2006) developed a method to automatically acquire a knowledge-base of near-synonym differences with an unsupervised decision-list algorithm that learns extraction patterns from a special dictionary of synonym differences. The patterns are then used to extract knowledge from the text of the dictionary. The initial knowledge-base is later enriched with information from other machine-readable dictionaries. Information about the collocational behavior of the near-synonyms is acquired from free text. The knowledge-base is used by Xenon, a natural language generation system that shows how the new lexical resource can be used to choose the best near-synonym in specific situations.

However, this method requires a special dictionary of near-synonyms; such dictionaries exist for only a few languages and tend to be very incomplete. Being able to use a regular dictionary would be preferable. Tong Wang (Wang and Hirst 2009, 2010, 2012) proposed three novel methods — two rule-based methods and one machine learning approach — to identify synonyms from definition texts in a regular machine-readable dictionary. Extracted synonyms are evaluated in two extrinsic experiments and one intrinsic experiment. Evaluation results show that the pattern-based approach achieves best performance in one of the experiments and satisfactory results in the other, comparable to corpus-based state-of-the-art results. Wang subsequently worked on methods for near-synonym choice that are based on latent semantic analysis (LSA). The model was built on the lexical level of co-occurrence, which has been empirically proven to be effective in providing higher dimensional information on the subtle differences among near-synonyms. By employing supervised learning on the latent features, the system achieved an accuracy of 74.5% in a fill-in-the-blank task. The improvement over the earlier state-of-the-art is statistically significant.

References

DiMarco, Chrysanne; Hirst, Graeme; and Stede, Manfred. “The semantic and stylistic differentiation of synonyms and near-synonyms.” AAAI Spring Symposium on Building Lexicons for Machine Translation, Stanford, CA, March 1993, 114-121. [PDF]

Edmonds, Philip. Semantic Representations of Near-Synonyms for Automatic Lexical Choice. Ph.D. Thesis. Department of Computer Science, University of Toronto. September 1999. [PDF]

Edmonds, Philip and Hirst, Graeme. “Near-synonymy and lexical choice.” Computational Linguistics, 28(2), June 2002, 105–144. [PDF]

Hirst, Graeme. “Near-synonymy and the structure of lexical knowledge.” AAAI Symposium on Representation and Acquisition of Lexical Knowledge: Polysemy, Ambiguity, and Generativity, Stanford University, March 1995, 51–56. [PDF]

Inkpen, Diana. Building a Lexical Knowledge-Base of Near-Synonym Differences. Ph.D. Thesis. Department of Computer Science, University of Toronto. October 2003. [PDF]

Inkpen, Diana and Hirst, Graeme. “Building and using a lexical knowledge-base of near-synonym differences.” Computational Linguistics, 32(2), June 2006, 223–262. [PDF]

Wang, Tong and Hirst, Graeme. “Extracting synonyms from dictionary definitions.” Proceedings, Conference on Recent Advances in Natural Language Processing, September 2009, Borovets, Bulgaria, 470-476. [PDF]

Wang, Tong and Hirst, Graeme. “Near-synonym lexical choice in latent semantic space.” Proceedings, 23rd International Conference on Computational Linguistics (COLING 2010), Beijing, August 2010, 1182–1190. [PDF]

Wang, Tong and Hirst, Graeme. “Refining the notions of depth and density in WordNet-based semantic similarity measures.” Proceedings, 2011 Conference on Empirical Methods in Natural Language Processing, Edinburgh, July 2011, 1003–1011. [PDF]

Wang, Tong and Hirst, Graeme. “Exploring patterns in dictionary definitions for synonym extraction.” Natural Language Engineering, 18(3), July 2012, 313–342. [PDF]