Goldsmith creates computer program Linguistica to map morphology of European languages

March 30, 2000
Vol. 19 No. 13

Goldsmith creates computer program Linguistica to map morphology of European languages

By Arthur Fournier
News Office

In a manner many of his colleagues might have argued would be impossible, John Goldsmith [john goldsmith] by jason smith , the Edward Carson Waller Distinguished Service Professor in Linguistics, has developed a computer program called Linguistica that automatically maps out the morphological structure of European languages, based purely on the input of raw text.

During a recent visit to his office, Goldsmith demonstrated the program by selecting a file in English––in this instance, Abraham Lincoln’s “Gettysburg Address”––and typing a series of commands that instructed his computer to provide a report of the major suffixes found in the language. Within seconds, Linguistica provided Goldsmith with a list that identified -s, –ed and –ing as among the word endings that were most likely to meet the criteria of his request.

“Using a simple line-command interface, Linguistica allows its users to select a raw text file, typically in the range of 5,000 to 1 million words,” he explained. “Then, based purely on the information it can gather from the words in that file, without the aid of dictionaries or morphological rules specific to the language, the program produces a partial morphological analysis of most of the words in the given corpus.”

Many European languages are still in need of good morphologies, but Goldsmith said that is not what sparked his interest in the overlap between computation and linguistics. In the mid-1990s, while visiting Microsoft, where his wife works as a computational linguist, Goldsmith became interested in the work software engineers were doing to improve natural-language voice emulators. Goldsmith said that although he worked as a consultant on some projects; for the most part, the Microsoft engineers were not that interested in theoretical linguistics.

“As you can probably imagine, it was incredibly frustrating for me at first, but really it had a salutary effect,” he said. “It forced me to go out and learn about the ideas they were working with at the time. I decided to read everything I could about the conceptual tools used by computer scientists interested in machine-learning theory.”

Goldsmith’s research eventually led him to a closer encounter with both machine-learning theory and the history of different problem-solving techniques linguists and computer scientists use to approach issues related to human languages. Benefiting equally from those two perspectives, Linguistica bears some of the fruit of that period in Goldsmith’s career.

The goal of the program is to demonstrate that computer-based algorithms are able to produce an output that closely matches the analysis a human morphologist would provide. In order to meet that challenge, Goldsmith implemented a principle known in machine-learning theory as Minimum Descriptive Length. With a novel interpretation of MDL as the basis for its code, Linguistica is able to equip computers with a kind of innate critical knowledge. This, Goldsmith believes, is strikingly similar to the “aesthetic of elegance” that drives human linguists as they look at data and try to determine the best analysis.

At Linguistica’s core are a series of algorithms that Goldsmith has designed to recognize and report on the possible morphologies of words in a given text. Linguistica then uses MDL as the criterion to discern which of the possibilities is more likely to be in line with contemporary linguistic theory.

“The heuristic tools are the creative part of the analysis, while the MDL is the evaluative intelligence of the program. MDL is one of the really important ideas that has developed in machine-learning theory––it’s very simple, but also very deep,” explained Goldsmith.

“Although it can be summarized in a sentence or two, there isn’t an obvious way of breaking it down into a series of operations. For me, part of the creative work of the project has been trying to translate some of the knowledge represented by MDL into computer code.”

Goldsmith is most excited that the program demonstrates the extent to which linguistics stands to benefit from the use of the conceptual tools made available through recent advances in the computer sciences. “I’ve shown it can be done in morphology, and I believe it can be done in syntax as well. The amazing part of all of this––I still haven’t completely gotten over it––is how sometimes you have to go all the way out to the fourth or fifth decimal place for the MDL to correctly figure out which of two analyses is more elegant,” he continued. “But if you model it correctly, it really does represent the system in exactly the same way a conscious linguist would.”

Goldsmith acknowledges that his embrace of statistical modeling tools may seem controversial. Since the mid-1950s, when Noam Chomsky argued that the kinds of operations human language-learners employ could never be coded as algorithms, most linguists have disdained research that deals with machine-learning theory. But as Goldsmith points out, over the past few decades, both conceptual and technological aspects of computational abilities have become tremendously more sophisticated.

“With Linguistica, I’m really trying to show that the tools of information theory and probability theory give us the power to reconstruct traditional notions in linguistics,” he continued. “I’m not trying to show that linguistics is wrong; what I’m trying to show is that we actually can write out a formula that encapsulates all of the ideas that linguists characterize as the criteria for analytical elegance,” he continued.

Goldsmith’s work raises questions for researchers in the field of language acquisition. “As linguists, we work with models that assume language acquisition in human children operates according to a series of rules related to analogous classes of linguistic objects,” explained Salikoko Mufwene, Professor and Chairman of Linguistics. “But there is no compelling reason why children should be assumed to operate like linguists, and it’s difficult for us to imagine how the very young develop those rule systems.

“John’s program, on the other hand, presents a model that suggests how language acquisition might operate probabilistically,” Mufwene continued. “My hope is that his research will tell us something about how such a model might apply to human abilities.”

Although its present interface is primarily designed to serve linguists, Goldsmith also sees the program as a stepping-off point for a wide variety of computer applications.

On the practical side, he believes Linguistica represents a small but very important step toward natural-language interactions between machines and humans. “It would be foolish to say or think that what I’m doing here alone would push us any closer toward the task of getting computers to understand natural language, but my research is part of that general process,” he said.

“One of the most direct applications of my work with Linguistica is in the development of speech-recognition software,” he continued. “In order for speech recognition to work, we need to establish a process for phonological guess work on the part of the computer––that’s the way humans understand language anyway, and that’s also how we want to train computers to work,” he said.

Since Linguistica functions by computing probabilities, it is in line with a new generation of software that tries to model common sense by determining what is the most likely understanding of a situation in light of all of the evidence that is available, Goldsmith explained. Such software eventually may enable humans and computers to interact with each other, using the natural, spoken language of day-to-day conversation.

“Since I’m basically optimistic, not only about what’s going on in the field now but also about the future of computational linguistics over the next 10, 15, 20 years, I’m certainly comfortable with the idea that we’re moving in that direction,” he said.

More information on Linguistica is available at Goldsmith’s Web site at http://humanities.uchicago.edu/faculty/goldsmith/.