Grammars are valuable resources for natural language processing. We divide the process of grammar development into three tasks: selecting a formalism, defining the prototypes, and building a grammar for a particular human language. After a brief discussion about the first two tasks, we focus on the third task. Traditionally, grammars are built by hand and there are many problems with this approach. To address these problems, we built two systems that automatically generate grammars. The first system (LexOrg) solves two major problems in grammar development: namely, the redundancy caused by the reuse of structures in a grammar and the lack of explicit generalizations over the structures in a grammar. LexOrg takes several types of specification as input and combines them to automatically generate a grammar. The second system (LexTract) extracts Lexicalized Tree Adjoining Grammars (LTAGs) and Context-free Grammars (CFGs) from Treebanks, and builds derivation trees that can be used to train statistical LTAG parsers directly. In addition to creating Treebank grammars and producing training materials for parsers, LeXTract is also used to evaluate the coverage of existing hand-crafted grammars, to compare grammars for different languages, to detect annotation errors in Treebanks, and to test certain linguistic hypotheses. LexOrg and LeXTract provide two different perspectives on grammars. In LexOrg, elementary trees in an LTAG grammar are the result of combining language specifications such as tree descriptions. In LeXTract, elementary trees are building blocks of syntactic structures in a Treebank. LexOrg makes explicit the language specifications that form elementary trees, whereas LeXTract makes explicit the elementary trees that form syntactic structures. The systems provide a rich set of tools for language description and comparison that greatly enhances our ability to build and maintain grammars and Treebanks effectively.
Supervisors: Martha Palmer; Aravind Joshi. Thesis (Ph.D. in Computer and Information Science) -- University of Pennsylvania, 2001. Includes bibliographical references.