Simple Parser for Natural Language? [long]

Sun Jul 25 20:55:45 UTC 1999

Excuse me for the belated reply, but I've been busy (and had keyboard problems on my mail-reading computer, too).  I'm going to address some of the sections of your original message out of order, to make my commentary hang together somewhat better.

At 09:49 AM 7/16/99 -0800, you wrote:
>Folks -
>
>Ted Kaehler and I want to write a Squeak program capable of superficially understanding natural language.  (Of course you are all invited to play, too ;-).

....

>The idea is to then point it at a newspaper, the web, or the Squeak archives, and see if we can get it to make any interesting statements, even if they are wrong, and especially if they are funny.

I'm not quite sure what it is you want to build.  You talk about having an NL understanding system, that you can 'point at' various texts.  So far, so good; this is the typical information extraction task (see the MUC/Tipster proceedings for more detail than you probably want on the various approaches to extraction).  Presumably this can either be batch (e.g. you spider the web, intranets, local archives, etc.) or interactive (i.e. you let the system loose on some interesting text you just came across).  Again, so far, so good.

Where I start to have trouble trying to address the question of what resources (engines, knowledge bases, research fields to look at) you will need is when you say: 'see if we can get it to make any interesting statements'.  Do you actually mean new statements (utterances or complete sentences) in natural language (presumably English)?  If so, you not only have to worry about developing/acquiring a natural language parsing/understanding program, you also have to develop/acquire a natural languge generation program.  And this, in turn, raises the question of keeping the two systems (understanding and generation) in sync.  While there may be some systems that do both, systems typically either do analysis or generation.  In which case, you'll need to develop your own mechanism for keeping them in sync, otherwise you'll start running into habitability issues (e.g. vocabulary items or syntactic constructions that the generator uses but the analyzer can't handle).

Also, will the system make these statements spontaneously; via a menu driven interface; or will users elicit them by querying the system in natural language?  Historically, natural language analysis systems have either done continuous text processing (i.e. information extraction from pre-existing texts) or interactive understanding (i.e. processing of ad hoc queries/commands, either in typed or spoken language, either to retrieve information from an existing database or to issue instructions to another program).  For some reason, researchers have not used the same engine both to populate a database and to retrieve information from the database so constructed.  Some of this is probably accidental, but there are also substantive issues; e.g. verbs of saying or showing (e.g. 'tell', 'show') have different extensional interpretations in running text vs. in ad hoc queries/commands.  So either one has to go the trouble of adding in all the syntactic, semantic, and pragmatic mechanisms that provide the correct interpretations in the two contexts, or else one special cases and detects the command instances and gives them a special treatment.  If you do intend for users to query the system in NL, you will also have to address this issue.  I don't know of any existing off the shelf system, either commercial or freeware, that does both.  [Full disclosure Note #1: I'm in the process of working on a commercial system that does both, but it's just in pre-beta right now.]

There is also the question of how you will persist the information you get out of text.  While this is not strictly an NL issue, it does affect how you will make use of the semantic representations your analyzer produces.

....

>By superficial understanding I mean, that it could successfully parse most sentences and could build up a body of valid knowledge structures based on the content.  Of course this does not constitute real understanding, since the relationships may be ambiguous, conflicting or lacking necessary context or metainformation.
>
>However, even at this superficial stage, it could be very useful and probably a lot of fun.  With backpointers to its source material, it could certainly facilitate inquiries about the content.  And with a bit more work, we might actually learn a thing or two about real understanding.
>
>So, here's the question:  Do any of you know of any simple parsers in Smalltalk (or even other languages) that are capable of parsing most english sentences correctly?  

I'm a bit disturbed that, even though your goal is understanding, you keep talking about 'parsers' and 'parsing' to describe the engine you want.  Most (if not all) NL researchers use the term 'parser' to describe an engine that just brings back one or more syntactic structures associated with an input text.  To get to understanding, you need an interpreter, something that assigns a semantic representation to its input.  There are various parsers out there that simply parse, and you can probably obtain any one of them very easily.  Interpreters, especially freeware interpreters, are harder to find.  So, if you do get a parser, you'll probably have to develop a compositional semantics for it.  And if you get an interpreter, you'll have to make sure that you're satisfied with the kinds of semantics representations it produces.

But I think there is a deeper disconnect from current NLP going on here.  You asked: 'Do any of you know of any simple parsers in Smalltalk (or even other languages) that are capable of parsing most english sentences correctly?'  This strikes me as being the wrong question to ask.  The literal answer is: No.  Back when I was keeping track of such things, the figures I heard quoted were that the top of the line NLP parsers for English produced the correct parse (against a test corpus of hand parsed utterances, such as the UPenn Treebank or IBM's Birmingham (? if I haven't gotten the name incorrect) corpus) 60% of the time.  However, given current NLP techniques, percentage of complete parses is not all that relevant to the task of understanding.

Back in the days of ATN parsers, an input utterance was either in or out.  Either the parser assigned it a parse or it didn't.  Parsing (even just qua parsing) has come a long way since then.  The buzz term that came into the (D)ARPA research community in the late '80's was 'robust parsing', which meant a system that, unlike an ATN, did not have a simple binary result for an input sentence.  Parsers were expected to try to come up with interpretations even for utterances that didn't get complete parses.  Various fallback and post-parsing mechanisms were developed.  And systems using chart parsers developed techniques to extract interpretations from the fragments in the chart even for those utterances that did not result in complete parses.  So, these days, any NL analysis system worth its salt will attempt to come up with a useful interpretation for every input utterance, no matter how incomplete the parse is.  Therefore the real need for your project is not a system whose parse coverage per se is the criterion, but rather a system that is adept at extracting useful information, by any means necessary.

Another way in which sentential parsing, even extending the term to include creating semantic interpretations, is not the major issue is in the area of what is known as 'entity merging' in the field.  Typically, a given entity will be referred to in an article in multiple ways; e.g. 'Swipe Inc.', 'it', 'they', 'the company', 'this Allston-based conglomerate'.  An important part of doing the kind of information extraction you're talking about is recognizing that these are alternate descriptions of the same entity.  Again, this goes beyond simple sentential parsing/interpretation.

As for parsers in Smalltalk in particular, I don't know of any, aside from the one I'm currently working on.  I've come across references to a few 'object-oriented parsers' out there, but when I've been able to read the cited papers, the systems have just used objects as data repositiories, such as 'object-oriented Prolog' (!) parsers that kept the fundamental backtracking algorithm of Prolog, rather than letting the objects duke it out among themselves.  I *have* come across two OO parsers that really do seem to not just be implemented in OO languages but to actually have OO architectures.  There's Parsetalk:

http://www.coling.uni-freiburg.de/~neuhaus/linguistics/restricted/restricted.html
http://www.coling.uni-freiburg.de/~neuhaus/manuals/draft/draft.html

which is a dependency grammar parser.

There's also the Power parser:

will.nlp.info.eng.niigata-u.ac.jp/nlp/power/power.html

All the documentation for this is in Japanese, aside from some snippets of code and rules.  But the small amount of information that those fragments reveal does seem to indicate that Power does have a real OO architecture.

> Presumably this also requires a lexicon, so it is important that the associated lexicon be in the public domain as well.

Please note that, unless you plan to use some off-the-shelf system, and not change it or extend it in any way, you will wind up (1) adding to and editing the lexicon you start out with; and (2) importing lexical information from other lexical sources.  I've heard it said that working chemists must, of necessity, know how to blow their own glassware.  My own experience is that every computational lexicologist is periodically faced with the task of importing lexical information from a new source, either automatically, semi-automatically, or manually.  I'd have to sit down a good long time to try to come up with even a rough estimate of how many times I've done this over the past 15+ years.

Note also that any interpreter you get should not assume a closed vocabulary; i.e. that every word it encounters in the text will be in a pre-existing lexicon.  Particularly given the types of texts you mentioned ('a newspaper, the web, or the Squeak archives'), any interpeter that is up to the task must be able to handle out of vocabulary items and assign these newly encountered words a default semantics and, in the best case, use known words in the surrounding context to change this default semantics into something more interesting.

Also, you need to be sensitive to the issue of whether the interpreter you get expects raw, unsegmented (i.e. not broken up into sentences) text, or whether it expects a single sentence.  Given that you've mentioned web based texts, you will want something that can separate out the text from HTML/XML/etc. tags.  If the interpreter doesn't do this, you'll need a tokenizer, too.

>Obviously, the next topic of interest is meta-information in the lexicon (like the relationship between infinitessimal, tiny, small, little, average, big, large, enormous, collossal), so if you have any leads onto (again, simple) work along these lines, that is also of interest to us.

I'm not sure why you refer to this sort of information as 'meta-information'.  Unless your lexical semantics is bad old capital letter/prime semantics (e.g. where the semantics of the English word 'book' is the 'semantic interpretation' BOOK or book' or BOOK'), the compositional semantic definition of a concept should provide the hooks that link related concepts together, implicitly if not explicitly.  See James Pustejovsky's _The Generative Lexicon_ from MIT Press ('and sources cited therein', as the old phrase has it) for one approach to lexical semantics that does this. [Full disclosure Note #2: I actually helped edit this book and James is a friend.  Still, independently of all that, this strikes me as being the most descriptively and explantorily adequate approach to lexical semantics I know of at present.]  [Full disclosure Note #3: (This one's more of a historical aside.)  I'm not sure if this comes across in the book, but the GL approach to lexical semantics is heavily influenced by OO concepts.  It's the semantics in use in the system I'm working on.  When the type system designer (architect?) was redesigning (refactoring?) the top (i.e. domain independent) part of the type system, I introduced her to CRC cards.  The next day, we wound up sitting on the floor with a stack of 3 x 5 cards, working out the structure by laying out the cards in different configurations.]

You might also want to check out _Relational models of the lexicon: representing knowledge in semantic networks_, edited by Martha Evens, from Cambridge University Press, for various more recent approaches to lexical semantic representation.

>Please don't mock us for simple thoughts about complicated topics.  After all, that's how we got Squeak.

Well, I hope you don't construe my comments as mocking.  They were meant to point out the issues that face you if you do want to pursue this project seriously.  My own opinion is that you have to either just find some off the shelf system(s) that satisfy your needs or else get into building the systems yourself, which is going to be a tremendous amount of work.  Unless you're doing something really simple-minded like a Doctor/Eliza program, even superficial and 'trivial' natural language understanding is a lot of work.

Good luck and feel free to ask me followup questions.

-30-
Bob Ingria
As always, at a slight angle to the universe