Hi There!

I'm Dan Schlegel, an Associate Professor in the Computer Science Department at SUNY Oswego

Assignment 3

In this assignment you will explore issues in extracting commonsense knowledge from fairy tales through the use of pattern matching over dependency parses. Choose a fairy tale which is freely available on Project Gutenberg or another source to serve as your source text. You will extract roughly 2 pages of text, preferably mostly prose, from the fairy tale. The text can be all from the same area in the fairy tale, or a mixture of different sentences/paragraphs from different parts. The first page will serve as your training set, and the second as your test set. Set the second page aside and don’t look at it until we’re ready for it.

Read through the training set carefully, thinking about what kinds of “common sense” can be learned from the text. Annotate the training text with the commonsense knowledge you’d like to extract. For example, consider the following example from Schubert & Tong, 2003:

Sentence: Rilly or Glendora had entered her room while she slept, bringing back her washed clothes.
Extracted common sense:
A NAMED-ENTITY MAY ENTER A ROOM.
A FEMALE-INDIVIDUAL MAY HAVE A ROOM.
A FEMALE-INDIVIDUAL MAY SLEEP.
A FEMALE-INDIVIDUAL MAY HAVE CLOTHES.
CLOTHES CAN BE WASHED.

Now, take your training text and run it through the Stanford Dependency Parser (if that page is down try this one). Examine the paths through the dependency graphs in the places you have identified commonsense knowledge. Develop at least 10 rules which could recognize that knowledge in the graph. Make your rules as general as possible, trying to make them apply to more than just the examples in your training data. But, be your rules don’t match things you don’t intend them to! Develop some way of writing the rules which is expressive enough to mean what you want. Some example syntax might be:

meaning a verb of some kind with “she” as a dependent in the nsubj relation. This rule could be used to recognize the female entity may sleep item in the above example, but it also applies to many others. If you’re feeling ambitious, you might look into the Semgrex and Tregex syntax usually used for this purpose. Don’t worry too much about tense, since that only makes the project harder. Be sure a few of your rules use more than just a single dependency link.

Once you’ve created your rule set, set aside your training set, and move to your test set. Read through your test set and note all of the places you hope your rules will extract knowledge. You may print out the test set to do this, or make highlights etc in a word processor program. Run the test set through the dependency parser and print out the result. Without making any changes to your rules, try to apply them to the test set. Mark on the paper everywhere your rules apply. If a rule applies but it should not have, note it. Compare it with your document which lists all the knowledge you hoped would be extracted, and note places it missed.

Create a document in which you will do the following for each rule you have proposed:
– List the rule
– Explain the rationale for the rule
– Provide an example of the rule working from your training set
– Provide a table with:
— The number of times the rule was matched in your test set
— The number of times you hoped the rule would be matched in your test set
— The number of times the rule matched in your test set, but it should not have
– Provide a post-mortem on the rule – Did it work? What would you change about it?

Submit your document along with your training set, test set, and materials created along the way (annotated text, etc.)

References of Interest

Len Schubert’s KNEXT Project

Van Durme, Benjamin, Ting Qian, and Lenhart Schubert. “Class-driven attribute extraction.” Proceedings of the 22nd International Conference on Computational Linguistics-Volume 1. Association for Computational Linguistics, 2008.