Hi There!

I'm Dan Schlegel, an Associate Professor in the Computer Science Department at SUNY Oswego

Assignment 2

Compare corpora using n-grams

In this assignment you will compare three corpora by examining their n-grams. You will be looking for similarities and differences in word usage/frequency, phrase usage/frequency, structure, content, etc.

Three corpora have been cleaned (mostly stripped of irrelevant tags) and uploaded to Blackboard, they are:

  1. Santa Barbara Corpus of Spoken American English (sbcsae_terminals.txt)
  2. Corpus of Contemporary American English 2017 Update – Fiction Sample (ccae_2017_fic.txt)
  3. Corpus of Contemporary American English 2017 Update – News Sample (ccae_2017_news.txt)

A word of warning: Some of these corpora contain some text which is quite offensive. One of the effects of looking at real data is to see how real humans act.

Begin by exploring the ngrams of the corpora. Then, hypothesize about similarities and differences between them. What do you think explains the differences or similarities? Then, write at least 10 of your observations in a document, along with your hypotheses about why the observation might be true. For each of your observations/hypotheses, create a table containing the output from the Python program which supports your observation/hypothesis. Be sure your hypotheses explore language and its usage – just saying that the difference is because the corpora are different isn’t sufficient.

For this assignment form a team of 2 people. It will be best if at least one of your group members is a linguist. You will submit your report by creating a Scalar page and linking it from both of your pages.

Extra Credit

Each team member should create a corpus of text they have written, and compare it a) with the above corpora, and b) with the other team member’s corpus. To create a corpus of your own text, open Notepad and copy/paste as much text as you can which you have written into the file. Save it and load it into the Python program as you did the others. What are the similarities and differences when compared with the other corpora? Add at least 4 more observations to your produced file.

Python Program (ngram_observations.py)