Hi There!

I'm Dan Schlegel, an Associate Professor in the Computer Science Department at SUNY Oswego

Assignment 2

Compare corpora using n-grams

In this assignment you will compare three corpora by examining their n-grams. You will be looking for similarities and differences in word usage/frequency, phrase usage/frequency, structure, content, etc.

Three corpora have been cleaned (mostly stripped of irrelevant tags) and uploaded to Blackboard, they are:

  1. Santa Barbara Corpus of Spoken American English (sbcsae_terminals.txt)
  2. Corpus of Contemporary American English 2017 Update – Fiction Sample (ccae_2017_fic.txt)
  3. Corpus of Contemporary American English 2017 Update – News Sample (ccae_2017_news.txt)

A word of warning: Some of these corpora contain some text which is quite offensive. One of the effects of looking at real data is to see how real humans act.

Begin by exploring the ngrams of the corpora. Hypothesize about similarities and differences between them. Then, write at least 10 of your observations in a document, along with your hypotheses about why the observation might be true. Create tables in your document which contain the output of the Python program for the different corpora to support your observations and hypotheses.

For this assignment form a team of 2 people. It will be best if at least one of your group members is a linguist. You need only submit one report per team.

Extra Credit

Each team member should create a corpus of text they have written, and compare it a) with the above corpora, and b) with the other team member’s corpus. To create a corpus of your own text, open Notepad and copy/paste as much text as you can which you have written into the file. Save it and load it into the Python program as you did the others. What are the similarities and differences when compared with the other corpora? Add at least 4 more observations to your produced file.

Python Program (ngram_observations.py)