Assignment 2

Compare corpora using n-grams

In this assignment you will compare three corpora by examining their n-grams. You will be looking for similarities and differences in word usage/frequency, phrase usage/frequency, structure, content, etc.

Three corpora have been cleaned (mostly stripped of irrelevant tags) and uploaded to Blackboard, they are:

Santa Barbara Corpus of Spoken American English (sbcsae_terminals.txt)
Corpus of Contemporary American English 2017 Update – Fiction Sample (ccae_2017_fic.txt)
Corpus of Contemporary American English 2017 Update – News Sample (ccae_2017_news.txt)

A word of warning: Some of these corpora contain some text which is quite offensive. One of the effects of looking at real data is to see how real humans act.

Begin by exploring the ngrams of the corpora. Then, hypothesize about similarities and differences between them. What do you think explains the differences or similarities? Then, write at least 10 of your observations in a document, along with your hypotheses about why the observation might be true. For each of your observations/hypotheses, create a table containing the output from the Python program which supports your observation/hypothesis. Be sure your hypotheses explore language and its usage – just saying that the difference is because the corpora are different isn’t sufficient.

For this assignment form a team of 2 people. It will be best if at least one of your group members is a linguist. You will submit your report by creating a Scalar page and linking it from both of your pages.

Extra Credit

Each team member should create a corpus of text they have written, and compare it a) with the above corpora, and b) with the other team member’s corpus. To create a corpus of your own text, open Notepad and copy/paste as much text as you can which you have written into the file. Save it and load it into the Python program as you did the others. What are the similarities and differences when compared with the other corpora? Add at least 4 more observations to your produced file.

Python Program (ngram_observations.py)

# ngram_observations.py
# An set of functions for examining characteristics of ngrams.
# Author: Daniel R. Schlegel
# Modified: 2/27/19

'''
How to use this program:

First, if you are using Thonny and haven't yet installed nltk, from the Tools menu choose Manage Packages. In the window that appears type
nltk and hit the Search button. Then once it is found, click the Install button. You may now close that window. Then, in the console at the bottom of Thonny, type: 

import nltk
nltk.download('punkt') 

and hit enter. It will take a minute to download some required files.

Then:

1) At the bottom of this file, change the file location to the location of the input file you want to use.
2) Load the program into Python. It will take a few seconds to build unigrams through 5-grams.
3) Run commands like the following:

print_ngram_freq_list(unigrams, top=100, pattern=None)

This will get the top 100 unigrams along with their counts from the corpus.

You can:
- Change unigrams to bigrams, trigrams, quadrigrams, or pentagrams to look at the top results for those.
- Change top=100 to a different value (top=10, top=50, ...) to get a different number of results.
- Add a regular expression pattern to only get ngrams which match that pattern. 

More example usages:

print_ngram_freq_list(bigrams, top=5, pattern='ing')

This gets the top 5 bigrams that contain 'ing'.

print_ngram_freq_list(trigrams, top=5, pattern='^w+ings')

This gets the top 5 bigrams where the first word contains 'ings'

More advanced usage:
- You can create ngrams of any length by following the form used at the bottom of the file to create the unigrams etc. 
'''

import random
import nltk
import re
import collections
from nltk import word_tokenize

def tokenize_file(fname):
    file = open(fname, errors='ignore')
    return nltk.word_tokenize(file.read())

# This function gets the raw ngrams, without figuring out counts, duplicates, etc.
# Optionally takes a regular expression pattern as an argument, then only returns
# ngrams which match that pattern.
def get_ngrams(all_tokens, length, pattern = None):
    ngrams = []
    p = None
    if(pattern):
        p = re.compile(pattern)
    for x in range(0, len(all_tokens)-length+1):
        if p:
            if p.search(' '.join(all_tokens[x:x+length])):
                ngrams.append(all_tokens[x:x+length])
        else:
            ngrams.append(all_tokens[x:x+length])
    return ngrams

def print_ngram_freq_list(ngrams, top=None, pattern=None):
    length = len(ngrams[0])
    ngram_strs = [' '.join(x) for x in ngrams]
    freqs = collections.Counter(ngram_strs)
    if(pattern):
        p = re.compile(pattern)
        outcounts = {k: v for k, v in freqs.items() if p.search(k)}
    else:
        outcounts = freqs
    sorted_outcounts = [(k, outcounts[k]) for k in sorted(outcounts, key=outcounts.get, reverse=True)]
    if top == None:
        top = len(sorted_outcounts)
    for i in range(min(top,len(sorted_outcounts))):
        (k, v) = sorted_outcounts[i]
        print(v, k)
    if len(sorted_outcounts) == 0:
        print("No Matches.")
    return None

# Be sure to use forward slashes in your file path!
tokens = tokenize_file("C:/Users/digit/Dropbox/Teaching/COG376/Corpora/Cleaned/ccae_2017_fic.txt")
unigrams = get_ngrams(tokens, 1)
bigrams = get_ngrams(tokens, 2)
trigrams = get_ngrams(tokens, 3)
quadrigrams = get_ngrams(tokens, 4)
pentagrams = get_ngrams(tokens, 5)

# ngram_observations.py

# An set of functions for examining characteristics of ngrams.

# Author: Daniel R. Schlegel

# Modified: 2/27/19

'''

How to use this program:

First, if you are using Thonny and haven't yet installed nltk, from the Tools menu choose Manage Packages. In the window that appears type

nltk and hit the Search button. Then once it is found, click the Install button. You may now close that window. Then, in the console at the bottom of Thonny, type:

import nltk

nltk.download('punkt')

and hit enter. It will take a minute to download some required files.

Then:

1) At the bottom of this file, change the file location to the location of the input file you want to use.

2) Load the program into Python. It will take a few seconds to build unigrams through 5-grams.

3) Run commands like the following:

print_ngram_freq_list(unigrams, top=100, pattern=None)

This will get the top 100 unigrams along with their counts from the corpus.

You can:

- Change unigrams to bigrams, trigrams, quadrigrams, or pentagrams to look at the top results for those.

- Change top=100 to a different value (top=10, top=50, ...) to get a different number of results.

- Add a regular expression pattern to only get ngrams which match that pattern.

More example usages:

print_ngram_freq_list(bigrams, top=5, pattern='ing')

This gets the top 5 bigrams that contain 'ing'.

print_ngram_freq_list(trigrams, top=5, pattern='^w+ings')

This gets the top 5 bigrams where the first word contains 'ings'

More advanced usage:

- You can create ngrams of any length by following the form used at the bottom of the file to create the unigrams etc.

'''

import random

import nltk

import re

import collections

from nltk import word_tokenize

def tokenize_file(fname):

file = open(fname, errors='ignore')

return nltk.word_tokenize(file.read())

# This function gets the raw ngrams, without figuring out counts, duplicates, etc.

# Optionally takes a regular expression pattern as an argument, then only returns

# ngrams which match that pattern.

def get_ngrams(all_tokens, length, pattern = None):

ngrams = []

p = None

if(pattern):

p = re.compile(pattern)

for x in range(0, len(all_tokens)-length+1):

if p:

if p.search(' '.join(all_tokens[x:x+length])):

ngrams.append(all_tokens[x:x+length])

else:

ngrams.append(all_tokens[x:x+length])

return ngrams

def print_ngram_freq_list(ngrams, top=None, pattern=None):

length = len(ngrams[0])

ngram_strs = [' '.join(x) for x in ngrams]

freqs = collections.Counter(ngram_strs)

if(pattern):

p = re.compile(pattern)

outcounts = {k: v for k, v in freqs.items() if p.search(k)}

else:

outcounts = freqs

sorted_outcounts = [(k, outcounts[k]) for k in sorted(outcounts, key=outcounts.get, reverse=True)]

if top == None:

top = len(sorted_outcounts)

for i in range(min(top,len(sorted_outcounts))):

(k, v) = sorted_outcounts[i]

print(v, k)

if len(sorted_outcounts) == 0:

print("No Matches.")

return None

# Be sure to use forward slashes in your file path!

tokens = tokenize_file("C:/Users/digit/Dropbox/Teaching/COG376/Corpora/Cleaned/ccae_2017_fic.txt")

unigrams = get_ngrams(tokens, 1)

bigrams = get_ngrams(tokens, 2)

trigrams = get_ngrams(tokens, 3)

quadrigrams = get_ngrams(tokens, 4)

pentagrams = get_ngrams(tokens, 5)

Hi There!

Compare corpora using n-grams

Extra Credit

Python Program (ngram_observations.py)