Compare corpora using n-grams
In this assignment you will compare three corpora by examining their n-grams. You will be looking for similarities and differences in word usage/frequency, phrase usage/frequency, structure, content, etc.
Three corpora have been cleaned (mostly stripped of irrelevant tags) and uploaded to Blackboard, they are:
- Santa Barbara Corpus of Spoken American English (sbcsae_terminals.txt)
- Corpus of Contemporary American English 2017 Update – Fiction Sample (ccae_2017_fic.txt)
- Corpus of Contemporary American English 2017 Update – News Sample (ccae_2017_news.txt)
A word of warning: Some of these corpora contain some text which is quite offensive. One of the effects of looking at real data is to see how real humans act.
Begin by exploring the ngrams of the corpora. Hypothesize about similarities and differences between them. Then, write at least 10 of your observations in a document, along with your hypotheses about why the observation might be true. Create tables in your document which contain the output of the Python program for the different corpora to support your observations and hypotheses.
For this assignment form a team of 2 people. It will be best if at least one of your group members is a linguist. You need only submit one report per team.
Extra Credit
Each team member should create a corpus of text they have written, and compare it a) with the above corpora, and b) with the other team member’s corpus. To create a corpus of your own text, open Notepad and copy/paste as much text as you can which you have written into the file. Save it and load it into the Python program as you did the others. What are the similarities and differences when compared with the other corpora? Add at least 4 more observations to your produced file.
Python Program (ngram_observations.py)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 |
# ngram_observations.py # An set of functions for examining characteristics of ngrams. # Author: Daniel R. Schlegel # Modified: 2/14/18 ''' How to use this program: First, if you are using Thonny, from the Tools menu choose Manage Packages. In the window that appears type nltk and hit the Search button. Then once it is found, click the Install button. You may now close that window. Then, in the console at the bottom of Thonny, type: import nltk nltk.download('punkt') and hit enter. It will take a minute to download some required files. Then: 1) At the bottom of this file, change the file location to the location of the input file you want to use. 2) Load the program into Python. It will take a second to build unigrams through quadrigrams. 3) Run commands like the following: print_ngram_freq_list(unigrams, top=100, pattern=None) This will get the top 100 unigrams along with their counts from the corpus. You can: - Change unigrams to bigrams, trigrams, quadrigrams, or pentagrams to look at the top results for those. - Change top=100 to a different value (top=10, top=50, ...) to get a different number of results. - Add a regular expression pattern to only get ngrams which match that pattern. More example usages: print_ngram_freq_list(bigrams, top=5, pattern='ing') This gets the top 5 bigrams that contain 'ing'. print_ngram_freq_list(trigrams, top=5, pattern='^\w+ing\s') This gets the top 5 bigrams where the first word contains 'ing' More advanced usage: - You can create ngrams of any length by following the form used at the bottom of the file to create the unigrams etc. ''' import random import nltk import re import collections from nltk import word_tokenize def tokenize_file(fname): file = open(fname, errors='ignore') return nltk.word_tokenize(file.read()) # This function gets the raw ngrams, without figuring out counts, duplicates, etc. # Optionally takes a regular expression pattern as an argument, then only returns # ngrams which match that pattern. def get_ngrams(all_tokens, length, pattern = None): ngrams = [] p = None if(pattern): p = re.compile(pattern) for x in range(0, len(all_tokens)-length+1): if p: if p.search(' '.join(all_tokens[x:x+length])): ngrams.append(all_tokens[x:x+length]) else: ngrams.append(all_tokens[x:x+length]) return ngrams def print_ngram_freq_list(ngrams, top=None, pattern=None): length = len(ngrams[0]) ngram_strs = [' '.join(x) for x in ngrams] freqs = collections.Counter(ngram_strs) if(pattern): p = re.compile(pattern) outcounts = {k: v for k, v in freqs.items() if p.search(k)} else: outcounts = freqs sorted_outcounts = [(k, outcounts[k]) for k in sorted(outcounts, key=outcounts.get, reverse=True)] if top == None: top = len(sorted_outcounts) for i in range(min(top,len(sorted_outcounts))): (k, v) = sorted_outcounts[i] print(v, k) if len(sorted_outcounts) == 0: print("No Matches.") return None # Be sure to use forward slashes in your file path! tokens = tokenize_file("C:/Users/digit/Dropbox/Teaching/COG376/18S/Corpora/Cleaned/ccae_2017_fic.txt") unigrams = get_ngrams(tokens, 1) bigrams = get_ngrams(tokens, 2) trigrams = get_ngrams(tokens, 3) quadrigrams = get_ngrams(tokens, 4) pentagrams = get_ngrams(tokens, 5) |