CPSC 65/LING 20: Fall 2014

Many students have asked for a "study guide". Below I lay out the topics that you should be familiar with for the second exam. There will be no programming questions on the exam. All of the questions will be conceptual questions, largely about the lecture material but also about the topics covered in the labs.

I will not provide sample questions for you to answer. You can expect that there will be short answer questions and extended questions requiring one or two paragraphs. There will be some questions that require you to be knowledgeable about the mathematical formulas that we've used to describe various NLP algorithms.

This list of topics is a superset of the topics that will be covered on the exam. Since the exam takes place in a 2 hours block, everything cannot be covered.

The second exam includes only material covered in the second half of the class (since the last exam).

Exam 2 Topics

There's really nothing special about this list, it's just a slightly more extended version of the syllabus. Some of the subtopics below really go under multiple topics, so don't read too much into the way that I've organized this.

Lexical Semantics

Wordnet
Word sense disambiguation
Lexical sample task vs all-words task
Minimally-supervised bootstapping (Yarowsky algorithm)
McCarthy et al (Learning most frequent sense)
Lesk algorithm

Sentiment Analysis

Pang et al (Movie review classification)
O'Connor et al (Correlating tweets with polling data)
Basilisk and Metaboot (Learning subjective nouns)

Classification and Information Retrieval

Baselines in classification (e.g. random, most-frequent sense)
Decision lists and log-likelihood
Naive Bayes
Feature selection
Vector representataions of documents
Cosine similarity
Term frequency and Inverse Document Frequency

Parsing

Constituency parsing vs dependency parsing
Treebanks
Context free grammars (CFG) and Chomsky normal form
Recursive descent parser
Chart parsing / CKY algorithm
Probabilistic CFG (PCFG) and the Viterbi algorithm
Here's an exercise you can practice with. The first sentence is short but somewhat annoying to parse. The second sentence is longer and more interesting to parse.

Clustering

k-means clustering
heirarchical agglomerative clustering; single-, complete- and average-link
dendrograms
cluster evaluation: purity, mutual information, residual sum square error (RSS)

Machine Translation

IBM Models 1, 2, and 3: big idea and basic understanding of the formulation
Training process / EM
Knight workbook (main ideas). You already know most of the formulas through the end of section 24. There's no need to memorize the formulas in and after section 25, but you should understand it at a high level.

CPSC 65/LING 20:Natural Language Processing

Exam 2 Topics

CPSC 65/LING 20:
Natural Language Processing