Many students have asked for a "study guide". Below I lay out the
topics that you should be familiar with for the second exam. There
will be no programming questions on the exam. All of the questions
will be conceptual questions, largely about the lecture material but
also about the topics covered in the labs.
I will not provide sample questions for you to answer. You can expect
that there will be short answer questions and extended questions
requiring one or two paragraphs. There will be some questions that
require you to be knowledgeable about the mathematical formulas that
we've used to describe various NLP algorithms.
This list of topics is a superset of the topics that will be covered
on the exam. Since the exam takes place in a 2 hours block,
everything cannot be covered.
The second exam includes only material covered in the second half of
the class (since the last exam).
Exam 2 Topics
There's really nothing special about this list, it's just a slightly
more extended version of the syllabus. Some of the subtopics below
really go under multiple topics, so don't read too much into the way
that I've organized this.
- Lexical Semantics
- Wordnet
- Word sense disambiguation
- Lexical sample task vs all-words task
- Minimally-supervised bootstapping (Yarowsky algorithm)
- McCarthy et al (Learning most frequent sense)
- Lesk algorithm
- Sentiment Analysis
- Pang et al (Movie review classification)
- O'Connor et al (Correlating tweets with polling data)
- Basilisk and Metaboot (Learning subjective nouns)
- Classification and Information Retrieval
- Baselines in classification (e.g. random, most-frequent sense)
- Decision lists and log-likelihood
- Naive Bayes
- Feature selection
- Vector representataions of documents
- Cosine similarity
- Term frequency and Inverse Document Frequency
- Parsing
- Constituency parsing vs dependency parsing
- Treebanks
- Context free grammars (CFG) and Chomsky normal form
- Recursive descent parser
- Chart parsing / CKY algorithm
- Probabilistic CFG (PCFG) and the Viterbi algorithm
- Here's an exercise you can practice with. The first sentence is short but somewhat annoying to parse. The second sentence is longer and more interesting to parse.
- Clustering
- k-means clustering
- heirarchical agglomerative clustering; single-, complete- and
average-link
- dendrograms
- cluster evaluation: purity, mutual information, residual sum
square error (RSS)
- Machine Translation
- IBM Models 1, 2, and 3: big idea and basic understanding of the formulation
- Training process / EM
- Knight workbook (main ideas). You already know most of the
formulas through the end of section 24. There's no need to memorize
the formulas in and after section 25, but you should understand it at
a high level.