some topics we have covered in class and/or lab
Segmentation
- tokenization
- lemma(tization)
- stem(ming)
- sentence segmentation
- word type
- word token
Regular Expressions
- range
- Kleene *
- Kleene +
- anchor
- disjunction
- (non)-greedy
- capturing group
Probability / Language models
- basic facts
- conditional probability
- joint probability
- chain rule
- noisy channel model
- maximum likelihood estimate
- Markov assumption
- n-gram models
- log probability
- perplexity - we didn’t actually talk about this but it’s cool
- zero counts
- unknown words
- smoothing
- discounting
- backoff
- interpolation
- Laplace smoothing
Encodings
Evaluation
- corpus
- false positives
- false negatives
- training set
- development set
- test set
- cross-validation
- precision
- recall
- f-measure
- intrinsic and extrinsic evaluation - also cool but we didn’t talk about them
Edit Distance
- minimum (weighted) edit distance
- alignment
- backtrace
Vector Semantics
- embedding
- term-document matrix
- vector space model
- term-term matrix / word-word matrix / term-context matrix
- cosine similarity
- tf-idf
Lexical Semantics
- unsupervised learning
- supervised learning