Introduction to NLP¶
Course Description¶
This course provides an introduction to natural language processing (NLP), a subfield of artificial intelligence concerned with the interaction between computers and human language. The course will cover the fundamental concepts and techniques of NLP, including text pre-processing, morphology, syntax, semantics, text classification, topic modeling, and word embeddings.
Learning Goals¶
By the end of this course, students will be able to:
- Understand key concepts, models, and challenges in natural language processing
- Implement and apply fundamental algorithms in NLP
- Evaluate and use software systems for various NLP tasks
- Understand current approaches, datasets, and systems for various NLP tasks
- Build NLP models for various applications
Textbook¶
- Speech and Language Processing (3rd ed. draft) by Dan Jurafsky and James H. Martin
- Bird, S., Klein, E., & Loper, E. (2009). Natural language processing with Python. O'Reilly Media, Inc.
Lecture Notes¶
You can find the lecture notes of the course by clicking on the following link:
https://lecture.halla.ai/lectures/nlp_intro
Grading¶
- Participation: 10%
- Midterm: 30%
- Term Project: 60%
Plagiarism¶
Presenting someone else’s ideas as your own, either verbatim or recast in your own words, is a serious academic offense with serious consequences. Please familiarize yourself with the discussion of plagiarism in our campus policies.
Term Project¶
For the term project, students will choose a real-world dataset and build an NLP system that can perform some task on the dataset. The project will be presented in the form of a research paper and a poster at the end of the semester.
Course Outline¶
Week 1: Introduction¶
In the first week, we will provide an overview of the course and discuss the basics of natural language processing. We will cover what NLP is, how it works, and why it is a challenging field. We will also discuss the history and current state of the field.
Week 2: Morphology¶
In the second week, we will cover the basics of morphology, which is the study of the internal structure of words. We will cover types of morphemes, rules of thumb for identifying morphemes, and WordNet, a lexical database for English.
Week 3: Syntax¶
In the third week, we will cover the basics of syntax, which is the study of the structure of sentences. We will cover parts-of-speech tagging, which is the process of assigning a part of speech to each word in a sentence. We will also cover chunking and shallow parsing, dependency parsing, and constituency parsing.
Week 4: Semantics I¶
In the fourth week, we will cover the basics of semantics, which is the study of the meaning of words and sentences. We will cover word sense disambiguation, which is the process of determining the correct meaning of a word in context. We will also cover semantic role labeling, which is the process of identifying the semantic roles played by each word in a sentence.
Week 5: Text Pre-processing¶
In the fifth week, we will cover the basics of text pre-processing, which is the process of cleaning and preparing text for analysis. We will cover case conversion, which is the process of converting text to lowercase or uppercase. We will also cover normalization, stemming, and lemmatization, which are techniques for reducing words to their root form, and named entity recognition, which is the process of identifying named entities such as people, organizations, and locations.
Week 6: Tokenization¶
In the sixth week, we will cover the basics of tokenization, which is the process of breaking text into individual words or tokens. We will cover rules-based tokenization, which uses rules to split text into tokens, and statistical tokenization, which uses machine learning algorithms to split text into tokens.
Week 7: Vector Semantics¶
In the seventh week, we will cover the basics of vector semantics, which is the process of representing words and documents as vectors in a high-dimensional space. We will cover word vector models, which represent words as vectors, and document vector models, which represent documents as vectors. We will also cover the evaluation of vector semantics models.
Week 8: Midterm Exam¶
Week 9: Text Classification I¶
In the ninth week, we will cover the basics of text classification, which is the process of assigning predefined categories to text. We will cover supervised text classification, which uses labeled examples to learn to classify text. We will also cover the evaluation of text classifiers and text classification with scikit-learn, a popular Python library for machine learning.
Week 10: Text Classification II¶
In the tenth week, we will cover unsupervised and semi-supervised text classification. We will also cover active learning, a machine learning technique that selects the most informative samples to label for training.
Week 11: Topic Modeling I¶
In the eleventh week, we will cover the basics of topic modeling, which is the process of discovering hidden thematic structure in text collections. We will cover Latent Dirichlet Allocation (LDA), a popular topic modeling algorithm. We will also cover the evaluation of topic models and topic modeling with scikit-learn.
Week 12: Topic Modeling II¶
In the twelfth week, we will cover additional topic modeling techniques, including Probabilistic Latent Semantic Analysis (PLSA), Latent Semantic Indexing (LSI), and Non-negative Matrix Factorization (NMF).
Week 13: Word Embeddings I¶
In the thirteenth week, we will cover the basics of word embeddings, which are dense vector representations of words that capture their semantic meaning. We will cover Word2vec and GloVe, two popular algorithms for learning word embeddings.
Week 14: Word Embeddings II¶
In the fourteenth week, we will cover more advanced word embedding techniques, including FastText, ELMo, BERT, and XLNet.
Week 15: Final Presentations¶
In the final week, students will present their term projects in the form of a poster. The term project will involve building a natural language processing system that can perform a task on a real-world dataset.