In this course, you will learn techniques that will allow you to extract useful information from text and process them into a format suitable for applying ML models. More specifically, you will learn about POS tagging, named entity recognition, readability scores, the n-gram and tf-idf models, and how to implement them using scikit-learn and spaCy. You will also learn to compute how similar two documents are to each other. In the process, you will predict the sentiment of movie reviews and build movie and Ted Talk recommenders. Following the course, you will be able to engineer critical features out of any text and solve some of the most challenging problems in data science!

Basic features and readability scores

Slide

Introduction to NLP feature engineering

Basic feature extraction

Readability tests

Text preprocessing, POS tagging and NER

Slide

Tokenization and Lemmatization

Text cleaning

Part-of-speech tagging

Named entity recognition

N-Gram models

Slide

Building a bag of words model

Building a BoW Naive Bayes classifier

Building n-gram models

TF-IDF and similarity scores

Slide

Building tf-idf document vectors

Cosine similarity

Building a plot line based recommender

Beyond n-grams: word embeddings