Every day you read about the amazing breakthroughs in how the newest applications of machine learning are changing the world. Often this reporting glosses over the fact that a huge amount of data munging and feature engineering must be done before any of these fancy models can be used. In this course, you will learn how to do just that. You will work with Stack Overflow Developers survey, and historic US presidential inauguration addresses, to understand how best to preprocess and engineer features from categorical, continuous, and unstructured data. This course will give you hands-on experience on how to prepare any data for your own machine learning models.


Creating Features

Slide

Why generate features?

Dealing with categorical features

Numeric variables


Dealing with Messy Data

Slide

Why do missing values exist?

Dealing with missing values (I)

Dealing with missing values (II)

Dealing with other data issues


Conforming to Statistical Assumptions

Slide

Data distributions

Scaling and transformations

Removing outliers

Scaling and transforming new data


Dealing with Text Data

Slide

Encoding text

Word counts

Term frequency-inverse document frequency

N-grams