IntroductionIMDB has a dataset of 900,000 dirty movie plot description tagged with genres. Train a classifier to predict whether a movie is a "Comedy" or "Horror" movie from the plot description.
Example of Dirty Data:
|Bloodrage (1979)||A psychotic killer stalks the streets of New York City, preying on beautiful girls who live alone.....||Unrated||Comedy|
Getting Your Money's WorthData cleaning is expensive and time-consuming. Suppose you have a budget of (k) records, what is the best way to train the model?
- Combining dirty and clean data (see)
- Using only the clean data (see)
- ActiveClean: Iterative Update (see)
Test Error 50% 40% 30% 20% 10%1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Improvement over combining:
Improvement over only clean data: