Why Data Preprocessing is Important?

When I was building machine learning model for the first time, it had accuracy approximately 0.2~0.3. It was very low and at the time, I relied myself only from artificial intelligence course that I had taken in the previous semester. I already had some keywords about machine learning, but never had the chance to immerse more in the theory. So I just build the model on some tutorials I found on the internet and trained the model with some sets of data.

Later than I knew I had to do something with the data too. At that time, my tutor suggested me to do data cleaning before the model trains it. He suggested to use Python NLTK library to remove stop words. Stop words is common words that exist in data, and by removing those words we can decrease processing time or size of space. After using this library, the accuracy of the model was increased into more or less 0.4~0.6.

Based on machine learning course in Coursera, the principle is “garbage in, garbage out”. We do not want the output is the garbage. There are some causes that can make messy data :

  • duplicated/unnecessary data
  • inconsistent data
  • missing data
  • outliers
  • data sourcing issues, such as obtained from multiple systems, different database types, etc.

After doing some more researches and taking an online course, there are some other steps in data preprocessing :

  1. data cleaning
  2. data transformation
  3. data reduction

What I did before was one of several steps in data cleaning. What I learned here, building machine learning has to consider not only the model, but also the data that we would process. There may be several other aspects that we have to consider, but let’s wrap it up for now.

Leave a comment