For any machine learning or data mining purpose, the first job is to pre-process the data so that we can us the data for the original purpose. In lots of cases we have the raw data in *csv format, which we need to import and preprocess using the language we are using for the particular job. Python is one of the most popular language for this purpose. For this article I will use Python and one very popular library named pandas to show how we can use pandas for read, import and preprocess a *.csv file.
We have a *csv file which we want to pre-process. This is a file with a large number of columns, so it is not a good idea to display it here. I am showing a part of it.
The first thing we need to do is to import pandas and load the csv file.
import pandas as pd full_data = pd.read_csv('preprocessed_datasets.csv')
Now we can check the number of columns in the file like this.
full_data.columns # The output is like this: Index(['id', 'PlayerName', 'DraftAge', 'Country', 'country_group', 'Height', 'Weight', 'Position', 'DraftYear', 'Overall', 'CSS_rank', 'rs_GP', 'rs_G', 'rs_A', 'rs_P', 'rs_PIM', 'rs_PlusMinus', 'po_GP', 'po_G', 'po_A', 'po_P', 'po_PIM', 'sum_7yr_GP', 'sum_7yr_TOI', 'GP_greater_than_0'], dtype='object')
The first process of the pre-processing is we want to get rid of a few columns, which we consider unnecessary. This decisions are taken by domain expert and is not in the scope of this particular article. Lets, say, we want to do the following tasks:
Columns to drop:
For a lots of machine learning applications we need train data and test data. Say, for this dataset we want to use the rows from DraftYear=1998, 1999, 2000 as our training set data and DraftYear=2001 as our test set data.
So we can extract training data and test data using the DraftYear column name from the previously loaded full_data.
train_data = full_data[(full_data.DraftYear == 1998) | (full_data.DraftYear == 1999) | (full_data.DraftYear == 2000)] test_data = full_data[full_data.DraftYear == 2001]
Now we can drop the unnecessary columns from our train_data and test_data.
train_data = train_data \ .drop('id', 1) \ .drop('PlayerName', 1) \ .drop('rs_PlusMinus', 1) \ .drop('DraftYear', 1) test_data = test_data \ .drop('id', 1) \ .drop('PlayerName', 1) \ .drop('rs_PlusMinus', 1) \ .drop('DraftYear', 1)