*.csv File Preprocessing Using Pandas

For any machine learning or data mining purpose, the first job is to pre-process the data so that we can us the data for the original purpose. In lots of cases we have the raw data in *csv format, which we need to import and preprocess using the language we are using for the particular job. Python is one of the most popular language for this purpose. For this article I will use Python and one very popular library named pandas to show how we can use pandas for read, import and preprocess a *.csv file.

We have a *csv file which we want to pre-process. This is a file with a large number of columns, so it is not a good idea to display it here. I am showing a part of it.

The first thing we need to do is to import pandas and load the csv file.

import pandas as pd
full_data = pd.read_csv('preprocessed_datasets.csv')

Now we can check the number of columns in the file like this.

# The output is like this:
Index(['id', 'PlayerName', 'DraftAge', 'Country', 'country_group', 'Height',
       'Weight', 'Position', 'DraftYear', 'Overall', 'CSS_rank', 'rs_GP',
       'rs_G', 'rs_A', 'rs_P', 'rs_PIM', 'rs_PlusMinus', 'po_GP', 'po_G',
       'po_A', 'po_P', 'po_PIM', 'sum_7yr_GP', 'sum_7yr_TOI',

The first process of the pre-processing is we want to get rid of a few columns, which we consider unnecessary. This decisions are taken by domain expert and is not in the scope of this particular article. Lets, say, we want to do the following tasks:

Columns to drop:

  • id, Playername. Usually they don’t provide any information and works as system noise.
  • Sum_7yr_GP, sum_7yr_TOI.
  • DraftYear. We split the data into training and test set. So you cannot generalize from training to test data and should not use this feature in learning.
  • country. Because there are too many countries, we grouped the European countries together, as shown in the readme and in the country_group column.
  • For a lots of machine learning applications we need train data and test data. Say, for this dataset we want to use the rows from DraftYear=1998, 1999, 2000 as our training set data and DraftYear=2001 as our test set data.

    So we can extract training data and test data using the DraftYear column name from the previously loaded full_data.

    train_data = full_data[(full_data.DraftYear == 1998) | (full_data.DraftYear == 1999) | (full_data.DraftYear == 2000)]
    test_data = full_data[full_data.DraftYear == 2001]

    Now we can drop the unnecessary columns from our train_data and test_data.

    train_data = train_data    \
                    .drop('id', 1)    \
                    .drop('PlayerName', 1)    \
                    .drop('rs_PlusMinus', 1)    \
                    .drop('DraftYear', 1)
    test_data = test_data    \
                    .drop('id', 1)    \
                    .drop('PlayerName', 1)    \
                    .drop('rs_PlusMinus', 1)    \
                    .drop('DraftYear', 1)

    Leave a Reply

    Fill in your details below or click an icon to log in:

    WordPress.com Logo

    You are commenting using your WordPress.com account. Log Out /  Change )

    Google photo

    You are commenting using your Google account. Log Out /  Change )

    Twitter picture

    You are commenting using your Twitter account. Log Out /  Change )

    Facebook photo

    You are commenting using your Facebook account. Log Out /  Change )

    Connecting to %s