Find Similarity Using Jaccard Similarity

We read the file using Pandas.

import pandas as pd
import numpy as np
rawData = pd.read_csv('data-Assignment2.txt', sep=",", header=None)

We need to find the signature matrix. For that we need to make a permutation of the rows of the whole matrix. We can do that using pandas like this.

permuteData = rawData.sample(frac=1)

Just as a note we can use frac less than one if we want to do a random subsample. We can also shuffle in-place and use this.

df = df.sample(frac=1).reset_index(drop=True)  # in place shuffle, drop index column

We can test if it works by using a random matrix created by Pandas.

# create a random matrix with 0 and 1, like our example matrix
df = pd.DataFrame(np.random.randint(0,2,size=(100, 4)), columns=list('ABCD'))
# now we can do a shuffle like this
df = df.sample(frac=1)

The before and after is shown by the following figure:

a = []
b = []
for k in range(3):
    for j in range(4):
    a = []
# OUTPUT: [[0, 1, 2, 3], [0, 1, 2, 3], [0, 1, 2, 3]]

*.csv File Preprocessing Using Pandas

For any machine learning or data mining purpose, the first job is to pre-process the data so that we can us the data for the original purpose. In lots of cases we have the raw data in *csv format, which we need to import and preprocess using the language we are using for the particular job. Python is one of the most popular language for this purpose. For this article I will use Python and one very popular library named pandas to show how we can use pandas for read, import and preprocess a *.csv file.

We have a *csv file which we want to pre-process. This is a file with a large number of columns, so it is not a good idea to display it here. I am showing a part of it.

Continue reading “*.csv File Preprocessing Using Pandas”