Find Similarity Using Jaccard Similarity

We read the file using Pandas.

import pandas as pd
import numpy as np
rawData = pd.read_csv('data-Assignment2.txt', sep=",", header=None)

We need to find the signature matrix. For that we need to make a permutation of the rows of the whole matrix. We can do that using pandas like this.

permuteData = rawData.sample(frac=1)

Just as a note we can use frac less than one if we want to do a random subsample. We can also shuffle in-place and use this.

df = df.sample(frac=1).reset_index(drop=True)  # in place shuffle, drop index column

We can test if it works by using a random matrix created by Pandas.

# create a random matrix with 0 and 1, like our example matrix
df = pd.DataFrame(np.random.randint(0,2,size=(100, 4)), columns=list('ABCD'))
# now we can do a shuffle like this
df = df.sample(frac=1)

The before and after is shown by the following figure:

a = []
b = []
for k in range(3):
    for j in range(4):
    a = []
# OUTPUT: [[0, 1, 2, 3], [0, 1, 2, 3], [0, 1, 2, 3]]

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s