Using baseplot for ploting geographical coordinates

shanto@shanto:~$ $ conda install basemap

it was giving some error. So i found the solution.

shanto@shanto:~$ conda install -c conda-forge basemap
Solving environment: done

## Package Plan ##

  environment location: /home/shanto/anaconda3

  added / updated specs: 
    - basemap


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    certifi-2017.11.5          |           py36_0         195 KB  conda-forge
    openssl-1.0.2n             |                0         3.5 MB  conda-forge
    ca-certificates-2017.11.5  |                0         145 KB  conda-forge
    basemap-1.1.0              |           py36_3        15.4 MB  conda-forge
    pyproj-1.9.5.1             |           py36_0         3.4 MB  conda-forge
    geos-3.6.2                 |                1        19.9 MB  conda-forge
    pyshp-1.2.12               |             py_0          22 KB  conda-forge
    ------------------------------------------------------------
                                           Total:        42.5 MB

The following NEW packages will be INSTALLED:

    pyproj:          1.9.5.1-py36_0             conda-forge
    pyshp:           1.2.12-py_0                conda-forge

The following packages will be UPDATED:

    basemap:         1.0.7-np113py36_0                      --> 1.1.0-py36_3     conda-forge
    ca-certificates: 2017.08.26-h1d4fec5_0                  --> 2017.11.5-0      conda-forge
    certifi:         2017.7.27.1-py36h8b7b77e_0             --> 2017.11.5-py36_0 conda-forge
    geos:            3.5.0-0                                --> 3.6.2-1          conda-forge
    openssl:         1.0.2l-h9d1a558_3                      --> 1.0.2n-0         conda-forge

Proceed ([y]/n)? y


Downloading and Extracting Packages
certifi 2017.11.5: ##################################################### | 100% 
openssl 1.0.2n: ######################################################## | 100% 
ca-certificates 2017.11.5: ############################################# | 100% 
basemap 1.1.0: ######################################################### | 100% 
pyproj 1.9.5.1: ######################################################## | 100% 
geos 3.6.2: ############################################################ | 100% 
pyshp 1.2.12: ########################################################## | 100% 
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
shanto@shanto:~$ 

Copy File from Cloud HDFS to Local Computer

While I work with big data technologies like Spark and a large dataset I like to work on the university cloud, where everything is faster. However, for different reasons sometimes I have to move to local computer (my laptop). This time the reason is, I need to use a package of Python matplotlib, named baseplot, which is not installed on the cloud. However, the data I need to work on is on the cloud HDFS. Therefore, I need to copy the data from HDFS to my local laptop. This can be done in two simple steps:

Step 1: copy data from HDFS to remote local (not HDFS)
Step 2: copy data from remote local to local (my laptop)
Continue reading “Copy File from Cloud HDFS to Local Computer”

Data Science Interview Questions

In this post I am going to make a compilation of interview questions for data science role. A big part of them are questions that I faced during my interviews. I have also gathered questions from different websites and which I found interesting. So, lets get started.

What do you know about bias-variance/bias-variance tradeoff?

In statistics and machine learning, the bias–variance tradeoff (or dilemma) is the problem of simultaneously minimizing two sources of error that prevent supervised learning algorithms from generalizing beyond their training set [Wikipedia]:

  • The bias is an error from erroneous assumptions in the learning algorithm. High bias can cause an algorithm to miss the relevant relations between features and target outputs (underfitting). Bias are the simplifying assumptions made by a model to make the target function easier to learn. Examples of low-bias machine learning algorithms include: Decision Trees, k-Nearest Neighbors and Support Vector Machines. Examples of high-bias machine learning algorithms include: Linear Regression, Linear Discriminant Analysis and Logistic Regression [2].
  • The variance is an error from sensitivity to small fluctuations in the training set. High variance can cause an algorithm to model the random noise in the training data, rather than the intended outputs (overfitting). Variance is the amount that the estimate of the target function will change if different training data was used. Low variance suggests small changes to the estimate of the target function with changes to the training dataset. High variance suggests large changes to the estimate of the target function with changes to the training dataset. Generally, nonparametric machine learning algorithms that have a lot of flexibility have a high variance. For example, decision trees have a high variance, that is even higher if the trees are not pruned before use. Examples of low-variance machine learning algorithms include: Linear Regression, Linear Discriminant Analysis and Logistic Regression. Examples of high-variance machine learning algorithms include: Decision Trees, k-Nearest Neighbors and Support Vector Machines [2].

Continue reading “Data Science Interview Questions”

Printing Jupyter Notebook to other File Format

As a data scientist, I frequently use Jupyter notebook. For writing some report one might need to print out (on paper) the full notebook. There is a print preview option in the current version of Jupyter notebook, but no print option.


I tried to use CTRL + P command on the print preview page, but the output was horrible (like when we try to print an webpage). I googled and found a better way of doing that.

I am running Jupyter notebook on Ubuntu 16.04. The steps are very simple:

(1) Open terminal
(2) Change directory (where the notebook is located)
(3) Use command: ipython nbconvert –to pdf A1.ipynb (A1.ipynb is my notebook)

shanto@shanto:~$ cd ~/Desktop/BigData/706/Assignments/
shanto@shanto:~/Desktop/BigData/706/Assignments$ ls
A1.ipynb
shanto@shanto:~/Desktop/BigData/706/Assignments$ jupyter nbconvert --to pdf A1.ipynb
[NbConvertApp] Converting notebook A1.ipynb to pdf
[NbConvertApp] Writing 25564 bytes to notebook.tex
[NbConvertApp] Building PDF
[NbConvertApp] Running xelatex 3 times: ['xelatex', 'notebook.tex']
[NbConvertApp] Running bibtex 1 time: ['bibtex', 'notebook']
[NbConvertApp] WARNING | bibtex had problems, most likely because there were no citations
[NbConvertApp] PDF successfully created
[NbConvertApp] Writing 23494 bytes to A1.pdf
shanto@shanto:~/Desktop/BigData/706/Assignments$ 

The figure shows a snap of the generated *.pdf file. The file is reasonably neat with a good formating.

If we change the –to pdf part to –to whateverFormat then the same command can be used to convert the notebook to other formats. Conversion to a few other format is shown below.

shanto@shanto:~/Desktop/BigData/706/Assignments$ jupyter nbconvert --to script A1.ipynb
[NbConvertApp] Converting notebook A1.ipynb to script
[NbConvertApp] Writing 2077 bytes to A1.py
shanto@shanto:~/Desktop/BigData/706/Assignments$ # convert to latex
shanto@shanto:~/Desktop/BigData/706/Assignments$ jupyter nbconvert --to latex A1.ipynb
[NbConvertApp] Converting notebook A1.ipynb to latex
[NbConvertApp] Writing 25564 bytes to A1.tex
shanto@shanto:~/Desktop/BigData/706/Assignments$ 

Running Spark on Local Machine

Apache Spark is a fast and general-purpose cluster computing system. To get maximum potential out of it, Spark should be running on a distributed computing system. However, one might not have access to any distributed system all the time. Specially, for learning purpose one might want tor run spark on his/her own computer. This is actually a very easy task to do. There is a handful of way to do this. I would show, what I have done to run Spark on my laptop.
Continue reading “Running Spark on Local Machine”

Find Similarity Using Jaccard Similarity

We read the file using Pandas.

import pandas as pd
import numpy as np
rawData = pd.read_csv('data-Assignment2.txt', sep=",", header=None)

We need to find the signature matrix. For that we need to make a permutation of the rows of the whole matrix. We can do that using pandas like this.

permuteData = rawData.sample(frac=1)

Just as a note we can use frac less than one if we want to do a random subsample. We can also shuffle in-place and use this.

df = df.sample(frac=1).reset_index(drop=True)  # in place shuffle, drop index column

We can test if it works by using a random matrix created by Pandas.

# create a random matrix with 0 and 1, like our example matrix
df = pd.DataFrame(np.random.randint(0,2,size=(100, 4)), columns=list('ABCD'))
# now we can do a shuffle like this
df = df.sample(frac=1)

The before and after is shown by the following figure:

a = []
b = []
for k in range(3):
    for j in range(4):
        a.append(j)
    b.append(a)
    a = []
print(b)
# OUTPUT: [[0, 1, 2, 3], [0, 1, 2, 3], [0, 1, 2, 3]]

*.csv File Preprocessing Using Pandas

For any machine learning or data mining purpose, the first job is to pre-process the data so that we can us the data for the original purpose. In lots of cases we have the raw data in *csv format, which we need to import and preprocess using the language we are using for the particular job. Python is one of the most popular language for this purpose. For this article I will use Python and one very popular library named pandas to show how we can use pandas for read, import and preprocess a *.csv file.

We have a *csv file which we want to pre-process. This is a file with a large number of columns, so it is not a good idea to display it here. I am showing a part of it.

Continue reading “*.csv File Preprocessing Using Pandas”

Understanding MapReduce in My Way : Starting with Word Count

Word Count problem is known as the ‘Hello World’ for MapReduce. In this article I will explain, how I understand different bits of MapReduce in my way. The code provided in this article is trivial and is available in lots of places including the official MapReduce website. My concern would be to focus on how it really works.
Continue reading “Understanding MapReduce in My Way : Starting with Word Count”