Using baseplot for Ploting Geographical Coordinates

Basemap is a great tool for creating maps using python in a simple way. It’s a matplotlib extension, so it has got all its features to create data visualizations, and adds the geographical projections and some datasets to be able to plot coast lines, countries, and so on directly from the library [1].
Continue reading “Using baseplot for Ploting Geographical Coordinates”

Copy File from Cloud HDFS to Local Computer

While I work with big data technologies like Spark and a large dataset I like to work on the university cloud, where everything is faster. However, for different reasons sometimes I have to move to local computer (my laptop). This time the reason is, I need to use a package of Python matplotlib, named baseplot, which is not installed on the cloud. However, the data I need to work on is on the cloud HDFS. Therefore, I need to copy the data from HDFS to my local laptop. This can be done in two simple steps:

Step 1: copy data from HDFS to remote local (not HDFS)
Step 2: copy data from remote local to local (my laptop)
Continue reading “Copy File from Cloud HDFS to Local Computer”

Data Science Interview Questions

In this post I am going to make a compilation of interview questions for data science role. A big part of them are questions that I faced during my interviews. I have also gathered questions from different websites and which I found interesting. So, lets get started.

What do you know about bias-variance/bias-variance tradeoff?

In statistics and machine learning, the bias–variance tradeoff (or dilemma) is the problem of simultaneously minimizing two sources of error that prevent supervised learning algorithms from generalizing beyond their training set [Wikipedia]:

  • The bias is an error from erroneous assumptions in the learning algorithm. High bias can cause an algorithm to miss the relevant relations between features and target outputs (underfitting). Bias are the simplifying assumptions made by a model to make the target function easier to learn. Examples of low-bias machine learning algorithms include: Decision Trees, k-Nearest Neighbors and Support Vector Machines. Examples of high-bias machine learning algorithms include: Linear Regression, Linear Discriminant Analysis and Logistic Regression [2].
  • The variance is an error from sensitivity to small fluctuations in the training set. High variance can cause an algorithm to model the random noise in the training data, rather than the intended outputs (overfitting). Variance is the amount that the estimate of the target function will change if different training data was used. Low variance suggests small changes to the estimate of the target function with changes to the training dataset. High variance suggests large changes to the estimate of the target function with changes to the training dataset. Generally, nonparametric machine learning algorithms that have a lot of flexibility have a high variance. For example, decision trees have a high variance, that is even higher if the trees are not pruned before use. Examples of low-variance machine learning algorithms include: Linear Regression, Linear Discriminant Analysis and Logistic Regression. Examples of high-variance machine learning algorithms include: Decision Trees, k-Nearest Neighbors and Support Vector Machines [2].

Continue reading “Data Science Interview Questions”

Printing Jupyter Notebook to other File Format

As a data scientist, I frequently use Jupyter notebook. For writing some report one might need to print out (on paper) the full notebook. There is a print preview option in the current version of Jupyter notebook, but no print option.


I tried to use CTRL + P command on the print preview page, but the output was horrible (like when we try to print an webpage). I googled and found a better way of doing that.

I am running Jupyter notebook on Ubuntu 16.04. The steps are very simple:

(1) Open terminal
(2) Change directory (where the notebook is located)
(3) Use command: ipython nbconvert –to pdf A1.ipynb (A1.ipynb is my notebook)

shanto@shanto:~$ cd ~/Desktop/BigData/706/Assignments/
shanto@shanto:~/Desktop/BigData/706/Assignments$ ls
A1.ipynb
shanto@shanto:~/Desktop/BigData/706/Assignments$ jupyter nbconvert --to pdf A1.ipynb
[NbConvertApp] Converting notebook A1.ipynb to pdf
[NbConvertApp] Writing 25564 bytes to notebook.tex
[NbConvertApp] Building PDF
[NbConvertApp] Running xelatex 3 times: ['xelatex', 'notebook.tex']
[NbConvertApp] Running bibtex 1 time: ['bibtex', 'notebook']
[NbConvertApp] WARNING | bibtex had problems, most likely because there were no citations
[NbConvertApp] PDF successfully created
[NbConvertApp] Writing 23494 bytes to A1.pdf
shanto@shanto:~/Desktop/BigData/706/Assignments$ 

The figure shows a snap of the generated *.pdf file. The file is reasonably neat with a good formating.

If we change the –to pdf part to –to whateverFormat then the same command can be used to convert the notebook to other formats. Conversion to a few other format is shown below.

shanto@shanto:~/Desktop/BigData/706/Assignments$ jupyter nbconvert --to script A1.ipynb
[NbConvertApp] Converting notebook A1.ipynb to script
[NbConvertApp] Writing 2077 bytes to A1.py
shanto@shanto:~/Desktop/BigData/706/Assignments$ # convert to latex
shanto@shanto:~/Desktop/BigData/706/Assignments$ jupyter nbconvert --to latex A1.ipynb
[NbConvertApp] Converting notebook A1.ipynb to latex
[NbConvertApp] Writing 25564 bytes to A1.tex
shanto@shanto:~/Desktop/BigData/706/Assignments$ 

Running Spark on Local Machine

Apache Spark is a fast and general-purpose cluster computing system. To get maximum potential out of it, Spark should be running on a distributed computing system. However, one might not have access to any distributed system all the time. Specially, for learning purpose one might want tor run spark on his/her own computer. This is actually a very easy task to do. There is a handful of way to do this. I would show, what I have done to run Spark on my laptop.
Continue reading “Running Spark on Local Machine”