In this article I will demonstrate a few examples of web scraping. According to the Wikipedia article, Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. There are a lots of tutorial on web scraping. In this post I will demonstrate web scraping while solving a few problems. I will use Python3 and a few libraries for this purpose. Lets get into the problem.
Continue reading “Web Scraping Using lxml”
General-purpose computing on graphics processing units (GPGPU, rarely GPGP) is the use of a graphics processing unit (GPU), which typically handles computation only for computer graphics, to perform computation in applications traditionally handled by the central processing unit (CPU) .
Continue reading “GPGPU Programming with CUDA for Color Space Conversion”
One of my friend on Facebook, who happens to be a data scientist, shared a very exciting news. I do not click all the links that are shared by my friends on Facebook. But, this time I had to. The title was enough for any technology enthusiast to at least click for the details.
I was trying to set up my Jupyter notebook to work on some deep learning problem (some image classification on MNIST and imagenet dataset) on my laptop (Ubuntu 16.04 LTS). Previously I have used a little bit of Keras (which runs on top of Tensorflow) on a small dataset, but I did not use that with Jupyter. For that purpose I installed Tensorflow and Keras independently and used them in a Python script. However, it was not working from my Jupyter notebook. I googled for the solution, but found nothing concrete. I tried to activate the tensorflow environment and run jupyter notebook from their but in vein. I guess the reason is, I have downloaded different packages in different times and that might make some compatibility issues. Therefore, I decided to create a BRAND NEW conda environment for my deep learning endeavor. This is how it goes:
Continue reading “Setting up Jupyter notebook with Tensorflow, Keras and Pytorch for Deep Learning”
Basemap is a great tool for creating maps using python in a simple way. It’s a matplotlib extension, so it has got all its features to create data visualizations, and adds the geographical projections and some datasets to be able to plot coast lines, countries, and so on directly from the library .
Continue reading “Using baseplot for Ploting Geographical Coordinates”
While I work with big data technologies like Spark and a large dataset I like to work on the university cloud, where everything is faster. However, for different reasons sometimes I have to move to local computer (my laptop). This time the reason is, I need to use a package of Python matplotlib, named baseplot, which is not installed on the cloud. However, the data I need to work on is on the cloud HDFS. Therefore, I need to copy the data from HDFS to my local laptop. This can be done in two simple steps:
Step 1: copy data from HDFS to remote local (not HDFS)
Step 2: copy data from remote local to local (my laptop)
Continue reading “Copy File from Cloud HDFS to Local Computer”
In this post I am going to make a compilation of interview questions for data science role. A big part of them are questions that I faced during my interviews. I have also gathered questions from different websites and which I found interesting. So, lets get started.
What do you know about bias-variance/bias-variance tradeoff?
In statistics and machine learning, the bias–variance tradeoff (or dilemma) is the problem of simultaneously minimizing two sources of error that prevent supervised learning algorithms from generalizing beyond their training set [Wikipedia]:
- The bias is an error from erroneous assumptions in the learning algorithm. High bias can cause an algorithm to miss the relevant relations between features and target outputs (underfitting). Bias are the simplifying assumptions made by a model to make the target function easier to learn. Examples of low-bias machine learning algorithms include: Decision Trees, k-Nearest Neighbors and Support Vector Machines. Examples of high-bias machine learning algorithms include: Linear Regression, Linear Discriminant Analysis and Logistic Regression .
- The variance is an error from sensitivity to small fluctuations in the training set. High variance can cause an algorithm to model the random noise in the training data, rather than the intended outputs (overfitting). Variance is the amount that the estimate of the target function will change if different training data was used. Low variance suggests small changes to the estimate of the target function with changes to the training dataset. High variance suggests large changes to the estimate of the target function with changes to the training dataset. Generally, nonparametric machine learning algorithms that have a lot of flexibility have a high variance. For example, decision trees have a high variance, that is even higher if the trees are not pruned before use. Examples of low-variance machine learning algorithms include: Linear Regression, Linear Discriminant Analysis and Logistic Regression. Examples of high-variance machine learning algorithms include: Decision Trees, k-Nearest Neighbors and Support Vector Machines .
Continue reading “Data Science Interview Questions”
As a data scientist, I frequently use Jupyter notebook. For writing some report one might need to print out (on paper) the full notebook. There is a print preview option in the current version of Jupyter notebook, but no print option.
I tried to use CTRL + P command on the print preview page, but the output was horrible (like when we try to print an webpage). I googled and found a better way of doing that.
I am running Jupyter notebook on Ubuntu 16.04. The steps are very simple:
(1) Open terminal
(2) Change directory (where the notebook is located)
(3) Use command: ipython nbconvert –to pdf A1.ipynb (A1.ipynb is my notebook)
shanto@shanto:~$ cd ~/Desktop/BigData/706/Assignments/
shanto@shanto:~/Desktop/BigData/706/Assignments$ jupyter nbconvert --to pdf A1.ipynb
[NbConvertApp] Converting notebook A1.ipynb to pdf
[NbConvertApp] Writing 25564 bytes to notebook.tex
[NbConvertApp] Building PDF
[NbConvertApp] Running xelatex 3 times: ['xelatex', 'notebook.tex']
[NbConvertApp] Running bibtex 1 time: ['bibtex', 'notebook']
[NbConvertApp] WARNING | bibtex had problems, most likely because there were no citations
[NbConvertApp] PDF successfully created
[NbConvertApp] Writing 23494 bytes to A1.pdf
The figure shows a snap of the generated *.pdf file. The file is reasonably neat with a good formating.
If we change the –to pdf part to –to whateverFormat then the same command can be used to convert the notebook to other formats. Conversion to a few other format is shown below.
shanto@shanto:~/Desktop/BigData/706/Assignments$ jupyter nbconvert --to script A1.ipynb
[NbConvertApp] Converting notebook A1.ipynb to script
[NbConvertApp] Writing 2077 bytes to A1.py
shanto@shanto:~/Desktop/BigData/706/Assignments$ # convert to latex
shanto@shanto:~/Desktop/BigData/706/Assignments$ jupyter nbconvert --to latex A1.ipynb
[NbConvertApp] Converting notebook A1.ipynb to latex
[NbConvertApp] Writing 25564 bytes to A1.tex
Apache Spark is a fast and general-purpose cluster computing system. To get maximum potential out of it, Spark should be running on a distributed computing system. However, one might not have access to any distributed system all the time. Specially, for learning purpose one might want tor run spark on his/her own computer. This is actually a very easy task to do. There is a handful of way to do this. I would show, what I have done to run Spark on my laptop.
Continue reading “Running Spark on Local Machine”
We read the file using Pandas.
import pandas as pd
import numpy as np
rawData = pd.read_csv('data-Assignment2.txt', sep=",", header=None)
We need to find the signature matrix. For that we need to make a permutation of the rows of the whole matrix. We can do that using pandas like this.
permuteData = rawData.sample(frac=1)
Just as a note we can use frac less than one if we want to do a random subsample. We can also shuffle in-place and use this.
df = df.sample(frac=1).reset_index(drop=True) # in place shuffle, drop index column
We can test if it works by using a random matrix created by Pandas.
# create a random matrix with 0 and 1, like our example matrix
df = pd.DataFrame(np.random.randint(0,2,size=(100, 4)), columns=list('ABCD'))
# now we can do a shuffle like this
df = df.sample(frac=1)
The before and after is shown by the following figure:
a = 
b = 
for k in range(3):
for j in range(4):
a = 
# OUTPUT: [[0, 1, 2, 3], [0, 1, 2, 3], [0, 1, 2, 3]]