In this article I will demonstrate a few examples of web scraping. According to the Wikipedia article, Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. There… More
I was trying to set up my Jupyter notebook to work on some deep learning problem (some image classification on MNIST and imagenet dataset) on my laptop (Ubuntu 16.04 LTS). Previously I have used a little bit of Keras (which runs on top of Tensorflow) on a small dataset, but I did not use that with Jupyter. For that purpose I installed Tensorflow and Keras independently and used them in a Python script. However, it was not working from my Jupyter notebook. I googled for the solution, but found nothing concrete. I tried to activate the tensorflow environment and run jupyter notebook from their but in vein. I guess the reason is, I have downloaded different packages in different times and that might make some compatibility issues. Therefore, I decided to create a BRAND NEW conda environment for my deep learning endeavor. This is how it goes:
Continue reading “Setting up Jupyter notebook with Tensorflow, Keras and Pytorch for Deep Learning”
Basemap is a great tool for creating maps using python in a simple way. It’s a matplotlib extension, so it has got all its features to create data visualizations, and adds the geographical projections and some datasets to be able to plot coast lines, countries, and so on directly from the library .
Continue reading “Using baseplot for Ploting Geographical Coordinates”
While I work with big data technologies like Spark and a large dataset I like to work on the university cloud, where everything is faster. However, for different reasons sometimes I have to move to local computer (my laptop). This time the reason is, I need to use a package of Python matplotlib, named baseplot, which is not installed on the cloud. However, the data I need to work on is on the cloud HDFS. Therefore, I need to copy the data from HDFS to my local laptop. This can be done in two simple steps:
Step 1: copy data from HDFS to remote local (not HDFS)
Step 2: copy data from remote local to local (my laptop)
Continue reading “Copy File from Cloud HDFS to Local Computer”
In this post I am going to make a compilation of interview questions for data science role. A big part of them are questions that I faced during my interviews. I have also gathered questions from different websites and which I found interesting. So, lets get started.
What do you know about bias-variance/bias-variance tradeoff?
In statistics and machine learning, the bias–variance tradeoff (or dilemma) is the problem of simultaneously minimizing two sources of error that prevent supervised learning algorithms from generalizing beyond their training set [Wikipedia]:
- The bias is an error from erroneous assumptions in the learning algorithm. High bias can cause an algorithm to miss the relevant relations between features and target outputs (underfitting). Bias are the simplifying assumptions made by a model to make the target function easier to learn. Examples of low-bias machine learning algorithms include: Decision Trees, k-Nearest Neighbors and Support Vector Machines. Examples of high-bias machine learning algorithms include: Linear Regression, Linear Discriminant Analysis and Logistic Regression .
- The variance is an error from sensitivity to small fluctuations in the training set. High variance can cause an algorithm to model the random noise in the training data, rather than the intended outputs (overfitting). Variance is the amount that the estimate of the target function will change if different training data was used. Low variance suggests small changes to the estimate of the target function with changes to the training dataset. High variance suggests large changes to the estimate of the target function with changes to the training dataset. Generally, nonparametric machine learning algorithms that have a lot of flexibility have a high variance. For example, decision trees have a high variance, that is even higher if the trees are not pruned before use. Examples of low-variance machine learning algorithms include: Linear Regression, Linear Discriminant Analysis and Logistic Regression. Examples of high-variance machine learning algorithms include: Decision Trees, k-Nearest Neighbors and Support Vector Machines .
As a data scientist, I frequently use Jupyter notebook. For writing some report one might need to print out (on paper) the full notebook. There is a print preview option in the current version of Jupyter notebook, but no print option.
I tried to use CTRL + P command on the print preview page, but the output was horrible (like when we try to print an webpage). I googled and found a better way of doing that.
I am running Jupyter notebook on Ubuntu 16.04. The steps are very simple:
(1) Open terminal
(2) Change directory (where the notebook is located)
(3) Use command: ipython nbconvert –to pdf A1.ipynb (A1.ipynb is my notebook)
shanto@shanto:~$ cd ~/Desktop/BigData/706/Assignments/ shanto@shanto:~/Desktop/BigData/706/Assignments$ ls A1.ipynb shanto@shanto:~/Desktop/BigData/706/Assignments$ jupyter nbconvert --to pdf A1.ipynb [NbConvertApp] Converting notebook A1.ipynb to pdf [NbConvertApp] Writing 25564 bytes to notebook.tex [NbConvertApp] Building PDF [NbConvertApp] Running xelatex 3 times: ['xelatex', 'notebook.tex'] [NbConvertApp] Running bibtex 1 time: ['bibtex', 'notebook'] [NbConvertApp] WARNING | bibtex had problems, most likely because there were no citations [NbConvertApp] PDF successfully created [NbConvertApp] Writing 23494 bytes to A1.pdf shanto@shanto:~/Desktop/BigData/706/Assignments$
The figure shows a snap of the generated *.pdf file. The file is reasonably neat with a good formating.
If we change the –to pdf part to –to whateverFormat then the same command can be used to convert the notebook to other formats. Conversion to a few other format is shown below.
shanto@shanto:~/Desktop/BigData/706/Assignments$ jupyter nbconvert --to script A1.ipynb [NbConvertApp] Converting notebook A1.ipynb to script [NbConvertApp] Writing 2077 bytes to A1.py shanto@shanto:~/Desktop/BigData/706/Assignments$ # convert to latex shanto@shanto:~/Desktop/BigData/706/Assignments$ jupyter nbconvert --to latex A1.ipynb [NbConvertApp] Converting notebook A1.ipynb to latex [NbConvertApp] Writing 25564 bytes to A1.tex shanto@shanto:~/Desktop/BigData/706/Assignments$
Apache Spark is a fast and general-purpose cluster computing system. To get maximum potential out of it, Spark should be running on a distributed computing system. However, one might not have access to any distributed system all the time. Specially, for learning purpose one might want tor run spark on his/her own computer. This is actually a very easy task to do. There is a handful of way to do this. I would show, what I have done to run Spark on my laptop.
Continue reading “Running Spark on Local Machine”
We read the file using Pandas.
import pandas as pd import numpy as np rawData = pd.read_csv('data-Assignment2.txt', sep=",", header=None)
We need to find the signature matrix. For that we need to make a permutation of the rows of the whole matrix. We can do that using pandas like this.
permuteData = rawData.sample(frac=1)
Just as a note we can use frac less than one if we want to do a random subsample. We can also shuffle in-place and use this.
df = df.sample(frac=1).reset_index(drop=True) # in place shuffle, drop index column
We can test if it works by using a random matrix created by Pandas.
# create a random matrix with 0 and 1, like our example matrix df = pd.DataFrame(np.random.randint(0,2,size=(100, 4)), columns=list('ABCD')) # now we can do a shuffle like this df = df.sample(frac=1)
The before and after is shown by the following figure:
a =  b =  for k in range(3): for j in range(4): a.append(j) b.append(a) a =  print(b) # OUTPUT: [[0, 1, 2, 3], [0, 1, 2, 3], [0, 1, 2, 3]]
First job is to download VirtualBox. We can go to the link by googling VirtualBox. This is where it should take us.
For any machine learning or data mining purpose, the first job is to pre-process the data so that we can us the data for the original purpose. In lots of cases we have the raw data in *csv format, which we need to import and preprocess using the language we are using for the particular job. Python is one of the most popular language for this purpose. For this article I will use Python and one very popular library named pandas to show how we can use pandas for read, import and preprocess a *.csv file.
We have a *csv file which we want to pre-process. This is a file with a large number of columns, so it is not a good idea to display it here. I am showing a part of it.
Continue reading “*.csv File Preprocessing Using Pandas”
Word Count problem is known as the ‘Hello World’ for MapReduce. In this article I will explain, how I understand different bits of MapReduce in my way. The code provided in this article is trivial and is available in lots of places including the official MapReduce website. My concern would be to focus on how it really works.
Continue reading “Understanding MapReduce in My Way : Starting with Word Count”