Standalone Hadoop Installation and Running MapReduce

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures [1]. On the other hand, Hadoop MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster [2]. Ideally, a MapReduce job will run on a cluster of Hadoop nodes. However, for learning purpose we can run a standalone version of Hadoop MapReduce on a single computer. In this article, I will go through all the steps necessary, from installing Hadoop to running a MapReduce job on a single standalone computer. All the procedures, discussed in this article is done on Ubuntu 18.04 LTS.

Install Hadoop

Step 1: Update and Install Java

Before starting the installation, we should update the operating system using sudo apt update. The next step would be installing Java (if not already installed). we can check the currently installed version of Java using the following command.

shant@shanto:~$ java -version
openjdk version "10.0.1" 2018-04-17
OpenJDK Runtime Environment (build 10.0.1+10-Ubuntu-3ubuntu1)
OpenJDK 64-Bit Server VM (build 10.0.1+10-Ubuntu-3ubuntu1, mixed mode)
shant@shanto:~$

For my case, as we can see, Java is already installed. However, if not, we can install it using the command sudo apt install default-jdk.

Step 2: Download and Run Hadoop

Now we should download a stable release from the Hadoop official website. If we go to the website we can see a list of releases. I am going for the latest version (3.1.1). As shown in the following figure, we should remember to click on the binary. If we follow along to the next page, we should be able to download the latest binary release as a .tar.gz file. In this step, we can go further and try to verify the integrity of file by following the instructions on this link. However, I have downloaded and worked on the same file in the past, so I am not going through the verification process once again.

Now, we should go the download location and use the command tar -xzvf hadoop-3.1.1.tar.gz for extracting the file (or we can extract the folder by right clicking) in the current directory as shown below.

shant@shanto:~$ cd ~/Desktop
shant@shanto:~/Desktop$ tar -xzvf hadoop-3.1.1.tar.gz
shant@shanto:~$

Now I will move all the subfolders in the extracted file to /usr/local/hadoop. This is not a must step. But the reason is, we do not want to pile up everything on our desktop. We can easily move folders into the right place using the following command.

shant@shanto:~/Desktop$ sudo mv hadoop-3.1.1/* /usr/local/hadoop
[sudo] password for shant:
shant@shanto:~/Desktop$

Hadoop requires the path to Java, either as an environment variable or in the Hadoop configuration file. We can use the following command to get the correct Java path.

shant@shanto:~$ readlink -f /usr/bin/java | sed "s:bin/java::"
/usr/lib/jvm/java-11-openjdk-amd64/

Now we will open the file hadoop-env.sh and add the line export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64/ or export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::") as shown in the following figure. We can open the file using nano using the command sudo nano /usr/local/hadoop/etc/hadoop/hadoop-env.sh.

Now, we should be able to run Hadoop by typing bin/hadoop in the terminal (if we are already in the directory where Hadoop is. /usr/local/hadoop in our case). This is what happens when we run the command.

shant@shanto:/usr/local/hadoop$ bin/hadoop
Usage: hadoop [OPTIONS] SUBCOMMAND [SUBCOMMAND OPTIONS]
 or    hadoop [OPTIONS] CLASSNAME [CLASSNAME OPTIONS]

... ... ...
... ... ...
 
  SUBCOMMAND is one of:
 
 
Admin Commands:
 
daemonlog     get/set the log level for each daemon
 
Client Commands:
 
archive       create a Hadoop archive
checknative   check native Hadoop and compression libraries availability
... ... ...
... ... ...

trace         view and modify Hadoop tracing settings
version       print the version
 
Daemon Commands:
 
kms           run KMS, the Key Management Server
 
SUBCOMMAND may print help when invoked w/o parameters or with -h.
shant@shanto:/usr/local/hadoop$

Running MapReduce

Running Example MapReduce

The Hadoop folder has some example file for MapReduce. We can look at the example files in the folder /usr/local/hadoop/share/hadoop/mapreduce as shown below.

shant@shanto:/usr/local/hadoop/share/hadoop/mapreduce$ ls
hadoop-mapreduce-client-app-3.1.1.jar         hadoop-mapreduce-client-jobclient-3.1.1.jar        hadoop-mapreduce-examples-3.1.1.jar
hadoop-mapreduce-client-common-3.1.1.jar      hadoop-mapreduce-client-jobclient-3.1.1-tests.jar  jdiff
hadoop-mapreduce-client-core-3.1.1.jar        hadoop-mapreduce-client-nativetask-3.1.1.jar       lib
hadoop-mapreduce-client-hs-3.1.1.jar          hadoop-mapreduce-client-shuffle-3.1.1.jar          lib-examples
hadoop-mapreduce-client-hs-plugins-3.1.1.jar  hadoop-mapreduce-client-uploader-3.1.1.jar         sources

Now we can run hadoop-mapreduce-examples-3.1.1.jar. For running this example we need to create a folder for input and copy all the xml files from /usr/local/hadoop/etc/hadoop/ to this folder.

shant@shanto:~$ mkdir ~/input
shant@shanto:~$ cp /usr/local/hadoop/etc/hadoop/*.xml ~/input

Now we can run the file using the following command as shown below.

shant@shanto:~$ /usr/local/hadoop/bin/hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.1.jar grep ~/input ~/grep_example 'allowed[.]*' 
2018-09-06 02:29:33,577 INFO impl.MetricsConfig: loaded properties from hadoop-metrics2.properties
2018-09-06 02:29:33,736 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
2018-09-06 02:29:33,736 INFO impl.MetricsSystemImpl: JobTracker metrics system started
2018-09-06 02:29:33,926 INFO input.FileInputFormat: Total input files to process : 9

... ... ...
... ... ...

2018-09-06 02:29:36,577 INFO mapreduce.Job: Job job_local1928501997_0002 completed successfully
2018-09-06 02:29:36,584 INFO mapreduce.Job: Counters: 30
	File System Counters
		FILE: Number of bytes read=1333090
		FILE: Number of bytes written=3252451
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
	Map-Reduce Framework
		Map input records=2
		Map output records=2
		... ... ...
		... ... ...
		GC time elapsed (ms)=7
		Total committed heap usage (bytes)=601882624
	Shuffle Errors
		BAD_ID=0
		CONNECTION=0
		IO_ERROR=0
		WRONG_LENGTH=0
		WRONG_MAP=0
		WRONG_REDUCE=0
	File Input Format Counters 
		Bytes Read=147
	File Output Format Counters 
		Bytes Written=34
shant@shanto:~$ 

This will create a output folder named ~/grep_example, which we can look at as shown below.

shant@shanto:~$ cat ~/grep_example/*
19	allowed.
1	allowed
shant@shanto:~$ 

Running MapReduce from File

Now we will try to run a MapReduce program from file. The source code and dataset can be found in this Git repository. After we clone the files from the git repository, we should open the folder in a Terminal. Now, we will refer two variables HADOOP_HOME and JAVA_HOME in the terminal as shown below.

shant@shanto:~$ export HADOOP_HOME=/usr/local/hadoop
shant@shanto:~$ export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64/

Now we have to compile the WordCount.java file to create a JAR file. We will use the JAR file for running the MapReduce job. Before creating the JAR file we have WordCount.java and a foder wordcount-1 which contains the te data. We can see that using ls command.

shant@shanto:~/Desktop/blog/word_count$ ls
README.md  wordcount-1  WordCount.java

Now we will compile as shown below.

shant@shanto:~/Desktop/blog/word_count$ ${JAVA_HOME}/bin/javac -classpath `${HADOOP_HOME}/bin/hadoop classpath` WordCount.java
shant@shanto:~/Desktop/blog/word_count$ ls
 README.md     WordCount.class                  WordCount.java
 wordcount-1  'WordCount$IntSumReducer.class'  'WordCount$TokenizerMapper.class'
shant@shanto:~/Desktop/blog/word_count$ 
shant@shanto:~/Desktop/blog/word_count$ ${JAVA_HOME}/bin/jar cf wordcount.jar WordCount*.class
shant@shanto:~/Desktop/blog/word_count$ ls
 README.md     WordCount.class                  wordcount.jar   'WordCount$TokenizerMapper.class'
 wordcount-1  'WordCount$IntSumReducer.class'   WordCount.java
shant@shanto:~/Desktop/blog/word_count$ 

We can see that, a file named wordcount.jar has been created in the current folder. Now we can run the job using YARN. The command will have input and output path. In the following command wordcount-1 is the input path and output-1 is the output path. If this command works as expected, we will observe a folder named output-1 in the working directory. We should remember that, if we want to compile again with the same output directory, we should delete the directory. Otherwise this command will generate an error.

shant@shanto:~/Desktop/blog/word_count$ ${HADOOP_HOME}/bin/yarn jar wordcount.jar WordCount wordcount-1 output-1
2018-09-07 15:30:21,194 INFO impl.MetricsConfig: loaded properties from hadoop-metrics2.properties
2018-09-07 15:30:21,362 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
2018-09-07 15:30:21,362 INFO impl.MetricsSystemImpl: JobTracker metrics system started
2018-09-07 15:30:21,525 INFO input.FileInputFormat: Total input files to process : 3
... ... ...
... ... ...

		Total committed heap usage (bytes)=715128832
	Shuffle Errors
		BAD_ID=0
		CONNECTION=0
		IO_ERROR=0
		WRONG_LENGTH=0
		WRONG_MAP=0
		WRONG_REDUCE=0
	File Input Format Counters 
		Bytes Read=469948
	File Output Format Counters 
		Bytes Written=237607
shant@shanto:~/Desktop/blog/word_count$ ls
 output-1    wordcount-1      'WordCount$IntSumReducer.class'   WordCount.java
 README.md   WordCount.class   wordcount.jar                   'WordCount$TokenizerMapper.class'
shant@shanto:~/Desktop/blog/word_count$ 

We can see that a folder named output-1 has been generated. Now we can observe the generated output of the MapReduce job.

shant@shanto:~/Desktop/blog/word_count$ cd output-1/
shant@shanto:~/Desktop/blog/word_count/output-1$ ls
part-r-00000  _SUCCESS
shant@shanto:~/Desktop/blog/word_count/output-1$ cd ..
shant@shanto:~/Desktop/blog/word_count$ 
shant@shanto:~/Desktop/blog/word_count$ cat output-1/part-* | head -n 10
"'TIS	1
"'Tis	1
"'Twill	1
"--SAID	1
"A	19
"About	3
"Add	1
"Ah!	7
"Ah!"	3
"Ah!--no,--have	1

We are done! Now we can add the variables permanently to our ~/.bashrc profile so that we do need to add them all the time we try to run Hadoop MapReduce. Lets open ~/.bashrc using nano ~/.bashrc and copy and paste the lines as shown below.

Now, we do not need to export the variable every time we try to run MapReduce job.

Install LAMP Stack on Ubuntu 18.04 LTS

LAMP is an archetypal model of web service stacks, named as an acronym of the names of its original four open-source components: the GNU/Linux operating system, the Apache HTTP Server, the MySQL relational database management system (RDBMS), and the PHP programming language [1]. Currently, I am using Ubuntu 18.04 LTS. In this article, I will document the process of installing all the necessary packages for setting up a LAMP stack development environment. All the commands are written with a non-root sudo-enabled user account and a basic firewall.

Step 1 – Update and Install Apache 2

The first step is to update the OS and install Apache2. Both of them should be done using sudo and the Terminal will prompt us for providing the password.

shant@shanto:~$ # Updating OS
shant@shanto:~$ sudo apt update
[sudo] password for shant:
Ign:1 http://dl.google.com/linux/chrome/deb stable InRelease
Ign:2 http://repo.mongodb.org/apt/ubuntu xenial/mongodb-org/3.2 InRelease
... ...
... ...
All packages are up to date.
shant@shanto:~$ # Install Apache2
shant@shanto:~$ sudo apt install apache2
Reading package lists... Done
... ...
... ...

Step 2 – Adjust the Firewall

For this tutorial I am installing LAMP stack on my personal computer, which I will use for experimenting and debugging during the development process. However, if we do this on a server machine, we need to make sure that our firewall allows HTTP and HTTPS traffic. We can use the following series of commands as shown below.

shant@shanto:~$ # check available applications
shant@shanto:~$ sudo ufw app list
Available applications:
  Apache
  Apache Full
  Apache Secure
  CUPS
shant@shanto:~$ # check app information
shant@shanto:~$ sudo ufw app info "Apache Full"
Profile: Apache Full
Title: Web Server (HTTP,HTTPS)
Description: Apache v2 is the next generation of the omnipresent Apache web
server.

Ports:
  80,443/tcp
shant@shanto:~$ # Allow incoming HTTP and HTTPS traffic
shant@shanto:~$ sudo ufw allow in "Apache Full"
Rules updated
Rules updated (v6)

If everything works properly so far, we should be able to type localhost in our browser (or the public IP of the server, if all these done on a server) to go the default page as shown below.

Step 3 – Installing MySQL

At this step, we will install mysql-server using sudo apt install mysql-server. After the installation is done, we should  run mysql_secure_installation command which will set some of the basic settings. I have decided to go with all the default settings, as shown below.

shant@shanto:~$ # Installing MySQL
shant@shanto:~$ sudo apt install mysql-server
Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following additional packages will be installed:
  libevent-core-2.1-6 mysql-client-5.7 mysql-client-core-5.7 mysql-common mysql-server-5.7
  mysql-server-core-5.7
... ...
... ...

Setting up mysql-server (5.7.21-1ubuntu1) ...
Processing triggers for ureadahead (0.100.0-20) ...
Processing triggers for systemd (237-3ubuntu10) ...
shant@shanto:~$ sudo mysql_secure_installation

Securing the MySQL server deployment.

... ...
secure enough. Would you like to setup VALIDATE PASSWORD plugin?

Press y|Y for Yes, any other key for No: y

There are three levels of password validation policy:

LOW    Length >= 8
MEDIUM Length >= 8, numeric, mixed case, and special characters
STRONG Length >= 8, numeric, mixed case, special characters and dictionary                  file

Please enter 0 = LOW, 1 = MEDIUM and 2 = STRONG: 0
Please set the password for root here.

New password: 

Re-enter new password: 

Estimated strength of the password: 50 
Do you wish to continue with the password provided?(Press y|Y for Yes, any other key for No) : y
... ...

Remove anonymous users? (Press y|Y for Yes, any other key for No) : n

 ... skipping.


... ...

Disallow root login remotely? (Press y|Y for Yes, any other key for No) : n

... ...


Remove test database and access to it? (Press y|Y for Yes, any other key for No) : n

... ...

Reload privilege tables now? (Press y|Y for Yes, any other key for No) : n

 ... skipping.
All done! 
shant@shanto:~$

Now we can test our mysql installation by typing sudo mysql on the terminal. It might prompt the user for password.

shant@shanto:~$ sudo mysql
[sudo] password for shant: 
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 9
Server version: 5.7.21-1ubuntu1 (Ubuntu)

Copyright (c) 2000, 2018, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> SHOW DATABASES;
+--------------------+
| Database           |
+--------------------+
| information_schema |
| mysql              |
| performance_schema |
| sys                |
+--------------------+
4 rows in set (0.01 sec)

mysql> 

Adding a New User

Now just to test our mysql installation, we can create a new username and password, so that we do need to login all the time using root.

mysql> # Add User
mysql> CREATE USER 'shanto'@'localhost' IDENTIFIED BY 'password';
Query OK, 0 rows affected (0.00 sec)

mysql> # Grant full privilege 
mysql> GRANT ALL PRIVILEGES ON * . * TO 'shanto'@'localhost';
Query OK, 0 rows affected (0.00 sec)

mysql> FLUSH PRIVILEGES;
Query OK, 0 rows affected (0.00 sec)

mysql> # Set password for user
mysql> SET PASSWORD FOR 'shanto'@'localhost' = 'PUT ANY PASSWORD';
Query OK, 0 rows affected (0.00 sec)

Now we should be able to login to mysql command line using mysql -u shanto -p on the terminal.

shant@shanto:~$ mysql -u shanto -p
Enter password: 
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 12
Server version: 5.7.21-1ubuntu1 (Ubuntu)

Copyright (c) 2000, 2018, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql>

Step 3 — Installing PHP

Now we will install PHP. After the installation is done, we can check our installation by writing a file named info.php into /var/www/html/ folder.

shant@shanto:~$ # Install PHP
shant@shanto:~$ sudo apt install php libapache2-mod-php php-mysql
[sudo] password for shant: 
Reading package lists... Done
... ...
... ...

shant@shanto:~$ # Open a new file for testing PHP
shant@shanto:~$ sudo nano /var/www/html/info.php

This will open a nano editor, where we can copy and paste the following text ans save the file using CTRL-X followed by ‘y’.

<?php phpinfo(); ?>

Now if we type localhost/info.php into our browser, we should be able to see the following page.

Now we are ready to work with our LAMP stack system!

Creating a Conda Environment from an Existing Environment

In this article, I have discussed how I have created a Python environment for working on machine learning and AI problems using Conda. For a new project, I need to work with Django for creating a web application. However, I still need the machine learning back-end. So I decided to build an identical (which I have already created and installed all the necessary machine learning packages) conda environment to start with. Then I will install any extra package (e.g. Django) that I might need along the road.

In my machine learning environment (which I named ml36), I have installed some of the packages using conda, while some of them using pip. So I would have to use both conda list and pip freeze command to build requirements file, from which I will be able recreate the environment. Lets do this step by step.

Step 1 – Activate Environment and Create Requirement Files

shant@shanto:~$ source activate ml36
(ml36) shant@shanto:~$ cd ~/Desktop/
(ml36) shant@shanto:~/Desktop$ conda list --explicit > ml36.txt
(ml36) shant@shanto:~/Desktop$ pip freeze > ml36requirements.txt

This will generate two files on the Desktop named ml36.txt and ml36requirements.txt.

Step 2 – Create New Environment

Now we have to create a new environment using ml36.txt. First, I will deactivate the ml36 environment and use the file ml36.txt to create a new environment named webdev using the command conda create --name webdev --file ml36.txt as shown below. This will generate a new environment name webdev and install all the packages that were installed by conda command.

(ml36) shant@shanto:~/Desktop$ source deactivate
shant@shanto:~/Desktop$ conda create --name webdev --file ml36.txt
Preparing transaction: done
Verifying transaction: done
Executing transaction: \ + /home/shant/miniconda3/envs/webdev/bin/python -m nb_conda_kernels.install --enable --prefix=/home/shant/miniconda3/envs/webdev
Enabling nb_conda_kernels...
Enabled nb_conda_kernels

\ + /home/shant/miniconda3/envs/webdev/bin/jupyter-nbextension enable nb_conda --py --sys-prefix
Enabling notebook extension nb_conda/main...
      - Validating: OK
Enabling tree extension nb_conda/tree...
      - Validating: OK
+ /home/shant/miniconda3/envs/webdev/bin/jupyter-serverextension enable nb_conda --py --sys-prefix
Enabling: nb_conda
- Writing config: /home/shant/miniconda3/envs/webdev/etc/jupyter
    - Validating...
      nb_conda 2.2.1 OK

done
shant@shanto:~/Desktop$ source activate webdev
(webdev) shant@shanto:~/Desktop$

Step 3 – Install pip Packages

Now we need to install all the files that were installed using pip using the file ml36requirements.txt. We will use the command pip install -r ml36requirements.txt for this purpose as shown below.

(webdev) shant@shanto:~/Desktop$ pip install -r ml36requirements.txt 
Requirement already satisfied: absl-py==0.4.0 in /home/shant/miniconda3/envs/webdev/lib/python3.6/site-packages (from -r ml36requirements.txt (line 1)) (0.4.0)
Requirement already satisfied: alabaster==0.7.11 in /home/shant/miniconda3/envs/webdev/lib/python3.6/site-packages (from -r ml36requirements.txt (line 2)) (0.7.11)
... ...
... ...
Successfully installed Keras-2.2.2 Keras-Applications-1.0.4 Keras-Preprocessing-1.0.2 PyHamcrest-1.9.0 cycler-0.10.0 cython-0.28.5 kiwisolver-1.0.1 matplotlib-2.2.3 numpy-1.14.5 seaborn-0.9.0 tensorflow-1.10.0
You are using pip version 10.0.1, however version 18.0 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
(webdev) shant@shanto:~/Desktop$ 

We are done with recreating an environment from an existing one. Now we can install Django using conda or pip, or directly from Github into our freshly baked webdev environment.

(webdev) shant@shanto:~/Desktop$ conda install -c anaconda django 
Solving environment: done

## Package Plan ##

  environment location: /home/shant/miniconda3/envs/webdev

  added / updated specs: 
    - django


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    django-2.1                 |           py36_0         4.8 MB  anaconda

The following NEW packages will be INSTALLED:

    django: 2.1-py36_0 anaconda

Proceed ([y]/n)? y


Downloading and Extracting Packages
django-2.1           | 4.8 MB    | ######################################################### | 100% 
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
(webdev) shant@shanto:~/Desktop$

Installing Anaconda to Setup a Machine Learning Environment

I am trying to set up a machine learning environment on my laptop. For this purpose, I am primarily a Python user. So I will need all the Python libraries. I will also setup things in an python environment. So that I can keep my workplace neat. Creating a pyhton environnment is necessary specially if different project has different version requirements. Specially if we want to work on some legacy project, which I plan to work on. I am running Ubuntu 18.04 LTS. I am documenting for my own reference, so that I know what I have done previously.

Step 1 – Download

I start with googling and going to the anaconda page. I click on the linux under the regular installation thing. Aparently, as shown in the page, we have two options for installing Anaconda miniconda (needs around 400MB disk space) and full anaconda (needs around 3GB disk space). Anaconda has all the necessary packages, while miniconda will have only the basic thing installed, so that we can install everything later. We will not have a lots of unnecessary things installed. I decided to install miniconda, so I go to this link for downloading miniconda. As shown in the following figure, I download the 64-bit (bash installer) on my Desktop for Python 3.6 as shown below.

Step 2 – Install

Now we need to install miniconda. Lets open a terminal and change directory to Desktop (or the location of the file)

shant@shanto:~$ cd ~/Desktop/
shant@shanto:~/Desktop$ ls
Miniconda3-latest-Linux-x86_64.sh
shant@shanto:~/Desktop$ clear

shant@shanto:~/Desktop$ bash Miniconda3-latest-Linux-x86_64.sh 

Welcome to Miniconda3 4.5.4

In order to continue the installation process, please review the license
agreement.
Please, press ENTER to continue
>>> 

Pressing ENTER will let the installation process continue as shown below. I will continue with all the default options as shown highlighed in the following.

===================================
Miniconda End User License Agreement
===================================

Copyright 2015, Anaconda, Inc.

All rights reserved under the 3-clause BSD License:
... ...
... ...
for client/server applications by using secret-key cryptography.

cryptography
    A Python library which exposes cryptographic recipes and primitives.


Do you accept the license terms? [yes|no]
>>> yes
Miniconda3 will now be installed into this location:
/home/shant/miniconda3

  - Press ENTER to confirm the location
  - Press CTRL-C to abort the installation
  - Or specify a different location below

[/home/shant/miniconda3] >>> 
PREFIX=/home/shant/miniconda3
installing: python-3.6.5-hc3d631a_2 ...
Python 3.6.5 :: Anaconda, Inc.
installing: ca-certificates-2018.03.07-0 ...
... ...
... ...
installing: requests-2.18.4-py36he2e5f8d_1 ...
installing: conda-4.5.4-py36_0 ...
installation finished.
Do you wish the installer to prepend the Miniconda3 install location
to PATH in your /home/shant/.bashrc ? [yes|no]
[no] >>> yes

Appending source /home/shant/miniconda3/bin/activate to /home/shant/.bashrc
A backup will be made to: /home/shant/.bashrc-miniconda3.bak


For this change to become active, you have to open a new terminal.

Thank you for installing Miniconda3!
shant@shanto:~/Desktop$ 

Miniconda should already be installed. We can check the installation by typying any conda command as shown below. We then update conda to the latest version.

shant@shanto:~$ conda --version
conda 4.5.4
shant@shanto:~$ conda update conda
Solving environment: done

## Package Plan ##

  environment location: /home/shant/miniconda3

  added / updated specs: 
    - conda


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    ... ...
    ------------------------------------------------------------
                                           Total:         4.6 MB

The following packages will be UPDATED:

    ... ...

Proceed ([y]/n)? y


Downloading and Extracting Packages
conda-4.5.10         |  1.0 MB | ############################################################################################### | 100% 
openssl-1.0.2p       |  3.5 MB | ############################################################################################### | 100% 
certifi-2018.8.13    |  138 KB | ############################################################################################### | 100% 
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
shant@shanto:~$ 

Step 3 – Setting up Environment

Now we can create and manage environments using Anaconda. It is always better to use different environment for different purpose and install necessary package in a particular environment, so that each environment is orthogonal and does not interfere with each other. We can see the existing environments using the command conda info –envs. As we have just installed Anaconda there should just be only one ‘base’ environment, and we can activate the ‘base’ environment using source activate base, as shown below (the activated environment will be shown within a first bracket). In the activated environment we can type which python, to check that Python is running from our miniconda installation.

shant@shanto:~$ conda info --envs
# conda environments:
#
base                  *  /home/shant/miniconda3

shant@shanto:~$ source activate base
(base) shant@shanto:~$ which python
/home/shant/miniconda3/bin/python
(base) shant@shanto:~$ 

Now we can add a new environmet. First lets deactivate the current environment ‘base’ by typing source deactivate. While creating a new environment using conda, we can proceed with the original version of our Python (3.6 in this case) installation or we can specify any other version of Python (say 3.5 or 2.7). We will create two different environments for python 3.6 and python 2.7 namely ml36 and ml27 respectively. The primary purpose of these environments are doing machine learning. After installation we can check if they are really installed using the command for getting the environment names. Then we can switch between different environments and check if we have the right version of Python. The commands for all these tasks are highlighted.

(base) shant@shanto:~$ source deactivate
shant@shanto:~$ conda create --name ml36 python=3.6
Solving environment: done

## Package Plan ##

... ...
... ...
# To deactivate an active environment, use
#
#     $ conda deactivate

shant@shanto:~$ conda create --name ml27 python=2.7
Solving environment: done

## Package Plan ##

  environment location: /home/shant/miniconda3/envs/ml27

  added / updated specs: 
    - python=2.7

... ...
... ...
shant@shanto:~$ conda info --envs
# conda environments:
#
base                  *  /home/shant/miniconda3
ml27                     /home/shant/miniconda3/envs/ml27
ml36                     /home/shant/miniconda3/envs/ml36

shant@shanto:~$ source activate ml36
(ml36) shant@shanto:~$ python --version
Python 3.6.6 :: Anaconda, Inc.
(ml36) shant@shanto:~$ source deactivate
shant@shanto:~$ source activate ml27
(ml27) shant@shanto:~$ python --version
Python 2.7.15 :: Anaconda, Inc.
(ml27) shant@shanto:~$  

Step 4 – Installing Packages

Now we can install the necessary packages into any environment depending on our need. Lets activate ml36 and install some necessary packages for machine learning. First we will install Jupyter notebook, which is an essential tool for interactive data analysis and experimentation.

The command conda list will provide a list of the already installed packages. After installing Jupyter notebook, we will install pandas, spyder, numpy, scikit-learn, tensorflow, keras, pyyaml, h5pymatplotlibseaborn, argparse, pytorch,  keras.

shant@shanto:~$ source activate ml36
(ml36) shant@shanto:~$ conda list
# packages in environment at /home/shant/miniconda3/envs/ml36:
#
# Name                    Version                   Build  Channel
ca-certificates           2018.03.07                    0  
certifi                   2018.8.13                py36_0  
... ...
... ...

xz                        5.2.4                h14c3975_4  
zlib                      1.2.11               ha838bed_2  
(ml36) shant@shanto:~$ conda install jupyter
Solving environment: done

## Package Plan ##

  environment location: /home/shant/miniconda3/envs/ml36
... ...
... ...

Verifying transaction: done
Executing transaction: done
(ml36) shant@shanto:~$ jupyter-notebook
[I 02:27:50.100 NotebookApp] Writing notebook server cookie secret to /run/user/1002/jupyter/notebook_cookie_secret
[I 02:27:50.332 NotebookApp] Serving notebooks from local directory: /home/shant
[I 02:27:50.332 NotebookApp] The Jupyter Notebook is running at:

At this moment Jupyter notebook installation is done and it should open in a tab on our default browser after the command jupyter-notebook as shown in the following Figure.

However, if we click on the down arrow on the New button on the top right corner, we do not see the new environment (ml36) in here. To solve this issue we need to instal nb_conda using the command conda install nb_conda on the terminal. Now if we open Jupyter by typing jupyter-notebook on the terminal we see that we have our environment listed in Jupyter notebook as shown below.

Now lets install the packages using the following commands on the terminal while the conda environment is active.

(ml36) shant@shanto:~$ conda install scipy
... ...
(ml36) shant@shanto:~$ conda install pandas
... ...
(ml36) shant@shanto:~$ conda install spyder
... ...
(ml36) shant@shanto:~$ conda install -c conda-forge tensorflow
... ...
(ml36) shant@shanto:~$ conda install -c conda-forge keras
... ...
(ml36) shant@shanto:~$ pip install matplotlib seaborn argparse
... ...
(ml36) shant@shanto:~$ conda install scikit-learn
... ...
(ml36) shant@shanto:~$ conda install -c anaconda xlrd
... ...
(ml36) shant@shanto:~$ conda install -c anaconda beautifulsoup4
... ...
(ml36) shant@shanto:~$ conda install -c bokeh bokeh
... ...
(ml36) shant@shanto:~$ conda install -c bokeh/label/dev bokeh
... ...
(ml36) shant@shanto:~$ conda install -c conda-forge ipywidgets
... ...
(ml36) shant@shanto:~$ conda install pytorch-cpu torchvision-cpu -c pytorch
... ...

I just went with all the default options while installing the packages. As shown in the gist below, all of them working perfectly. We need to remember that, depending on the environment (with or without GPU) we need to install the right version of PyTorch. As I am installing everything on my laptop which does not have a GPU, I selected a CPU version of PyTorch. The necessary command depending on the environment can be generated from the PyTorch official website.

Install Avro for Ubuntu 18.04 LTS

Avro is a phonetic keyboard for Unicode Bangla typing. For a windows machine Avro has its executable files and the installation process is pretty straight forward. However, the installation process for a Linux machine is not as simple (at least if compared to an Windows machine installation process). I am a Ubuntu user. As I am upgrading from 16.04 LTS to 18.04 LTS, I need to reinstall Avro, and seems like the installation process is a little different (as always). In this article, I am documenting the whole process for future reference (mostly for myself).

Step 1

From the search option in Ubuntu, we need to search for Language Support as shown in the following figure.


Continue reading “Install Avro for Ubuntu 18.04 LTS”

Web Scraping Using lxml

In this article I will demonstrate a few examples of web scraping. According to the Wikipedia article, Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. There are a lots of tutorial on web scraping. In this post I will demonstrate web scraping while solving a few problems. I will use Python3 and a few libraries for this purpose. Lets get into the problem.
Continue reading “Web Scraping Using lxml”

GPGPU Programming with CUDA for Color Space Conversion

General-purpose computing on graphics processing units (GPGPU, rarely GPGP) is the use of a graphics processing unit (GPU), which typically handles computation only for computer graphics, to perform computation in applications traditionally handled by the central processing unit (CPU) [1].
Continue reading “GPGPU Programming with CUDA for Color Space Conversion”

What I Just Read : Assessing Cardiovascular Risk Factors with Computer Vision

One of my friend on Facebook, who happens to be a data scientist, shared a very exciting news. I do not click all the links that are shared by my friends on Facebook. But, this time I had to. The title was enough for any technology enthusiast to at least click for the details.

https://research.googleblog.com/
https://www.nature.com/articles/s41551-018-0195-0.pdf

Setting up Jupyter notebook with Tensorflow, Keras and Pytorch for Deep Learning

I was trying to set up my Jupyter notebook to work on some deep learning problem (some image classification on MNIST and imagenet dataset) on my laptop (Ubuntu 16.04 LTS). Previously I have used a little bit of Keras (which runs on top of Tensorflow) on a small dataset, but I did not use that with Jupyter. For that purpose I installed Tensorflow and Keras independently and used them in a Python script. However, it was not working from my Jupyter notebook. I googled for the solution, but found nothing concrete. I tried to activate the tensorflow environment and run jupyter notebook from their but in vein. I guess the reason is, I have downloaded different packages in different times and that might make some compatibility issues. Therefore, I decided to create a BRAND NEW conda environment for my deep learning endeavor. This is how it goes:
Continue reading “Setting up Jupyter notebook with Tensorflow, Keras and Pytorch for Deep Learning”

Using baseplot for Ploting Geographical Coordinates

Basemap is a great tool for creating maps using python in a simple way. It’s a matplotlib extension, so it has got all its features to create data visualizations, and adds the geographical projections and some datasets to be able to plot coast lines, countries, and so on directly from the library [1].
Continue reading “Using baseplot for Ploting Geographical Coordinates”