Standalone Hadoop Installation and Running MapReduce

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures [1]. On the other hand, Hadoop MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster [2]. Ideally, a MapReduce job will run on a cluster of Hadoop nodes. However, for learning purpose we can run a standalone version of Hadoop MapReduce on a single computer. In this article, I will go through all the steps necessary, from installing Hadoop to running a MapReduce job on a single standalone computer. All the procedures, discussed in this article is done on Ubuntu 18.04 LTS.

Install Hadoop

Step 1: Update and Install Java

Before starting the installation, we should update the operating system using sudo apt update. The next step would be installing Java (if not already installed). we can check the currently installed version of Java using the following command.

shant@shanto:~$ java -version
openjdk version "10.0.1" 2018-04-17
OpenJDK Runtime Environment (build 10.0.1+10-Ubuntu-3ubuntu1)
OpenJDK 64-Bit Server VM (build 10.0.1+10-Ubuntu-3ubuntu1, mixed mode)
shant@shanto:~$

For my case, as we can see, Java is already installed. However, if not, we can install it using the command sudo apt install default-jdk.

Step 2: Download and Run Hadoop

Now we should download a stable release from the Hadoop official website. If we go to the website we can see a list of releases. I am going for the latest version (3.1.1). As shown in the following figure, we should remember to click on the binary. If we follow along to the next page, we should be able to download the latest binary release as a .tar.gz file. In this step, we can go further and try to verify the integrity of file by following the instructions on this link. However, I have downloaded and worked on the same file in the past, so I am not going through the verification process once again.

Now, we should go the download location and use the command tar -xzvf hadoop-3.1.1.tar.gz for extracting the file (or we can extract the folder by right clicking) in the current directory as shown below.

shant@shanto:~$ cd ~/Desktop
shant@shanto:~/Desktop$ tar -xzvf hadoop-3.1.1.tar.gz
shant@shanto:~$

Now I will move all the subfolders in the extracted file to /usr/local/hadoop. This is not a must step. But the reason is, we do not want to pile up everything on our desktop. We can easily move folders into the right place using the following command.

shant@shanto:~/Desktop$ sudo mv hadoop-3.1.1/* /usr/local/hadoop
[sudo] password for shant:
shant@shanto:~/Desktop$

Hadoop requires the path to Java, either as an environment variable or in the Hadoop configuration file. We can use the following command to get the correct Java path.

shant@shanto:~$ readlink -f /usr/bin/java | sed "s:bin/java::"
/usr/lib/jvm/java-11-openjdk-amd64/

Now we will open the file hadoop-env.sh and add the line export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64/ or export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::") as shown in the following figure. We can open the file using nano using the command sudo nano /usr/local/hadoop/etc/hadoop/hadoop-env.sh.

Now, we should be able to run Hadoop by typing bin/hadoop in the terminal (if we are already in the directory where Hadoop is. /usr/local/hadoop in our case). This is what happens when we run the command.

shant@shanto:/usr/local/hadoop$ bin/hadoop
Usage: hadoop [OPTIONS] SUBCOMMAND [SUBCOMMAND OPTIONS]
 or    hadoop [OPTIONS] CLASSNAME [CLASSNAME OPTIONS]

... ... ...
... ... ...
 
  SUBCOMMAND is one of:
 
 
Admin Commands:
 
daemonlog     get/set the log level for each daemon
 
Client Commands:
 
archive       create a Hadoop archive
checknative   check native Hadoop and compression libraries availability
... ... ...
... ... ...

trace         view and modify Hadoop tracing settings
version       print the version
 
Daemon Commands:
 
kms           run KMS, the Key Management Server
 
SUBCOMMAND may print help when invoked w/o parameters or with -h.
shant@shanto:/usr/local/hadoop$

Running MapReduce

Running Example MapReduce

The Hadoop folder has some example file for MapReduce. We can look at the example files in the folder /usr/local/hadoop/share/hadoop/mapreduce as shown below.

shant@shanto:/usr/local/hadoop/share/hadoop/mapreduce$ ls
hadoop-mapreduce-client-app-3.1.1.jar         hadoop-mapreduce-client-jobclient-3.1.1.jar        hadoop-mapreduce-examples-3.1.1.jar
hadoop-mapreduce-client-common-3.1.1.jar      hadoop-mapreduce-client-jobclient-3.1.1-tests.jar  jdiff
hadoop-mapreduce-client-core-3.1.1.jar        hadoop-mapreduce-client-nativetask-3.1.1.jar       lib
hadoop-mapreduce-client-hs-3.1.1.jar          hadoop-mapreduce-client-shuffle-3.1.1.jar          lib-examples
hadoop-mapreduce-client-hs-plugins-3.1.1.jar  hadoop-mapreduce-client-uploader-3.1.1.jar         sources

Now we can run hadoop-mapreduce-examples-3.1.1.jar. For running this example we need to create a folder for input and copy all the xml files from /usr/local/hadoop/etc/hadoop/ to this folder.

shant@shanto:~$ mkdir ~/input
shant@shanto:~$ cp /usr/local/hadoop/etc/hadoop/*.xml ~/input

Now we can run the file using the following command as shown below.

shant@shanto:~$ /usr/local/hadoop/bin/hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.1.jar grep ~/input ~/grep_example 'allowed[.]*' 
2018-09-06 02:29:33,577 INFO impl.MetricsConfig: loaded properties from hadoop-metrics2.properties
2018-09-06 02:29:33,736 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
2018-09-06 02:29:33,736 INFO impl.MetricsSystemImpl: JobTracker metrics system started
2018-09-06 02:29:33,926 INFO input.FileInputFormat: Total input files to process : 9

... ... ...
... ... ...

2018-09-06 02:29:36,577 INFO mapreduce.Job: Job job_local1928501997_0002 completed successfully
2018-09-06 02:29:36,584 INFO mapreduce.Job: Counters: 30
	File System Counters
		FILE: Number of bytes read=1333090
		FILE: Number of bytes written=3252451
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
	Map-Reduce Framework
		Map input records=2
		Map output records=2
		... ... ...
		... ... ...
		GC time elapsed (ms)=7
		Total committed heap usage (bytes)=601882624
	Shuffle Errors
		BAD_ID=0
		CONNECTION=0
		IO_ERROR=0
		WRONG_LENGTH=0
		WRONG_MAP=0
		WRONG_REDUCE=0
	File Input Format Counters 
		Bytes Read=147
	File Output Format Counters 
		Bytes Written=34
shant@shanto:~$ 

This will create a output folder named ~/grep_example, which we can look at as shown below.

shant@shanto:~$ cat ~/grep_example/*
19	allowed.
1	allowed
shant@shanto:~$ 

Running MapReduce from File

Now we will try to run a MapReduce program from file. The source code and dataset can be found in this Git repository. After we clone the files from the git repository, we should open the folder in a Terminal. Now, we will refer two variables HADOOP_HOME and JAVA_HOME in the terminal as shown below.

shant@shanto:~$ export HADOOP_HOME=/usr/local/hadoop
shant@shanto:~$ export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64/

Now we have to compile the WordCount.java file to create a JAR file. We will use the JAR file for running the MapReduce job. Before creating the JAR file we have WordCount.java and a foder wordcount-1 which contains the te data. We can see that using ls command.

shant@shanto:~/Desktop/blog/word_count$ ls
README.md  wordcount-1  WordCount.java

Now we will compile as shown below.

shant@shanto:~/Desktop/blog/word_count$ ${JAVA_HOME}/bin/javac -classpath `${HADOOP_HOME}/bin/hadoop classpath` WordCount.java
shant@shanto:~/Desktop/blog/word_count$ ls
 README.md     WordCount.class                  WordCount.java
 wordcount-1  'WordCount$IntSumReducer.class'  'WordCount$TokenizerMapper.class'
shant@shanto:~/Desktop/blog/word_count$ 
shant@shanto:~/Desktop/blog/word_count$ ${JAVA_HOME}/bin/jar cf wordcount.jar WordCount*.class
shant@shanto:~/Desktop/blog/word_count$ ls
 README.md     WordCount.class                  wordcount.jar   'WordCount$TokenizerMapper.class'
 wordcount-1  'WordCount$IntSumReducer.class'   WordCount.java
shant@shanto:~/Desktop/blog/word_count$ 

We can see that, a file named wordcount.jar has been created in the current folder. Now we can run the job using YARN. The command will have input and output path. In the following command wordcount-1 is the input path and output-1 is the output path. If this command works as expected, we will observe a folder named output-1 in the working directory. We should remember that, if we want to compile again with the same output directory, we should delete the directory. Otherwise this command will generate an error.

shant@shanto:~/Desktop/blog/word_count$ ${HADOOP_HOME}/bin/yarn jar wordcount.jar WordCount wordcount-1 output-1
2018-09-07 15:30:21,194 INFO impl.MetricsConfig: loaded properties from hadoop-metrics2.properties
2018-09-07 15:30:21,362 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
2018-09-07 15:30:21,362 INFO impl.MetricsSystemImpl: JobTracker metrics system started
2018-09-07 15:30:21,525 INFO input.FileInputFormat: Total input files to process : 3
... ... ...
... ... ...

		Total committed heap usage (bytes)=715128832
	Shuffle Errors
		BAD_ID=0
		CONNECTION=0
		IO_ERROR=0
		WRONG_LENGTH=0
		WRONG_MAP=0
		WRONG_REDUCE=0
	File Input Format Counters 
		Bytes Read=469948
	File Output Format Counters 
		Bytes Written=237607
shant@shanto:~/Desktop/blog/word_count$ ls
 output-1    wordcount-1      'WordCount$IntSumReducer.class'   WordCount.java
 README.md   WordCount.class   wordcount.jar                   'WordCount$TokenizerMapper.class'
shant@shanto:~/Desktop/blog/word_count$ 

We can see that a folder named output-1 has been generated. Now we can observe the generated output of the MapReduce job.

shant@shanto:~/Desktop/blog/word_count$ cd output-1/
shant@shanto:~/Desktop/blog/word_count/output-1$ ls
part-r-00000  _SUCCESS
shant@shanto:~/Desktop/blog/word_count/output-1$ cd ..
shant@shanto:~/Desktop/blog/word_count$ 
shant@shanto:~/Desktop/blog/word_count$ cat output-1/part-* | head -n 10
"'TIS	1
"'Tis	1
"'Twill	1
"--SAID	1
"A	19
"About	3
"Add	1
"Ah!	7
"Ah!"	3
"Ah!--no,--have	1

We are done! Now we can add the variables permanently to our ~/.bashrc profile so that we do need to add them all the time we try to run Hadoop MapReduce. Lets open ~/.bashrc using nano ~/.bashrc and copy and paste the lines as shown below.

Now, we do not need to export the variable every time we try to run MapReduce job.

Copy File from Cloud HDFS to Local Computer

While I work with big data technologies like Spark and a large dataset I like to work on the university cloud, where everything is faster. However, for different reasons sometimes I have to move to local computer (my laptop). This time the reason is, I need to use a package of Python matplotlib, named baseplot, which is not installed on the cloud. However, the data I need to work on is on the cloud HDFS. Therefore, I need to copy the data from HDFS to my local laptop. This can be done in two simple steps:

Step 1: copy data from HDFS to remote local (not HDFS)
Step 2: copy data from remote local to local (my laptop)
Continue reading “Copy File from Cloud HDFS to Local Computer”

*.csv File Preprocessing Using Pandas

For any machine learning or data mining purpose, the first job is to pre-process the data so that we can us the data for the original purpose. In lots of cases we have the raw data in *csv format, which we need to import and preprocess using the language we are using for the particular job. Python is one of the most popular language for this purpose. For this article I will use Python and one very popular library named pandas to show how we can use pandas for read, import and preprocess a *.csv file.

We have a *csv file which we want to pre-process. This is a file with a large number of columns, so it is not a good idea to display it here. I am showing a part of it.

Continue reading “*.csv File Preprocessing Using Pandas”

Understanding MapReduce in My Way : Starting with Word Count

Word Count problem is known as the ‘Hello World’ for MapReduce. In this article I will explain, how I understand different bits of MapReduce in my way. The code provided in this article is trivial and is available in lots of places including the official MapReduce website. My concern would be to focus on how it really works.
Continue reading “Understanding MapReduce in My Way : Starting with Word Count”