Browsing:

Categorie: Big Data

Hadoop on Debian Wheezy

Hadoop

We from the ButtonFactory care deeply about Data. And the latest hype seems to be that the Data must be Big. So let’s get our hands dirty already and take a plunge into Hadoop.

This is how you can install Hadoop on a Debian Wheezy virtual machine in VirtualBox.
My laptop is running Windows 8 Enterprise.

Java Development Kit

Install Debian Wheezy from here.
When choosing packages, select only base system and SSH server.
We need to install Java 6 (see wiki).

$chmod u=rwx jdk-6u38-linux-x64.bin
$tar xvf jdk-6u38-linux-x64.bin

We will move the JDK to /opt like this:

$mkdir /opt/jvm
$mv jdk1.6.0_38/ /opt/jvm/jdk1.6.0_38/

$update-alternatives --install /usr/bin/java java /opt/jvm/jdk1.6.0_38/jre/bin/java 3
$update-alternatives --config java

Now we can check the version:

$ java -version
java version "1.6.0_38"
Java(TM) SE Runtime Environment (build 1.6.0_38-b05)
Java HotSpot(TM) 64-Bit Server VM (build 20.13-b02, mixed mode)

The Hadoop user

I followed Michael Noll’s howto for Ubuntu.

We need to create a hadoop group and a hduser, and we’ll put the haduser in the hadoop group:

addgroup hadoop
adduser --ingroup hadoop hduser

We need to disable IPv6, so let’s add the following lines to the end of /etc/sysctl.conf:

#disable ipv6
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1

Next we need to add Java and Hadoop to the path.

Add Java to the path

We need to set the $JAVA_HOME variable and add JAVA to our path.
Edit ~/.bashrc:

# Set Hadoop-related environment variables
export HADOOP_HOME=/opt/hadoop

# Set JAVA_HOME (we will also configure JAVA_HOME directly for Hadoop later on)
export JAVA_HOME=/opt/jvm/jdk1.6.0_38

# Add Hadoop bin/ and JAVA bin/ directory to PATH
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$JAVA_HOME/bin

Next we can install Hadoop, also in the opt directory. I prefer to use the opt directory, but feel free to use /usr or something else.

Installing Hadoop

I’m using Hadoop 1.0.4, because it is stable. We might want to test with 2.0 later on.

cd /opt
tar xzf hadoop-1.0.4.tar.gz
mv hadoop-1.0.4 hadoop
chown -R hduser:hadoop hadoop

We need to edit some of Hadoop’s config files now.

hadoop-env.sh

Open opt/hadoop/conf/hadoop-env.sh and set the JAVA_HOME environment variable to the Sun JDK/JRE 6 directory.

# The java implementation to use.  Required.
# export JAVA_HOME=/usr/lib/j2sdk1.5-sun
to
export JAVA_HOME=/opt/jvm/jdk1.6.0_38

core-site.xml

This is where Hadoop stores its Data.
/opt/hadoop/conf/core-site.xml




  hadoop.tmp.dir
  /app/hadoop/tmp
  A base for other temporary directories.



  fs.default.name
  hdfs://localhost:54310
  The name of the default file system.  A URI whose
  scheme and authority determine the FileSystem implementation.  The
  uri's scheme determines the config property (fs.SCHEME.impl) naming
  the FileSystem implementation class.  The uri's authority is used to
  determine the host, port, etc. for a filesystem.


 

We need to create this directory and set ownership correctly:

mkdir -p /app/hadoop/tmp
chown hduser:hadoop /app/hadoop/tmp

mapred-site.xml

vim mapred-site.xml



  mapred.job.tracker
  localhost:54311
  The host and port that the MapReduce job tracker runs
  at.  If "local", then jobs are run in-process as a single map
  and reduce task.
  

hdfs-site.xml

vim hdfs-site.xml



  dfs.replication
  1
  Default block replication.
  The actual number of replications can be specified when the file is created.
  The default is used if replication is not specified in create time.
  

Starting Hadoop

su to hduser, because Hadoop runs under the hduser:

First format the namenode:

/opt/hadoop/bin/hadoop namenode -format


Now start Hadoop:

bin/start-all.sh

You can check if it all runs by running the Java JPS command:

hduser@wheezy:$ jps
2764 JobTracker
3374 Jps
2554 DataNode
2667 SecondaryNameNode
2879 TaskTracker
2449 NameNode

Is something went wrong and you don’t see the DataNode running, then stop Hadoop, remove all the files in /app/hadoop/tmp, format the datanode and start again.

bin/stop-all.sh 
cd /app/hadoop/tmp/
rm * -rf
cd /opt/hadoop/
bin/hadoop namenode -format
bin/start-all.sh

Now you should be able to browse to:
http://localhost:50070/ – web UI of the NameNode daemon
http://localhost:50030/ – web UI of the JobTracker daemon
http://localhost:50060/ – web UI of the TaskTracker daemon

If you visit the browser from the host machine replace ‘localhost’ with the IP from your virtual Debian machine.

Run a MapReduce Task

You can run one of the examples like this:

hduser@wheezy:/opt/hadoop$ bin/hadoop jar hadoop-examples-1.0.4.jar pi 10 1000000

When all goes well the output should be something like this:

SNIP
12/12/16 16:48:57 INFO mapred.JobClient:     Combine output records=0
12/12/16 16:48:57 INFO mapred.JobClient:     Physical memory (bytes) snapshot=1655828480
12/12/16 16:48:57 INFO mapred.JobClient:     Reduce output records=0
12/12/16 16:48:57 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=5379981312
12/12/16 16:48:57 INFO mapred.JobClient:     Map output records=20
Job Finished in 71.721 seconds
Estimated value of Pi is 3.14158440000000000000
SNIP

But chances are it doesn’t work directly. I had to doublecheck my file permissions, because I once ran Hadoop as root and that makes root owner of the log directory. And then the hduser is not allowed to write in them.

So check for errors like “WARN mapred.JobClient: Error reading task outputhttp://wheezy:50060/tasklog?plaintext=true&taskid=attempt_201001181020_0002_m_000014_0&filter=stdout 10/01/18 10:52:48 WARN mapred.JobClient: Error reading task outputhttp://wheezy:50060/tasklog?plaintext=true&taskid=attempt_201001181020_0002_m_000014_0&filter=stderr”

And take ownership of the /opt/hadoop/logs:

chown -R hduser:hadoop logs