This homework will make you more familiar with the Hadoop File System (HDFS) and the Hadoop mapreduce framework. You will be provided a disk image with everything pre-installed so hopefully the setup-phase for this homework will go smooth.

If you have one hour to spare I suggest that you watch this lecture from UC Berkeley. It will give you a great overview of what mapreduce is and what it is good at.

Overview

In this homework you will “do big data” on some weather data from the 18th century. The data will be given as many relatively small files (one for each year) so it will be easy for the tech-savvy ones to write a small program that will do what we ask of in Python (perhaps a nice way to double check your results before you hand them in ;). But since that is not very “big data” we will instead use Hadoop mapreduce.

The input data will be one .csv-file for each year in the period 1763 to 1799. Each of these files contains one row per measurement. To make things a little tricky there were only one weather station recording data in 1763 and at least two weather stations in 1799. Furthermore there are at least two measurements each day, one for the maximum temperature (TMAX) and one for the minimum temperature (TMIN), and sometimes one for the precipitation (PRCP).

The content of 1789.csv looks similar to this:

ITE00100554,17890101,TMAX,-63,,,E,
ITE00100554,17890101,TMIN,-90,,,E,
GM000010962,17890101,PRCP,4,,,E,
EZE00100082,17890101,TMAX,-103,,,E,
EZE00100082,17890101,TMIN,-184,,,E,
ITE00100554,17890102,TMAX,-16,,,E,
ITE00100554,17890102,TMIN,-66,,,E,
GM000010962,17890102,PRCP,15,,,E,
EZE00100082,17890102,TMAX,-98,,,E,
EZE00100082,17890102,TMIN,-170,,,E,

The Goal

The goal of this assignment is to create a sorted list for the most common temperatures (where the temperature is rounded to the nearest integer) in the period 1763 to 1799. The list should be sorted from the least common temperature to the most common temperature. The list should also state how many times each temperature occurred.

You solve this very important problem in two steps - task 1 and task 2.

Outline

The outline of the homework is as follows:

  1. Download and install the virtual disk image (homework5.ova)
  2. Intro to Hadoop File System (i.e. how to add/remove directories(s) and upload/download file(s))
  3. Intro to Hadoop Mapreduce (i.e. how to compile, run, and check the log-files of a mapreduce program)
  4. Task 1 - Count how many times each temperature occurs
  5. Task 2 - Sort the temperatures to see which is least/most common

Deadline and Deliverables

Deadline for this homework is May 12th. You will hand in your source code and the all result files.

Installing the necessary software

For this assignment everything is provided in the homework5.ova virtual disk image.

  1. Download the image

Get to know Hadoop File System (HDFS)

It’s now time to start playing around a little bit with HDFS! You will notice that it is almost exactly like navigating in a regular UNIX environment - just a zillion times slower…

First things first, you might have to start up HDFS and YARN:

Some basic HDFS commands:

hduser@ubuntu1410adminuser:~$ hdfs dfs -ls /
hduser@ubuntu1410adminuser:~$ hdfs dfs -mkdir /testing
hduser@ubuntu1410adminuser:~$ hdfs dfs -mkdir /testing2
hduser@ubuntu1410adminuser:~$ hdfs dfs -ls /
hduser@ubuntu1410adminuser:~$ hdfs dfs -ls /testing

Create a text file sample.txt and add some content to it, then continue

hduser@ubuntu1410adminuser:~$ hdfs dfs -put sample.txt /testing/
hduser@ubuntu1410adminuser:~$ hdfs dfs -ls /testing/
hduser@ubuntu1410adminuser:~$ hdfs dfs -cat /testing/sample.txt
hduser@ubuntu1410adminuser:~$ hdfs dfs -get /testing/sample.txt downloadedSample.txt
hduser@ubuntu1410adminuser:~$ hdfs dfs -rm /testing/sample.txt
hduser@ubuntu1410adminuser:~$ hdfs dfs -ls /testing/
hduser@ubuntu1410adminuser:~$ hdfs dfs -rm -r /testing2
hduser@ubuntu1410adminuser:~$ hdfs dfs -ls /

A complete list of all the commands!

Get to know Mapreduce - our demo program

We will try to introduce mapreduce a bit through a demo, using the same weather data as you will use in Task 1 and Task 2. If you need to fresh up on what mapreduce is, I suggest that you take a look at the wikipedia page as well as the Hadoop MapReduce Tutorial.

Overview

The goal for this task is to take all the original weather data for the period 1763 to 1799 and run your own mapreduce program on it. The output of the mapreduce job should be a file for each weather station. Within these files you should list the days and temperatures that this weather station recorded. An overview can be seen below:

WeatherJob.java

This is the Main-file for the mapreduce program. In this you specify which mapper and reducer you will be using. You will also specify the input and output format for the mapper and reducer.

You will not have to modify any of the WeatherJob.java-files that we’ve given you!

Documentation for the mapreduce Job class.

WeatherMapper.java

Documentation for the mapreduce Mapper class

WeatherReducer.java

The specification of the reducer is as follows:

Documentation for the mapreduce Reducer Class

Compile and run the program

Follow this guide to compile and run the mapreduce demo:

  1. Make sure that HDFS and YARN are up and running, otherwise start them
  2. Upload some new input data (1779.csv and 1780.csv):

    hduser@ubuntu1410adminuser:~$ cd weather/data
    hduser@ubuntu1410adminuser:~weather/data$ hdfs dfs -put 1779.csv /weather/input
    hduser@ubuntu1410adminuser:~weather/data$ hdfs dfs -put 1780.csv /weather/input
    hduser@ubuntu1410adminuser:~weather/data$ hdfs dfs -ls /weather/input/
    
  3. Compile the mapreduce-program (make sure you’re in the correct directory)

    hduser@ubuntu1410adminuser:~weather/data$ cd ../demo
    hduser@ubuntu1410adminuser:~/weather/demo$ hadoop com.sun.tools.javac.Main *.java
    
  4. Turn it into a runnable .jar-file (named wmr.jar):

    hduser@ubuntu1410adminuser:~/weather/demo$ jar cf wmr.jar *.class
    
  5. Run it!

  6. Look at what it prints:

  7. Download and view the results:

    hduser@ubuntu1410adminuser:~/weather/demo$ hdfs dfs -get /weather/result
    hduser@ubuntu1410adminuser:~/weather/demo$ ls /result
    

Task 1

The goal for this task is to take all the original weather data for the period 1763 to 1799 and run your own mapreduce program on it. The output of the mapreduce job should be a file for each degree of temperature that was present within this period. Within these files you should list the days (and for which weather station) this temperature was recorded. Furthermore, at the end of each file there should be a summation of the total number of times that this temperature occurred. An overview can be seen below:

Mapper - specification

The specification of the mapper is as follows:

Reducer - specification

The specification of the reducer is as follows:

How to run it?

Example of how you can compile and run the program:

  1. Make sure that HDFS and YARN are up and running, otherwise start them
  2. Compile the mapreduce-program (make sure you’re in the correct directory)

    hduser@ubuntu1410adminuser:~/weather/task1$ hadoop com.sun.tools.javac.Main *.java
    
  3. Turn it into a runnable .jar-file (named task1.jar):

    hduser@ubuntu1410adminuser:~/weather/task1$ jar cf task1.jar *.class
    
  4. Run it!

Task 2

The goal for this task it to take the output from the previous task and use as input for this task. The output of this task should be just one file. In this file you should have a sorted list showing which temperature was most common and how many times this temperature occurred. An overview can be seen below:

Mapper - specification

The specification of the mapper is as follows:

Reducer - specification

The specification of the reducer is as follows:

How to run it?

Example of how you can compile and run the program:

  1. Make sure that HDFS and YARN are up and running, otherwise start them
  2. Compile the mapreduce-program (make sure you’re in the correct directory)

    hduser@ubuntu1410adminuser:~/weather/task2$ hadoop com.sun.tools.javac.Main *.java
    
  3. Turn it into a runnable .jar-file (named task2.jar):

    hduser@ubuntu1410adminuser:~/weather/task2$ jar cf task2.jar *.class
    
  4. Run it!