上篇文章中,練習了範例程式wordcount的使用,以及學習如何操作hdfs。接下來這個例子,增加了一點變化,要來分析apache2 web server的log記錄檔,計算每小時的存取次數。

以下使用python,如果想要使用java,可以參考這篇文章


實作分析

首先,要了解apache2 web server記錄檔的格式長怎樣。可以參考官方的說法 ,也可以看看下面的例子。

apache2 web server 的檔案格式如下:

64.242.88.10 - - [07/Mar/2004:16:10:02 -0800] "GET /mailman/listinfo/hsdivision HTTP/1.1" 200 6291

這裡面包含了來源ip位置,時間以及http request資訊。

由於我們要算的是每小時的存取次數,ip位置與request資訊都可以拿掉,只留下時間,如下:

2004-03-07 T 16:00:00.000

所以要做的事情是:擷取框框中的時間字串,將分,秒清空為00,接著丟進hadoop算wordcount,等待結果


map-reduce範例程式

這是根據網路上的python範例程式修改而成,有興趣可以參照這篇教學

mapper.py

#!/usr/bin/env python
import sys
import time
import datetime
# input comes from STDIN (standard input) 
for line in sys.stdin:
    # remove leading and trailing whitespace 
    line = line.strip()
    words = line.split('[')
    words = words[1].split(' -0800')
    time = datetime.datetime.strptime(words[0], "%d/%b/%Y:%H:%M:%S")
    print time.strftime('%Y-%m-%d T %H:00:00.000')+"\t1"

reducer.py

#!/usr/bin/env python
from operator import itemgetter
import sys
current_word = None
current_count = 0
word = None
# input comes from STDIN 
for line in sys.stdin:
    # remove leading and trailing whitespace 
    line = line.strip()
    # parse the input we got from mapper.py 
    word, count = line.split('\t', 1)
    # convert count (currently a string) to int 
    try:
        count = int(count)
    except ValueError:
        # count was not a number, so silently 
        # ignore/discard this line 
        continue
    # this IF-switch only works because Hadoop sorts map output 
    # by key (here: word) before it is passed to the reducer 
    if current_word == word:
        current_count += count
    else:
        if current_word:
            # write result to STDOUT 
            print '%s\t%s' % (current_word, current_count)
        current_count = count
        current_word = word
# do not forget to output the last word if needed! 
if current_word == word:
    print '%s\t%s' % (current_word, current_count)

範例access.log檔案下載 access.log


程式解說

擷取框框中的時間字串,並濾掉-0800

line = line.strip()
words = line.split('[')
words = words[1].split(' -0800')

得到這樣子的輸出

07/Mar/2004:16:10:02

接著透過python的datetime module來做格式的轉換

time = datetime.datetime.strptime(words[0], "%d/%b/%Y:%H:%M:%S") 
print time.strftime('%Y-%m-%d T %H:00:00.000')+"\t1" 

表格中節錄幾個用到的符號:

Directive Meaning Example
%d Day of the month as a zero-padded decimal number. 01, 02, …, 31
%b Month as locale’s abbreviated name.
Jan, Feb, …, Dec (en_US);
Jan, Feb, …, Dez (de_DE)
%m Month as a zero-padded decimal number. 01, 02, …, 12
%Y Year with century as a decimal number. 1970, 1988, 2001, 2013
%H Hour (24-hour clock) as a zero-padded decimal number. 00, 01, …, 23
%M Minute as a zero-padded decimal number. 00, 01, …, 59
%S Second as a zero-padded decimal number. 00, 01, …, 59

為了要交給map-reduce去統計次數,必須要將每小時的資料修改成一樣

這裏輸出單位至小時,後面的分秒都設為0,並在後面加上一個1,結果如下:

2004-03-07 T 16:00:00.000    1

PS. 因為只是要統計次數,其實可以不要印出0,這邊只是參考網路上JAVA版本的作法,做相同的輸出

在執行hadoop map-reduce前,我們先執行看看python程式是否正確

$ cat access.log | python mapper.py | python reducer.py

在hadoop上進行運算

先將hadoop開起來

 # clear tmp, hdfs file
 $ rm -r tmp hdfs
 # format hdfs system
 $ hdfs namenode –format
 # starting Hadoop service
 $ /opt/hadoop/sbin/start-all.sh

執行map-reduce

這邊我寫了一個script,並附上註解了,如果還是不清楚,可以參考hadoop streaming的說明

#!/bin/bash
# Hadoop stream jar
STREAMJAR=/opt/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.6.0.jar
# input file
INPUT=access.log
# input directory
INPUT_DIR=/input
# output file
OUTPUT=result.dat
# output directory
OUTPUT_DIR=/output
# mapper file
MAPPER=./mapper.py
# reducer file
REDUCER=./reducer.py
# create input directory on hdfs
hdfs dfs -mkdir /input
# upload input file to input directory
hdfs dfs -put $INPUT $INPUT_DIR
# remove old output directory
hdfs dfs -rm -r -f $OUTPUT_DIR
# execute map-reduce with Hadoop stream jar
hadoop jar $STREAMJAR -files $MAPPER,$REDUCER -mapper $MAPPER -reducer $REDUCER -input $INPUT_DIR -output $OUTPUT_DIR
# download the output file from hdfs
hdfs dfs -cat $OUTPUT_DIR/part* $OUTPUT

檔案存放在output資料夾下
Image 007


執行結果
打開result.dat檔案,就可以看到結果拉
Image 016


參考資料

Applied Big Data Analysis in the Real World with MapReduce and Hadoop

Python datetime