如何使用Python为Hadoop编写一个简单的MapReduce程序 如何使用Python为Hadoop编写一个简单的MapRed...

\u5982\u4f55\u4f7f\u7528eclipse\u7f16\u5199mapreduce\u7a0b\u5e8f

\u4f7f\u7528eclipse\u7f16\u5199mapreduce\u7a0b\u5e8f\u7684\u6b65\u9aa4\uff1a
\u4e00.\u5b89\u88c5hadoop for eclipse\u7684\u63d2\u4ef6\uff0c\u6ce8\u610f\uff1a\u63d2\u4ef6\u7248\u672c\u8981\u548chadoop\u7248\u672c\u4e00\u81f4\u3002
\u4e0b\u8f7d\uff1ahadoop-eclipse-plugin-2.5.2.jar
http://download.csdn.net/detail/tondayong1981/8186269
\u5c06hadoop-eclipse-plugin-2.5.2.jar\u6587\u4ef6\u653e\u5230ECLIPSE_HOME/plugins\u4e0b\uff0c\u91cd\u542feclipse
\u4e8c.\u5728eclipse\u4e2d\u65b0\u5efahadoop\u9879\u76ee
File>New>other>Map/Reduce Project>next>\u8f93\u5165\u9879\u76ee\u540d>finish

\u70b9\u51fb\u53f3\u4e0a\u89d2Map/Reduce\u89c6\u56fe
\u73b0\u5728\u80fd\u770b\u89c1Map/Reduce Locations\u4e86\uff0c\u70b9\u51fb\u53f3\u4fa7\u7684\u7d2b\u8272\u5927\u8c61\u3002

\u56e0\u4e3a\u6211\u4eec\u662fyarn\u7684\u73af\u5883 \u6240\u4ee5\u4e0d\u9700\u8981\u914d\u7f6eMap/Reduce(V2) Master \u9879
DFS Master\u4e2d\u7684IP\u548c\u7aef\u53e3 \u5c31\u662f\u6211\u4eechadoop/etc/hadoop/core-site.xml\u4e2dfs.defaultFS\u7684\u503c
\u73b0\u5728\u6211\u4eec\u5c31\u80fd\u770b\u89c1\u8fdc\u7a0bhadoop\u7684HDFS\u4e86

\u4e09.\u5728eclipce\u4e2d\u8fd0\u884cMap/Reduce\uff0c\u4ee5hadoop\u81ea\u5e26\u7684wordcount\u4e3a\u4f8b
\u9996\u5148\u4e0b\u8f7dhadoop\u7684\u6e90\u7801\uff1a
http://mirrors.cnnic.cn/apache/hadoop/common/hadoop-2.5.2/
\u5728hadoop-2.5.2-src ▸ hadoop-mapreduce-project ▸ hadoop-mapreduce-examples ▸ src ▸ main ▸ java ▸ org ▸ apache ▸ hadoop ▸ examples\u76ee\u5f55\u4e2d\u627e\u5230WordCount.java
\u5c06WordCount.java\u590d\u5236\u5230\u521a\u624d\u5efa\u7684myhadoop\u9879\u76ee\u7684src\u4e0b

\u521b\u5efa\u6570\u636e\u6e90\u6587\u4ef6word.txt

\u5728HDFS\u4e0a\u521b\u5efa\u8f93\u5165\u76ee\u5f55\uff0c\u518d\u5c06word.txt\u4f20\u5230HDFS\u4e0a
$hadoop fs -mkdir /tmp/input
$hadoop fs -copyFromLocal /home/hadoop/word.txt /tmp/input/word.txt
\u56de\u5230eclipse \u5237\u65b0DFS\u5c06\u770b\u5230

\u8fd0\u884cWordCount.java
(1).\u5728\u65b0\u5efa\u7684\u9879\u76eeHadoop\uff0c\u70b9\u51fbWordCount.java\uff0c\u53f3\u952e-->Run As-->Run Configurations
(2).\u5728\u5f39\u51fa\u7684Run Configurations\u5bf9\u8bdd\u6846\u4e2d\uff0c\u70b9Java Application\uff0c\u53f3\u952e-->New\uff0c\u8fd9\u65f6\u4f1a\u65b0\u5efa\u4e00\u4e2aapplication\u540d\u4e3aWordCount
(3).\u914d\u7f6e\u8fd0\u884c\u53c2\u6570\uff0c\u70b9Arguments\uff0c\u5728Program arguments\u4e2d\u8f93\u5165\u4f60\u8981\u4f20\u7ed9\u7a0b\u5e8f\u7684\u8f93\u5165\u6587\u4ef6\u5939\u548c\u4f60\u8981\u6c42\u7a0b\u5e8f\u5c06\u8ba1\u7b97\u7ed3\u679c\u4fdd\u5b58\u7684\u6587\u4ef6\u5939\uff0c\u5982\uff1a
hdfs://10.6.9.226:9000/tmp/input/word.txt hdfs://10.6.9.226:9000/tmp/output
\u70b9\u51fbRun,\u7b49\u8fd0\u884c\u7ed3\u675f\u5237\u65b0DFS Locations\u5c06\u4f1a\u770b\u5230output\u6587\u4ef6\u5939

\u5728\u8fd9\u4e2a\u5b9e\u4f8b\u4e2d\uff0c\u6211\u5c06\u4f1a\u5411\u5927\u5bb6\u4ecb\u7ecd\u5982\u4f55\u4f7f\u7528Python \u4e3a Hadoop\u7f16\u5199\u4e00\u4e2a\u7b80\u5355\u7684MapReduce
\u7a0b\u5e8f\u3002
\u5c3d\u7ba1Hadoop \u6846\u67b6\u662f\u4f7f\u7528Java\u7f16\u5199\u7684\u4f46\u662f\u6211\u4eec\u4ecd\u7136\u9700\u8981\u4f7f\u7528\u50cfC++\u3001Python\u7b49\u8bed\u8a00\u6765\u5b9e\u73b0Hadoop\u7a0b\u5e8f\u3002\u5c3d\u7ba1Hadoop\u5b98\u65b9\u7f51\u7ad9\u7ed9\u7684\u793a\u4f8b\u7a0b\u5e8f\u662f\u4f7f\u7528Jython\u7f16\u5199\u5e76\u6253\u5305\u6210Jar\u6587\u4ef6\uff0c\u8fd9\u6837\u663e\u7136\u9020\u6210\u4e86\u4e0d\u4fbf\uff0c\u5176\u5b9e\uff0c\u4e0d\u4e00\u5b9a\u975e\u8981\u8fd9\u6837\u6765\u5b9e\u73b0\uff0c\u6211\u4eec\u53ef\u4ee5\u4f7f\u7528Python\u4e0eHadoop \u5173\u8054\u8fdb\u884c\u7f16\u7a0b\uff0c\u770b\u770b\u4f4d\u4e8e/src/examples/python/WordCount.py \u7684\u4f8b\u5b50\uff0c\u4f60\u5c06\u4e86\u89e3\u5230\u6211\u5728\u8bf4\u4ec0\u4e48\u3002

\u6211\u4eec\u60f3\u8981\u505a\u4ec0\u4e48\uff1f

\u6211\u4eec\u5c06\u7f16\u5199\u4e00\u4e2a\u7b80\u5355\u7684 MapReduce \u7a0b\u5e8f\uff0c\u4f7f\u7528\u7684\u662fC-Python\uff0c\u800c\u4e0d\u662fJython\u7f16\u5199\u540e\u6253\u5305\u6210jar\u5305\u7684\u7a0b\u5e8f\u3002
\u6211\u4eec\u7684\u8fd9\u4e2a\u4f8b\u5b50\u5c06\u6a21\u4eff WordCount \u5e76\u4f7f\u7528Python\u6765\u5b9e\u73b0\uff0c\u4f8b\u5b50\u901a\u8fc7\u8bfb\u53d6\u6587\u672c\u6587\u4ef6\u6765\u7edf\u8ba1\u51fa\u5355\u8bcd\u7684\u51fa\u73b0\u6b21\u6570\u3002\u7ed3\u679c\u4e5f\u4ee5\u6587\u672c\u5f62\u5f0f\u8f93\u51fa\uff0c\u6bcf\u4e00\u884c\u5305\u542b\u4e00\u4e2a\u5355\u8bcd\u548c\u5355\u8bcd\u51fa\u73b0\u7684\u6b21\u6570\uff0c\u4e24\u8005\u4e2d\u95f4\u4f7f\u7528\u5236\u8868\u7b26\u6765\u60f3\u95f4\u9694\u3002

\u5148\u51b3\u6761\u4ef6

\u7f16\u5199\u8fd9\u4e2a\u7a0b\u5e8f\u4e4b\u524d\uff0c\u4f60\u5b66\u8981\u67b6\u8bbe\u597dHadoop \u96c6\u7fa4\uff0c\u8fd9\u6837\u624d\u80fd\u4e0d\u4f1a\u5728\u540e\u671f\u5de5\u4f5c\u6293\u778e\u3002\u5982\u679c\u4f60\u6ca1\u6709\u67b6\u8bbe\u597d\uff0c\u90a3\u4e48\u5728\u540e\u9762\u6709\u4e2a\u7b80\u660e\u6559\u7a0b\u6765\u6559\u4f60\u5728Ubuntu Linux \u4e0a\u642d\u5efa\uff08\u540c\u6837\u9002\u7528\u4e8e\u5176\u4ed6\u53d1\u884c\u7248linux\u3001unix\uff09

\u5982\u4f55\u4f7f\u7528Hadoop Distributed File System (HDFS)\u5728Ubuntu Linux \u5efa\u7acb\u5355\u8282\u70b9\u7684 Hadoop \u96c6\u7fa4

\u5982\u4f55\u4f7f\u7528Hadoop Distributed File System (HDFS)\u5728Ubuntu Linux \u5efa\u7acb\u591a\u8282\u70b9\u7684 Hadoop \u96c6\u7fa4


Python\u7684MapReduce\u4ee3\u7801

\u4f7f\u7528Python\u7f16\u5199MapReduce\u4ee3\u7801\u7684\u6280\u5de7\u5c31\u5728\u4e8e\u6211\u4eec\u4f7f\u7528\u4e86 HadoopStreaming \u6765\u5e2e\u52a9\u6211\u4eec\u5728Map \u548c Reduce\u95f4\u4f20\u9012\u6570\u636e\u901a\u8fc7STDIN (\u6807\u51c6\u8f93\u5165)\u548cSTDOUT (\u6807\u51c6\u8f93\u51fa).\u6211\u4eec\u4ec5\u4ec5\u4f7f\u7528Python\u7684sys.stdin\u6765\u8f93\u5165\u6570\u636e\uff0c\u4f7f\u7528sys.stdout\u8f93\u51fa\u6570\u636e\uff0c\u8fd9\u6837\u505a\u662f\u56e0\u4e3aHadoopStreaming\u4f1a\u5e2e\u6211\u4eec\u529e\u597d\u5176\u4ed6\u4e8b\u3002\u8fd9\u662f\u771f\u7684\uff0c\u522b\u4e0d\u76f8\u4fe1\uff01


Map: mapper.py

\u5c06\u4e0b\u5217\u7684\u4ee3\u7801\u4fdd\u5b58\u5728/home/hadoop/mapper.py\u4e2d\uff0c\u4ed6\u5c06\u4eceSTDIN\u8bfb\u53d6\u6570\u636e\u5e76\u5c06\u5355\u8bcd\u6210\u884c\u5206\u9694\u5f00\uff0c\u751f\u6210\u4e00\u4e2a\u5217\u8868\u6620\u5c04\u5355\u8bcd\u4e0e\u53d1\u751f\u6b21\u6570\u7684\u5173\u7cfb\uff1a
\u6ce8\u610f\uff1a\u8981\u786e\u4fdd\u8fd9\u4e2a\u811a\u672c\u6709\u8db3\u591f\u6743\u9650\uff08chmod +x /home/hadoop/mapper.py\uff09\u3002

#!/usr/bin/env python

import sys

# input comes from STDIN (standard input)
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# split the line into words
words = line.split()
# increase counters
for word in words:
# write the results to STDOUT (standard output);
# what we output here will be the input for the
# Reduce step, i.e. the input for reducer.py
#
# tab-delimited; the trivial word count is 1
print '%s\\t%s' % (word, 1)\u5728\u8fd9\u4e2a\u811a\u672c\u4e2d\uff0c\u5e76\u4e0d\u8ba1\u7b97\u51fa\u5355\u8bcd\u51fa\u73b0\u7684\u603b\u6570\uff0c\u5b83\u5c06\u8f93\u51fa " 1" \u8fc5\u901f\u5730\uff0c\u5c3d\u7ba1\u53ef\u80fd\u4f1a\u5728\u8f93\u5165\u4e2d\u51fa\u73b0\u591a\u6b21\uff0c\u8ba1\u7b97\u662f\u7559\u7ed9\u540e\u6765\u7684Reduce\u6b65\u9aa4\uff08\u6216\u53eb\u505a\u7a0b\u5e8f\uff09\u6765\u5b9e\u73b0\u3002\u5f53\u7136\u4f60\u53ef\u4ee5\u6539\u53d8\u4e0b\u7f16\u7801\u98ce\u683c\uff0c\u5b8c\u5168\u5c0a\u91cd\u4f60\u7684\u4e60\u60ef\u3002


Reduce: reducer.py

\u5c06\u4ee3\u7801\u5b58\u50a8\u5728/home/hadoop/reducer.py \u4e2d\uff0c\u8fd9\u4e2a\u811a\u672c\u7684\u4f5c\u7528\u662f\u4ecemapper.py \u7684STDIN\u4e2d\u8bfb\u53d6\u7ed3\u679c\uff0c\u7136\u540e\u8ba1\u7b97\u6bcf\u4e2a\u5355\u8bcd\u51fa\u73b0\u6b21\u6570\u7684\u603b\u548c\uff0c\u5e76\u8f93\u51fa\u7ed3\u679c\u5230STDOUT\u3002
\u540c\u6837\uff0c\u8981\u6ce8\u610f\u811a\u672c\u6743\u9650\uff1achmod +x /home/hadoop/reducer.py

#!/usr/bin/env python

from operator import itemgetter
import sys

# maps words to their counts
word2count = {}

# input comes from STDIN
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()

# parse the input we got from mapper.py
word, count = line.split('\\t', 1)
# convert count (currently a string) to int
try:
count = int(count)
word2count[word] = word2count.get(word, 0) + count
except ValueError:
# count was not a number, so silently
# ignore/discard this line
pass

# sort the words lexigraphically;
#
# this step is NOT required, we just do it so that our
# final output will look more like the official Hadoop
# word count examples
sorted_word2count = sorted(word2count.items(), key=itemgetter(0))

# write the results to STDOUT (standard output)
for word, count in sorted_word2count:
print '%s\\t%s'% (word, count)
\u6d4b\u8bd5\u4f60\u7684\u4ee3\u7801\uff08cat data | map | sort | reduce\uff09


\u6211\u5efa\u8bae\u4f60\u5728\u8fd0\u884cMapReduce job\u6d4b\u8bd5\u524d\u5c1d\u8bd5\u624b\u5de5\u6d4b\u8bd5\u4f60\u7684mapper.py \u548c reducer.py\u811a\u672c\uff0c\u4ee5\u514d\u5f97\u4e0d\u5230\u4efb\u4f55\u8fd4\u56de\u7ed3\u679c
\u8fd9\u91cc\u6709\u4e00\u4e9b\u5efa\u8bae\uff0c\u5173\u4e8e\u5982\u4f55\u6d4b\u8bd5\u4f60\u7684Map\u548cReduce\u7684\u529f\u80fd\uff1a
\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014
\r\n
# very basic test
hadoop@ubuntu:~$ echo "foo foo quux labs foo bar quux" | /home/hadoop/mapper.py
foo 1
foo 1
quux 1
labs 1
foo 1
bar 1
\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014
hadoop@ubuntu:~$ echo "foo foo quux labs foo bar quux" | /home/hadoop/mapper.py | sort | /home/hadoop/reducer.py
bar 1
foo 3
labs 1
\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014

# using one of the ebooks as example input
# (see below on where to get the ebooks)
hadoop@ubuntu:~$ cat /tmp/gutenberg/20417-8.txt | /home/hadoop/mapper.py
The 1
Project 1
Gutenberg 1
EBook 1
of 1
[...]
(you get the idea)

quux 2

quux 1

\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014

\u5728Hadoop\u5e73\u53f0\u4e0a\u8fd0\u884cPython\u811a\u672c

\u4e3a\u4e86\u8fd9\u4e2a\u4f8b\u5b50\uff0c\u6211\u4eec\u5c06\u9700\u8981\u4e09\u79cd\u7535\u5b50\u4e66\uff1a


The Outline of Science, Vol. 1 (of 4) by J. Arthur Thomson\r\n
The Notebooks of Leonardo Da Vinci\r\n
Ulysses by James Joyce
\u4e0b\u8f7d\u4ed6\u4eec\uff0c\u5e76\u4f7f\u7528us-ascii\u7f16\u7801\u5b58\u50a8 \u89e3\u538b\u540e\u7684\u6587\u4ef6\uff0c\u4fdd\u5b58\u5728\u4e34\u65f6\u76ee\u5f55\uff0c\u6bd4\u5982/tmp/gutenberg.

hadoop@ubuntu:~$ ls -l /tmp/gutenberg/
total 3592
-rw-r--r-- 1 hadoop hadoop 674425 2007-01-22 12:56 20417-8.txt
-rw-r--r-- 1 hadoop hadoop 1423808 2006-08-03 16:36 7ldvc10.txt
-rw-r--r-- 1 hadoop hadoop 1561677 2004-11-26 09:48 ulyss12.txt
hadoop@ubuntu:~$

\u590d\u5236\u672c\u5730\u6570\u636e\u5230HDFS

\u5728\u6211\u4eec\u8fd0\u884cMapReduce job \u524d\uff0c\u6211\u4eec\u9700\u8981\u5c06\u672c\u5730\u7684\u6587\u4ef6\u590d\u5236\u5230HDFS\u4e2d\uff1a

hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -copyFromLocal /tmp/gutenberg gutenberg
hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -ls
Found 1 items
/user/hadoop/gutenberg
hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -ls gutenberg
Found 3 items
/user/hadoop/gutenberg/20417-8.txt 674425
/user/hadoop/gutenberg/7ldvc10.txt 1423808
/user/hadoop/gutenberg/ulyss12.txt 1561677

\u6267\u884c MapReduce job

\u73b0\u5728\uff0c\u4e00\u5207\u51c6\u5907\u5c31\u7eea\uff0c\u6211\u4eec\u5c06\u5728\u8fd0\u884cPython MapReduce job \u5728Hadoop\u96c6\u7fa4\u4e0a\u3002\u50cf\u6211\u4e0a\u9762\u6240\u8bf4\u7684\uff0c\u6211\u4eec\u4f7f\u7528\u7684\u662f
HadoopStreaming \u5e2e\u52a9\u6211\u4eec\u4f20\u9012\u6570\u636e\u5728Map\u548cReduce\u95f4\u5e76\u901a\u8fc7STDIN\u548cSTDOUT\uff0c\u8fdb\u884c\u6807\u51c6\u5316\u8f93\u5165\u8f93\u51fa\u3002


hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop jar contrib/streaming/hadoop-0.19.1-streaming.jar
-mapper /home/hadoop/mapper.py -reducer /home/hadoop/reducer.py -input gutenberg/*
-output gutenberg-output
\u5728\u8fd0\u884c\u4e2d\uff0c\u5982\u679c\u4f60\u60f3\u66f4\u6539Hadoop\u7684\u4e00\u4e9b\u8bbe\u7f6e\uff0c\u5982\u589e\u52a0Reduce\u4efb\u52a1\u7684\u6570\u91cf\uff0c\u4f60\u53ef\u4ee5\u4f7f\u7528\u201c-jobconf\u201d\u9009\u9879\uff1a

hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop jar contrib/streaming/hadoop-0.19.1-streaming.jar
-jobconf mapred.reduce.tasks=16 -mapper ...

\u4e00\u4e2a\u91cd\u8981\u7684\u5907\u5fd8\u662f\u5173\u4e8eHadoop does not honor mapred.map.tasks
\u8fd9\u4e2a\u4efb\u52a1\u5c06\u4f1a\u8bfb\u53d6HDFS\u76ee\u5f55\u4e0b\u7684gutenberg\u5e76\u5904\u7406\u4ed6\u4eec\uff0c\u5c06\u7ed3\u679c\u5b58\u50a8\u5728\u72ec\u7acb\u7684\u7ed3\u679c\u6587\u4ef6\u4e2d\uff0c\u5e76\u5b58\u50a8\u5728HDFS\u76ee\u5f55\u4e0b\u7684
gutenberg-output\u76ee\u5f55\u3002
\u4e4b\u524d\u6267\u884c\u7684\u7ed3\u679c\u5982\u4e0b\uff1a

hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop jar contrib/streaming/hadoop-0.19.1-streaming.jar
-mapper /home/hadoop/mapper.py -reducer /home/hadoop/reducer.py -input gutenberg/*
-output gutenberg-output

additionalConfSpec_:null
null=@@@userJobConfProps_.get(stream.shipped.hadoopstreaming
packageJobJar: [/usr/local/hadoop-datastore/hadoop-hadoop/hadoop-unjar54543/]
[] /tmp/streamjob54544.jar tmpDir=null
[...] INFO mapred.FileInputFormat: Total input paths to process : 7
[...] INFO streaming.StreamJob: getLocalDirs(): [/usr/local/hadoop-datastore/hadoop-hadoop/mapred/local]
[...] INFO streaming.StreamJob: Running job: job_200803031615_0021
[...]
[...] INFO streaming.StreamJob: map 0% reduce 0%
[...] INFO streaming.StreamJob: map 43% reduce 0%
[...] INFO streaming.StreamJob: map 86% reduce 0%
[...] INFO streaming.StreamJob: map 100% reduce 0%
[...] INFO streaming.StreamJob: map 100% reduce 33%
[...] INFO streaming.StreamJob: map 100% reduce 70%
[...] INFO streaming.StreamJob: map 100% reduce 77%
[...] INFO streaming.StreamJob: map 100% reduce 100%
[...] INFO streaming.StreamJob: Job complete: job_200803031615_0021


[...] INFO streaming.StreamJob: Output: gutenberg-output hadoop@ubuntu:/usr/local/hadoop$


\u6b63\u5982\u4f60\u6240\u89c1\u5230\u7684\u4e0a\u9762\u7684\u8f93\u51fa\u7ed3\u679c\uff0cHadoop \u540c\u65f6\u8fd8\u63d0\u4f9b\u4e86\u4e00\u4e2a\u57fa\u672c\u7684WEB\u63a5\u53e3\u663e\u793a\u7edf\u8ba1\u7ed3\u679c\u548c\u4fe1\u606f\u3002
\u5f53Hadoop\u96c6\u7fa4\u5728\u6267\u884c\u65f6\uff0c\u4f60\u53ef\u4ee5\u4f7f\u7528\u6d4f\u89c8\u5668\u8bbf\u95ee http://localhost:50030/ \uff0c\u5982\u56fe\uff1a




\u68c0\u67e5\u7ed3\u679c\u662f\u5426\u8f93\u51fa\u5e76\u5b58\u50a8\u5728HDFS\u76ee\u5f55\u4e0b\u7684gutenberg-output\u4e2d\uff1a

hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -ls gutenberg-output
Found 1 items
/user/hadoop/gutenberg-output/part-00000 903193 2007-09-21 13:00
hadoop@ubuntu:/usr/local/hadoop$

\u53ef\u4ee5\u4f7f\u7528dfs -cat \u547d\u4ee4\u68c0\u67e5\u6587\u4ef6\u76ee\u5f55

hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -cat gutenberg-output/part-00000
"(Lo)cra" 1
"1490 1
"1498," 1
"35" 1
"40," 1
"A 2
"AS-IS". 2
"A_ 1
"Absoluti 1
[...]
hadoop@ubuntu:/usr/local/hadoop$

\u6ce8\u610f\u6bd4\u8f93\u51fa\uff0c\u4e0a\u9762\u7ed3\u679c\u7684(")\u7b26\u53f7\u4e0d\u662fHadoop\u63d2\u5165\u7684\u3002


\u8f6c\u8f7d\u4ec5\u4f9b\u53c2\u8003\uff0c\u7248\u6743\u5c5e\u4e8e\u539f\u4f5c\u8005\u3002\u795d\u4f60\u6109\u5feb\uff0c\u6ee1\u610f\u8bf7\u91c7\u7eb3\u54e6

我们将编写一个简单的 MapReduce 程序,使用的是C-Python,而不是Jython编写后打包成jar包的程序。
  我们的这个例子将模仿 WordCount 并使用Python来实现,例子通过读取文本文件来统计出单词的出现次数。结果也以文本形式输出,每一行包含一个单词和单词出现的次数,两者中间使用制表符来想间隔。

  先决条件

  编写这个程序之前,你学要架设好Hadoop 集群,这样才能不会在后期工作抓瞎。如果你没有架设好,那么在后面有个简明教程来教你在Ubuntu Linux 上搭建(同样适用于其他发行版linux、unix)

  如何使用Hadoop Distributed File System (HDFS)在Ubuntu Linux 建立单节点的 Hadoop 集群

  如何使用Hadoop Distributed File System (HDFS)在Ubuntu Linux 建立多节点的 Hadoop 集群

  Python的MapReduce代码

  使用Python编写MapReduce代码的技巧就在于我们使用了 HadoopStreaming 来帮助我们在Map 和 Reduce间传递数据通过STDIN (标准输入)和STDOUT (标准输出).我们仅仅使用Python的sys.stdin来输入数据,使用sys.stdout输出数据,这样做是因为HadoopStreaming会帮我们办好其他事。这是真的,别不相信!

  Map: mapper.py

  将下列的代码保存在/home/hadoop/mapper.py中,他将从STDIN读取数据并将单词成行分隔开,生成一个列表映射单词与发生次数的关系:
  注意:要确保这个脚本有足够权限(chmod +x /home/hadoop/mapper.py)。

  #!/usr/bin/env python
  
  import sys
  
  # input comes from STDIN (standard input)
  for line in sys.stdin:
  # remove leading and trailing whitespace
  line = line.strip()
  # split the line into words
  words = line.split()
  # increase counters
  for word in words:
  # write the results to STDOUT (standard output);
  # what we output here will be the input for the
  # Reduce step, i.e. the input for reducer.py
  #
  # tab-delimited; the trivial word count is 1
  print '%s\\t%s' % (word, 1)在这个脚本中,并不计算出单词出现的总数,它将输出 "<word> 1" 迅速地,尽管<word>可能会在输入中出现多次,计算是留给后来的Reduce步骤(或叫做程序)来实现。当然你可以改变下编码风格,完全尊重你的习惯。

  Reduce: reducer.py

  将代码存储在/home/hadoop/reducer.py 中,这个脚本的作用是从mapper.py 的STDIN中读取结果,然后计算每个单词出现次数的总和,并输出结果到STDOUT。
  同样,要注意脚本权限:chmod +x /home/hadoop/reducer.py

  #!/usr/bin/env python
  
  from operator import itemgetter
  import sys
  
  # maps words to their counts
  word2count = {}
  
  # input comes from STDIN
  for line in sys.stdin:
  # remove leading and trailing whitespace
  line = line.strip()
  
  # parse the input we got from mapper.py
  word, count = line.split('\\t', 1)
  # convert count (currently a string) to int
  try:
  count = int(count)
  word2count[word] = word2count.get(word, 0) + count
  except ValueError:
  # count was not a number, so silently
  # ignore/discard this line
  pass
  
  # sort the words lexigraphically;
  #
  # this step is NOT required, we just do it so that our
  # final output will look more like the official Hadoop
  # word count examples
  sorted_word2count = sorted(word2count.items(), key=itemgetter(0))
  
  # write the results to STDOUT (standard output)
  for word, count in sorted_word2count:
  print '%s\\t%s'% (word, count)
  测试你的代码(cat data | map | sort | reduce)

  我建议你在运行MapReduce job测试前尝试手工测试你的mapper.py 和 reducer.py脚本,以免得不到任何返回结果
  这里有一些建议,关于如何测试你的Map和Reduce的功能:
  ——————————————————————————————————————————————
  \r\n
  # very basic test
  hadoop@ubuntu:~$ echo "foo foo quux labs foo bar quux" | /home/hadoop/mapper.py
  foo 1
  foo 1
  quux 1
  labs 1
  foo 1
  bar 1
  ——————————————————————————————————————————————
  hadoop@ubuntu:~$ echo "foo foo quux labs foo bar quux" | /home/hadoop/mapper.py | sort | /home/hadoop/reducer.py
  bar 1
  foo 3
  labs 1
  ——————————————————————————————————————————————

  # using one of the ebooks as example input
  # (see below on where to get the ebooks)
  hadoop@ubuntu:~$ cat /tmp/gutenberg/20417-8.txt | /home/hadoop/mapper.py
  The 1
  Project 1
  Gutenberg 1
  EBook 1
  of 1
  [...]
  (you get the idea)

  quux 2

  quux 1

  ——————————————————————————————————————————————

  在Hadoop平台上运行Python脚本

  为了这个例子,我们将需要三种电子书:

  The Outline of Science, Vol. 1 (of 4) by J. Arthur Thomson\r\n
  The Notebooks of Leonardo Da Vinci\r\n
  Ulysses by James Joyce
  下载他们,并使用us-ascii编码存储 解压后的文件,保存在临时目录,比如/tmp/gutenberg.

  hadoop@ubuntu:~$ ls -l /tmp/gutenberg/
  total 3592
  -rw-r--r-- 1 hadoop hadoop 674425 2007-01-22 12:56 20417-8.txt
  -rw-r--r-- 1 hadoop hadoop 1423808 2006-08-03 16:36 7ldvc10.txt
  -rw-r--r-- 1 hadoop hadoop 1561677 2004-11-26 09:48 ulyss12.txt
  hadoop@ubuntu:~$

  复制本地数据到HDFS

  在我们运行MapReduce job 前,我们需要将本地的文件复制到HDFS中:

  hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -copyFromLocal /tmp/gutenberg gutenberg
  hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -ls
  Found 1 items
  /user/hadoop/gutenberg <dir>
  hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -ls gutenberg
  Found 3 items
  /user/hadoop/gutenberg/20417-8.txt <r 1> 674425
  /user/hadoop/gutenberg/7ldvc10.txt <r 1> 1423808
  /user/hadoop/gutenberg/ulyss12.txt <r 1> 1561677

  执行 MapReduce job

  现在,一切准备就绪,我们将在运行Python MapReduce job 在Hadoop集群上。像我上面所说的,我们使用的是
  HadoopStreaming 帮助我们传递数据在Map和Reduce间并通过STDIN和STDOUT,进行标准化输入输出。

  hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop jar contrib/streaming/hadoop-0.19.1-streaming.jar
  -mapper /home/hadoop/mapper.py -reducer /home/hadoop/reducer.py -input gutenberg/*
  -output gutenberg-output
  在运行中,如果你想更改Hadoop的一些设置,如增加Reduce任务的数量,你可以使用“-jobconf”选项:

  hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop jar contrib/streaming/hadoop-0.19.1-streaming.jar
  -jobconf mapred.reduce.tasks=16 -mapper ...

  一个重要的备忘是关于Hadoop does not honor mapred.map.tasks
  这个任务将会读取HDFS目录下的gutenberg并处理他们,将结果存储在独立的结果文件中,并存储在HDFS目录下的
  gutenberg-output目录。
  之前执行的结果如下:

  hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop jar contrib/streaming/hadoop-0.19.1-streaming.jar
  -mapper /home/hadoop/mapper.py -reducer /home/hadoop/reducer.py -input gutenberg/*
  -output gutenberg-output
  
  additionalConfSpec_:null
  null=@@@userJobConfProps_.get(stream.shipped.hadoopstreaming
  packageJobJar: [/usr/local/hadoop-datastore/hadoop-hadoop/hadoop-unjar54543/]
  [] /tmp/streamjob54544.jar tmpDir=null
  [...] INFO mapred.FileInputFormat: Total input paths to process : 7
  [...] INFO streaming.StreamJob: getLocalDirs(): [/usr/local/hadoop-datastore/hadoop-hadoop/mapred/local]
  [...] INFO streaming.StreamJob: Running job: job_200803031615_0021
  [...]
  [...] INFO streaming.StreamJob: map 0% reduce 0%
  [...] INFO streaming.StreamJob: map 43% reduce 0%
  [...] INFO streaming.StreamJob: map 86% reduce 0%
  [...] INFO streaming.StreamJob: map 100% reduce 0%
  [...] INFO streaming.StreamJob: map 100% reduce 33%
  [...] INFO streaming.StreamJob: map 100% reduce 70%
  [...] INFO streaming.StreamJob: map 100% reduce 77%
  [...] INFO streaming.StreamJob: map 100% reduce 100%
  [...] INFO streaming.StreamJob: Job complete: job_200803031615_0021

  [...] INFO streaming.StreamJob: Output: gutenberg-output hadoop@ubuntu:/usr/local/hadoop$

  正如你所见到的上面的输出结果,Hadoop 同时还提供了一个基本的WEB接口显示统计结果和信息。

http://www.cnblogs.com/kaituorensheng/p/3826114.html

  • shadowrockt瑕佺櫥褰鍚
    绛旓細瑕乢hadowrocket 鏄 iOS 骞冲彴涓婅緝鏃╁嚭鐜扮殑鏀寔 SSR 鍗忚鐨勫鎴风锛屽潑闂寸О浣溾滃皬鐏鈥濄俖hadowsocks鏄竴绉嶅熀浜嶴ocks5浠g悊鏂瑰紡鐨勫姞瀵嗕紶杈撳崗璁紝涔熷彲浠ユ寚瀹炵幇杩欎釜鍗忚鐨勫悇绉嶅紑鍙戝寘銆傜洰鍓嶅寘浣跨敤Python銆丆銆丆++銆丆#銆丟o璇█銆丷ust绛夌紪绋嬭瑷寮鍙戯紝澶ч儴鍒嗕富瑕佸疄鐜伴噰鐢ˋpache璁稿彲璇併丟PL銆丮IT璁稿彲璇佺瓑澶氱鑷敱杞...
  • 扩展阅读:python手机版下载官方 ... 如何用python制作app ... python编程入门自学 ... python入门教程菜鸟教程 ... 学python后到底能干什么 ... python下载安装教程 ... python初学者使用教程 ... python手机在线编程入口 ... python怎样打下一行 ...

    本站交流只代表网友个人观点,与本站立场无关
    欢迎反馈与建议,请联系电邮
    2024© 车视网