[Mahout] K-Means 예제 실행

다음은 빅데이터 처리와 분석을 위한 하둡 맵리듀스 프로그래밍 책을 참고하여 작성하였습니다.
Hadoop 과 Mahout 설치가 완료되어 있는 상태에서 다음을 진행합니다.

본 글은 머하웃 설치 후 K-Means 예제를 실행하는 것과 관련한 글입니다.

http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data

에 접속하여 K-Means를 수행할 데이터를 다운 받습니다.

이후 하둡파일시스템에 testdata 폴더를 생성하고 데이터 파일을 복사합니다.

hadoop@hdmaster:~$ hadoop fs -mkdir testdata
hadoop@hdmaster:~$ hadoop fs -put /tmp/synthetic_control.data /testdata/synthetic_control.data

다음은 프로그램 사용 방법에 관한 요약입니다. -h 혹은 –help 옵션을 통해 확인 가능합니다.

Usage:                                                                          
 [--input <input> --output <output> --distanceMeasure <distanceMeasure>         
--numClusters <k> --t1 <t1> --t2 <t2> --convergenceDelta <convergenceDelta>     
--maxIter <maxIter> --overwrite --help --tempDir <tempDir> --startPhase         
<startPhase> --endPhase <endPhase>]                                             
Job-Specific Options:                                                           
  --input (-i) input                           Path to job input directory.     
  --output (-o) output                         The directory pathname for       
                                               output.                          
  --distanceMeasure (-dm) distanceMeasure      The classname of the             
                                               DistanceMeasure. Default is      
                                               SquaredEuclidean                 
  --numClusters (-k) k                         The number of clusters to create 
  --t1 (-t1) t1                                T1 threshold value               
  --t2 (-t2) t2                                T2 threshold value               
  --convergenceDelta (-cd) convergenceDelta    The convergence delta value.     
                                               Default is 0.5                   
  --maxIter (-x) maxIter                       The maximum number of            
                                               iterations.                      
  --overwrite (-ow)                            If present, overwrite the output 
                                               directory before running job     
  --help (-h)                                  Print out help                   
  --tempDir tempDir                            Intermediate output directory    
  --startPhase startPhase                      First phase to run               
  --endPhase endPhase                          Last phase to run

아래와 같이 옵션을 주고 프로그램을 실행합니다.

hadoop@hdmaster:~/mahout$ bin/mahout org.apache.mahout.clustering.syntheticcontrol.kmeans.Job --input /testdata/synthetic_control.data --output /testdata/output.data -t1 20 -t2 50 -k 5 -x 20 -ow

실행 시 결과는 아래와 같이 나옵니다.

Running on hadoop, using /home/hadoop/hadoop/bin/hadoop and HADOOP_CONF_DIR=

MAHOUT-JOB: /home/hadoop/mahout/mahout-examples-0.7-job.jar

13/06/28 01:03:08 WARN driver.MahoutDriver: No org.apache.mahout.clustering.syntheticcontrol.kmeans.Job.props found on classpath, will use command-line arguments only

13/06/28 01:03:08 INFO kmeans.Job: Running with only user-supplied arguments

13/06/28 01:03:08 INFO common.AbstractJob: Command line arguments: {--convergenceDelta=[0.5], --distanceMeasure=[org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure], --endPhase=[2147483647], --input=[/testdata/synthetic_control.data], --maxIter=[20], --numClusters=[5], --output=[/testdata/output.data], --overwrite=null, --startPhase=[0], --t1=[20], --t2=[50], --tempDir=[temp]}

13/06/28 01:03:09 INFO common.HadoopUtil: Deleting /testdata/output.data

13/06/28 01:03:09 INFO kmeans.Job: Preparing Input

13/06/28 01:03:09 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.

13/06/28 01:03:10 INFO input.FileInputFormat: Total input paths to process : 1

13/06/28 01:03:10 INFO util.NativeCodeLoader: Loaded the native-hadoop library

13/06/28 01:03:10 WARN snappy.LoadSnappy: Snappy native library not loaded

13/06/28 01:03:10 INFO mapred.JobClient: Running job: job_201306272314_0031

13/06/28 01:03:11 INFO mapred.JobClient:  map 0% reduce 0%



(... 생략 ...)

    1.0: [24.414, 33.038, 34.179, 29.754, 28.521, 26.006, 26.926, 29.606, 30.506, 27.485, 24.180, 32.983, 34.323, 24.235, 27.686, 34.259, 27.943, 24.728, 28.680, 27.841, 33.559, 24.370, 29.469, 35.175, 28.066, 20.058, 13.843, 21.060, 15.618, 19.504, 14.817, 20.791, 18.734, 20.550, 12.769, 11.089, 12.766, 18.481, 9.255, 19.045, 19.711, 16.527, 9.536, 14.096, 13.148, 9.097, 14.224, 12.818, 14.469, 13.448, 10.000, 18.012, 12.411, 17.807, 18.465, 12.053, 20.603, 20.798, 15.460, 13.911]

    1.0: [34.335, 30.938, 31.953, 31.146, 24.519, 24.393, 27.696, 29.874, 26.767, 33.089, 31.371, 26.233, 26.383, 35.661, 32.663, 27.685, 29.277, 31.761, 34.650, 24.940, 33.434, 26.849, 28.714, 26.581, 34.825, 34.026, 8.823, 12.634, 12.694, 6.279, 13.644, 16.651, 18.078, 7.975, 9.274, 9.208, 12.879, 12.729, 6.976, 17.832, 13.330, 6.326, 12.131, 11.842, 16.716, 10.425, 9.445, 14.400, 15.696, 11.028, 10.608, 15.190, 9.076, 17.909, 9.846, 15.013, 13.913, 11.743, 11.699, 10.152]

13/06/28 01:08:06 INFO clustering.ClusterDumper: Wrote 10 clusters

13/06/28 01:08:06 INFO driver.MahoutDriver: Program took 297404 ms (Minutes: 4.956733333333333)

실행 후 하둡파일시스템 모습입니다.

hadoop@hdmaster:~$ hadoop fs -ls /testdata
Found 2 items
drwxr-xr-x   - hadoop supergroup          0 2013-06-28 01:07 /testdata/output.data
-rw-r--r--   2 hadoop supergroup     288374 2013-06-27 23:21 /testdata/synthetic_control.data

참고 자료

스리나스 페레라 외 1명, 『빅데이터 처리와 분석을 위한 하둡 맵리듀스 프로그래밍』 (에이콘), 안건국 외 1명 옮김, 2013.

Blog categories

Comments

[Mahout] K-Means 예제 실행

[Mahout] K-Means 예제 실행

about me

LASTEST POSTS

CATEGORY

TAGS