1. Download and Install JVM
sudo apt install openjdk-17-jdk java -version
export JAVA_HOME=/usr/bin/jvm/java-17-openjdk-amd64 export PATH=$PATH:$JAVA_HOME/bin export CLASSPATH=$CLASSPATH:$JAVA_HOME/bin/ext:JAVA_HOME/lib/tools.jar
2. Download and Install Spark
Download Spark:
https://spark.apache.org/downloads.html
1. Choose a Spark release
2. Choose a package type
3. Click the download link
-> e.g., https://www.apache.org/dyn/closer.lua/spark/spark-3.5.5/spark-3.5.5-bin-hadoop3.tgz
4. You can see the HTTP link for downloading spark
We suggest the following location for your download: https://dlcdn.apache.org/spark/spark-3.5.5/spark-3.5.5-bin-hadoop3.tgz Alternate download locations are suggested below. It is essential that you verify the integrity of the downloaded file using the PGP signature (
HTTPhttps://dlcdn.apache.org/spark/spark-3.5.5/spark-3.5.5-bin-hadoop3.tgz … |
5. Copy the link
6. Get the file using wget
-> e.g., wget https://dlcdn.apache.org/spark/spark-3.5.5/spark-3.5.5-bin-hadoop3.tgz
7. Un-zip the file using tar xvf spark-3.x.x-bin-hadoop3.tgz
8. Move the files into /opt/spark/
mv spark-3.x.x-bin-hadoop3/ /opt/spark
9. Add path to .bashrc
export SPARK_HOME=/opt/spark export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
2. Execute Spark
Execute Spark with Script
- pyspark: for Python
- sparkR: for R
- spark-shell: for Scala
- spark-sql: for SparkSQL
Submit Spark Application
spark-submit
명령을 통해 JAR, Python Script, or R script 파일을 Spark로 보내고 실행할 수 있음.
3. Run a job on Spark
WebUI
Spark는 WebUI를 기본 제공한다. Standalone 으로 실행하는 경우
localhost:4040
로 접속하여 확인 가능하다.
Script로 Spark를 실행하는 경우, 콘솔에서 Welcome to Spark (logo) 이후
Spark context Web UI available at http://xxxxx:4040
과 같은 문구를 찾을 수 있다.
여러 application을 사용하는 경우 4040 포트 부터 포트 숫자가 증가한다고 한다.
History Server
1. Setting spark conf file
spark.eventLog.enabled true spark.eventLog.dir file:/tmp/spark-events spark.history.fs.logDirectory file:/tmp/spark-events
spark.eventLog.dir
– Local: file:/tmp/spark-events
-> or use “file:/opt/spark-events”
– HDFS: hdfs://namenode/shared/spark-logs
spark.eventLog.dir 경로가 이미 생성되어 있는지 확인
없을 시 아래와 같은 오류 발생:
failed to launch: nice -n 0 /opt/spark/sbin/spark-class org.apache.spark.deploy.history.HistoryServer at org.apache.spark.deploy.history.FsHistoryProvider.start(FsHistoryProvider.scala:388) at org.apache.spark.deploy.history.HistoryServer$.main(HistoryServer.scala:311) at org.apache.spark.deploy.history.HistoryServer.main(HistoryServer.scala) Caused by: java.io.FileNotFoundException: File file:/tmp/spark-events does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:779) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:1100) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:769) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:462) at org.apache.spark.deploy.history.FsHistoryProvider.startPolling(FsHistoryProvider.scala:260) ... 4 more
2. Start History Server
./sbin/start-history-server.sh