Apache Spark Configuration Guide-Optimize Spark Shell for Performance
Apache Spark UI environment tab showing active configurations
Apache Spark provides multiple ways to configure its system to optimize performance, logging, and resource management. This guide covers the three primary configuration methods:
conf/spark-env.sh
.log4j2.properties
.
Spark properties are application-specific and can be set using a SparkConf
object or Java system properties.
val conf = new SparkConf()
.setMaster("local[2]")
.setAppName("CountingSheep")
val sc = new SparkContext(conf)
local[2]
: Runs Spark locally with 2 threads for parallelism.ms
, s
, mb
, gb
for durations and byte sizes.Avoid hardcoding configurations by passing them at runtime:
./bin/spark-submit --name "MyApp" --master local[4] --conf spark.eventLog.enabled=false
--conf
flagsspark-defaults.conf
file
Configure node-specific settings (e.g., IP, ports) via conf/spark-env.sh
. Key variables:
Variable | Purpose |
---|---|
JAVA_HOME |
Java installation path. |
SPARK_LOCAL_IP |
Binds Spark to a specific IP. |
SPARK_PUBLIC_DNS |
Hostname advertised to the cluster. |
Customize logging via log4j2.properties
:
cp conf/log4j2.properties.template conf/log4j2.properties
Adjust log levels (e.g., INFO
, ERROR
) and appenders for better debugging and monitoring.
Property | Default | Description |
---|---|---|
spark.app.name |
(none) | Application name (visible in UI/logs). |
spark.driver.memory |
1g | Memory for the driver process. |
spark.executor.memory |
1g | Memory per executor. |
Property | Default | Description |
---|---|---|
spark.default.parallelism |
Varies | Default number of partitions. |
spark.sql.shuffle.partitions |
200 | Partitions for shuffles in SQL. |
Property | Default | Description |
---|---|---|
spark.dynamicAllocation.enabled |
false | Enables dynamic executor scaling. |
spark.dynamicAllocation.minExecutors |
0 | Minimum executors to retain. |
Check active settings in the Spark UI under the Environment tab: http://<driver>:4040
.
Properly configuring Spark ensures optimal resource utilization and performance. Use:
For advanced tuning, refer to the Spark documentation.