Apache Spark Configuration Guide-Optimize Spark Shell for Performance

3/24/2025

Apache Spark UI environment tab showing active configurations

Go Back

Apache Spark Configuration Guide: Optimize Spark Shell for Performance (2025)

Introduction

Apache Spark provides multiple ways to configure its system to optimize performance, logging, and resource management. This guide covers the three primary configuration methods:

  • Spark Properties – Control application parameters via SparkConf or system properties.
  • Environment Variables – Set per-machine settings (e.g., IP addresses) via conf/spark-env.sh.
  • Logging – Configure logging behavior using log4j2.properties.
Apache Spark UI environment tab showing active configurations

Spark Properties

Spark properties are application-specific and can be set using a SparkConf object or Java system properties.

Basic Configuration Example

val conf = new SparkConf()  
  .setMaster("local[2]")  
  .setAppName("CountingSheep")  
val sc = new SparkContext(conf)  
  • local[2]: Runs Spark locally with 2 threads for parallelism.
  • Time/Size Units: Use suffixes like ms, s, mb, gb for durations and byte sizes.

Dynamic Loading

Avoid hardcoding configurations by passing them at runtime:

./bin/spark-submit --name "MyApp" --master local[4] --conf spark.eventLog.enabled=false  

Configuration Precedence

  1. SparkConf settings (highest priority)
  2. Command-line --conf flags
  3. spark-defaults.conf file

Environment Variables

Configure node-specific settings (e.g., IP, ports) via conf/spark-env.sh. Key variables:

Variable Purpose
JAVA_HOME Java installation path.
SPARK_LOCAL_IP Binds Spark to a specific IP.
SPARK_PUBLIC_DNS Hostname advertised to the cluster.

Logging Configuration

Customize logging via log4j2.properties:

Copy the Template

cp conf/log4j2.properties.template conf/log4j2.properties  

Adjust log levels (e.g., INFO, ERROR) and appenders for better debugging and monitoring.

Key Configuration Parameters

Application Properties

Property Default Description
spark.app.name (none) Application name (visible in UI/logs).
spark.driver.memory 1g Memory for the driver process.
spark.executor.memory 1g Memory per executor.

Execution Behavior

Property Default Description
spark.default.parallelism Varies Default number of partitions.
spark.sql.shuffle.partitions 200 Partitions for shuffles in SQL.

Dynamic Allocation

Property Default Description
spark.dynamicAllocation.enabled false Enables dynamic executor scaling.
spark.dynamicAllocation.minExecutors 0 Minimum executors to retain.

Viewing Configurations

Check active settings in the Spark UI under the Environment tab: http://<driver>:4040.

Conclusion

Properly configuring Spark ensures optimal resource utilization and performance. Use:

  • SparkConf for application-specific settings.
  • spark-env.sh for cluster-wide machine configurations.
  • log4j2.properties for fine-grained logging control.

For advanced tuning, refer to the Spark documentation.

Table of content