Apache Spark Configuration Guide: Optimize Spark Shell for Performance (2025)

Introduction

Apache Spark provides multiple ways to configure its system to optimize performance, logging, and resource management. This guide covers the three primary configuration methods:

Spark Properties – Control application parameters via SparkConf or system properties.
Environment Variables – Set per-machine settings (e.g., IP addresses) via conf/spark-env.sh.
Logging – Configure logging behavior using log4j2.properties.

Apache Spark UI environment tab showing active configurations

Spark Properties

Spark properties are application-specific and can be set using a SparkConf object or Java system properties.

Basic Configuration Example

val conf = new SparkConf()  
  .setMaster("local[2]")  
  .setAppName("CountingSheep")  
val sc = new SparkContext(conf)

local[2]: Runs Spark locally with 2 threads for parallelism.
Time/Size Units: Use suffixes like ms, s, mb, gb for durations and byte sizes.

Dynamic Loading

Avoid hardcoding configurations by passing them at runtime:

./bin/spark-submit --name "MyApp" --master local[4] --conf spark.eventLog.enabled=false

Configuration Precedence

SparkConf settings (highest priority)
Command-line --conf flags
spark-defaults.conf file

Environment Variables

Configure node-specific settings (e.g., IP, ports) via conf/spark-env.sh. Key variables:

Variable	Purpose
`JAVA_HOME`	Java installation path.
`SPARK_LOCAL_IP`	Binds Spark to a specific IP.
`SPARK_PUBLIC_DNS`	Hostname advertised to the cluster.

Logging Configuration

Customize logging via log4j2.properties:

Copy the Template

cp conf/log4j2.properties.template conf/log4j2.properties

Adjust log levels (e.g., INFO, ERROR) and appenders for better debugging and monitoring.

Key Configuration Parameters

Application Properties

Property	Default	Description
`spark.app.name`	(none)	Application name (visible in UI/logs).
`spark.driver.memory`	1g	Memory for the driver process.
`spark.executor.memory`	1g	Memory per executor.

Execution Behavior

Property	Default	Description
`spark.default.parallelism`	Varies	Default number of partitions.
`spark.sql.shuffle.partitions`	200	Partitions for shuffles in SQL.

Dynamic Allocation

Property	Default	Description
`spark.dynamicAllocation.enabled`	false	Enables dynamic executor scaling.
`spark.dynamicAllocation.minExecutors`	0	Minimum executors to retain.

Viewing Configurations

Check active settings in the Spark UI under the Environment tab: http://<driver>:4040.

Conclusion

Properly configuring Spark ensures optimal resource utilization and performance. Use:

SparkConf for application-specific settings.
spark-env.sh for cluster-wide machine configurations.
log4j2.properties for fine-grained logging control.

For advanced tuning, refer to the Spark documentation.

Table of content

Introduction to Apache Spark
Spark Architecture & Components
Working with Spark Shell
Core Spark Concepts
Working with Data in Spark
- Spark DataFrames
- Spark SQL
- Dataset API
- Handling JSON, CSV, and Parquet
Spark Streaming
- What is Spark Streaming?
- Structured Streaming
- Processing Real-time Data
Performance Optimization
- Spark Execution Plan
- Broadcast Variables & Accumulators
- Caching & Persistence
- Optimizing Shuffle Operations
Machine Learning with Spark
- Introduction to MLlib
- Classification & Regression
- Clustering & Recommendation Systems
Job Deployment & Cluster Management
- Job Deployment in Spark
- Running Spark on YARN, Mesos, and Kubernetes
- Monitoring & Debugging Spark Jobs
Advanced Spark Topics
- GraphX (Graph Processing in Spark)
- Spark with Hadoop & HDFS
- Security in Spark
Spark Interview Preparation
- Top 250 Spark Questions
- Spark Interview Questions
Additional Spark Resources