pyspark filter- apply a filter to a section of a Pyspark dataframe
filter in Pyspark dataframe
Updated by 03/march/2025
Using PySpark, the filter transformation is used to pick elements from the original RDD (Resilient Distributed Dataset) that meet a predetermined requirement. Because it returns a new RDD without changing the original, the filter operation is a transformation, much like the filter function in Python.
We are creating filter on top of dataframe.Here field name gender and salary should be greater than 5000 and gender is male.
+---------+----------+--------+-----+------+------+
|firstname|middlename|lastname|id |gender|salary|
+---------+----------+--------+-----+------+------+
|Michael |Rose | |40288|M |4000 |
|shubham | |Williams|42114|M |6000 |
|Maria |Anne |Jones |39192|F |4000 |
|Jen |Mary |Brown |30001|F |2000 |
+---------+----------+--------+-----+------+------+
df.where(''' (gender == 'M' and salary > 50000) ''').show()
+-------+------+-----+ | name|gender| id| +-------+------+-----+ |shubham| M|60000| +-------+------+-----+
To begin, import the necessary modules from PySpark:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
A SparkSession is required to work with DataFrames in PySpark. If you haven’t already started a session, create one as follows:
spark = SparkSession.builder \
.appName("PySpark Filter Example") \
.getOrCreate()
We will now create a DataFrame with sample data.
data = [
("Michael", "Rose", "", 40288, "M", 4000),
("Shubham", "", "Williams", 42114, "M", 6000),
("Maria", "Anne", "Jones", 39192, "F", 4000),
("Jen", "Mary", "Brown", 30001, "F", 2000)
]
columns = ["firstname", "middlename", "lastname", "id", "gender", "salary"]
df = spark.createDataFrame(data, columns)
We will filter rows where:
gender
column has the value 'M' (Male)salary
column has a value greater than 5000filtered_df = df.filter((col("gender") == "M") & (col("salary") > 5000))
Alternatively, you can use SQL-style expressions:
filtered_df = df.where("gender = 'M' AND salary > 5000")
Use the show() action to display the filtered DataFrame:
filtered_df.show()
firstname | middlename | lastname | id | gender | salary |
---|---|---|---|---|---|
Shubham | Williams | 42114 | M | 6000 |
In this example, we applied a filter transformation to a PySpark DataFrame, extracting rows that met the condition gender = 'M' AND salary > 5000
. Since filter() is a transformation, it does not modify the original DataFrame but creates a new one with the filtered results.
By leveraging filtering in PySpark, you can efficiently extract meaningful insights from large datasets w.