Go Back

What Is The Difference Between Persist() And Cache()?

12/1/2021
All Articles

#sparkrdd dataframe cache perist

What Is The Difference Between Persist() And Cache()?

What Is The Difference Between Persist() And Cache()?

When a resilient distributed dataset (RDD) is created from a text file or collection (or from another RDD), do we need to call "cache" or "persist" explicitly to store the RDD data into memory? Or is the RDD data stored in a distributed way in the memory by default?

val textFile = sc.textFile("/tmp/user.txt")

cache()

cache is useful when the lineage of the RDD branches out. Let's say you want to filter the words of the previous example into a count for positive and negative words. You could do this like that:

val textFile = sc.textFile("/tmp/user.txt")
val wordsRDD = textFile.flatMap(line => line.split(" "))
wordsRDD.cache()
val positiveWordsCount = wordsRDD.filter(word => isPositive(word)).count()
val negativeWordsCount = wordsRDD.filter(word => isNegative(word)).count()

cache() doesn’t take any parameters

cache() on RDD will persist the objects in memor

 

persist()

There are 2 flavours of persist() functions 

persist() – without argument. When called without argument, calls cache() internally.

RDD

rdd.persist()

persist(StorageLevel) – with StorageLevel as argument

RDD

rdd.persist(StorageLevel.MEMORY_ONLY_SER)

DataFrame

df.persist(StorageLevel.DISK_ONLY)

Dataset

ds.persist(StorageLevel.MEMORY_AND_DISK)

Based on the provided StorageLevel, the behaviour of the persisted objects will vary.

Conclusion

understanding use of  Persist() And Cache() in spark with rdd and dataframe ....

This Solution is provided by Shubham mishra

This article is contributed by Developer Indian team. Please write comments if you find anything incorrect, or you want to share more information about the topic discussed above.

Also folllow our instagram , linkedIn , Facebook , twiter account for more....

Article