Q.1 What is the use of coalesce in Spark?Ans: Spark uses a coalesce method to reduce the number of partitions in a DataFrame.Suppose you want to read data from a CSV file into an RDD having four partitions. |
Q.2 What is the significance of Resilient Distributed Datasets in Spark?Ans: Spark revolves around the concept of a resilient distributed dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel. There are two ways to create RDDs: parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat. |
Q.3 Which all languages Apache Spark supports?Ans: Apache Spark is written in Scala. Many people use Scala for the purpose of development. But it also has API in Java, Python, and R. |
Q.4 Which all languages Apache Spark supports?Ans: There are three methods to run Spark in a Hadoop cluster:1.Standalone deployment2.Hadoop Yarn deployment3.Spark In MapReduce (SIMR) |
Q.5 What is SparkSession in Apache Spark?Ans: Spark session is a unified entry point of a spark application from Spark 2.0. It provides a way to interact with various spark’s functionality with a lesser number of constructs. Instead of having a spark context, hive context, SQL context, now all of it is encapsulated in a Spark session. |
Q.6 What operations does RDD support?Ans: RDDs support two types of operations: transformations and actions.Transformations: Transformations create new RDD from existing RDD like map, reduceByKey and filter we just saw. Transformations are executed on demand. That means they are computed lazily.Actions: Actions return final results of RDD computations. Actions triggers execution using lineage graph to load the data into original RDD, carry out all intermediate transformations and return final results to Driver program or write it out to file system. |