Spark defines an RDD interface with the properties that each type of RDD must implement. These properties include the RDD’s dependencies and information about data locality that are needed for the execution engine to compute that RDD. Since RDDs are statically typed and immutable, calling a transformation on one RDD will not modify the original RDD but rather return a new RDD object with a new defini‐ tion of the RDD’s properties.
RDDs can be created in three ways:
(1) by transforming an existing RDD;
(2) from a Spark Context, which is the API’s gateway to Spark for your application; and
(3) con‐verting a DataFrame or Dataset
The SparkCon text represents the connection between a Spark cluster and one running Spark application. The SparkContext can be used to create an RDD from a local Scala object (using the makeRDD or parallelize methods) or by reading from stable storage (text files, binary files, a Hadoop Context, or a Hadoop file). DataFrames and Data sets can be read using the Spark SQL equivalent to a SparkContext, the Spark Session.
Internally, Apache Spark uses five main properties to represent an RDD. The three required properties are the list of partition objects that make up the RDD, a function for com‐ puting an iterator of each partition, and a list of dependencies on other RDDs. Optionally, RDDs also include a partitioner (for RDDs of rows of key/value pairs rep‐ resented as Scala tuples) and a list of preferred locations (for the HDFS file). As an end user, you will rarely need these five properties and are more likely to use prede‐ fined RDD transformations. However, it is helpful to understand the properties and know how to access them for debugging and for a better conceptual understanding. These five properties correspond to the following five methods available to the end user.
partitions():
Returns an array of the partition objects that make up the parts of the distributed dataset. In the case of an RDD with a partitioner, the value of the index of each partition will correspond to the value of the getPartition function for each key in the data associated with that partition.
iterator(p, parentIters):
Computes the elements of partition p given iterators for each of its parent parti‐ tions. This function is called in order to compute each of the partitions in this RDD. This is not intended to be called directly by the user. Rather, this is used by Spark when computing actions. Still, referencing the implementation of this function can be useful in determining how each partition of an RDD transforma‐ tion is evaluated.
dependencies():
Returns a sequence of dependency objects. The dependencies let the scheduler know how this RDD depends on other RDDs. There are two kinds of dependen‐ cies: narrow dependencies (NarrowDependency objects), which represent parti‐ tions that depend on one or a small subset of partitions in the parent, and wide dependencies (ShuffleDependency objects), which are used when a partition can only be computed by rearranging all the data in the parent.
partitioner():
Returns a Scala option type of a partitioner object if the RDD has a function between element and partition associated with it, such as a hashPartitioner. This function returns None for all RDDs that are not of type tuple (do not repre‐ sent key/value data). An RDD that represents an HDFS file (implemented in NewHadoopRDD.scala) has a partition for each block of the file.
preferredLocations(p):
Returns information about the data locality of a partition, p. Specifically, this function returns a sequence of strings representing some information about each of the nodes where the split p is stored. In an RDD representing an HDFS file, each string in the result of preferredLocations is the Hadoop name of the node where that partition is stored.
To get in-depth knowledge on spark enroll for live free demo on Apache Spark Training with 24x7 Guidance Support and Life time Access.