What is Spark and Spark RDD along with it’s Operations

What is Apache Spark?

Apache Spark is an open-source, distributed processing system used for big data workloads. It utilises in-memory caching and optimised query execution for fast queries against data of any size. Simply put, Spark is a fast and general engine for large scale data processing.

The secret for being faster is that Spark runs on memory (RAM), and that makes the processing much faster than on disk drives.

It can be used for multiple things like running distributed SQL, creating data pipelines, ingesting data into a database, running Machine Learning algorithms, working with graphs or data streams, and much more.

Features of Spark

1.  Fast ProcessingSpark contains Resilient Distributed Dataset (RDD) which saves time in reading and writing operations, allowing it to run almost ten to one hundred times faster than Hadoop.

2.  FlexibilityApache Spark supports multiple languages and allows the developers to write applications in Java, Scala, R, or Python.

3.  In-memory computingSpark stores the data in the RAM of servers which allows quick access and in turn accelerates the speed of analytics.

4.  Real-time processingSpark is able to process real-time streaming data. Unlike Map-Reduce which processes only stored data.

5.  Better analyticsApache Spark consists of a rich set of SQL queries, machine learning algorithms, complex analytics, etc.

Components of Spark

Spark Components

1.  Apache Spark CoreSpark Core is the underlying general execution engine for the Spark platform that all other functionality is built upon. It provides in-memory computing and referencing datasets in external storage systems.

2.  Spark SQLSpark SQL is Apache Spark’s module for working with structured data. The interfaces offered by Spark SQL provides Spark with more information about the structure of both the data and the computation being performed.

3.  Spark StreamingThis component allows Spark to process real-time streaming data. Data can be ingested from many sources like Kafka, Flume, and HDFS. Then the data can be processed using complex algorithms and pushed out to file systems, databases, and live dashboards.

4.  MLlib (Machine Learning Library)This library contains a wide array of machine learning algorithms. It also includes other tools for constructing, evaluating, and tuning ML Pipelines. All these functionalities help Spark scale out across a cluster.

5.  GraphXSpark also comes with a library to manipulate graph databases and perform computations called GraphX. GraphX unifies ETL (Extract, Transform, and Load) process, exploratory analysis, and iterative graph computation within a single system.

What is Spark RDD?

Resilient Distributed Dataset (RDD) is the fundamental data structure of Spark. They are immutable Distributed collections of objects of any type. RDDs can be created through deterministic operations on either data on stable storage or other RDDs. RDD is a fault-tolerant collection of elements that can be operated on in parallel.

There are many ways to create RDDs − parallelizing an existing collection in your driver program,  referencing a dataset in an external storage system, such as a shared file system, HDFS or drect stream approach using Kafka.

In Java  RDD can be divided into 2 different types, i.e., JavaRDD and JavaPairRDD.
JavaRDD can be used for operations which don’t require an explicit Key field. These operations are generic operations on arbitrary element types.
JavaPairRDD is there to declare the contract to the developer that a Key and Value is required.

Take a look at their javadocs to see the functions that available for each.
JavaRDD
JavaPairRDD

Spark RDD Operations with some examples

Apache Spark RDD supports two types of Operations-

Transformations It is a function that produces a new RDD from the existing RDDs. It takes an RDD as input and generates one or more RDD as output. Types of transformations are:

1.  Narrow Transformations-In this type, all the elements which are required to compute the records in a single partition live in that single partition. Here, we use a limited subset of partition to calculate the result. Examples are map(), filter() and etc.

2.  Wide Transformations-Here, all elements required to compute the records in that single partition may live in many of the partitions of the parent RDD. Examples are groupbyKey() and reducebyKey().

=> map()- Using map() transformation we take in any function, and that function is applied to every element of RDD.
For example, in RDD {1, 2, 3, 4, 5} if we apply “rdd.map(x=>x+2)” we will get the result as (3, 4, 5, 6, 7).

=> filter()- Spark RDD filter() function returns a new RDD, containing only the elements that meet a predicate.
For example, the below code filters the input data and passes the data in which ”bot” remains ”false” into a new RDD.

=> groupByKey()- The groupByKey() will group the integers on the basis of same key.
For example, the final result is the alphabet-wise counts.

=> reduceByKey()- When we use reduceByKey() on a dataset (K, V), the pairs  with the same key are combined, before the data is shuffled.
For example, we can sppecify the function in reduceByKey() that we want to perform.

NOTE:- to know about more Transformation operations Click Here.

ActionsActions are Spark RDD operations that give non-RDD values. The values of action are stored to drivers or to the external storage system.

=> count()- Action count() returns the number of elements in RDD.
For example,

=> collect()- The action collect() is the common and simplest operation that returns our entire RDD’s content to driver program.
For example,

=> take(n)- The action take(n) returns n number of elements from RDD. It tries to cut the number of partition it accesses, so it represents a biased collection. We cannot presume the order of the elements.
For example, below example takes 300 records.

=> forEach()- When we have a situation where we want to apply operation on each element of RDD, but it should not return value to the driver. In this case, foreach() function is useful.
For example, 

NOTE:- to know about more Transformation operations Click Here.

Limitations of RDD

1.  No static and runtime type safetyRDD does not provide static or Runtime type Safety and does not allow the user to check error at the runtime.

2.  The problem of OverflowRDD degrades when there is not enough memory too available to store it in-memory or on disk. Here, the partitions that overflow from RAM may be stored on disk and will provide the same level of performance. We need to increase the RAM and disk size to overcome this problem.

3.  No schema view of dataRDD has a problem with handling structured data. This is because it does not provide a schema view of data and has no provision in that context.

Conclusion

The Hadoop MapReduce had a lot of shortcomings with it. To overcome these shortcomings, Spark RDD was introduced. It had in-memory processing, immutability and other functionalities mentioned above which gave users a better option. But RDD too had some limitations which restricted Spark from being more versatile.

On applying a transformation to an RDD creates another RDD. As a result of this RDDs are immutable in nature. On the introduction of an action on an RDD, the result gets computed. Thus, this lazy evaluation decreases the overhead of computation and make the system more efficient.

Resources

1.   Apache Spark Documentation (latest)
2.  Spark Programming Guide

Further Readings

1.  Spark 101: What Is It, What It Does, and Why It Matters
2.  Top Apache Spark Use Cases

Comments