In this tutorial, we will focus on Spark, Spark Framework, its Architecture, working, Resilient Distributed Datasets, RDD operations, Spark programming language, sand comparison of Spark with MapReduce.
Spark is a fast cluster computing system that is compatible with Hadoop. It has the capability to work with any Hadoop supported storage system such as HDFS, S3. Spark uses in-memory computing to improve efficiency. In-memory computation does not save the intermediate output results to disk. Spark also uses caching to handle repetitive queries. Spark is up to 100x times compared to Hadoop. Spark is developed in Scala.
Spark is another Big Data framework. Spark supports In-Memory processing. Hadoop reads and writes data directly from disk thus wasting a significant amount of time in disk I/O. To tackle this scenario Spark stores intermediate results in memory thus reducing disk I/O and increasing speed of processing.
Spark also uses the master-slave architecture. It has main two entities: Driver and Executors. The driver is a central coordinator (Driver) that communicates with multiple distributed executors.
In spark, the application starts with the initialization of SparkContext instance. After this, the driver program gets started and asks for resources from the cluster manager and the cluster manager will launch the executors. The driver sends all the operations to executors. These operations can be actions and transformations over RDDs. Executors perform the task and save the final results.
In case of any executor crash/failure, tasks will be assigned to different executors. Spark has the capability to deal with failure and slow machines. Whenever any node crashes or gets slower then spark launches a speculative copy of the task on another executor or node in the cluster. We can stop the application by using the SparkContext.stop() method. This will terminate all the executors and release the cluster resources.
Spark uses one important data structure to distribute data over the executors in the cluster. This data structure is known as RDD (Resilient Distributed Datasets). RDD is an immutable data structure that can be distributed across the cluster for parallel computation. RDDs can be cached and persisted in memory.
In Map Reduce, data sharing among the nodes is slow because of data replication, serialization, and disk IO operations. Hadoop spends more than 90% of the time read-write operations on HDFS. To address this problem, researchers came with a new key idea of Resilient Distributed Datasets (RDD). RDD supports in-memory computation. In-memory means data is stored in the form of objects across the job and these all operations performed in the RAM. The in-memory concept helped in making 10 to 100 times faster data transfer operations.
We can perform two types of basic operations on the RDD:
“Transformations are lazy, they don’t compute right away. Transformation is only computed when any action is performed.”
You can execute the task in spark using Scala, Java, Python, and R language. Scala works faster with scala language compared to other languages because Spark is written in Scala. Most of the data scientists prefer Python for doing their tasks. But before using python we need to understand the difference between Python and Scala in Spark.
One of the major drawbacks of MapReduce is that it permanently stores the whole dataset on HDFS after executing each task. This operation is very expensive due to data replication. Spark doesn’t write data permanently on disk after each operation. It is an improvement over Mapreduce. Spark uses the in-memory concept for faster operations. This idea is given by Microsoft’s Dryad paper.
The main advantage of spark is that it launches any task faster compared to MapReduce. MapReduce launches JVM for each task while Spark keeps JVM running on each executor so that launching any task will not take much time.
Congratulations, you have made it to the end of this tutorial!
In this tutorial, we will focus on Spark, Spark framework, its architecture, working, Resilient Distributed Datasets, RDD operations, Spark programming language, sand comparison of Spark with MapReduce.
I look forward to hearing any feedback or questions. You can ask a question by leaving a comment, and I will try my best to answer it.
In this tutorial, we will focus on MapReduce Algorithm, its working, example, Word Count Problem,…
Learn how to use Pyomo Packare to solve linear programming problems. In recent years, with…
In today's rapidly evolving technological landscape, machine learning has emerged as a transformative discipline, revolutionizing…
Analyze employee churn, Why employees are leaving the company, and How to predict, who will…
Airflow operators are core components of any workflow defined in airflow. The operator represents a…
Machine Learning Operations (MLOps) is a multi-disciplinary field that combines machine learning and software development…