Figure 1: MapReduce based iterative programming
This challenge is caused by the fact that Hadoop is designed to utilize the storage space in the cluster. However, each MapReduce program requires outputting the data into the hard drive. This feature leads to a large amount of read/write of HDFS, which significantly limits the performance.
Spark Programming
The Spark system implements the Resilient Distributed Dataset (RDD) to maximize the memory space in the cluster. With RDD, most of the operation is done in the memory. To develop a K-Means algorithm in Spark, you just need to transform the previous RDD into a new one for the next iteration.
Programming in Lab 2
In this lab, please, based on your previous code, implement the K-Means algorithm. You can use any Spark-related library package.
1. Part 1: Please redo Project 1 Part 1 Question 1 with different levels of parallelism, 2, 3, 4, 5. You can change the parallelism level by adding one line in the test.sh, --conf spark.default.parallelism=2, after spark-submit to set the parallelism level to 2.
2. Part 2: Please redo Project 1 Part 2 Question 2.
3. Part 3: Please redo Project 1 Bonus Question (K-Means in Spark).
Installing the Spark cluster GitHub Link.
Grading Rubric
Up to 2 students in a group.
(50%) Part 1;
(20% * 2) Part 2 and 3.
(10%) Report;
Figure 2: Hadoop vs. Spark
To develop a K-Means algorithm in Spark, you just need to transform the previous RDD into a new one for the next iteration.
Programming in Lab 2
In this lab, please, based on your previous code, implement the K-Means algorithm. You can use any Spark-related library package.
Part 1: Please redo Project 1 Part 1 Question 1 with different levels of parallelism, 2, 3, 4. You can change the parallelism level by adding one line in the test.sh, --conf spark.default.parallelism=2, after spark-submit to set the parallelism level to 2.
Part 2: Please redo Project 1 Part 2 Question 2.
Part 3: Please redo Project 1 Bonus Question (K-Means in Spark).
Installing the Spark cluster GitHub Link.
Grading Rubric
Up to 2 students in a group.
(50%) Part 1;
(20% * 2) Part 2 and 3.
(10%) Report;
Hadoop
FIRST SECOND HDFS ITERATION HDFS HDF$
Spark
FIRST SECOND ITERATION RAM HDFS