Question

Figure 1: MapReduce based iterative programming This challenge is caused by the fact that Hadoop is designed to utilize the storage space in the cluster. However, each MapReduce program requires outputting the data into the hard drive. This feature leads to a large amount of read/write of HDFS, which significantly limits the performance. Spark Programming The Spark system implements the Resilient Distributed Dataset (RDD) to maximize the memory space in the cluster. With RDD, most of the operation is done in the memory. To develop a K-Means algorithm in Spark, you just need to transform the previous RDD into a new one for the next iteration. Programming in Lab 2 In this lab, please, based on your previous code, implement the K-Means algorithm. You can use any Spark-related library package. 1. Part 1: Please redo Project 1 Part 1 Question 1 with different levels of parallelism, 2, 3, 4, 5. You can change the parallelism level by adding one line in the test.sh, --conf spark.default.parallelism=2, after spark-submit to set the parallelism level to 2. 2. Part 2: Please redo Project 1 Part 2 Question 2. 3. Part 3: Please redo Project 1 Bonus Question (K-Means in Spark). Installing the Spark cluster GitHub Link. Grading Rubric Up to 2 students in a group. (50%) Part 1; (20% * 2) Part 2 and 3. (10%) Report; Figure 2: Hadoop vs. Spark To develop a K-Means algorithm in Spark, you just need to transform the previous RDD into a new one for the next iteration. Programming in Lab 2 In this lab, please, based on your previous code, implement the K-Means algorithm. You can use any Spark-related library package. Part 1: Please redo Project 1 Part 1 Question 1 with different levels of parallelism, 2, 3, 4. You can change the parallelism level by adding one line in the test.sh, --conf spark.default.parallelism=2, after spark-submit to set the parallelism level to 2. Part 2: Please redo Project 1 Part 2 Question 2. Part 3: Please redo Project 1 Bonus Question (K-Means in Spark). Installing the Spark cluster GitHub Link. Grading Rubric Up to 2 students in a group. (50%) Part 1; (20% * 2) Part 2 and 3. (10%) Report; Hadoop FIRST SECOND HDFS ITERATION HDFS HDF$ Spark FIRST SECOND ITERATION RAM HDFS

Figure 1: MapReduce based iterative programming

This challenge is caused by the fact that Hadoop is designed to utilize the storage space in the cluster. However, each MapReduce program requires outputting the data into the hard drive. This feature leads to a large amount of read/write of HDFS, which significantly limits the performance.

Spark Programming

The Spark system implements the Resilient Distributed Dataset (RDD) to maximize the memory space in the cluster. With RDD, most of the operation is done in the memory. To develop a K-Means algorithm in Spark, you just need to transform the previous RDD into a new one for the next iteration.

Programming in Lab 2

In this lab, please, based on your previous code, implement the K-Means algorithm. You can use any Spark-related library package.

1. Part 1: Please redo Project 1 Part 1 Question 1 with different levels of parallelism, 2, 3, 4, 5. You can change the parallelism level by adding one line in the test.sh, --conf spark.default.parallelism=2, after spark-submit to set the parallelism level to 2.
2. Part 2: Please redo Project 1 Part 2 Question 2.
3. Part 3: Please redo Project 1 Bonus Question (K-Means in Spark).

Installing the Spark cluster GitHub Link.

Grading Rubric

Up to 2 students in a group.
(50%) Part 1;
(20% * 2) Part 2 and 3.
(10%) Report;

Figure 2: Hadoop vs. Spark

To develop a K-Means algorithm in Spark, you just need to transform the previous RDD into a new one for the next iteration.

Programming in Lab 2

In this lab, please, based on your previous code, implement the K-Means algorithm. You can use any Spark-related library package.

Part 1: Please redo Project 1 Part 1 Question 1 with different levels of parallelism, 2, 3, 4. You can change the parallelism level by adding one line in the test.sh, --conf spark.default.parallelism=2, after spark-submit to set the parallelism level to 2.
Part 2: Please redo Project 1 Part 2 Question 2.
Part 3: Please redo Project 1 Bonus Question (K-Means in Spark).

Installing the Spark cluster GitHub Link.

Grading Rubric

Up to 2 students in a group.
(50%) Part 1;
(20% * 2) Part 2 and 3.
(10%) Report;

Hadoop
FIRST SECOND HDFS ITERATION HDFS HDF$

Spark
FIRST SECOND ITERATION RAM HDFS