A bank is trying to implement a fraud detection algorithm in order to identify fraudulent transactions. A typical flowchart of the fraud detection process is given below:
The figure shows 6 major steps and a total of 700 sub-steps (not shown) across all 6 major steps. Out of these sub-steps, 525 can be executed in parallel using the bank's existing machines, while the remaining sub-steps can only be executed sequentially.
Given this information, please answer the following questions:
1. Calculate the Speedup and Efficiency to execute the fraud detection process on a cluster with n = 256 nodes (assuming the workload is not scaled up and regular parallel processing occurs). [10 marks]
2. What are your comments on the efficiency? Is it low or high?
3. What could be the possible reason for the low/high efficiency from a maximum Speedup point of view? [5 marks]
4. If the fraud detection process is not run on a distributed setup, the firm is considering two options: using VMs or containers. Recommend and justify which one of the two options you think the bank should opt for? [5 marks]
Train/Test Dataset
SVM Classifier
Applying SVM on test data
Fraud Detection Process
K-Means Clustering
Fraud (1) or not Fraud (0)
Feature
KNN