HW 3 HDFS—Lecture 5
Name:
ID:
Consider a small cluster with 20 machines: 19 DataNodes and 1 NameNode. Each node in the cluster has a total of 2 Terabyte hard disk space and 2 Gigabyte of main memory available. The cluster uses a block-size of 64 Megabytes (MB) and a replication factor of 3. The master maintains 100 bytes of metadata for each 64MB block.
(a) Let’s upload the file wiki_dump.xml (with a size of 600 Megabytes) to HDFS. Explain what effect this upload has on the number of occupied HDFS blocks.
(b) Figure 1 shows an excerpt of wiki_dump.xml’s structure. Explain the relationship between an HDFS block, an InputSplit and a record based on this excerpt.
<dump time="1483027930">
<page id="EN3234">
...
...
...
</page> } 80.2 MB
<page id="DE5434">
...
...
...
</page> } 0.6 MB
...
</dump>
Figure 1: Excerpt of wiki_dump.xml. Each Wikipedia page is stored within an element. The element with id EN3234 contains 80.2 Megabytes of textual content.
(c) You are the only user of the cluster and write a Hadoop job to extract information from wiki_dump.xml. You want to speed up the job by testing different block size configuration: besides the existing 64 MB configuration, you also consider 32 MB and 128 MB block sizes. Which configuration do you think will lead to the fastest job execution? Explain why.
(d) Let us assume no files are currently stored on HDFS. You are given 100 million files, each one with a size of 100 Kilobytes. How many of those can you upload successfully to the cluster, considering the storage restrictions (memory/disk) on the NameNode and the DataNodes? Explain your answer.