You are member of a data science team which wants to analyse the Opal Card transport data. This dataset contains millions of tap-on / tap-off events where transport users were swiping on and off from trains, buses and ferries. The format is:
CardEvents (date, card, mode, time_on, time_off )
Your colleague suggests to store this data partitioned over multiple computers to be able to parallelise the processing. He suggests to use horizontal hash partitioning on the date attribute.
Your task in the project is to identify the average length of journeys, which consist of multiple trips across different transport modes (bus, train, etc) each within one hour of the previous trip.
How well does your colleague's suggestion to use hash partitioning for the data help you with this?