Recently at the PODS Colloquium on Theory and Challenges of Big Data a member of the audience asked: "What is the right model of computation to analyze parallel algorithms?" Indeed, there is a large number of systems in practice: MapReduce, Hadoop, Pregel, Giraph, Dryad, the list goes on. There is also a number of different models in theory; from classical like PRAM (with the different variants, CREW, EREW, CRCW), BSP to more modern ones like MRC and CGP among others. And then there are GPUs...
At a high level, one can classify these systems into synchronous and asynchronous. Most of the synchronous systems (and models) are adopting Valiant's Block Synchronous Parallel model of computation, albeit with different assumptions on which operations are cheap and which are expensive.
The key point of the tradeoff is the question of how much computation each individual compute node should do before syncing and sharing state with others. If the amount of computation is small, then each node needs to have small memory, but the network bandwidth needs to be high enough to support the frequent communication. If, on the other hand, the amount of computation done before synching is relatively large, then each node should have relatively large memory, but communication is slightly less of an issue.
For an algorithm designer this means that the primary objective for the algorithm changes. The total running time, the number of synchronization rounds, and the degree to which the system is load-balanced (e.g. maximum input size to each compute node) are all important objectives and we strive to minimize all of them, however different ones take precedence in the various instantiations.
In the sequential world with no parallelism:
At a high level, one can classify these systems into synchronous and asynchronous. Most of the synchronous systems (and models) are adopting Valiant's Block Synchronous Parallel model of computation, albeit with different assumptions on which operations are cheap and which are expensive.
The key point of the tradeoff is the question of how much computation each individual compute node should do before syncing and sharing state with others. If the amount of computation is small, then each node needs to have small memory, but the network bandwidth needs to be high enough to support the frequent communication. If, on the other hand, the amount of computation done before synching is relatively large, then each node should have relatively large memory, but communication is slightly less of an issue.
For an algorithm designer this means that the primary objective for the algorithm changes. The total running time, the number of synchronization rounds, and the degree to which the system is load-balanced (e.g. maximum input size to each compute node) are all important objectives and we strive to minimize all of them, however different ones take precedence in the various instantiations.
In the sequential world with no parallelism:
- All of the input is read by a single machine, which should have high large enough memory (~Tb)
- Network latency and bandwidth is not important.
- Goal: optimize the overall running time.
- All of the input is partitioned across a small (100s or 1000s) number of compute tasks.
- The per node memory is relatively large (Gb) but probably not enough to read the full input
- Rounds are relatively slow since each round requires reshuffling of most of the input
- Primary goal is to minimize the number of rounds
- The input is partitioned across a large swath (millions) of compute tasks.
- The per node memory is relatively small (Mb)
- Rounds are relatively quick, as long as each node communicates with few other nodes.
- Primary goal is to load balance the communication
In a PRAM world with unlimited parallelism:
- The input is in shared memory
- The per node individual memory is extremely small (b-Kb)
- Rounds are very fast, but require a strong network infrastructure to support it
- Primary goal is to minimize the number of rounds
Of course this is not to say that load balancing isn't critical in MapReduce (it is!), and the number of rounds doesn't matter in Pregel (it sure does!). But more often than not the "obvious" MapReduce algorithm will be take too many rounds, and the "obvious" Pregel algorithm will deal poorly with very high degree nodes.
Although I lumped Pregel/Giraph and GPUs together above, they are very different beasts and expose a different axis of complexity, namely the amount of hierarchical thinking that the algorithms designer must worry about. One one extreme is BSP which defines exact tradeoffs between different speed and communication between CPUs, Caches, nearby machines, faraway machines, etc. On the other is MapReduce, where all of these (while important) are hidden from the programmer. GPUs lie somewhere in between: there are not many constraints on the amount of concurrency and shared memory, but those that are there must be rigorously followed, lest the algorithm become unimplementable.
Although I lumped Pregel/Giraph and GPUs together above, they are very different beasts and expose a different axis of complexity, namely the amount of hierarchical thinking that the algorithms designer must worry about. One one extreme is BSP which defines exact tradeoffs between different speed and communication between CPUs, Caches, nearby machines, faraway machines, etc. On the other is MapReduce, where all of these (while important) are hidden from the programmer. GPUs lie somewhere in between: there are not many constraints on the amount of concurrency and shared memory, but those that are there must be rigorously followed, lest the algorithm become unimplementable.