patternMinor
Problem Solving in MapReduce Framework
Viewed 0 times
problemsolvingframeworkmapreduce
Problem
I am looking for good resources which have classified and solved typical large scale data processing in MapReduce framework (like graph algorithms, statistics, numerical algorithms ...). Any help is appreciated!
Solution
Mapreduce does not seem to have received significant academic attention and/or widespread enthusiasm nor does it seem to have yet been integrated into much of the large previously-existing literature and theory on parallelism. This is partly due to it being relatively new, but there may be other reasons.
One possible angle: this is apparently not widely acknowledged or recognized but MapReduce seems mainly oriented towards & optimized for what are known as embarrassingly parallel problems where the problem space can easily be partitioned into independent parts requiring virtually no intercommunication, i.e. the "map" step. (this is partly the basis of some criticism of the simplicity of the model as in the Wikipedia article.)
The "reduce" step is a communication-heavy step of merging/combining/aggregating the separate results into a combined structure. [1] are slides on the background & structure of typical embarrassingly parallel problems, which classic examples are:
Another angle is that MapReduce has some similarities to the "master-slave" design pattern where the "master" distributes jobs and also handles collecting the results, and some surveys on "master-slave" decomposition of parallel problems may be available.
Embarrassingly parallel problems are also naturally "finely granular" and Mapreduce seems especially suited to "fine-grained" parallelism problems (such that the "fine grains" can be highly distributed over many computing machines). "Fine-grained" parallelism is possibly not studied as much from an academic point of view because its intrinsically theoretically less problematic.
Another angle: it's similar to "Bulk Synchronous Parallel" [4]. The reduce step is similar to "barrier synchronization".
A survey from 2006 by academic parallelism experts [2] that was meant to be very comprehensive and forward-looking at the time identified 7 widespread "dwarves" or basic parallelism structures found in algorithms (Section 3.2). Conspicuously and surprisingly, none have a MapReduce-like structure; i.e. MapReduce type problems are absent from the survey (but note Monte Carlo methods are one of the "dwarves" and is again regarded as embarrassingly parallel).
This would also tend to support the view that MapReduce is a kind of "toy" parallelism model. However! that point is not meant to detract from its widespread usefulness in many applications. Particularly with parallelism there can be great value in simpler models that avoid complexity and thereby also various pitfalls.
[1] Embarrassingly parallel overview
[2] The landscape of parallel computing: the view from Berkeley
[3] Current parallel models for computation tcs.se
[4] Is the mapreduce framework a type of BSP tcs.se
One possible angle: this is apparently not widely acknowledged or recognized but MapReduce seems mainly oriented towards & optimized for what are known as embarrassingly parallel problems where the problem space can easily be partitioned into independent parts requiring virtually no intercommunication, i.e. the "map" step. (this is partly the basis of some criticism of the simplicity of the model as in the Wikipedia article.)
The "reduce" step is a communication-heavy step of merging/combining/aggregating the separate results into a combined structure. [1] are slides on the background & structure of typical embarrassingly parallel problems, which classic examples are:
- image processing,
- fractal calculations e.g. Mandelbrot set, and
- Monte Carlo calculations.
Another angle is that MapReduce has some similarities to the "master-slave" design pattern where the "master" distributes jobs and also handles collecting the results, and some surveys on "master-slave" decomposition of parallel problems may be available.
Embarrassingly parallel problems are also naturally "finely granular" and Mapreduce seems especially suited to "fine-grained" parallelism problems (such that the "fine grains" can be highly distributed over many computing machines). "Fine-grained" parallelism is possibly not studied as much from an academic point of view because its intrinsically theoretically less problematic.
Another angle: it's similar to "Bulk Synchronous Parallel" [4]. The reduce step is similar to "barrier synchronization".
A survey from 2006 by academic parallelism experts [2] that was meant to be very comprehensive and forward-looking at the time identified 7 widespread "dwarves" or basic parallelism structures found in algorithms (Section 3.2). Conspicuously and surprisingly, none have a MapReduce-like structure; i.e. MapReduce type problems are absent from the survey (but note Monte Carlo methods are one of the "dwarves" and is again regarded as embarrassingly parallel).
This would also tend to support the view that MapReduce is a kind of "toy" parallelism model. However! that point is not meant to detract from its widespread usefulness in many applications. Particularly with parallelism there can be great value in simpler models that avoid complexity and thereby also various pitfalls.
[1] Embarrassingly parallel overview
[2] The landscape of parallel computing: the view from Berkeley
[3] Current parallel models for computation tcs.se
[4] Is the mapreduce framework a type of BSP tcs.se
Context
StackExchange Computer Science Q#10453, answer score: 5
Revisions (0)
No revisions yet.