
Be the first to like this
Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.
Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.
Published on
In distributed ensemble modelbuilding algorithms, the performance and statistical validity of models are
dependent on sizes of the input data partitions as well as the distribution of records among the partitions.
Failure to correctly select and preprocess the data often results in the models which are not stable and do
not perform well. This article introduces an optimized approach to building the ensemble models for very
large data sets in distributed mapreduce environments using PassStreamMerge (PSM) algorithm. To
ensure the model correctness the input data is randomly distributed using the facilities built into mapreduce
frameworks.
Be the first to like this
Be the first to comment