WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
MapReduce Explained: Why We Need It and How It Works
1. MapReduce
Why do we need it?
What is it?
My initial interaction with it
Joe Duimstra
Aug 6, 2015
2. Data Growth
There is an ENORMOUS amount of data out there and
it's growing exponentially!!
Example:
20+ billion web pages x 20KB = 400+ TB
1 computer reads 30-35 MB/sec from disk
~4 months to read the web
~1,000 hard drives to store the web
Takes even more to do something usefulwith the data!
3. What to do?
Option 1:
Make a bigger, custom
machine
–Custom = Expensive
–Eventually this machine
will be too small
–Note: could be useful if
cheap enough to be
used with option 2
5. Parallel computing
● Use cheap, off-the-shelf servers and network equipment
● Moving data is expensive so do the computation near
the data
MapReduce (from Google) is one paradigm for parallel
computing
● Use distributed file system
–Hardware will fail
–Provide redundancy by replicating chunks across
machines
7. My impressions
●
I have virtually no experience with Java, so that's an initial barrier
●
Hadoop seems pretty low-level
– Running multiple jobs is kind of a pain
– Seems like you need to have knowledge of partioning, sorting, and
grouping implementations to optimize performance
– I believe abstractions (e.g. Cascading) exist?
●
Spark:
– Should be faster than Hadoop since it's in memory and has lazy execution
– Seems like a more cohesive set of tools for parallel data processing
●
Hadoop:
– Requires you to 'roll-up-your-sleeves'