MK99 – Big Data 1
MOOC lectures Pr. Clement Levallois
MK99 – Big Data 2
Focus on “Hadoop”
• Frequently mentioned in relation to big data
• Vague definitions available and inflated talks
• This short video will clarify it.
MK99 – Big Data 3
• Note on the terminology:
– “computers” are called “servers” when they are just used
for computing / processing / storing data
– They have no screen, no mouse and no keyboard because
that’s not needed.
– But they are basically computers!
MK99 – Big Data 4
• Created by Yahoo! engineers in ~ 2005. Named after the elephant toy of one of the
• Made open source and now developed by the main open source developer
community, called “Apache”. So you can see sometimes “Apache Hadoop”.
• In simple words:
– Hadoop is a free, open source software.
– It serves to connect several servers, so that a single task can be accomplished in parallel on
– So, with Hadoop and 5 servers you can get a task of data crunching finish 5 times sooner than
with if you had just used one server.
– That’s it!
MK99 – Big Data 5
Why are Hadoop, cloud computing and big data
often discussed together?
– Imagine that you are Walmart and want to compute something on your CRM: say, what are
the clients who are most profitable for each store, based on their purchase history.
– You will need many servers to store the data, and many servers to do the computations.
– Instead of purchasing a farm of servers for this (expensive! time consuming!), you can pay for
a service of cloud computing (such as Amazon AWS EC2) to rent servers just for this task,
– And install Hadoop on these servers to divide the task among all servers and get it to run in
parallel, speeding up computation times.
– You will get the results in minutes or hours, instead of days.
MK99 – Big Data 6
– “Map/reduce” is also an expression often discussed in relation with cloud computing and
– This is a principle of programming perfected by engineers in Google around 2004, and made
– It is a principle that solves this problem: when I have data spread on 500 different servers,
how do I search some data on all the servers? Checking all servers one by one (sequential
search) would take a very long time. MapReduce dispatches the search on all servers at once,
hence it is 500 times quicker than a sequential search.
– Any software can use this principle of programming. Mapreduce is at the heart of Hadoop,
which is one of the most popular software using it.
MK99 – Big Data 7
What is the business relevance
• Hadoop made it possible to process large amounts of data quickly, using free
• It enables business models where intensive data crunching is necessary to create
– Amazon computing book recommendations for you,
– Walmart offering personalized coupons,
– NYT showing personalized display ads,
– Waze (driving app) showing the state of traffic on your road in real time,
– your electricity utility company computing how much electricity should be generated at peak