2. Outline
• Big Data
• Hadoop
• Hadoop Cluster
• Hadoop Ecosystem
• HDFS
• MapReduce
• Demo
3. Big Data
• There’s no one definition for ‘big data’, it’s a very subjective term.
4. Big Data
• Most people would consider a data set of terabytes or more to be ‘big data’,
but there are certainly people using Hadoop with great success on smaller
chunks of data than that.
• One reasonable definition is that it’s data which can’t comfortably be
processed on a single machine.
5. The 3 V’s of Big Data
• Volume refers to the size of data that you’re dealing with.
• Variety refers to the fact that the data is often coming from lots of different
sources and in many different formats
• Velocity refers to the speed at which the data is being generated
6. Hadoop
• The logo and the name comes from Doug Cutting son’s elephant toy.
• Started as a search engine project called Nutch in 2003 by Doug Cutting
and Mike Cafarella.
• Implemented Google’s white paper about distributed file system.
• Invested by Yahoo in 2006 and become a open-source project.
• Also in 2016 Hadoop 0.1.0 released
7. Hadoop Cluster
The core Hadoop project consists of a way to store data, known as the
Hadoop Distributed File System, or HDFS, and a way to process the data,
called MapReduce. The key concept is that we split the data up and store it
across a collection of machines, known as a cluster. Then, when we want to
process the data, we process it where it’s actually stored. Rather than
retrieving the data from a central server, instead it’s already on the cluster,
and we can process it in place.
Store in HDFS
Process with
MapReduce