Hands On Big Data: 
Getting Started With 
NoSQL And Hadoop 
Mario Cartia 
mario@big-data.ninja
Big Data Facts 
• Google processes about 20Pb (E+15 
bytes) of data each day 
• About 5Eb (Exabytes, E+18 bytes) of data 
in the world. 90% generated over last 2 
years 
• Wearable computing and IoT…
Big Data: 3V Model 
• Big Data it’s not only about volume 
–Volume 
>= Petabytes, not Gigabytes 
–Variety 
Structured and unstructured data 
–Velocity 
Real-time or near real-time
Big Data 
Risk
Big Data 
Opportunity
Big Data Facts
Big Data Success Stories 
Amazon.com, a pioneer of targeted 
advertising became a big data user when Greg 
Linden, one of its software engineers realized 
the potential of book reviewing from the 
average results of their in-house review project 
When Amazon compared the results of the 
computer sales against the in house reviews, 
the results were much better for the data-derived 
material, and revolutionized e-commerce
Big Data Success Stories 
Google Flu Trends is a web service 
operated by Google. It provides 
estimates of influenza activity for more 
than 25 countries. By aggregating 
Google search queries, it attempts to 
make accurate predictions about flu 
activity 
In the 2009 flu pandemic Google Flu 
Trends tracked information about flu in 
the United States. In February 2010, the 
CDC identified influenza cases spiking in 
the mid-Atlantic region of the United 
States. However, Google’s data of 
search queries about flu symptoms was 
able to show that same spike two weeks 
prior to the CDC report being released
Big Data Success Stories 
reCAPTCHA is a user-dialogue system originally 
developed by Luis von Ahn, Ben Maurer, Colin 
McMillen, David Abraham and Manuel Blum at 
Carnegie Mellon University's main Pittsburgh 
campus, and acquired by Google in September 
2009 
The reCAPTCHA service supplies subscribing 
websites with images of words that optical 
character recognition (OCR) software has been 
unable to read. The subscribing websites present 
these images for humans to decipher as 
CAPTCHA words, as part of their normal validation 
procedures. They then return the results to the 
reCAPTCHA service, which sends the results to 
the digitization projects 
Secondary 
data 
usage
Big Data Techniques 
Data Warehouse Data Visualization 
Statistics 
Data Mining 
Business Intelligence 
Prediction Machine 
Learning 
Advanced Analytics 
Correlation Analysis
The Traditional Approach 
ETL: Extract, Transform, Load 
•Extracts data from outside sources 
•Transforms it to fit operational needs, which 
can include quality levels 
•Loads it into the end target (database, 
operational data store, data mart or data 
warehouse) 
Does it fit “big data” needs?
Hadoop Basics 
Apache Hadoop is an open-source 
software framework for distributed 
storage and distributed processing of 
Big Data on clusters of 
commodity hardware
Hadoop Basics 
Hadoop was created by Doug 
Cutting and Mike Cafarella in 2005. 
Cutting, who was working at 
Yahoo! at the time named it after 
his son's toy elephant
Hadoop 1 vs. Hadoop 2
Hadoop Distributions
Hadoop Market
Hadoop vs. RDBMS
From RDBMS to NoSQL 
A NoSQL (often interpreted as Not 
Only SQL) database provides a 
mechanism for storage and 
retrieval of data that is modeled in 
means other than the tabular 
relations used in relational 
databases
From RDBMS to NoSQL 
Motivations for this approach include 
simplicity of design, horizontal 
scaling and finer control over 
availability. The data structure (e.g. 
key-value, graph, or document) differs 
from the RDBMS, and therefore some 
operations are faster in NoSQL and 
some in RDBMS
NoSQL Approaches 
Most popular NoSQL database types 
•Document (MongoDB, CouchDB, Clusterpoint, 
Couchbase, MarkLogic, etc.) 
•Key-value (Redis, MemcacheDB, Dynamo, 
FoundationDB, Riak, FairCom c-treeACE, 
Aerospike, etc.) 
•Column (Accumulo, Cassandra, Druid, HBase, 
Vertica, etc.) 
•Graph (Allegro, Neo4J, InfiniteGraph, OrientDB, 
Virtuoso, Stardog, etc.)
NoSQL Approaches
CAP theorem (Brewer) 
NoSQL H(oBrwewe rT) o Choose
Hadoop Architecture 
Overview
Hadoop Core Components
MapReduce Model 
• MapReduce is a programming model, and an 
associated implementation, for processing and 
generating large data sets with a parallel, 
distributed algorithm on a cluster 
• The model is inspired by the map and reduce 
functions commonly used in functional 
programming, although their purpose in the 
MapReduce framework is not the same as in 
their original forms
MapReduce Paper
MapReduce Overview 
• Map step: Each worker node applies the map() 
function to the local data, and writes the output to a 
temporary storage. A master node orchestrates that 
for redundant copies of input data, only one is 
processed 
• Shuffle step: Worker nodes redistribute data based 
on the output keys (produced by the map() function), 
such that all data belonging to one key is located on 
the same worker node 
• Reduce step: Worker nodes now process each 
group of output data, per key, in parallel
Map Reduce: A really simple 
introduction 
Dear <Your Name>, 
As you know we are building the blogging platform 
blogger2.com, I need some statistics. I need to find out, 
Acorss all blogs ever wrriten on blogger.com, how many times 1 
character words occur(like 'a', 'I'), How many times two 
character words occur (like 'be', 'is').. and so on till how 
many times do ten character words occur. 
I know its a really big job. So, I will assign, all 50,000 
employees working in our company to work with you on this for 
a week. I am going on a vacation for a week, and its really 
important that I've this when I return. Good luck. 
regds, 
The CEO 
(src: http://ksat.me/map-reduce-a-really-simple-introduction-kloudo/)
Map Reduce: A really 
simple introduction 
The next day, You stand with a mike on the dias 
before 50,000 and proclaim. For a week, you will all be 
divided into many groups: 
•The Mappers (tens of Thousands of people will be in 
this group) 
•The Grouper (Assume just one guy for now) 
•The Reducers ( Around 10 of em.) and.. 
•The Master (That’s you)
Map Reduce: A really 
simple introduction 
• Each mapper will get a set of 50 blog urls and really 
Big sheet of paper. Each one of you need to go to 
each of that url. and for each word in those blogs, 
write one line on the paper. The format of that line 
should be the number of characters in the word, then a 
commna, and then the actual word 
• For example, if you find the word “a”, you write “1,a”, in 
a new line in your paper. since the word “a” has only 1 
character. If you find the word “hello”, you write 
“5,hello” on the new line
Map Reduce: A really 
simple introduction 
Each take 4 days. So, After 4 days, your sheet might 
look like this 
•“1,a” 
•“5,hello” 
•“2,if” 
•.. and a million more lines 
At the end of the 4th day. each one of you will give 
your sheet completely filled to the Grouper
Map Reduce: A really 
simple introduction 
• I will give you 10 papers. The first paper will be marked 
1, the second paper will be marked 2, and so on, till 10 
• You collect the output from mappers and for each line in 
the mapper’s sheet, if it says “1,”, your write the on sheet 
1, if it says “2, ”, you write it on sheet two 
• For example, if the first line of a mapper’s sheet says 
“1,a”, you write “a” on sheet 1. if it says “2,if”, your write 
“if” on sheet 2. If it says “5,hello”, you write hello on 
sheet 5
Map Reduce: A really 
simple introduction 
So at the end of your work, the 10 sheets you have might look like 
this 
•Sheet 1: a, a ,a , I, I , i, a, i, i, i…. millions more 
•Sheet 2: if, of, it, of, of, if, at, im, is,is, of, of … millions more 
•Sheet 3 :the, the, and, for, met, bet, the, the, and, … millions more 
•.. 
•Sheet 10: …… 
once you are done, you distribute, each sheet to one reducer. For 
example sheet 1 goes to reducer 1, sheet 2 goes to reducer 2 and 
so on.
Map Reduce: A really 
simple introduction 
• Each one of you gets one sheet from the grouper. For each 
sheet you count the number of words written on it and write it in 
big bold letters on the back side of the paper. 
• For ex, if you are reducer 2 you get sheet 2 from the grouper that 
looks like this: 
“Sheet 2: if, of, it, of, of, if, at, im, 
is,is, of, of …” 
• You count the number of words on that sheet, say the number of 
words is 28838380044, You write it on the back side of the paper 
, in big bold letters and give it to the Master
Map Reduce: A really 
simple introduction 
You essentially did map reduce. The greatest advantage 
in your approach was this: 
•The mappers can work independently 
•The reducers can work independently 
•The grouper can work really fast, because, he din’t have 
to do any counting of words, all the had to do was to look 
at the first number and put that word in the appropriate 
sheet 
The process can be easily applied to other kinds 
of problems
Map Reduce: formal 
definition 
The Map and Reduce functions of 
MapReduce are both defined with respect to 
data structured in (key, value) pairs. 
Map takes one pair of data with a type in 
one data domain, and returns a list of pairs 
in a different domain: 
•Map(k1 ,v1) → list(k2, v2)
Map Reduce: formal 
definition 
The Map function is applied in parallel to every 
pair in the input dataset 
This produces a list of pairs for each call 
After that, the MapReduce framework collects 
all pairs with the same key from all lists and 
groups them together, creating one group for 
each key
Map Reduce: formal 
definition 
The Reduce function is then applied in parallel to 
each group, which in turn produces a collection of 
values in the same domain: 
•Reduce(k2, list (v2)) → list(v3) 
Each Reduce call typically produces either one value 
v3 or an empty return, though one call is allowed to 
return more than one value. The returns of all calls 
are collected as the desired result list
MapReduce job example 
package org.myorg; 
import java.io.IOException; 
… 
public class WordCount { 
public static class Map extends MapReduceBase implements Mapper<LongWritable, 
Text, Text, IntWritable> { 
private final static IntWritable one = new IntWritable(1); 
private Text word = new Text(); 
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, 
Reporter reporter) throws IOException { 
String line = value.toString(); 
StringTokenizer tokenizer = new StringTokenizer(line); 
while (tokenizer.hasMoreTokens()) { 
word.set(tokenizer.nextToken()); 
output.collect(word, one); 
} 
} 
}
MapReduce job example 
public static class Reduce extends MapReduceBase implements Reducer<Text, 
IntWritable, Text, IntWritable> { 
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, 
IntWritable> output, Reporter reporter) throws IOException { 
int sum = 0; 
while (values.hasNext()) { 
sum += values.next().get(); 
} 
output.collect(key, new IntWritable(sum)); 
} 
}
MapReduce job example 
public static void main(String[] args) throws Exception { 
JobConf conf = new JobConf(WordCount.class); 
conf.setJobName("wordcount"); 
conf.setOutputKeyClass(Text.class); 
conf.setOutputValueClass(IntWritable.class); 
conf.setMapperClass(Map.class); 
conf.setCombinerClass(Reduce.class); 
conf.setReducerClass(Reduce.class); 
conf.setInputFormat(TextInputFormat.class); 
conf.setOutputFormat(TextOutputFormat.class); 
FileInputFormat.setInputPaths(conf, new Path(args[0])); 
FileOutputFormat.setOutputPath(conf, new Path(args[1])); 
JobClient.runJob(conf); 
} 
}
Machine Learning 
Machine learning is a scientific 
discipline that deals with the construction 
and study of algorithms that can learn 
from data. Such algorithms operate by 
building a model based on inputs and 
using that to make predictions or 
decisions, rather than following only 
explicitly programmed instructions
Machine Learning 
Machine learning can be 
considered a subfield of computer 
science and statistics. It has strong 
ties to artificial intelligence and 
optimization, which deliver 
methods, theory and application 
domains to the field
Machine Learning 
Example applications include 
spam filtering, optical character 
recognition (OCR), search engines 
and computer vision. Machine 
learning is sometimes conflated 
with data mining
Machine Learning 
Examples
Machine Learning 
Examples
Machine Learning Tools 
Apache Mahout is a project of the 
Apache Software Foundation to produce 
free implementations of distributed or 
otherwise scalable machine learning 
algorithms focused primarily in the areas 
of collaborative filtering, clustering and 
classification
Machine Learning Tools
Data Visualization 
Studies show the brain 
processes images 60,000x 
faster than text. The final 
step in your big data 
analytics workflow, the big 
data analytics visualization 
is a visual representation of 
the insights gained from 
your analysis
Data Visualization Tools
Data Visualization Tools

Big Data, a space adventure - Mario Cartia - Codemotion Milan 2014

  • 1.
    Hands On BigData: Getting Started With NoSQL And Hadoop Mario Cartia mario@big-data.ninja
  • 4.
    Big Data Facts • Google processes about 20Pb (E+15 bytes) of data each day • About 5Eb (Exabytes, E+18 bytes) of data in the world. 90% generated over last 2 years • Wearable computing and IoT…
  • 6.
    Big Data: 3VModel • Big Data it’s not only about volume –Volume >= Petabytes, not Gigabytes –Variety Structured and unstructured data –Velocity Real-time or near real-time
  • 7.
  • 8.
  • 9.
  • 10.
    Big Data SuccessStories Amazon.com, a pioneer of targeted advertising became a big data user when Greg Linden, one of its software engineers realized the potential of book reviewing from the average results of their in-house review project When Amazon compared the results of the computer sales against the in house reviews, the results were much better for the data-derived material, and revolutionized e-commerce
  • 11.
    Big Data SuccessStories Google Flu Trends is a web service operated by Google. It provides estimates of influenza activity for more than 25 countries. By aggregating Google search queries, it attempts to make accurate predictions about flu activity In the 2009 flu pandemic Google Flu Trends tracked information about flu in the United States. In February 2010, the CDC identified influenza cases spiking in the mid-Atlantic region of the United States. However, Google’s data of search queries about flu symptoms was able to show that same spike two weeks prior to the CDC report being released
  • 12.
    Big Data SuccessStories reCAPTCHA is a user-dialogue system originally developed by Luis von Ahn, Ben Maurer, Colin McMillen, David Abraham and Manuel Blum at Carnegie Mellon University's main Pittsburgh campus, and acquired by Google in September 2009 The reCAPTCHA service supplies subscribing websites with images of words that optical character recognition (OCR) software has been unable to read. The subscribing websites present these images for humans to decipher as CAPTCHA words, as part of their normal validation procedures. They then return the results to the reCAPTCHA service, which sends the results to the digitization projects Secondary data usage
  • 13.
    Big Data Techniques Data Warehouse Data Visualization Statistics Data Mining Business Intelligence Prediction Machine Learning Advanced Analytics Correlation Analysis
  • 14.
    The Traditional Approach ETL: Extract, Transform, Load •Extracts data from outside sources •Transforms it to fit operational needs, which can include quality levels •Loads it into the end target (database, operational data store, data mart or data warehouse) Does it fit “big data” needs?
  • 16.
    Hadoop Basics ApacheHadoop is an open-source software framework for distributed storage and distributed processing of Big Data on clusters of commodity hardware
  • 17.
    Hadoop Basics Hadoopwas created by Doug Cutting and Mike Cafarella in 2005. Cutting, who was working at Yahoo! at the time named it after his son's toy elephant
  • 18.
    Hadoop 1 vs.Hadoop 2
  • 19.
  • 20.
  • 21.
  • 23.
    From RDBMS toNoSQL A NoSQL (often interpreted as Not Only SQL) database provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases
  • 24.
    From RDBMS toNoSQL Motivations for this approach include simplicity of design, horizontal scaling and finer control over availability. The data structure (e.g. key-value, graph, or document) differs from the RDBMS, and therefore some operations are faster in NoSQL and some in RDBMS
  • 26.
    NoSQL Approaches Mostpopular NoSQL database types •Document (MongoDB, CouchDB, Clusterpoint, Couchbase, MarkLogic, etc.) •Key-value (Redis, MemcacheDB, Dynamo, FoundationDB, Riak, FairCom c-treeACE, Aerospike, etc.) •Column (Accumulo, Cassandra, Druid, HBase, Vertica, etc.) •Graph (Allegro, Neo4J, InfiniteGraph, OrientDB, Virtuoso, Stardog, etc.)
  • 27.
  • 28.
    CAP theorem (Brewer) NoSQL H(oBrwewe rT) o Choose
  • 30.
  • 31.
  • 34.
    MapReduce Model •MapReduce is a programming model, and an associated implementation, for processing and generating large data sets with a parallel, distributed algorithm on a cluster • The model is inspired by the map and reduce functions commonly used in functional programming, although their purpose in the MapReduce framework is not the same as in their original forms
  • 35.
  • 36.
    MapReduce Overview •Map step: Each worker node applies the map() function to the local data, and writes the output to a temporary storage. A master node orchestrates that for redundant copies of input data, only one is processed • Shuffle step: Worker nodes redistribute data based on the output keys (produced by the map() function), such that all data belonging to one key is located on the same worker node • Reduce step: Worker nodes now process each group of output data, per key, in parallel
  • 38.
    Map Reduce: Areally simple introduction Dear <Your Name>, As you know we are building the blogging platform blogger2.com, I need some statistics. I need to find out, Acorss all blogs ever wrriten on blogger.com, how many times 1 character words occur(like 'a', 'I'), How many times two character words occur (like 'be', 'is').. and so on till how many times do ten character words occur. I know its a really big job. So, I will assign, all 50,000 employees working in our company to work with you on this for a week. I am going on a vacation for a week, and its really important that I've this when I return. Good luck. regds, The CEO (src: http://ksat.me/map-reduce-a-really-simple-introduction-kloudo/)
  • 39.
    Map Reduce: Areally simple introduction The next day, You stand with a mike on the dias before 50,000 and proclaim. For a week, you will all be divided into many groups: •The Mappers (tens of Thousands of people will be in this group) •The Grouper (Assume just one guy for now) •The Reducers ( Around 10 of em.) and.. •The Master (That’s you)
  • 40.
    Map Reduce: Areally simple introduction • Each mapper will get a set of 50 blog urls and really Big sheet of paper. Each one of you need to go to each of that url. and for each word in those blogs, write one line on the paper. The format of that line should be the number of characters in the word, then a commna, and then the actual word • For example, if you find the word “a”, you write “1,a”, in a new line in your paper. since the word “a” has only 1 character. If you find the word “hello”, you write “5,hello” on the new line
  • 41.
    Map Reduce: Areally simple introduction Each take 4 days. So, After 4 days, your sheet might look like this •“1,a” •“5,hello” •“2,if” •.. and a million more lines At the end of the 4th day. each one of you will give your sheet completely filled to the Grouper
  • 42.
    Map Reduce: Areally simple introduction • I will give you 10 papers. The first paper will be marked 1, the second paper will be marked 2, and so on, till 10 • You collect the output from mappers and for each line in the mapper’s sheet, if it says “1,”, your write the on sheet 1, if it says “2, ”, you write it on sheet two • For example, if the first line of a mapper’s sheet says “1,a”, you write “a” on sheet 1. if it says “2,if”, your write “if” on sheet 2. If it says “5,hello”, you write hello on sheet 5
  • 43.
    Map Reduce: Areally simple introduction So at the end of your work, the 10 sheets you have might look like this •Sheet 1: a, a ,a , I, I , i, a, i, i, i…. millions more •Sheet 2: if, of, it, of, of, if, at, im, is,is, of, of … millions more •Sheet 3 :the, the, and, for, met, bet, the, the, and, … millions more •.. •Sheet 10: …… once you are done, you distribute, each sheet to one reducer. For example sheet 1 goes to reducer 1, sheet 2 goes to reducer 2 and so on.
  • 44.
    Map Reduce: Areally simple introduction • Each one of you gets one sheet from the grouper. For each sheet you count the number of words written on it and write it in big bold letters on the back side of the paper. • For ex, if you are reducer 2 you get sheet 2 from the grouper that looks like this: “Sheet 2: if, of, it, of, of, if, at, im, is,is, of, of …” • You count the number of words on that sheet, say the number of words is 28838380044, You write it on the back side of the paper , in big bold letters and give it to the Master
  • 45.
    Map Reduce: Areally simple introduction You essentially did map reduce. The greatest advantage in your approach was this: •The mappers can work independently •The reducers can work independently •The grouper can work really fast, because, he din’t have to do any counting of words, all the had to do was to look at the first number and put that word in the appropriate sheet The process can be easily applied to other kinds of problems
  • 46.
    Map Reduce: formal definition The Map and Reduce functions of MapReduce are both defined with respect to data structured in (key, value) pairs. Map takes one pair of data with a type in one data domain, and returns a list of pairs in a different domain: •Map(k1 ,v1) → list(k2, v2)
  • 47.
    Map Reduce: formal definition The Map function is applied in parallel to every pair in the input dataset This produces a list of pairs for each call After that, the MapReduce framework collects all pairs with the same key from all lists and groups them together, creating one group for each key
  • 48.
    Map Reduce: formal definition The Reduce function is then applied in parallel to each group, which in turn produces a collection of values in the same domain: •Reduce(k2, list (v2)) → list(v3) Each Reduce call typically produces either one value v3 or an empty return, though one call is allowed to return more than one value. The returns of all calls are collected as the desired result list
  • 49.
    MapReduce job example package org.myorg; import java.io.IOException; … public class WordCount { public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); } } }
  • 50.
    MapReduce job example public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } }
  • 51.
    MapReduce job example public static void main(String[] args) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(Map.class); conf.setCombinerClass(Reduce.class); conf.setReducerClass(Reduce.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf); } }
  • 52.
    Machine Learning Machinelearning is a scientific discipline that deals with the construction and study of algorithms that can learn from data. Such algorithms operate by building a model based on inputs and using that to make predictions or decisions, rather than following only explicitly programmed instructions
  • 53.
    Machine Learning Machinelearning can be considered a subfield of computer science and statistics. It has strong ties to artificial intelligence and optimization, which deliver methods, theory and application domains to the field
  • 54.
    Machine Learning Exampleapplications include spam filtering, optical character recognition (OCR), search engines and computer vision. Machine learning is sometimes conflated with data mining
  • 55.
  • 56.
  • 57.
    Machine Learning Tools Apache Mahout is a project of the Apache Software Foundation to produce free implementations of distributed or otherwise scalable machine learning algorithms focused primarily in the areas of collaborative filtering, clustering and classification
  • 58.
  • 59.
    Data Visualization Studiesshow the brain processes images 60,000x faster than text. The final step in your big data analytics workflow, the big data analytics visualization is a visual representation of the insights gained from your analysis
  • 60.
  • 61.