Big data-denis-rothman

On Which side of the Cloud are you ?
An Introduction to Big Data
Denis Rothman
Copyright 2014 Denis Rothman

Big Data - Introduction
□ This course is not meant to make Big
Data experts out of you in a few
hours but is designed to help you
grasp the main concepts.
□ We’ll be discussing Apache Hadoop,
MapReduce, Mongodb, Pig and
several other names and concepts
that will be familiar to you by the end
of the course !

□ We’re going to talk about Apache
« Hadoop » and « MapReduce »
because the following companies use
this technology, at least parent or
derived versions : Google,
Yahoo!,Facebook,Amazon, IBM, Ebay
and many more key players on the
market.

□ All the figures, software and brands
mentioned in this document are simple
examples. All of this is going to expand and
change through the years !
□ The main goal here is for you to grasp
enough concepts to be able to create Big
Data architectures with today’s but also
tomorrow’s technology and ideas !
□ So focus on the concepts and the way you
can solve problems with Big Data
technology.

Big Data – What is big data ?
Learn more : http://en.wikipedia.org/wiki/Big_data
Let’s say that starting with one 10TB for a dataset (collection of data) we’re
talking Big Data and starting one petabyte we really need the technology !
The world has jumped from talking petabytes to exabytes in a year, we’ll
probably be talking zettabytes.
1 EB = 1000000000000000000B = 1018bytes = 1000petabytes1 EB
= 1000000000000000000B = 1018bytes = 1000petabytes = 1 million1
EB = 1000000000000000000B = 1018bytes = 1000petabytes =
1 million terabytes1 EB
= 1000000000000000000B = 1018bytes = 1000petabytes =
1 million terabytes = 1 billion1 EB
= 1000000000000000000B = 1018bytes = 1000petabytes =
1 million terabytes = 1 billion gigabytes.

Big Data – What is big data ?
For the Universe, the galaxies
are our small representative
volumes, and there are
something like 10^11 to
10^12 stars in our Galaxy
(The Milky Way)
•The number of bitsThe
number of bits on a
computer tera capacity hard
disk is typically about 10^13,
1000 GB)
To compare the amount of data we now store we have to do
down to atom level quantities in our universe !

Big Data – Can you represent the
Volume ?
Learn more : http://www.seagate.com/about/newsroom/press-
releases/Terascale-Enterprise-HDD-pr-master/
Tell us how and were you
would store a 1PB dataset for a
given company without Big
Data technology ?
How many average size 4 TB
hard disks would it take to
simple store the data ?
High-Capacity— highest capacity HDD (4TB) available in a 3.5-
inch enterprise-class SATA(Serial Advanced Technology Attachment)
HDD enabling scalable, high-capacity storage in 24×7
environments.
?

Big Data – Can you represent a fast
way to access(Velocity) 1PB of data
with Big Data technology?
Let’s say we’re talking
about the data related to
all bank accounts of the
BNP of the past 5 years
that had a balance of
more that 1000 $ at given
time and that need to be
accessed for a financial
analysis.
How would you do it now,
without Big Data
Technology ?

Big Data – Can you represent to access
additional documents in a great Variety of
data ?
Now we need to retrieve
other documents to
analyse these BNP
accounts : text
documents(signed
contracts, for example)
How would you do it now,
without Big Data
Technology ?

Big Data – Do you think you can manage
10PB without Big Data ?
If we now try to solve the 3 V problem with a 10PB dataset to
manage, how could we do it even with Oracle Big Files ?
A bigfile tablespace contains only
one datafile or tempfile, which
can contain up to approximately
4 billion ( 232 ) blocks. The
maximum size of the single
datafile or tempfile is 128
terabytes (TB) for a tablespace
with 32 K blocks and 32 TB for
a tablespace with 8 K blocks.
Number of
blocks
Bigfile Tablespaces
Learn more :
http://docs.oracle.com/cd/B28359_01/server.111/b28320/limits002.htm#i2879
15

Big Data – Volume, Velocity, Variety that is
beyond non Big Data solutions
We seen the limits of non
Big Data technology ?
How would you solve the
problem ?
Even if you already know
how Big Data works, do
you think it will solve the
increasing size and
variety of datasets ?
How will it help with
sensors ?

Big Data – Apache Hadoop
There are several
solutions on the
market. Let’s use
Apache Hadoop as
way to understand
how Big Data storage
works to solve the 3V
problem.
Learn more : http://en.wikipedia.org/wiki/Apache_Hadoop

□ There are many ways
to try to understand a
subject. This part of
the course is designed
for you to see that
the core ideas of
Apache Hadoop are
simple !

□ First of all, what
does « Hadoop »
mean ? It means
nothing !
□ Doug Cutting just
named after his
son’s toy elephant.
So that’s one
mystery solved.

□ The first thing
we need to do is
understand
cluster
architectures.
□ Cluster
architectures are
spreading at a
wild speed as a
framework for
the analyis of big
data.
New Exabytes of data appear
each…week…
Learn more : http://www.ovh.com/fr/serveurs_dedies/big-data/

□ Cluster architectures are the
best choice because they
have Cloud performances :
extensible, flexible and cost
efficient.

□ So what ? So
what’s the
difference between
a traditional
entreprise
architecture and a
cloud-cluster
architecture ?

□ A traditional
architecture is
built on
server technology
that is expensive
and thus has to be
used as much as
possible.

□ A traditional
architecture is
also built on
storage capacity
of different sizes
and types : SSD
to SATA.

□ A traditional
architecture is
finally built on
storage area
networks
(SAN) to
connect a set
of servers to a
set of storage
units

□ The big quality of traditional
architecture is that the servers and
storage units can be managed (size,
number) separately with SAN
(Storage Area Network) connecting
them.

□ The big drawback of traditional
architecture is that it must be
extremely reliable and any failure
must be dealt with very quickly.
□ This brings the price up.

□ Traditional architectures were
designed for intensive applications
focusing on one part of the data. The
servers process the information and
then the results are transferred to
storage.

□ So in essence a traditional architecture is
designed for a specific need (intense
computing, a standard data warehouse.
Fine.
□ How would you now solve a problem
involving a tremendous weekly increase in
data (PB) ? Not knowing what you’re
looking for in advance : sorting by order,
by timestamp or retrieving certain values.

□ Even a few years ago Google was
facing a daily increase of data of
20PB…per day.
□ For a special operation, let’s say user
mail history (number and size of
mails over a five year period), we
need to parse the entire dataset not
just a subset.

□ Why sort that data ?
□ To make searching, merging and
analyzing easier.
□ So how can you sort n x 20PB of
data?
□ With cluster architecture !

Let’s now study 3 basic
properties of cluster computing :
-Pennysort
-Minutesort
-Graysort

□ Sorting being a major function of Big
Data, it’s important to have
benchmark references.
Learn more : http://sortbenchmark.org/
GraySort
Metric: Sort rate (TBs / minute) achieved while sorting a very large
amount of data (currently 100 TB minimum).
PennySort
Metric: Amount of data that can be sorted for a penny's worth of system
time.
Originally defined in AlphaSort paper.
MinuteSort
Metric: Amount of data that can be sorted in 60.00 seconds or less.
Originally defined in AlphaSort paper.

Learn more : http://sortbenchmark.org/
2013, 1.42 TB/min
Hadoop
102.5 TB in 4,328 seconds
2100 nodes x
(2 2.3Ghz hexcore Xeon E5-2630, 64
GB memory, 12x3TB disks)
Thomas Graves
Yahoo! Inc.
Gray

2011, 286 GB
psort
2.7 Ghz AMD Sempron, 4 GB RAM,
5x320 GB 7200 RPM Samsung SpinPoint F4
HD332GJ, Linux
Paolo Bertasi, Federica Bogo, Marco Bressan
and Enoch Peserico
Univ. Padova, Italy
Penny

2012, 1,401 GB
Flat Datacenter Storage
256 heterogeneous nodes, 1033 disks
Johnson Apacible, Rich Draves, Jeremy Elson,
Jinliang Fan, Owen Hofmann, Jon Howell, Ed
Nightingale, Reuben Olinksy, Yutaka Suzue
Microsoft ResearchMinute

□ Getting down to a cluster.
A cluster breaks down to its basic
component : a NODE
A node is made up of cores, memory
and disks that can be assembled in
the thousands, the tens of thousands,
the hundreds of thousands.

□ The NODES
are then
grouped in
RACKS
□ The RACKS
are then
grouped into
CLUSTERS
The CLUSTERS ARE CONNECTED TO A NETWORK WITH A CISCO
SWITCH, for example

□ The first property of a cluster is to be
MODULAR and SCALABLE (handles
growing amount of elements)
□ This means that it’s cheap to just add
more and more nodes at the best
price and it doesn’t need to be that
reliable as we will see further.

□ The second property of a cluster is
DATA LOCALITY. This means your not
going through a sequence but directly
to the physical location. No more
bottlenecks...
□ This leads to PARALLELIZATION which
means you access several locations
simultaneously.
Learn more : http://en.wikipedia.org/wiki/Locality_of_reference

□ With data locality and parallelization
MASSIVE PARALLEL PROCESSING
becomes a reality.
□ The main function, sorting, can now
be done within each node on a subset
of data.
□ Please bear in mind that these nodes
are cheaper than traditional
architectures.

□ This is just an example that goes
back to 2011 but makes the point.
A typical SSD drive system would
process data at about $1.2 a gigabyte
at 30K IOPS and a SATA at about
$0.05 but only at 250 IOPS
IOPS (input/output operations per
second) .
Let’s take a simple cluster…

□ In a simple cluster, 30 000 IOPS
are delivered in parallel with around
120 nodes (around 250 IOPS) at the
same time BUT for the IOP price of
SATA.
□ We’re talking about cheaper and
more expendable equipment.

Big Data – Map Reduce
□ This means that in a cluster
architecture failures will be more
frequent with cheaper equipement.

□ Failures with cheaper equipement ?
Who cares ? Don’t get ripped off and purchase
expensive reliable hardware but expendable
material to be cost efficient.
We just need to find a way to detect and
respond quickly to deal with this
complexity.
We’ll need to replicate the date up to three
times in three different data locations.
Let’s see how to solve these problems with
Apache Hadoop.

Hadoop is about clusters build with
commodity hardware not high quality
hardware :
• widely available
• interchangeable
• plug and play
• breaks down more often

□ Before we go on, what’s
the purpose of all this.
WHY ?
It all started with Google who
had to index pages every
day and quickly reach huge
amounts of data. Hadoop
reaches back into the
Google File System (GFS)
and Google MapReduce. In
the early days, Yahoo ! and
Apache got involved in the
process.
Around 2004, Google started
publishing all this…

□ Let’s take Facebook. You know all the
information that’s in there for you. But with
over 1 000 000 000 users + the 450 000
000 WhatsApp we’re talking about a
massive chunk of the world population
increasing the size of Facebook every day.
We’re talking increasing data in Exabytes
in this case. How are you going to run a
search over that one dataset spread over
hundreds of thousands of nodes ?
With Apache Hadoop !

Apache Hadoop was designed for DISTRIBUTED DATA OVER THE CLUSTERS
Apache Hadoop was designed with the concept of DATA LOCALITY
Hadoop Distributed File
System (HDFS)
Hadoop Map Reduce

□ HDFS has 3 main functions : split,
scatter and replicate.
1. SPILTING. In Hadoop
each FILE BLOCK has the
SAME size (64 Mb for
example) in a STORAGE
BLOCK
2.Scattering. These FILE
BLOCKS are generally
on different datanodes
3.REPLICATION : There are
multiple copies of these
blocks in different
locations.

Architecture
Learn more : http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html#Data+Blocks
One main
node
Generally
3 copies in
the
replication
process so
nodes can
fail !

□ The NameNode is the centerpiece
of an HDFS file system. It keeps
the directory tree of all files in the
file system, and tracks where
across the cluster the file data is
kept. It does not store the data of
these files itself.
□ Client applications talk to the
NameNode whenever they wish to
locate a file, or when they want to
add/copy/move/delete a file. The
NameNode responds the
successful requests by returning a
list of relevant DataNode servers
where the data lives(addresses).
Learn more : http://wiki.apache.org/hadoop/NameNode
Works fine for
failures on
commodity
equipment !

□ So what happens when the
NameNode fails.
□ Hadoop has copies of the data and
as long as the same IP address is
reassigned, a new NameNode will be
designated and that’s it !
Learn more : http://wiki.apache.org/hadoop/NameNode

Once the HDFS is set up, MAP REDUCE
is there to retrieve information in a
simple way.
First a MAPPER is user then the
information is REDUCED.
Let’s see how this happens.

The MAPPER function relies on the fact
that the data is EVENLY
DISTRIBUTED. This means that
Massive Parallel Processing is
possible.
The MAPPER uses the LOCALITY (hence
« MAP » features of HADOOP to
optimize it’s search.

□ If not the file blocks were not of equal size,
the processing time would be equal to the
largest file.
□ But since in Hadoop the file blocks have the
same size, processing is tremendously
enhanced for MPP.
□ A little caveat could be Internet unequal
internet connexions but most organizations
have solved this and there are replications
everywhere…

Suppose you need to analyse
the number of times the word
« Happy New Year » in a
Google search at midnight on
Decembre 31rst in their
timezone.
Let’s say we’re concentrating
on France only and that the
nodes containing this data are
Nodes 1,2,3 (at their
address)

□ Now we run a <key,value> pair withe
the mapping functions. They key here
is « Happy New Year » and the value
will be the number of times it
appears.
□ In Node 1: <Happy New Year,
1000000, Node 2 : <Happy New
Year, 4000000>, Node 3: <Happy
New Year, 2000000>

□ Let’s get a look and feel of Hadoop
command line functions, among
others.
□ https://hadoop.apache.org/docs/r0.1
8.3/hdfs_shell.html

□ In Node 1: <Happy New Year, 1000000, Node 2 :
<Happy New Year, 4000000>, Node 3: <Happy New
Year, 2000000> data is sent to a reduce node to run
the REDUCE function which will give the following
output:
<Happy New Year, 1000000,4000000,2000000> to be
summed up for example to <Happy New Year,
7000000>
Mapping and reducing are thus 2 simple but powerful
functions.
If various keys are sent, they are SORTED through a
shuffling process.

□ The Mapper functions and Reduce functions
are TASKS and together they form a JOB.
□ Map Reduce’s framework has a JOB
TRACKER that schedules the tasks.
□ A JOB TRACKER will reroute tasks if a node
fails, it organizes the activities.
□ Just like HDFS has a name node, Map
Reduce has a special node assigned to the
JOB TRACKER.

□ Now the programmer will provide
MapReduce with a list of file blocks, the
map and jobs.
□ The output is a set of keys and values.
□ All of this can be done in a tremendous MPP
run.
□ By 2015, it’s estimated that 50% of all data
will be processed with Hadoop…

Big Data – High level software
□ Now the programmer will provide
MapReduce with a list of file blocks, the
map and jobs.
□ The output is a set of keys and values.
□ All of this can be done in a tremendous MPP
run.
□ By 2015, it’s estimated that 50% of all data
will be processed with Hadoop type
technology…

Getting Started with Hadoop
MapReduce
Now let’s get Hadoop
MapReduce into the equation
Learn more: http://hadoop.apache.org/docs/r0.18.3/mapred_tutorial.html
http://hadoop.apache.org/docs/r0.18.3/mapred_tutorial.html#Pre-requisites
Let’s get a look and feel of MapReduce functions :
http://hadoop.apache.org/docs/r0.18.3/mapred_tutorial.html#Example
%3A+WordCount+v1.0
Just bear in mind that you looking at developing
<key,value> sets both mapping them and reducing them.

MapReduce
More look and
feel
approaches :
http://hadoop.a
pache.org/doc
s/r2.2.0/api/o
rg/apache/had
oop/mapreduc
e/Mapper.html

Apache Hadoop MapReduce
Architecture
□ Let’s take five here and see what
we’ve got up to here. Ok, we have
Hadoop and MapReduce.
□ Let’s see how this fits together and
how we can access data at a higher
level.
□ We’re going to take a look at how
Google explains this…

Apache Hadoop MapReduce
Architecture
Google explains it
with this
concept with
physical
retrieval :
1.Standard software
query : 1
person
2.MapReduce :
several persons
Let’s work on this
physical file
system
Learn more : https://cloud.google.com/developers/articles/apache-hadoop-
hive-and-pig-on-google-compute-engine#appendix-b

Getting Started with PIG
All the tools are there,
just use them !
You’re going to have to choose a
platform or just rent one as
explained further in the
document.

PIG
Let’s have some fun
with high level
programming !
« Pig is a high-level platform for
creating MapReduce programs used
with Hadoop. »
Learn more : http://en.wikipedia.org/wiki/Pig_(programming_tool)
What does a pig do ? it « grunts »
You can use Grunt to run Pig, you can
use Pig to run Python code, you can
use Pig for the MapReduce
framework.
Just stop thinking « categories »,be
Creative and have fun !

PIG
Learn more : http://en.wikipedia.org/wiki/Pig_(programming_tool)
http://pig.apache.org/docs/r0.8.1/udf.html#Aggregate+Functions
Let’s have a look at
some of the PIG
functions to get the feel
of it.
http://docs.aws.amazon.com/ElasticMapReduce/lat
est/DeveloperGuide/emr-pig-udf.html

What if I don’t want to use Pig ?
There are a lot of languages you can use that
integrate the Hadoop & MapReduce framework !
Java : http://www.javacodegeeks.com/2013/08/writing-a-hadoop-
mapreduce-task-in-java.html
PHP : http://stackoverflow.com/questions/10978975/need-a-map-
reduce-function-in-mongo-using-php
C++ : http://cxwangyi.blogspot.fr/2010/01/writing-hadoop-programs-
using-c.html
Python :
https://developers.google.com/appengine/docs/python/dataprocessing/h
elloworld

Big Data or Standard Databases ?
□ File Systems or
databases ?
□ So now what ?
SQL solutions ?
No SQL solutions ?
□ Both ?
Let’s take a few minutes and find some examples in
which one philosphy or another is best for a company
SQL ?
No SQL ?

Big Data – NOSQL
Learn more : http://en.wikipedia.org/wiki/NoSQL
□ First let’s get rid of a simple and old concept : SQL
□ When you want to explore exabytes, of data, SQL is
useless.
□ « the term was used in NoSQL(Not Only SQL) in 1998
to name a lightweight, open source database that did
not expose the standard SQL interface. Strozzi
suggests that, as the current NoSQL movement
"departs from the relational model altogether; it
should therefore have been called more appropriately
'NoREL'. »
□ In somes cases the volume of data and it’s nature
(documents, texts) can’t be accessed through SQL

Big Data – NOSQL
□ « Some notable implementations of NoSQL
are Facebook's Cassandra database,
Google's BigTable and Amazon's SimpleDB
and Dynamo. »
□ Let’s approach NOSQL with one of its core
concepts. In a RDMS
(relational database management system)
several users can’t modify exactly the same
record at the same time. The system is
base on read-write-relational functions.

Big Data – NOSQL
In an RDMS, the last user that writes in
exactly the same record will overide
previous records. Of course you can
append a record per user but then
you have multiple records for the
same data index.
So you generally you lock the record
while it’s in use or use a LIFO(Last In
First Out)

Big Data – NOSQL
Learn more : http://www.techopedia.com/definition/27689/nosql-database
The fundamental difference in NOSQL is
that the relations don’t matter
anymore, so unique keys don’t
matter either.
You’re not worried about read and write
rules, relations, inner joins, size
constraints, time contraints.

Big Data – NOSQL
Learn more : http://en.wikipedia.org/wiki/NoSQL
With NOSQL you can
scatter your data
everywhere, on
various servers at
the same time
and write multiple
records with
multiple
simultaneous
users with millions
of same type
entries !

Big Data – SQL, Data Warehouse
and perspective
Let’s make NOSQL concepts clear :
-Hive is language that is SQL related and used with Big Data
-Pig is a NoSQL language
- You can use both in project !
http://gcn.com/blogs/reality-check/2014/01/hadoop-vs-data-
warehousing.aspx
A traditional Datawarehouse feeds data into a relational database.
What about a Hadoop Datawarehouse ? Why not ?
Perspective : Stop thinking of a data flow from a client
to server, start thinking about a universe of scattered
data ! Think from the point of view of the crowd not
the individual. Stop thinking about a single solution,
just use everything you can to reach your goal !

MongoDB
Learn more : http://www.mongodb.org/
Whereas Apache Hadoop is based on HFDS, MongoDB is a
NOSQL document database.
-Document-Orientated Storage with JSON style documents
-Index support
-Querying
-Map/Reduce

MongoDB
http://docs.mongodb.org/manual/core/map-reduce/
Let’s get the feel of Mongodb and MapReduce functions
So, now continue to stop thinking. Oh, i’m into Relational Databases and
this is a non relational database. What do I have to choose.
You don’t have to choose !
At one point Facebook, and this might still be true, gathered data in
MySQL, sent it out to Hadoop and then retrieved it with MapReduce :
mapping it, shuffling it, reducing it and making into sense back …in
MySQL for its users !!!

Purchasing and managing your
« Hadoop-MapReduce-MongdoDB,
PIG » Architecture
□ First you need to set up or choose a
type of physical Cloud architecture.
□ You need to make an financial and
technical decision.
□ If your company is not big enough to
build it’s own cluster, then you need
to choose cloud offers.

Learn more : http://www.ovh.com/fr/serveurs_dedies/big-data/

□ Just a concept to bear in mind but you don’t have to do it on your
own as explained previously. Cloud services provide this.
□ "You have 10 machines connected in LAN and i need to create
Name Node in one system and Data Nodes in remaining 9
machines .
□ For example you have ( 1.. 10 ) machines , where machine1
is Server and from machine(2..9) are slaves[Data Nodes] so
do i need to install Hadoop on all 10 machines ?
□ You need Hadoop installed in every node and each node should
have the services started as for appropriate for its role. Also the
configuration files, present on each node, have to coherently
describe the topology of the cluster, including location/name/port
for various common used resources (eg. namenode). Doing this
manually, from scratch, is error prone, specially if you never did
this before and you don't know exactly what you're trying to do.
Also would be good to decide on a specific distribution of Hadoop
(HortonWorks, Cloudera, HDInsight, Intel, etc) »

Do you have an Amazon account ?
What do you know about what’s beyond your account ?
Does Amazon have Big Data Technology ?
How far does Amazon go in this field ?
Let’s see…

Learn more : http://aws.amazon.com/big-data/
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-what-is-
emr.html

Getting Starting with your Big Data
Architecture
Let’s have a look at a real Big Data account
and resource management interface.
http://aws.amazon.com/s3/pricing/http://aws
.amazon.com/s3/pricing/
https://console.aws.amazon.com/console/hom
e?region=eu-west-1#
https://console.aws.amazon.com/elasticmapre
duce/vnext/home?region=eu-west-
1#getting-started:

Big Data – Ebay
□ EBay has a nice way of summing it up
before we get down to analyzing.
http://www.ebaytechblog.com/2010/10/29/hadoop-
the-power-of-the-elephant/#.UxncJbV5Gx4

Analyst
The analysists are here
Let’s find out what they do and what you could do in the future !

Big Data – Analyst
First you need to forget about
consumption(sales, marketing) and all the
clichés you hear around you.
Why ? Because the first step is to set highly
creative goals, then to map, reduce and
transform them into useful data. Useful
data can be for medical research, police
departments, astronomy and many other
areas.

At Planilog, we created a powerful Advanced
Planning System that deals with the 3 Vs
(Volume, Velocity and Variety). Our APS
can optimize any field of data.
Without going into the detail of our APS
program, the following slides are going to
provide you with tools to begin analyzing.
Of course, you can analyze anything and
anyway you want. This is just a guideline
we used that help us solve hundreds of
problems.

Planilog’s first conceptual approch starts
with Cognitive Science and
Linguistics.
Human activity and be broken down into
two great categories :
Passive and active.

Analyst
Let’s take some passive activities using
just one or two senses. You can easily
guess the others after.
Eyes :
• Watching (movies, events, any other)
• Reading
• Listening to music

Analyst
Let’s take some active activities using
some senses. You can easily guess
the others after.
- Writing documents, chats, mails
• Talking over the phone
• Combining video and sound : Skype

Now that you have an idea of active and
passive activies, let’s see what they can
apply to and what we can get out them :
Thought process ->analyzing how someone
thinks (« Sentiment analysis »)
Feeling -> Sentiment analysis
Body -> Movement anaylsis (GPS, for
exemple).

Analyst
Finally there are only two ways to measure passive/active
activities applied to the thinking-feeling-body process.
It just boils down to this
Qualitivative properties and quantitive analysis.
Once we know what we’re analysing and how much we
can pretty much make a model of the whole universe
!
We could sum it up with brackets :
<property or key, quantity> or if we simplify :
<key, value>
Sounds familiar ?
See the power ? See why you need to analyse what your
going to do before you analyse the data.

Analyst
The Hadoop tools available don’t need to be isolated in
terms of concepts but simple must be interoperable.
You can Sqoop data from relational dtabase and collect
event data with Flume. You have HDFS (distributed
files system) that you can acces in a non relational
way with Pig or even use in a Data Warehouse with
Hive. With MapReduce you can run parallel
computation. If you need more resources, you can
use Whirr to deploy more clusters and Zookeeper to
configure, manage and coordinate all of this !
So there is no relational, non relational opposition, there
is no « standard » approach. There is simply a goal to
attain with the best means possible.

Big Data – Privacy
Everything you touch is
stored, replicated,
mapped, reduced and
processed.
Just focus on the legal
aspect not on ethics.
Do you think all of this is
legal ?
What’s legal ? In which
country ? Where ? How ?
Can this be prevented ?

Big Data – On which side of the
Cloud are you ?
Now let’s forget about
the legal aspect.
How do you feel about
Clouds and Big Data ?
Do you feel threatened
?
Do you think it’s the
end of your freedom ?

Big Data – On which side of the
Cloud are you ?
Now if you feel it’s progress
with some drawbacks,
you’re ready to be an Big
Data analyst !
Do you agree with this or
not ?
Progress

Analyst : for those who agreed on
using Big Data ! ☺, the others can
leave ☺
Let’s sum it up before we begin analyzing real
projects and cases.
Conceptually if you use an active/passive
matrix applied to thought-feeling-physical
body, you can understand a great number
of models.
With Pig ->MapReduce->Hadoop and maybe
Mongdb add or not, you’re going to map,
reduce and transform DATA into useful
INFORMATION for decision making
processes. You’re exploring time and space.

Analyst : Can you imagine the data
to be mapped and retrieved in
various fields ?
You need to
think
differently.
Forget
everything
you
learned and
be open to
new, very
new ideas.
Let’s hit the
road now !

Oh, you think this is
theory for the future ?
□ Ok,well you can stop laughing. Let’s have a
look at sentiment analysis tools :
http://blog.mashape.com/post/48757031167/l
ist-of-20-sentiment-analysis-apis
How to feel about that ? Remember if you Tweet about this
page, it will be analyzed, so be careful of what you’re thinking
and writing !
https://www.mashape.com/
How is your mind shaped ?
How many applications are there out there ?
Think as a global data analyst and not an individual. Express your
thoughts.

Sentiment Analysis
□ https://www.mashape.com/aylien/tex
t-analysis#!endpoint-Sentiment-
Analysis
With this technology nothing is a secret anymore.
You can « hack », sorry « analyse » ☺ , in the world with unlimited
technology, space and processing power.
Let’s do a sentiment analysis right now…

Let’s carry out a little experiment
What do you think about Sentiment
analysis if you were Tweeting your
impression. Let’s analyze the
audience :
Key <positive , value>
Key <negative , value>
More difficult. Explain why.
Key <objective, value>
Key <subjective, value>

Big Data – Life saving
Try to find some ideas to save lives when there is a fire, or
to protect violence on women or any other idea that
comes to your mind.
Think social networks, think drones, swarms of robots, think from
the point of view of the swarm command, like in SC2 but to help
people not for the confort of an individual.

Big Data – Life saving
https://www.cmu.edu/silicon-valley/news-
events/news/2011/stamberger-interviewed.html
https://www.cmu.edu/silicon-valley/news-
events/news/2011/stamberger-interviewed.html

Big Data – Insurance
How can you optimize the price of the
premiums in real time worldwide with
Hadoop Mapreduce ?
Start with a major disaster and see how
you’re going to pay and forecast future
disasters.
Hadoop can be used for predictive functions.

Big Data – Insurance also needs
human resources.
How can you optimize
part time jobs in a
huge quantitative
environment in which
you have 100 000
employees to manage
?
http://www.optimaldecisionsllc.com/Welcome.html

Big Data – Amazon
Think of the passive-active matrix
and the related activities (thought,
feeling, physical) and tell me how
you would use Big Data.
How could you find a way to get
sentiment analysis out of the
reader ?

Big Data – Amazon-Kindle
What <key, value> pairs what you be looking for ?

Big Data – Tweeter
Try to find a great many
positive applications for
Tweeter.
We’ve seen the API’s.
Do you have ideas ?
Life Saving
Science and research
Other ?
We’re not going to talk about
the negative ones. You need
to think of how to go forward,
not slow down !

Big Data – Design Facebook
□ Describe the data
that Facebook
collects.
□ How can it legally
access 50% more
data that it didn’t
gather in the first
place….?

Big Data – Design Facebook
□ WhatAps ! 450 millions new users !
How would you Map, Shuffle and
Reduce this data to fit into your
Facebook strategy.
Advertising is a cliché.
What else can you do ?
Do you now what Stealth
Marketing is ?
http://en.wikipedia.org/wiki/Undercover_
marketing
How can you analyze and detect it
automatically if you were a
government consumer protection
agency ? Why wouldn’t
governments map,shuffle, and
reduce illegal behaviours ?

Big Data – Design Sony Smartband
http://www.expansys.fr/sony-smartband-swr10-with-2-black-wristbands-sl-
257855/?utm_source=google&utm_medium=shopping&utm_campaign=ba
se&mkwid=svHjLhmZB&kword=adwords_productfeed&gclid=CI36mJSLgb
0CFWjpwgod1wMANA

Smartwatches
□ Samsung has one too that measures
your heartbeat.
□ http://venturebeat.com/2013/09/01/t
his-is-samsungs-galaxy-gear-
smartwatch-a-blocky-health-tracker-
with-a-camera/
They want you to think about what you can do with the watch
while they’re thinking of what to do with the global data they’re
gathering as well. What could you analyse with Big Data
tools ?

□ http://online.wsj.com/news/articles/S
B10001424127887324178904578340
071261396666
Big Data is taking over !
Forget about the technical aspect.
Just bear in mind that « huge » doesn’t
mean impossible anymore. There is simply
no limit to what data that can be processed
through Big Data !

HR
B10001424127887324178904578340
071261396666
http://blog.mashape.com/post/487570
31167/list-of-20-sentiment-analysis-
apis
How can you use Big Data to recruit somebody ? Would
you automatically analyse personal data if you could on social networks ?
With a sentiment anaylsis program for example ?

Trucks : Tracking and Sensors
B10001424127887324178904578340
071261396666
Sensors, robots, drones. Surveillance to optimize !
Give some of your ideas…

Big Data – Design Sony Smartband
Now you can pick up the pulse rate and all activity. How
can you relate this to all the other data on a group of
people and not just yourself ? Think of a concert and
sentiment analysis, for example.
http://www.sonymobile.com/us/products/accessories/sm
artwatch/#tabs
https://play.google.com/store/apps/details?id=com.sony
ericsson.extras.smartwatch&hl=fr

Governments and government
agencies
What can a governement collect
that corporations can’t ?
Can governments reach the level of
private corporations to protect
you ?
How can it be done ?
With what budget ?

Governments and government
agencies
Phone companies gather data.
Google, Microsoft and others gather
data
In fact everybody gathers data !
So could you gather as much data as
the government, in the end ?
Why or why not ?

Big Data – Can you trust yourself
to drive your car ?
□ Google, like all others, have you focus
on your individual need.
□ In the meantime, your personal data
has gone global.
□ Think global and tell me how you
would analyze the data.
In this case we’re dealing with Big Data streaming data like in
online gaming. So when you’re parsing the data, NoSQL or SQL
is not the issue, getting the right information straight is the vital
goal !

Big Data – Can you trust a human
to drive your car ?
A Google Car gathers about an average of
Gigabyte / second which is could add up to
over 80 TB a day. An you ?
□ http://www.isn.ethz.ch/Digital-
Library/Articles/Detail/?lng=en&id=173004
Google explains how it collects all types of data.
http://googlepolicyeurope.blogspot.fr/2010/04/da
ta-collected-by-google-cars.html
Where is all the data going : Big Data ?
http://www.hostdime.com/blog/google-self-
driving-car-news/

Now take a step back and imagine all
of the data gathered and accessed by
a single group of analysts…
…And now go out, imagine and conquer the world of Big Data !

Big Data – A New Data Paridigm :
no limits
You can ask your questions now or
contact me at
Denis.Rothman76@gmail.com

Big data-denis-rothman

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Big data-denis-rothman

Similar to Big data-denis-rothman (20)

Recently uploaded

Recently uploaded (20)

Big data-denis-rothman