This document discusses big data, including definitions, forecasts about talent shortages, and some key concepts. It notes that by 2018 the US could face shortages of 140,000-190,000 people with deep analytics skills and 1.5 million managers able to make decisions using big data analysis. It defines big data, describes Hadoop and MapReduce frameworks, discusses NoSQL databases, and mentions tools like AWS, Google BigQuery, and on-premises solutions from Oracle and Autonomy for working with big data.
2. Why should I care?
McKinsey:
•$250 billions annual savings in EU alone by enhancing public sector
•$600 billions annual consumer surplus from using personal location data globally
•Annual growth of data is remarcable
•Data is the most valuable thing most companies have
•Data is massively underutilized
Eufris 2012
3. Forecast
There will be a shortage of talent necessary for
organizations to take advantage of big data. By 2018, the
United States alone could face a shortage of 140,000 to
190,000 people with deep analytical skills as well as 1.5
million managers and analysts with the know-how to use
the analysis of big data to make effective decisions.
Eufris 2012
4. What is Big Data?
"Big data technologies describe a new generation of technologies and architectures, designed to
economically extract value from very large volumes of a wide variety of data, by enabling high-velocity
capture, discovery, and/or analysis"
IDC
"Big Data is a technlogy that helps extract value from the digital universe.”
IDC
"Techniques and technologies that make handling data at extreme scale economical."
Forrester
Eufris 2012
5. ABC of Big Data
Analy&cs
•making
sense
of
your
data,
in
real-‐5me,
in
easy
way
Bandwidth
•inges5ng,
prosessing
and
delivering
large
amounts
of
data
Content
•storing,
managing
and
retaining
large
amounts
of
data
www.netapp.com Eufris 2012
6. 3 V’s of Big Data
Variety
• Big
Data
extends
beyond
structured
data,
including
unstructured
data
of
all
varie5es:
text,
audio,
video,
click
streams,
log
files
and
more
Velocity
• o@en
5me
sensi5ve,
Big
Data
must
be
used
as
it
is
streaming
in
to
the
enterprise
in
order
to
maximize
its
value
to
the
business
Volume
• Big
Data
comes
in
one
size:
large.
Enterprises
are
awash
with
data,
easily
amassing
terabytes
and
even
petabytes
of
informa5on
Eufris 2012
8. Hadoop
•The
Apache
Hadoop
so.ware
library
is
a
framework
that
allows
for
the
distributed
processing
of
large
data
sets
across
clusters
of
computers
using
a
simple
programming
model.
•Three
subprojects
•Hadoop
Common
•Hadoop
Distributed
Filesystem
(HDFS)
•Hadoop
MapReduce
Eufris 2012
10. MapReduce on App Engine
• Mapreduce
is
an
experimental,
innovaNve,
and
rapidly
changing
new
feature
for
App
Engine
Eufris 2012
11. NoSQL
•DefiniNon
1
“Next Generation Databases mostly addressing some of the points: being
non-relational, distributed, open-source and horizontally scalable. The
original intention has been modern web-scale databases. The movement
began early 2009 and is growing rapidly. Often more characteristics apply as:
schema-free, easy replication support, simple API, eventually consistent, a
huge data amount, and more.”
nosql-database.org
Eufris 2012
12. NoSQL
•DefiniNon
2
“In computing, NoSQL (sometimes expanded to "not only SQL") is a broad
class of database management systems that differ from the classic model of
the relational database management system (RDBMS) in some significant
ways. These data stores may not require fixed table schemas, usually avoid
join operations, and typically scale horizontally.”
Wikipedia
Eufris 2012
13. From ACID to BASE
ACID:
Atomicity,
Consistency,
Isola&on,
Durability
BASE:
Basically
available,
So?
state,
Eventually
consistent
Eufris 2012
18. Google BigQuery
Features
• Speed - Analyze billions of rows(!) in seconds
• Scale - Terabytes of data, trillions of records
• Simplicity - SQL-like query language, hosted on
Google infrastructure
• Sharing - Powerful group- and user-based permissions
using Google accounts
• Security - Secure SSL access
• Multiple access methods - Can be used by REST
API, a command-line tool, a browser-based graphical
interface, and Google Apps Script
Eufris 2012
21. Oracle Big Data Appliance
About 500 000 $
18 Oracle Sun Servers
• 864 GB main memory;
• 216 CPU cores;
• 648 TB of raw disk storage;
• 40 Gb/s InfiniBand connectivity between nodes and engineered systems;
• 10 Gb/s Ethernet connectivity.
Eufris 2012
22. Autonomy IDOL 10
"For far too long, organizations have confined structured data to
relational databases and unstructured data to simplistic keyword
matching technologies..."
“IDOL 10 brings these worlds together, allowing organizations to
automatically process, understand, and act on 100 percent of
their data, in real-time. The results will be dramatic, as
businesses can develop entirely new applications that explore
the richness and color of Human Information that live in
unstructured, semi-structured, and structured forms.”
Price?
Eufris 2012