3. OUTLINE
¡ About me
¡ What is Big Data?
¡ Evolution of Business Intelligence
¡ Big Data Opportunities
¡ Big Data challenges
¡ Conclusion
3
24/10/2014
4. About me
¡ Associate professor in Computer Science – LISITE-RDI
¡ Research interest: Data stream mining, scalability and resource optimization in distributed architectures
(e.g cloud architectures), recommender systems
¡ Research field: Large scale data management
4. Optimizing resources in large scale systems
1. Real-time and
distributed
processing of
various data
sources
2. Use semantic
technologies to
add a semantic
layer
3. Recommender
systems and
collaborative data
mining
Heterogeneous
and
sta1c
data
Heterogeneous
and
dynamic
data
streams
sensors
5. Modeling and validation of complex systems
4
24/10/2014
10. 10 So, what is Big Data?
§ Wikipedia
§ GPS
data
§ RFID
§ POS
Scanners
§ …
24/10/2014
Dawn
of
(me
Volume
of
data
created
Worldwide
2003
2012
5
EB
…
2.7
ZB
2015
10
ZB
(E)
§ 1
YB
=
10^24
Bytes
§ 1
ZB
=
10^21
Bytes
§ 1
EB
=
10^18
Bytes
§ 1
PB
=
10^15
Bytes
§ 1TB
=
10^12
Bytes
§ 1
GB
=
10^9
Bytes
Variety
of
data
§ Radio
§ TV
§ News
§ E-‐Mails
§ Facebook
Posts
Velocity
of
data
§ Walmart
handles
1M
transac(ons
per
hour
§ Google
processes
24PB
of
data
per
day
§ AT&T
transfers
30
PB
of
data
per
day
§ 90
trillion
emails
are
sent
per
year
§ World
of
WarcraQ
uses
1.3
PB
of
storage
§ Tweets
§ Blogs
§ Photos
§ Videos
(user
and
paid)
§ RSS
feeds
§ Facebook
when
had
a
user
base
of
900
M
users,
had
25
PB
of
compressed
data
§ 400M
tweets
per
day
in
June
’12
§ 72
hours
of
video
is
uploaded
to
Youtube
every
minute
Big
Data
Elements
Volume
Variety
Velocity
+ Veracity (IBM) -
information
uncertainty
Source: Big Data & Analytics - Why Should We Care?, Vishwa Kolla
11. 11
octobre
24,
2014
Key factors
¡ Cheap storage
¡ Recording everything is not expensive anymore
¡ Cloud computing
¡ Cheap, on demand computing resources from
anywhere in the world and for everyone
¡ Business reasons
¡ New insights arise that give competitive
advantage
¡ Data in various forms everywhere: IoT and
IoE, Social Networks, Open Data
¡ The way we interact with each other and
with data / information
¡ …
24/10/2014
12. 12 Transforming our daily lives
24/10/2014
Then Now
One size fits all Personalization & Targeted
Selling
Source: Big Data Trends by David Feinleib
13. 13 Fitness
24/10/2014
Then Now
Manual tracking Focus on the goal
Source: Big Data Trends by David Feinleib
14. 14 Customer service
24/10/2014
Then Now
Reactive Customer Service Pro-active Customer Service
Source: Big Data Trends by David Feinleib
15. 15
24/10/2014
Customer service: 360-degree
view of the customer
Why?
What?
Who?
When/ How?
Where?
Opera1onal
data
Behavioral
data
Descrip1ve
data
Interac1on
Contextual
data
data
16. 17 Big Data opportunities
24/10/2014
Source: Source: Big Data opportunities survey, Unisphere / SAP, May 2013.
17. Opportunities: big data use cases
360°
view
of
the
customer
• Integra1on
of
data
from
social
networks,
CRM,
transac1onal
data,
etc.
• Example:
T-‐Mobile,
telecom
operator
-‐
>
Reduc1on
of
the
customer
leave
of
50%
in
a
quarter
E-‐reputa?on
19
• Sen1ment
analysis,
proac1ve
monitoring
of
social
networks
• Example:
Nestlé,
food
group-‐>
Gain
of
4
places
in
the
Reputa1on
Ins1tute’s
Index
due
to
an
interac1on
24/7
Op?misa?on
• Predic1ve
analysis
for
anomalies
detec1on,
processes
op1miza1on
using
sensors
and
opera1onal
data
• Example:
Union
Pacific
Railroad,
reduce
train
derailments,
increase
train
shipment,
carbon
emission
reduc1on
Public
security
• Monitoring
social
networks,
integra1on
of
spa1al
data
and
sensors
• Example:
Serious
Request
2012
-‐>
monitoring
of
crowd
movements
with
Twi^er
and
sensors,
localiza1on
of
public
force,
integra1on
with
GIS
24/10/2014
19. 21
Real time
visual-analytics
Retro-action
24/10/2014
Static Data Semantic Data Stream (Big) Data
Output
User
Interac1on
Store
Gathering
Informa1on
Data
sources
Visual analytics
Flexible
queries
/
SPARQL
Triple Sore
Seman1c
ETL/Batch
processing
Structured/unstructured
data
Static report
Ad-‐hoc
queries
Analy1cs
C
Data Warehouse
ETL/Batch
processing
databases
C
Real-time analytics
Databases/
Triplestores
Knowledg
e
enrichmen
t
Continuous
queries/
Business rules
Semantic
ETL
stream
processing
Load shedding
sensors
Data streamSst atic data
20. 22
Real time
visual-analytics
Retro-action
24/10/2014
Static Data Semantic Data Stream (Big) Data
Output
User
Interac1on
Store
Gathering
Informa1on
Data
sources
Real-time analytics
Databases/
Triplestores
Knowledg
e
enrichmen
t
Continuous
queries/
Business rules
Semantic
ETL
stream
processing
Load shedding
sensors
Data streamSst atic data
Visual analytics
Flexible
queries /
SPARQL
C
Triple Sore
Semantic
ETL/Batch
processin
g
Structured/unstructured
data
Static report
Ad-hoc
queries
Analytics
C
Data
Warehouse
ETL/Batch
processin
g
databases
21. 23
Real time
visual-analytics
Retro-action
24/10/2014
Static Data Semantic Data Stream (Big) Data
Output
User
Interaction
Store
Gathering
Information
Data
sources
Visual analytics
Flexible
queries /
SPARQL
C
Triple Sore
Semantic
ETL/Batch
processin
g
Structured/unstructured
data
Static report
Ad-hoc
queries
Analytics
C
Data
Warehouse
ETL/Batch
processin
g
databases
Real-time analytics
Databases/
Triplestores
Knowledge
enrichment
Continuous
queries/
Business rules
Semantic
ETL
stream
processing
Load shedding
sensors
Data stream Static data
23. Big Data workflow
1. Capture
2. Store
3. Analyze
4. Visualize
Challenges arise in all these steps
25
24/10/2014
24. 26 Challenges: Data Collection
¡ Heterogeneity of sources
¡ Company databases => Silos
¡ Sensor networks, Intelligent objects
¡ Data streams: Social Networks, financial information, etc.
24/10/2014
¡ Data Velocity
¡ Data provenance and quality
25. 27
24/10/2014
Type of data used in Big Data
initiatives
Internal data
Traditional sources
« New data »
Source: Big Data opportunities survey, Unisphere / SAP, May 2013.
26. 28
24/10/2014
Challenges: Data Collection
Velocity
Website logs
Network
monitoring Financial services
eCommerce Traffic control
Weather
forecasting
Power
consumption
27. What is a data stream?
29
¡ Golab & Oszu (2003): “A data stream is a real-time, continuous, ordered
(implicitly by arrival time or explicitly by timestamp) sequence of items. It is
impossible to control the order in which items arrive, nor is it feasible to locally
store a stream in its entirety.”
¡ Massive volumes of data, items arrive at a high rate.
24/10/2014
28. 30
24/10/2014
Data Stream Management
Systems
DBMS DSMS
Data model Permanent updatable relations Streams and permanent updatable
relations
Storage Data is stored on disk Permanent relations are stored on disk
Streams are processed on the fly
Query SQL language
Creating structures
Inserting/updating/deleting data
Retrieving data (one-time query)
SQL-like query language
Standard SQL on permanent relations
Extended SQL on streams with
windowing
Continuous queries
Performance Large volumes of data Optimization of computer resources to
deal with
Several streams
Several queries
Ability to face variations in arrival rates
without crash
29. Challenges: Data Collection
Data provenance and quality
¡ Data provenance: Provenance refers to the information that
describes data in sufficient detail to facilitate reproduction and
enable validation of results.
¡ Data quality: Validity and consistency of the data. Is it up to
date and fit for the targetted use case ?
31
Source: Patrick McDaniel, Kevin Butler, Steve McLaughlin, Radu Sion, Erez Zadok, and Marianne Winslett, Towards a secure
and ecfficient system for end-to-end provenance, 2010.
24/10/2014
30. 32 Challenges in data storage
¡ Large amounts of data
¡ Need to use a highly distributed architecture
¡ Massive queries
¡ Avoid joins since they are very time consuming
¡ Evolutionary schema
¡ Flexibility and scalability
¡ Predictable and low latency
¡ High availability
¡ Elasticity : Horizontal extensibility (Scale out)
¡ No need: Transaction / Strong consistency/ Complex queries
24/10/2014
31. Limitation of RDBMS
“ If the only tool you have is a hammer, you
tend to see every problem as a nail.”
Abraham Maslow
33
24/10/2014
33. Not Only
NO SQL
Relational
35
• No SQL => Not Only SQL
• SQL must not die but storage solutions should be
considered for specific applications
Exact name: Non relational DB
24/10/2014
34. CAP theorem (E.Brewer, N. Lynch
2000)
consistency
C
Claim: every distributed
system is on one side of
the triangle.
CP: always consistent, even in a
partition, but a reachable replica
may deny service without
agreement of the others
“CAP Theorem”:
C-A-P: choose two.
CA: available, and
consistent, unless there is a
partition.
A P
AP: a reachable replica
provides service even in a
partition, but may be
inconsistent.
Availability Partition-Tolerance
36
24/10/2014
36. Challenges in Data Analytics
¡ Problems in large scale analytics
¡ Distributed computation efficiency
¡ Evaluate performance gains from distribution
¡ Bringing data to the processor
¡ Efficient parallel algorithms (statistics, summaries)
¡ Speed analytics
¡ Streaming computations
¡ Load balancing
¡ Load Shedding
38
24/10/2014
37. 39
Challenges in Data Access and
Visualization
¡ The main goal of data visualization is to communicate
information clearly and effectively through graphical means
¡ Provide results of analytics workflow for faster systems such as
real-time query interfaces
24/10/2014
“Visualization is a form of knowledge compression”
- David McCandless
38. 40
Big Data: Technological
challenges
¡ Data infrastructure tools and platforms : data centers, cloud
infrastructures, noSQL databases, in-memory databases,
Hadoop/Map Reduce Ecosphere
¡ New generation of front-end tools for BI and analytic systems:
data visualization and visual analytics, self-service BI, Mobile BI
24/10/2014
¡ Data processing : supercomputers, distributed or massively
parallel-computing
40. 42
24/10/2014
Conclusion: Big Data
challenges
¡ Semantic Information aggregation
¡ Information aggregation: “too much data to assimilate but not
enough knowledge to act”
¡ Distributed and real-time processing
¡ Design of real-time and distributed algorithms for stream processing
and information aggregation
¡ Distribution and parallelization of data mining algorithms
¡ Optimizing resources
¡ visual analytics and user modeling
¡ Dynamic user model
¡ Novel visualizations for very large datasets
¡ Data protection
41. 43
24/10/2014
IEEE Metro Area Smart Tech
Workshop on Distributed Data
Streaming Dec 5,2014 Paris
¡ 08h00: Registration - Breakfast
08h50: Room L012 - Welcome
09h00: Room L012 - Introduction to Distributed Data Streaming - Speaker: Raja Chiky (ISEP)
10h15: Coffee break
10h45: Room L012 - Real World Issues in Supervised Classification for Data Streams - Speaker:
Vincent Lemaire (Orange Labs)
11h30: Room L012 - Use Case 1- Finance - Speaker: Antoine Chambille (Quartet FS)
12h00: Room L012 - Use Case 2 – Smart metering - Speakers: Marie-Luce Picard (EDF R&D)
12h30: Lunch offsite
14h00: Rooms L305-L306 - 2 Parallel labs sessions: Real-Time Data processing with open
source DSMS - Speakers: Raja Chiky and Sylvain Lefebvre - 1st part
15:30: Coffee break
16:00: Rooms L305-L306 - 2 Parallel labs sessions: Real-Time Data processing with open source
DSMS - Speakers: Raja Chiky and Sylvain Lefebvre - 2nd part
17h30: Reception onsite
43. Big
Data
Linked
Data
Volume,
Variety,
Velocity,
Veracity,
…
Value
Web
of
data,
Seman(c
Web
-‐ A
set
of
principles
and
good
prac1ces
allowing
to
link,
publish
and
search
for
web
data
-‐ Structure
and
seman1cally
enrich
RDF
data,
with
a
very
high
scalability
-‐>
Big
Linked
Data
Integrate,
aggregate,
analyze,
visualize
large
data
sets,
whatever
is
their
type,
provenance,
speed
of
their
flow
…
Big
Linked
Data
Linked
Big
Data
Seman?c
Technologies
Living
Lab
Linked
&
Big
Data
Academic
Chair
Our
Value
proposi?on
–
Seman1c
aggrega1on
from
textual
and
non
textual
streams
–
Manage
seman1c
heterogeneity,
real-‐1me
and
distributed
processing
–
Ensure
data
quality
and
veracity
–
Visual
analy1cs