Internet Infrastructures for Big Data (Verisign's Distinguished Speaker Series)

Internet Infrastructures
for Big Data
Philippe Cudré-Mauroux
eXascale Infolab, University of Fribourg
Switzerland
VeriSign EMEA
June 26, 2014
1

eXascale Infolab
• New lab @ U. of Fribourg, Switzerland
• Financed by Swiss Federal State / companies / private
foundations
• Big (non-relational) data management
(Volume, Velocity, Variety) (… mostly)
2

On the Menu Today
• Big Data!
– Big Data Buzz
– 3 Big Data projects w/ XI & Verisign
3

Exascale Data Deluge
• Science
– Biology
– Astronomy
– Remote Sensing
• Web companies
– Ebay
– Yahoo
• Financial services,
retail companies
governments, etc.
© Wired 2009
➡ New data formats
➡ New machines
➡ Peta & exa-scale datasets
➡ Obsolescence of traditional
information infrastructures
4

Big Data “Central Theorem”
Data+Technology  Actionable Insight  $$
Reporting, Monitoring, Root Cause Analysis,
(User) Modelization, Prediction
5

Big Data Buzz
6
Between now and 2015, the firm expects big data to
create some 4.4 million IT jobs globally; of those, 1.9
million will be in the U.S. Applying an economic
multiplier to that estimate, Gartner expects each new big-
data-related IT job to create work for three more people
outside the tech industry, for a total of almost 6 million
more U.S. jobs.
Growth in the Asia Pacific Big Data market
is expected to accelerate rapidly in two to
three years time, from a mere US$258.5
million last year to in excess of $1.76
billion in 2016, with highest growth in the
storage segment.

Big Data Everywhere!
• The Age of Big Data (NYTimes Feb. 11, 2012)
http://www.nytimes.com/2012/02/12/sunday-review/big-datas-impact-in-
the-world.html
“Welcome to the Age of Big Data. The new megarich of Silicon Valley,
first at Google and now Facebook, are masters at harnessing the data
of the Web — online searches, posts and messages — with Internet
advertising. At the World Economic Forum last month in Davos,
Switzerland, Big Data was a marquee topic. A report by the forum, “Big
Data, Big Impact,” declared data a new class of economic asset, like
currency or gold.”
7

The 3-Vs of Big Data
• Volume
– amount of data
• Velocity
– speed of data in and out
• Variety
– range of data types and sources
• [Gartner 2012] "Big Data are high-volume, high-velocity, and/or
high-variety information assets that require new forms of
processing to enable enhanced decision making, insight
discovery and process optimization"
Coming up: 3
examples from XI
10

Volume: Fixing the Hadoop
Distributed File System
• Hadoop (YARN): “cluster Operating System”
• Often synonymous with Big Data
• Used everywhere (… even in CH)
11

HDFS Blocks Placement Strategy
Rack 1 Rack 2
● 1st replica on local
node or random
node
● 2nd replica on a
different node in a
different rack
● 3rd replica on a
different node in
same rack as 2nd
replica
➡Not hardware-aware
➡Block level rather than file level

Solution: Hadaps File Placement
• Assigns weights to DataNodes
– I/O-bound jobs finish earlier on new media
– CPU-bound jobs finish earlier on new CPUs
• Uses lower utilization servers first
• Moves more blocks to newer generations
• Operates on file level
Up to 300% performance
improvement by activating
all nodes
1
A
1
2
B
1
2
C
1
2
D
2
3
E
2
3
F
2
3
2
34
56
7
8
9
Blocks
Weight
123456
789
1 2
3
4
5
6
7 8
9
10
10
10

Velocity: Real-Time Data
Management
• Smart(er) Cities!
– Electricity provisioning
– Water Networks
14

Example: Scalable Anomaly
Detection
• Detecting leaks / pipe bursts / contamination
in real-time for water distribution networks
15

Data at each Vertex!
• Spatial + temporal statistical processing (mini-
Lisas)
• Stream processing (Storm) + Array processing
(SciDB)
base
station 29
sensor 1053
sensor 1054
base
station 17
base
station 42Peer Information Management overlay
Array Data Management System
OLTP HYRISE OLAP
OLTP HYRISE OLAP
OLTP HYRISE OLAP
Anomaly
Detection
Alert
Sliding-Window
Average
Data Gap
Event
Mini-Lisa
Computations
Missing Data?
Anomaly
Detected?
Yes
No
Yes Anomaly
Event
Delta
Compression
Fluctuation?
Yes Publish
Value
Event
No
No
Alive Event
Stream Processing Flow
16

Results
(anomalies
Detected)
17

Variety: Sharing Data Locally & Globally
• 70+% of the world’s population has no or
very limited access to the Web
[Ahmed Shams 2013]
18

Our Solution: ERS, the
Entity Registry System
• Three-tier solution to deploy data-powered apps
– Flexible
• Seamlessly reconcile entities in local / ad-hoc / global modes
– Collaborative
• Transactional consistency,
data versioning
– Scalable
• Bridges, scale-out servers,
tunable consistency
– Open-source
• https://github.com/ers-devs
19

Ongoing Deployments
• Entity-powered apps for the Sugar Learning
Platform
• Ambient Assisted Living of elderly persons
in tropical environments
20

Special Thanks to…
• Vincenzo Russo, Benoit Perroud, Matt
Thomas, Romain Cholat and the whole
Verisign Fribourg office
• Burt Kaliski and his team
• Allison Mankin, Scott Hollenbeck, Debra
Anderson & the Internet Infrastructures Grant
team
… for their continued support

http://exascale.info
Big thanks to the whole XI crew!
Questions?
VeriSign EMEA
June 26, 2014
22

Internet Infrastructures for Big Data (Verisign's Distinguished Speaker Series)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Internet Infrastructures for Big Data (Verisign's Distinguished Speaker Series)

Similar to Internet Infrastructures for Big Data (Verisign's Distinguished Speaker Series) (20)

More from eXascale Infolab

More from eXascale Infolab (20)

Recently uploaded

Recently uploaded (20)

Internet Infrastructures for Big Data (Verisign's Distinguished Speaker Series)