SlideShare a Scribd company logo
1 of 70
www.bsc.es
El estado del arte del Big Data & Data Science:
La revolución de los datos
Prof. Mateo Valero
BSC Director
Big Data: de la investigación científica a la gestión empresarial
Fundación Ramón Areces
Madrid, 3 de Julio de 2014
Evolution over time of the research paradigm
• In the last millennium, science was empirical:
• description of natural phenomena
• A few centuries ago opens the theoretical approach
• using models, formulas and generalizations.
• In recent decades appears computational science
• simulation of complex phenomena
• Now focuses on the exploration of big data (eScience)
• unification of theory, experiment and simulation
• capture massive data using instruments or generated through simulation and processed by
• computer
• knowledge and information stored in computers
• Scientists analyse databases and files on data infrastructures
2
2
2
.
3
4
a
cG
a
a
Κ−=










ρπ
Jim Gray, National Research Council, http://sites.nationalacademies.org/NRC/index.htm;
Computer Science and Telecommunications Board, http://sites.nationalacademies.org/cstb/index.htm.
3
Higgs and Englert’s Nobel for Physics 2013
Last year one of the most computer-intensive scientific
experiments ever undertaken confirmed Peter Higgs and
François Englert’s theory by making the Higgs boson –
the so-called “God particle” – in an $8bn atom smasher,
the Large Hadron Collider at Cern outside Geneva.
“ the LHC produces 600 TB/sec… and after filtering
needs to store 25 PB/year”… 15 million sensors….
4
Sequencing Costs
4
Source: National Human Genome Research Institute (NHGRI)
http://www.genome.gov/sequencingcosts/
(1) "Cost per Megabase of DNA Sequence" — the cost of determining one megabase (Mb; a million bases) of DNA sequence of a
specified quality
(2) "Cost per Genome" - the cost of sequencing a human-sized genome. For each, a graph is provided showing the data since 2001
In both graphs, the data from 2001 through October 2007 represent the costs of generating DNA sequence using Sanger-based
chemistries and capillary-based instruments ('first generation' sequencing platforms). Beginning in January 2008, the data represent the
costs of generating DNA sequence using 'second-generation' (or 'next-generation') sequencing platforms. The change in instruments
represents the rapid evolution of DNA sequencing technologies that has occurred in recent years.
5
Sequencing Costs
5
We have an issue…. Need for faster ways of processing data,
not storing everything…
6
World-class genomic consortiums
• European Genome-phenome Archive (EGA) repository will allow us to
explore datasets from numerous genetic studies.
• Pan-cancer will produce rich data that will provide a major opportunity to
develop an integrated picture of differences across tumor lineages.
Square Kilometer Array (SKA)
7
Square Kilometer Array (SKA)
8
Tbps ·············> Gbps
Pervasive
Connectivity
Explosion of
Information
400,710 ad
requests
2000 lyrics played
on Tunewiki
1,500 pings sent
on PingMe
208,333 minutes
Angry Birds played
23,148 apps
downloaded
98,000 tweets
Smart
Device
Expansion
In 60 sec
today
2013
30
Billion
By 2020
40
Trillion GB
… for 8
Billion
10
Million
DATA
(1)
(2)
(3)
Devices
Mobile
Apps
(4)
(1) IDC Directions 2013: Why the Datacenter of the Future Will Leverage a Converged Infrastructure, March 2013, Matt Eastwood ; (2) & (3) IDC Predictions 2012: Competing for 2020, Document 231720, December 2011, Frank Gens; (4)
http://en.wikipedia.org
A New Era of Information Technology
Current infrastructure sagging under its own weight
Internet of Things
Challenges of data generation
Source: http://www-01.ibm.com/software/data/bigdata/
The Data Deluge
2020
40ZB*
(figure exceeds prior forecasts
by 5 ZBs)
2005 2010
2012
2015
8.5ZB
2.8ZB
1.2ZB0.1ZB
* Source: IDC
10
30
This will take
us beyond our
decimal
system
Geopbyte
This will be our digital
universe tomorrow…
Brontobyte
10
27
10
24This is our digital universe today
= 250 trillion of DVDs
Yottabyte
1021
1.3 ZB of
network traffic by
2016
Zettabyte
10
18
1 EB of data is created on the internet each day = 250
million DVDs worth of information. The proposed Square
Kilometer Array telescope will generated an EB of data
per day
Exabyte
10
12
Terabyte
500TB of new data per day are ingested in Facebook
databases
1015
Petabyte
The CERN Large Hadron
Collider
generates 1PB per second
109
Gigabyte
10
6
Megabyte
How big is big?
Saganbyte, Jotabyte,…
Technological Achievements
Transistor (Bell Labs, 1947)
– DEC PDP-1 (1957)
– IBM 7090 (1960)
Integrated circuit (1958)
– IBM System 360 (1965)
– DEC PDP-8 (1965)
Microprocessor (1971)
– Intel 4004
Outline
The Big Data era: data generation explosion
Storing data
Processing data
Where do we place data?
How we use data?
Research at BSC
14
Universidad Veracruzana, Xalapa, Octubre 2005
M.
Val
ero
Magnetic Tape Memory
Invented by Eckert & Mauchly for the UNIVAC I, March, 21,1951
– Model UNISERVO
– 224 KB of data
– 1/2 inches of diameter,1200 feets, 128 characters per inch
– Speed: 100 inches per second, equivalent to 12800 characters per
second
Storage Tech, 2013, T-10000D,
– 8.5 Terabytes (40.000.000 increase)
– EBW: 250 Mbytes/second (20.000 increase)
– Load Time: 10 seconds
First Disk, IBM, 1956
1956, IBM 305 RAMAC
4 MB, 50x24” disks, 1200 rpm, 100 bits/track
Intertracks: 0.1 inches, Density: 1000 bits/in2
100 ms access , Tubes, 35k$/y rent
Year 2013: 4 Terabytes (1.000.000 increase)
Average access time: few milliseconds (40 to 1)
Areal: doubling in average every 2/4 years, but not now
Predicted: 14 Terabytes in 2020 at the cost of $40
HDD: Hard Drives Disk
Storage Device Density Landscape
Source: Decad, Fontana,Hetzler
– IBM Journal
1990 1994 1998 2002 2006 2010 2014 20181992 1996 2000 2004 2008 2012 2016
10000
1000
100
10
1
0.1
AREALDENSITY(Gbit/in²)
HDD Products
NAND Products
TAPE Products
TAPE Demos
HDD Products
NAND Products
TAPE Products
TAPE Demos
NAND
1 bit/cell
NAND
1 bit/cell
NAND 40%/yr
2 bit/cell
NAND 40%/yr
2 bit/cell
TAPE 40%/yr
HDD 40%/yrHDD 40%/yr
HDD 20%/yr ??HDD 20%/yr ??
HDD 40%/yrHDD 40%/yr
HDD 100%/yrHDD 100%/yr
HDD 20%/yrHDD 20%/yr
Optical 20%/yr
Tape demo 75%/yr
18
Tapes advancing fast
18
May 2014
IBM and Fujifilm have demoed a
154TB LTO-size tape cartridge
which could come to market in 10
years' time.
The Sony development involved a
148Gbit/in2 tape density and its
own tape design to achieve a
185TB uncompressed capacity.
148 Gbit/in2
3.1 Gbit/in2 (T10000C)
http://www.insic.org/news/2012Roadmap/12index.html
Outline
The Big Data era: data generation explosion
Storing data
Processing data
Where do we place data?
How we use data?
Research at BSC
19
20
The MultiCore Era
Moore’s Law + Memory Wall + Power Wall
Chip MultiProcessors (CMPs)
UltraSPARC T2 (2007)
Intel Xeon 7100
(2006)
POWER4 (2001)
21
How are the Multicore architectures designed?
IBM Power4 (2001)
– 2 cores, ST
– 16 Gflops/s
– L1: 64KB+32KB/core
L2: 1.41MB /chip
L3: 32MB off-chip
– 115W TDP
– 10GB/s mem BW
IBM Power7 (2010)
– 8 cores, SMT4
– 128 Gigaflops/s
– L1: 32KB+32KB/core
L2: 256KB/core
L3·: 32 MB in total
(on-chip)
– 170W TDP
– 100GB/s mem EBW
IBM Power8 (2014)
– 12 cores, SMT8
– 400 Gigaflops/s
– L1:64KB+32KB/core
L2: 512KB/core
L3:49,96 MB in total
(on-chip)
– 250W TDP
– 410GB/s mem BW
Top10
Rank Site Computer Procs Rmax Rpeak Power
GFlops/W
att
Name
1
National University of
Defense Technology
TH-IVB-FEP Cluster, Intel Xeon
E5-2692 12C 2.200GHz, TH
Express-2, Intel Xeon Phi
31S1P
3120000
2736000
33,86 54,90 17,8 1,90
Tianhe-2
(MilkyWay-2)
2
DOE/SC/OAK Ridge
National Lab
CRAY XK7, Opteron 6274 16C,
2.20 GHz, Cray Gemini
interconnect, NVIDIA K20x
560640
261632
17,59 27,11 8,21 2,14 Titan
3 DOE/NNSA/LLNL
BlueGene/Q, Power BQC 16C
1.60 GHz, Custom
1572864 17,17 20,13 7,89 2,18 Sequoia
4
RIKEN Advanced Institute
for Computational Science
(AICS)
Fujitsu, K computer, SPARC64
VIIIfx 2.0GHz, Tofu
interconnect
705024 10,51 11,28 12,65 0,83 K
5
DOE/SC/Argonne National
Laboratory
BlueGene/Q, Power BQC 16C
1.60GHz, Custom
786432 8,58 10,06 3,94 2,18 Mira
6 CSCS
Cray XC30, Xeon E5-2670 8C
2.600GHz, Aries interconnect ,
NVIDIA K20x
115984
73808
6,27 7,79 2,32 2,70 Piz Daint
7
Texas Advanced
Computing Center
PowerEdge C8220, Xeon E5-
2680 8C 2.700GHz, Infiniband
FDR, Intel Xeon Phi
462462
366366
5,17 8,52 4,51 1,14 Stampede
8
Forschungszentrum
Juelich (FZJ)
BlueGene/Q, Power BQC 16C
1.60GHz, Custom
458752 5,00 5,87 2,30 2,18 JUQUEEN
9 DOE/NNSA/LLNL
BlueGene/Q, Power BQC 16C
1.60 GHz, Custom
393216 4,29 5,03 1,97 2,18 Vulcan
10 Government
Cray XC30, Intel Xeon E5-
2697v2 12C 2.7GHz, Aries
225984 3,14 4,88
Barcelona, February, 2012
Parallel Systems
Node
Node
Node
Node
Node
Interconnect (Myrinet, IB, Ge, 3D torus, tree, …)
Node*
Node*
Node*
Node**
Node**
Node**
SMP
multicore
multicore
multicore
multicore
Memory
IN
homogeneous multicore (BlueGene-Q chip)
heterogenous multicore
general-purpose accelerator (e.g. Cell)
GPU
FPGA
ASIC (e.g. Anton for MD)
Network-on-chip (bus, ring, direct, …)
Tianhe-2
Compute node with:
– 2 IvyBridge Xeon sockets (12 cores each) with 88 GB memory
– 3 Xeon Phi sockets with 8 GB memory for a total of 3.120.000 cores
IvyBridge socket: 8 flops/cycle per
core * 12 cores/socket * 2.2 GHz
= 211.2 Gflop/s peak
Xeon Phi socket: 16 flops/cycle per
core * 57 cores/socket * 1.1 GHz
= 1.003 Tflop/s peak
On a node there are 2 IvyBridge
* 0.2112 Tflop/s + 3 Phi * 1.003
Tflop/s  3.431 Tflop/s per node
FLOP/second (operaciones sobre números reales 64 bits)
1988
Cray Y-MP (8 processadors)
1998
Cray T3E (1024 processadors)
2008
Cray XT5 (15000 processadors)
~2018
? (1x107 processadors
Evolution of the computing power of Supercomputers
Intel Xeon Phi or
Intel Many Integrated Core Architecture (MIC)
Knights Corner (2011)
– Coprocessor, 61x86 cores,
22nm, AVX-512, 4 HTs
– 1.2TFLOPS (DP), 300W TDP,
4 GFLOPS/W
– 512KB/core L2 coherent
– Int Netw: Ring
– Mem BW: 352GB/s
Knights Landing (exp 2015)
– Coprocessor or host processor
– 72 Atom cores, 14nm, AVX512 per core, 4 HTs
– Up to 16GB of DRAM 3D stacked on-package, 384GB GDDR
– 3TFLOPS (DP), 200W TDP, 15GFLOPS/W
26
27
Accelerators: NVIDIA Kepler GK110 GPU (2014)
DP Performance:
1.43 Tflop
Mem BW (ECC off):
288 GB/s
Memory size (GDDR5):
12 GB
15 SMX units
– 192 single‐precision
CUDA cores
– 64 double‐precision units
– 32 special function units
– 32 load/store units
Six 64‐bit memory
controllers
Processor node
Thanks to S. Borkar, Intel
Nvidia: Node for the Exaflop Computer
Thanks Bill Dally
30
30
Graph 500 Project
HPC benchmarks and performance metrics do not provide useful information on
the suitability of supercomputing systems for data intensive applications.
Graph 500 establishes a set of large-scale benchmarks for these applications.
Steering Committee: 50 international HPC experts from academia and industry.
Graph 500 is developing benchmarks to address three application kernels:
– Concurrent search
– Optimization (single source shortest path)
– Edge-oriented (maximal independent set)
Also they are addressing five graph-related business areas:
– Cybersecurity
– Medical Informatics
– Data Enrichment
– Social Networks
– Symbolic Networks.
• Top500 defined a benchmark (Linpack) to rate HPC machines upon performance. This
benchmark is not suitable to address the characteristics of Big Data applications.
• Green Graph 500 list:
• Collects performance-per-watt metrics
• To compare the energy consumption of data intensive computing workloads.
Graph500 vs. Top500
Linpack:
– computation bounded
– focused on floating-point operations
– Bulk-Synchronous-Parallel model:
behavior based on big computation
and short communication bursts
– dense data structures highly
organized and coalesced (spatial
locality)
Graph500: graph500.org Top500: top500.org
• Graph500:
• communication bounded
• focused on integer operations
• asynchronous spatial uniform
communication interleaved
with computation
• larger sparse datasets (very
low spatial and temporal
locality)
Communication pattern changes
HPC
Graph500
Graph500 and Top500 Lists: June 2014
Graph500
# Machine Site Cores
Problem
size
GTEPS
1 K computer (Fujitsu - Custom supercomputer)
RIKEN Advanced Institute for
Computational Science (AICS)
524,288 40 17,977.1
2
Sequoia (IBM - BlueGene/Q, Power BQC 16C 1.60
GHz)
Lawrence Livermore National
Laboratory
1,048,576 40 16,599
3 Mira (IBM - BlueGene/Q, Power BQC 16C 1.60 GHz) Argonne National Laboratory 786,432 40 14,328
4
JUQUEEN (IBM - BlueGene/Q, Power BQC 16C 1.60
GHz)
Forschungszentrum Juelich (FZJ) 262,144 38 5,848
5 Fermi (IBM - BlueGene/Q, Power BQC 16C 1.60 GHz) CINECA 131,072 37 2,567
Top500
# Machine Site Cores
Rmax
(TFlop/s)
Rpeak
(TFlop/s)
1
Tianhe-2 (TH-IVB-FEP Cluster, Intel Xeon E5-2692 12C
2.200GHz, Intel Xeon Phi 31S1P)
National Super Computer Center
in Guangzhou
3,120,000 33,862.7 54,902.4
2
Titan (Cray XK7 , Opteron 6274 16C 2.200GHz, NVIDIA
K20x)
Oak Ridge National Laboratory 560,640 17,590.0 27,112.5
3
Sequoia (IBM - BlueGene/Q, Power BQC 16C 1.60
GHz)
Lawrence Livermore National
Laboratory
1,572,864 17,173.2 20,132.7
4 K computer (Fujitsu - SPARC64 VIIIfx 2.0GHz)
RIKEN Advanced Institute for
Computational Science (AICS)
705,024 10,510.0 11,280.4
5 Mira (IBM - BlueGene/Q, Power BQC 16C 1.60 GHz) Argonne National Laboratory 786,432 8,586.6 10,066.3
35
Specialization is Everything (?)
ASICs
FPGAs
Source: Bob Broderson, Berkeley Wireless group
36
What are FPGAs?
FPGA: Field
Programmable Gate
Array
Generic sea of
programmable logic
and interconnect
Program into
specialized
computing &
networking circuits
Can be reconfigured
in as little as 100 ms
RAMRAM
DSPDSPDSP
NetworkPCIe
GenericLogic
(LUTs)
Specialized I/O Blocks
DSPMultiplierBlocks
36 Kb Dual Port RAM
Outline
The Big Data era: data generation explosion
Storing data
Processing data
Where do we place data?
How we use data?
Research at BSC
37
38
BioInformatics, Big Data and Supercomputing
1 PB of compressed data
100.000 subjects suffering different diseases
If we ever had a 1PB disk
(100MB/s)
scanning 1 Petabyte:
3,000 hours / 125 days
What is the time required to retrieve information?
What if we want to
process the data in
1 hour ?
Supercomputing is about doing things FAST…
41
36-portFDR1036-portFDR1036-portFDR1036-portFDR1036-portFDR1036-portFDR1036-portFDR1036-portFDR10
Mellanox
648-port
IB
Core
Switch
Mellanox
648-port
IB
Core
Switch
Mellanox
648-port
IB
Core
Switch
Mellanox
648-port
IB
Core
Switch
Infiniba
nd648-
port
FDR
Core
switch
Mellanox
648-port
IB
Core
Switch
Mellanox
648-port
IB
Core
Switch
36-portFDR1036-portFDR10
560560560560560560
Leafswitches
1818181812
3linkstoeachcore3linkstoeachcore3linkstoeachcore3linkstoeachcore2linkstoeachcore
FDR10
links
1818181812
3linkstoeachcore3linkstoeachcore3linkstoeachcore3linkstoeachcore2linkstoeachcore
1818181812
1818181812
Latency:0,7
μs
Bandwidth:
40Gb/s
Evolution of the storage architecture for Big Data
ComputeNetwork
(40GbpsNodeAdapter)
ComputeNodes
Storage Network
(1Gbps Node Adapter)
StorageRacks
11hrs for 1 PB
1PB
Compute and Communication Energies
42
Source: Processors and sockets: What's next? Greg
Astfalk / Salishan Conference / April 25, 2013
43
Paradigm shift
Old
Compute-centric Model
New
Data-centric Model
Massive Parallelism
Persistent Memory
Flash Storage Class Mem
Manycore Accelerators
Source: Heiko Joerg http://www.slideshare.net/schihei/petascale-analytics-the-world-of-big-data-requires-big-analytics
44
Future HW for Big Data: Non-volatile memory
New technologies (STTMRAM, CBRAM, RRAM, …)
More density
Non-volatility
good for throwing away the file abstraction, address non-volatile
memory directly
Replacement for DRAMs
Endurance problem
Large influence on software
Data base systems, File systems
Optimized System Design for Data-Centric Deep Computing
45
Source: IBM Corporation, 2013
46
Solutions for Supercomputing
47
IBM Netezza Appliance (>50TB)
48
Microsoft Catapult
Microsoft is planning to replace traditional CPUs in data
centers with field-programmable arrays, or FPGAs,
processors that Microsoft could modify specifically for use
with its own software. These FPGAs are already available in
the market and Microsoft is sourcing it from a company called
Altera. The FPGAs are 40 times faster than a CPU at
processing Bing’s custom algorithms.
Doug Burger From Microsoft Research
Talks About Project Catapult Which Will
Make Bing Twice As Fast
(Jun 17, 2014)
http://microsoft-news.com/doug-burger-from-microsoft-research-talks-about-project-catapult-which-will-make-bing-twice-as-fast-video/
FE: Feature Extraction
Query: “FPGA Configuration”
NumberOfOccurrences_0 = 7NumberOfOccurrences_1 = 4 NumberOfTuples_0_1 = 1
{Query, Document}
~4K Dynamic
Features
~2K Synthetic
Features
L2 Score
Document
Score
50
Bing RaaS Overview
5
0
Query
compilation
From L1: query + 4
document IDs
Read document
from disk
FE: Feature
Extraction
FFE: Free-
Form
Expressions
MLS: Machine
learning
scoring
Docs
Dynamic
Features
Synthetic
Features
Send ranked
scores for 4
documents back
to MLA
Hit vector per stream and static features
>
/
+
+
+
+
+
*
1 1e-006
55
SF1
if
NF91
DF88 DF89
DF90 DF91
DF92
DF93
DF95
ln
max
SF13+
DF94
+
S0
Position Term
5 3
12 4
99 2
107 3
109 3
7 1
42 3
43 7
S1
NumOccurrences_1_3 = 1
Decompress
and extract HV
51
HP unveils “The Machine”
June 11, 2014
HP unveils “The Machine”
It uses clusters of special-purpose cores, rather than a
few generalized cores; photonics link everything instead
of slow, energy-hungry copper wires; memristors give it
unified memory that's as fast as RAM yet stores data
permanently, like a flash drive.
A Machine server could address 160 petabytes of data in
250 nanoseconds; HP says its hardware should be about
six times more powerful than an existing server, even as
it consumes 80 times less energy. Ditching older
technology like copper also encourages non-traditional,
three-dimensional computing shapes, since you're not
bound by the usual distance limits.
Source: http://www.engadget.com/2014/06/11/hp-the-machine/
Outline
The Big Data era: data generation explosion
Storing data
Processing data
Where do we place data?
How we use data?
Research at BSC
53
Relational databases sometimes not good for scale-out
1 2 3 4 5
2 3 4
1 3 5
1 3 4
2 4 5
1 2 5
NoSQL for non-structured data, eventual
consistency
Programming model
 To meet the challenges: MapReduce
– Programming Model introduced by Google in
early 2000s to support distributed computing
(special emphasis in fault-tolerance)
 Ecosystem of big data processing tools
• open source, distributed, and run on commodity
hardware.
 The key innovation of MapReduce
is
– the ability to take a query over a data
set, divide it, and run it in parallel
over many nodes.
 Two phases
– Map phase
– Reduce phase
Reducers
Mappers
Input Data
Output Data
MapReduce: example
Reducers
Mappers
Input Data
Output Data
be, 12th.txt
not, 12th.txt
afraid, 12th.txt
of, 12th.txt
greatness, 12th.txt
to be or not
hamlet.txt
be not afraid
of greatness
12th.txt
to, hamlet.txt
be, hamlet.txt
or, hamlet.txt
not, hamlet.txt
afraid, (12th.txt)
be, (12th.txt, hamlet.txt)
greatness, (12th.txt)
not, (12th.txt, hamlet.txt)
of, (12th.txt)
or, (hamlet.txt)
to, (hamlet.txt)
Input
Map
Reduce
Output
57
Limitations of MapReduce as a Programming model?
• MapReduce is great but not every one
is a MapReduce expert
“I am a python expert but ….”
• There is a class of algorithms that
cannot be efficiently implemented with
the MapReduce programming model
• Different programming models deal
with different challenges
– pyCOMPSs from BSC
– SPARK from Berkeley
Input Data
Output Data
58
Google progressively dropping MapReduce
2004
introduces
MapReduce
2006
The Apache Hadoop Project brings an
opensource MapReduce Implementation,
with the team of the Apache Nutch
Crawler and the support of Yahoo!
June 25, 2014
Google announces Cloud Dataflow: write code once, run it in batch or stream mode
Cloud Dataflow is a managed service for
creating data pipelines that ingest, transform
and analyze data in both batch and streaming
modes.
2010
Google announces Pregel used for building
incremental reverse indexes instead of
MapReduce
Phylosophy: ‘think like a vertex” (event-driven)
59
What happens if SW does not consider HW
• Terasort contest: sorting 100TB data
• Number 1: Hadoop
• 2100 nodes, 12 cores per node, 64 Gb per node
• 24.000 cores
• 134 Tb memory
• Time: 4300 segs
• Cost in Amazon: $ 8.800
• Number 2: Tritonsort
• 52 nodes, 8 cores per node, 24 Gb
• 416 cores
• 1,2 Tb memory
• Time: 8300 segs and 6400 segs
• Cost in Amazon: $ 294 and 226
• Hadoop is easy to program, but needs 57X more cores, 100X
more memory, and only gets 2X performance
60
HW-SW-Network co-design: Data Appliances
BigData needs Intelligence!
Deep Learning: Building high-level abstractions
Watson: Reasoning
Deep Learning: generating high level abstractions
Toronto Face Database (TFD)
– Automatic recognition of Facial Expressions
62
Ranzato, M., Mnih, V., Susskind, J. and Hinton, G. E. Modeling Natural Images Using Gated MRFs. IEEE
Trans. Pattern Analysis and Machine Intelligence, to appear.
63
Watson: Calculating Vs Learning
63
Source: Courtesy of IBM
64
What is behind Watson?
IBM Watson -- How to replicate
Watson hardware and systems design
for your own use in your basement
By Tony Pearson (ibm.co/Pearson)
http://tinyurl.com/pt6mdfu
DeepQA: Massively Parallel
Probabilistic Evidence-
Based Architecture
For each question, generates
and scores many hypotheses
using a combination of
1000’s of Natural Language
Processing, Information
Retrieval, Machine
Learning and Reasoning
Algorithms. These gather,
evaluate, weigh and balance
different types of evidence to
deliver the answer with the
best evidence support it can
find
65
Software tuning key to exploit hardware!
Outline
The Big Data era: data generation explosion
Storing data
Processing data
Where do we place data?
How we use data?
Research at BSC
66
67
Barcelona Supercomputing Center
Centro Nacional de Supercomputación
BSC-CNS objectives:
– R&D in Computer, Life, Earth and Engineering Sciences
– Supercomputing services and support to Spanish and
European researchers
BSC-CNS is a consortium that includes:
– Spanish Government 51%
– Catalonian Government 37%
– Universitat Politècnica de Catalunya (UPC) 12%
+400 people, 45 countries
67
68
EARTH SCIENCES
To develop and implement
global and regional state-
of-the-art models for
short-term air quality
forecast and long-term
climate applications
LIFE SCIENCES
To understand living
organisms by means of
theoretical and
computational methods
(molecular modeling,
genomics, proteomics)
CASE
To develop scientific and
engineering software to
efficiently exploit super-
computing capabilities
(biomedical, geophysics,
COMPUTER
SCIENCES
To influence the way
machines are built,
programmed and used:
programming models,
performance tools, Big
Data, computer
atmospheric, energy,
social and economic simulations)
architecture, energy efficiency
Mission of BSC Scientific Departments
MareNostrum 3
69
48,896 Intel SandyBridge cores at 2.6 GHz
84 Intel Xeon Phi
Peak Performance of 1.1 Petaflops
100.8 TB of main memory
2 PB of disk storage
8.5 PB of archive storage
9th in Europe, 29th in the world (June 2013 Top500 List)
Analytics
(real-time)
Big Data
HPC
Cloud
Computing
Sustainable
Computing
SDE
Smarts Cities
& IoT
BSC
Programming
Models
ALOJA
BGAS
BSC
Tools
BSC-CATech
Citizen as a sensor
Deep Learning
NoSQL
Data Management
Collab. CASE
Life Science
Real-time Monitoring
of Image Streaming
BSC
Programming
Models
servIoTicy.com
Adaptive
Scheduler
www.bsc.es
Thank you !

More Related Content

What's hot

A Campus-Scale High Performance Cyberinfrastructure is Required for Data-Int...
A Campus-Scale High Performance Cyberinfrastructure is Required for Data-Int...A Campus-Scale High Performance Cyberinfrastructure is Required for Data-Int...
A Campus-Scale High Performance Cyberinfrastructure is Required for Data-Int...Larry Smarr
 
Calit2-a Persistent UCSD/UCI Framework for Collaboration
Calit2-a Persistent UCSD/UCI Framework for CollaborationCalit2-a Persistent UCSD/UCI Framework for Collaboration
Calit2-a Persistent UCSD/UCI Framework for CollaborationLarry Smarr
 
Jax 2013 - Big Data and Personalised Medicine
Jax 2013 - Big Data and Personalised MedicineJax 2013 - Big Data and Personalised Medicine
Jax 2013 - Big Data and Personalised MedicineGaurav Kaul
 
The OptIPuter and Its Applications
The OptIPuter and Its ApplicationsThe OptIPuter and Its Applications
The OptIPuter and Its ApplicationsLarry Smarr
 
100G network research at UCL
100G network research at UCL100G network research at UCL
100G network research at UCLJisc
 
Using OptIPuter Innovations to Enable LambdaGrid Applications
Using OptIPuter Innovations to Enable LambdaGrid ApplicationsUsing OptIPuter Innovations to Enable LambdaGrid Applications
Using OptIPuter Innovations to Enable LambdaGrid ApplicationsLarry Smarr
 
Bergman Enabling Computation for neuro ML external
Bergman Enabling Computation for neuro ML externalBergman Enabling Computation for neuro ML external
Bergman Enabling Computation for neuro ML externalazlefty
 
Tales of AI agents saving the human race!
Tales of AI agents saving the human race!Tales of AI agents saving the human race!
Tales of AI agents saving the human race!Alison B. Lowndes
 
Talk on commercialising space data
Talk on commercialising space data Talk on commercialising space data
Talk on commercialising space data Alison B. Lowndes
 
Accelerate AI w/ Synthetic Data using GANs
Accelerate AI w/ Synthetic Data using GANsAccelerate AI w/ Synthetic Data using GANs
Accelerate AI w/ Synthetic Data using GANsRenee Yao
 
Grid Computing In Israel
Grid Computing  In IsraelGrid Computing  In Israel
Grid Computing In IsraelGuy Tel-Zur
 
Fuelling the AI Revolution with Gaming
Fuelling the AI Revolution with GamingFuelling the AI Revolution with Gaming
Fuelling the AI Revolution with GamingAlison B. Lowndes
 
System-On-Chip Market Outlook, Trends, Forecast of Top Countries 2023
System-On-Chip Market Outlook, Trends, Forecast of Top Countries 2023System-On-Chip Market Outlook, Trends, Forecast of Top Countries 2023
System-On-Chip Market Outlook, Trends, Forecast of Top Countries 2023Adhiraj Kumar
 
Positioning University of California Information Technology for the Future: S...
Positioning University of California Information Technology for the Future: S...Positioning University of California Information Technology for the Future: S...
Positioning University of California Information Technology for the Future: S...Larry Smarr
 
The Pacific Research Platform
The Pacific Research PlatformThe Pacific Research Platform
The Pacific Research PlatformLarry Smarr
 
"Using Deep Learning for Video Event Detection on a Compute Budget," a Presen...
"Using Deep Learning for Video Event Detection on a Compute Budget," a Presen..."Using Deep Learning for Video Event Detection on a Compute Budget," a Presen...
"Using Deep Learning for Video Event Detection on a Compute Budget," a Presen...Edge AI and Vision Alliance
 
Enabling Artificial Intelligence - Alison B. Lowndes
Enabling Artificial Intelligence - Alison B. LowndesEnabling Artificial Intelligence - Alison B. Lowndes
Enabling Artificial Intelligence - Alison B. LowndesWithTheBest
 
SciNet -- Pushing scientific boundaries
SciNet -- Pushing scientific boundaries SciNet -- Pushing scientific boundaries
SciNet -- Pushing scientific boundaries Lenovo Data Center
 
Global Research Platforms: Past, Present, Future
Global Research Platforms: Past, Present, FutureGlobal Research Platforms: Past, Present, Future
Global Research Platforms: Past, Present, FutureLarry Smarr
 

What's hot (20)

A Campus-Scale High Performance Cyberinfrastructure is Required for Data-Int...
A Campus-Scale High Performance Cyberinfrastructure is Required for Data-Int...A Campus-Scale High Performance Cyberinfrastructure is Required for Data-Int...
A Campus-Scale High Performance Cyberinfrastructure is Required for Data-Int...
 
Calit2-a Persistent UCSD/UCI Framework for Collaboration
Calit2-a Persistent UCSD/UCI Framework for CollaborationCalit2-a Persistent UCSD/UCI Framework for Collaboration
Calit2-a Persistent UCSD/UCI Framework for Collaboration
 
Jax 2013 - Big Data and Personalised Medicine
Jax 2013 - Big Data and Personalised MedicineJax 2013 - Big Data and Personalised Medicine
Jax 2013 - Big Data and Personalised Medicine
 
Gaurav slides
Gaurav slidesGaurav slides
Gaurav slides
 
The OptIPuter and Its Applications
The OptIPuter and Its ApplicationsThe OptIPuter and Its Applications
The OptIPuter and Its Applications
 
100G network research at UCL
100G network research at UCL100G network research at UCL
100G network research at UCL
 
Using OptIPuter Innovations to Enable LambdaGrid Applications
Using OptIPuter Innovations to Enable LambdaGrid ApplicationsUsing OptIPuter Innovations to Enable LambdaGrid Applications
Using OptIPuter Innovations to Enable LambdaGrid Applications
 
Bergman Enabling Computation for neuro ML external
Bergman Enabling Computation for neuro ML externalBergman Enabling Computation for neuro ML external
Bergman Enabling Computation for neuro ML external
 
Tales of AI agents saving the human race!
Tales of AI agents saving the human race!Tales of AI agents saving the human race!
Tales of AI agents saving the human race!
 
Talk on commercialising space data
Talk on commercialising space data Talk on commercialising space data
Talk on commercialising space data
 
Accelerate AI w/ Synthetic Data using GANs
Accelerate AI w/ Synthetic Data using GANsAccelerate AI w/ Synthetic Data using GANs
Accelerate AI w/ Synthetic Data using GANs
 
Grid Computing In Israel
Grid Computing  In IsraelGrid Computing  In Israel
Grid Computing In Israel
 
Fuelling the AI Revolution with Gaming
Fuelling the AI Revolution with GamingFuelling the AI Revolution with Gaming
Fuelling the AI Revolution with Gaming
 
System-On-Chip Market Outlook, Trends, Forecast of Top Countries 2023
System-On-Chip Market Outlook, Trends, Forecast of Top Countries 2023System-On-Chip Market Outlook, Trends, Forecast of Top Countries 2023
System-On-Chip Market Outlook, Trends, Forecast of Top Countries 2023
 
Positioning University of California Information Technology for the Future: S...
Positioning University of California Information Technology for the Future: S...Positioning University of California Information Technology for the Future: S...
Positioning University of California Information Technology for the Future: S...
 
The Pacific Research Platform
The Pacific Research PlatformThe Pacific Research Platform
The Pacific Research Platform
 
"Using Deep Learning for Video Event Detection on a Compute Budget," a Presen...
"Using Deep Learning for Video Event Detection on a Compute Budget," a Presen..."Using Deep Learning for Video Event Detection on a Compute Budget," a Presen...
"Using Deep Learning for Video Event Detection on a Compute Budget," a Presen...
 
Enabling Artificial Intelligence - Alison B. Lowndes
Enabling Artificial Intelligence - Alison B. LowndesEnabling Artificial Intelligence - Alison B. Lowndes
Enabling Artificial Intelligence - Alison B. Lowndes
 
SciNet -- Pushing scientific boundaries
SciNet -- Pushing scientific boundaries SciNet -- Pushing scientific boundaries
SciNet -- Pushing scientific boundaries
 
Global Research Platforms: Past, Present, Future
Global Research Platforms: Past, Present, FutureGlobal Research Platforms: Past, Present, Future
Global Research Platforms: Past, Present, Future
 

Viewers also liked

Esteban Moro - Big data: de la investigación científica a la gestión empresarial
Esteban Moro - Big data: de la investigación científica a la gestión empresarialEsteban Moro - Big data: de la investigación científica a la gestión empresarial
Esteban Moro - Big data: de la investigación científica a la gestión empresarialFundación Ramón Areces
 
Submissão à autoridad1
Submissão à autoridad1Submissão à autoridad1
Submissão à autoridad1mgno42
 
G te c sesion4b-complejidad y tpa
G te c sesion4b-complejidad y tpaG te c sesion4b-complejidad y tpa
G te c sesion4b-complejidad y tpaVictoria López
 
Financial densities in emerging markets
Financial densities in emerging marketsFinancial densities in emerging markets
Financial densities in emerging marketsimauleecon
 
Trastornos del aprendizaje
Trastornos del aprendizajeTrastornos del aprendizaje
Trastornos del aprendizajeMartin Hinojosa
 
Miniquest.Informática, 2013.
Miniquest.Informática, 2013.Miniquest.Informática, 2013.
Miniquest.Informática, 2013.Silvia0202
 
proyecto de integración de las tics en la educación inicial.
proyecto de integración de las tics en la educación inicial.proyecto de integración de las tics en la educación inicial.
proyecto de integración de las tics en la educación inicial.marleny Brito de lo Santo
 
Plan Nacional de Riego en Argentina
Plan Nacional de Riego en ArgentinaPlan Nacional de Riego en Argentina
Plan Nacional de Riego en ArgentinaFAO
 
Improve Sales Productivity with Digital Sales Coverage
Improve Sales Productivity with Digital Sales CoverageImprove Sales Productivity with Digital Sales Coverage
Improve Sales Productivity with Digital Sales CoverageMarketBridge
 
Practicas De 2º Bachilelrato
Practicas De 2º BachilelratoPracticas De 2º Bachilelrato
Practicas De 2º Bachilelratoguestbe57ac709
 
Discovering the Other 90% of our Human Superorganism
Discovering the Other 90% of our Human SuperorganismDiscovering the Other 90% of our Human Superorganism
Discovering the Other 90% of our Human SuperorganismLarry Smarr
 
Introducción al Big Data y el Business Intelligence
Introducción al Big Data y el Business IntelligenceIntroducción al Big Data y el Business Intelligence
Introducción al Big Data y el Business IntelligenceAlex Rayón Jerez
 

Viewers also liked (20)

Esteban Moro - Big data: de la investigación científica a la gestión empresarial
Esteban Moro - Big data: de la investigación científica a la gestión empresarialEsteban Moro - Big data: de la investigación científica a la gestión empresarial
Esteban Moro - Big data: de la investigación científica a la gestión empresarial
 
Submissão à autoridad1
Submissão à autoridad1Submissão à autoridad1
Submissão à autoridad1
 
G te c sesion4b-complejidad y tpa
G te c sesion4b-complejidad y tpaG te c sesion4b-complejidad y tpa
G te c sesion4b-complejidad y tpa
 
Financial densities in emerging markets
Financial densities in emerging marketsFinancial densities in emerging markets
Financial densities in emerging markets
 
Trastornos del aprendizaje
Trastornos del aprendizajeTrastornos del aprendizaje
Trastornos del aprendizaje
 
Cross-promotion of iPad apps
Cross-promotion of iPad appsCross-promotion of iPad apps
Cross-promotion of iPad apps
 
Miniquest.Informática, 2013.
Miniquest.Informática, 2013.Miniquest.Informática, 2013.
Miniquest.Informática, 2013.
 
Flormar 7-8
Flormar 7-8Flormar 7-8
Flormar 7-8
 
proyecto de integración de las tics en la educación inicial.
proyecto de integración de las tics en la educación inicial.proyecto de integración de las tics en la educación inicial.
proyecto de integración de las tics en la educación inicial.
 
Agenda Escolar de Diabetes
Agenda Escolar de DiabetesAgenda Escolar de Diabetes
Agenda Escolar de Diabetes
 
El Agua
El AguaEl Agua
El Agua
 
8683476 assimil-njemacki-bez-muke-1974
8683476 assimil-njemacki-bez-muke-19748683476 assimil-njemacki-bez-muke-1974
8683476 assimil-njemacki-bez-muke-1974
 
Plan Nacional de Riego en Argentina
Plan Nacional de Riego en ArgentinaPlan Nacional de Riego en Argentina
Plan Nacional de Riego en Argentina
 
Improve Sales Productivity with Digital Sales Coverage
Improve Sales Productivity with Digital Sales CoverageImprove Sales Productivity with Digital Sales Coverage
Improve Sales Productivity with Digital Sales Coverage
 
Practicas De 2º Bachilelrato
Practicas De 2º BachilelratoPracticas De 2º Bachilelrato
Practicas De 2º Bachilelrato
 
Gloria Fuertes
Gloria FuertesGloria Fuertes
Gloria Fuertes
 
Tendencias en Big Data (2015-2016)
Tendencias en Big Data (2015-2016)Tendencias en Big Data (2015-2016)
Tendencias en Big Data (2015-2016)
 
Discovering the Other 90% of our Human Superorganism
Discovering the Other 90% of our Human SuperorganismDiscovering the Other 90% of our Human Superorganism
Discovering the Other 90% of our Human Superorganism
 
Introducción al Big Data y el Business Intelligence
Introducción al Big Data y el Business IntelligenceIntroducción al Big Data y el Business Intelligence
Introducción al Big Data y el Business Intelligence
 
Nowcasting, sociedades para nuevas formas de organización
Nowcasting, sociedades para nuevas formas de organizaciónNowcasting, sociedades para nuevas formas de organización
Nowcasting, sociedades para nuevas formas de organización
 

Similar to Mateo Valero - Big data: de la investigación científica a la gestión empresarial

Cloud Computing,雲端運算-中研院網格計畫主持人林誠謙
Cloud Computing,雲端運算-中研院網格計畫主持人林誠謙Cloud Computing,雲端運算-中研院網格計畫主持人林誠謙
Cloud Computing,雲端運算-中研院網格計畫主持人林誠謙Tracy Chen
 
Valladolid final-septiembre-2010
Valladolid final-septiembre-2010Valladolid final-septiembre-2010
Valladolid final-septiembre-2010TELECOM I+D
 
¿Es posible construir el Airbus de la Supercomputación en Europa?
¿Es posible construir el Airbus de la Supercomputación en Europa?¿Es posible construir el Airbus de la Supercomputación en Europa?
¿Es posible construir el Airbus de la Supercomputación en Europa?AMETIC
 
An Alternative to Hard Drives in the Coming Future:DNA-BASED DATA STORAGE
An Alternative to Hard Drives in the Coming Future:DNA-BASED DATA STORAGEAn Alternative to Hard Drives in the Coming Future:DNA-BASED DATA STORAGE
An Alternative to Hard Drives in the Coming Future:DNA-BASED DATA STORAGEvivatechijri
 
2008—The Year of Global Telepresence
2008—The Year of Global Telepresence2008—The Year of Global Telepresence
2008—The Year of Global TelepresenceLarry Smarr
 
E Science As A Lens On The World Lazowska
E Science As A Lens On The World   LazowskaE Science As A Lens On The World   Lazowska
E Science As A Lens On The World Lazowskaguest43b4df3
 
E Science As A Lens On The World Lazowska
E Science As A Lens On The World   LazowskaE Science As A Lens On The World   Lazowska
E Science As A Lens On The World LazowskaWCET
 
The OptIPlanet Collaboratory Supporting Researchers Worldwide
The OptIPlanet Collaboratory Supporting Researchers WorldwideThe OptIPlanet Collaboratory Supporting Researchers Worldwide
The OptIPlanet Collaboratory Supporting Researchers WorldwideLarry Smarr
 
The OptIPlanet Collaboratory Supporting Researchers Worldwide
The OptIPlanet Collaboratory Supporting Researchers WorldwideThe OptIPlanet Collaboratory Supporting Researchers Worldwide
The OptIPlanet Collaboratory Supporting Researchers WorldwideLarry Smarr
 
The Optiputer - Toward a Terabit LAN
The Optiputer - Toward a Terabit LANThe Optiputer - Toward a Terabit LAN
The Optiputer - Toward a Terabit LANLarry Smarr
 
Building a Global Collaboration System for Data-Intensive Discovery
Building a Global Collaboration System for Data-Intensive DiscoveryBuilding a Global Collaboration System for Data-Intensive Discovery
Building a Global Collaboration System for Data-Intensive DiscoveryLarry Smarr
 
Is it Live or is it Telepresence?
Is it Live or is it Telepresence?Is it Live or is it Telepresence?
Is it Live or is it Telepresence?Larry Smarr
 
Toward a National Research Platform
Toward a National Research PlatformToward a National Research Platform
Toward a National Research PlatformLarry Smarr
 
Agents In An Exponential World Foster
Agents In An Exponential World FosterAgents In An Exponential World Foster
Agents In An Exponential World FosterIan Foster
 
Cloud Computing y Big Data, próxima frontera de la innovación
Cloud Computing y Big Data, próxima frontera de la innovaciónCloud Computing y Big Data, próxima frontera de la innovación
Cloud Computing y Big Data, próxima frontera de la innovaciónFundación Ramón Areces
 
Envisioning the Future
Envisioning the FutureEnvisioning the Future
Envisioning the FutureLarry Smarr
 
The Future of the Internet
The Future of the Internet The Future of the Internet
The Future of the Internet PayamBarnaghi
 
Petascale Analytics - The World of Big Data Requires Big Analytics
Petascale Analytics - The World of Big Data Requires Big AnalyticsPetascale Analytics - The World of Big Data Requires Big Analytics
Petascale Analytics - The World of Big Data Requires Big AnalyticsHeiko Joerg Schick
 
Science and Cyberinfrastructure in the Data-Dominated Era
Science and Cyberinfrastructure in the Data-Dominated EraScience and Cyberinfrastructure in the Data-Dominated Era
Science and Cyberinfrastructure in the Data-Dominated EraLarry Smarr
 
Coupling Australia’s Researchers to the Global Innovation Economy
Coupling Australia’s Researchers to the Global Innovation EconomyCoupling Australia’s Researchers to the Global Innovation Economy
Coupling Australia’s Researchers to the Global Innovation EconomyLarry Smarr
 

Similar to Mateo Valero - Big data: de la investigación científica a la gestión empresarial (20)

Cloud Computing,雲端運算-中研院網格計畫主持人林誠謙
Cloud Computing,雲端運算-中研院網格計畫主持人林誠謙Cloud Computing,雲端運算-中研院網格計畫主持人林誠謙
Cloud Computing,雲端運算-中研院網格計畫主持人林誠謙
 
Valladolid final-septiembre-2010
Valladolid final-septiembre-2010Valladolid final-septiembre-2010
Valladolid final-septiembre-2010
 
¿Es posible construir el Airbus de la Supercomputación en Europa?
¿Es posible construir el Airbus de la Supercomputación en Europa?¿Es posible construir el Airbus de la Supercomputación en Europa?
¿Es posible construir el Airbus de la Supercomputación en Europa?
 
An Alternative to Hard Drives in the Coming Future:DNA-BASED DATA STORAGE
An Alternative to Hard Drives in the Coming Future:DNA-BASED DATA STORAGEAn Alternative to Hard Drives in the Coming Future:DNA-BASED DATA STORAGE
An Alternative to Hard Drives in the Coming Future:DNA-BASED DATA STORAGE
 
2008—The Year of Global Telepresence
2008—The Year of Global Telepresence2008—The Year of Global Telepresence
2008—The Year of Global Telepresence
 
E Science As A Lens On The World Lazowska
E Science As A Lens On The World   LazowskaE Science As A Lens On The World   Lazowska
E Science As A Lens On The World Lazowska
 
E Science As A Lens On The World Lazowska
E Science As A Lens On The World   LazowskaE Science As A Lens On The World   Lazowska
E Science As A Lens On The World Lazowska
 
The OptIPlanet Collaboratory Supporting Researchers Worldwide
The OptIPlanet Collaboratory Supporting Researchers WorldwideThe OptIPlanet Collaboratory Supporting Researchers Worldwide
The OptIPlanet Collaboratory Supporting Researchers Worldwide
 
The OptIPlanet Collaboratory Supporting Researchers Worldwide
The OptIPlanet Collaboratory Supporting Researchers WorldwideThe OptIPlanet Collaboratory Supporting Researchers Worldwide
The OptIPlanet Collaboratory Supporting Researchers Worldwide
 
The Optiputer - Toward a Terabit LAN
The Optiputer - Toward a Terabit LANThe Optiputer - Toward a Terabit LAN
The Optiputer - Toward a Terabit LAN
 
Building a Global Collaboration System for Data-Intensive Discovery
Building a Global Collaboration System for Data-Intensive DiscoveryBuilding a Global Collaboration System for Data-Intensive Discovery
Building a Global Collaboration System for Data-Intensive Discovery
 
Is it Live or is it Telepresence?
Is it Live or is it Telepresence?Is it Live or is it Telepresence?
Is it Live or is it Telepresence?
 
Toward a National Research Platform
Toward a National Research PlatformToward a National Research Platform
Toward a National Research Platform
 
Agents In An Exponential World Foster
Agents In An Exponential World FosterAgents In An Exponential World Foster
Agents In An Exponential World Foster
 
Cloud Computing y Big Data, próxima frontera de la innovación
Cloud Computing y Big Data, próxima frontera de la innovaciónCloud Computing y Big Data, próxima frontera de la innovación
Cloud Computing y Big Data, próxima frontera de la innovación
 
Envisioning the Future
Envisioning the FutureEnvisioning the Future
Envisioning the Future
 
The Future of the Internet
The Future of the Internet The Future of the Internet
The Future of the Internet
 
Petascale Analytics - The World of Big Data Requires Big Analytics
Petascale Analytics - The World of Big Data Requires Big AnalyticsPetascale Analytics - The World of Big Data Requires Big Analytics
Petascale Analytics - The World of Big Data Requires Big Analytics
 
Science and Cyberinfrastructure in the Data-Dominated Era
Science and Cyberinfrastructure in the Data-Dominated EraScience and Cyberinfrastructure in the Data-Dominated Era
Science and Cyberinfrastructure in the Data-Dominated Era
 
Coupling Australia’s Researchers to the Global Innovation Economy
Coupling Australia’s Researchers to the Global Innovation EconomyCoupling Australia’s Researchers to the Global Innovation Economy
Coupling Australia’s Researchers to the Global Innovation Economy
 

More from Fundación Ramón Areces

Jordi Torren - Coordinador del proyecto ESVAC. Agencia Europea de Medicamento...
Jordi Torren - Coordinador del proyecto ESVAC. Agencia Europea de Medicamento...Jordi Torren - Coordinador del proyecto ESVAC. Agencia Europea de Medicamento...
Jordi Torren - Coordinador del proyecto ESVAC. Agencia Europea de Medicamento...Fundación Ramón Areces
 
Dominique L. Monnet Director del programa ARHAI (Antimicrobial Resistance an...
Dominique L. Monnet  Director del programa ARHAI (Antimicrobial Resistance an...Dominique L. Monnet  Director del programa ARHAI (Antimicrobial Resistance an...
Dominique L. Monnet Director del programa ARHAI (Antimicrobial Resistance an...Fundación Ramón Areces
 
Antonio Cabrales -University College of London.
Antonio Cabrales -University College of London. Antonio Cabrales -University College of London.
Antonio Cabrales -University College of London. Fundación Ramón Areces
 
Teresa Puig - Institut de Ciència de Materials de Barcelona, ICMAB-CSIC, Espa...
Teresa Puig - Institut de Ciència de Materials de Barcelona, ICMAB-CSIC, Espa...Teresa Puig - Institut de Ciència de Materials de Barcelona, ICMAB-CSIC, Espa...
Teresa Puig - Institut de Ciència de Materials de Barcelona, ICMAB-CSIC, Espa...Fundación Ramón Areces
 
Elena Bascones - Instituto de Ciencia de Materiales de Madrid (ICMM-CSIC), Es...
Elena Bascones - Instituto de Ciencia de Materiales de Madrid (ICMM-CSIC), Es...Elena Bascones - Instituto de Ciencia de Materiales de Madrid (ICMM-CSIC), Es...
Elena Bascones - Instituto de Ciencia de Materiales de Madrid (ICMM-CSIC), Es...Fundación Ramón Areces
 
Jonathan D. Ostry - Fondo Monetario Internacional (FMI).
Jonathan D. Ostry - Fondo Monetario Internacional (FMI). Jonathan D. Ostry - Fondo Monetario Internacional (FMI).
Jonathan D. Ostry - Fondo Monetario Internacional (FMI). Fundación Ramón Areces
 
Juan Carlos López-Gutiérrez - Unidad de Anomalías Vasculares, Hospital Unive...
Juan Carlos López-Gutiérrez  - Unidad de Anomalías Vasculares, Hospital Unive...Juan Carlos López-Gutiérrez  - Unidad de Anomalías Vasculares, Hospital Unive...
Juan Carlos López-Gutiérrez - Unidad de Anomalías Vasculares, Hospital Unive...Fundación Ramón Areces
 
Víctor Martínez-Glez. - Instituto de Genética Médica y Molecular (INGEMM). I...
Víctor Martínez-Glez. - Instituto de Genética Médica y Molecular (INGEMM).  I...Víctor Martínez-Glez. - Instituto de Genética Médica y Molecular (INGEMM).  I...
Víctor Martínez-Glez. - Instituto de Genética Médica y Molecular (INGEMM). I...Fundación Ramón Areces
 
Rudolf Happle - Dermatología, University of Freiburg Medical Center, Freiburg...
Rudolf Happle - Dermatología, University of Freiburg Medical Center, Freiburg...Rudolf Happle - Dermatología, University of Freiburg Medical Center, Freiburg...
Rudolf Happle - Dermatología, University of Freiburg Medical Center, Freiburg...Fundación Ramón Areces
 
Rafael Doménech - Responsable de Análisis Macroeconómico, BBVA Research.
Rafael Doménech - Responsable de Análisis Macroeconómico, BBVA Research. Rafael Doménech - Responsable de Análisis Macroeconómico, BBVA Research.
Rafael Doménech - Responsable de Análisis Macroeconómico, BBVA Research. Fundación Ramón Areces
 
Diego Valero - Presidente del Grupo Novaster.
Diego Valero - Presidente del Grupo Novaster. Diego Valero - Presidente del Grupo Novaster.
Diego Valero - Presidente del Grupo Novaster. Fundación Ramón Areces
 
Nicholas Barr - Profesor de Economía Pública, London School of Economics.
Nicholas Barr - Profesor de Economía Pública, London School of Economics. Nicholas Barr - Profesor de Economía Pública, London School of Economics.
Nicholas Barr - Profesor de Economía Pública, London School of Economics. Fundación Ramón Areces
 
Juan Manuel Sarasua - Comunicador y periodista científico.
Juan Manuel Sarasua - Comunicador y periodista científico. Juan Manuel Sarasua - Comunicador y periodista científico.
Juan Manuel Sarasua - Comunicador y periodista científico. Fundación Ramón Areces
 
Marta Olivares - Investigadora Postdoctoral en Université catholique de Louva...
Marta Olivares - Investigadora Postdoctoral en Université catholique de Louva...Marta Olivares - Investigadora Postdoctoral en Université catholique de Louva...
Marta Olivares - Investigadora Postdoctoral en Université catholique de Louva...Fundación Ramón Areces
 
Frederic Lluis - Investigador principal en KU Leuven.
Frederic Lluis - Investigador principal en KU Leuven. Frederic Lluis - Investigador principal en KU Leuven.
Frederic Lluis - Investigador principal en KU Leuven. Fundación Ramón Areces
 

More from Fundación Ramón Areces (20)

Jordi Torren - Coordinador del proyecto ESVAC. Agencia Europea de Medicamento...
Jordi Torren - Coordinador del proyecto ESVAC. Agencia Europea de Medicamento...Jordi Torren - Coordinador del proyecto ESVAC. Agencia Europea de Medicamento...
Jordi Torren - Coordinador del proyecto ESVAC. Agencia Europea de Medicamento...
 
Dominique L. Monnet Director del programa ARHAI (Antimicrobial Resistance an...
Dominique L. Monnet  Director del programa ARHAI (Antimicrobial Resistance an...Dominique L. Monnet  Director del programa ARHAI (Antimicrobial Resistance an...
Dominique L. Monnet Director del programa ARHAI (Antimicrobial Resistance an...
 
Antonio Cabrales -University College of London.
Antonio Cabrales -University College of London. Antonio Cabrales -University College of London.
Antonio Cabrales -University College of London.
 
Teresa Puig - Institut de Ciència de Materials de Barcelona, ICMAB-CSIC, Espa...
Teresa Puig - Institut de Ciència de Materials de Barcelona, ICMAB-CSIC, Espa...Teresa Puig - Institut de Ciència de Materials de Barcelona, ICMAB-CSIC, Espa...
Teresa Puig - Institut de Ciència de Materials de Barcelona, ICMAB-CSIC, Espa...
 
Elena Bascones - Instituto de Ciencia de Materiales de Madrid (ICMM-CSIC), Es...
Elena Bascones - Instituto de Ciencia de Materiales de Madrid (ICMM-CSIC), Es...Elena Bascones - Instituto de Ciencia de Materiales de Madrid (ICMM-CSIC), Es...
Elena Bascones - Instituto de Ciencia de Materiales de Madrid (ICMM-CSIC), Es...
 
Jonathan D. Ostry - Fondo Monetario Internacional (FMI).
Jonathan D. Ostry - Fondo Monetario Internacional (FMI). Jonathan D. Ostry - Fondo Monetario Internacional (FMI).
Jonathan D. Ostry - Fondo Monetario Internacional (FMI).
 
Martín Uribe - Universidad de Columbia.
Martín Uribe - Universidad de Columbia.Martín Uribe - Universidad de Columbia.
Martín Uribe - Universidad de Columbia.
 
Thomas S. Robertson - The Wharton School.
Thomas S. Robertson - The Wharton School. Thomas S. Robertson - The Wharton School.
Thomas S. Robertson - The Wharton School.
 
Diana Robertson - The Wharton School.
Diana Robertson - The Wharton School. Diana Robertson - The Wharton School.
Diana Robertson - The Wharton School.
 
Juan Carlos López-Gutiérrez - Unidad de Anomalías Vasculares, Hospital Unive...
Juan Carlos López-Gutiérrez  - Unidad de Anomalías Vasculares, Hospital Unive...Juan Carlos López-Gutiérrez  - Unidad de Anomalías Vasculares, Hospital Unive...
Juan Carlos López-Gutiérrez - Unidad de Anomalías Vasculares, Hospital Unive...
 
Víctor Martínez-Glez. - Instituto de Genética Médica y Molecular (INGEMM). I...
Víctor Martínez-Glez. - Instituto de Genética Médica y Molecular (INGEMM).  I...Víctor Martínez-Glez. - Instituto de Genética Médica y Molecular (INGEMM).  I...
Víctor Martínez-Glez. - Instituto de Genética Médica y Molecular (INGEMM). I...
 
Rudolf Happle - Dermatología, University of Freiburg Medical Center, Freiburg...
Rudolf Happle - Dermatología, University of Freiburg Medical Center, Freiburg...Rudolf Happle - Dermatología, University of Freiburg Medical Center, Freiburg...
Rudolf Happle - Dermatología, University of Freiburg Medical Center, Freiburg...
 
Rafael Doménech - Responsable de Análisis Macroeconómico, BBVA Research.
Rafael Doménech - Responsable de Análisis Macroeconómico, BBVA Research. Rafael Doménech - Responsable de Análisis Macroeconómico, BBVA Research.
Rafael Doménech - Responsable de Análisis Macroeconómico, BBVA Research.
 
Diego Valero - Presidente del Grupo Novaster.
Diego Valero - Presidente del Grupo Novaster. Diego Valero - Presidente del Grupo Novaster.
Diego Valero - Presidente del Grupo Novaster.
 
Mercedes Ayuso - Universitat de Barcelona.
Mercedes Ayuso -  Universitat de Barcelona. Mercedes Ayuso -  Universitat de Barcelona.
Mercedes Ayuso - Universitat de Barcelona.
 
Nicholas Barr - Profesor de Economía Pública, London School of Economics.
Nicholas Barr - Profesor de Economía Pública, London School of Economics. Nicholas Barr - Profesor de Economía Pública, London School of Economics.
Nicholas Barr - Profesor de Economía Pública, London School of Economics.
 
Julia Campa - The Open University.
Julia Campa - The Open University. Julia Campa - The Open University.
Julia Campa - The Open University.
 
Juan Manuel Sarasua - Comunicador y periodista científico.
Juan Manuel Sarasua - Comunicador y periodista científico. Juan Manuel Sarasua - Comunicador y periodista científico.
Juan Manuel Sarasua - Comunicador y periodista científico.
 
Marta Olivares - Investigadora Postdoctoral en Université catholique de Louva...
Marta Olivares - Investigadora Postdoctoral en Université catholique de Louva...Marta Olivares - Investigadora Postdoctoral en Université catholique de Louva...
Marta Olivares - Investigadora Postdoctoral en Université catholique de Louva...
 
Frederic Lluis - Investigador principal en KU Leuven.
Frederic Lluis - Investigador principal en KU Leuven. Frederic Lluis - Investigador principal en KU Leuven.
Frederic Lluis - Investigador principal en KU Leuven.
 

Recently uploaded

Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 

Recently uploaded (20)

Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 

Mateo Valero - Big data: de la investigación científica a la gestión empresarial

  • 1. www.bsc.es El estado del arte del Big Data & Data Science: La revolución de los datos Prof. Mateo Valero BSC Director Big Data: de la investigación científica a la gestión empresarial Fundación Ramón Areces Madrid, 3 de Julio de 2014
  • 2. Evolution over time of the research paradigm • In the last millennium, science was empirical: • description of natural phenomena • A few centuries ago opens the theoretical approach • using models, formulas and generalizations. • In recent decades appears computational science • simulation of complex phenomena • Now focuses on the exploration of big data (eScience) • unification of theory, experiment and simulation • capture massive data using instruments or generated through simulation and processed by • computer • knowledge and information stored in computers • Scientists analyse databases and files on data infrastructures 2 2 2 . 3 4 a cG a a Κ−=           ρπ Jim Gray, National Research Council, http://sites.nationalacademies.org/NRC/index.htm; Computer Science and Telecommunications Board, http://sites.nationalacademies.org/cstb/index.htm.
  • 3. 3 Higgs and Englert’s Nobel for Physics 2013 Last year one of the most computer-intensive scientific experiments ever undertaken confirmed Peter Higgs and François Englert’s theory by making the Higgs boson – the so-called “God particle” – in an $8bn atom smasher, the Large Hadron Collider at Cern outside Geneva. “ the LHC produces 600 TB/sec… and after filtering needs to store 25 PB/year”… 15 million sensors….
  • 4. 4 Sequencing Costs 4 Source: National Human Genome Research Institute (NHGRI) http://www.genome.gov/sequencingcosts/ (1) "Cost per Megabase of DNA Sequence" — the cost of determining one megabase (Mb; a million bases) of DNA sequence of a specified quality (2) "Cost per Genome" - the cost of sequencing a human-sized genome. For each, a graph is provided showing the data since 2001 In both graphs, the data from 2001 through October 2007 represent the costs of generating DNA sequence using Sanger-based chemistries and capillary-based instruments ('first generation' sequencing platforms). Beginning in January 2008, the data represent the costs of generating DNA sequence using 'second-generation' (or 'next-generation') sequencing platforms. The change in instruments represents the rapid evolution of DNA sequencing technologies that has occurred in recent years.
  • 5. 5 Sequencing Costs 5 We have an issue…. Need for faster ways of processing data, not storing everything…
  • 6. 6 World-class genomic consortiums • European Genome-phenome Archive (EGA) repository will allow us to explore datasets from numerous genetic studies. • Pan-cancer will produce rich data that will provide a major opportunity to develop an integrated picture of differences across tumor lineages.
  • 8. Square Kilometer Array (SKA) 8 Tbps ·············> Gbps
  • 9. Pervasive Connectivity Explosion of Information 400,710 ad requests 2000 lyrics played on Tunewiki 1,500 pings sent on PingMe 208,333 minutes Angry Birds played 23,148 apps downloaded 98,000 tweets Smart Device Expansion In 60 sec today 2013 30 Billion By 2020 40 Trillion GB … for 8 Billion 10 Million DATA (1) (2) (3) Devices Mobile Apps (4) (1) IDC Directions 2013: Why the Datacenter of the Future Will Leverage a Converged Infrastructure, March 2013, Matt Eastwood ; (2) & (3) IDC Predictions 2012: Competing for 2020, Document 231720, December 2011, Frank Gens; (4) http://en.wikipedia.org A New Era of Information Technology Current infrastructure sagging under its own weight Internet of Things
  • 10. Challenges of data generation Source: http://www-01.ibm.com/software/data/bigdata/
  • 11. The Data Deluge 2020 40ZB* (figure exceeds prior forecasts by 5 ZBs) 2005 2010 2012 2015 8.5ZB 2.8ZB 1.2ZB0.1ZB * Source: IDC
  • 12. 10 30 This will take us beyond our decimal system Geopbyte This will be our digital universe tomorrow… Brontobyte 10 27 10 24This is our digital universe today = 250 trillion of DVDs Yottabyte 1021 1.3 ZB of network traffic by 2016 Zettabyte 10 18 1 EB of data is created on the internet each day = 250 million DVDs worth of information. The proposed Square Kilometer Array telescope will generated an EB of data per day Exabyte 10 12 Terabyte 500TB of new data per day are ingested in Facebook databases 1015 Petabyte The CERN Large Hadron Collider generates 1PB per second 109 Gigabyte 10 6 Megabyte How big is big? Saganbyte, Jotabyte,…
  • 13. Technological Achievements Transistor (Bell Labs, 1947) – DEC PDP-1 (1957) – IBM 7090 (1960) Integrated circuit (1958) – IBM System 360 (1965) – DEC PDP-8 (1965) Microprocessor (1971) – Intel 4004
  • 14. Outline The Big Data era: data generation explosion Storing data Processing data Where do we place data? How we use data? Research at BSC 14
  • 15. Universidad Veracruzana, Xalapa, Octubre 2005 M. Val ero Magnetic Tape Memory Invented by Eckert & Mauchly for the UNIVAC I, March, 21,1951 – Model UNISERVO – 224 KB of data – 1/2 inches of diameter,1200 feets, 128 characters per inch – Speed: 100 inches per second, equivalent to 12800 characters per second Storage Tech, 2013, T-10000D, – 8.5 Terabytes (40.000.000 increase) – EBW: 250 Mbytes/second (20.000 increase) – Load Time: 10 seconds
  • 16. First Disk, IBM, 1956 1956, IBM 305 RAMAC 4 MB, 50x24” disks, 1200 rpm, 100 bits/track Intertracks: 0.1 inches, Density: 1000 bits/in2 100 ms access , Tubes, 35k$/y rent Year 2013: 4 Terabytes (1.000.000 increase) Average access time: few milliseconds (40 to 1) Areal: doubling in average every 2/4 years, but not now Predicted: 14 Terabytes in 2020 at the cost of $40 HDD: Hard Drives Disk
  • 17. Storage Device Density Landscape Source: Decad, Fontana,Hetzler – IBM Journal 1990 1994 1998 2002 2006 2010 2014 20181992 1996 2000 2004 2008 2012 2016 10000 1000 100 10 1 0.1 AREALDENSITY(Gbit/in²) HDD Products NAND Products TAPE Products TAPE Demos HDD Products NAND Products TAPE Products TAPE Demos NAND 1 bit/cell NAND 1 bit/cell NAND 40%/yr 2 bit/cell NAND 40%/yr 2 bit/cell TAPE 40%/yr HDD 40%/yrHDD 40%/yr HDD 20%/yr ??HDD 20%/yr ?? HDD 40%/yrHDD 40%/yr HDD 100%/yrHDD 100%/yr HDD 20%/yrHDD 20%/yr Optical 20%/yr Tape demo 75%/yr
  • 18. 18 Tapes advancing fast 18 May 2014 IBM and Fujifilm have demoed a 154TB LTO-size tape cartridge which could come to market in 10 years' time. The Sony development involved a 148Gbit/in2 tape density and its own tape design to achieve a 185TB uncompressed capacity. 148 Gbit/in2 3.1 Gbit/in2 (T10000C) http://www.insic.org/news/2012Roadmap/12index.html
  • 19. Outline The Big Data era: data generation explosion Storing data Processing data Where do we place data? How we use data? Research at BSC 19
  • 20. 20 The MultiCore Era Moore’s Law + Memory Wall + Power Wall Chip MultiProcessors (CMPs) UltraSPARC T2 (2007) Intel Xeon 7100 (2006) POWER4 (2001)
  • 21. 21 How are the Multicore architectures designed? IBM Power4 (2001) – 2 cores, ST – 16 Gflops/s – L1: 64KB+32KB/core L2: 1.41MB /chip L3: 32MB off-chip – 115W TDP – 10GB/s mem BW IBM Power7 (2010) – 8 cores, SMT4 – 128 Gigaflops/s – L1: 32KB+32KB/core L2: 256KB/core L3·: 32 MB in total (on-chip) – 170W TDP – 100GB/s mem EBW IBM Power8 (2014) – 12 cores, SMT8 – 400 Gigaflops/s – L1:64KB+32KB/core L2: 512KB/core L3:49,96 MB in total (on-chip) – 250W TDP – 410GB/s mem BW
  • 22. Top10 Rank Site Computer Procs Rmax Rpeak Power GFlops/W att Name 1 National University of Defense Technology TH-IVB-FEP Cluster, Intel Xeon E5-2692 12C 2.200GHz, TH Express-2, Intel Xeon Phi 31S1P 3120000 2736000 33,86 54,90 17,8 1,90 Tianhe-2 (MilkyWay-2) 2 DOE/SC/OAK Ridge National Lab CRAY XK7, Opteron 6274 16C, 2.20 GHz, Cray Gemini interconnect, NVIDIA K20x 560640 261632 17,59 27,11 8,21 2,14 Titan 3 DOE/NNSA/LLNL BlueGene/Q, Power BQC 16C 1.60 GHz, Custom 1572864 17,17 20,13 7,89 2,18 Sequoia 4 RIKEN Advanced Institute for Computational Science (AICS) Fujitsu, K computer, SPARC64 VIIIfx 2.0GHz, Tofu interconnect 705024 10,51 11,28 12,65 0,83 K 5 DOE/SC/Argonne National Laboratory BlueGene/Q, Power BQC 16C 1.60GHz, Custom 786432 8,58 10,06 3,94 2,18 Mira 6 CSCS Cray XC30, Xeon E5-2670 8C 2.600GHz, Aries interconnect , NVIDIA K20x 115984 73808 6,27 7,79 2,32 2,70 Piz Daint 7 Texas Advanced Computing Center PowerEdge C8220, Xeon E5- 2680 8C 2.700GHz, Infiniband FDR, Intel Xeon Phi 462462 366366 5,17 8,52 4,51 1,14 Stampede 8 Forschungszentrum Juelich (FZJ) BlueGene/Q, Power BQC 16C 1.60GHz, Custom 458752 5,00 5,87 2,30 2,18 JUQUEEN 9 DOE/NNSA/LLNL BlueGene/Q, Power BQC 16C 1.60 GHz, Custom 393216 4,29 5,03 1,97 2,18 Vulcan 10 Government Cray XC30, Intel Xeon E5- 2697v2 12C 2.7GHz, Aries 225984 3,14 4,88
  • 23. Barcelona, February, 2012 Parallel Systems Node Node Node Node Node Interconnect (Myrinet, IB, Ge, 3D torus, tree, …) Node* Node* Node* Node** Node** Node** SMP multicore multicore multicore multicore Memory IN homogeneous multicore (BlueGene-Q chip) heterogenous multicore general-purpose accelerator (e.g. Cell) GPU FPGA ASIC (e.g. Anton for MD) Network-on-chip (bus, ring, direct, …)
  • 24. Tianhe-2 Compute node with: – 2 IvyBridge Xeon sockets (12 cores each) with 88 GB memory – 3 Xeon Phi sockets with 8 GB memory for a total of 3.120.000 cores IvyBridge socket: 8 flops/cycle per core * 12 cores/socket * 2.2 GHz = 211.2 Gflop/s peak Xeon Phi socket: 16 flops/cycle per core * 57 cores/socket * 1.1 GHz = 1.003 Tflop/s peak On a node there are 2 IvyBridge * 0.2112 Tflop/s + 3 Phi * 1.003 Tflop/s  3.431 Tflop/s per node
  • 25. FLOP/second (operaciones sobre números reales 64 bits) 1988 Cray Y-MP (8 processadors) 1998 Cray T3E (1024 processadors) 2008 Cray XT5 (15000 processadors) ~2018 ? (1x107 processadors Evolution of the computing power of Supercomputers
  • 26. Intel Xeon Phi or Intel Many Integrated Core Architecture (MIC) Knights Corner (2011) – Coprocessor, 61x86 cores, 22nm, AVX-512, 4 HTs – 1.2TFLOPS (DP), 300W TDP, 4 GFLOPS/W – 512KB/core L2 coherent – Int Netw: Ring – Mem BW: 352GB/s Knights Landing (exp 2015) – Coprocessor or host processor – 72 Atom cores, 14nm, AVX512 per core, 4 HTs – Up to 16GB of DRAM 3D stacked on-package, 384GB GDDR – 3TFLOPS (DP), 200W TDP, 15GFLOPS/W 26
  • 27. 27 Accelerators: NVIDIA Kepler GK110 GPU (2014) DP Performance: 1.43 Tflop Mem BW (ECC off): 288 GB/s Memory size (GDDR5): 12 GB 15 SMX units – 192 single‐precision CUDA cores – 64 double‐precision units – 32 special function units – 32 load/store units Six 64‐bit memory controllers
  • 28. Processor node Thanks to S. Borkar, Intel
  • 29. Nvidia: Node for the Exaflop Computer Thanks Bill Dally
  • 30. 30 30
  • 31. Graph 500 Project HPC benchmarks and performance metrics do not provide useful information on the suitability of supercomputing systems for data intensive applications. Graph 500 establishes a set of large-scale benchmarks for these applications. Steering Committee: 50 international HPC experts from academia and industry. Graph 500 is developing benchmarks to address three application kernels: – Concurrent search – Optimization (single source shortest path) – Edge-oriented (maximal independent set) Also they are addressing five graph-related business areas: – Cybersecurity – Medical Informatics – Data Enrichment – Social Networks – Symbolic Networks.
  • 32. • Top500 defined a benchmark (Linpack) to rate HPC machines upon performance. This benchmark is not suitable to address the characteristics of Big Data applications. • Green Graph 500 list: • Collects performance-per-watt metrics • To compare the energy consumption of data intensive computing workloads. Graph500 vs. Top500 Linpack: – computation bounded – focused on floating-point operations – Bulk-Synchronous-Parallel model: behavior based on big computation and short communication bursts – dense data structures highly organized and coalesced (spatial locality) Graph500: graph500.org Top500: top500.org • Graph500: • communication bounded • focused on integer operations • asynchronous spatial uniform communication interleaved with computation • larger sparse datasets (very low spatial and temporal locality)
  • 34. Graph500 and Top500 Lists: June 2014 Graph500 # Machine Site Cores Problem size GTEPS 1 K computer (Fujitsu - Custom supercomputer) RIKEN Advanced Institute for Computational Science (AICS) 524,288 40 17,977.1 2 Sequoia (IBM - BlueGene/Q, Power BQC 16C 1.60 GHz) Lawrence Livermore National Laboratory 1,048,576 40 16,599 3 Mira (IBM - BlueGene/Q, Power BQC 16C 1.60 GHz) Argonne National Laboratory 786,432 40 14,328 4 JUQUEEN (IBM - BlueGene/Q, Power BQC 16C 1.60 GHz) Forschungszentrum Juelich (FZJ) 262,144 38 5,848 5 Fermi (IBM - BlueGene/Q, Power BQC 16C 1.60 GHz) CINECA 131,072 37 2,567 Top500 # Machine Site Cores Rmax (TFlop/s) Rpeak (TFlop/s) 1 Tianhe-2 (TH-IVB-FEP Cluster, Intel Xeon E5-2692 12C 2.200GHz, Intel Xeon Phi 31S1P) National Super Computer Center in Guangzhou 3,120,000 33,862.7 54,902.4 2 Titan (Cray XK7 , Opteron 6274 16C 2.200GHz, NVIDIA K20x) Oak Ridge National Laboratory 560,640 17,590.0 27,112.5 3 Sequoia (IBM - BlueGene/Q, Power BQC 16C 1.60 GHz) Lawrence Livermore National Laboratory 1,572,864 17,173.2 20,132.7 4 K computer (Fujitsu - SPARC64 VIIIfx 2.0GHz) RIKEN Advanced Institute for Computational Science (AICS) 705,024 10,510.0 11,280.4 5 Mira (IBM - BlueGene/Q, Power BQC 16C 1.60 GHz) Argonne National Laboratory 786,432 8,586.6 10,066.3
  • 35. 35 Specialization is Everything (?) ASICs FPGAs Source: Bob Broderson, Berkeley Wireless group
  • 36. 36 What are FPGAs? FPGA: Field Programmable Gate Array Generic sea of programmable logic and interconnect Program into specialized computing & networking circuits Can be reconfigured in as little as 100 ms RAMRAM DSPDSPDSP NetworkPCIe GenericLogic (LUTs) Specialized I/O Blocks DSPMultiplierBlocks 36 Kb Dual Port RAM
  • 37. Outline The Big Data era: data generation explosion Storing data Processing data Where do we place data? How we use data? Research at BSC 37
  • 38. 38 BioInformatics, Big Data and Supercomputing 1 PB of compressed data 100.000 subjects suffering different diseases
  • 39. If we ever had a 1PB disk (100MB/s) scanning 1 Petabyte: 3,000 hours / 125 days What is the time required to retrieve information?
  • 40. What if we want to process the data in 1 hour ? Supercomputing is about doing things FAST…
  • 41. 41 36-portFDR1036-portFDR1036-portFDR1036-portFDR1036-portFDR1036-portFDR1036-portFDR1036-portFDR10 Mellanox 648-port IB Core Switch Mellanox 648-port IB Core Switch Mellanox 648-port IB Core Switch Mellanox 648-port IB Core Switch Infiniba nd648- port FDR Core switch Mellanox 648-port IB Core Switch Mellanox 648-port IB Core Switch 36-portFDR1036-portFDR10 560560560560560560 Leafswitches 1818181812 3linkstoeachcore3linkstoeachcore3linkstoeachcore3linkstoeachcore2linkstoeachcore FDR10 links 1818181812 3linkstoeachcore3linkstoeachcore3linkstoeachcore3linkstoeachcore2linkstoeachcore 1818181812 1818181812 Latency:0,7 μs Bandwidth: 40Gb/s Evolution of the storage architecture for Big Data ComputeNetwork (40GbpsNodeAdapter) ComputeNodes Storage Network (1Gbps Node Adapter) StorageRacks 11hrs for 1 PB 1PB
  • 42. Compute and Communication Energies 42 Source: Processors and sockets: What's next? Greg Astfalk / Salishan Conference / April 25, 2013
  • 43. 43 Paradigm shift Old Compute-centric Model New Data-centric Model Massive Parallelism Persistent Memory Flash Storage Class Mem Manycore Accelerators Source: Heiko Joerg http://www.slideshare.net/schihei/petascale-analytics-the-world-of-big-data-requires-big-analytics
  • 44. 44 Future HW for Big Data: Non-volatile memory New technologies (STTMRAM, CBRAM, RRAM, …) More density Non-volatility good for throwing away the file abstraction, address non-volatile memory directly Replacement for DRAMs Endurance problem Large influence on software Data base systems, File systems
  • 45. Optimized System Design for Data-Centric Deep Computing 45 Source: IBM Corporation, 2013
  • 48. 48 Microsoft Catapult Microsoft is planning to replace traditional CPUs in data centers with field-programmable arrays, or FPGAs, processors that Microsoft could modify specifically for use with its own software. These FPGAs are already available in the market and Microsoft is sourcing it from a company called Altera. The FPGAs are 40 times faster than a CPU at processing Bing’s custom algorithms. Doug Burger From Microsoft Research Talks About Project Catapult Which Will Make Bing Twice As Fast (Jun 17, 2014) http://microsoft-news.com/doug-burger-from-microsoft-research-talks-about-project-catapult-which-will-make-bing-twice-as-fast-video/
  • 49. FE: Feature Extraction Query: “FPGA Configuration” NumberOfOccurrences_0 = 7NumberOfOccurrences_1 = 4 NumberOfTuples_0_1 = 1 {Query, Document} ~4K Dynamic Features ~2K Synthetic Features L2 Score Document Score
  • 50. 50 Bing RaaS Overview 5 0 Query compilation From L1: query + 4 document IDs Read document from disk FE: Feature Extraction FFE: Free- Form Expressions MLS: Machine learning scoring Docs Dynamic Features Synthetic Features Send ranked scores for 4 documents back to MLA Hit vector per stream and static features > / + + + + + * 1 1e-006 55 SF1 if NF91 DF88 DF89 DF90 DF91 DF92 DF93 DF95 ln max SF13+ DF94 + S0 Position Term 5 3 12 4 99 2 107 3 109 3 7 1 42 3 43 7 S1 NumOccurrences_1_3 = 1 Decompress and extract HV
  • 51. 51 HP unveils “The Machine” June 11, 2014 HP unveils “The Machine” It uses clusters of special-purpose cores, rather than a few generalized cores; photonics link everything instead of slow, energy-hungry copper wires; memristors give it unified memory that's as fast as RAM yet stores data permanently, like a flash drive. A Machine server could address 160 petabytes of data in 250 nanoseconds; HP says its hardware should be about six times more powerful than an existing server, even as it consumes 80 times less energy. Ditching older technology like copper also encourages non-traditional, three-dimensional computing shapes, since you're not bound by the usual distance limits. Source: http://www.engadget.com/2014/06/11/hp-the-machine/
  • 52. Outline The Big Data era: data generation explosion Storing data Processing data Where do we place data? How we use data? Research at BSC 53
  • 53. Relational databases sometimes not good for scale-out 1 2 3 4 5 2 3 4 1 3 5 1 3 4 2 4 5 1 2 5 NoSQL for non-structured data, eventual consistency
  • 54. Programming model  To meet the challenges: MapReduce – Programming Model introduced by Google in early 2000s to support distributed computing (special emphasis in fault-tolerance)  Ecosystem of big data processing tools • open source, distributed, and run on commodity hardware.  The key innovation of MapReduce is – the ability to take a query over a data set, divide it, and run it in parallel over many nodes.  Two phases – Map phase – Reduce phase Reducers Mappers Input Data Output Data
  • 55. MapReduce: example Reducers Mappers Input Data Output Data be, 12th.txt not, 12th.txt afraid, 12th.txt of, 12th.txt greatness, 12th.txt to be or not hamlet.txt be not afraid of greatness 12th.txt to, hamlet.txt be, hamlet.txt or, hamlet.txt not, hamlet.txt afraid, (12th.txt) be, (12th.txt, hamlet.txt) greatness, (12th.txt) not, (12th.txt, hamlet.txt) of, (12th.txt) or, (hamlet.txt) to, (hamlet.txt) Input Map Reduce Output
  • 56. 57 Limitations of MapReduce as a Programming model? • MapReduce is great but not every one is a MapReduce expert “I am a python expert but ….” • There is a class of algorithms that cannot be efficiently implemented with the MapReduce programming model • Different programming models deal with different challenges – pyCOMPSs from BSC – SPARK from Berkeley Input Data Output Data
  • 57. 58 Google progressively dropping MapReduce 2004 introduces MapReduce 2006 The Apache Hadoop Project brings an opensource MapReduce Implementation, with the team of the Apache Nutch Crawler and the support of Yahoo! June 25, 2014 Google announces Cloud Dataflow: write code once, run it in batch or stream mode Cloud Dataflow is a managed service for creating data pipelines that ingest, transform and analyze data in both batch and streaming modes. 2010 Google announces Pregel used for building incremental reverse indexes instead of MapReduce Phylosophy: ‘think like a vertex” (event-driven)
  • 58. 59 What happens if SW does not consider HW • Terasort contest: sorting 100TB data • Number 1: Hadoop • 2100 nodes, 12 cores per node, 64 Gb per node • 24.000 cores • 134 Tb memory • Time: 4300 segs • Cost in Amazon: $ 8.800 • Number 2: Tritonsort • 52 nodes, 8 cores per node, 24 Gb • 416 cores • 1,2 Tb memory • Time: 8300 segs and 6400 segs • Cost in Amazon: $ 294 and 226 • Hadoop is easy to program, but needs 57X more cores, 100X more memory, and only gets 2X performance
  • 60. BigData needs Intelligence! Deep Learning: Building high-level abstractions Watson: Reasoning
  • 61. Deep Learning: generating high level abstractions Toronto Face Database (TFD) – Automatic recognition of Facial Expressions 62 Ranzato, M., Mnih, V., Susskind, J. and Hinton, G. E. Modeling Natural Images Using Gated MRFs. IEEE Trans. Pattern Analysis and Machine Intelligence, to appear.
  • 62. 63 Watson: Calculating Vs Learning 63 Source: Courtesy of IBM
  • 63. 64 What is behind Watson? IBM Watson -- How to replicate Watson hardware and systems design for your own use in your basement By Tony Pearson (ibm.co/Pearson) http://tinyurl.com/pt6mdfu DeepQA: Massively Parallel Probabilistic Evidence- Based Architecture For each question, generates and scores many hypotheses using a combination of 1000’s of Natural Language Processing, Information Retrieval, Machine Learning and Reasoning Algorithms. These gather, evaluate, weigh and balance different types of evidence to deliver the answer with the best evidence support it can find
  • 64. 65 Software tuning key to exploit hardware!
  • 65. Outline The Big Data era: data generation explosion Storing data Processing data Where do we place data? How we use data? Research at BSC 66
  • 66. 67 Barcelona Supercomputing Center Centro Nacional de Supercomputación BSC-CNS objectives: – R&D in Computer, Life, Earth and Engineering Sciences – Supercomputing services and support to Spanish and European researchers BSC-CNS is a consortium that includes: – Spanish Government 51% – Catalonian Government 37% – Universitat Politècnica de Catalunya (UPC) 12% +400 people, 45 countries 67
  • 67. 68 EARTH SCIENCES To develop and implement global and regional state- of-the-art models for short-term air quality forecast and long-term climate applications LIFE SCIENCES To understand living organisms by means of theoretical and computational methods (molecular modeling, genomics, proteomics) CASE To develop scientific and engineering software to efficiently exploit super- computing capabilities (biomedical, geophysics, COMPUTER SCIENCES To influence the way machines are built, programmed and used: programming models, performance tools, Big Data, computer atmospheric, energy, social and economic simulations) architecture, energy efficiency Mission of BSC Scientific Departments
  • 68. MareNostrum 3 69 48,896 Intel SandyBridge cores at 2.6 GHz 84 Intel Xeon Phi Peak Performance of 1.1 Petaflops 100.8 TB of main memory 2 PB of disk storage 8.5 PB of archive storage 9th in Europe, 29th in the world (June 2013 Top500 List)
  • 69. Analytics (real-time) Big Data HPC Cloud Computing Sustainable Computing SDE Smarts Cities & IoT BSC Programming Models ALOJA BGAS BSC Tools BSC-CATech Citizen as a sensor Deep Learning NoSQL Data Management Collab. CASE Life Science Real-time Monitoring of Image Streaming BSC Programming Models servIoTicy.com Adaptive Scheduler