This document discusses big data and opportunities related to different types of data. It covers challenges of big data including volume, velocity, variety and veracity. It also discusses value that can be extracted from data. The document outlines static versus real-time data and structured versus unstructured data. Examples of applying machine learning techniques like regression, classification, clustering and dimensionality reduction are provided. The introduction to cloud computing discusses public, private and hybrid clouds and features of cloud infrastructure.
14. Real-Time Data
[Link]
Real-time VR visualization of mobility and social
media data in Brussels.
[VR tool visualizing public transformation flows in Brussels; the
system enables the user to see on-the-fly the position of STIB
buses, the occupancy of Villo stations, geolocated social media
posts.]
20. Size of Video Data
My Camera Specs.
§ 8 MP (3264×2448 pixels) Image
§ 640×480 pixels Video
§ 24 bits per pixel
§ 30 frames per second
Video from Milos à 5 min
Ø 8.3 GBs for storage
Internet Connection à 1 Mbps
Ø 18 hours for uploading
24. Regression: Predicting Second-Hand Car Prices
Mileage (km) in 1000’s
Price(euros)
3.000
6.000
9.000
12.000
15.000
20 40 60 80 100 120 14090
10.000
14.000
Supervised Learning
– Learn a model based on
labeled training data
Regression
– The predicted parameter is
continuous
26. Regression: Matrix Completion
predict movie ratings
Netflix: Users rate movies using a 0-5 star rating
Nikos (1) Eva (2) Duc (3) Tien (4)
P.S. I love you
Lord of the rings
Interstellar
Spectre
Crazy, stupid love
5
1
?
0
5
5
0
?
0
4
0
?
5
4
0
0
4
5
?
?
27. Classification: Sepsis Mortality Probability
APACHE II Score at Baseline
Survived 0
Supervised Learning
– Learn a model based on
labeled training data
Classification
– The predicted parameter is
discrete
Died 1
5 10 15 20 25 30 35 40 45
28. Clustering: The Pizza Hut Problem
Unsupervised Learning
– No labeled data available
Clustering
– Group the data
30. 30/20
Data Visualization
3D visualization map of frequency of tweets in Brussels!
3D visualization map of the frequency of tweets in Brussels.
[Superposition of a high-resolution texture of the region, and a so-called height-map]
31. 31/20
Topic Extraction on Social Media
Dominant media communities
on #Twitter in #Brussels
during June 2016 – January 2017
Visualization from 7Million tweets.
32. Twitter User Geolocation
Multiview deep learning architecture
S2 adaptive grid (Google S2 geometry library)
Geolocation accuracy
Significant gain in
geolocation accuracy
compared to the latest
approaches.
T. Do Huu, D. M. Nguyen, E. Tsiligianni, B. Cornelis, N. Deligiannis, “Twitter user geolocation
using deep multiview learning”, IEEE ICASSP 2018.
35. Phrase localization in image
Example caption: A man with a goatee in a black shirt and
white latex gloves is using a tattoo on someone‘s back
36. Learning Problem Categories
Learning
Unsupervised LearningSupervised Learning
Regression Classification Clustering Dimensionality Reduction
Learn a model based
on labeled training
data
The predicted data
is continuous
The predicted data
is discrete
Cluster the data
into groups
Find lower
dimensions of the
data
No labeled training
data
37. What is the Learning Problem Category?
Google news
38. What is the Learning Problem Category?
Optical Character Recognition
39. What is the Learning Problem Category?
Predict the Total Amount of Sales in Oklahoma (OK)
State # malls Sales (m. $)
WA 630 15.5
NC 370 7.5
CA 616 13.9
UT 700 18.7
FL 430 8.2
IL 568 13.2
TX 1200 23
40. What is the Learning Problem Category?
Spam mail detector
42. Cloud Categories
Private cloud
(accessible only to company employees)
Public cloud
(service provided to any paying customer)
Amazon S3 (Simple storage service): store
arbitrary datasets, pay per GB-month stored
Amazon EC2 (Elastic Compute Cloud): upload
and run arbitrary OS images, pay per CPU hour
used
Google Compute Engine: develop applications
within their App Engine framework, upload data
that will be imported into their format, and run
44. Features in Today’s Cloud!
• Massive scale
• On-demand access
- Pay-as-you-go, no upfront commitment
- Anyone can access it
• Data-intensive applications
- MBs have become TBs, PBs and XBs
- Daily logs, forensics, web data, etc.
• New cloud programming paradigms
- MapReduce/Hadoop, NoSQL/Cassandra/MongoDB
- High in accessibility and ease of programmability
- Lots of open-source
45. Components of a Cloud
Servers (front) Servers (back)
Servers (inside) Servers (secure)
47. Features in Today’s Cloud!
• Massive scale
• On-demand access
- Pay-as-you-go, no upfront commitment
- Anyone can access it
• Data-intensive applications
- MBs have become TBs, PBs and XBs
- Daily logs, forensics, web data, etc.
• New cloud programming paradigms
- MapReduce/Hadoop, NoSQL/Cassandra/MongoDB
- High in accessibility and ease of programmability
- Lots of open-source
48. On-Demand Access
• On-demand access: like renting a car when needed
- AWS Elastic Compute Cloud (EC2) a few cents to a few USD
per CPU hour
- AWS simple storage service (S3) a few cents to a few USD per
GB-month
• HaaS: Hardware as a Service
- You get access to hardware machines, do whatever you
want with them (example, your own cluster)
- Security risks
• IaaS: Infrastructure as a Service
- You get access to flexible computing and storage
infrastructure. Virtualization or, for example a Linux
environment are ways to achieve this
- Examples: Amazon Web Services, Eucalyptus, Microsoft
Azure
49. On-Demand Access
• PaaS: Platform as a Service
- You get access to flexible computing and storage
infrastructure, together with a software platform.
- Example: Google AppEngine (Python, Java)
• SaaS: Software as a Service
- You get access to software services, when you need
them. Often subsumes Service Oriented Architectures
- Examples: Google docs, MS Office on demand
50. Data-Intensive Applications
• Computation-intensive computing
- Example areas: MPI-based,
high performance computing, grids
- Typically run on supercomputers
- the speed of supercomputers is benchmarked in "FLOPS" (FLoating point
Operations Per Second), and not in terms of "MIPS" (Million Instructions Per
Second), as for general-purpose computers
• Data-intensive computing
- Typically store data at datacenters
- Use compute nodes nearby
- Compute nodes run computation services
- The focus is on I/O operations (disk and/or network) not
on CPU utilization
51. New Cloud Programming Paradigms
Easy to write and run highly parallel programs in new cloud
programming paradigms:
• Google
- MapReduce and Sawzall
- MapReduce indexing a chain of 24 MapReduce jobs
- Approx. 200K jobs processing 50PB/month (in 2006)
• Amazon
- Elastic MapReduce service (pay-as-you-go)
• Yahoo!
- Hadoop + Pig
- WebMap a chain of 100 MapReduce jobs
- 280 TB of data, 2500 nodes
• Facebook
- Approx. 300TB total, adding 2TB/day (in 2008)
- 3K jobs processing 55TB/day
52. Cloud Categories
Private cloud
(accessible only to company employees)
Public cloud
(service provided to any paying customer)
If you are starting your own company should you use a
public cloud or purchase your own private cloud?
Power, cooling, management costs CPU usage, Storage usage
53. To Outsource Or Not
Medium-sized organization wishes to run a service for M
months
The services requires 128 servers (1024 cores) and 524
TB
• Outsource (e.g., AWS) [monthly cost]
- S3: $0.12 per GB/month – EC2: $0.10 per
CPU/hour
- Storage: $0.12 × 524 × 1000 ≈ $62.000
- Computation: $0.10 × 1024 × 24 × 30 ≈ $74.000
- Total: approx. $136.000
• Purchase [total cost]
- Storage: approx. $349.000
- Total: approx. $1.555.000 + $7.500 per month for a
system administrator per 100 nodes
54. To Outsource Or Not
Breakeven analysis à duration of usage defines the
• Outsource (e.g., AWS) [monthly cost]
- Storage: $0.12 × 524 × 1000 ≈ $62.000 per month
- Total: approx. $136.000 per month
• Purchase [total cost]
- Storage: approx. $349.000
- Total: approx. $1.555.000 + ($7.500 per month)
Breakeven points
• Storage: $349.000/$62.000 ≈ 5.55 months
• Total: $1.555.000/$136.000 ≈ 12 months
ü Startups use clouds a lot – they do not know how long
they will be in business
ü Cloud providers benefit monetarily more from storage