SlideShare a Scribd company logo
1 of 54
Download to read offline
Big Data
Types of data and opportunities
Prof. Dr. Nikolaos (Nikos) Deligiannis
Email: ndeligia@vub.be
Twitter: @prof_ndeligia
2
Big Data: Big Challenges and Big Value
Big Data
Challenges
Volume
Veracity
VarietyVelocity
Value
Data Deluge
The Trend in the Job Market
Source: Indeed.com
Types of Data: Static vs Real-time
Part 1
Static Data
Medical Images
Road network information
Open Data
Static Data: Belgium OECD Data
Static Data
Road network
information
Actual data sample (GPX data).
Real-Time Data
Smart Mobility Smart Cities
Smart FinTech Smart Marketing
Real-Time Data
Positions of public transport vehicles
Real-Time Data
Public bicycle usage
Real-Time Data
[Link]
Real-time VR visualization of mobility and social
media data in Brussels.
[VR tool visualizing public transformation flows in Brussels; the
system enables the user to see on-the-fly the position of STIB
buses, the occupancy of Villo stations, geolocated social media
posts.]
15/20
Health Data Analytics
Data from epidemic web apps. [link]
Types of Data: Structure vs
Unstructured
Part 2
Structured Data
Phone addresses
IBAN bank codes
Product descriptions
Unstructured Data
Images & Video Audio
Text files (reports...)
Unstructured Data: Video
0 1 10 1 1 10
Size of Video Data
My Camera Specs.
§ 8 MP (3264×2448 pixels) Image
§ 640×480 pixels Video
§ 24 bits per pixel
§ 30 frames per second
Video from Milos à 5 min
Ø 8.3 GBs for storage
Internet Connection à 1 Mbps
Ø 18 hours for uploading
Type 3: Open vs Public vs Private data
Part 3
Private vs. Public
Extracting Value
Part 4
Regression: Predicting Second-Hand Car Prices
Mileage (km) in 1000’s
Price(euros)
3.000
6.000
9.000
12.000
15.000
20 40 60 80 100 120 14090
10.000
14.000
Supervised Learning
– Learn a model based on
labeled training data
Regression
– The predicted parameter is
continuous
25/20
Regression: Recommender Systems
Regression: Matrix Completion
predict movie ratings
Netflix: Users rate movies using a 0-5 star rating
Nikos (1) Eva (2) Duc (3) Tien (4)
P.S. I love you
Lord of the rings
Interstellar
Spectre
Crazy, stupid love
5
1
?
0
5
5
0
?
0
4
0
?
5
4
0
0
4
5
?
?
Classification: Sepsis Mortality Probability
APACHE II Score at Baseline
Survived 0
Supervised Learning
– Learn a model based on
labeled training data
Classification
– The predicted parameter is
discrete
Died 1
5 10 15 20 25 30 35 40 45
Clustering: The Pizza Hut Problem
Unsupervised Learning
– No labeled data available
Clustering
– Group the data
Dimensionality Reduction: Visualization
Unsupervised Learning
– No labeled data available
Dimensionality Reduction
– Find the latent dimensions of
the data
30/20
Data Visualization
3D visualization map of frequency of tweets in Brussels!
3D visualization map of the frequency of tweets in Brussels.
[Superposition of a high-resolution texture of the region, and a so-called height-map]
31/20
Topic Extraction on Social Media
Dominant media communities
on #Twitter in #Brussels
during June 2016 – January 2017
Visualization from 7Million tweets.
Twitter User Geolocation
Multiview deep learning architecture
S2 adaptive grid (Google S2 geometry library)
Geolocation accuracy
Significant gain in
geolocation accuracy
compared to the latest
approaches.
T. Do Huu, D. M. Nguyen, E. Tsiligianni, B. Cornelis, N. Deligiannis, “Twitter user geolocation
using deep multiview learning”, IEEE ICASSP 2018.
Image Analysis
True orthophoto Predicted Pixel label
Yu Liu, D. M. Nguyen et al. 2017
Cross-Modal Image-Text Retrieval
[Link]
Phrase localization in image
Example caption: A man with a goatee in a black shirt and
white latex gloves is using a tattoo on someone‘s back
Learning Problem Categories
Learning
Unsupervised LearningSupervised Learning
Regression Classification Clustering Dimensionality Reduction
Learn a model based
on labeled training
data
The predicted data
is continuous
The predicted data
is discrete
Cluster the data
into groups
Find lower
dimensions of the
data
No labeled training
data
What is the Learning Problem Category?
Google news
What is the Learning Problem Category?
Optical Character Recognition
What is the Learning Problem Category?
Predict the Total Amount of Sales in Oklahoma (OK)
State # malls Sales (m. $)
WA 630 15.5
NC 370 7.5
CA 616 13.9
UT 700 18.7
FL 430 8.2
IL 568 13.2
TX 1200 23
What is the Learning Problem Category?
Spam mail detector
Introduction to the Cloud
Part 5
Cloud Categories
Private cloud
(accessible only to company employees)
Public cloud
(service provided to any paying customer)
Amazon S3 (Simple storage service): store
arbitrary datasets, pay per GB-month stored
Amazon EC2 (Elastic Compute Cloud): upload
and run arbitrary OS images, pay per CPU hour
used
Google Compute Engine: develop applications
within their App Engine framework, upload data
that will be imported into their format, and run
Example of Cloud Architecture
Features in Today’s Cloud!
• Massive scale
• On-demand access
- Pay-as-you-go, no upfront commitment
- Anyone can access it
• Data-intensive applications
- MBs have become TBs, PBs and XBs
- Daily logs, forensics, web data, etc.
• New cloud programming paradigms
- MapReduce/Hadoop, NoSQL/Cassandra/MongoDB
- High in accessibility and ease of programmability
- Lots of open-source
Components of a Cloud
Servers (front) Servers (back)
Servers (inside) Servers (secure)
Powering a Cloud
Hydroelectric plants Thermoelectric plants
Photovoltaic plant
Features in Today’s Cloud!
• Massive scale
• On-demand access
- Pay-as-you-go, no upfront commitment
- Anyone can access it
• Data-intensive applications
- MBs have become TBs, PBs and XBs
- Daily logs, forensics, web data, etc.
• New cloud programming paradigms
- MapReduce/Hadoop, NoSQL/Cassandra/MongoDB
- High in accessibility and ease of programmability
- Lots of open-source
On-Demand Access
• On-demand access: like renting a car when needed
- AWS Elastic Compute Cloud (EC2) a few cents to a few USD
per CPU hour
- AWS simple storage service (S3) a few cents to a few USD per
GB-month
• HaaS: Hardware as a Service
- You get access to hardware machines, do whatever you
want with them (example, your own cluster)
- Security risks
• IaaS: Infrastructure as a Service
- You get access to flexible computing and storage
infrastructure. Virtualization or, for example a Linux
environment are ways to achieve this
- Examples: Amazon Web Services, Eucalyptus, Microsoft
Azure
On-Demand Access
• PaaS: Platform as a Service
- You get access to flexible computing and storage
infrastructure, together with a software platform.
- Example: Google AppEngine (Python, Java)
• SaaS: Software as a Service
- You get access to software services, when you need
them. Often subsumes Service Oriented Architectures
- Examples: Google docs, MS Office on demand
Data-Intensive Applications
• Computation-intensive computing
- Example areas: MPI-based,
high performance computing, grids
- Typically run on supercomputers
- the speed of supercomputers is benchmarked in "FLOPS" (FLoating point
Operations Per Second), and not in terms of "MIPS" (Million Instructions Per
Second), as for general-purpose computers
• Data-intensive computing
- Typically store data at datacenters
- Use compute nodes nearby
- Compute nodes run computation services
- The focus is on I/O operations (disk and/or network) not
on CPU utilization
New Cloud Programming Paradigms
Easy to write and run highly parallel programs in new cloud
programming paradigms:
• Google
- MapReduce and Sawzall
- MapReduce indexing a chain of 24 MapReduce jobs
- Approx. 200K jobs processing 50PB/month (in 2006)
• Amazon
- Elastic MapReduce service (pay-as-you-go)
• Yahoo!
- Hadoop + Pig
- WebMap a chain of 100 MapReduce jobs
- 280 TB of data, 2500 nodes
• Facebook
- Approx. 300TB total, adding 2TB/day (in 2008)
- 3K jobs processing 55TB/day
Cloud Categories
Private cloud
(accessible only to company employees)
Public cloud
(service provided to any paying customer)
If you are starting your own company should you use a
public cloud or purchase your own private cloud?
Power, cooling, management costs CPU usage, Storage usage
To Outsource Or Not
Medium-sized organization wishes to run a service for M
months
The services requires 128 servers (1024 cores) and 524
TB
• Outsource (e.g., AWS) [monthly cost]
- S3: $0.12 per GB/month – EC2: $0.10 per
CPU/hour
- Storage: $0.12 × 524 × 1000 ≈ $62.000
- Computation: $0.10 × 1024 × 24 × 30 ≈ $74.000
- Total: approx. $136.000
• Purchase [total cost]
- Storage: approx. $349.000
- Total: approx. $1.555.000 + $7.500 per month for a
system administrator per 100 nodes
To Outsource Or Not
Breakeven analysis à duration of usage defines the
• Outsource (e.g., AWS) [monthly cost]
- Storage: $0.12 × 524 × 1000 ≈ $62.000 per month
- Total: approx. $136.000 per month
• Purchase [total cost]
- Storage: approx. $349.000
- Total: approx. $1.555.000 + ($7.500 per month)
Breakeven points
• Storage: $349.000/$62.000 ≈ 5.55 months
• Total: $1.555.000/$136.000 ≈ 12 months
ü Startups use clouds a lot – they do not know how long
they will be in business
ü Cloud providers benefit monetarily more from storage

More Related Content

What's hot

Not Your Father's Database by Databricks
Not Your Father's Database by DatabricksNot Your Father's Database by Databricks
Not Your Father's Database by Databricks
Caserta
 
Neo4j graphs in the real world - graph days d.c. - april 14, 2015
Neo4j   graphs in the real world - graph days d.c. - april 14, 2015Neo4j   graphs in the real world - graph days d.c. - april 14, 2015
Neo4j graphs in the real world - graph days d.c. - april 14, 2015
Neo4j
 

What's hot (20)

Expert Big Data Tips
Expert Big Data TipsExpert Big Data Tips
Expert Big Data Tips
 
Full-Stack Data Science: How to be a One-person Data Team
Full-Stack Data Science: How to be a One-person Data TeamFull-Stack Data Science: How to be a One-person Data Team
Full-Stack Data Science: How to be a One-person Data Team
 
Modern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An OverviewModern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An Overview
 
Stanford DeepDive Framework
Stanford DeepDive FrameworkStanford DeepDive Framework
Stanford DeepDive Framework
 
Making Data Scientists Productive in Azure
Making Data Scientists Productive in AzureMaking Data Scientists Productive in Azure
Making Data Scientists Productive in Azure
 
Power of the Run Graph
Power of the Run GraphPower of the Run Graph
Power of the Run Graph
 
Lessons from building a stream-first metadata platform | Shirshanka Das, Stealth
Lessons from building a stream-first metadata platform | Shirshanka Das, StealthLessons from building a stream-first metadata platform | Shirshanka Das, Stealth
Lessons from building a stream-first metadata platform | Shirshanka Das, Stealth
 
Top Big data Analytics tools: Emerging trends and Best practices
Top Big data Analytics tools: Emerging trends and Best practicesTop Big data Analytics tools: Emerging trends and Best practices
Top Big data Analytics tools: Emerging trends and Best practices
 
Big Data Platform Landscape by 2017
Big Data Platform Landscape by 2017Big Data Platform Landscape by 2017
Big Data Platform Landscape by 2017
 
Not Your Father's Database by Databricks
Not Your Father's Database by DatabricksNot Your Father's Database by Databricks
Not Your Father's Database by Databricks
 
Big Data Real Time Analytics - A Facebook Case Study
Big Data Real Time Analytics - A Facebook Case StudyBig Data Real Time Analytics - A Facebook Case Study
Big Data Real Time Analytics - A Facebook Case Study
 
Exploring Big Data Analytics Tools
Exploring Big Data Analytics ToolsExploring Big Data Analytics Tools
Exploring Big Data Analytics Tools
 
BreizhJUG - Janvier 2014 - Big Data - Dataiku - Pages Jaunes
BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages JaunesBreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes
BreizhJUG - Janvier 2014 - Big Data - Dataiku - Pages Jaunes
 
The Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They NeedThe Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They Need
 
Intro to Neo4j Webinar
Intro to Neo4j WebinarIntro to Neo4j Webinar
Intro to Neo4j Webinar
 
Big Data in Action : Operations, Analytics and more
Big Data in Action : Operations, Analytics and moreBig Data in Action : Operations, Analytics and more
Big Data in Action : Operations, Analytics and more
 
Neo4j graphs in the real world - graph days d.c. - april 14, 2015
Neo4j   graphs in the real world - graph days d.c. - april 14, 2015Neo4j   graphs in the real world - graph days d.c. - april 14, 2015
Neo4j graphs in the real world - graph days d.c. - april 14, 2015
 
Total Data Industry Report
Total Data Industry ReportTotal Data Industry Report
Total Data Industry Report
 
Knowledge Architecture: Graphing Your Knowledge
Knowledge Architecture: Graphing Your KnowledgeKnowledge Architecture: Graphing Your Knowledge
Knowledge Architecture: Graphing Your Knowledge
 
Before Kaggle
Before KaggleBefore Kaggle
Before Kaggle
 

Similar to Course 3 : Types of data and opportunities by Nikolaos Deligiannis

ppt vdo stream cloud comp.ppt Cloud computing with the help of AWS
ppt vdo stream cloud comp.ppt Cloud computing with the help of AWSppt vdo stream cloud comp.ppt Cloud computing with the help of AWS
ppt vdo stream cloud comp.ppt Cloud computing with the help of AWS
vij7027
 

Similar to Course 3 : Types of data and opportunities by Nikolaos Deligiannis (20)

Introduction to Cloud Computing and Big Data
Introduction to Cloud Computing and Big DataIntroduction to Cloud Computing and Big Data
Introduction to Cloud Computing and Big Data
 
L2-3.FA19.ppt
L2-3.FA19.pptL2-3.FA19.ppt
L2-3.FA19.ppt
 
L2-3.FA19.ppt
L2-3.FA19.pptL2-3.FA19.ppt
L2-3.FA19.ppt
 
Cloud Computing
Cloud ComputingCloud Computing
Cloud Computing
 
L2-3.FA19.ppt
L2-3.FA19.pptL2-3.FA19.ppt
L2-3.FA19.ppt
 
L2-3.FA19.ppt
L2-3.FA19.pptL2-3.FA19.ppt
L2-3.FA19.ppt
 
L2-3.FA19.ppt
L2-3.FA19.pptL2-3.FA19.ppt
L2-3.FA19.ppt
 
L2-3.FA19.ppt
L2-3.FA19.pptL2-3.FA19.ppt
L2-3.FA19.ppt
 
L2-3.FA19.ppt
L2-3.FA19.pptL2-3.FA19.ppt
L2-3.FA19.ppt
 
Introduction To Cloud Computing.ppt
Introduction To Cloud Computing.pptIntroduction To Cloud Computing.ppt
Introduction To Cloud Computing.ppt
 
cloud computing services
cloud computing servicescloud computing services
cloud computing services
 
Internet of behaviours features and documents
Internet of behaviours features and documentsInternet of behaviours features and documents
Internet of behaviours features and documents
 
L2 3.fa19
L2 3.fa19L2 3.fa19
L2 3.fa19
 
L2-3.FA19.ppt
L2-3.FA19.pptL2-3.FA19.ppt
L2-3.FA19.ppt
 
AWS res 2024 key points for better research.ppt
AWS res 2024 key points for better research.pptAWS res 2024 key points for better research.ppt
AWS res 2024 key points for better research.ppt
 
ppt vdo stream cloud comp.ppt Cloud computing with the help of AWS
ppt vdo stream cloud comp.ppt Cloud computing with the help of AWSppt vdo stream cloud comp.ppt Cloud computing with the help of AWS
ppt vdo stream cloud comp.ppt Cloud computing with the help of AWS
 
Cloud introduction2.ppt
Cloud introduction2.pptCloud introduction2.ppt
Cloud introduction2.ppt
 
Big Data Session 1.pptx
Big Data Session 1.pptxBig Data Session 1.pptx
Big Data Session 1.pptx
 
cloud.ppt
cloud.pptcloud.ppt
cloud.ppt
 
Azure Data Explorer deep dive - review 04.2020
Azure Data Explorer deep dive - review 04.2020Azure Data Explorer deep dive - review 04.2020
Azure Data Explorer deep dive - review 04.2020
 

Recently uploaded

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Recently uploaded (20)

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 

Course 3 : Types of data and opportunities by Nikolaos Deligiannis

  • 1. Big Data Types of data and opportunities Prof. Dr. Nikolaos (Nikos) Deligiannis Email: ndeligia@vub.be Twitter: @prof_ndeligia
  • 2. 2 Big Data: Big Challenges and Big Value Big Data Challenges Volume Veracity VarietyVelocity Value
  • 4.
  • 5.
  • 6. The Trend in the Job Market Source: Indeed.com
  • 7. Types of Data: Static vs Real-time Part 1
  • 8. Static Data Medical Images Road network information Open Data
  • 11. Real-Time Data Smart Mobility Smart Cities Smart FinTech Smart Marketing
  • 12. Real-Time Data Positions of public transport vehicles
  • 14. Real-Time Data [Link] Real-time VR visualization of mobility and social media data in Brussels. [VR tool visualizing public transformation flows in Brussels; the system enables the user to see on-the-fly the position of STIB buses, the occupancy of Villo stations, geolocated social media posts.]
  • 15. 15/20 Health Data Analytics Data from epidemic web apps. [link]
  • 16. Types of Data: Structure vs Unstructured Part 2
  • 17. Structured Data Phone addresses IBAN bank codes Product descriptions
  • 18. Unstructured Data Images & Video Audio Text files (reports...)
  • 20. Size of Video Data My Camera Specs. § 8 MP (3264×2448 pixels) Image § 640×480 pixels Video § 24 bits per pixel § 30 frames per second Video from Milos à 5 min Ø 8.3 GBs for storage Internet Connection à 1 Mbps Ø 18 hours for uploading
  • 21. Type 3: Open vs Public vs Private data Part 3
  • 24. Regression: Predicting Second-Hand Car Prices Mileage (km) in 1000’s Price(euros) 3.000 6.000 9.000 12.000 15.000 20 40 60 80 100 120 14090 10.000 14.000 Supervised Learning – Learn a model based on labeled training data Regression – The predicted parameter is continuous
  • 26. Regression: Matrix Completion predict movie ratings Netflix: Users rate movies using a 0-5 star rating Nikos (1) Eva (2) Duc (3) Tien (4) P.S. I love you Lord of the rings Interstellar Spectre Crazy, stupid love 5 1 ? 0 5 5 0 ? 0 4 0 ? 5 4 0 0 4 5 ? ?
  • 27. Classification: Sepsis Mortality Probability APACHE II Score at Baseline Survived 0 Supervised Learning – Learn a model based on labeled training data Classification – The predicted parameter is discrete Died 1 5 10 15 20 25 30 35 40 45
  • 28. Clustering: The Pizza Hut Problem Unsupervised Learning – No labeled data available Clustering – Group the data
  • 29. Dimensionality Reduction: Visualization Unsupervised Learning – No labeled data available Dimensionality Reduction – Find the latent dimensions of the data
  • 30. 30/20 Data Visualization 3D visualization map of frequency of tweets in Brussels! 3D visualization map of the frequency of tweets in Brussels. [Superposition of a high-resolution texture of the region, and a so-called height-map]
  • 31. 31/20 Topic Extraction on Social Media Dominant media communities on #Twitter in #Brussels during June 2016 – January 2017 Visualization from 7Million tweets.
  • 32. Twitter User Geolocation Multiview deep learning architecture S2 adaptive grid (Google S2 geometry library) Geolocation accuracy Significant gain in geolocation accuracy compared to the latest approaches. T. Do Huu, D. M. Nguyen, E. Tsiligianni, B. Cornelis, N. Deligiannis, “Twitter user geolocation using deep multiview learning”, IEEE ICASSP 2018.
  • 33. Image Analysis True orthophoto Predicted Pixel label Yu Liu, D. M. Nguyen et al. 2017
  • 35. Phrase localization in image Example caption: A man with a goatee in a black shirt and white latex gloves is using a tattoo on someone‘s back
  • 36. Learning Problem Categories Learning Unsupervised LearningSupervised Learning Regression Classification Clustering Dimensionality Reduction Learn a model based on labeled training data The predicted data is continuous The predicted data is discrete Cluster the data into groups Find lower dimensions of the data No labeled training data
  • 37. What is the Learning Problem Category? Google news
  • 38. What is the Learning Problem Category? Optical Character Recognition
  • 39. What is the Learning Problem Category? Predict the Total Amount of Sales in Oklahoma (OK) State # malls Sales (m. $) WA 630 15.5 NC 370 7.5 CA 616 13.9 UT 700 18.7 FL 430 8.2 IL 568 13.2 TX 1200 23
  • 40. What is the Learning Problem Category? Spam mail detector
  • 41. Introduction to the Cloud Part 5
  • 42. Cloud Categories Private cloud (accessible only to company employees) Public cloud (service provided to any paying customer) Amazon S3 (Simple storage service): store arbitrary datasets, pay per GB-month stored Amazon EC2 (Elastic Compute Cloud): upload and run arbitrary OS images, pay per CPU hour used Google Compute Engine: develop applications within their App Engine framework, upload data that will be imported into their format, and run
  • 43. Example of Cloud Architecture
  • 44. Features in Today’s Cloud! • Massive scale • On-demand access - Pay-as-you-go, no upfront commitment - Anyone can access it • Data-intensive applications - MBs have become TBs, PBs and XBs - Daily logs, forensics, web data, etc. • New cloud programming paradigms - MapReduce/Hadoop, NoSQL/Cassandra/MongoDB - High in accessibility and ease of programmability - Lots of open-source
  • 45. Components of a Cloud Servers (front) Servers (back) Servers (inside) Servers (secure)
  • 46. Powering a Cloud Hydroelectric plants Thermoelectric plants Photovoltaic plant
  • 47. Features in Today’s Cloud! • Massive scale • On-demand access - Pay-as-you-go, no upfront commitment - Anyone can access it • Data-intensive applications - MBs have become TBs, PBs and XBs - Daily logs, forensics, web data, etc. • New cloud programming paradigms - MapReduce/Hadoop, NoSQL/Cassandra/MongoDB - High in accessibility and ease of programmability - Lots of open-source
  • 48. On-Demand Access • On-demand access: like renting a car when needed - AWS Elastic Compute Cloud (EC2) a few cents to a few USD per CPU hour - AWS simple storage service (S3) a few cents to a few USD per GB-month • HaaS: Hardware as a Service - You get access to hardware machines, do whatever you want with them (example, your own cluster) - Security risks • IaaS: Infrastructure as a Service - You get access to flexible computing and storage infrastructure. Virtualization or, for example a Linux environment are ways to achieve this - Examples: Amazon Web Services, Eucalyptus, Microsoft Azure
  • 49. On-Demand Access • PaaS: Platform as a Service - You get access to flexible computing and storage infrastructure, together with a software platform. - Example: Google AppEngine (Python, Java) • SaaS: Software as a Service - You get access to software services, when you need them. Often subsumes Service Oriented Architectures - Examples: Google docs, MS Office on demand
  • 50. Data-Intensive Applications • Computation-intensive computing - Example areas: MPI-based, high performance computing, grids - Typically run on supercomputers - the speed of supercomputers is benchmarked in "FLOPS" (FLoating point Operations Per Second), and not in terms of "MIPS" (Million Instructions Per Second), as for general-purpose computers • Data-intensive computing - Typically store data at datacenters - Use compute nodes nearby - Compute nodes run computation services - The focus is on I/O operations (disk and/or network) not on CPU utilization
  • 51. New Cloud Programming Paradigms Easy to write and run highly parallel programs in new cloud programming paradigms: • Google - MapReduce and Sawzall - MapReduce indexing a chain of 24 MapReduce jobs - Approx. 200K jobs processing 50PB/month (in 2006) • Amazon - Elastic MapReduce service (pay-as-you-go) • Yahoo! - Hadoop + Pig - WebMap a chain of 100 MapReduce jobs - 280 TB of data, 2500 nodes • Facebook - Approx. 300TB total, adding 2TB/day (in 2008) - 3K jobs processing 55TB/day
  • 52. Cloud Categories Private cloud (accessible only to company employees) Public cloud (service provided to any paying customer) If you are starting your own company should you use a public cloud or purchase your own private cloud? Power, cooling, management costs CPU usage, Storage usage
  • 53. To Outsource Or Not Medium-sized organization wishes to run a service for M months The services requires 128 servers (1024 cores) and 524 TB • Outsource (e.g., AWS) [monthly cost] - S3: $0.12 per GB/month – EC2: $0.10 per CPU/hour - Storage: $0.12 × 524 × 1000 ≈ $62.000 - Computation: $0.10 × 1024 × 24 × 30 ≈ $74.000 - Total: approx. $136.000 • Purchase [total cost] - Storage: approx. $349.000 - Total: approx. $1.555.000 + $7.500 per month for a system administrator per 100 nodes
  • 54. To Outsource Or Not Breakeven analysis à duration of usage defines the • Outsource (e.g., AWS) [monthly cost] - Storage: $0.12 × 524 × 1000 ≈ $62.000 per month - Total: approx. $136.000 per month • Purchase [total cost] - Storage: approx. $349.000 - Total: approx. $1.555.000 + ($7.500 per month) Breakeven points • Storage: $349.000/$62.000 ≈ 5.55 months • Total: $1.555.000/$136.000 ≈ 12 months ü Startups use clouds a lot – they do not know how long they will be in business ü Cloud providers benefit monetarily more from storage