Successfully reported this slideshow.

Course 3 : Types of data and opportunities by Nikolaos Deligiannis

0

Share

Upcoming SlideShare
Lecture1
Lecture1
Loading in …3
×
1 of 54
1 of 54

Course 3 : Types of data and opportunities by Nikolaos Deligiannis

0

Share

Download to read offline

For more info about our Big Data courses, check out our website ➡️ https://www.betacowork.com/big-data/
---------
"Data is the new oil" - Many companies and professionals do not know how to use their data or are not aware of the added value they could gain from it.

It is in response to these problems that the project “Brussels: The Beating Heart of Big Data” was born.

This project, financed by the Region of Brussels Capital and organised by Betacowork, offers 3 training cycles of 10 courses on big data, at both beginner and advanced levels. These 3 cycles will be followed by a Hackathon weekend.

No prerequisites are required to start these courses. The aim of these courses is to familiarize participants with the principles of Big Data.
------
For more info about our Big Data courses, check out our website ➡️ https://www.betacowork.com/big-data/

For more info about our Big Data courses, check out our website ➡️ https://www.betacowork.com/big-data/
---------
"Data is the new oil" - Many companies and professionals do not know how to use their data or are not aware of the added value they could gain from it.

It is in response to these problems that the project “Brussels: The Beating Heart of Big Data” was born.

This project, financed by the Region of Brussels Capital and organised by Betacowork, offers 3 training cycles of 10 courses on big data, at both beginner and advanced levels. These 3 cycles will be followed by a Hackathon weekend.

No prerequisites are required to start these courses. The aim of these courses is to familiarize participants with the principles of Big Data.
------
For more info about our Big Data courses, check out our website ➡️ https://www.betacowork.com/big-data/

More Related Content

Related Books

Free with a 14 day trial from Scribd

See all

Course 3 : Types of data and opportunities by Nikolaos Deligiannis

  1. 1. Big Data Types of data and opportunities Prof. Dr. Nikolaos (Nikos) Deligiannis Email: ndeligia@vub.be Twitter: @prof_ndeligia
  2. 2. 2 Big Data: Big Challenges and Big Value Big Data Challenges Volume Veracity VarietyVelocity Value
  3. 3. Data Deluge
  4. 4. The Trend in the Job Market Source: Indeed.com
  5. 5. Types of Data: Static vs Real-time Part 1
  6. 6. Static Data Medical Images Road network information Open Data
  7. 7. Static Data: Belgium OECD Data
  8. 8. Static Data Road network information Actual data sample (GPX data).
  9. 9. Real-Time Data Smart Mobility Smart Cities Smart FinTech Smart Marketing
  10. 10. Real-Time Data Positions of public transport vehicles
  11. 11. Real-Time Data Public bicycle usage
  12. 12. Real-Time Data [Link] Real-time VR visualization of mobility and social media data in Brussels. [VR tool visualizing public transformation flows in Brussels; the system enables the user to see on-the-fly the position of STIB buses, the occupancy of Villo stations, geolocated social media posts.]
  13. 13. 15/20 Health Data Analytics Data from epidemic web apps. [link]
  14. 14. Types of Data: Structure vs Unstructured Part 2
  15. 15. Structured Data Phone addresses IBAN bank codes Product descriptions
  16. 16. Unstructured Data Images & Video Audio Text files (reports...)
  17. 17. Unstructured Data: Video 0 1 10 1 1 10
  18. 18. Size of Video Data My Camera Specs. § 8 MP (3264×2448 pixels) Image § 640×480 pixels Video § 24 bits per pixel § 30 frames per second Video from Milos à 5 min Ø 8.3 GBs for storage Internet Connection à 1 Mbps Ø 18 hours for uploading
  19. 19. Type 3: Open vs Public vs Private data Part 3
  20. 20. Private vs. Public
  21. 21. Extracting Value Part 4
  22. 22. Regression: Predicting Second-Hand Car Prices Mileage (km) in 1000’s Price(euros) 3.000 6.000 9.000 12.000 15.000 20 40 60 80 100 120 14090 10.000 14.000 Supervised Learning – Learn a model based on labeled training data Regression – The predicted parameter is continuous
  23. 23. 25/20 Regression: Recommender Systems
  24. 24. Regression: Matrix Completion predict movie ratings Netflix: Users rate movies using a 0-5 star rating Nikos (1) Eva (2) Duc (3) Tien (4) P.S. I love you Lord of the rings Interstellar Spectre Crazy, stupid love 5 1 ? 0 5 5 0 ? 0 4 0 ? 5 4 0 0 4 5 ? ?
  25. 25. Classification: Sepsis Mortality Probability APACHE II Score at Baseline Survived 0 Supervised Learning – Learn a model based on labeled training data Classification – The predicted parameter is discrete Died 1 5 10 15 20 25 30 35 40 45
  26. 26. Clustering: The Pizza Hut Problem Unsupervised Learning – No labeled data available Clustering – Group the data
  27. 27. Dimensionality Reduction: Visualization Unsupervised Learning – No labeled data available Dimensionality Reduction – Find the latent dimensions of the data
  28. 28. 30/20 Data Visualization 3D visualization map of frequency of tweets in Brussels! 3D visualization map of the frequency of tweets in Brussels. [Superposition of a high-resolution texture of the region, and a so-called height-map]
  29. 29. 31/20 Topic Extraction on Social Media Dominant media communities on #Twitter in #Brussels during June 2016 – January 2017 Visualization from 7Million tweets.
  30. 30. Twitter User Geolocation Multiview deep learning architecture S2 adaptive grid (Google S2 geometry library) Geolocation accuracy Significant gain in geolocation accuracy compared to the latest approaches. T. Do Huu, D. M. Nguyen, E. Tsiligianni, B. Cornelis, N. Deligiannis, “Twitter user geolocation using deep multiview learning”, IEEE ICASSP 2018.
  31. 31. Image Analysis True orthophoto Predicted Pixel label Yu Liu, D. M. Nguyen et al. 2017
  32. 32. Cross-Modal Image-Text Retrieval [Link]
  33. 33. Phrase localization in image Example caption: A man with a goatee in a black shirt and white latex gloves is using a tattoo on someone‘s back
  34. 34. Learning Problem Categories Learning Unsupervised LearningSupervised Learning Regression Classification Clustering Dimensionality Reduction Learn a model based on labeled training data The predicted data is continuous The predicted data is discrete Cluster the data into groups Find lower dimensions of the data No labeled training data
  35. 35. What is the Learning Problem Category? Google news
  36. 36. What is the Learning Problem Category? Optical Character Recognition
  37. 37. What is the Learning Problem Category? Predict the Total Amount of Sales in Oklahoma (OK) State # malls Sales (m. $) WA 630 15.5 NC 370 7.5 CA 616 13.9 UT 700 18.7 FL 430 8.2 IL 568 13.2 TX 1200 23
  38. 38. What is the Learning Problem Category? Spam mail detector
  39. 39. Introduction to the Cloud Part 5
  40. 40. Cloud Categories Private cloud (accessible only to company employees) Public cloud (service provided to any paying customer) Amazon S3 (Simple storage service): store arbitrary datasets, pay per GB-month stored Amazon EC2 (Elastic Compute Cloud): upload and run arbitrary OS images, pay per CPU hour used Google Compute Engine: develop applications within their App Engine framework, upload data that will be imported into their format, and run
  41. 41. Example of Cloud Architecture
  42. 42. Features in Today’s Cloud! • Massive scale • On-demand access - Pay-as-you-go, no upfront commitment - Anyone can access it • Data-intensive applications - MBs have become TBs, PBs and XBs - Daily logs, forensics, web data, etc. • New cloud programming paradigms - MapReduce/Hadoop, NoSQL/Cassandra/MongoDB - High in accessibility and ease of programmability - Lots of open-source
  43. 43. Components of a Cloud Servers (front) Servers (back) Servers (inside) Servers (secure)
  44. 44. Powering a Cloud Hydroelectric plants Thermoelectric plants Photovoltaic plant
  45. 45. Features in Today’s Cloud! • Massive scale • On-demand access - Pay-as-you-go, no upfront commitment - Anyone can access it • Data-intensive applications - MBs have become TBs, PBs and XBs - Daily logs, forensics, web data, etc. • New cloud programming paradigms - MapReduce/Hadoop, NoSQL/Cassandra/MongoDB - High in accessibility and ease of programmability - Lots of open-source
  46. 46. On-Demand Access • On-demand access: like renting a car when needed - AWS Elastic Compute Cloud (EC2) a few cents to a few USD per CPU hour - AWS simple storage service (S3) a few cents to a few USD per GB-month • HaaS: Hardware as a Service - You get access to hardware machines, do whatever you want with them (example, your own cluster) - Security risks • IaaS: Infrastructure as a Service - You get access to flexible computing and storage infrastructure. Virtualization or, for example a Linux environment are ways to achieve this - Examples: Amazon Web Services, Eucalyptus, Microsoft Azure
  47. 47. On-Demand Access • PaaS: Platform as a Service - You get access to flexible computing and storage infrastructure, together with a software platform. - Example: Google AppEngine (Python, Java) • SaaS: Software as a Service - You get access to software services, when you need them. Often subsumes Service Oriented Architectures - Examples: Google docs, MS Office on demand
  48. 48. Data-Intensive Applications • Computation-intensive computing - Example areas: MPI-based, high performance computing, grids - Typically run on supercomputers - the speed of supercomputers is benchmarked in "FLOPS" (FLoating point Operations Per Second), and not in terms of "MIPS" (Million Instructions Per Second), as for general-purpose computers • Data-intensive computing - Typically store data at datacenters - Use compute nodes nearby - Compute nodes run computation services - The focus is on I/O operations (disk and/or network) not on CPU utilization
  49. 49. New Cloud Programming Paradigms Easy to write and run highly parallel programs in new cloud programming paradigms: • Google - MapReduce and Sawzall - MapReduce indexing a chain of 24 MapReduce jobs - Approx. 200K jobs processing 50PB/month (in 2006) • Amazon - Elastic MapReduce service (pay-as-you-go) • Yahoo! - Hadoop + Pig - WebMap a chain of 100 MapReduce jobs - 280 TB of data, 2500 nodes • Facebook - Approx. 300TB total, adding 2TB/day (in 2008) - 3K jobs processing 55TB/day
  50. 50. Cloud Categories Private cloud (accessible only to company employees) Public cloud (service provided to any paying customer) If you are starting your own company should you use a public cloud or purchase your own private cloud? Power, cooling, management costs CPU usage, Storage usage
  51. 51. To Outsource Or Not Medium-sized organization wishes to run a service for M months The services requires 128 servers (1024 cores) and 524 TB • Outsource (e.g., AWS) [monthly cost] - S3: $0.12 per GB/month – EC2: $0.10 per CPU/hour - Storage: $0.12 × 524 × 1000 ≈ $62.000 - Computation: $0.10 × 1024 × 24 × 30 ≈ $74.000 - Total: approx. $136.000 • Purchase [total cost] - Storage: approx. $349.000 - Total: approx. $1.555.000 + $7.500 per month for a system administrator per 100 nodes
  52. 52. To Outsource Or Not Breakeven analysis à duration of usage defines the • Outsource (e.g., AWS) [monthly cost] - Storage: $0.12 × 524 × 1000 ≈ $62.000 per month - Total: approx. $136.000 per month • Purchase [total cost] - Storage: approx. $349.000 - Total: approx. $1.555.000 + ($7.500 per month) Breakeven points • Storage: $349.000/$62.000 ≈ 5.55 months • Total: $1.555.000/$136.000 ≈ 12 months ü Startups use clouds a lot – they do not know how long they will be in business ü Cloud providers benefit monetarily more from storage

×