INF2190 - Data Analytics:
Introduction, Methods and Practical
Approaches
Winter 2016 – Week 1
Dr. Attila Barta
atibarta@cs.toronto.edu
Introduction to the Course
 Instructor: Attila Barta, Ph.D. Computer Science UofT.
 Details of the course can be found in the syllabus (published
on the Blackboard).
 Current course is based on the course first taught by Prof.
Periklis Andritsos in Winter 2014 with updates to reflect the
current trends in Big Data Technologies.
 All material is under copyright by FI unless specified explicitly.
 Time and place: Thursday, 6:30pm-9:30pm.
2
Data Analytics – (old) definitions
 Analysis of data is a process of inspecting, cleaning,
transforming, and modeling data with the goal of discovering
useful information, suggesting conclusions, and supporting
decision-making.
 Data Mining is a particular data analysis technique that
focuses on modeling and knowledge discovery for predictive
rather than purely descriptive purposes.
 Business Intelligence covers data analysis that relies
heavily on aggregation, focusing on business information.
(Wikipedia, Jan 2016).
3
Where Data Analytics fits in the (new) big picture?
4
Enterprise Data Analytics Architecture – Copyright Attila Barta
The Data Analytics world changed significantly in the last 5 years with the arrival of the Big Data.
Evolution of the database technologies
 Before data analytics there was data, lots of it:
 Hierarchical databases (early ‘70), IBM IMS still extensively in use
 Network databases (mid ‘70s), CA IDMS still in use
 Relational databases (mid ‘80s), DB2, Sybase, Oracle, MS-SQL Server
 Object-oriented databases (early ’90s), Poet, O2
 Data Warehouses (early ‘90s)
 all started with RedBrick – first time when the database research community
had to catch-up to industry
 The Inmod vs Kimball debates starts, as well as normalized vs de-normalized,
star vs snowflake schema…
 Data Analytics (early ’90s), the famous beer and diapers story
 Graph Databases (mid ‘90s), UofT leader in web databases, semantic databases
 Semi-structured database (late ‘90s), ToX (UofT) still one of the best XML native
databases
 Data Mining (late ‘90s)
 Stream databases (early ‘2000s), network sensors – Berkeley
 Big Data (late ‘2000s)
5
Big Data – How we got here
 In a 2001 research report[1] Gartner analyst Doug Laney defined data growth challenges and
opportunities as being three-dimensional, i.e. increasing volume (amount of data), velocity
(speed of data in and out), and variety (range of data types and sources). Gartner, and now
much of the industry, continue to use this "3Vs" model for describing Big Data[2]. (Wikipedia).
 What was happening in 2001? Three major trends:
 Sloan Digital Sky Survey began collecting astronomical data in 2000 at a rate of 200GB/night – volume
 Sensor networks (web of things) and streaming databases (Message Oriented Middleware) – velocity
 Semi-structured databases, XML native databases beside object-oriented, relational databases – variety
 What happened after 2001?
 Rise of search engines and portals - Yahoo and Google:
 Problem: how to store and query (cheaply large amounts of (semi-structured) data.
 Answer: Hadoop on commodity Linux farms.
 Memory got cheaper – in-memory data grids.
 Rise of Social Media – petabytes in pictures, unstructured and semi-structured data.
 Increased computational power and large memory – visual analytics.
6
Big Data – Definitions and Examples
7
•In 2012, Gartner updated its definition as follows: "Big data are high-volume, high-velocity, and/or high-variety
information assets that require new forms of processing to enable enhanced decision making, insight discovery
and process optimization“[3].
• In 2012 IDC defines Big Data technologies as “a new generation of technologies and architectures designed
to extract value economically from very large volumes of a wide variety of data by enabling high-velocity
capture, discovery, and/or analysis”[4].
•In 2012 Forrester characterize Big Data as “increases in data volume, velocity, variety, and variability”[5].
•Big Data Characteristics:
1. Data Volume: data size in order of petabytes.
• Example: Facebook on June 13, 2012 announced that their had reached 100 PB of data. On
November 8, 2012 they announced that their warehouse grows by half a PB per day.
2. Data Velocity: real time processing of streaming data, including real time analytics.
• Example: a jet engine generates 20TB data/hour that has to be processed near real time.
3. Data Variety: structured, semi-structured, text, imagines, video, audio, etc.
• Example: 80% of enterprise data is unstructured. YouTube - 500TB of video uploaded per year
4. Data Variability: data flows can be inconsistent with periodic peaks.
• Example: blogs commenting the new Blackberry 10; stock market data that reacts to market events.
8
Big Data – Reference Architecture
An Architecture for Big Data has to address following the capabilities:
1. Real-time complex event processing (including sense and response, streaming
data).
2. Massive volumes of data (petabytes) relational and non-relational (i.e. social
media, location, RFID).
3. Parallel processing/fast loading, typically based on Hadoop/Sparks.
4. High-performance query systems based on in-memory data architectures.
5. Advanced analytics, e.g. visual analytics, columnar databases.
Big Data – Reference Architecture (contd.)
9
Virtual Infrastructure Workload Management
Infrastructure Services
Event Mgmt.
Query
(SQL, non-SQL)
Processing
Advanced
Analytics
Shared nothing hwd,
massively parallel
Commodity;
own or rent
Massive load via
parallel processing
Data Stream
Stream Processing
Non-relational dbms
Data Management
Relational dbms
Distributed File System
In-Memory Data Grid
Big Data Reference Architecture – Copyright by Attila Barta
10
Virtual Infrastructure Workload Management
Infrastructure Services
Event Mgmt.
Query
(SQL, non-SQL)
Processing
Advanced
Analytics
Client Omni-Channel
Interactions
Tableau, SAS
Spotfire, HANA
Tibco
BusinessEvents
Stream Processing
Non-relational dbms
Data Management
Relational dbms
Distributed File System
In-Memory Data Grid
Tibco ActiveSpaces,
HANA, Kafka
R, MapReduce,
Sparks SQL
PaaS, IaaS
Big Data – Sample Technology Placement
HDFS, Sparks,
Casandra
11
Traditional Data Analytics
Enterprise Data
Warehouse
Highly normalized, usually multi-level, relational or start
schema.
Data Marts
A simple form of a data warehouse that is focused on a
single subject (or functional area).
Data Cubes
Multi-dimensional data sets, usually specific for a certain BI
tool (e.g. Cognos, BO, MS).
OLAP
Analyze multidimensional data interactively using
consolidation (roll-up), drill-down, and slicing and dicing.
Works on data cubes (MOLAP) or RDBMS (ROLAP).
Fixed, regularly scheduled (canned) reports usually based
on decision support systems.
Mgmt. Inf.
System
Statistical
Computing (R)
Statistical computing and modeling packages, e.g. SAS, R.
Diagnostic
Operational analytics that address the “why did it happen” based
on data aggregation and/or modeling.
• Complex to deploy (a new data warehouse takes months to build); most run on specialized hardware (e.g. SAS only
runs on AIX).
• Proprietary technologies of significant up front and running cost; difficult to migrate them to a cloud solution.
• Difficult to change both at the data source level (data warehouse) and at the analytical level (canned reports).
Characteristics:
12
Big Data era Data Analytics
Stream processor for sensor data, multi-media, geo-
location, GIS, etc.
Sense and Response capability, in memory data
aggregation.
Object pair, document, semi-structured, XML in-
memory databases.
In-memory columnar databases, support for R
language.
Distributed File System (HDFS, Casandra) based
relational, non-relational, multi-media, sensor or
document data
Analytical
Appliances
Specialized analytical hardware, e.g. Netezza,
Oracle Exadata.
Columnar
Database
NO-SQL
Database
In-Memory Data
Grid
Stream
Processing
Operational
Reporting
Real time in-sights based on streaming data, e.g.
sensor, geo-location, GIS, multi-media.
Data
Visualization
Self-service data visualization tools, e.g. Tableu,
Spotfire.
Big Data Search MapReduce real-time or batch search.
Descriptive
Analysis
What happened?
Predictive
Analysis
What will happened?
Prescriptive
Analysis
What do to about? Decision support automation.
• High volume and data diversity, support for new data
types.
• High horizontal and vertical scalability.
• Easy to setup and change.
• Low ownership most, mostly open source and commodity
hardware, cloud solutions readily available.
Characteristics:
Data Lakes
Objective of this course: the illusive Data Scientist…
13
 “Data Scientist: The Sexiest Job of the 21st Century” –
Harvard Business Review, Oct 2012
 Data scientists today are akin to the Wall Street “quants” of the
1980s and 1990s.
 The Hot Job of the Decade.
 185 Data Scientist Job vacancies available in Toronto
as Jan 6, 2016 on Indeed Canada, alone.
 How this course will qualify you?
 Foundation in Data Mining algorithms and techniques.
 Foundation on Big Data architecture and challenges.

INF2190_W1_2016_public

  • 1.
    INF2190 - DataAnalytics: Introduction, Methods and Practical Approaches Winter 2016 – Week 1 Dr. Attila Barta atibarta@cs.toronto.edu
  • 2.
    Introduction to theCourse  Instructor: Attila Barta, Ph.D. Computer Science UofT.  Details of the course can be found in the syllabus (published on the Blackboard).  Current course is based on the course first taught by Prof. Periklis Andritsos in Winter 2014 with updates to reflect the current trends in Big Data Technologies.  All material is under copyright by FI unless specified explicitly.  Time and place: Thursday, 6:30pm-9:30pm. 2
  • 3.
    Data Analytics –(old) definitions  Analysis of data is a process of inspecting, cleaning, transforming, and modeling data with the goal of discovering useful information, suggesting conclusions, and supporting decision-making.  Data Mining is a particular data analysis technique that focuses on modeling and knowledge discovery for predictive rather than purely descriptive purposes.  Business Intelligence covers data analysis that relies heavily on aggregation, focusing on business information. (Wikipedia, Jan 2016). 3
  • 4.
    Where Data Analyticsfits in the (new) big picture? 4 Enterprise Data Analytics Architecture – Copyright Attila Barta The Data Analytics world changed significantly in the last 5 years with the arrival of the Big Data.
  • 5.
    Evolution of thedatabase technologies  Before data analytics there was data, lots of it:  Hierarchical databases (early ‘70), IBM IMS still extensively in use  Network databases (mid ‘70s), CA IDMS still in use  Relational databases (mid ‘80s), DB2, Sybase, Oracle, MS-SQL Server  Object-oriented databases (early ’90s), Poet, O2  Data Warehouses (early ‘90s)  all started with RedBrick – first time when the database research community had to catch-up to industry  The Inmod vs Kimball debates starts, as well as normalized vs de-normalized, star vs snowflake schema…  Data Analytics (early ’90s), the famous beer and diapers story  Graph Databases (mid ‘90s), UofT leader in web databases, semantic databases  Semi-structured database (late ‘90s), ToX (UofT) still one of the best XML native databases  Data Mining (late ‘90s)  Stream databases (early ‘2000s), network sensors – Berkeley  Big Data (late ‘2000s) 5
  • 6.
    Big Data –How we got here  In a 2001 research report[1] Gartner analyst Doug Laney defined data growth challenges and opportunities as being three-dimensional, i.e. increasing volume (amount of data), velocity (speed of data in and out), and variety (range of data types and sources). Gartner, and now much of the industry, continue to use this "3Vs" model for describing Big Data[2]. (Wikipedia).  What was happening in 2001? Three major trends:  Sloan Digital Sky Survey began collecting astronomical data in 2000 at a rate of 200GB/night – volume  Sensor networks (web of things) and streaming databases (Message Oriented Middleware) – velocity  Semi-structured databases, XML native databases beside object-oriented, relational databases – variety  What happened after 2001?  Rise of search engines and portals - Yahoo and Google:  Problem: how to store and query (cheaply large amounts of (semi-structured) data.  Answer: Hadoop on commodity Linux farms.  Memory got cheaper – in-memory data grids.  Rise of Social Media – petabytes in pictures, unstructured and semi-structured data.  Increased computational power and large memory – visual analytics. 6
  • 7.
    Big Data –Definitions and Examples 7 •In 2012, Gartner updated its definition as follows: "Big data are high-volume, high-velocity, and/or high-variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization“[3]. • In 2012 IDC defines Big Data technologies as “a new generation of technologies and architectures designed to extract value economically from very large volumes of a wide variety of data by enabling high-velocity capture, discovery, and/or analysis”[4]. •In 2012 Forrester characterize Big Data as “increases in data volume, velocity, variety, and variability”[5]. •Big Data Characteristics: 1. Data Volume: data size in order of petabytes. • Example: Facebook on June 13, 2012 announced that their had reached 100 PB of data. On November 8, 2012 they announced that their warehouse grows by half a PB per day. 2. Data Velocity: real time processing of streaming data, including real time analytics. • Example: a jet engine generates 20TB data/hour that has to be processed near real time. 3. Data Variety: structured, semi-structured, text, imagines, video, audio, etc. • Example: 80% of enterprise data is unstructured. YouTube - 500TB of video uploaded per year 4. Data Variability: data flows can be inconsistent with periodic peaks. • Example: blogs commenting the new Blackberry 10; stock market data that reacts to market events.
  • 8.
    8 Big Data –Reference Architecture An Architecture for Big Data has to address following the capabilities: 1. Real-time complex event processing (including sense and response, streaming data). 2. Massive volumes of data (petabytes) relational and non-relational (i.e. social media, location, RFID). 3. Parallel processing/fast loading, typically based on Hadoop/Sparks. 4. High-performance query systems based on in-memory data architectures. 5. Advanced analytics, e.g. visual analytics, columnar databases.
  • 9.
    Big Data –Reference Architecture (contd.) 9 Virtual Infrastructure Workload Management Infrastructure Services Event Mgmt. Query (SQL, non-SQL) Processing Advanced Analytics Shared nothing hwd, massively parallel Commodity; own or rent Massive load via parallel processing Data Stream Stream Processing Non-relational dbms Data Management Relational dbms Distributed File System In-Memory Data Grid Big Data Reference Architecture – Copyright by Attila Barta
  • 10.
    10 Virtual Infrastructure WorkloadManagement Infrastructure Services Event Mgmt. Query (SQL, non-SQL) Processing Advanced Analytics Client Omni-Channel Interactions Tableau, SAS Spotfire, HANA Tibco BusinessEvents Stream Processing Non-relational dbms Data Management Relational dbms Distributed File System In-Memory Data Grid Tibco ActiveSpaces, HANA, Kafka R, MapReduce, Sparks SQL PaaS, IaaS Big Data – Sample Technology Placement HDFS, Sparks, Casandra
  • 11.
    11 Traditional Data Analytics EnterpriseData Warehouse Highly normalized, usually multi-level, relational or start schema. Data Marts A simple form of a data warehouse that is focused on a single subject (or functional area). Data Cubes Multi-dimensional data sets, usually specific for a certain BI tool (e.g. Cognos, BO, MS). OLAP Analyze multidimensional data interactively using consolidation (roll-up), drill-down, and slicing and dicing. Works on data cubes (MOLAP) or RDBMS (ROLAP). Fixed, regularly scheduled (canned) reports usually based on decision support systems. Mgmt. Inf. System Statistical Computing (R) Statistical computing and modeling packages, e.g. SAS, R. Diagnostic Operational analytics that address the “why did it happen” based on data aggregation and/or modeling. • Complex to deploy (a new data warehouse takes months to build); most run on specialized hardware (e.g. SAS only runs on AIX). • Proprietary technologies of significant up front and running cost; difficult to migrate them to a cloud solution. • Difficult to change both at the data source level (data warehouse) and at the analytical level (canned reports). Characteristics:
  • 12.
    12 Big Data eraData Analytics Stream processor for sensor data, multi-media, geo- location, GIS, etc. Sense and Response capability, in memory data aggregation. Object pair, document, semi-structured, XML in- memory databases. In-memory columnar databases, support for R language. Distributed File System (HDFS, Casandra) based relational, non-relational, multi-media, sensor or document data Analytical Appliances Specialized analytical hardware, e.g. Netezza, Oracle Exadata. Columnar Database NO-SQL Database In-Memory Data Grid Stream Processing Operational Reporting Real time in-sights based on streaming data, e.g. sensor, geo-location, GIS, multi-media. Data Visualization Self-service data visualization tools, e.g. Tableu, Spotfire. Big Data Search MapReduce real-time or batch search. Descriptive Analysis What happened? Predictive Analysis What will happened? Prescriptive Analysis What do to about? Decision support automation. • High volume and data diversity, support for new data types. • High horizontal and vertical scalability. • Easy to setup and change. • Low ownership most, mostly open source and commodity hardware, cloud solutions readily available. Characteristics: Data Lakes
  • 13.
    Objective of thiscourse: the illusive Data Scientist… 13  “Data Scientist: The Sexiest Job of the 21st Century” – Harvard Business Review, Oct 2012  Data scientists today are akin to the Wall Street “quants” of the 1980s and 1990s.  The Hot Job of the Decade.  185 Data Scientist Job vacancies available in Toronto as Jan 6, 2016 on Indeed Canada, alone.  How this course will qualify you?  Foundation in Data Mining algorithms and techniques.  Foundation on Big Data architecture and challenges.