Understanding your Data - Data Analytics Lifecycle and Machine Learning

Understanding your Data
Data Analytics Lifecycle and Machine Learning
Dr. Abzetdin ADAMOV
Director, Center for Data Analytics Research
School of IT & Engineering
ADA University
aadamov@ada.edu.az

Content
• Why now?
• Data Analytics Lifecycle
• Data Acquisition
• Data Repository
• Data Preprocessing
• Data Analytics and Machine Learning
• Data Visualization
• Data Governance

Computing Facilities at CeDSRT
Characteristics of computing cluster:
Processing Cores: 102
Memory: 1,568 TB
Storage: 136 TB

AAdamov, CeDAR, ADA University

Student Research - SDP Topics
1. Development of Lexical and Morphological Analysis System
2. Effective Installation of Multi-node Cluster based on Hadoop 3.0
3. Utilizing Artificial Intelligence (AI) to improve quality of life for people with
Dementia
4. Statistical Analysis and Data Visualization of DTS Data
5. Development of N-gram Model
6. Development of Semantic Similarity System
7. Development of Sentiment Analysis System
8. Personalized Offers and Customer Retention Platform in Banking
9. Data Retrieval, Storage and Manipulation of DTS Data
10. Network Security and IDS using Machine Learning
11. Development Spell Correction System

Where Data Comes From
Data is produced by:
• People
• Social Media, Public Web, Smartphones, …
• Organizations (Employer)
• OLTP, OLAP, BI, …
• Machines
• IoT, Satellites, Vehicles, Science, …

Modern Data Sources
Ã Internet of Anything (IoAT)
• Wind Turbines, Oil Rigs, Cars
• Weather Stations, Smart Grids
• RFID Tags, Beacons, Wearables
Ã User Generated Content (Web & Mobile)
• Twitter, Facebook, Snapchat, YouTube
• Clickstream, Ads, User Engagement
• Payments: Paypal, Venmo

Addressing Data – Digital Universe
0
5
10
15
20
25
30
35
1995 2000 2003 2005 2008 2009 2010 2011 2012 2014 2016 2017 2018
DataGrowthinZettaBytes
Digital Universe Growth over time

Addressing Data – Hard Disk Capacity
0
10000
20000
30000
40000
50000
60000
70000
1991 1998 2003 2005 2007 2008 2009 2010 2011 2012 2014 2016 2017
CapacityinGigaBytes
Hard Drive Capacity Growth over time

Addressing Data – Storage Cost
1200000
100000
10000 800 10 1 0.1 0.003 0.0020
200000
400000
600000
800000
1000000
1200000
1400000
1980 1985 1990 1995 2000 2005 2010 2015 2020
Price$perGBytes
Data Storage Cost per Gigabyte

Computation Power CPU and GPU
0
500
1000
1500
2000
2500
3000
3500
4000
4500
2001 2002 2005 2006 2008 2009 2010 2012 2013 2014 2015 2017 2018
GFLOPS
Computation Power CPU and GPU
GPU
CPU

Data Growth vs. Processing Power

Addressing Data – Transfer Rate
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
1991 1998 2003 2005 2007 2009 2010 2012 2014 2016
TransferSpeedMB/sec
Hard Drive Data Transfer Rate

Data Analytics Life-Cycle
Data
Acquisition
Data
Repository
Data
Processing
Data
Analytics /
ML
Data
Visualization
- Hadoop HDFS
- Microsoft Azure
- Amazon EC2
- Warehouse
- Statistical Analysis
- Machine Learning
- R Programming
- Python
- RapidMiner
- Weka
- ….
- Web Crawling
- Data Mining
- Information
Retrieval
- ….
- ETL
- Parsing
- Indexing
- Searching
- Ranking
- NLP
- ….
Big Data Management involves Data Science and Data Engineering areas for
implementing Data Mining Techniques

Data Acquisition Techniques
1. Operational Systems
2. Data Warehouses and Data Marts
3. Online Analytical Processing (OLAP) / BI
4. Web Crawling
5. Data Brokers (Commercial Data)
6. Open Data Sources
7. Experimental Data Collection
8. Online Surveys

Data Acquisition Considerations
• Business Needs
• Data Standards (ISO, ITIS, FGDC, ISDM)
• Accuracy Requirements
• Currency of Data
• Time Constraints
• Format (CSV, XLS, XML, JSON, …)
• Cost

Traditional Approach in Data Management
1TB Hard Drive
3 TB file
1TB of Data
1TB of Data
1TB of Data
STORAGE PROCESSING
DATA Processor
Raw DataProcessed Data

The “Big Data” Problem
Ã A single machine cannot process or even store all the data!
Problem
Solution
Ã Distribute data over large clusters
Difficulty
Ã How to split work across machines?
Ã Moving data over network is expensive
Ã Must consider data & network locality
Ã How to deal with failures?
Ã How to deal with slow nodes?

Addressing Data
• Standard Hard Drive data transmission speed 60 – 100 MB/sec
• Solid State Hard Drive (SSD) - 250 – 500 MB/sec
• Hard Drive capacity growing RAPIDLY (4 – 60 TB)
• Online data growth (double every 18 month)
• Processing Speed (relatively same growth)
• Hard Drive transmission speed is relatively FLAT
Moving Data IN and OUT of disk is the Bottleneck

BIG DATA and TRADITIONAL SYSTEMS

Hadoop Input/Output Model
A
128mb
Hadoop WRITE/READS blocks into HDFS sequentially
CLIENT
ABCD
B
128mb
C
128mb
D
128mb
File NEWS.txt (512 Mb) divided to 4 blocks
Hadoop Reads/Writes blocks sequentially, not in parallel. Its why Hadoop does not affect IO
performance significantly.
SOLUTION is Data Striping technique…

Distributed Architecture of HDFS
Rack 1
DN1
DN2
DN3
DN4
Switch
Rack 2
DN11
DN12
DN13
DN14
Switch
Rack 3
DN21
DN22
DN23
DN24
Switch
Rack 4
DN31
DN32
DN33
DN34
Switch
CC
AA
DD
BB
A
Where to write file ADA.txt (blocks A, B, C, D) in HDFS?
CLIENT
NAMENODE
A B C D
A – DN32, 11, 14
B – DN01, 22, 23
C – DN12, 02, 04
D – DN34, 12, 14
BCD

Big Data and Virtialization
Traditional
Architecture
Distributed
Architecture
Operating System
HARDWARE
App App App
HARDWARE
App App App
HYPERVISOR
OS OS OS
HARDWARE HARDWARE HARDWARE
OS OS OS
HADOOP HDFS + YARN
App App App App App App
Virtualized
Architecture

BIG DOES NOT MEAN SLOW

DATA PREPROCESSING
Good decisions require Good Data

Why Data Preprocessing?
• Data in the real world is dirty
• incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
• noisy: containing errors or outliers
• inconsistent: containing discrepancies in codes or names
• No quality data, no quality Analytics results!
• Quality decisions must be based on quality data
• Data warehouse needs consistent integration of quality data

Multi-Dimensional Measure of Data Quality
• Accuracy
• Completeness
• Consistency
• Timeliness
• Believability
• Value added
• Interpretability
• Accessibility

Major Tasks in Data Preprocessing
• Data Cleaning
• Fill in missing values, smooth noisy data, identify or remove outliers, and resolve
inconsistencies
• Data Integration
• Integration of multiple databases, data cubes, or files
• Data Transformation
• Normalization and aggregation
• Data Reduction
• Obtains reduced representation in volume but produces the same or similar
analytical results
• Data Discretization
• Part of data reduction but with particular importance, especially for numerical data

Data Cleaning and Transformation
• Data Cleaning Tasks:
• Fill in missing values
• Identify outliers and smooth out noisy data
• Correct inconsistent data
• Data Transformation Tasks:
• Smoothing: remove noise from data
• Aggregation: summarization, data cube construction
• Normalization: scaled to fall within a small, specified range
• Generalization: concept hierarchy climbing

Data Reduction Strategies
• Why data reduction?
• Warehouse may store terabytes of data
• Complex data analysis/mining may take a very long time to run on the complete data
set
• Data reduction
• Obtains a reduced representation of the data set that is much smaller in volume but
yet produces the same (or almost the same) analytical results
• Data reduction strategies
• Data Compression
• Sampling
• Data cube aggregation
• Dimensionality reduction
• Numerosity reduction
• Discretization and concept hierarchy generation

DATA ANALYTICS
You can't manage what you can't measure

Skills Requirements for Data Analytics
Statistics
Business
Domain
Computer
Science
Data Analytics

Meaning of Statistics
• The word statistics is used in either two senses:
• Commonly used to refer to data.
• Principles and methods of handling numerical data.
• Statistics is defined as a branch of mathematics that deals with the
collection, analysis and interpretation of numerical information
• Statistics changes numbers into information
• deciding how to collect data efficiently
• using data to give information
• using data to answer questions
• using data to make decisions.
• Statistics is the science of learning from data

Kinds of Data
• Quantitative – Data that is numerical, counted, or compared
• Demographic data
• Answers to closed-ended survey items
• Attendance data
• Scores on standardized instruments
• Qualitative – Narratives, logs, experience
• Interviews
• Open-ended survey items
• Categories

Statistical Measures
• Measure of central tendency
• Mean
• Median
• Mode
• Measure of variation
• Range
• Variance and standard deviation
• Interquartile range
• Proportion, Percentage
• Ratio, Rate

Statistical Analytics in R
• mean(), max(), min()
• median()
• var(), sd()
• cor()
• quantile()
• summary()
• hist()
• plot()

Mean, Median, Quantile
5,3,6,8,9,2,11,8,3,8,10,9
mean(vec) 6.83
median(vec) 8
5,3,6,8,9,2,11,8,3,8,10,1000
mean(vec) 89.41
median(vec) 8
quantile(vec)
0% 25% 50% 75% 100%
2.00 4.50 8.00 9.25 1000.00
quantile(var, probs = c(0, .75, 1))
0% 75% 100%
2.00 9.25 1000.00
quantile(vec, probs=seq(0, 1, .1))

Correlation
mpg cyl disp wt
mpg 1.0000000 -0.8521620 -0.8475514 -0.8676594
cyl -0.8521620 1.0000000 0.9020329 0.7824958
disp -0.8475514 0.9020329 1.0000000 0.8879799
wt -0.8676594 0.7824958 0.8879799 1.0000000
cor(mtcars[, c("mpg", "cyl", "disp", "wt")])
mpg cyl disp wt
Mazda RX4 21.0 6 160.0 2.620
Mazda RX4 Wag 21.0 6 160.0 2.875
Datsun 710 22.8 4 108.0 2.320
Hornet 4 Drive 21.4 6 258.0 3.215
Hornet Sportabout 18.7 8 360.0 3.440
Valiant 18.1 6 225.0 3.460
Duster 360 14.3 8 360.0 3.570
Merc 240D 24.4 4 146.7 3.190
Merc 230 22.8 4 140.8 3.150

Summary Function
summary(mtcars[, c("mpg", "cyl", "disp", "wt")])
mpg cyl disp wt
Min. :10.40 Min. :4.000 Min. : 71.1 Min. :1.513
1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.:2.581
Median :19.20 Median :6.000 Median :196.3 Median :3.325
Mean :20.09 Mean :6.188 Mean :230.7 Mean :3.217
3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:3.610
Max. :33.90 Max. :8.000 Max. :472.0 Max. :5.424

Anscombe Quartet
Anscombe Kvarteti
I II III IV
x y x y x y x y
10.0 8.04 10.0 9.14 10.0 7.46 8.0 6.58
8.0 6.95 8.0 8.14 8.0 6.77 8.0 5.76
13.0 7.58 13.0 8.74 13.0 12.74 8.0 7.71
9.0 8.81 9.0 8.77 9.0 7.11 8.0 8.84
11.0 8.33 11.0 9.26 11.0 7.81 8.0 8.47
14.0 9.96 14.0 8.10 14.0 8.84 8.0 7.04
6.0 7.24 6.0 6.13 6.0 6.08 8.0 5.25
4.0 4.26 4.0 3.10 4.0 5.39 19.0 12.50
12.0 10.84 12.0 9.13 12.0 8.15 8.0 5.56
7.0 4.82 7.0 7.26 7.0 6.42 8.0 7.91
5.0 5.68 5.0 4.74 5.0 5.73 8.0 6.89

What id Machine Learning?
• “Machine learning refers to a system capable of the autonomous
acquisition and integration of knowledge.”
• “Learning denotes changes in a system that ... enable a system to do
the same task … more efficiently the next time.” - Herbert Simon
• Automating automation
• Getting computers to program themselves
• Writing software is the bottleneck
• Let the data do the work instead!
• Machine learning is primarily concerned with the accuracy and
effectiveness of the computer system.

Why Machine Learning?
• No human experts
• industrial/manufacturing control
• mass spectrometer analysis, drug design, astronomic discovery
• Black-box human expertise
• face/handwriting/speech recognition
• driving a car, flying a plane
• Rapidly changing phenomena
• credit scoring, financial modeling
• diagnosis, fraud detection
• Need for customization/personalization
• personalized news reader
• movie/book recommendation

Machine Learning Algorithms Categories
• Supervised Learning algorithm
• Logistic Regression,
• Neural Networks,
• Support Vector Machines (SVMs), and Naive Bayes classifiers)
• Unsupervised Learning algorithm
• K-means, Random Forests, Hierarchical clustering)
• Semi-supervised Learning algorithm
• Reinforcement learning algorithm (self-driving cars)

ML vs Traditional Programming
Traditional Programming
Machine Learning
Computer
Data
Program
Output
Computer
Data
Output
Program

Data Governance Metrics
• Digital Culture
• Naming Standard
• Professional Terms and Abbreviations
• Data Model, Documentation and Relationship
• Data Quality Rules and Metrics
• Hierarchy of Data Artifacts / Entities
• Classify your Data:
• Master Data, Transactional Data, Reference Data

But do you have the capacity to refine it?
DATA is the NEW OIL!

Information is the oil of the 21st century,
and Analytics is the Combustion Engine

Q & A ?
Dr. Abzetdin Adamov
Email me at: aadamov@ada.edu.az
Follow me at: @
Link to me at: www.linkedin.com/in/adamov
Visit my blog at: aadamov.wordpress.com

Understanding your Data - Data Analytics Lifecycle and Machine Learning

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Understanding your Data - Data Analytics Lifecycle and Machine Learning

Similar to Understanding your Data - Data Analytics Lifecycle and Machine Learning (20)

More from Abzetdin Adamov

More from Abzetdin Adamov (16)

Recently uploaded

Recently uploaded (20)

Understanding your Data - Data Analytics Lifecycle and Machine Learning