SlideShare a Scribd company logo
Understanding your Data
Data Analytics Lifecycle and Machine Learning
Dr. Abzetdin ADAMOV
Director, Center for Data Analytics Research
School of IT & Engineering
ADA University
aadamov@ada.edu.az
Content
• Why now?
• Data Analytics Lifecycle
• Data Acquisition
• Data Repository
• Data Preprocessing
• Data Analytics and Machine Learning
• Data Visualization
• Data Governance
BIG DATA AT ADA
4th Big Data Day Baku 2018
Computing Facilities at CeDSRT
Characteristics of computing cluster:
Processing Cores: 102
Memory: 1,568 TB
Storage: 136 TB
AAdamov, CeDAR, ADA University
Student Research - SDP Topics
1. Development of Lexical and Morphological Analysis System
2. Effective Installation of Multi-node Cluster based on Hadoop 3.0
3. Utilizing Artificial Intelligence (AI) to improve quality of life for people with
Dementia
4. Statistical Analysis and Data Visualization of DTS Data
5. Development of N-gram Model
6. Development of Semantic Similarity System
7. Development of Sentiment Analysis System
8. Personalized Offers and Customer Retention Platform in Banking
9. Data Retrieval, Storage and Manipulation of DTS Data
10. Network Security and IDS using Machine Learning
11. Development Spell Correction System
WHY NOW?
Where Data Comes From
Data is produced by:
• People
• Social Media, Public Web, Smartphones, …
• Organizations (Employer)
• OLTP, OLAP, BI, …
• Machines
• IoT, Satellites, Vehicles, Science, …
AAdamov, CeDAR, ADA University
Modern Data Sources
à Internet of Anything (IoAT)
• Wind Turbines, Oil Rigs, Cars
• Weather Stations, Smart Grids
• RFID Tags, Beacons, Wearables
à User Generated Content (Web & Mobile)
• Twitter, Facebook, Snapchat, YouTube
• Clickstream, Ads, User Engagement
• Payments: Paypal, Venmo
Addressing Data – Digital Universe
0
5
10
15
20
25
30
35
1995 2000 2003 2005 2008 2009 2010 2011 2012 2014 2016 2017 2018
DataGrowthinZettaBytes
Digital Universe Growth over time
Addressing Data – Hard Disk Capacity
0
10000
20000
30000
40000
50000
60000
70000
1991 1998 2003 2005 2007 2008 2009 2010 2011 2012 2014 2016 2017
CapacityinGigaBytes
Hard Drive Capacity Growth over time
Addressing Data – Storage Cost
1200000
100000
10000 800 10 1 0.1 0.003 0.0020
200000
400000
600000
800000
1000000
1200000
1400000
1980 1985 1990 1995 2000 2005 2010 2015 2020
Price$perGBytes
Data Storage Cost per Gigabyte
AAdamov, CeDAR, ADA University
Computation Power CPU and GPU
0
500
1000
1500
2000
2500
3000
3500
4000
4500
2001 2002 2005 2006 2008 2009 2010 2012 2013 2014 2015 2017 2018
GFLOPS
Computation Power CPU and GPU
GPU
CPU
Data Growth vs. Processing Power
AAdamov, CeDAR, ADA University
Addressing Data – Transfer Rate
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
1991 1998 2003 2005 2007 2009 2010 2012 2014 2016
TransferSpeedMB/sec
Hard Drive Data Transfer Rate
DATA ANALYTICS LIFECYCLE
Data Analytics Life-Cycle
Data
Acquisition
Data
Repository
Data
Processing
Data
Analytics /
ML
Data
Visualization
- Hadoop HDFS
- Microsoft Azure
- Amazon EC2
- Warehouse
- Statistical Analysis
- Machine Learning
- R Programming
- Python
- RapidMiner
- Weka
- ….
- Web Crawling
- Data Mining
- Information
Retrieval
- ….
- ETL
- Parsing
- Indexing
- Searching
- Ranking
- NLP
- ….
Big Data Management involves Data Science and Data Engineering areas for
implementing Data Mining Techniques
DATA ACQUISITION
Data Acquisition Techniques
1. Operational Systems
2. Data Warehouses and Data Marts
3. Online Analytical Processing (OLAP) / BI
4. Web Crawling
5. Data Brokers (Commercial Data)
6. Open Data Sources
7. Experimental Data Collection
8. Online Surveys
Data Acquisition Considerations
• Business Needs
• Data Standards (ISO, ITIS, FGDC, ISDM)
• Accuracy Requirements
• Currency of Data
• Time Constraints
• Format (CSV, XLS, XML, JSON, …)
• Cost
DATA REPOSITORY
Traditional Approach in Data Management
1TB Hard Drive
3 TB file
1TB of Data
1TB of Data
1TB of Data
STORAGE PROCESSING
DATA Processor
Raw DataProcessed Data
The “Big Data” Problem
à A single machine cannot process or even store all the data!
Problem
Solution
à Distribute data over large clusters
Difficulty
à How to split work across machines?
à Moving data over network is expensive
à Must consider data & network locality
à How to deal with failures?
à How to deal with slow nodes?
Addressing Data
• Standard Hard Drive data transmission speed 60 – 100 MB/sec
• Solid State Hard Drive (SSD) - 250 – 500 MB/sec
• Hard Drive capacity growing RAPIDLY (4 – 60 TB)
• Online data growth (double every 18 month)
• Processing Speed (relatively same growth)
• Hard Drive transmission speed is relatively FLAT
Moving Data IN and OUT of disk is the Bottleneck
BIG DATA and TRADITIONAL SYSTEMS
AAdamov, CeDAR, ADA University
Hadoop Input/Output Model
AAdamov, CeDAR, ADA University
A
128mb
Hadoop WRITE/READS blocks into HDFS sequentially
CLIENT
ABCD
B
128mb
C
128mb
D
128mb
File NEWS.txt (512 Mb) divided to 4 blocks
Hadoop Reads/Writes blocks sequentially, not in parallel. Its why Hadoop does not affect IO
performance significantly.
SOLUTION is Data Striping technique…
Distributed Architecture of HDFS
Rack 1
DN1
DN2
DN3
DN4
Switch
Rack 2
DN11
DN12
DN13
DN14
Switch
Rack 3
DN21
DN22
DN23
DN24
Switch
Rack 4
DN31
DN32
DN33
DN34
Switch
CC
AA
DD
BB
A
Where to write file ADA.txt (blocks A, B, C, D) in HDFS?
CLIENT
NAMENODE
A B C D
A – DN32, 11, 14
B – DN01, 22, 23
C – DN12, 02, 04
D – DN34, 12, 14
BCD
AAdamov, CeDAR, ADA University
Big Data and Virtialization
Traditional
Architecture
Distributed
Architecture
Operating System
HARDWARE
App App App
HARDWARE
App App App
HYPERVISOR
OS OS OS
HARDWARE HARDWARE HARDWARE
OS OS OS
HADOOP HDFS + YARN
App App App App App App
Virtualized
Architecture
BIG DOES NOT MEAN SLOW
AAdamov, CeDAR, ADA University
DATA PREPROCESSING
Good decisions require Good Data
Why Data Preprocessing?
• Data in the real world is dirty
• incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
• noisy: containing errors or outliers
• inconsistent: containing discrepancies in codes or names
• No quality data, no quality Analytics results!
• Quality decisions must be based on quality data
• Data warehouse needs consistent integration of quality data
Multi-Dimensional Measure of Data Quality
• Accuracy
• Completeness
• Consistency
• Timeliness
• Believability
• Value added
• Interpretability
• Accessibility
Major Tasks in Data Preprocessing
• Data Cleaning
• Fill in missing values, smooth noisy data, identify or remove outliers, and resolve
inconsistencies
• Data Integration
• Integration of multiple databases, data cubes, or files
• Data Transformation
• Normalization and aggregation
• Data Reduction
• Obtains reduced representation in volume but produces the same or similar
analytical results
• Data Discretization
• Part of data reduction but with particular importance, especially for numerical data
Data Cleaning and Transformation
• Data Cleaning Tasks:
• Fill in missing values
• Identify outliers and smooth out noisy data
• Correct inconsistent data
• Data Transformation Tasks:
• Smoothing: remove noise from data
• Aggregation: summarization, data cube construction
• Normalization: scaled to fall within a small, specified range
• Generalization: concept hierarchy climbing
Data Reduction Strategies
• Why data reduction?
• Warehouse may store terabytes of data
• Complex data analysis/mining may take a very long time to run on the complete data
set
• Data reduction
• Obtains a reduced representation of the data set that is much smaller in volume but
yet produces the same (or almost the same) analytical results
• Data reduction strategies
• Data Compression
• Sampling
• Data cube aggregation
• Dimensionality reduction
• Numerosity reduction
• Discretization and concept hierarchy generation
DATA ANALYTICS
You can't manage what you can't measure
Skills Requirements for Data Analytics
Statistics
Business
Domain
Computer
Science
Data Analytics
Meaning of Statistics
• The word statistics is used in either two senses:
• Commonly used to refer to data.
• Principles and methods of handling numerical data.
• Statistics is defined as a branch of mathematics that deals with the
collection, analysis and interpretation of numerical information
• Statistics changes numbers into information
• deciding how to collect data efficiently
• using data to give information
• using data to answer questions
• using data to make decisions.
• Statistics is the science of learning from data
Kinds of Data
• Quantitative – Data that is numerical, counted, or compared
• Demographic data
• Answers to closed-ended survey items
• Attendance data
• Scores on standardized instruments
• Qualitative – Narratives, logs, experience
• Interviews
• Open-ended survey items
• Categories
Statistical Measures
• Measure of central tendency
• Mean
• Median
• Mode
• Measure of variation
• Range
• Variance and standard deviation
• Interquartile range
• Proportion, Percentage
• Ratio, Rate
Statistical Analytics in R
• mean(), max(), min()
• median()
• var(), sd()
• cor()
• quantile()
• summary()
• hist()
• plot()
Mean, Median, Quantile
5,3,6,8,9,2,11,8,3,8,10,9
mean(vec) 6.83
median(vec) 8
5,3,6,8,9,2,11,8,3,8,10,1000
mean(vec) 89.41
median(vec) 8
quantile(vec)
0% 25% 50% 75% 100%
2.00 4.50 8.00 9.25 1000.00
quantile(var, probs = c(0, .75, 1))
0% 75% 100%
2.00 9.25 1000.00
quantile(vec, probs=seq(0, 1, .1))
Correlation
mpg cyl disp wt
mpg 1.0000000 -0.8521620 -0.8475514 -0.8676594
cyl -0.8521620 1.0000000 0.9020329 0.7824958
disp -0.8475514 0.9020329 1.0000000 0.8879799
wt -0.8676594 0.7824958 0.8879799 1.0000000
cor(mtcars[, c("mpg", "cyl", "disp", "wt")])
mpg cyl disp wt
Mazda RX4 21.0 6 160.0 2.620
Mazda RX4 Wag 21.0 6 160.0 2.875
Datsun 710 22.8 4 108.0 2.320
Hornet 4 Drive 21.4 6 258.0 3.215
Hornet Sportabout 18.7 8 360.0 3.440
Valiant 18.1 6 225.0 3.460
Duster 360 14.3 8 360.0 3.570
Merc 240D 24.4 4 146.7 3.190
Merc 230 22.8 4 140.8 3.150
Summary Function
summary(mtcars[, c("mpg", "cyl", "disp", "wt")])
mpg cyl disp wt
Min. :10.40 Min. :4.000 Min. : 71.1 Min. :1.513
1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.:2.581
Median :19.20 Median :6.000 Median :196.3 Median :3.325
Mean :20.09 Mean :6.188 Mean :230.7 Mean :3.217
3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:3.610
Max. :33.90 Max. :8.000 Max. :472.0 Max. :5.424
DATA VISUALIZATION
Anscombe Quartet
Anscombe Kvarteti
I II III IV
x y x y x y x y
10.0 8.04 10.0 9.14 10.0 7.46 8.0 6.58
8.0 6.95 8.0 8.14 8.0 6.77 8.0 5.76
13.0 7.58 13.0 8.74 13.0 12.74 8.0 7.71
9.0 8.81 9.0 8.77 9.0 7.11 8.0 8.84
11.0 8.33 11.0 9.26 11.0 7.81 8.0 8.47
14.0 9.96 14.0 8.10 14.0 8.84 8.0 7.04
6.0 7.24 6.0 6.13 6.0 6.08 8.0 5.25
4.0 4.26 4.0 3.10 4.0 5.39 19.0 12.50
12.0 10.84 12.0 9.13 12.0 8.15 8.0 5.56
7.0 4.82 7.0 7.26 7.0 6.42 8.0 7.91
5.0 5.68 5.0 4.74 5.0 5.73 8.0 6.89
Visualization
MACHINE LEARNING
What id Machine Learning?
• “Machine learning refers to a system capable of the autonomous
acquisition and integration of knowledge.”
• “Learning denotes changes in a system that ... enable a system to do
the same task … more efficiently the next time.” - Herbert Simon
• Automating automation
• Getting computers to program themselves
• Writing software is the bottleneck
• Let the data do the work instead!
• Machine learning is primarily concerned with the accuracy and
effectiveness of the computer system.
Why Machine Learning?
• No human experts
• industrial/manufacturing control
• mass spectrometer analysis, drug design, astronomic discovery
• Black-box human expertise
• face/handwriting/speech recognition
• driving a car, flying a plane
• Rapidly changing phenomena
• credit scoring, financial modeling
• diagnosis, fraud detection
• Need for customization/personalization
• personalized news reader
• movie/book recommendation
Machine Learning Algorithms Categories
• Supervised Learning algorithm
• Logistic Regression,
• Neural Networks,
• Support Vector Machines (SVMs), and Naive Bayes classifiers)
• Unsupervised Learning algorithm
• K-means, Random Forests, Hierarchical clustering)
• Semi-supervised Learning algorithm
• Reinforcement learning algorithm (self-driving cars)
ML vs Traditional Programming
Traditional Programming
Machine Learning
Computer
Data
Program
Output
Computer
Data
Output
Program
Machine Learning in Python
DATA GOVERNANCE
Data Governance Metrics
• Digital Culture
• Naming Standard
• Professional Terms and Abbreviations
• Data Model, Documentation and Relationship
• Data Quality Rules and Metrics
• Hierarchy of Data Artifacts / Entities
• Classify your Data:
• Master Data, Transactional Data, Reference Data
But do you have the capacity to refine it?
DATA is the NEW OIL!
AAdamov, CeDAR, ADA University
Information is the oil of the 21st century,
and Analytics is the Combustion Engine
Q & A ?
Dr. Abzetdin Adamov
Email me at: aadamov@ada.edu.az
Follow me at: @
Link to me at: www.linkedin.com/in/adamov
Visit my blog at: aadamov.wordpress.com

More Related Content

What's hot

Data Mining: What is Data Mining?
Data Mining: What is Data Mining?Data Mining: What is Data Mining?
Data Mining: What is Data Mining?
Seerat Malik
 
Application of data science in healthcare
Application of data science in healthcareApplication of data science in healthcare
Application of data science in healthcare
ShreyaPai7
 
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
What Is Data Science? | Introduction to Data Science | Data Science For Begin...What Is Data Science? | Introduction to Data Science | Data Science For Begin...
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
Simplilearn
 
Data Science Project Lifecycle and Skill Set
Data Science Project Lifecycle and Skill SetData Science Project Lifecycle and Skill Set
Data Science Project Lifecycle and Skill Set
IDEAS - Int'l Data Engineering and Science Association
 
Data Science Training | Data Science For Beginners | Data Science With Python...
Data Science Training | Data Science For Beginners | Data Science With Python...Data Science Training | Data Science For Beginners | Data Science With Python...
Data Science Training | Data Science For Beginners | Data Science With Python...
Simplilearn
 
HEART DISEASE PREDICTION USING NAIVE BAYES ALGORITHM
HEART DISEASE PREDICTION USING NAIVE BAYES ALGORITHMHEART DISEASE PREDICTION USING NAIVE BAYES ALGORITHM
HEART DISEASE PREDICTION USING NAIVE BAYES ALGORITHM
amiteshg
 
Data Mining: Application and trends in data mining
Data Mining: Application and trends in data miningData Mining: Application and trends in data mining
Data Mining: Application and trends in data mining
DataminingTools Inc
 
Big data introduction, Hadoop in details
Big data introduction, Hadoop in detailsBig data introduction, Hadoop in details
Big data introduction, Hadoop in details
Mahmoud Yassin
 
Data Visualization Tools in Python
Data Visualization Tools in PythonData Visualization Tools in Python
Data Visualization Tools in Python
Roman Merkulov
 
Data Science
Data ScienceData Science
Data Science
Prakhyath Rai
 
Introduction to Data Science and Analytics
Introduction to Data Science and AnalyticsIntroduction to Data Science and Analytics
Introduction to Data Science and Analytics
Srinath Perera
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
Jason Rodrigues
 
Big data
Big dataBig data
Data Science
Data ScienceData Science
Data Science
Amit Singh
 
Major issues in data mining
Major issues in data miningMajor issues in data mining
Major issues in data mining
Yashwant Rautela
 
Building a performing Machine Learning model from A to Z
Building a performing Machine Learning model from A to ZBuilding a performing Machine Learning model from A to Z
Building a performing Machine Learning model from A to Z
Charles Vestur
 
Top (10) challenging problems in data mining
Top (10) challenging problems  in data miningTop (10) challenging problems  in data mining
Top (10) challenging problems in data mining
Ahmedasbasb
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
Ghulam Imaduddin
 
The Future of Data Science
The Future of Data ScienceThe Future of Data Science
The Future of Data Science
DataWorks Summit
 
How to Become a Data Scientist
How to Become a Data ScientistHow to Become a Data Scientist
How to Become a Data Scientist
ryanorban
 

What's hot (20)

Data Mining: What is Data Mining?
Data Mining: What is Data Mining?Data Mining: What is Data Mining?
Data Mining: What is Data Mining?
 
Application of data science in healthcare
Application of data science in healthcareApplication of data science in healthcare
Application of data science in healthcare
 
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
What Is Data Science? | Introduction to Data Science | Data Science For Begin...What Is Data Science? | Introduction to Data Science | Data Science For Begin...
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
 
Data Science Project Lifecycle and Skill Set
Data Science Project Lifecycle and Skill SetData Science Project Lifecycle and Skill Set
Data Science Project Lifecycle and Skill Set
 
Data Science Training | Data Science For Beginners | Data Science With Python...
Data Science Training | Data Science For Beginners | Data Science With Python...Data Science Training | Data Science For Beginners | Data Science With Python...
Data Science Training | Data Science For Beginners | Data Science With Python...
 
HEART DISEASE PREDICTION USING NAIVE BAYES ALGORITHM
HEART DISEASE PREDICTION USING NAIVE BAYES ALGORITHMHEART DISEASE PREDICTION USING NAIVE BAYES ALGORITHM
HEART DISEASE PREDICTION USING NAIVE BAYES ALGORITHM
 
Data Mining: Application and trends in data mining
Data Mining: Application and trends in data miningData Mining: Application and trends in data mining
Data Mining: Application and trends in data mining
 
Big data introduction, Hadoop in details
Big data introduction, Hadoop in detailsBig data introduction, Hadoop in details
Big data introduction, Hadoop in details
 
Data Visualization Tools in Python
Data Visualization Tools in PythonData Visualization Tools in Python
Data Visualization Tools in Python
 
Data Science
Data ScienceData Science
Data Science
 
Introduction to Data Science and Analytics
Introduction to Data Science and AnalyticsIntroduction to Data Science and Analytics
Introduction to Data Science and Analytics
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Big data
Big dataBig data
Big data
 
Data Science
Data ScienceData Science
Data Science
 
Major issues in data mining
Major issues in data miningMajor issues in data mining
Major issues in data mining
 
Building a performing Machine Learning model from A to Z
Building a performing Machine Learning model from A to ZBuilding a performing Machine Learning model from A to Z
Building a performing Machine Learning model from A to Z
 
Top (10) challenging problems in data mining
Top (10) challenging problems  in data miningTop (10) challenging problems  in data mining
Top (10) challenging problems in data mining
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
The Future of Data Science
The Future of Data ScienceThe Future of Data Science
The Future of Data Science
 
How to Become a Data Scientist
How to Become a Data ScientistHow to Become a Data Scientist
How to Become a Data Scientist
 

Similar to Understanding your Data - Data Analytics Lifecycle and Machine Learning

Big Data and High Performance Computing
Big Data and High Performance ComputingBig Data and High Performance Computing
Big Data and High Performance Computing
Abzetdin Adamov
 
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data TorrentSeagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
Seeling Cheung
 
Dw 07032018-dr pl pradhan
Dw 07032018-dr pl pradhanDw 07032018-dr pl pradhan
Dw 07032018-dr pl pradhan
Dr Pradhan PL Pradhan
 
Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...
Tech Triveni
 
AzureDay - Introduction Big Data Analytics.
AzureDay  - Introduction Big Data Analytics.AzureDay  - Introduction Big Data Analytics.
AzureDay - Introduction Big Data Analytics.
Łukasz Grala
 
Introduction to data mining and data warehousing
Introduction to data mining and data warehousingIntroduction to data mining and data warehousing
Introduction to data mining and data warehousing
Er. Nawaraj Bhandari
 
Architectures styles and deployment on the hadoop
Architectures styles and deployment on the hadoopArchitectures styles and deployment on the hadoop
Architectures styles and deployment on the hadoop
Anu Ravindranath
 
The Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They NeedThe Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They Need
Dunn Solutions Group
 
Bigdata analytics
Bigdata analyticsBigdata analytics
Bigdata analytics
Keshav Tripathy
 
Big data unit 2
Big data unit 2Big data unit 2
Big data unit 2
RojaT4
 
20160331 sa introduction to big data pipelining berlin meetup 0.3
20160331 sa introduction to big data pipelining berlin meetup   0.320160331 sa introduction to big data pipelining berlin meetup   0.3
20160331 sa introduction to big data pipelining berlin meetup 0.3
Simon Ambridge
 
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
DATAVERSITY
 
BDVe Webinar Series - Designing Big Data pipelines with Toreador (Ernesto Dam...
BDVe Webinar Series - Designing Big Data pipelines with Toreador (Ernesto Dam...BDVe Webinar Series - Designing Big Data pipelines with Toreador (Ernesto Dam...
BDVe Webinar Series - Designing Big Data pipelines with Toreador (Ernesto Dam...
Big Data Value Association
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
James Serra
 
Data Science Machine Lerning Bigdat.pptx
Data Science Machine Lerning Bigdat.pptxData Science Machine Lerning Bigdat.pptx
Data Science Machine Lerning Bigdat.pptx
Priyadarshini648418
 
Big Data Boom
Big Data BoomBig Data Boom
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarKognitio
 
Big data
Big dataBig data
Big data
nikki135
 
Pitfalls of Data Warehousing_2019-04-24
Pitfalls of Data Warehousing_2019-04-24Pitfalls of Data Warehousing_2019-04-24
Pitfalls of Data Warehousing_2019-04-24
Martin Bém
 
Gse uk-cedrinemadera-2018-shared
Gse uk-cedrinemadera-2018-sharedGse uk-cedrinemadera-2018-shared
Gse uk-cedrinemadera-2018-shared
cedrinemadera
 

Similar to Understanding your Data - Data Analytics Lifecycle and Machine Learning (20)

Big Data and High Performance Computing
Big Data and High Performance ComputingBig Data and High Performance Computing
Big Data and High Performance Computing
 
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data TorrentSeagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
 
Dw 07032018-dr pl pradhan
Dw 07032018-dr pl pradhanDw 07032018-dr pl pradhan
Dw 07032018-dr pl pradhan
 
Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...
 
AzureDay - Introduction Big Data Analytics.
AzureDay  - Introduction Big Data Analytics.AzureDay  - Introduction Big Data Analytics.
AzureDay - Introduction Big Data Analytics.
 
Introduction to data mining and data warehousing
Introduction to data mining and data warehousingIntroduction to data mining and data warehousing
Introduction to data mining and data warehousing
 
Architectures styles and deployment on the hadoop
Architectures styles and deployment on the hadoopArchitectures styles and deployment on the hadoop
Architectures styles and deployment on the hadoop
 
The Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They NeedThe Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They Need
 
Bigdata analytics
Bigdata analyticsBigdata analytics
Bigdata analytics
 
Big data unit 2
Big data unit 2Big data unit 2
Big data unit 2
 
20160331 sa introduction to big data pipelining berlin meetup 0.3
20160331 sa introduction to big data pipelining berlin meetup   0.320160331 sa introduction to big data pipelining berlin meetup   0.3
20160331 sa introduction to big data pipelining berlin meetup 0.3
 
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
 
BDVe Webinar Series - Designing Big Data pipelines with Toreador (Ernesto Dam...
BDVe Webinar Series - Designing Big Data pipelines with Toreador (Ernesto Dam...BDVe Webinar Series - Designing Big Data pipelines with Toreador (Ernesto Dam...
BDVe Webinar Series - Designing Big Data pipelines with Toreador (Ernesto Dam...
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
Data Science Machine Lerning Bigdat.pptx
Data Science Machine Lerning Bigdat.pptxData Science Machine Lerning Bigdat.pptx
Data Science Machine Lerning Bigdat.pptx
 
Big Data Boom
Big Data BoomBig Data Boom
Big Data Boom
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
Big data
Big dataBig data
Big data
 
Pitfalls of Data Warehousing_2019-04-24
Pitfalls of Data Warehousing_2019-04-24Pitfalls of Data Warehousing_2019-04-24
Pitfalls of Data Warehousing_2019-04-24
 
Gse uk-cedrinemadera-2018-shared
Gse uk-cedrinemadera-2018-sharedGse uk-cedrinemadera-2018-shared
Gse uk-cedrinemadera-2018-shared
 

More from Abzetdin Adamov

Big Data & Privacy
Big Data & PrivacyBig Data & Privacy
Big Data & Privacy
Abzetdin Adamov
 
Big Data Ecosystem for Data-Driven Decision Making
Big Data Ecosystem for Data-Driven Decision MakingBig Data Ecosystem for Data-Driven Decision Making
Big Data Ecosystem for Data-Driven Decision Making
Abzetdin Adamov
 
Latest Trends in Technology: BigData Analytics, Virtualization, Cloud Computi...
Latest Trends in Technology:BigData Analytics, Virtualization, Cloud Computi...Latest Trends in Technology:BigData Analytics, Virtualization, Cloud Computi...
Latest Trends in Technology: BigData Analytics, Virtualization, Cloud Computi...
Abzetdin Adamov
 
Steps and Tips to Protect Yourself and your Private Information while Online....
Steps and Tips to Protect Yourself and your Private Information while Online....Steps and Tips to Protect Yourself and your Private Information while Online....
Steps and Tips to Protect Yourself and your Private Information while Online....
Abzetdin Adamov
 
Technical, Legal and Political Issues of Combating Terrorism on the Internet.
Technical, Legal and Political Issues of Combating Terrorism on the Internet.Technical, Legal and Political Issues of Combating Terrorism on the Internet.
Technical, Legal and Political Issues of Combating Terrorism on the Internet.
Abzetdin Adamov
 
Introduction to object oriented programming
Introduction to object oriented programmingIntroduction to object oriented programming
Introduction to object oriented programmingAbzetdin Adamov
 
Qafqaz university-inegrated-management-information-system
Qafqaz university-inegrated-management-information-systemQafqaz university-inegrated-management-information-system
Qafqaz university-inegrated-management-information-system
Abzetdin Adamov
 
Grid Computing
Grid ComputingGrid Computing
Grid Computing
Abzetdin Adamov
 
Üniversite Bilgi Sistemi - Birimlerin İşbirliği Platformu
Üniversite Bilgi Sistemi - Birimlerin İşbirliği PlatformuÜniversite Bilgi Sistemi - Birimlerin İşbirliği Platformu
Üniversite Bilgi Sistemi - Birimlerin İşbirliği Platformu
Abzetdin Adamov
 
INFORMATION TECHNOLOGIES AS THE BASE OF THE BUSINESS PROCESS MANAGEMENT IMPLE...
INFORMATION TECHNOLOGIES AS THE BASE OF THE BUSINESS PROCESS MANAGEMENT IMPLE...INFORMATION TECHNOLOGIES AS THE BASE OF THE BUSINESS PROCESS MANAGEMENT IMPLE...
INFORMATION TECHNOLOGIES AS THE BASE OF THE BUSINESS PROCESS MANAGEMENT IMPLE...
Abzetdin Adamov
 
e-Government Strategy. Government Transformation in Developing Countries of t...
e-Government Strategy. Government Transformation in Developing Countries of t...e-Government Strategy. Government Transformation in Developing Countries of t...
e-Government Strategy. Government Transformation in Developing Countries of t...Abzetdin Adamov
 
The Truth about Cloud Computing as new Paradigm in IT
The Truth about Cloud Computing  as new Paradigm in ITThe Truth about Cloud Computing  as new Paradigm in IT
The Truth about Cloud Computing as new Paradigm in ITAbzetdin Adamov
 
The Role of Business Process Management in Success of the e-Government Projec...
The Role of Business Process Management in Success of the e-Government Projec...The Role of Business Process Management in Success of the e-Government Projec...
The Role of Business Process Management in Success of the e-Government Projec...Abzetdin Adamov
 
University Management Information System
University Management Information SystemUniversity Management Information System
University Management Information System
Abzetdin Adamov
 

More from Abzetdin Adamov (16)

Big Data & Privacy
Big Data & PrivacyBig Data & Privacy
Big Data & Privacy
 
Big Data Ecosystem for Data-Driven Decision Making
Big Data Ecosystem for Data-Driven Decision MakingBig Data Ecosystem for Data-Driven Decision Making
Big Data Ecosystem for Data-Driven Decision Making
 
Latest Trends in Technology: BigData Analytics, Virtualization, Cloud Computi...
Latest Trends in Technology:BigData Analytics, Virtualization, Cloud Computi...Latest Trends in Technology:BigData Analytics, Virtualization, Cloud Computi...
Latest Trends in Technology: BigData Analytics, Virtualization, Cloud Computi...
 
Steps and Tips to Protect Yourself and your Private Information while Online....
Steps and Tips to Protect Yourself and your Private Information while Online....Steps and Tips to Protect Yourself and your Private Information while Online....
Steps and Tips to Protect Yourself and your Private Information while Online....
 
Technical, Legal and Political Issues of Combating Terrorism on the Internet.
Technical, Legal and Political Issues of Combating Terrorism on the Internet.Technical, Legal and Political Issues of Combating Terrorism on the Internet.
Technical, Legal and Political Issues of Combating Terrorism on the Internet.
 
Introduction to object oriented programming
Introduction to object oriented programmingIntroduction to object oriented programming
Introduction to object oriented programming
 
Introduction to AJAX
Introduction to AJAXIntroduction to AJAX
Introduction to AJAX
 
Introduction to HTML
Introduction to HTMLIntroduction to HTML
Introduction to HTML
 
Qafqaz university-inegrated-management-information-system
Qafqaz university-inegrated-management-information-systemQafqaz university-inegrated-management-information-system
Qafqaz university-inegrated-management-information-system
 
Grid Computing
Grid ComputingGrid Computing
Grid Computing
 
Üniversite Bilgi Sistemi - Birimlerin İşbirliği Platformu
Üniversite Bilgi Sistemi - Birimlerin İşbirliği PlatformuÜniversite Bilgi Sistemi - Birimlerin İşbirliği Platformu
Üniversite Bilgi Sistemi - Birimlerin İşbirliği Platformu
 
INFORMATION TECHNOLOGIES AS THE BASE OF THE BUSINESS PROCESS MANAGEMENT IMPLE...
INFORMATION TECHNOLOGIES AS THE BASE OF THE BUSINESS PROCESS MANAGEMENT IMPLE...INFORMATION TECHNOLOGIES AS THE BASE OF THE BUSINESS PROCESS MANAGEMENT IMPLE...
INFORMATION TECHNOLOGIES AS THE BASE OF THE BUSINESS PROCESS MANAGEMENT IMPLE...
 
e-Government Strategy. Government Transformation in Developing Countries of t...
e-Government Strategy. Government Transformation in Developing Countries of t...e-Government Strategy. Government Transformation in Developing Countries of t...
e-Government Strategy. Government Transformation in Developing Countries of t...
 
The Truth about Cloud Computing as new Paradigm in IT
The Truth about Cloud Computing  as new Paradigm in ITThe Truth about Cloud Computing  as new Paradigm in IT
The Truth about Cloud Computing as new Paradigm in IT
 
The Role of Business Process Management in Success of the e-Government Projec...
The Role of Business Process Management in Success of the e-Government Projec...The Role of Business Process Management in Success of the e-Government Projec...
The Role of Business Process Management in Success of the e-Government Projec...
 
University Management Information System
University Management Information SystemUniversity Management Information System
University Management Information System
 

Recently uploaded

一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
vcaxypu
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
benishzehra469
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
ukgaet
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
balafet
 
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
2023240532
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 

Recently uploaded (20)

一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
 
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 

Understanding your Data - Data Analytics Lifecycle and Machine Learning

  • 1. Understanding your Data Data Analytics Lifecycle and Machine Learning Dr. Abzetdin ADAMOV Director, Center for Data Analytics Research School of IT & Engineering ADA University aadamov@ada.edu.az
  • 2. Content • Why now? • Data Analytics Lifecycle • Data Acquisition • Data Repository • Data Preprocessing • Data Analytics and Machine Learning • Data Visualization • Data Governance
  • 4. 4th Big Data Day Baku 2018
  • 5.
  • 6. Computing Facilities at CeDSRT Characteristics of computing cluster: Processing Cores: 102 Memory: 1,568 TB Storage: 136 TB
  • 7. AAdamov, CeDAR, ADA University
  • 8. Student Research - SDP Topics 1. Development of Lexical and Morphological Analysis System 2. Effective Installation of Multi-node Cluster based on Hadoop 3.0 3. Utilizing Artificial Intelligence (AI) to improve quality of life for people with Dementia 4. Statistical Analysis and Data Visualization of DTS Data 5. Development of N-gram Model 6. Development of Semantic Similarity System 7. Development of Sentiment Analysis System 8. Personalized Offers and Customer Retention Platform in Banking 9. Data Retrieval, Storage and Manipulation of DTS Data 10. Network Security and IDS using Machine Learning 11. Development Spell Correction System
  • 10. Where Data Comes From Data is produced by: • People • Social Media, Public Web, Smartphones, … • Organizations (Employer) • OLTP, OLAP, BI, … • Machines • IoT, Satellites, Vehicles, Science, … AAdamov, CeDAR, ADA University
  • 11. Modern Data Sources à Internet of Anything (IoAT) • Wind Turbines, Oil Rigs, Cars • Weather Stations, Smart Grids • RFID Tags, Beacons, Wearables à User Generated Content (Web & Mobile) • Twitter, Facebook, Snapchat, YouTube • Clickstream, Ads, User Engagement • Payments: Paypal, Venmo
  • 12. Addressing Data – Digital Universe 0 5 10 15 20 25 30 35 1995 2000 2003 2005 2008 2009 2010 2011 2012 2014 2016 2017 2018 DataGrowthinZettaBytes Digital Universe Growth over time
  • 13. Addressing Data – Hard Disk Capacity 0 10000 20000 30000 40000 50000 60000 70000 1991 1998 2003 2005 2007 2008 2009 2010 2011 2012 2014 2016 2017 CapacityinGigaBytes Hard Drive Capacity Growth over time
  • 14. Addressing Data – Storage Cost 1200000 100000 10000 800 10 1 0.1 0.003 0.0020 200000 400000 600000 800000 1000000 1200000 1400000 1980 1985 1990 1995 2000 2005 2010 2015 2020 Price$perGBytes Data Storage Cost per Gigabyte AAdamov, CeDAR, ADA University
  • 15. Computation Power CPU and GPU 0 500 1000 1500 2000 2500 3000 3500 4000 4500 2001 2002 2005 2006 2008 2009 2010 2012 2013 2014 2015 2017 2018 GFLOPS Computation Power CPU and GPU GPU CPU
  • 16. Data Growth vs. Processing Power AAdamov, CeDAR, ADA University
  • 17. Addressing Data – Transfer Rate 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 1991 1998 2003 2005 2007 2009 2010 2012 2014 2016 TransferSpeedMB/sec Hard Drive Data Transfer Rate
  • 19. Data Analytics Life-Cycle Data Acquisition Data Repository Data Processing Data Analytics / ML Data Visualization - Hadoop HDFS - Microsoft Azure - Amazon EC2 - Warehouse - Statistical Analysis - Machine Learning - R Programming - Python - RapidMiner - Weka - …. - Web Crawling - Data Mining - Information Retrieval - …. - ETL - Parsing - Indexing - Searching - Ranking - NLP - …. Big Data Management involves Data Science and Data Engineering areas for implementing Data Mining Techniques
  • 21. Data Acquisition Techniques 1. Operational Systems 2. Data Warehouses and Data Marts 3. Online Analytical Processing (OLAP) / BI 4. Web Crawling 5. Data Brokers (Commercial Data) 6. Open Data Sources 7. Experimental Data Collection 8. Online Surveys
  • 22. Data Acquisition Considerations • Business Needs • Data Standards (ISO, ITIS, FGDC, ISDM) • Accuracy Requirements • Currency of Data • Time Constraints • Format (CSV, XLS, XML, JSON, …) • Cost
  • 24. Traditional Approach in Data Management 1TB Hard Drive 3 TB file 1TB of Data 1TB of Data 1TB of Data STORAGE PROCESSING DATA Processor Raw DataProcessed Data
  • 25. The “Big Data” Problem à A single machine cannot process or even store all the data! Problem Solution à Distribute data over large clusters Difficulty à How to split work across machines? à Moving data over network is expensive à Must consider data & network locality à How to deal with failures? à How to deal with slow nodes?
  • 26. Addressing Data • Standard Hard Drive data transmission speed 60 – 100 MB/sec • Solid State Hard Drive (SSD) - 250 – 500 MB/sec • Hard Drive capacity growing RAPIDLY (4 – 60 TB) • Online data growth (double every 18 month) • Processing Speed (relatively same growth) • Hard Drive transmission speed is relatively FLAT Moving Data IN and OUT of disk is the Bottleneck
  • 27. BIG DATA and TRADITIONAL SYSTEMS AAdamov, CeDAR, ADA University
  • 28. Hadoop Input/Output Model AAdamov, CeDAR, ADA University A 128mb Hadoop WRITE/READS blocks into HDFS sequentially CLIENT ABCD B 128mb C 128mb D 128mb File NEWS.txt (512 Mb) divided to 4 blocks Hadoop Reads/Writes blocks sequentially, not in parallel. Its why Hadoop does not affect IO performance significantly. SOLUTION is Data Striping technique…
  • 29. Distributed Architecture of HDFS Rack 1 DN1 DN2 DN3 DN4 Switch Rack 2 DN11 DN12 DN13 DN14 Switch Rack 3 DN21 DN22 DN23 DN24 Switch Rack 4 DN31 DN32 DN33 DN34 Switch CC AA DD BB A Where to write file ADA.txt (blocks A, B, C, D) in HDFS? CLIENT NAMENODE A B C D A – DN32, 11, 14 B – DN01, 22, 23 C – DN12, 02, 04 D – DN34, 12, 14 BCD AAdamov, CeDAR, ADA University
  • 30. Big Data and Virtialization Traditional Architecture Distributed Architecture Operating System HARDWARE App App App HARDWARE App App App HYPERVISOR OS OS OS HARDWARE HARDWARE HARDWARE OS OS OS HADOOP HDFS + YARN App App App App App App Virtualized Architecture
  • 31. BIG DOES NOT MEAN SLOW AAdamov, CeDAR, ADA University
  • 33. Why Data Preprocessing? • Data in the real world is dirty • incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data • noisy: containing errors or outliers • inconsistent: containing discrepancies in codes or names • No quality data, no quality Analytics results! • Quality decisions must be based on quality data • Data warehouse needs consistent integration of quality data
  • 34. Multi-Dimensional Measure of Data Quality • Accuracy • Completeness • Consistency • Timeliness • Believability • Value added • Interpretability • Accessibility
  • 35. Major Tasks in Data Preprocessing • Data Cleaning • Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies • Data Integration • Integration of multiple databases, data cubes, or files • Data Transformation • Normalization and aggregation • Data Reduction • Obtains reduced representation in volume but produces the same or similar analytical results • Data Discretization • Part of data reduction but with particular importance, especially for numerical data
  • 36. Data Cleaning and Transformation • Data Cleaning Tasks: • Fill in missing values • Identify outliers and smooth out noisy data • Correct inconsistent data • Data Transformation Tasks: • Smoothing: remove noise from data • Aggregation: summarization, data cube construction • Normalization: scaled to fall within a small, specified range • Generalization: concept hierarchy climbing
  • 37. Data Reduction Strategies • Why data reduction? • Warehouse may store terabytes of data • Complex data analysis/mining may take a very long time to run on the complete data set • Data reduction • Obtains a reduced representation of the data set that is much smaller in volume but yet produces the same (or almost the same) analytical results • Data reduction strategies • Data Compression • Sampling • Data cube aggregation • Dimensionality reduction • Numerosity reduction • Discretization and concept hierarchy generation
  • 38. DATA ANALYTICS You can't manage what you can't measure
  • 39. Skills Requirements for Data Analytics Statistics Business Domain Computer Science Data Analytics
  • 40. Meaning of Statistics • The word statistics is used in either two senses: • Commonly used to refer to data. • Principles and methods of handling numerical data. • Statistics is defined as a branch of mathematics that deals with the collection, analysis and interpretation of numerical information • Statistics changes numbers into information • deciding how to collect data efficiently • using data to give information • using data to answer questions • using data to make decisions. • Statistics is the science of learning from data
  • 41. Kinds of Data • Quantitative – Data that is numerical, counted, or compared • Demographic data • Answers to closed-ended survey items • Attendance data • Scores on standardized instruments • Qualitative – Narratives, logs, experience • Interviews • Open-ended survey items • Categories
  • 42. Statistical Measures • Measure of central tendency • Mean • Median • Mode • Measure of variation • Range • Variance and standard deviation • Interquartile range • Proportion, Percentage • Ratio, Rate
  • 43. Statistical Analytics in R • mean(), max(), min() • median() • var(), sd() • cor() • quantile() • summary() • hist() • plot()
  • 44. Mean, Median, Quantile 5,3,6,8,9,2,11,8,3,8,10,9 mean(vec) 6.83 median(vec) 8 5,3,6,8,9,2,11,8,3,8,10,1000 mean(vec) 89.41 median(vec) 8 quantile(vec) 0% 25% 50% 75% 100% 2.00 4.50 8.00 9.25 1000.00 quantile(var, probs = c(0, .75, 1)) 0% 75% 100% 2.00 9.25 1000.00 quantile(vec, probs=seq(0, 1, .1))
  • 45. Correlation mpg cyl disp wt mpg 1.0000000 -0.8521620 -0.8475514 -0.8676594 cyl -0.8521620 1.0000000 0.9020329 0.7824958 disp -0.8475514 0.9020329 1.0000000 0.8879799 wt -0.8676594 0.7824958 0.8879799 1.0000000 cor(mtcars[, c("mpg", "cyl", "disp", "wt")]) mpg cyl disp wt Mazda RX4 21.0 6 160.0 2.620 Mazda RX4 Wag 21.0 6 160.0 2.875 Datsun 710 22.8 4 108.0 2.320 Hornet 4 Drive 21.4 6 258.0 3.215 Hornet Sportabout 18.7 8 360.0 3.440 Valiant 18.1 6 225.0 3.460 Duster 360 14.3 8 360.0 3.570 Merc 240D 24.4 4 146.7 3.190 Merc 230 22.8 4 140.8 3.150
  • 46. Summary Function summary(mtcars[, c("mpg", "cyl", "disp", "wt")]) mpg cyl disp wt Min. :10.40 Min. :4.000 Min. : 71.1 Min. :1.513 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.:2.581 Median :19.20 Median :6.000 Median :196.3 Median :3.325 Mean :20.09 Mean :6.188 Mean :230.7 Mean :3.217 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:3.610 Max. :33.90 Max. :8.000 Max. :472.0 Max. :5.424
  • 48. Anscombe Quartet Anscombe Kvarteti I II III IV x y x y x y x y 10.0 8.04 10.0 9.14 10.0 7.46 8.0 6.58 8.0 6.95 8.0 8.14 8.0 6.77 8.0 5.76 13.0 7.58 13.0 8.74 13.0 12.74 8.0 7.71 9.0 8.81 9.0 8.77 9.0 7.11 8.0 8.84 11.0 8.33 11.0 9.26 11.0 7.81 8.0 8.47 14.0 9.96 14.0 8.10 14.0 8.84 8.0 7.04 6.0 7.24 6.0 6.13 6.0 6.08 8.0 5.25 4.0 4.26 4.0 3.10 4.0 5.39 19.0 12.50 12.0 10.84 12.0 9.13 12.0 8.15 8.0 5.56 7.0 4.82 7.0 7.26 7.0 6.42 8.0 7.91 5.0 5.68 5.0 4.74 5.0 5.73 8.0 6.89
  • 51. What id Machine Learning? • “Machine learning refers to a system capable of the autonomous acquisition and integration of knowledge.” • “Learning denotes changes in a system that ... enable a system to do the same task … more efficiently the next time.” - Herbert Simon • Automating automation • Getting computers to program themselves • Writing software is the bottleneck • Let the data do the work instead! • Machine learning is primarily concerned with the accuracy and effectiveness of the computer system.
  • 52. Why Machine Learning? • No human experts • industrial/manufacturing control • mass spectrometer analysis, drug design, astronomic discovery • Black-box human expertise • face/handwriting/speech recognition • driving a car, flying a plane • Rapidly changing phenomena • credit scoring, financial modeling • diagnosis, fraud detection • Need for customization/personalization • personalized news reader • movie/book recommendation
  • 53. Machine Learning Algorithms Categories • Supervised Learning algorithm • Logistic Regression, • Neural Networks, • Support Vector Machines (SVMs), and Naive Bayes classifiers) • Unsupervised Learning algorithm • K-means, Random Forests, Hierarchical clustering) • Semi-supervised Learning algorithm • Reinforcement learning algorithm (self-driving cars)
  • 54. ML vs Traditional Programming Traditional Programming Machine Learning Computer Data Program Output Computer Data Output Program
  • 57. Data Governance Metrics • Digital Culture • Naming Standard • Professional Terms and Abbreviations • Data Model, Documentation and Relationship • Data Quality Rules and Metrics • Hierarchy of Data Artifacts / Entities • Classify your Data: • Master Data, Transactional Data, Reference Data
  • 58. But do you have the capacity to refine it? DATA is the NEW OIL! AAdamov, CeDAR, ADA University
  • 59. Information is the oil of the 21st century, and Analytics is the Combustion Engine
  • 60. Q & A ? Dr. Abzetdin Adamov Email me at: aadamov@ada.edu.az Follow me at: @ Link to me at: www.linkedin.com/in/adamov Visit my blog at: aadamov.wordpress.com