AzureDay - Introduction Big Data Analytics.

AzureDay North Poland
Gdynia 2016

Introduction to Big Data
Analytics?
Łukasz Grala | Senior Architect

Łukasz Grala
• Senior architekt rozwiązań Platformy Danych & Business Intelligence & Zaawansowanej Analityki w TIDK
• Twórca „Data Scientist as as Service”
• Certyfikowany trener Microsoft i wykładowca na wyższych uczelniach
• Autor zaawansowanych szkoleń i warsztatów, oraz licznych publikacji i webcastów
• Od 2010 roku wyróżniany nagrodą Microsoft Data Platform MVP
• Doktorant Politechnika Poznańska – Wydział Informatyki (obszar bazy danych, eksploracja danych, uczenie maszynowe)
• Prelegent na licznych konferencjach w kraju i na świecie
• Posiada liczne certyfikaty (MCT, MCSE, MCSA, MCITP,…)
• Członek Polskiego Towarzystwa Informatycznego
• Członek i lider Polish SQL Server User Group (PLSSUG)
• Pasjonat analizy, przechowywania i przetwarzania danych, miłośnik Jazzu
email lukasz@tidk.pl

Data
• 72 hours of video are uploaded per minute on YouTube (1
terabyte every 4 minutes)
• 500 terabytes of new data per day are ingested in Facebook
databases
• Sensors from a Boeing jet engine create 20 terabytes
of data every hour
• The proposed Square Kilometer Array telescope will generate “a
few Exabytes of data per day” (single beam)

Big Data
http://www.ibmbigdatahub.com/infographic/four-vs-big-data

4V
Volume Variety Velocity Veracity
• Validity
• Value
• Variability
• Venue
• Vocabulary
• Vagueness

New Modern BI Solution
ETL Tool
(SSIS, etc) EDW
(SQL Server, Teradata, etc)
Extract
Original Data
Load
Transformed
Data
Transform
BI Tools
Ingest (EL)
Original Data
Scale-out
Storage &
Compute
(HDFS, Blob Storage,
etc)
Transform & Load
Data Marts
Data Lake(s)
Dashboards
Apps
Streaming data

Big Data
Storage
Processing
and
Analytics
Visualization

Storage
Blob
SQL Database & SQL Data Warehouse
DocumentDB
HDInsight
Azure Data Lake Store

Azure Blob Storage
• Blob Storage
• Table Storage
• Queue Storage
• File Storage

SQL Database
& SQL Data Warehouse

Analytics
Azure HDInsight
Azure Data Lake Analytics
Azure Stream Analytics
Azure Machine Learning
Azure Cognitive Services

Azure Data Lake
WebHDFS
YARN
U-SQL
Analytics Service HDInsight
(managed Hadoop Clusters)
Analytics
Store

Why Machine Learning
Analytics
Storage
HDInsight
(“managed clusters”)
Azure Data Lake Analytics
Azure Data Lake Storage

HDInsight
• HDInsight is a Hadoop-based service that brings 100% Apache
Hadoop solution running on the Microsoft Azure platform
• Based on the Hortonworks Data Platform (HDP)
• Scalable, on-demand service

HDInsight & SQL Server
Query relational
and non-relational
data, on-premises
and in Azure
Apps
T-SQL query
SQL Server Hadoop

Azure Stream Analytics
Point of
Service Devices
Self Checkout
Stations
Kiosks
Smart
Phones
Slates/
Tablets
PCs/
Laptops
Servers
Digital
Signs
Diagnostic
EquipmentRemote Medical
Monitors
Logic
Controllers
Specialized
DevicesThin
Clients
Handhelds
Security
POS
Terminals
Automation
Devices
Vending
Machines
Kinect
ATM

Canonical Event-driven Scenario

Advanced Analytics
• Language R and Python
• Microsoft R Open, Microsoft R Server, R Services, CARN R,
Revolution
• Mahout
• SparkR
• MLLib
• Azure Machine Learning
• Azure Cognitive Services Models/API

Traditional Data Mining vs Big Data
Analysis
Traditional Big Data analysis
Memory access Data is stored in centralized RAM and
can be efficiently scanned several times
Data be stored on high distributed data
sources
In case of huge, continuous data
streams, data is accessed only in single
scan
Computional processing and
architectures
Serial, centralized processing
A single-computer platform that scales
with better hardware is sufficient
Parallel and distributed architectures
may be necessary
Cluster platforms that scale with several
nodes may be necessary
Data Types Data source is relatively homogeneous
Data is static and of resonable size
Data come from multiple data sources
which may be heterogeneous and
complex
Data may be dynamic and evolving.
Adapting to data changes may be
necessary

Analysis
Data management Data format is simple and fits in
relational database or data warehouse
Data access time is not critical
Data format are usually diverse and may
not fit in a relational database.
Data may be greatly interconnected and
needs to be integreted from several
nodes
Often special data systems are required
that manage varied data formats
(NoSQL, Databases, HADOOP,…)
Data acess time is critical for scalability
and speed
Data quality The provenance and pre-processing
steps are relatively well documented
Strong correction techniques were
applied
Data is relatively well tagged and
labeled
The provenance and pre-processing
steps may be unclear and
undocumented
There is a large amount of uncertainly
and imprecision in the data
Only small numer of data are tagged and
labeled

Analysis
Data processing Only batch learning is necessary
Learning can be slow and off-line
Data fits into memory
All the data has some sort of utility
Data may arrive in a stream and need
processed continuously
Learning need to be fast and online
The scalability of algorithms is important
Data not fit in memory

AzureDay - Introduction Big Data Analytics.

More Related Content

What's hot

Viewers also liked

Similar to AzureDay - Introduction Big Data Analytics.

More from Łukasz Grala

Recently uploaded

AzureDay - Introduction Big Data Analytics.