AzureDay North Poland
Gdynia 2016
Introduction to Big Data
Analytics?
Łukasz Grala | Senior Architect
Łukasz Grala
• Senior architekt rozwiązań Platformy Danych & Business Intelligence & Zaawansowanej Analityki w TIDK
• Twórca „Data Scientist as as Service”
• Certyfikowany trener Microsoft i wykładowca na wyższych uczelniach
• Autor zaawansowanych szkoleń i warsztatów, oraz licznych publikacji i webcastów
• Od 2010 roku wyróżniany nagrodą Microsoft Data Platform MVP
• Doktorant Politechnika Poznańska – Wydział Informatyki (obszar bazy danych, eksploracja danych, uczenie maszynowe)
• Prelegent na licznych konferencjach w kraju i na świecie
• Posiada liczne certyfikaty (MCT, MCSE, MCSA, MCITP,…)
• Członek Polskiego Towarzystwa Informatycznego
• Członek i lider Polish SQL Server User Group (PLSSUG)
• Pasjonat analizy, przechowywania i przetwarzania danych, miłośnik Jazzu
email lukasz@tidk.pl
Data
• 72 hours of video are uploaded per minute on YouTube (1
terabyte every 4 minutes)
• 500 terabytes of new data per day are ingested in Facebook
databases
• Sensors from a Boeing jet engine create 20 terabytes
of data every hour
• The proposed Square Kilometer Array telescope will generate “a
few Exabytes of data per day” (single beam)
Big Data
http://www.ibmbigdatahub.com/infographic/four-vs-big-data
4V
Volume Variety Velocity Veracity
• Validity
• Value
• Variability
• Venue
• Vocabulary
• Vagueness
Internet Of Things
New Modern BI Solution
ETL Tool
(SSIS, etc) EDW
(SQL Server, Teradata, etc)
Extract
Original Data
Load
Transformed
Data
Transform
BI Tools
Ingest (EL)
Original Data
Scale-out
Storage &
Compute
(HDFS, Blob Storage,
etc)
Transform & Load
Data Marts
Data Lake(s)
Dashboards
Apps
Streaming data
Time
Big Data
Storage
Processing
and
Analytics
Visualization
Visualization
Reports & Mobil Reports
Storage
Blob
SQL Database & SQL Data Warehouse
DocumentDB
HDInsight
Azure Data Lake Store
Azure Blob Storage
• Blob Storage
• Table Storage
• Queue Storage
• File Storage
SQL Database
& SQL Data Warehouse
SQL Database
& SQL Data Warehouse
DocumentDB
Analytics
Azure HDInsight
Azure Data Lake Analytics
Azure Stream Analytics
Azure Machine Learning
Azure Cognitive Services
Azure Data Lake
WebHDFS
YARN
U-SQL
Analytics Service HDInsight
(managed Hadoop Clusters)
Analytics
Store
Why Machine Learning
Analytics
Storage
HDInsight
(“managed clusters”)
Azure Data Lake Analytics
Azure Data Lake Storage
HDInsight
• HDInsight is a Hadoop-based service that brings 100% Apache
Hadoop solution running on the Microsoft Azure platform
• Based on the Hortonworks Data Platform (HDP)
• Scalable, on-demand service
HDInsight
Why Machine Learning
HDInsight & SQL Server
Query relational
and non-relational
data, on-premises
and in Azure
Apps
T-SQL query
SQL Server Hadoop
Azure Stream Analytics
Point of
Service Devices
Self Checkout
Stations
Kiosks
Smart
Phones
Slates/
Tablets
PCs/
Laptops
Servers
Digital
Signs
Diagnostic
EquipmentRemote Medical
Monitors
Logic
Controllers
Specialized
DevicesThin
Clients
Handhelds
Security
POS
Terminals
Automation
Devices
Vending
Machines
Kinect
ATM
Canonical Event-driven Scenario
Advanced Analytics
• Language R and Python
• Microsoft R Open, Microsoft R Server, R Services, CARN R,
Revolution
• Mahout
• SparkR
• MLLib
• Azure Machine Learning
• Azure Cognitive Services Models/API
Traditional Data Mining vs Big Data
Analysis
Traditional Big Data analysis
Memory access Data is stored in centralized RAM and
can be efficiently scanned several times
Data be stored on high distributed data
sources
In case of huge, continuous data
streams, data is accessed only in single
scan
Computional processing and
architectures
Serial, centralized processing
A single-computer platform that scales
with better hardware is sufficient
Parallel and distributed architectures
may be necessary
Cluster platforms that scale with several
nodes may be necessary
Data Types Data source is relatively homogeneous
Data is static and of resonable size
Data come from multiple data sources
which may be heterogeneous and
complex
Data may be dynamic and evolving.
Adapting to data changes may be
necessary
Traditional Data Mining vs Big Data
Analysis
Traditional Big Data analysis
Data management Data format is simple and fits in
relational database or data warehouse
Data access time is not critical
Data format are usually diverse and may
not fit in a relational database.
Data may be greatly interconnected and
needs to be integreted from several
nodes
Often special data systems are required
that manage varied data formats
(NoSQL, Databases, HADOOP,…)
Data acess time is critical for scalability
and speed
Data quality The provenance and pre-processing
steps are relatively well documented
Strong correction techniques were
applied
Data is relatively well tagged and
labeled
The provenance and pre-processing
steps may be unclear and
undocumented
There is a large amount of uncertainly
and imprecision in the data
Only small numer of data are tagged and
labeled
Traditional Data Mining vs Big Data
Analysis
Traditional Big Data analysis
Data processing Only batch learning is necessary
Learning can be slow and off-line
Data fits into memory
All the data has some sort of utility
Data may arrive in a stream and need
processed continuously
Learning need to be fast and online
The scalability of algorithms is important
Data not fit in memory
Azure Machine Learning
Cognitive Services
Question?
lukasz@tidk.pl

AzureDay - Introduction Big Data Analytics.

  • 1.
  • 2.
    Introduction to BigData Analytics? Łukasz Grala | Senior Architect
  • 3.
    Łukasz Grala • Seniorarchitekt rozwiązań Platformy Danych & Business Intelligence & Zaawansowanej Analityki w TIDK • Twórca „Data Scientist as as Service” • Certyfikowany trener Microsoft i wykładowca na wyższych uczelniach • Autor zaawansowanych szkoleń i warsztatów, oraz licznych publikacji i webcastów • Od 2010 roku wyróżniany nagrodą Microsoft Data Platform MVP • Doktorant Politechnika Poznańska – Wydział Informatyki (obszar bazy danych, eksploracja danych, uczenie maszynowe) • Prelegent na licznych konferencjach w kraju i na świecie • Posiada liczne certyfikaty (MCT, MCSE, MCSA, MCITP,…) • Członek Polskiego Towarzystwa Informatycznego • Członek i lider Polish SQL Server User Group (PLSSUG) • Pasjonat analizy, przechowywania i przetwarzania danych, miłośnik Jazzu email lukasz@tidk.pl
  • 4.
    Data • 72 hoursof video are uploaded per minute on YouTube (1 terabyte every 4 minutes) • 500 terabytes of new data per day are ingested in Facebook databases • Sensors from a Boeing jet engine create 20 terabytes of data every hour • The proposed Square Kilometer Array telescope will generate “a few Exabytes of data per day” (single beam)
  • 5.
  • 6.
    4V Volume Variety VelocityVeracity • Validity • Value • Variability • Venue • Vocabulary • Vagueness
  • 7.
  • 8.
    New Modern BISolution ETL Tool (SSIS, etc) EDW (SQL Server, Teradata, etc) Extract Original Data Load Transformed Data Transform BI Tools Ingest (EL) Original Data Scale-out Storage & Compute (HDFS, Blob Storage, etc) Transform & Load Data Marts Data Lake(s) Dashboards Apps Streaming data
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
    Storage Blob SQL Database &SQL Data Warehouse DocumentDB HDInsight Azure Data Lake Store
  • 14.
    Azure Blob Storage •Blob Storage • Table Storage • Queue Storage • File Storage
  • 15.
    SQL Database & SQLData Warehouse
  • 16.
    SQL Database & SQLData Warehouse
  • 17.
  • 18.
    Analytics Azure HDInsight Azure DataLake Analytics Azure Stream Analytics Azure Machine Learning Azure Cognitive Services
  • 19.
    Azure Data Lake WebHDFS YARN U-SQL AnalyticsService HDInsight (managed Hadoop Clusters) Analytics Store
  • 20.
    Why Machine Learning Analytics Storage HDInsight (“managedclusters”) Azure Data Lake Analytics Azure Data Lake Storage
  • 21.
    HDInsight • HDInsight isa Hadoop-based service that brings 100% Apache Hadoop solution running on the Microsoft Azure platform • Based on the Hortonworks Data Platform (HDP) • Scalable, on-demand service
  • 22.
  • 23.
  • 24.
    HDInsight & SQLServer Query relational and non-relational data, on-premises and in Azure Apps T-SQL query SQL Server Hadoop
  • 25.
    Azure Stream Analytics Pointof Service Devices Self Checkout Stations Kiosks Smart Phones Slates/ Tablets PCs/ Laptops Servers Digital Signs Diagnostic EquipmentRemote Medical Monitors Logic Controllers Specialized DevicesThin Clients Handhelds Security POS Terminals Automation Devices Vending Machines Kinect ATM
  • 26.
  • 27.
    Advanced Analytics • LanguageR and Python • Microsoft R Open, Microsoft R Server, R Services, CARN R, Revolution • Mahout • SparkR • MLLib • Azure Machine Learning • Azure Cognitive Services Models/API
  • 28.
    Traditional Data Miningvs Big Data Analysis Traditional Big Data analysis Memory access Data is stored in centralized RAM and can be efficiently scanned several times Data be stored on high distributed data sources In case of huge, continuous data streams, data is accessed only in single scan Computional processing and architectures Serial, centralized processing A single-computer platform that scales with better hardware is sufficient Parallel and distributed architectures may be necessary Cluster platforms that scale with several nodes may be necessary Data Types Data source is relatively homogeneous Data is static and of resonable size Data come from multiple data sources which may be heterogeneous and complex Data may be dynamic and evolving. Adapting to data changes may be necessary
  • 29.
    Traditional Data Miningvs Big Data Analysis Traditional Big Data analysis Data management Data format is simple and fits in relational database or data warehouse Data access time is not critical Data format are usually diverse and may not fit in a relational database. Data may be greatly interconnected and needs to be integreted from several nodes Often special data systems are required that manage varied data formats (NoSQL, Databases, HADOOP,…) Data acess time is critical for scalability and speed Data quality The provenance and pre-processing steps are relatively well documented Strong correction techniques were applied Data is relatively well tagged and labeled The provenance and pre-processing steps may be unclear and undocumented There is a large amount of uncertainly and imprecision in the data Only small numer of data are tagged and labeled
  • 30.
    Traditional Data Miningvs Big Data Analysis Traditional Big Data analysis Data processing Only batch learning is necessary Learning can be slow and off-line Data fits into memory All the data has some sort of utility Data may arrive in a stream and need processed continuously Learning need to be fast and online The scalability of algorithms is important Data not fit in memory
  • 31.
  • 32.
  • 33.