‫امیری‬ ‫وحید‬
‫موضوع‬:
‫های‬ ‫سیستم‬ ‫معماری‬
‫داده‬ ‫کالن‬
‫کارشناس‬‫ارشد‬‫فناوری‬‫اطالعات‬-‫تجارت‬
‫الکترونیک‬
‫کارشناس‬‫و‬‫مشاور‬‫در‬‫حوزه‬‫کالن‬‫داده‬
VahidAmiry.ir
@vahidamiry
What to expect from this short talk
• History of Data Processing
• Data Warehouse System
• Data Lake Concept
• Big Data Pipeline
• Data Lake Architecture
What is big data and why is it valuable to the business
A evolution in the nature and use of data in the enterprise
Data complexity: variety and velocity
Petabytes/Volume
Information
explosion,
new insights
90%of the world’s data has been created over
the last two years alone1
Shift to cheaper,
faster computing,
on demand
45%
of total IT spend will be
cloud-related by 20202
Increasingly
data-savvy
workforce
5X
Companies that use analytics
are 5x more likely to make decisions faster
than competitors3
Transformative opportunity
41. IDC. 2. Josh Waldo Senior Director, Cloud Partner Strategy, Microsoft. 3. Bain & Company, The Value of Big Data: How Analytics Differentiates Winners, 2013.
Recommenda-
tion engines
Smart meter
monitoring
Equipment
monitoring
Advertising
analysis
Life sciences
research
Fraud
detection
Healthcare
outcomes
Weather
forecasting for
business planning
Oil & Gas
exploration
Social network
analysis
Churn
analysis
Traffic flow
optimization
IT infrastructure
& Web App
optimization
Legal
discovery and
document
archiving
Data Analytics is needed everywhere
Intelligence
Gathering
Location-based
tracking &
services
Pricing Analysis
Personalized
Insurance
Traditional Business Analytics Process
1. Start with end-user requirements to identify desired reports
and analysis
2. Define corresponding database schema and queries
3. Identify the required data sources
4. Create a Extract-Transform-Load (ETL) pipeline to extract
required data and transform it to target schema (‘schema-on-write’)
5. Create reports: Analyse data
ETL pipeline
ETL tools
Defined schema
Queries
Results
Relational
LOB Applications
All data not immediately required is discarded or archived
Harness the growing and changing nature of data
Need to collect any data
StreamingStructured
Challenge is combining transactional data stored in relational databases with less structured data
Get the right information to the right people at the right time in the right format
Unstructured
“ ”
The three V’s Big Data
Store indefinitely Analyze See results
Gather data
from all sources
Iterate
New big data thinking: All data has value
All data has potential value
Data hoarding
No defined schema—stored in native format
Apps and users interpret the data as they see fit
Data Lake
What is a Data Lake?
Data Lake is a new and
increasingly popular way to
store and analyze massive
volumes and heterogeneous
types of data in a centralized
repository.
Benefits of a Data Lake – Quick Ingest
Quickly ingest data without
needing to force it into a
pre-defined schema.
“How can I collect data
quickly from various
sources and store it
efficiently?”
Benefits of a Data Lake –All Data in One Place
“Why is the data distributed
in many locations? Where is
the single source of truth ?”
Store and analyze all of
your data, from all of your
sources, in one centralized
location.
Benefits of a Data Lake –Storage vs Compute
Separating your storage and
compute allows you to scale
each component as required
“How can I scale up with
the volume of data being
generated?”
Benefits of a Data Lake –Schema on Read
A Data Lake enables ad-hoc
analysis by applying schemas
on read, not write.
“Is there a way I can apply
multiple analytics and
processing frameworks to
the same data?”
Data Analysis Paradigm Shift
OLD WAY: Structure -> Ingest -> Analyze
NEW WAY: Ingest -> Analyze -> Structure
Data Lake Architecture
Simplified Big Data Pipeline
Data Ingestion
• Scalable, Extensible to capture streaming and batch data
• Provide capability to business logic, filters, validation, data quality,
routing, etc. business requirements
• Technology Stack:
• Apache Flume
• Apache Kafka
• Apache Sqoop
• Apache NiFi
Data Storage
• Depending on the requirements data can placed into Distrbuted File
System, Object Storage, Nosql Databases, etc.
• Metadata Management
• Policy-based data retention is provided
• Technology Stack
• HDFS/ Hive
• Redis/Mongodb/Hbase/Cassandra/ElasticSearch
• …
Data Store: Technical Requirements
Secure Must be highly secure to prevent unauthorized access (especially as all data is in one place).
Native format Must permit data to be stored in its ‘native format’ to track lineage & for data provenance.
Low latency Must have low latency for high-frequency operations.
Must support multiple analytic frameworks—Batch, Real-time, Streaming, ML etc.
No one analytic framework can work for all data and all types of analysis.
Multiple analytic
frameworks
Details Must be able to store data with all details; aggregation may lead to loss of details.
Throughput Must have high throughput for massively parallel processing via frameworks such as Hadoop and Spark
Reliable Must be highly available and reliable (no permanent loss of data).
Scalable Must be highly scalable. When storing all data indefinitely, data volumes can quickly add up
All sources Must be able ingest data from a variety of sources-LOB/ERP, Logs, Devices, Social NWs etc.
Data Processing
• Pocessing is provided for batch, streaming and near-realtime use cases
• Scale-Out Instead of Scale-Up
• Fault-Tolerant methods
• Process is moved to data
• Technology Stack
• Mapreduce
• Spark
• Storm
• Flink
Spark Stack
Visualization and APIs
• Dashboard and applications that provides valuable bisuness insights
• Data can be made available to consumers using API, messaging queue
or DB access
• Technology Stack
• Qlik/Tableau/Spotifire
• REST APIs
• Kafka
• JDBC
Big Data Architecture
Lambda Architecture
Unified Architecture
Business Intelligence Vs Big Data
Analytics Example
Implement Data Warehouse
Physical Design
ETL Development
Reporting &
Analytics
Development
Install and Tune
Reporting &
Analytics Design
Dimension Modelling
ETL Design
Setup Infrastructure
Understand
Corporate
Strategy
Data Warehousing Uses A Top-Down Approach
Data sources
Gather
Requirements
Business
Requirements
Technical
Requirements
The “data lake” Uses A Bottoms-Up Approach
Ingest all data
Store all data
in native format without
schema definition
Do analysis
Using analytic engines
Interactive queries
Batch queries
Machine Learning
Data warehouse
Real-time analytics
Devices
Data Lake + Data Warehouse Better Together
Data sources
What happened?
Descriptive
Analytics
Diagnostic
Analytics
Why did it happen?
What will happen?
Predictive
Analytics
Prescriptive
Analytics
How can we make it happen?
Summary
• We live in an increasingly data-intensive world
• Much of the data stored and analyzed today is more varied than the data stored in
recent years
• More of our data arrives in near-real time
This presents a large business opportunity.
VahidAmiry.ir
@vahidamiry

Data lake-itweekend-sharif university-vahid amiry

  • 1.
    ‫امیری‬ ‫وحید‬ ‫موضوع‬: ‫های‬ ‫سیستم‬‫معماری‬ ‫داده‬ ‫کالن‬ ‫کارشناس‬‫ارشد‬‫فناوری‬‫اطالعات‬-‫تجارت‬ ‫الکترونیک‬ ‫کارشناس‬‫و‬‫مشاور‬‫در‬‫حوزه‬‫کالن‬‫داده‬ VahidAmiry.ir @vahidamiry
  • 2.
    What to expectfrom this short talk • History of Data Processing • Data Warehouse System • Data Lake Concept • Big Data Pipeline • Data Lake Architecture
  • 3.
    What is bigdata and why is it valuable to the business A evolution in the nature and use of data in the enterprise Data complexity: variety and velocity Petabytes/Volume
  • 4.
    Information explosion, new insights 90%of theworld’s data has been created over the last two years alone1 Shift to cheaper, faster computing, on demand 45% of total IT spend will be cloud-related by 20202 Increasingly data-savvy workforce 5X Companies that use analytics are 5x more likely to make decisions faster than competitors3 Transformative opportunity 41. IDC. 2. Josh Waldo Senior Director, Cloud Partner Strategy, Microsoft. 3. Bain & Company, The Value of Big Data: How Analytics Differentiates Winners, 2013.
  • 5.
    Recommenda- tion engines Smart meter monitoring Equipment monitoring Advertising analysis Lifesciences research Fraud detection Healthcare outcomes Weather forecasting for business planning Oil & Gas exploration Social network analysis Churn analysis Traffic flow optimization IT infrastructure & Web App optimization Legal discovery and document archiving Data Analytics is needed everywhere Intelligence Gathering Location-based tracking & services Pricing Analysis Personalized Insurance
  • 6.
    Traditional Business AnalyticsProcess 1. Start with end-user requirements to identify desired reports and analysis 2. Define corresponding database schema and queries 3. Identify the required data sources 4. Create a Extract-Transform-Load (ETL) pipeline to extract required data and transform it to target schema (‘schema-on-write’) 5. Create reports: Analyse data ETL pipeline ETL tools Defined schema Queries Results Relational LOB Applications All data not immediately required is discarded or archived
  • 7.
    Harness the growingand changing nature of data Need to collect any data StreamingStructured Challenge is combining transactional data stored in relational databases with less structured data Get the right information to the right people at the right time in the right format Unstructured “ ”
  • 9.
  • 10.
    Store indefinitely AnalyzeSee results Gather data from all sources Iterate New big data thinking: All data has value All data has potential value Data hoarding No defined schema—stored in native format Apps and users interpret the data as they see fit
  • 11.
  • 12.
    What is aData Lake? Data Lake is a new and increasingly popular way to store and analyze massive volumes and heterogeneous types of data in a centralized repository.
  • 13.
    Benefits of aData Lake – Quick Ingest Quickly ingest data without needing to force it into a pre-defined schema. “How can I collect data quickly from various sources and store it efficiently?”
  • 14.
    Benefits of aData Lake –All Data in One Place “Why is the data distributed in many locations? Where is the single source of truth ?” Store and analyze all of your data, from all of your sources, in one centralized location.
  • 15.
    Benefits of aData Lake –Storage vs Compute Separating your storage and compute allows you to scale each component as required “How can I scale up with the volume of data being generated?”
  • 16.
    Benefits of aData Lake –Schema on Read A Data Lake enables ad-hoc analysis by applying schemas on read, not write. “Is there a way I can apply multiple analytics and processing frameworks to the same data?”
  • 17.
    Data Analysis ParadigmShift OLD WAY: Structure -> Ingest -> Analyze NEW WAY: Ingest -> Analyze -> Structure
  • 18.
  • 19.
    Data Ingestion • Scalable,Extensible to capture streaming and batch data • Provide capability to business logic, filters, validation, data quality, routing, etc. business requirements • Technology Stack: • Apache Flume • Apache Kafka • Apache Sqoop • Apache NiFi
  • 20.
    Data Storage • Dependingon the requirements data can placed into Distrbuted File System, Object Storage, Nosql Databases, etc. • Metadata Management • Policy-based data retention is provided • Technology Stack • HDFS/ Hive • Redis/Mongodb/Hbase/Cassandra/ElasticSearch • …
  • 21.
    Data Store: TechnicalRequirements Secure Must be highly secure to prevent unauthorized access (especially as all data is in one place). Native format Must permit data to be stored in its ‘native format’ to track lineage & for data provenance. Low latency Must have low latency for high-frequency operations. Must support multiple analytic frameworks—Batch, Real-time, Streaming, ML etc. No one analytic framework can work for all data and all types of analysis. Multiple analytic frameworks Details Must be able to store data with all details; aggregation may lead to loss of details. Throughput Must have high throughput for massively parallel processing via frameworks such as Hadoop and Spark Reliable Must be highly available and reliable (no permanent loss of data). Scalable Must be highly scalable. When storing all data indefinitely, data volumes can quickly add up All sources Must be able ingest data from a variety of sources-LOB/ERP, Logs, Devices, Social NWs etc.
  • 22.
    Data Processing • Pocessingis provided for batch, streaming and near-realtime use cases • Scale-Out Instead of Scale-Up • Fault-Tolerant methods • Process is moved to data • Technology Stack • Mapreduce • Spark • Storm • Flink
  • 24.
  • 25.
    Visualization and APIs •Dashboard and applications that provides valuable bisuness insights • Data can be made available to consumers using API, messaging queue or DB access • Technology Stack • Qlik/Tableau/Spotifire • REST APIs • Kafka • JDBC
  • 26.
  • 27.
  • 28.
  • 29.
  • 30.
  • 31.
    Implement Data Warehouse PhysicalDesign ETL Development Reporting & Analytics Development Install and Tune Reporting & Analytics Design Dimension Modelling ETL Design Setup Infrastructure Understand Corporate Strategy Data Warehousing Uses A Top-Down Approach Data sources Gather Requirements Business Requirements Technical Requirements
  • 32.
    The “data lake”Uses A Bottoms-Up Approach Ingest all data Store all data in native format without schema definition Do analysis Using analytic engines Interactive queries Batch queries Machine Learning Data warehouse Real-time analytics Devices
  • 33.
    Data Lake +Data Warehouse Better Together Data sources What happened? Descriptive Analytics Diagnostic Analytics Why did it happen? What will happen? Predictive Analytics Prescriptive Analytics How can we make it happen?
  • 34.
    Summary • We livein an increasingly data-intensive world • Much of the data stored and analyzed today is more varied than the data stored in recent years • More of our data arrives in near-real time This presents a large business opportunity.
  • 35.