Data lake-itweekend-sharif university-vahid amiry

‫امیری‬ ‫وحید‬
‫موضوع‬:
‫های‬ ‫سیستم‬ ‫معماری‬
‫داده‬ ‫کالن‬
‫کارشناس‬‫ارشد‬‫فناوری‬‫اطالعات‬-‫تجارت‬
‫الکترونیک‬
‫کارشناس‬‫و‬‫مشاور‬‫در‬‫حوزه‬‫کالن‬‫داده‬
VahidAmiry.ir
@vahidamiry

What to expect from this short talk
• History of Data Processing
• Data Warehouse System
• Data Lake Concept
• Big Data Pipeline
• Data Lake Architecture

What is big data and why is it valuable to the business
A evolution in the nature and use of data in the enterprise
Data complexity: variety and velocity
Petabytes/Volume

Information
explosion,
new insights
90%of the world’s data has been created over
the last two years alone1
Shift to cheaper,
faster computing,
on demand
45%
of total IT spend will be
cloud-related by 20202
Increasingly
data-savvy
workforce
5X
Companies that use analytics
are 5x more likely to make decisions faster
than competitors3
Transformative opportunity
41. IDC. 2. Josh Waldo Senior Director, Cloud Partner Strategy, Microsoft. 3. Bain & Company, The Value of Big Data: How Analytics Differentiates Winners, 2013.

Recommenda-
tion engines
Smart meter
monitoring
Equipment
monitoring
Advertising
analysis
Life sciences
research
Fraud
detection
Healthcare
outcomes
Weather
forecasting for
business planning
Oil & Gas
exploration
Social network
analysis
Churn
analysis
Traffic flow
optimization
IT infrastructure
& Web App
optimization
Legal
discovery and
document
archiving
Data Analytics is needed everywhere
Intelligence
Gathering
Location-based
tracking &
services
Pricing Analysis
Personalized
Insurance

Traditional Business Analytics Process
1. Start with end-user requirements to identify desired reports
and analysis
2. Define corresponding database schema and queries
3. Identify the required data sources
4. Create a Extract-Transform-Load (ETL) pipeline to extract
required data and transform it to target schema (‘schema-on-write’)
5. Create reports: Analyse data
ETL pipeline
ETL tools
Defined schema
Queries
Results
Relational
LOB Applications
All data not immediately required is discarded or archived

Harness the growing and changing nature of data
Need to collect any data
StreamingStructured
Challenge is combining transactional data stored in relational databases with less structured data
Get the right information to the right people at the right time in the right format
Unstructured
“ ”

Store indefinitely Analyze See results
Gather data
from all sources
Iterate
New big data thinking: All data has value
All data has potential value
Data hoarding
No defined schema—stored in native format
Apps and users interpret the data as they see fit

What is a Data Lake?
Data Lake is a new and
increasingly popular way to
store and analyze massive
volumes and heterogeneous
types of data in a centralized
repository.

Benefits of a Data Lake – Quick Ingest
Quickly ingest data without
needing to force it into a
pre-defined schema.
“How can I collect data
quickly from various
sources and store it
efficiently?”

Benefits of a Data Lake –All Data in One Place
“Why is the data distributed
in many locations? Where is
the single source of truth ?”
Store and analyze all of
your data, from all of your
sources, in one centralized
location.

Benefits of a Data Lake –Storage vs Compute
Separating your storage and
compute allows you to scale
each component as required
“How can I scale up with
the volume of data being
generated?”

Benefits of a Data Lake –Schema on Read
A Data Lake enables ad-hoc
analysis by applying schemas
on read, not write.
“Is there a way I can apply
multiple analytics and
processing frameworks to
the same data?”

Data Analysis Paradigm Shift
OLD WAY: Structure -> Ingest -> Analyze
NEW WAY: Ingest -> Analyze -> Structure

Data Lake Architecture
Simplified Big Data Pipeline

Data Ingestion
• Scalable, Extensible to capture streaming and batch data
• Provide capability to business logic, filters, validation, data quality,
routing, etc. business requirements
• Technology Stack:
• Apache Flume
• Apache Kafka
• Apache Sqoop
• Apache NiFi

Data Storage
• Depending on the requirements data can placed into Distrbuted File
System, Object Storage, Nosql Databases, etc.
• Metadata Management
• Policy-based data retention is provided
• Technology Stack
• HDFS/ Hive
• Redis/Mongodb/Hbase/Cassandra/ElasticSearch
• …

Data Store: Technical Requirements
Secure Must be highly secure to prevent unauthorized access (especially as all data is in one place).
Native format Must permit data to be stored in its ‘native format’ to track lineage & for data provenance.
Low latency Must have low latency for high-frequency operations.
Must support multiple analytic frameworks—Batch, Real-time, Streaming, ML etc.
No one analytic framework can work for all data and all types of analysis.
Multiple analytic
frameworks
Details Must be able to store data with all details; aggregation may lead to loss of details.
Throughput Must have high throughput for massively parallel processing via frameworks such as Hadoop and Spark
Reliable Must be highly available and reliable (no permanent loss of data).
Scalable Must be highly scalable. When storing all data indefinitely, data volumes can quickly add up
All sources Must be able ingest data from a variety of sources-LOB/ERP, Logs, Devices, Social NWs etc.

Data Processing
• Pocessing is provided for batch, streaming and near-realtime use cases
• Scale-Out Instead of Scale-Up
• Fault-Tolerant methods
• Process is moved to data
• Mapreduce
• Spark
• Storm
• Flink

Visualization and APIs
• Dashboard and applications that provides valuable bisuness insights
• Data can be made available to consumers using API, messaging queue
or DB access
• Qlik/Tableau/Spotifire
• REST APIs
• Kafka
• JDBC

Business Intelligence Vs Big Data

Implement Data Warehouse
Physical Design
ETL Development
Reporting &
Analytics
Development
Install and Tune
Reporting &
Analytics Design
Dimension Modelling
ETL Design
Setup Infrastructure
Understand
Corporate
Strategy
Data Warehousing Uses A Top-Down Approach
Data sources
Gather
Requirements
Business
Requirements
Technical
Requirements

The “data lake” Uses A Bottoms-Up Approach
Ingest all data
Store all data
in native format without
schema definition
Do analysis
Using analytic engines
Interactive queries
Batch queries
Machine Learning
Data warehouse
Real-time analytics
Devices

Data Lake + Data Warehouse Better Together
Data sources
What happened?
Descriptive
Analytics
Diagnostic
Analytics
Why did it happen?
What will happen?
Predictive
Analytics
Prescriptive
Analytics
How can we make it happen?

Summary
• We live in an increasingly data-intensive world
• Much of the data stored and analyzed today is more varied than the data stored in
recent years
• More of our data arrives in near-real time
This presents a large business opportunity.

Data lake-itweekend-sharif university-vahid amiry

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Data lake-itweekend-sharif university-vahid amiry

Similar to Data lake-itweekend-sharif university-vahid amiry (20)

Recently uploaded

Recently uploaded (20)

Data lake-itweekend-sharif university-vahid amiry