VIP Call Girls Service Jamshedpur Aishwarya 8250192130 Independent Escort Ser...
Architectures styles and deployment on the hadoop
1. Architectural Patterns and Best
Practices : #BigData #Hadoop
Srividhya Balasubramaniam @ Data and Information Management Consultant
Srividhya.logic@gmail.com
3. Agenda
• Why are enterprises re-thinking on their data strategy
• Modernizing Enterprise Data Warehouses
• Architectural Patterns and Design Consideration
• Best Practices
Analytics
Architecture
Application
Architecture
Platform
Architecture
4. “Because we have been doing
stuff this way for ages!…… ”
is not the norm
Re-Think!
5. Drivers of Change What Has not changed
DATA QUALITY AND GOVERNANCE
INFORMATION SECURITY
METADATA MANAGEMENT
DATA SOURCES
DATA STORE
DATA ACCESS
ORCHESTRATION AND SCHEDULING
7. What is the Right Tool? How should
I use the tool
Reference
Architecture?
What Language and
tool should I learn
Why?Why? Why? Why?
What's like data
modelling in Hadoop
Buy or build?
8. Core Design Principles
What Business Problem is being Solved?
Define Tool Selection Criteria
Decouple processing store and systems
Hybrid Architecture Leverage Batch and Stream
Scalable, Reliable, Fit for Purpose, Secure
Available, Very low Admin Cost
Supportable and Operations Monitoring
Best Design is cheap
9. Typical Data Pipeline
Data Source Ingest
•RDBMS
•SEARCH
•FILES/API
•MESSAGING
•IOT/STREAM
Store Raw
•DATABASE
•SEARCH DOCUMENTS
•DIST FILE STORAGE
•QUEUE
•STREAM STORE
Process for Analysis
•BATCH
•INTERACTIVE
•STREAMING
•MESSAGING
•MACHINE LEARNING
Store
•Key Value
•Graph
•Document
•Queue
•MPP
Insights
•Analytical Models
•Visualization
•Self Service BIStorage of Messaging and Streaming
Criteria
1. How Distributed Services are managed
2. Guaranteed Ordering
3. Data Delivery
4. Data Retention Period
5. Availability
6. Scalability
7. Throughput
8. Parallel Clients
9. Object Size
10.Stream Map Reduce
11.Cost
Eg: Apache Kafka
• Guranteed Ordering,
Parallel Client and Stream
MR
• Configurable Data
Retention, Availability,
Object Size
• Low cost but more admin
10. Typical Data Pipeline
Data Source Ingest
•RDBMS
•SEARCH
•FILES/API
•MESSAGING
•IOT/STREAM
Store Raw
•DATABASE
•SEARCH DOCUMENTS
•DIST FILE STORAGE
•QUEUE
•STREAM STORE
Process for Analysis
•BATCH
•INTERACTIVE
•STREAMING
•MESSAGING
•MACHINE LEARNING
Store
•Key Value
•Graph
•Document
•Queue
•MPP
Insights
•Analytical Models
•Visualization
•Self Service BI
Databases What DB Export to choose
1. File Size
2. Network Bandwidth
3. Partitioning
4. Bulk Loading
5. CDC and Delta Data Transfers
6. Native connectors and specific
connectors for Distribution
Adaptors and
Golden Gate etc.
11. Typical Data Pipeline
Data Source Ingest
•RDBMS
•SEARCH
•FILES/API
•MESSAGING
•IOT/STREAM
Store Raw
•DATABASE
•SEARCH DOCUMENTS
•DIST FILE STORAGE
•QUEUE
•STREAM STORE
Process for Analysis
•BATCH
•INTERACTIVE
•STREAMING
•MESSAGING
•MACHINE LEARNING
Store
•Key Value
•Graph
•Document
•Queue
•MPP
Insights
•Analytical Models
•Visualization
•Self Service BI
Data Storage – Distributed Files Criteria
1. Average Latency
2. Typical Data Stored
3. Typical Item Size
4. Request Rate
5. Storage Cost PerGB / timeframe
6. Durability
7. Availability
8. Native support for toolsets
9. Active community and open source
Enterprise Distributions Selection
Clouders, Hortonworks, MapR
12. Typical Data Pipeline
Data Source Ingest
•RDBMS
•SEARCH
•FILES/API
•MESSAGING
•IOT/STREAM
Store Raw
•DATABASE
•SEARCH DOCUMENTS
•DIST FILE STORAGE
•QUEUE
•STREAM STORE
Process for Analysis
•BATCH
•INTERACTIVE
•STREAMING
•MESSAGING
•MACHINE LEARNING
Store
•Key Value
•Graph
•Document
•Queue
•MPP
Insights
•Analytical Models
•Visualization
•Self Service BI
Data Storage Selection Criteria
Data Structure : Fixed , Key Value, JSON
Access Patterns : Hierarchical, Structured, Search, Publish etc
Data Temperature : Hot, Warm Cold
TCO : Low
Elastic
Cache
13. Typical Data Pipeline
Data Source Ingest
•RDBMS
•SEARCH
•FILES/API
•MESSAGING
•IOT/STREAM
Store Raw
•DATABASE
•SEARCH DOCUMENTS
•DIST FILE STORAGE
•QUEUE
•STREAM STORE
Process for Analysis
•BATCH
•INTERACTIVE
•STREAMING
•MESSAGING
•MACHINE LEARNING
Store
•Key Value
•Graph
•Document
•Queue
•MPP
Insights
•Analytical Models
•Visualization
•Self Service BI
Data Storage Selection Criteria
Cache NoSQL SQL Search 1. Average Latency (ms, sec, min, hours)
2. Typical Volume Stored (GB, TB, PB)
3. Typical Item Size (B, KB, TB, PB)
4. Query Request Rate (High to Very Low)
5. Storage and Maintenance Cost (High – Low)
6. Durability (Low – Very High)
7. Availability (High – Very High)
Data Structure : Fixed , Key Value, JSON
Access Patterns : Hierarchical, Structured,
Search, Publish etc
Data Temperature : Hot, Warm Cold
TCO : Low
14. Typical Data Pipeline
Data Source Ingest
•RDBMS
•SEARCH
•FILES/API
•MESSAGING
•IOT/STREAM
Store Raw
•DATABASE
•SEARCH DOCUMENTS
•DIST FILE STORAGE
•QUEUE
•STREAM STORE
Process for Analysis
•BATCH
•INTERACTIVE
•STREAMING
•MESSAGING
•MACHINE LEARNING
Store
•Key Value
•Graph
•Document
•Queue
•MPP
Insights
•Analytical Models
•Visualization
•Self Service BI
BATCH INTERACTIVE STREAMING MESSAGING
Machine Learning
Spark ML
EMR etc
Criteria
1. Programming Language
Support
2. Availability
3. Speed
4. Scale
5. Latency Query
6. Data Volume
7. Storage Support
8. SQL?
Temperature of Data
15. Typical Data Pipeline
Data Source Ingest
•RDBMS
•SEARCH
•FILES/API
•MESSAGING
•IOT/STREAM
Store Raw
•DATABASE
•SEARCH DOCUMENTS
•DIST FILE STORAGE
•QUEUE
•STREAM STORE
Process for Analysis
•BATCH
•INTERACTIVE
•STREAMING
•MESSAGING
•MACHINE LEARNING
Store
•Key Value
•Graph
•Document
•Queue
•MPP
Insights
•Analytical Models
•Visualization
•Self Service BI
Buy Vs Build ETL Decision?
16. Typical Data Pipeline
Data Source Ingest
•RDBMS
•SEARCH
•FILES/API
•MESSAGING
•IOT/STREAM
Store Raw
•DATABASE
•SEARCH DOCUMENTS
•DIST FILE STORAGE
•QUEUE
•STREAM STORE
Process for Analysis
•BATCH
•INTERACTIVE
•STREAMING
•MESSAGING
•MACHINE LEARNING
Store
•Key Value
•Graph
•Document
•Queue
•MPP
Insights
•Analytical Models
•Visualization
•Self Service BI
Create Analytical Application
Make Insights Available Via API
Analysis and Visualization
Zepplin, HUE etc
Publish to Queue
18. Not only ER and Dimension Models (NoERDM)
Data Storage Format
Text
Sequence
Avro
Parquet
RC/ORC
Know strength and weakness of each format in terms of
Supporting Distributions
Processing requirements – Write, partial read, full read
Schema Evolution
Extract Requirements
Storage Requirements – How big are your files
How important is file splitability
Does block compression matter
Does the file format support indexing?
How easy it is to parse
Does it support column Stats?
Failure behavior for various file formats.
19. Not only ER and Dimension Models (NoERDM)
Compression Codecs
ZLIB
LZO
LZF
Snappy
Gzip
Bzip
Considerations
How much the size reduces
How fast it can compress decompress
How can I split my compressed files? File splitbility to make
use of parallelism
Compression types
Uncompressed
Record compressed.
Block Compressed.
`
We trade I/O Loads for CPU Loads
20. Other Practices
1. Structure and Organize your repository
a. Standard directory structure
b. Access quota controls
c. Stage area conventions
2. Location of HDFS files
a. Directory structure should simplify the assignment of permissions to be grated.
b. Eg /user, /etl , /tmp, /data, /app, /metadata,
3. Partitioning, Bucketing and denormalization.
21. Data Lake / Reservoir / Refinery
Exploratory Data Analysis
Application Level Analytics
Batch and Stream Analytics – Lambda Architecture
Enterprise Data Pipeline
Broken Promise – No single version of truth
At least we can persist raw data
Obstacles with Big Data
Obstacles with Big Data
Obstacles with Big Data
Obstacles with Big Data
Obstacles with Big Data
Obstacles with Big Data
Obstacles with Big Data
Obstacles with Big Data
Obstacles with Big Data
1.1.1 Data Storage formats
Considerations
File Format
Know strength and weaknesses of each format in terms of
Supporting Distributions
Processing requirements – Write, partial read, full read
Schema Evolution
Extract Requirements
Storage Requirements – How big are your files
How important is file splitability
Does block compression matter
Does the file format support indexing?
How easy it is to parse
Does it support column Stats?
Failure behavior for various file formats.
Compression consideration
Any codec can be made splittable using a container format.
Enable compression for mapreduce intermediate steps to enhance performance.
Pay attention to how data is ordered. Compression happens in chunks so the entropy of these chunks are important. Ordering and storing data will enable better compression.
Use compact fixle format with support for splitabbility eg: seq and avro.