SlideShare a Scribd company logo
1 of 22
Architectural Patterns and Best
Practices : #BigData #Hadoop
Srividhya Balasubramaniam @ Data and Information Management Consultant
Srividhya.logic@gmail.com
Ice Breaker
120 Sec
Shhhhh!
Agenda
• Why are enterprises re-thinking on their data strategy
• Modernizing Enterprise Data Warehouses
• Architectural Patterns and Design Consideration
• Best Practices
Analytics
Architecture
Application
Architecture
Platform
Architecture
“Because we have been doing
stuff this way for ages!…… ”
is not the norm
Re-Think!
Drivers of Change What Has not changed
DATA QUALITY AND GOVERNANCE
INFORMATION SECURITY
METADATA MANAGEMENT
DATA SOURCES
DATA STORE
DATA ACCESS
ORCHESTRATION AND SCHEDULING
Challenges?
Velocity , Variety and Volume
What is the Right Tool? How should
I use the tool
Reference
Architecture?
What Language and
tool should I learn
Why?Why? Why? Why?
What's like data
modelling in Hadoop
Buy or build?
Core Design Principles
 What Business Problem is being Solved?
 Define Tool Selection Criteria
 Decouple processing store and systems
 Hybrid Architecture Leverage Batch and Stream
 Scalable, Reliable, Fit for Purpose, Secure
 Available, Very low Admin Cost
 Supportable and Operations Monitoring
 Best Design is cheap
Typical Data Pipeline
Data Source Ingest
•RDBMS
•SEARCH
•FILES/API
•MESSAGING
•IOT/STREAM
Store Raw
•DATABASE
•SEARCH DOCUMENTS
•DIST FILE STORAGE
•QUEUE
•STREAM STORE
Process for Analysis
•BATCH
•INTERACTIVE
•STREAMING
•MESSAGING
•MACHINE LEARNING
Store
•Key Value
•Graph
•Document
•Queue
•MPP
Insights
•Analytical Models
•Visualization
•Self Service BIStorage of Messaging and Streaming
Criteria
1. How Distributed Services are managed
2. Guaranteed Ordering
3. Data Delivery
4. Data Retention Period
5. Availability
6. Scalability
7. Throughput
8. Parallel Clients
9. Object Size
10.Stream Map Reduce
11.Cost
Eg: Apache Kafka
• Guranteed Ordering,
Parallel Client and Stream
MR
• Configurable Data
Retention, Availability,
Object Size
• Low cost but more admin
Typical Data Pipeline
Data Source Ingest
•RDBMS
•SEARCH
•FILES/API
•MESSAGING
•IOT/STREAM
Store Raw
•DATABASE
•SEARCH DOCUMENTS
•DIST FILE STORAGE
•QUEUE
•STREAM STORE
Process for Analysis
•BATCH
•INTERACTIVE
•STREAMING
•MESSAGING
•MACHINE LEARNING
Store
•Key Value
•Graph
•Document
•Queue
•MPP
Insights
•Analytical Models
•Visualization
•Self Service BI
Databases What DB Export to choose
1. File Size
2. Network Bandwidth
3. Partitioning
4. Bulk Loading
5. CDC and Delta Data Transfers
6. Native connectors and specific
connectors for Distribution
Adaptors and
Golden Gate etc.
Typical Data Pipeline
Data Source Ingest
•RDBMS
•SEARCH
•FILES/API
•MESSAGING
•IOT/STREAM
Store Raw
•DATABASE
•SEARCH DOCUMENTS
•DIST FILE STORAGE
•QUEUE
•STREAM STORE
Process for Analysis
•BATCH
•INTERACTIVE
•STREAMING
•MESSAGING
•MACHINE LEARNING
Store
•Key Value
•Graph
•Document
•Queue
•MPP
Insights
•Analytical Models
•Visualization
•Self Service BI
Data Storage – Distributed Files Criteria
1. Average Latency
2. Typical Data Stored
3. Typical Item Size
4. Request Rate
5. Storage Cost PerGB / timeframe
6. Durability
7. Availability
8. Native support for toolsets
9. Active community and open source
Enterprise Distributions Selection
Clouders, Hortonworks, MapR
Typical Data Pipeline
Data Source Ingest
•RDBMS
•SEARCH
•FILES/API
•MESSAGING
•IOT/STREAM
Store Raw
•DATABASE
•SEARCH DOCUMENTS
•DIST FILE STORAGE
•QUEUE
•STREAM STORE
Process for Analysis
•BATCH
•INTERACTIVE
•STREAMING
•MESSAGING
•MACHINE LEARNING
Store
•Key Value
•Graph
•Document
•Queue
•MPP
Insights
•Analytical Models
•Visualization
•Self Service BI
Data Storage Selection Criteria
Data Structure : Fixed , Key Value, JSON
Access Patterns : Hierarchical, Structured, Search, Publish etc
Data Temperature : Hot, Warm Cold
TCO : Low
Elastic
Cache
Typical Data Pipeline
Data Source Ingest
•RDBMS
•SEARCH
•FILES/API
•MESSAGING
•IOT/STREAM
Store Raw
•DATABASE
•SEARCH DOCUMENTS
•DIST FILE STORAGE
•QUEUE
•STREAM STORE
Process for Analysis
•BATCH
•INTERACTIVE
•STREAMING
•MESSAGING
•MACHINE LEARNING
Store
•Key Value
•Graph
•Document
•Queue
•MPP
Insights
•Analytical Models
•Visualization
•Self Service BI
Data Storage Selection Criteria
Cache  NoSQL SQL Search 1. Average Latency (ms, sec, min, hours)
2. Typical Volume Stored (GB, TB, PB)
3. Typical Item Size (B, KB, TB, PB)
4. Query Request Rate (High to Very Low)
5. Storage and Maintenance Cost (High – Low)
6. Durability (Low – Very High)
7. Availability (High – Very High)
Data Structure : Fixed , Key Value, JSON
Access Patterns : Hierarchical, Structured,
Search, Publish etc
Data Temperature : Hot, Warm Cold
TCO : Low
Typical Data Pipeline
Data Source Ingest
•RDBMS
•SEARCH
•FILES/API
•MESSAGING
•IOT/STREAM
Store Raw
•DATABASE
•SEARCH DOCUMENTS
•DIST FILE STORAGE
•QUEUE
•STREAM STORE
Process for Analysis
•BATCH
•INTERACTIVE
•STREAMING
•MESSAGING
•MACHINE LEARNING
Store
•Key Value
•Graph
•Document
•Queue
•MPP
Insights
•Analytical Models
•Visualization
•Self Service BI
BATCH INTERACTIVE STREAMING MESSAGING
Machine Learning
Spark ML
EMR etc
Criteria
1. Programming Language
Support
2. Availability
3. Speed
4. Scale
5. Latency Query
6. Data Volume
7. Storage Support
8. SQL?
Temperature of Data
Typical Data Pipeline
Data Source Ingest
•RDBMS
•SEARCH
•FILES/API
•MESSAGING
•IOT/STREAM
Store Raw
•DATABASE
•SEARCH DOCUMENTS
•DIST FILE STORAGE
•QUEUE
•STREAM STORE
Process for Analysis
•BATCH
•INTERACTIVE
•STREAMING
•MESSAGING
•MACHINE LEARNING
Store
•Key Value
•Graph
•Document
•Queue
•MPP
Insights
•Analytical Models
•Visualization
•Self Service BI
Buy Vs Build ETL Decision?
Typical Data Pipeline
Data Source Ingest
•RDBMS
•SEARCH
•FILES/API
•MESSAGING
•IOT/STREAM
Store Raw
•DATABASE
•SEARCH DOCUMENTS
•DIST FILE STORAGE
•QUEUE
•STREAM STORE
Process for Analysis
•BATCH
•INTERACTIVE
•STREAMING
•MESSAGING
•MACHINE LEARNING
Store
•Key Value
•Graph
•Document
•Queue
•MPP
Insights
•Analytical Models
•Visualization
•Self Service BI
Create Analytical Application
Make Insights Available Via API
Analysis and Visualization
Zepplin, HUE etc
Publish to Queue
Data Modelling in Hadoop &
Architectural Patterns
Not only ER and Dimension Models (NoERDM)
Data Storage Format
Text
Sequence
Avro
Parquet
RC/ORC
Know strength and weakness of each format in terms of
Supporting Distributions
Processing requirements – Write, partial read, full read
Schema Evolution
Extract Requirements
Storage Requirements – How big are your files
How important is file splitability
Does block compression matter
Does the file format support indexing?
How easy it is to parse
Does it support column Stats?
Failure behavior for various file formats.
Not only ER and Dimension Models (NoERDM)
Compression Codecs
ZLIB
LZO
LZF
Snappy
Gzip
Bzip
Considerations
How much the size reduces
How fast it can compress decompress
How can I split my compressed files? File splitbility to make
use of parallelism
Compression types
Uncompressed
Record compressed.
Block Compressed.
`
We trade I/O Loads for CPU Loads
Other Practices
1. Structure and Organize your repository
a. Standard directory structure
b. Access quota controls
c. Stage area conventions
2. Location of HDFS files
a. Directory structure should simplify the assignment of permissions to be grated.
b. Eg /user, /etl , /tmp, /data, /app, /metadata,
3. Partitioning, Bucketing and denormalization.
Data Lake / Reservoir / Refinery
Exploratory Data Analysis
Application Level Analytics
Batch and Stream Analytics – Lambda Architecture
Enterprise Data Pipeline
Thank You!
Questions?

More Related Content

What's hot

Top 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationsTop 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationshadooparchbook
 
Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015hadooparchbook
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoophadooparchbook
 
What no one tells you about writing a streaming app
What no one tells you about writing a streaming appWhat no one tells you about writing a streaming app
What no one tells you about writing a streaming apphadooparchbook
 
Architectural considerations for Hadoop Applications
Architectural considerations for Hadoop ApplicationsArchitectural considerations for Hadoop Applications
Architectural considerations for Hadoop Applicationshadooparchbook
 
Application Architectures with Hadoop - Big Data TechCon SF 2014
Application Architectures with Hadoop - Big Data TechCon SF 2014Application Architectures with Hadoop - Big Data TechCon SF 2014
Application Architectures with Hadoop - Big Data TechCon SF 2014hadooparchbook
 
Apache Spark: Usage and Roadmap in Hadoop
Apache Spark: Usage and Roadmap in HadoopApache Spark: Usage and Roadmap in Hadoop
Apache Spark: Usage and Roadmap in HadoopCloudera Japan
 
2015 nov 27_thug_paytm_rt_ingest_brief_final
2015 nov 27_thug_paytm_rt_ingest_brief_final2015 nov 27_thug_paytm_rt_ingest_brief_final
2015 nov 27_thug_paytm_rt_ingest_brief_finalAdam Muise
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoophadooparchbook
 
August 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache OozieAugust 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache OozieYahoo Developer Network
 
Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Se...
Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Se...Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Se...
Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Se...Cloudera, Inc.
 
Druid: Under the Covers (Virtual Meetup)
Druid: Under the Covers (Virtual Meetup)Druid: Under the Covers (Virtual Meetup)
Druid: Under the Covers (Virtual Meetup)Imply
 
Nl HUG 2016 Feb Hadoop security from the trenches
Nl HUG 2016 Feb Hadoop security from the trenchesNl HUG 2016 Feb Hadoop security from the trenches
Nl HUG 2016 Feb Hadoop security from the trenchesBolke de Bruin
 
Zero ETL analytics with LLAP in Azure HDInsight
Zero ETL analytics with LLAP in Azure HDInsightZero ETL analytics with LLAP in Azure HDInsight
Zero ETL analytics with LLAP in Azure HDInsightAshish Thapliyal
 
How to use Impala query plan and profile to fix performance issues
How to use Impala query plan and profile to fix performance issuesHow to use Impala query plan and profile to fix performance issues
How to use Impala query plan and profile to fix performance issuesCloudera, Inc.
 
Hadoop Security and Compliance - StampedeCon 2016
Hadoop Security and Compliance - StampedeCon 2016Hadoop Security and Compliance - StampedeCon 2016
Hadoop Security and Compliance - StampedeCon 2016StampedeCon
 
Apache Druid®: A Dance of Distributed Processes
 Apache Druid®: A Dance of Distributed Processes Apache Druid®: A Dance of Distributed Processes
Apache Druid®: A Dance of Distributed ProcessesImply
 
Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...
Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...
Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...Edureka!
 
大数据数据治理及数据安全
大数据数据治理及数据安全大数据数据治理及数据安全
大数据数据治理及数据安全Jianwei Li
 
A deep dive into running data analytic workloads in the cloud
A deep dive into running data analytic workloads in the cloudA deep dive into running data analytic workloads in the cloud
A deep dive into running data analytic workloads in the cloudCloudera, Inc.
 

What's hot (20)

Top 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationsTop 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applications
 
Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoop
 
What no one tells you about writing a streaming app
What no one tells you about writing a streaming appWhat no one tells you about writing a streaming app
What no one tells you about writing a streaming app
 
Architectural considerations for Hadoop Applications
Architectural considerations for Hadoop ApplicationsArchitectural considerations for Hadoop Applications
Architectural considerations for Hadoop Applications
 
Application Architectures with Hadoop - Big Data TechCon SF 2014
Application Architectures with Hadoop - Big Data TechCon SF 2014Application Architectures with Hadoop - Big Data TechCon SF 2014
Application Architectures with Hadoop - Big Data TechCon SF 2014
 
Apache Spark: Usage and Roadmap in Hadoop
Apache Spark: Usage and Roadmap in HadoopApache Spark: Usage and Roadmap in Hadoop
Apache Spark: Usage and Roadmap in Hadoop
 
2015 nov 27_thug_paytm_rt_ingest_brief_final
2015 nov 27_thug_paytm_rt_ingest_brief_final2015 nov 27_thug_paytm_rt_ingest_brief_final
2015 nov 27_thug_paytm_rt_ingest_brief_final
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoop
 
August 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache OozieAugust 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache Oozie
 
Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Se...
Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Se...Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Se...
Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Se...
 
Druid: Under the Covers (Virtual Meetup)
Druid: Under the Covers (Virtual Meetup)Druid: Under the Covers (Virtual Meetup)
Druid: Under the Covers (Virtual Meetup)
 
Nl HUG 2016 Feb Hadoop security from the trenches
Nl HUG 2016 Feb Hadoop security from the trenchesNl HUG 2016 Feb Hadoop security from the trenches
Nl HUG 2016 Feb Hadoop security from the trenches
 
Zero ETL analytics with LLAP in Azure HDInsight
Zero ETL analytics with LLAP in Azure HDInsightZero ETL analytics with LLAP in Azure HDInsight
Zero ETL analytics with LLAP in Azure HDInsight
 
How to use Impala query plan and profile to fix performance issues
How to use Impala query plan and profile to fix performance issuesHow to use Impala query plan and profile to fix performance issues
How to use Impala query plan and profile to fix performance issues
 
Hadoop Security and Compliance - StampedeCon 2016
Hadoop Security and Compliance - StampedeCon 2016Hadoop Security and Compliance - StampedeCon 2016
Hadoop Security and Compliance - StampedeCon 2016
 
Apache Druid®: A Dance of Distributed Processes
 Apache Druid®: A Dance of Distributed Processes Apache Druid®: A Dance of Distributed Processes
Apache Druid®: A Dance of Distributed Processes
 
Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...
Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...
Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...
 
大数据数据治理及数据安全
大数据数据治理及数据安全大数据数据治理及数据安全
大数据数据治理及数据安全
 
A deep dive into running data analytic workloads in the cloud
A deep dive into running data analytic workloads in the cloudA deep dive into running data analytic workloads in the cloud
A deep dive into running data analytic workloads in the cloud
 

Viewers also liked

MapReduce Design Patterns
MapReduce Design PatternsMapReduce Design Patterns
MapReduce Design PatternsDonald Miner
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce AlgorithmsAmund Tveit
 
Hadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce DetailsHadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce DetailsAnju Singh
 
Real-time Streaming Analytics: Business Value, Use Cases and Architectural Co...
Real-time Streaming Analytics: Business Value, Use Cases and Architectural Co...Real-time Streaming Analytics: Business Value, Use Cases and Architectural Co...
Real-time Streaming Analytics: Business Value, Use Cases and Architectural Co...Impetus Technologies
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsLynn Langit
 
Hadoop - Architectural road map for Hadoop Ecosystem
Hadoop -  Architectural road map for Hadoop EcosystemHadoop -  Architectural road map for Hadoop Ecosystem
Hadoop - Architectural road map for Hadoop Ecosystemnallagangus
 

Viewers also liked (6)

MapReduce Design Patterns
MapReduce Design PatternsMapReduce Design Patterns
MapReduce Design Patterns
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce Algorithms
 
Hadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce DetailsHadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce Details
 
Real-time Streaming Analytics: Business Value, Use Cases and Architectural Co...
Real-time Streaming Analytics: Business Value, Use Cases and Architectural Co...Real-time Streaming Analytics: Business Value, Use Cases and Architectural Co...
Real-time Streaming Analytics: Business Value, Use Cases and Architectural Co...
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
Hadoop - Architectural road map for Hadoop Ecosystem
Hadoop -  Architectural road map for Hadoop EcosystemHadoop -  Architectural road map for Hadoop Ecosystem
Hadoop - Architectural road map for Hadoop Ecosystem
 

Similar to Architectures styles and deployment on the hadoop

Demystifying data engineering
Demystifying data engineeringDemystifying data engineering
Demystifying data engineeringThang Bui (Bob)
 
Riga dev day 2016 adding a data reservoir and oracle bdd to extend your ora...
Riga dev day 2016   adding a data reservoir and oracle bdd to extend your ora...Riga dev day 2016   adding a data reservoir and oracle bdd to extend your ora...
Riga dev day 2016 adding a data reservoir and oracle bdd to extend your ora...Mark Rittman
 
Pitfalls of Data Warehousing_2019-04-24
Pitfalls of Data Warehousing_2019-04-24Pitfalls of Data Warehousing_2019-04-24
Pitfalls of Data Warehousing_2019-04-24Martin Bém
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...Institute of Contemporary Sciences
 
The Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They NeedThe Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They NeedDunn Solutions Group
 
Data lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiryData lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amirydatastack
 
20160331 sa introduction to big data pipelining berlin meetup 0.3
20160331 sa introduction to big data pipelining berlin meetup   0.320160331 sa introduction to big data pipelining berlin meetup   0.3
20160331 sa introduction to big data pipelining berlin meetup 0.3Simon Ambridge
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra
 
A Tale of Two BI Standards
A Tale of Two BI StandardsA Tale of Two BI Standards
A Tale of Two BI StandardsArcadia Data
 
Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...
Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...
Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...Rittman Analytics
 
AzureDay - Introduction Big Data Analytics.
AzureDay  - Introduction Big Data Analytics.AzureDay  - Introduction Big Data Analytics.
AzureDay - Introduction Big Data Analytics.Łukasz Grala
 
Knowledge Graph for Machine Learning and Data Science
Knowledge Graph for Machine Learning and Data ScienceKnowledge Graph for Machine Learning and Data Science
Knowledge Graph for Machine Learning and Data ScienceCambridge Semantics
 
Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...
Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...
Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...Rukmani Gopalan
 
Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which DataWorks Summit
 
Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Tech Triveni
 

Similar to Architectures styles and deployment on the hadoop (20)

Demystifying data engineering
Demystifying data engineeringDemystifying data engineering
Demystifying data engineering
 
Riga dev day 2016 adding a data reservoir and oracle bdd to extend your ora...
Riga dev day 2016   adding a data reservoir and oracle bdd to extend your ora...Riga dev day 2016   adding a data reservoir and oracle bdd to extend your ora...
Riga dev day 2016 adding a data reservoir and oracle bdd to extend your ora...
 
Pitfalls of Data Warehousing_2019-04-24
Pitfalls of Data Warehousing_2019-04-24Pitfalls of Data Warehousing_2019-04-24
Pitfalls of Data Warehousing_2019-04-24
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 
The Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They NeedThe Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They Need
 
Data lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiryData lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiry
 
20160331 sa introduction to big data pipelining berlin meetup 0.3
20160331 sa introduction to big data pipelining berlin meetup   0.320160331 sa introduction to big data pipelining berlin meetup   0.3
20160331 sa introduction to big data pipelining berlin meetup 0.3
 
Architecting a datalake
Architecting a datalakeArchitecting a datalake
Architecting a datalake
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
A Tale of Two BI Standards
A Tale of Two BI StandardsA Tale of Two BI Standards
A Tale of Two BI Standards
 
Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...
Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...
Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...
 
AzureDay - Introduction Big Data Analytics.
AzureDay  - Introduction Big Data Analytics.AzureDay  - Introduction Big Data Analytics.
AzureDay - Introduction Big Data Analytics.
 
Knowledge Graph for Machine Learning and Data Science
Knowledge Graph for Machine Learning and Data ScienceKnowledge Graph for Machine Learning and Data Science
Knowledge Graph for Machine Learning and Data Science
 
Data Warehouse
Data WarehouseData Warehouse
Data Warehouse
 
kalyani.ppt
kalyani.pptkalyani.ppt
kalyani.ppt
 
Data Warehouse
Data WarehouseData Warehouse
Data Warehouse
 
kalyani.ppt
kalyani.pptkalyani.ppt
kalyani.ppt
 
Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...
Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...
Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...
 
Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which
 
Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...
 

Recently uploaded

VIP Russian Call Girls in Amravati Deepika 8250192130 Independent Escort Serv...
VIP Russian Call Girls in Amravati Deepika 8250192130 Independent Escort Serv...VIP Russian Call Girls in Amravati Deepika 8250192130 Independent Escort Serv...
VIP Russian Call Girls in Amravati Deepika 8250192130 Independent Escort Serv...Suhani Kapoor
 
Preventing and ending sexual harassment in the workplace.pptx
Preventing and ending sexual harassment in the workplace.pptxPreventing and ending sexual harassment in the workplace.pptx
Preventing and ending sexual harassment in the workplace.pptxGry Tina Tinde
 
VIP Call Girls in Cuttack Aarohi 8250192130 Independent Escort Service Cuttack
VIP Call Girls in Cuttack Aarohi 8250192130 Independent Escort Service CuttackVIP Call Girls in Cuttack Aarohi 8250192130 Independent Escort Service Cuttack
VIP Call Girls in Cuttack Aarohi 8250192130 Independent Escort Service CuttackSuhani Kapoor
 
CFO_SB_Career History_Multi Sector Experience
CFO_SB_Career History_Multi Sector ExperienceCFO_SB_Career History_Multi Sector Experience
CFO_SB_Career History_Multi Sector ExperienceSanjay Bokadia
 
VIP Call Girls in Jamshedpur Aarohi 8250192130 Independent Escort Service Jam...
VIP Call Girls in Jamshedpur Aarohi 8250192130 Independent Escort Service Jam...VIP Call Girls in Jamshedpur Aarohi 8250192130 Independent Escort Service Jam...
VIP Call Girls in Jamshedpur Aarohi 8250192130 Independent Escort Service Jam...Suhani Kapoor
 
CALL ON ➥8923113531 🔝Call Girls Nishatganj Lucknow best sexual service
CALL ON ➥8923113531 🔝Call Girls Nishatganj Lucknow best sexual serviceCALL ON ➥8923113531 🔝Call Girls Nishatganj Lucknow best sexual service
CALL ON ➥8923113531 🔝Call Girls Nishatganj Lucknow best sexual serviceanilsa9823
 
Employee of the Month - Samsung Semiconductor India Research
Employee of the Month - Samsung Semiconductor India ResearchEmployee of the Month - Samsung Semiconductor India Research
Employee of the Month - Samsung Semiconductor India ResearchSoham Mondal
 
Dubai Call Girls Starlet O525547819 Call Girls Dubai Showen Dating
Dubai Call Girls Starlet O525547819 Call Girls Dubai Showen DatingDubai Call Girls Starlet O525547819 Call Girls Dubai Showen Dating
Dubai Call Girls Starlet O525547819 Call Girls Dubai Showen Datingkojalkojal131
 
VIP High Profile Call Girls Jamshedpur Aarushi 8250192130 Independent Escort ...
VIP High Profile Call Girls Jamshedpur Aarushi 8250192130 Independent Escort ...VIP High Profile Call Girls Jamshedpur Aarushi 8250192130 Independent Escort ...
VIP High Profile Call Girls Jamshedpur Aarushi 8250192130 Independent Escort ...Suhani Kapoor
 
VIP Call Girls Service Saharanpur Aishwarya 8250192130 Independent Escort Ser...
VIP Call Girls Service Saharanpur Aishwarya 8250192130 Independent Escort Ser...VIP Call Girls Service Saharanpur Aishwarya 8250192130 Independent Escort Ser...
VIP Call Girls Service Saharanpur Aishwarya 8250192130 Independent Escort Ser...Suhani Kapoor
 
Call Girl in Low Price Delhi Punjabi Bagh 9711199012
Call Girl in Low Price Delhi Punjabi Bagh  9711199012Call Girl in Low Price Delhi Punjabi Bagh  9711199012
Call Girl in Low Price Delhi Punjabi Bagh 9711199012sapnasaifi408
 
CALL ON ➥8923113531 🔝Call Girls Husainganj Lucknow best Female service 🧳
CALL ON ➥8923113531 🔝Call Girls Husainganj Lucknow best Female service  🧳CALL ON ➥8923113531 🔝Call Girls Husainganj Lucknow best Female service  🧳
CALL ON ➥8923113531 🔝Call Girls Husainganj Lucknow best Female service 🧳anilsa9823
 
Résumé (2 pager - 12 ft standard syntax)
Résumé (2 pager -  12 ft standard syntax)Résumé (2 pager -  12 ft standard syntax)
Résumé (2 pager - 12 ft standard syntax)Soham Mondal
 
Sonam +91-9537192988-Mind-blowing skills and techniques of Ahmedabad Call Girls
Sonam +91-9537192988-Mind-blowing skills and techniques of Ahmedabad Call GirlsSonam +91-9537192988-Mind-blowing skills and techniques of Ahmedabad Call Girls
Sonam +91-9537192988-Mind-blowing skills and techniques of Ahmedabad Call GirlsNiya Khan
 
Resumes, Cover Letters, and Applying Online
Resumes, Cover Letters, and Applying OnlineResumes, Cover Letters, and Applying Online
Resumes, Cover Letters, and Applying OnlineBruce Bennett
 
VIP Call Girls Service Cuttack Aishwarya 8250192130 Independent Escort Servic...
VIP Call Girls Service Cuttack Aishwarya 8250192130 Independent Escort Servic...VIP Call Girls Service Cuttack Aishwarya 8250192130 Independent Escort Servic...
VIP Call Girls Service Cuttack Aishwarya 8250192130 Independent Escort Servic...Suhani Kapoor
 
Low Rate Call Girls Gorakhpur Anika 8250192130 Independent Escort Service Gor...
Low Rate Call Girls Gorakhpur Anika 8250192130 Independent Escort Service Gor...Low Rate Call Girls Gorakhpur Anika 8250192130 Independent Escort Service Gor...
Low Rate Call Girls Gorakhpur Anika 8250192130 Independent Escort Service Gor...Suhani Kapoor
 
Internshala Student Partner 6.0 Jadavpur University Certificate
Internshala Student Partner 6.0 Jadavpur University CertificateInternshala Student Partner 6.0 Jadavpur University Certificate
Internshala Student Partner 6.0 Jadavpur University CertificateSoham Mondal
 
Booking open Available Pune Call Girls Ambegaon Khurd 6297143586 Call Hot In...
Booking open Available Pune Call Girls Ambegaon Khurd  6297143586 Call Hot In...Booking open Available Pune Call Girls Ambegaon Khurd  6297143586 Call Hot In...
Booking open Available Pune Call Girls Ambegaon Khurd 6297143586 Call Hot In...Call Girls in Nagpur High Profile
 
VIP Call Girls Service Jamshedpur Aishwarya 8250192130 Independent Escort Ser...
VIP Call Girls Service Jamshedpur Aishwarya 8250192130 Independent Escort Ser...VIP Call Girls Service Jamshedpur Aishwarya 8250192130 Independent Escort Ser...
VIP Call Girls Service Jamshedpur Aishwarya 8250192130 Independent Escort Ser...Suhani Kapoor
 

Recently uploaded (20)

VIP Russian Call Girls in Amravati Deepika 8250192130 Independent Escort Serv...
VIP Russian Call Girls in Amravati Deepika 8250192130 Independent Escort Serv...VIP Russian Call Girls in Amravati Deepika 8250192130 Independent Escort Serv...
VIP Russian Call Girls in Amravati Deepika 8250192130 Independent Escort Serv...
 
Preventing and ending sexual harassment in the workplace.pptx
Preventing and ending sexual harassment in the workplace.pptxPreventing and ending sexual harassment in the workplace.pptx
Preventing and ending sexual harassment in the workplace.pptx
 
VIP Call Girls in Cuttack Aarohi 8250192130 Independent Escort Service Cuttack
VIP Call Girls in Cuttack Aarohi 8250192130 Independent Escort Service CuttackVIP Call Girls in Cuttack Aarohi 8250192130 Independent Escort Service Cuttack
VIP Call Girls in Cuttack Aarohi 8250192130 Independent Escort Service Cuttack
 
CFO_SB_Career History_Multi Sector Experience
CFO_SB_Career History_Multi Sector ExperienceCFO_SB_Career History_Multi Sector Experience
CFO_SB_Career History_Multi Sector Experience
 
VIP Call Girls in Jamshedpur Aarohi 8250192130 Independent Escort Service Jam...
VIP Call Girls in Jamshedpur Aarohi 8250192130 Independent Escort Service Jam...VIP Call Girls in Jamshedpur Aarohi 8250192130 Independent Escort Service Jam...
VIP Call Girls in Jamshedpur Aarohi 8250192130 Independent Escort Service Jam...
 
CALL ON ➥8923113531 🔝Call Girls Nishatganj Lucknow best sexual service
CALL ON ➥8923113531 🔝Call Girls Nishatganj Lucknow best sexual serviceCALL ON ➥8923113531 🔝Call Girls Nishatganj Lucknow best sexual service
CALL ON ➥8923113531 🔝Call Girls Nishatganj Lucknow best sexual service
 
Employee of the Month - Samsung Semiconductor India Research
Employee of the Month - Samsung Semiconductor India ResearchEmployee of the Month - Samsung Semiconductor India Research
Employee of the Month - Samsung Semiconductor India Research
 
Dubai Call Girls Starlet O525547819 Call Girls Dubai Showen Dating
Dubai Call Girls Starlet O525547819 Call Girls Dubai Showen DatingDubai Call Girls Starlet O525547819 Call Girls Dubai Showen Dating
Dubai Call Girls Starlet O525547819 Call Girls Dubai Showen Dating
 
VIP High Profile Call Girls Jamshedpur Aarushi 8250192130 Independent Escort ...
VIP High Profile Call Girls Jamshedpur Aarushi 8250192130 Independent Escort ...VIP High Profile Call Girls Jamshedpur Aarushi 8250192130 Independent Escort ...
VIP High Profile Call Girls Jamshedpur Aarushi 8250192130 Independent Escort ...
 
VIP Call Girls Service Saharanpur Aishwarya 8250192130 Independent Escort Ser...
VIP Call Girls Service Saharanpur Aishwarya 8250192130 Independent Escort Ser...VIP Call Girls Service Saharanpur Aishwarya 8250192130 Independent Escort Ser...
VIP Call Girls Service Saharanpur Aishwarya 8250192130 Independent Escort Ser...
 
Call Girl in Low Price Delhi Punjabi Bagh 9711199012
Call Girl in Low Price Delhi Punjabi Bagh  9711199012Call Girl in Low Price Delhi Punjabi Bagh  9711199012
Call Girl in Low Price Delhi Punjabi Bagh 9711199012
 
CALL ON ➥8923113531 🔝Call Girls Husainganj Lucknow best Female service 🧳
CALL ON ➥8923113531 🔝Call Girls Husainganj Lucknow best Female service  🧳CALL ON ➥8923113531 🔝Call Girls Husainganj Lucknow best Female service  🧳
CALL ON ➥8923113531 🔝Call Girls Husainganj Lucknow best Female service 🧳
 
Résumé (2 pager - 12 ft standard syntax)
Résumé (2 pager -  12 ft standard syntax)Résumé (2 pager -  12 ft standard syntax)
Résumé (2 pager - 12 ft standard syntax)
 
Sonam +91-9537192988-Mind-blowing skills and techniques of Ahmedabad Call Girls
Sonam +91-9537192988-Mind-blowing skills and techniques of Ahmedabad Call GirlsSonam +91-9537192988-Mind-blowing skills and techniques of Ahmedabad Call Girls
Sonam +91-9537192988-Mind-blowing skills and techniques of Ahmedabad Call Girls
 
Resumes, Cover Letters, and Applying Online
Resumes, Cover Letters, and Applying OnlineResumes, Cover Letters, and Applying Online
Resumes, Cover Letters, and Applying Online
 
VIP Call Girls Service Cuttack Aishwarya 8250192130 Independent Escort Servic...
VIP Call Girls Service Cuttack Aishwarya 8250192130 Independent Escort Servic...VIP Call Girls Service Cuttack Aishwarya 8250192130 Independent Escort Servic...
VIP Call Girls Service Cuttack Aishwarya 8250192130 Independent Escort Servic...
 
Low Rate Call Girls Gorakhpur Anika 8250192130 Independent Escort Service Gor...
Low Rate Call Girls Gorakhpur Anika 8250192130 Independent Escort Service Gor...Low Rate Call Girls Gorakhpur Anika 8250192130 Independent Escort Service Gor...
Low Rate Call Girls Gorakhpur Anika 8250192130 Independent Escort Service Gor...
 
Internshala Student Partner 6.0 Jadavpur University Certificate
Internshala Student Partner 6.0 Jadavpur University CertificateInternshala Student Partner 6.0 Jadavpur University Certificate
Internshala Student Partner 6.0 Jadavpur University Certificate
 
Booking open Available Pune Call Girls Ambegaon Khurd 6297143586 Call Hot In...
Booking open Available Pune Call Girls Ambegaon Khurd  6297143586 Call Hot In...Booking open Available Pune Call Girls Ambegaon Khurd  6297143586 Call Hot In...
Booking open Available Pune Call Girls Ambegaon Khurd 6297143586 Call Hot In...
 
VIP Call Girls Service Jamshedpur Aishwarya 8250192130 Independent Escort Ser...
VIP Call Girls Service Jamshedpur Aishwarya 8250192130 Independent Escort Ser...VIP Call Girls Service Jamshedpur Aishwarya 8250192130 Independent Escort Ser...
VIP Call Girls Service Jamshedpur Aishwarya 8250192130 Independent Escort Ser...
 

Architectures styles and deployment on the hadoop

  • 1. Architectural Patterns and Best Practices : #BigData #Hadoop Srividhya Balasubramaniam @ Data and Information Management Consultant Srividhya.logic@gmail.com
  • 3. Agenda • Why are enterprises re-thinking on their data strategy • Modernizing Enterprise Data Warehouses • Architectural Patterns and Design Consideration • Best Practices Analytics Architecture Application Architecture Platform Architecture
  • 4. “Because we have been doing stuff this way for ages!…… ” is not the norm Re-Think!
  • 5. Drivers of Change What Has not changed DATA QUALITY AND GOVERNANCE INFORMATION SECURITY METADATA MANAGEMENT DATA SOURCES DATA STORE DATA ACCESS ORCHESTRATION AND SCHEDULING
  • 7. What is the Right Tool? How should I use the tool Reference Architecture? What Language and tool should I learn Why?Why? Why? Why? What's like data modelling in Hadoop Buy or build?
  • 8. Core Design Principles  What Business Problem is being Solved?  Define Tool Selection Criteria  Decouple processing store and systems  Hybrid Architecture Leverage Batch and Stream  Scalable, Reliable, Fit for Purpose, Secure  Available, Very low Admin Cost  Supportable and Operations Monitoring  Best Design is cheap
  • 9. Typical Data Pipeline Data Source Ingest •RDBMS •SEARCH •FILES/API •MESSAGING •IOT/STREAM Store Raw •DATABASE •SEARCH DOCUMENTS •DIST FILE STORAGE •QUEUE •STREAM STORE Process for Analysis •BATCH •INTERACTIVE •STREAMING •MESSAGING •MACHINE LEARNING Store •Key Value •Graph •Document •Queue •MPP Insights •Analytical Models •Visualization •Self Service BIStorage of Messaging and Streaming Criteria 1. How Distributed Services are managed 2. Guaranteed Ordering 3. Data Delivery 4. Data Retention Period 5. Availability 6. Scalability 7. Throughput 8. Parallel Clients 9. Object Size 10.Stream Map Reduce 11.Cost Eg: Apache Kafka • Guranteed Ordering, Parallel Client and Stream MR • Configurable Data Retention, Availability, Object Size • Low cost but more admin
  • 10. Typical Data Pipeline Data Source Ingest •RDBMS •SEARCH •FILES/API •MESSAGING •IOT/STREAM Store Raw •DATABASE •SEARCH DOCUMENTS •DIST FILE STORAGE •QUEUE •STREAM STORE Process for Analysis •BATCH •INTERACTIVE •STREAMING •MESSAGING •MACHINE LEARNING Store •Key Value •Graph •Document •Queue •MPP Insights •Analytical Models •Visualization •Self Service BI Databases What DB Export to choose 1. File Size 2. Network Bandwidth 3. Partitioning 4. Bulk Loading 5. CDC and Delta Data Transfers 6. Native connectors and specific connectors for Distribution Adaptors and Golden Gate etc.
  • 11. Typical Data Pipeline Data Source Ingest •RDBMS •SEARCH •FILES/API •MESSAGING •IOT/STREAM Store Raw •DATABASE •SEARCH DOCUMENTS •DIST FILE STORAGE •QUEUE •STREAM STORE Process for Analysis •BATCH •INTERACTIVE •STREAMING •MESSAGING •MACHINE LEARNING Store •Key Value •Graph •Document •Queue •MPP Insights •Analytical Models •Visualization •Self Service BI Data Storage – Distributed Files Criteria 1. Average Latency 2. Typical Data Stored 3. Typical Item Size 4. Request Rate 5. Storage Cost PerGB / timeframe 6. Durability 7. Availability 8. Native support for toolsets 9. Active community and open source Enterprise Distributions Selection Clouders, Hortonworks, MapR
  • 12. Typical Data Pipeline Data Source Ingest •RDBMS •SEARCH •FILES/API •MESSAGING •IOT/STREAM Store Raw •DATABASE •SEARCH DOCUMENTS •DIST FILE STORAGE •QUEUE •STREAM STORE Process for Analysis •BATCH •INTERACTIVE •STREAMING •MESSAGING •MACHINE LEARNING Store •Key Value •Graph •Document •Queue •MPP Insights •Analytical Models •Visualization •Self Service BI Data Storage Selection Criteria Data Structure : Fixed , Key Value, JSON Access Patterns : Hierarchical, Structured, Search, Publish etc Data Temperature : Hot, Warm Cold TCO : Low Elastic Cache
  • 13. Typical Data Pipeline Data Source Ingest •RDBMS •SEARCH •FILES/API •MESSAGING •IOT/STREAM Store Raw •DATABASE •SEARCH DOCUMENTS •DIST FILE STORAGE •QUEUE •STREAM STORE Process for Analysis •BATCH •INTERACTIVE •STREAMING •MESSAGING •MACHINE LEARNING Store •Key Value •Graph •Document •Queue •MPP Insights •Analytical Models •Visualization •Self Service BI Data Storage Selection Criteria Cache  NoSQL SQL Search 1. Average Latency (ms, sec, min, hours) 2. Typical Volume Stored (GB, TB, PB) 3. Typical Item Size (B, KB, TB, PB) 4. Query Request Rate (High to Very Low) 5. Storage and Maintenance Cost (High – Low) 6. Durability (Low – Very High) 7. Availability (High – Very High) Data Structure : Fixed , Key Value, JSON Access Patterns : Hierarchical, Structured, Search, Publish etc Data Temperature : Hot, Warm Cold TCO : Low
  • 14. Typical Data Pipeline Data Source Ingest •RDBMS •SEARCH •FILES/API •MESSAGING •IOT/STREAM Store Raw •DATABASE •SEARCH DOCUMENTS •DIST FILE STORAGE •QUEUE •STREAM STORE Process for Analysis •BATCH •INTERACTIVE •STREAMING •MESSAGING •MACHINE LEARNING Store •Key Value •Graph •Document •Queue •MPP Insights •Analytical Models •Visualization •Self Service BI BATCH INTERACTIVE STREAMING MESSAGING Machine Learning Spark ML EMR etc Criteria 1. Programming Language Support 2. Availability 3. Speed 4. Scale 5. Latency Query 6. Data Volume 7. Storage Support 8. SQL? Temperature of Data
  • 15. Typical Data Pipeline Data Source Ingest •RDBMS •SEARCH •FILES/API •MESSAGING •IOT/STREAM Store Raw •DATABASE •SEARCH DOCUMENTS •DIST FILE STORAGE •QUEUE •STREAM STORE Process for Analysis •BATCH •INTERACTIVE •STREAMING •MESSAGING •MACHINE LEARNING Store •Key Value •Graph •Document •Queue •MPP Insights •Analytical Models •Visualization •Self Service BI Buy Vs Build ETL Decision?
  • 16. Typical Data Pipeline Data Source Ingest •RDBMS •SEARCH •FILES/API •MESSAGING •IOT/STREAM Store Raw •DATABASE •SEARCH DOCUMENTS •DIST FILE STORAGE •QUEUE •STREAM STORE Process for Analysis •BATCH •INTERACTIVE •STREAMING •MESSAGING •MACHINE LEARNING Store •Key Value •Graph •Document •Queue •MPP Insights •Analytical Models •Visualization •Self Service BI Create Analytical Application Make Insights Available Via API Analysis and Visualization Zepplin, HUE etc Publish to Queue
  • 17. Data Modelling in Hadoop & Architectural Patterns
  • 18. Not only ER and Dimension Models (NoERDM) Data Storage Format Text Sequence Avro Parquet RC/ORC Know strength and weakness of each format in terms of Supporting Distributions Processing requirements – Write, partial read, full read Schema Evolution Extract Requirements Storage Requirements – How big are your files How important is file splitability Does block compression matter Does the file format support indexing? How easy it is to parse Does it support column Stats? Failure behavior for various file formats.
  • 19. Not only ER and Dimension Models (NoERDM) Compression Codecs ZLIB LZO LZF Snappy Gzip Bzip Considerations How much the size reduces How fast it can compress decompress How can I split my compressed files? File splitbility to make use of parallelism Compression types Uncompressed Record compressed. Block Compressed. ` We trade I/O Loads for CPU Loads
  • 20. Other Practices 1. Structure and Organize your repository a. Standard directory structure b. Access quota controls c. Stage area conventions 2. Location of HDFS files a. Directory structure should simplify the assignment of permissions to be grated. b. Eg /user, /etl , /tmp, /data, /app, /metadata, 3. Partitioning, Bucketing and denormalization.
  • 21. Data Lake / Reservoir / Refinery Exploratory Data Analysis Application Level Analytics Batch and Stream Analytics – Lambda Architecture Enterprise Data Pipeline

Editor's Notes

  1. Broken Promise – No single version of truth At least we can persist raw data
  2. Obstacles with Big Data
  3. Obstacles with Big Data
  4. Obstacles with Big Data
  5. Obstacles with Big Data
  6. Obstacles with Big Data
  7. Obstacles with Big Data
  8. Obstacles with Big Data
  9. Obstacles with Big Data
  10. Obstacles with Big Data
  11. 1.1.1 Data Storage formats Considerations File Format Know strength and weaknesses of each format in terms of Supporting Distributions Processing requirements – Write, partial read, full read Schema Evolution Extract Requirements Storage Requirements – How big are your files How important is file splitability Does block compression matter Does the file format support indexing? How easy it is to parse Does it support column Stats? Failure behavior for various file formats.
  12. Compression consideration Any codec can be made splittable using a container format. Enable compression for mapreduce intermediate steps to enhance performance. Pay attention to how data is ordered. Compression happens in chunks so the entropy of these chunks are important. Ordering and storing data will enable better compression. Use compact fixle format with support for splitabbility eg: seq and avro.