SlideShare a Scribd company logo
1Page
Building a Self Service Analytics
Platform on Hadoop
Avinash Ramineni
2Page
Clairvoyant
3Page
Clairvoyant Services
4Page
Quick Poll
• Big Data Deployments in Prod
• Hadoop Distributions
• People use Ecosystems rather than tools
• Architecture was implemented on Cloudera
• Cloud Experience – AWS ?
5Page
Challenges
• Data in Silos
• Acquires Perspectives as data is moved
• Data availability delays
• Legacy Systems handling the Volume , Veracity and Velocity
• Extracting data from legacy systems
• Lack of Self-Service Capabilities
• Knowledge becomes tribal – instead of institutional
• Security / Compliance Requirements
6Page
Data Lake Attributes
• Data Democratization
• Data Discovery
• Data Lineage
• Self-Service capabilities
• Metadata Management
7Page
Without Self-Service
8Page
Self-Service at all Levels
Ingest Organize Enrich Analyze Dashboards
AnalyzeIngest Organize Enrich Insights
9Page
Key Design Tenets
• Separation of Compute and Storage
• Independently scale compute and storage
• Data Democratization and Governance
• Bring your own Compute (BYOC)
• HA / DR
• Open Source Stack
1
0
Page
Separation of Compute and Storage
• Scale storage and compute independently
• Shifts bottleneck from Disk IO to Network
• Centralized Data Storage
• Data Democratization
• No data duplication
• Easier Hardware upgrade paths
• Flexible Architecture
• DR Simplified
1
1
Page
BYOC (Bring Your Own Cluster)
• Each department/application can bring its own Hadoop cluster
• Eliminates the need for very large clusters
• Easier to administer and maintain
• Reduces multi-tenancy issues
• Clusters can be upgraded independently
• Enables usage based cost model
Centralized / Common S3 Storage
Marketing
Cluster
Centralized
Storage
Personalization
Cluster
Main
Cluster
1
2
Page
Architecture
1
3
Page
Architecture – Data Ingestion Layer
• DB Ingestor
• Stream Ingestor
• Kafka and Spark Streaming
• File Ingestor
• FTP / SFTP / Logs
• Ingestion using Service API
1
4
Page
Architecture – Data Processing Layer
• Storage layer carved into logical buckets
• Landing, Raw, Derived and Delivery
• Schema stored with data (no guesswork)
• Platform Jobs
• Converting text to Parquet
• Saving streaming data Parquet
• Derivatives
• Compaction
• Standardization
1
5
Page
Architecture – Data Delivery Layer
• Data Delivery
• SQL - Spark Thrift Server / Impala
• Tableau, SQL IDE, Applications
• Self Service
• Derivatives
• Represented Via SQL on Delivery Layer
• Stored in Derived Storage Layer
• Metadata driven
• Derived Layer Generators
• Long running Spark Job
• Derivative Refresh
1
6
Page
Key Takeaways - Cloud
• Hadoop Cloud ready-ness
• Cloudera Director Limitations
• Multi-Availability zone, regions
• Storage
• Instance Storage
• EBS Volumes
• gp2 vs st1
• S3 Eventual Consistency
1
7
Page
Key Takeaways - Spark Thrift Server
• Spark Thrift Server Support
• Performance Tuning
• Concurrency
• partition strategy
• Cache Tables
• Compression Codec for Parquet
• Snappy vs gzip
1
8
Page
Key Takeaways - Security
• Secure by Design, Secure by Default
• Access to Data on S3
• IAM Roles
• Sentry
• Support for Spark
• Kerberos
• Spark Thrift Server
• Navigator
• Support for Spark
1
9
Page
Key Takeaways - General
• Rapidly Changing Technology
• Feature addition
• Documentation
• Bugs
• Jar hell
• Small files
• Performance Issues
• Compaction
2
0
Page
Key Takeaways - General
• Partition Strategy
• Parquet Files
• Balancing parallelism and throughput
• Table Partitions
• Cluster sizing, optimization and tuning
• Integrating with Corporate infrastructure
• Deployment practices
• Monitoring and Alerting
• Information Security Policies
2
1
Page
Data Security
2
2
Page
Questions
• Principal @ Clairvoyant
• Email: avinash@clairvoyantsoft.com
• LinkedIn: https://www.linkedin.com/in/avinashramineni

More Related Content

What's hot

Using a Semantic and Graph-based Data Catalog in a Modern Data Fabric
Using a Semantic and Graph-based Data Catalog in a Modern Data FabricUsing a Semantic and Graph-based Data Catalog in a Modern Data Fabric
Using a Semantic and Graph-based Data Catalog in a Modern Data Fabric
Cambridge Semantics
 
Creating an Enterprise AI Strategy
Creating an Enterprise AI StrategyCreating an Enterprise AI Strategy
Creating an Enterprise AI Strategy
AtScale
 
Modern Data Discovery and Integration in Retail Banking
Modern Data Discovery and Integration in Retail BankingModern Data Discovery and Integration in Retail Banking
Modern Data Discovery and Integration in Retail Banking
Cambridge Semantics
 
Necessity of Data Lakes in the Financial Services Sector
Necessity of Data Lakes in the Financial Services SectorNecessity of Data Lakes in the Financial Services Sector
Necessity of Data Lakes in the Financial Services Sector
DataWorks Summit
 
Accelerate Digital Transformation with an Enterprise Big Data Fabric
Accelerate Digital Transformation with an Enterprise Big Data FabricAccelerate Digital Transformation with an Enterprise Big Data Fabric
Accelerate Digital Transformation with an Enterprise Big Data Fabric
Cambridge Semantics
 
The Convergence of Data & Digital: Mapping Out a Cohesive Strategy for Maximu...
The Convergence of Data & Digital: Mapping Out a Cohesive Strategy for Maximu...The Convergence of Data & Digital: Mapping Out a Cohesive Strategy for Maximu...
The Convergence of Data & Digital: Mapping Out a Cohesive Strategy for Maximu...
Remy Rosenbaum
 
Solution architecture for big data projects
Solution architecture for big data projectsSolution architecture for big data projects
Solution architecture for big data projects
Sandeep Sharma IIMK Smart City,IoT,Bigdata,Cloud,BI,DW
 
Sustainability Investment Research Using Cognitive Analytics
Sustainability Investment Research Using Cognitive AnalyticsSustainability Investment Research Using Cognitive Analytics
Sustainability Investment Research Using Cognitive Analytics
Cambridge Semantics
 
Data Integration Alternatives: When to use Data Virtualization, ETL, and ESB
Data Integration Alternatives: When to use Data Virtualization, ETL, and ESBData Integration Alternatives: When to use Data Virtualization, ETL, and ESB
Data Integration Alternatives: When to use Data Virtualization, ETL, and ESB
Denodo
 
Data Mining and Data Warehousing
Data Mining and Data WarehousingData Mining and Data Warehousing
Data Mining and Data Warehousing
Amdocs
 
Data Services Marketplace
Data Services MarketplaceData Services Marketplace
Data Services Marketplace
Denodo
 
Why Data Virtualization? An Introduction by Denodo
Why Data Virtualization? An Introduction by DenodoWhy Data Virtualization? An Introduction by Denodo
Why Data Virtualization? An Introduction by Denodo
Justo Hidalgo
 
Risk Analytics Using Knowledge Graphs / FIBO with Deep Learning
Risk Analytics Using Knowledge Graphs / FIBO with Deep LearningRisk Analytics Using Knowledge Graphs / FIBO with Deep Learning
Risk Analytics Using Knowledge Graphs / FIBO with Deep Learning
Cambridge Semantics
 
Big Data and Data Virtualization
Big Data and Data VirtualizationBig Data and Data Virtualization
Big Data and Data Virtualization
Kenneth Peeples
 
Cortana Analytics Workshop: Azure Data Catalog
Cortana Analytics Workshop: Azure Data CatalogCortana Analytics Workshop: Azure Data Catalog
Cortana Analytics Workshop: Azure Data Catalog
MSAdvAnalytics
 
ML Infra @ Spotify: Lessons Learned - Romain Yon - NYC ML Meetup
ML Infra @ Spotify: Lessons Learned - Romain Yon -  NYC ML MeetupML Infra @ Spotify: Lessons Learned - Romain Yon -  NYC ML Meetup
ML Infra @ Spotify: Lessons Learned - Romain Yon - NYC ML Meetup
Romain Yon
 
Graph-driven Data Integration: Accelerating and Automating Data Delivery for ...
Graph-driven Data Integration: Accelerating and Automating Data Delivery for ...Graph-driven Data Integration: Accelerating and Automating Data Delivery for ...
Graph-driven Data Integration: Accelerating and Automating Data Delivery for ...
Cambridge Semantics
 
Supporting Data Services Marketplace using Data Virtualization
Supporting Data Services Marketplace using Data VirtualizationSupporting Data Services Marketplace using Data Virtualization
Supporting Data Services Marketplace using Data Virtualization
Denodo
 
Denodo Data Virtualization - IT Days in Luxembourg with Oktopus
Denodo Data Virtualization - IT Days in Luxembourg with OktopusDenodo Data Virtualization - IT Days in Luxembourg with Oktopus
Denodo Data Virtualization - IT Days in Luxembourg with Oktopus
Denodo
 
AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...
AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...
AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...
Cambridge Semantics
 

What's hot (20)

Using a Semantic and Graph-based Data Catalog in a Modern Data Fabric
Using a Semantic and Graph-based Data Catalog in a Modern Data FabricUsing a Semantic and Graph-based Data Catalog in a Modern Data Fabric
Using a Semantic and Graph-based Data Catalog in a Modern Data Fabric
 
Creating an Enterprise AI Strategy
Creating an Enterprise AI StrategyCreating an Enterprise AI Strategy
Creating an Enterprise AI Strategy
 
Modern Data Discovery and Integration in Retail Banking
Modern Data Discovery and Integration in Retail BankingModern Data Discovery and Integration in Retail Banking
Modern Data Discovery and Integration in Retail Banking
 
Necessity of Data Lakes in the Financial Services Sector
Necessity of Data Lakes in the Financial Services SectorNecessity of Data Lakes in the Financial Services Sector
Necessity of Data Lakes in the Financial Services Sector
 
Accelerate Digital Transformation with an Enterprise Big Data Fabric
Accelerate Digital Transformation with an Enterprise Big Data FabricAccelerate Digital Transformation with an Enterprise Big Data Fabric
Accelerate Digital Transformation with an Enterprise Big Data Fabric
 
The Convergence of Data & Digital: Mapping Out a Cohesive Strategy for Maximu...
The Convergence of Data & Digital: Mapping Out a Cohesive Strategy for Maximu...The Convergence of Data & Digital: Mapping Out a Cohesive Strategy for Maximu...
The Convergence of Data & Digital: Mapping Out a Cohesive Strategy for Maximu...
 
Solution architecture for big data projects
Solution architecture for big data projectsSolution architecture for big data projects
Solution architecture for big data projects
 
Sustainability Investment Research Using Cognitive Analytics
Sustainability Investment Research Using Cognitive AnalyticsSustainability Investment Research Using Cognitive Analytics
Sustainability Investment Research Using Cognitive Analytics
 
Data Integration Alternatives: When to use Data Virtualization, ETL, and ESB
Data Integration Alternatives: When to use Data Virtualization, ETL, and ESBData Integration Alternatives: When to use Data Virtualization, ETL, and ESB
Data Integration Alternatives: When to use Data Virtualization, ETL, and ESB
 
Data Mining and Data Warehousing
Data Mining and Data WarehousingData Mining and Data Warehousing
Data Mining and Data Warehousing
 
Data Services Marketplace
Data Services MarketplaceData Services Marketplace
Data Services Marketplace
 
Why Data Virtualization? An Introduction by Denodo
Why Data Virtualization? An Introduction by DenodoWhy Data Virtualization? An Introduction by Denodo
Why Data Virtualization? An Introduction by Denodo
 
Risk Analytics Using Knowledge Graphs / FIBO with Deep Learning
Risk Analytics Using Knowledge Graphs / FIBO with Deep LearningRisk Analytics Using Knowledge Graphs / FIBO with Deep Learning
Risk Analytics Using Knowledge Graphs / FIBO with Deep Learning
 
Big Data and Data Virtualization
Big Data and Data VirtualizationBig Data and Data Virtualization
Big Data and Data Virtualization
 
Cortana Analytics Workshop: Azure Data Catalog
Cortana Analytics Workshop: Azure Data CatalogCortana Analytics Workshop: Azure Data Catalog
Cortana Analytics Workshop: Azure Data Catalog
 
ML Infra @ Spotify: Lessons Learned - Romain Yon - NYC ML Meetup
ML Infra @ Spotify: Lessons Learned - Romain Yon -  NYC ML MeetupML Infra @ Spotify: Lessons Learned - Romain Yon -  NYC ML Meetup
ML Infra @ Spotify: Lessons Learned - Romain Yon - NYC ML Meetup
 
Graph-driven Data Integration: Accelerating and Automating Data Delivery for ...
Graph-driven Data Integration: Accelerating and Automating Data Delivery for ...Graph-driven Data Integration: Accelerating and Automating Data Delivery for ...
Graph-driven Data Integration: Accelerating and Automating Data Delivery for ...
 
Supporting Data Services Marketplace using Data Virtualization
Supporting Data Services Marketplace using Data VirtualizationSupporting Data Services Marketplace using Data Virtualization
Supporting Data Services Marketplace using Data Virtualization
 
Denodo Data Virtualization - IT Days in Luxembourg with Oktopus
Denodo Data Virtualization - IT Days in Luxembourg with OktopusDenodo Data Virtualization - IT Days in Luxembourg with Oktopus
Denodo Data Virtualization - IT Days in Luxembourg with Oktopus
 
AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...
AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...
AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...
 

Viewers also liked

Big data it’s impact on the finance function
Big data it’s impact on the finance functionBig data it’s impact on the finance function
Big data it’s impact on the finance function
Mike Davis
 
Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...
Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...
Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...
Data Con LA
 
Building enterprise advance analytics platform
Building enterprise advance analytics platformBuilding enterprise advance analytics platform
Building enterprise advance analytics platform
Haoran Du
 
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop ProfessionalsBest Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Cloudera, Inc.
 
Data Warehouse Design and Best Practices
Data Warehouse Design and Best PracticesData Warehouse Design and Best Practices
Data Warehouse Design and Best Practices
Ivo Andreev
 
Webinar | Using Big Data and Predictive Analytics to Empower Distribution and...
Webinar | Using Big Data and Predictive Analytics to Empower Distribution and...Webinar | Using Big Data and Predictive Analytics to Empower Distribution and...
Webinar | Using Big Data and Predictive Analytics to Empower Distribution and...
NICSA
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
James Serra
 

Viewers also liked (7)

Big data it’s impact on the finance function
Big data it’s impact on the finance functionBig data it’s impact on the finance function
Big data it’s impact on the finance function
 
Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...
Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...
Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...
 
Building enterprise advance analytics platform
Building enterprise advance analytics platformBuilding enterprise advance analytics platform
Building enterprise advance analytics platform
 
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop ProfessionalsBest Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
 
Data Warehouse Design and Best Practices
Data Warehouse Design and Best PracticesData Warehouse Design and Best Practices
Data Warehouse Design and Best Practices
 
Webinar | Using Big Data and Predictive Analytics to Empower Distribution and...
Webinar | Using Big Data and Predictive Analytics to Empower Distribution and...Webinar | Using Big Data and Predictive Analytics to Empower Distribution and...
Webinar | Using Big Data and Predictive Analytics to Empower Distribution and...
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
 

Similar to Building A Self Service Analytics Platform on Hadoop

Practical guide to architecting data lakes - Avinash Ramineni - Phoenix Data...
Practical guide to architecting data lakes -  Avinash Ramineni - Phoenix Data...Practical guide to architecting data lakes -  Avinash Ramineni - Phoenix Data...
Practical guide to architecting data lakes - Avinash Ramineni - Phoenix Data...
Avinash Ramineni
 
Presentation Presentation Presentation Presentation Presentation
Presentation Presentation Presentation Presentation PresentationPresentation Presentation Presentation Presentation Presentation
Presentation Presentation Presentation Presentation Presentation
bangel105
 
Managing storage on Prem and in Cloud
Managing storage on Prem and in CloudManaging storage on Prem and in Cloud
Managing storage on Prem and in Cloud
Howard Marks
 
Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...
Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...
Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...
Cloudian
 
Move your on prem data to a lake in a Lake in Cloud
Move your on prem data to a lake in a Lake in CloudMove your on prem data to a lake in a Lake in Cloud
Move your on prem data to a lake in a Lake in Cloud
CAMMS
 
SQL, NoSQL, Distributed SQL: Choose your DataStore carefully
SQL, NoSQL, Distributed SQL: Choose your DataStore carefullySQL, NoSQL, Distributed SQL: Choose your DataStore carefully
SQL, NoSQL, Distributed SQL: Choose your DataStore carefully
Md Kamaruzzaman
 
Architecting a datalake
Architecting a datalakeArchitecting a datalake
Architecting a datalake
Laurent Leturgez
 
Apache Geode Meetup, Cork, Ireland at CIT
Apache Geode Meetup, Cork, Ireland at CITApache Geode Meetup, Cork, Ireland at CIT
Apache Geode Meetup, Cork, Ireland at CIT
Apache Geode
 
Spark volume requirements 2018
Spark volume requirements 2018Spark volume requirements 2018
Spark volume requirements 2018
Rachit Arora
 
Drupal performance
Drupal performanceDrupal performance
Drupal performance
Gabi Lee
 
Storage Requirements and Options for Running Spark on Kubernetes
Storage Requirements and Options for Running Spark on KubernetesStorage Requirements and Options for Running Spark on Kubernetes
Storage Requirements and Options for Running Spark on Kubernetes
DataWorks Summit
 
Keynote oracle days final 16x9 v3.alain
Keynote oracle days final 16x9 v3.alainKeynote oracle days final 16x9 v3.alain
Keynote oracle days final 16x9 v3.alain
Doina Draganescu
 
Teradata Loom Introductory Presentation
Teradata Loom Introductory PresentationTeradata Loom Introductory Presentation
Teradata Loom Introductory Presentation
mlang222
 
CC -Unit4.pptx
CC -Unit4.pptxCC -Unit4.pptx
CC -Unit4.pptx
Revathiparamanathan
 
Amazon Redshift with Full 360 Inc.
Amazon Redshift with Full 360 Inc.Amazon Redshift with Full 360 Inc.
Amazon Redshift with Full 360 Inc.
Amazon Web Services
 
What are clouds made from
What are clouds made fromWhat are clouds made from
What are clouds made from
John Garbutt
 
Apache Geode Meetup, London
Apache Geode Meetup, LondonApache Geode Meetup, London
Apache Geode Meetup, London
Apache Geode
 
What is Cloud computing?
What is Cloud computing?What is Cloud computing?
What is Cloud computing?
Richard Harvey
 
Accelerating Business Intelligence Solutions with Microsoft Azure pass
Accelerating Business Intelligence Solutions with Microsoft Azure   passAccelerating Business Intelligence Solutions with Microsoft Azure   pass
Accelerating Business Intelligence Solutions with Microsoft Azure pass
Jason Strate
 
Data Science Day New York: Data Science: A Personal History
Data Science Day New York: Data Science: A Personal HistoryData Science Day New York: Data Science: A Personal History
Data Science Day New York: Data Science: A Personal History
Cloudera, Inc.
 

Similar to Building A Self Service Analytics Platform on Hadoop (20)

Practical guide to architecting data lakes - Avinash Ramineni - Phoenix Data...
Practical guide to architecting data lakes -  Avinash Ramineni - Phoenix Data...Practical guide to architecting data lakes -  Avinash Ramineni - Phoenix Data...
Practical guide to architecting data lakes - Avinash Ramineni - Phoenix Data...
 
Presentation Presentation Presentation Presentation Presentation
Presentation Presentation Presentation Presentation PresentationPresentation Presentation Presentation Presentation Presentation
Presentation Presentation Presentation Presentation Presentation
 
Managing storage on Prem and in Cloud
Managing storage on Prem and in CloudManaging storage on Prem and in Cloud
Managing storage on Prem and in Cloud
 
Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...
Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...
Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...
 
Move your on prem data to a lake in a Lake in Cloud
Move your on prem data to a lake in a Lake in CloudMove your on prem data to a lake in a Lake in Cloud
Move your on prem data to a lake in a Lake in Cloud
 
SQL, NoSQL, Distributed SQL: Choose your DataStore carefully
SQL, NoSQL, Distributed SQL: Choose your DataStore carefullySQL, NoSQL, Distributed SQL: Choose your DataStore carefully
SQL, NoSQL, Distributed SQL: Choose your DataStore carefully
 
Architecting a datalake
Architecting a datalakeArchitecting a datalake
Architecting a datalake
 
Apache Geode Meetup, Cork, Ireland at CIT
Apache Geode Meetup, Cork, Ireland at CITApache Geode Meetup, Cork, Ireland at CIT
Apache Geode Meetup, Cork, Ireland at CIT
 
Spark volume requirements 2018
Spark volume requirements 2018Spark volume requirements 2018
Spark volume requirements 2018
 
Drupal performance
Drupal performanceDrupal performance
Drupal performance
 
Storage Requirements and Options for Running Spark on Kubernetes
Storage Requirements and Options for Running Spark on KubernetesStorage Requirements and Options for Running Spark on Kubernetes
Storage Requirements and Options for Running Spark on Kubernetes
 
Keynote oracle days final 16x9 v3.alain
Keynote oracle days final 16x9 v3.alainKeynote oracle days final 16x9 v3.alain
Keynote oracle days final 16x9 v3.alain
 
Teradata Loom Introductory Presentation
Teradata Loom Introductory PresentationTeradata Loom Introductory Presentation
Teradata Loom Introductory Presentation
 
CC -Unit4.pptx
CC -Unit4.pptxCC -Unit4.pptx
CC -Unit4.pptx
 
Amazon Redshift with Full 360 Inc.
Amazon Redshift with Full 360 Inc.Amazon Redshift with Full 360 Inc.
Amazon Redshift with Full 360 Inc.
 
What are clouds made from
What are clouds made fromWhat are clouds made from
What are clouds made from
 
Apache Geode Meetup, London
Apache Geode Meetup, LondonApache Geode Meetup, London
Apache Geode Meetup, London
 
What is Cloud computing?
What is Cloud computing?What is Cloud computing?
What is Cloud computing?
 
Accelerating Business Intelligence Solutions with Microsoft Azure pass
Accelerating Business Intelligence Solutions with Microsoft Azure   passAccelerating Business Intelligence Solutions with Microsoft Azure   pass
Accelerating Business Intelligence Solutions with Microsoft Azure pass
 
Data Science Day New York: Data Science: A Personal History
Data Science Day New York: Data Science: A Personal HistoryData Science Day New York: Data Science: A Personal History
Data Science Day New York: Data Science: A Personal History
 

Recently uploaded

Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Tosin Akinosho
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
saastr
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their MainframeDigital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Precisely
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Alpen-Adria-Universität
 
AppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSFAppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSF
Ajin Abraham
 
GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
Javier Junquera
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
ssuserfac0301
 
Principle of conventional tomography-Bibash Shahi ppt..pptx
Principle of conventional tomography-Bibash Shahi ppt..pptxPrinciple of conventional tomography-Bibash Shahi ppt..pptx
Principle of conventional tomography-Bibash Shahi ppt..pptx
BibashShahi
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
saastr
 
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectorsConnector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
DianaGray10
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
DanBrown980551
 
The Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptxThe Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptx
operationspcvita
 
Leveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and StandardsLeveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and Standards
Neo4j
 
Apps Break Data
Apps Break DataApps Break Data
Apps Break Data
Ivo Velitchkov
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
Miro Wengner
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
Jason Packer
 
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid ResearchHarnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
Neo4j
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 

Recently uploaded (20)

Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their MainframeDigital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
 
AppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSFAppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSF
 
GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
 
Principle of conventional tomography-Bibash Shahi ppt..pptx
Principle of conventional tomography-Bibash Shahi ppt..pptxPrinciple of conventional tomography-Bibash Shahi ppt..pptx
Principle of conventional tomography-Bibash Shahi ppt..pptx
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
 
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectorsConnector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
 
The Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptxThe Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptx
 
Leveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and StandardsLeveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and Standards
 
Apps Break Data
Apps Break DataApps Break Data
Apps Break Data
 
Artificial Intelligence and Electronic Warfare
Artificial Intelligence and Electronic WarfareArtificial Intelligence and Electronic Warfare
Artificial Intelligence and Electronic Warfare
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
 
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid ResearchHarnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 

Building A Self Service Analytics Platform on Hadoop

  • 1. 1Page Building a Self Service Analytics Platform on Hadoop Avinash Ramineni
  • 4. 4Page Quick Poll • Big Data Deployments in Prod • Hadoop Distributions • People use Ecosystems rather than tools • Architecture was implemented on Cloudera • Cloud Experience – AWS ?
  • 5. 5Page Challenges • Data in Silos • Acquires Perspectives as data is moved • Data availability delays • Legacy Systems handling the Volume , Veracity and Velocity • Extracting data from legacy systems • Lack of Self-Service Capabilities • Knowledge becomes tribal – instead of institutional • Security / Compliance Requirements
  • 6. 6Page Data Lake Attributes • Data Democratization • Data Discovery • Data Lineage • Self-Service capabilities • Metadata Management
  • 8. 8Page Self-Service at all Levels Ingest Organize Enrich Analyze Dashboards AnalyzeIngest Organize Enrich Insights
  • 9. 9Page Key Design Tenets • Separation of Compute and Storage • Independently scale compute and storage • Data Democratization and Governance • Bring your own Compute (BYOC) • HA / DR • Open Source Stack
  • 10. 1 0 Page Separation of Compute and Storage • Scale storage and compute independently • Shifts bottleneck from Disk IO to Network • Centralized Data Storage • Data Democratization • No data duplication • Easier Hardware upgrade paths • Flexible Architecture • DR Simplified
  • 11. 1 1 Page BYOC (Bring Your Own Cluster) • Each department/application can bring its own Hadoop cluster • Eliminates the need for very large clusters • Easier to administer and maintain • Reduces multi-tenancy issues • Clusters can be upgraded independently • Enables usage based cost model Centralized / Common S3 Storage Marketing Cluster Centralized Storage Personalization Cluster Main Cluster
  • 13. 1 3 Page Architecture – Data Ingestion Layer • DB Ingestor • Stream Ingestor • Kafka and Spark Streaming • File Ingestor • FTP / SFTP / Logs • Ingestion using Service API
  • 14. 1 4 Page Architecture – Data Processing Layer • Storage layer carved into logical buckets • Landing, Raw, Derived and Delivery • Schema stored with data (no guesswork) • Platform Jobs • Converting text to Parquet • Saving streaming data Parquet • Derivatives • Compaction • Standardization
  • 15. 1 5 Page Architecture – Data Delivery Layer • Data Delivery • SQL - Spark Thrift Server / Impala • Tableau, SQL IDE, Applications • Self Service • Derivatives • Represented Via SQL on Delivery Layer • Stored in Derived Storage Layer • Metadata driven • Derived Layer Generators • Long running Spark Job • Derivative Refresh
  • 16. 1 6 Page Key Takeaways - Cloud • Hadoop Cloud ready-ness • Cloudera Director Limitations • Multi-Availability zone, regions • Storage • Instance Storage • EBS Volumes • gp2 vs st1 • S3 Eventual Consistency
  • 17. 1 7 Page Key Takeaways - Spark Thrift Server • Spark Thrift Server Support • Performance Tuning • Concurrency • partition strategy • Cache Tables • Compression Codec for Parquet • Snappy vs gzip
  • 18. 1 8 Page Key Takeaways - Security • Secure by Design, Secure by Default • Access to Data on S3 • IAM Roles • Sentry • Support for Spark • Kerberos • Spark Thrift Server • Navigator • Support for Spark
  • 19. 1 9 Page Key Takeaways - General • Rapidly Changing Technology • Feature addition • Documentation • Bugs • Jar hell • Small files • Performance Issues • Compaction
  • 20. 2 0 Page Key Takeaways - General • Partition Strategy • Parquet Files • Balancing parallelism and throughput • Table Partitions • Cluster sizing, optimization and tuning • Integrating with Corporate infrastructure • Deployment practices • Monitoring and Alerting • Information Security Policies
  • 22. 2 2 Page Questions • Principal @ Clairvoyant • Email: avinash@clairvoyantsoft.com • LinkedIn: https://www.linkedin.com/in/avinashramineni