SlideShare a Scribd company logo

Hadoop Data Modeling

A
A

Slides from May 2018 St. Louis Big Data Innovations, Data Engineering, and Analytics User Group meeting. The presentation focused on Data Modeling in Hive.

Hadoop Data Modeling

1 of 33
Download to read offline
Confidential and Proprietary to Daugherty Business Solutions
05-02-2018
Hadoop Data Modeling
Confidential and Proprietary to Daugherty Business Solutions 2
Agenda
• What’s the Big Data Innovation, Data Engineering, Analytics Group?
• Data Modelling in Hadoop
• Questions
Confidential and Proprietary to Daugherty Business Solutions 3
It started with an article
Confidential and Proprietary to Daugherty Business Solutions 4
And a name change
Confidential and Proprietary to Daugherty Business Solutions
• What is the future of the Hadoop ecosystem?
• What is the dividing line between Spark and Hadoop?
• What are the big players doing?
• How does the push to cloud technologies affect Hadoop usage?
• How does Streaming come into play?
5
Which led to some questions
Confidential and Proprietary to Daugherty Business Solutions
• Hadoop is here to stay, but it will make the most strides as a machine learning platform.
• Spark can perform many of the same tasks that elements of the Hadoop ecosystem can,
but it is missing some existing features out of the box.
• Cloudera, Hortonworks, and MapR are positioning themselves as data processing
platforms with roots in Hadoop, but other aspirations. For example, Cloudera is
positioning itself as a machine learning platform.
• The push to cloud means that the distributed filesystem of HDFS may be less important
to cloud-based deployments. But Hadoop ecosystem projects are adapting to be able to
work with cloud sources.
• The Hadoop ecosystem projects have proven patterns for ingesting streaming data and
turning it into information.
And then our answer
6

Recommended

Introducing the Snowflake Computing Cloud Data Warehouse
Introducing the Snowflake Computing Cloud Data WarehouseIntroducing the Snowflake Computing Cloud Data Warehouse
Introducing the Snowflake Computing Cloud Data WarehouseSnowflake Computing
 
Data Lake Overview
Data Lake OverviewData Lake Overview
Data Lake OverviewJames Serra
 
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Hortonworks
 
Zero to Snowflake Presentation
Zero to Snowflake Presentation Zero to Snowflake Presentation
Zero to Snowflake Presentation Brett VanderPlaats
 
Developing ssas cube
Developing ssas cubeDeveloping ssas cube
Developing ssas cubeSlava Kokaev
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeDatabricks
 
Power BI On AIR - Melissa Coates: "What You Need to Know to Administer Power BI"
Power BI On AIR - Melissa Coates: "What You Need to Know to Administer Power BI"Power BI On AIR - Melissa Coates: "What You Need to Know to Administer Power BI"
Power BI On AIR - Melissa Coates: "What You Need to Know to Administer Power BI"Bohdan Maherus
 

More Related Content

What's hot

Large Scale Geospatial Indexing and Analysis on Apache Spark
Large Scale Geospatial Indexing and Analysis on Apache SparkLarge Scale Geospatial Indexing and Analysis on Apache Spark
Large Scale Geospatial Indexing and Analysis on Apache SparkDatabricks
 
Always on in SQL Server 2012
Always on in SQL Server 2012Always on in SQL Server 2012
Always on in SQL Server 2012Fadi Abdulwahab
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)James Serra
 
Oracle GoldenGate 21c New Features and Best Practices
Oracle GoldenGate 21c New Features and Best PracticesOracle GoldenGate 21c New Features and Best Practices
Oracle GoldenGate 21c New Features and Best PracticesBobby Curtis
 
Measuring Data Quality with DataOps
Measuring Data Quality with DataOpsMeasuring Data Quality with DataOps
Measuring Data Quality with DataOpsSteven Ensslen
 
Snowflake for Data Engineering
Snowflake for Data EngineeringSnowflake for Data Engineering
Snowflake for Data EngineeringHarald Erb
 
A 30 day plan to start ending your data struggle with Snowflake
A 30 day plan to start ending your data struggle with SnowflakeA 30 day plan to start ending your data struggle with Snowflake
A 30 day plan to start ending your data struggle with SnowflakeSnowflake Computing
 
Demystifying Data Warehousing as a Service - DFW
Demystifying Data Warehousing as a Service - DFWDemystifying Data Warehousing as a Service - DFW
Demystifying Data Warehousing as a Service - DFWKent Graziano
 
Snowflake: The most cost-effective agile and scalable data warehouse ever!
Snowflake: The most cost-effective agile and scalable data warehouse ever!Snowflake: The most cost-effective agile and scalable data warehouse ever!
Snowflake: The most cost-effective agile and scalable data warehouse ever!Visual_BI
 
LDM Slides: Data Modeling for XML and JSON
LDM Slides: Data Modeling for XML and JSONLDM Slides: Data Modeling for XML and JSON
LDM Slides: Data Modeling for XML and JSONDATAVERSITY
 
Iceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsIceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsAlluxio, Inc.
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache icebergAlluxio, Inc.
 
Delight: An Improved Apache Spark UI, Free, and Cross-Platform
Delight: An Improved Apache Spark UI, Free, and Cross-PlatformDelight: An Improved Apache Spark UI, Free, and Cross-Platform
Delight: An Improved Apache Spark UI, Free, and Cross-PlatformDatabricks
 
Azure data factory
Azure data factoryAzure data factory
Azure data factoryDavid Giard
 
Intro to databricks delta lake
 Intro to databricks delta lake Intro to databricks delta lake
Intro to databricks delta lakeMykola Zerniuk
 
Building an Effective Data Warehouse Architecture
Building an Effective Data Warehouse ArchitectureBuilding an Effective Data Warehouse Architecture
Building an Effective Data Warehouse ArchitectureJames Serra
 
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...Databricks
 
Let’s get to know Snowflake
Let’s get to know SnowflakeLet’s get to know Snowflake
Let’s get to know SnowflakeKnoldus Inc.
 

What's hot (20)

Large Scale Geospatial Indexing and Analysis on Apache Spark
Large Scale Geospatial Indexing and Analysis on Apache SparkLarge Scale Geospatial Indexing and Analysis on Apache Spark
Large Scale Geospatial Indexing and Analysis on Apache Spark
 
Snowflake Overview
Snowflake OverviewSnowflake Overview
Snowflake Overview
 
Always on in SQL Server 2012
Always on in SQL Server 2012Always on in SQL Server 2012
Always on in SQL Server 2012
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
 
Oracle GoldenGate 21c New Features and Best Practices
Oracle GoldenGate 21c New Features and Best PracticesOracle GoldenGate 21c New Features and Best Practices
Oracle GoldenGate 21c New Features and Best Practices
 
Measuring Data Quality with DataOps
Measuring Data Quality with DataOpsMeasuring Data Quality with DataOps
Measuring Data Quality with DataOps
 
Snowflake for Data Engineering
Snowflake for Data EngineeringSnowflake for Data Engineering
Snowflake for Data Engineering
 
A 30 day plan to start ending your data struggle with Snowflake
A 30 day plan to start ending your data struggle with SnowflakeA 30 day plan to start ending your data struggle with Snowflake
A 30 day plan to start ending your data struggle with Snowflake
 
Demystifying Data Warehousing as a Service - DFW
Demystifying Data Warehousing as a Service - DFWDemystifying Data Warehousing as a Service - DFW
Demystifying Data Warehousing as a Service - DFW
 
Snowflake: The most cost-effective agile and scalable data warehouse ever!
Snowflake: The most cost-effective agile and scalable data warehouse ever!Snowflake: The most cost-effective agile and scalable data warehouse ever!
Snowflake: The most cost-effective agile and scalable data warehouse ever!
 
LDM Slides: Data Modeling for XML and JSON
LDM Slides: Data Modeling for XML and JSONLDM Slides: Data Modeling for XML and JSON
LDM Slides: Data Modeling for XML and JSON
 
Iceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsIceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data Analytics
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
 
Delight: An Improved Apache Spark UI, Free, and Cross-Platform
Delight: An Improved Apache Spark UI, Free, and Cross-PlatformDelight: An Improved Apache Spark UI, Free, and Cross-Platform
Delight: An Improved Apache Spark UI, Free, and Cross-Platform
 
Azure data factory
Azure data factoryAzure data factory
Azure data factory
 
Data Lake,beyond the Data Warehouse
Data Lake,beyond the Data WarehouseData Lake,beyond the Data Warehouse
Data Lake,beyond the Data Warehouse
 
Intro to databricks delta lake
 Intro to databricks delta lake Intro to databricks delta lake
Intro to databricks delta lake
 
Building an Effective Data Warehouse Architecture
Building an Effective Data Warehouse ArchitectureBuilding an Effective Data Warehouse Architecture
Building an Effective Data Warehouse Architecture
 
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...
 
Let’s get to know Snowflake
Let’s get to know SnowflakeLet’s get to know Snowflake
Let’s get to know Snowflake
 

Similar to Hadoop Data Modeling

How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...Institute of Contemporary Sciences
 
Optimizing Dell PowerEdge Configurations for Hadoop
Optimizing Dell PowerEdge Configurations for HadoopOptimizing Dell PowerEdge Configurations for Hadoop
Optimizing Dell PowerEdge Configurations for HadoopMike Pittaro
 
Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which DataWorks Summit
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMichael Hiskey
 
Hadoop and SQL: Delivery Analytics Across the Organization
Hadoop and SQL:  Delivery Analytics Across the OrganizationHadoop and SQL:  Delivery Analytics Across the Organization
Hadoop and SQL: Delivery Analytics Across the OrganizationSeeling Cheung
 
The Future of Analytics, Data Integration and BI on Big Data Platforms
The Future of Analytics, Data Integration and BI on Big Data PlatformsThe Future of Analytics, Data Integration and BI on Big Data Platforms
The Future of Analytics, Data Integration and BI on Big Data PlatformsMark Rittman
 
PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015Cloudera, Inc.
 
Simple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform ConceptSimple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform ConceptSatish Mohan
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvewKunal Khanna
 
Hitachi Data Systems Hadoop Solution
Hitachi Data Systems Hadoop SolutionHitachi Data Systems Hadoop Solution
Hitachi Data Systems Hadoop SolutionHitachi Vantara
 
Hadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter PointHadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter PointInside Analysis
 
Bi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonBi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonDremio Corporation
 
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...DATAVERSITY
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureDATAVERSITY
 
Big Data Technologies and Why They Matter To R Users
Big Data Technologies and Why They Matter To R UsersBig Data Technologies and Why They Matter To R Users
Big Data Technologies and Why They Matter To R UsersAdaryl "Bob" Wakefield, MBA
 

Similar to Hadoop Data Modeling (20)

How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 
Optimizing Dell PowerEdge Configurations for Hadoop
Optimizing Dell PowerEdge Configurations for HadoopOptimizing Dell PowerEdge Configurations for Hadoop
Optimizing Dell PowerEdge Configurations for Hadoop
 
50 Shades of SQL
50 Shades of SQL50 Shades of SQL
50 Shades of SQL
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which
 
Intro to Big Data
Intro to Big DataIntro to Big Data
Intro to Big Data
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
Hadoop and SQL: Delivery Analytics Across the Organization
Hadoop and SQL:  Delivery Analytics Across the OrganizationHadoop and SQL:  Delivery Analytics Across the Organization
Hadoop and SQL: Delivery Analytics Across the Organization
 
The Future of Analytics, Data Integration and BI on Big Data Platforms
The Future of Analytics, Data Integration and BI on Big Data PlatformsThe Future of Analytics, Data Integration and BI on Big Data Platforms
The Future of Analytics, Data Integration and BI on Big Data Platforms
 
PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015
 
Simple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform ConceptSimple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform Concept
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvew
 
Hitachi Data Systems Hadoop Solution
Hitachi Data Systems Hadoop SolutionHitachi Data Systems Hadoop Solution
Hitachi Data Systems Hadoop Solution
 
Hadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter PointHadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter Point
 
Bi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonBi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in London
 
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 
Big Data Technologies and Why They Matter To R Users
Big Data Technologies and Why They Matter To R UsersBig Data Technologies and Why They Matter To R Users
Big Data Technologies and Why They Matter To R Users
 
Big data rmoug
Big data rmougBig data rmoug
Big data rmoug
 

More from Adam Doyle

Data Engineering Roles
Data Engineering RolesData Engineering Roles
Data Engineering RolesAdam Doyle
 
Managed Cluster Services
Managed Cluster ServicesManaged Cluster Services
Managed Cluster ServicesAdam Doyle
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architectureAdam Doyle
 
Great Expectations Presentation
Great Expectations PresentationGreat Expectations Presentation
Great Expectations PresentationAdam Doyle
 
May 2021 Spark Testing ... or how to farm reputation on StackOverflow
May 2021 Spark Testing ... or how to farm reputation on StackOverflowMay 2021 Spark Testing ... or how to farm reputation on StackOverflow
May 2021 Spark Testing ... or how to farm reputation on StackOverflowAdam Doyle
 
Automate your data flows with Apache NIFI
Automate your data flows with Apache NIFIAutomate your data flows with Apache NIFI
Automate your data flows with Apache NIFIAdam Doyle
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAAdam Doyle
 
Localized Hadoop Development
Localized Hadoop DevelopmentLocalized Hadoop Development
Localized Hadoop DevelopmentAdam Doyle
 
The new big data
The new big dataThe new big data
The new big dataAdam Doyle
 
Feature store Overview St. Louis Big Data IDEA Meetup aug 2020
Feature store Overview   St. Louis Big Data IDEA Meetup aug 2020Feature store Overview   St. Louis Big Data IDEA Meetup aug 2020
Feature store Overview St. Louis Big Data IDEA Meetup aug 2020Adam Doyle
 
Snowflake Data Science and AI/ML at Scale
Snowflake Data Science and AI/ML at ScaleSnowflake Data Science and AI/ML at Scale
Snowflake Data Science and AI/ML at ScaleAdam Doyle
 
Operationalizing Data Science St. Louis Big Data IDEA
Operationalizing Data Science St. Louis Big Data IDEAOperationalizing Data Science St. Louis Big Data IDEA
Operationalizing Data Science St. Louis Big Data IDEAAdam Doyle
 
Retooling on the Modern Data and Analytics Tech Stack
Retooling on the Modern Data and Analytics Tech StackRetooling on the Modern Data and Analytics Tech Stack
Retooling on the Modern Data and Analytics Tech StackAdam Doyle
 
Stl meetup cloudera platform - january 2020
Stl meetup   cloudera platform  - january 2020Stl meetup   cloudera platform  - january 2020
Stl meetup cloudera platform - january 2020Adam Doyle
 
How stlrda does data
How stlrda does dataHow stlrda does data
How stlrda does dataAdam Doyle
 
Tailoring machine learning practices to support prescriptive analytics
Tailoring machine learning practices to support prescriptive analyticsTailoring machine learning practices to support prescriptive analytics
Tailoring machine learning practices to support prescriptive analyticsAdam Doyle
 
Synthesis of analytical methods data driven decision-making
Synthesis of analytical methods data driven decision-makingSynthesis of analytical methods data driven decision-making
Synthesis of analytical methods data driven decision-makingAdam Doyle
 
Big Data IDEA 101 2019
Big Data IDEA 101 2019Big Data IDEA 101 2019
Big Data IDEA 101 2019Adam Doyle
 
Data Engineering and the Data Science Lifecycle
Data Engineering and the Data Science LifecycleData Engineering and the Data Science Lifecycle
Data Engineering and the Data Science LifecycleAdam Doyle
 

More from Adam Doyle (20)

ML Ops.pptx
ML Ops.pptxML Ops.pptx
ML Ops.pptx
 
Data Engineering Roles
Data Engineering RolesData Engineering Roles
Data Engineering Roles
 
Managed Cluster Services
Managed Cluster ServicesManaged Cluster Services
Managed Cluster Services
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
 
Great Expectations Presentation
Great Expectations PresentationGreat Expectations Presentation
Great Expectations Presentation
 
May 2021 Spark Testing ... or how to farm reputation on StackOverflow
May 2021 Spark Testing ... or how to farm reputation on StackOverflowMay 2021 Spark Testing ... or how to farm reputation on StackOverflow
May 2021 Spark Testing ... or how to farm reputation on StackOverflow
 
Automate your data flows with Apache NIFI
Automate your data flows with Apache NIFIAutomate your data flows with Apache NIFI
Automate your data flows with Apache NIFI
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEA
 
Localized Hadoop Development
Localized Hadoop DevelopmentLocalized Hadoop Development
Localized Hadoop Development
 
The new big data
The new big dataThe new big data
The new big data
 
Feature store Overview St. Louis Big Data IDEA Meetup aug 2020
Feature store Overview   St. Louis Big Data IDEA Meetup aug 2020Feature store Overview   St. Louis Big Data IDEA Meetup aug 2020
Feature store Overview St. Louis Big Data IDEA Meetup aug 2020
 
Snowflake Data Science and AI/ML at Scale
Snowflake Data Science and AI/ML at ScaleSnowflake Data Science and AI/ML at Scale
Snowflake Data Science and AI/ML at Scale
 
Operationalizing Data Science St. Louis Big Data IDEA
Operationalizing Data Science St. Louis Big Data IDEAOperationalizing Data Science St. Louis Big Data IDEA
Operationalizing Data Science St. Louis Big Data IDEA
 
Retooling on the Modern Data and Analytics Tech Stack
Retooling on the Modern Data and Analytics Tech StackRetooling on the Modern Data and Analytics Tech Stack
Retooling on the Modern Data and Analytics Tech Stack
 
Stl meetup cloudera platform - january 2020
Stl meetup   cloudera platform  - january 2020Stl meetup   cloudera platform  - january 2020
Stl meetup cloudera platform - january 2020
 
How stlrda does data
How stlrda does dataHow stlrda does data
How stlrda does data
 
Tailoring machine learning practices to support prescriptive analytics
Tailoring machine learning practices to support prescriptive analyticsTailoring machine learning practices to support prescriptive analytics
Tailoring machine learning practices to support prescriptive analytics
 
Synthesis of analytical methods data driven decision-making
Synthesis of analytical methods data driven decision-makingSynthesis of analytical methods data driven decision-making
Synthesis of analytical methods data driven decision-making
 
Big Data IDEA 101 2019
Big Data IDEA 101 2019Big Data IDEA 101 2019
Big Data IDEA 101 2019
 
Data Engineering and the Data Science Lifecycle
Data Engineering and the Data Science LifecycleData Engineering and the Data Science Lifecycle
Data Engineering and the Data Science Lifecycle
 

Recently uploaded

A Gentle Introduction to Text Analysis :)
A Gentle Introduction to Text Analysis :)A Gentle Introduction to Text Analysis :)
A Gentle Introduction to Text Analysis :)UNCResearchHub
 
Lies and Myths in InfoSec - 2023 Usenix Enigma
Lies and Myths in InfoSec - 2023 Usenix EnigmaLies and Myths in InfoSec - 2023 Usenix Enigma
Lies and Myths in InfoSec - 2023 Usenix EnigmaAdrian Sanabria
 
Soil Health Policy Map Years 2020 to 2023
Soil Health Policy Map Years 2020 to 2023Soil Health Policy Map Years 2020 to 2023
Soil Health Policy Map Years 2020 to 2023stephizcoolio
 
SABARI PRIYAN's self introduction as a reference
SABARI PRIYAN's self introduction as a referenceSABARI PRIYAN's self introduction as a reference
SABARI PRIYAN's self introduction as a referencepriyansabari355
 
Oppotus - Malaysians on Malaysia 4Q 2023.pdf
Oppotus - Malaysians on Malaysia 4Q 2023.pdfOppotus - Malaysians on Malaysia 4Q 2023.pdf
Oppotus - Malaysians on Malaysia 4Q 2023.pdfOppotus
 
Generative AI Rennes Meetup with OVHcloud - WAICF highlights & how to deploy ...
Generative AI Rennes Meetup with OVHcloud - WAICF highlights & how to deploy ...Generative AI Rennes Meetup with OVHcloud - WAICF highlights & how to deploy ...
Generative AI Rennes Meetup with OVHcloud - WAICF highlights & how to deploy ...Thibaud Le Douarin
 
Industry 4.0 in IoT Transforming the Future.pptx
Industry 4.0 in IoT Transforming the Future.pptxIndustry 4.0 in IoT Transforming the Future.pptx
Industry 4.0 in IoT Transforming the Future.pptxMdRafiqulIslam403212
 
SABARI PRIYAN's self introduction as reference
SABARI PRIYAN's self introduction as referenceSABARI PRIYAN's self introduction as reference
SABARI PRIYAN's self introduction as referencepriyansabari355
 
Big Data - large Scale data (Amazon, FB)
Big Data - large Scale data (Amazon, FB)Big Data - large Scale data (Amazon, FB)
Big Data - large Scale data (Amazon, FB)CUO VEERANAN VEERANAN
 
AWS Identity and access management for users
AWS Identity and access management for usersAWS Identity and access management for users
AWS Identity and access management for usersStephenEfange3
 
Artificial Intelligence and its Impact on Society.pptx
Artificial Intelligence and its Impact on Society.pptxArtificial Intelligence and its Impact on Society.pptx
Artificial Intelligence and its Impact on Society.pptxVighnesh Shashtri
 
data analytics and tools from in2inglobal.pdf
data analytics  and tools from in2inglobal.pdfdata analytics  and tools from in2inglobal.pdf
data analytics and tools from in2inglobal.pdfdigimartfamily
 
Operations Data On Mobile - inSis Mobile App - Sample Screens
Operations Data On Mobile - inSis Mobile App - Sample ScreensOperations Data On Mobile - inSis Mobile App - Sample Screens
Operations Data On Mobile - inSis Mobile App - Sample ScreensKondapi V Siva Rama Brahmam
 
IIBA Adl - Being Effective on Day 1 - Slide Deck.pdf
IIBA Adl - Being Effective on Day 1 - Slide Deck.pdfIIBA Adl - Being Effective on Day 1 - Slide Deck.pdf
IIBA Adl - Being Effective on Day 1 - Slide Deck.pdfAustraliaChapterIIBA
 
Web 3.0 in Data Privacy and Security | Data Privacy |Blockchain Security| Cyb...
Web 3.0 in Data Privacy and Security | Data Privacy |Blockchain Security| Cyb...Web 3.0 in Data Privacy and Security | Data Privacy |Blockchain Security| Cyb...
Web 3.0 in Data Privacy and Security | Data Privacy |Blockchain Security| Cyb...Cyber Security Experts
 

Recently uploaded (17)

A Gentle Introduction to Text Analysis :)
A Gentle Introduction to Text Analysis :)A Gentle Introduction to Text Analysis :)
A Gentle Introduction to Text Analysis :)
 
Lies and Myths in InfoSec - 2023 Usenix Enigma
Lies and Myths in InfoSec - 2023 Usenix EnigmaLies and Myths in InfoSec - 2023 Usenix Enigma
Lies and Myths in InfoSec - 2023 Usenix Enigma
 
2.pptx
2.pptx2.pptx
2.pptx
 
Soil Health Policy Map Years 2020 to 2023
Soil Health Policy Map Years 2020 to 2023Soil Health Policy Map Years 2020 to 2023
Soil Health Policy Map Years 2020 to 2023
 
SABARI PRIYAN's self introduction as a reference
SABARI PRIYAN's self introduction as a referenceSABARI PRIYAN's self introduction as a reference
SABARI PRIYAN's self introduction as a reference
 
Oppotus - Malaysians on Malaysia 4Q 2023.pdf
Oppotus - Malaysians on Malaysia 4Q 2023.pdfOppotus - Malaysians on Malaysia 4Q 2023.pdf
Oppotus - Malaysians on Malaysia 4Q 2023.pdf
 
Generative AI Rennes Meetup with OVHcloud - WAICF highlights & how to deploy ...
Generative AI Rennes Meetup with OVHcloud - WAICF highlights & how to deploy ...Generative AI Rennes Meetup with OVHcloud - WAICF highlights & how to deploy ...
Generative AI Rennes Meetup with OVHcloud - WAICF highlights & how to deploy ...
 
Industry 4.0 in IoT Transforming the Future.pptx
Industry 4.0 in IoT Transforming the Future.pptxIndustry 4.0 in IoT Transforming the Future.pptx
Industry 4.0 in IoT Transforming the Future.pptx
 
SABARI PRIYAN's self introduction as reference
SABARI PRIYAN's self introduction as referenceSABARI PRIYAN's self introduction as reference
SABARI PRIYAN's self introduction as reference
 
Big Data - large Scale data (Amazon, FB)
Big Data - large Scale data (Amazon, FB)Big Data - large Scale data (Amazon, FB)
Big Data - large Scale data (Amazon, FB)
 
AWS Identity and access management for users
AWS Identity and access management for usersAWS Identity and access management for users
AWS Identity and access management for users
 
Artificial Intelligence and its Impact on Society.pptx
Artificial Intelligence and its Impact on Society.pptxArtificial Intelligence and its Impact on Society.pptx
Artificial Intelligence and its Impact on Society.pptx
 
Electricity Year 2023_updated_22022024.pptx
Electricity Year 2023_updated_22022024.pptxElectricity Year 2023_updated_22022024.pptx
Electricity Year 2023_updated_22022024.pptx
 
data analytics and tools from in2inglobal.pdf
data analytics  and tools from in2inglobal.pdfdata analytics  and tools from in2inglobal.pdf
data analytics and tools from in2inglobal.pdf
 
Operations Data On Mobile - inSis Mobile App - Sample Screens
Operations Data On Mobile - inSis Mobile App - Sample ScreensOperations Data On Mobile - inSis Mobile App - Sample Screens
Operations Data On Mobile - inSis Mobile App - Sample Screens
 
IIBA Adl - Being Effective on Day 1 - Slide Deck.pdf
IIBA Adl - Being Effective on Day 1 - Slide Deck.pdfIIBA Adl - Being Effective on Day 1 - Slide Deck.pdf
IIBA Adl - Being Effective on Day 1 - Slide Deck.pdf
 
Web 3.0 in Data Privacy and Security | Data Privacy |Blockchain Security| Cyb...
Web 3.0 in Data Privacy and Security | Data Privacy |Blockchain Security| Cyb...Web 3.0 in Data Privacy and Security | Data Privacy |Blockchain Security| Cyb...
Web 3.0 in Data Privacy and Security | Data Privacy |Blockchain Security| Cyb...
 

Hadoop Data Modeling

  • 1. Confidential and Proprietary to Daugherty Business Solutions 05-02-2018 Hadoop Data Modeling
  • 2. Confidential and Proprietary to Daugherty Business Solutions 2 Agenda • What’s the Big Data Innovation, Data Engineering, Analytics Group? • Data Modelling in Hadoop • Questions
  • 3. Confidential and Proprietary to Daugherty Business Solutions 3 It started with an article
  • 4. Confidential and Proprietary to Daugherty Business Solutions 4 And a name change
  • 5. Confidential and Proprietary to Daugherty Business Solutions • What is the future of the Hadoop ecosystem? • What is the dividing line between Spark and Hadoop? • What are the big players doing? • How does the push to cloud technologies affect Hadoop usage? • How does Streaming come into play? 5 Which led to some questions
  • 6. Confidential and Proprietary to Daugherty Business Solutions • Hadoop is here to stay, but it will make the most strides as a machine learning platform. • Spark can perform many of the same tasks that elements of the Hadoop ecosystem can, but it is missing some existing features out of the box. • Cloudera, Hortonworks, and MapR are positioning themselves as data processing platforms with roots in Hadoop, but other aspirations. For example, Cloudera is positioning itself as a machine learning platform. • The push to cloud means that the distributed filesystem of HDFS may be less important to cloud-based deployments. But Hadoop ecosystem projects are adapting to be able to work with cloud sources. • The Hadoop ecosystem projects have proven patterns for ingesting streaming data and turning it into information. And then our answer 6
  • 7. Confidential and Proprietary to Daugherty Business Solutions • We’re now going to be St. Louis Big Data Innovation, Data Engineering, and Analytics Group Or more simply put: St. Louis Big Data IDEA Introducing … 7
  • 8. Confidential and Proprietary to Daugherty Business Solutions • Local Companies • Big Data – Hadoop – Cloud deployments – Cloud-native technologies – Spark – Kafka • Innovation – New Big Data projects – New Big Data services – New Big Data applications • Data Engineering – Streaming data – Batch data analysis – Machine Learning Pipelines – Data Governance – ETL @ Scale • Analytics – Visualization – Machine Learning – Reporting – Forecasting So What is the STL Big Data IDEA interested in? 8
  • 9. Confidential and Proprietary to Daugherty Business Solutions • Scott Shaw has been with Hortonworks for four years. • He is the author of four books including Practical Hive and Internet of Things and Data Analytics Handbook. • Scott will be helping our group find speakers in the open source community. Please help me welcome Scott to the group in his new role Introducing our New Board Member 9
  • 10. Confidential and Proprietary to Daugherty Business Solutions 10 Agenda • The Schema-on-Read Promise • File formats and Compression formats • Schema Design – Data Layout • Indexes, Partitioning and Bucketing • Join Performance • Hadoop SQL Boost – Tez, Cost Based Optimizations & LLAP • Summary
  • 11. Confidential and Proprietary to Daugherty Business Solutions Introducing our Speakers Adam Doyle • Co-Organizer, St. Louis Big Data IDEA • Big Data Community Lead, Daugherty Business Solutions Drew Marco • Board Member & Secretary, TDWI • Data and Analytics Line of Service Leader, Daugherty Business Solutions 11
  • 12. Confidential and Proprietary to Daugherty Business Solutions 12 Schema On Read • Schemas are typically purpose-built and hard to change • Generally loses the raw/atomic data as a source • Requires considerable modeling/implementation effort before being able to work with the data • If a certain type of data can’t be confined in the schema, you can’t effectively store or use it (if you can store it at all) Schema on Write Schema on Read • Slower Results • Preserve the raw/atomic data as a source • Flexibility to add, remove and modify columns • Data may be riddled with missing or invalid data, duplicates • Suited for data exploration and not recommended for repetitive querying and high performance Real world use of Hadoop / Hive that require high performing queries on large data sets requires up-front planning and data modeling
  • 13. Confidential and Proprietary to Daugherty Business Solutions 13 Schema Design – Data Layout Normalization “The primary reason to avoid normalization is to minimize disk seeks, such as those typically required to navigate foreign key relations. Denormalizing data permits it to be scanned from or written to large, contiguous sections of disk drives, which optimizes I/O performance. However, you pay the penalty of denormalization, data duplication and the greater risk of inconsistent data.” Source: Programming Hive by Dean Wampler, Jason Rutherglen, Edward Capriolo, O’Reilly Media Denormalization • Pros • Reduces data redundancy • Decreases risk of inconsistent datasets • Cons • Requires re-organization of source data • Less efficient storage • Pros • Often requires reorganizing the data (slower writes) • Minimizes disk seeks (i.e. FK relations) • Storage in large contiguous disk drive segments • Cons • Data Duplication • Increased Risk of inconsistent data
  • 14. Confidential and Proprietary to Daugherty Business Solutions 14 Introducing Our Use Case Departments Dept_no Name Dept_emp Dept_no Emp_no From_date To_date Employees Emp_no Birth_date First_Name Last_Name Gender Hire_date Dept_Manager Dept_no Emp_no From_date To_date Titles Emp_no Title From_date To_date Salaries Emp_no Salary From_date To_date https://dev.mysql.com/doc/employee/en/
  • 15. Confidential and Proprietary to Daugherty Business Solutions 15 Data Storage Decisions • Hadoop is a file system - No Standard data storage format in Hadoop • Optimal storage of data is determined by how the data will be processed • Typical input data is in JSON, XML or CSV Major Considerations: File Formats Compression
  • 16. Confidential and Proprietary to Daugherty Business Solutions 16 Parquet • Faster access to data • Efficient columnar compression • Effective for select queries
  • 17. Confidential and Proprietary to Daugherty Business Solutions 17 ORCFile High Performance: Split-able, columnar storage file Efficient Reads: Break into large “stripes” of data for efficient read Fast Filtering: Built in index, min/max, metadata for fast filtering blocks - bloom filters if desired Efficient Compression: Decompose complex row types into primitives: massive compression and efficient comparisons for filtering Precomputation: Built in aggregates per block (min, max, count, sum, etc.) Proven at 300 PB scale: Facebook uses ORC for their 300 PB Hive Warehouse
  • 18. Confidential and Proprietary to Daugherty Business Solutions 18 Avro • JSON based schema • Cross-language file format for Hadoop • Schema evolution was primary goal – Good for Select * queries • Schema segregated from data • Row major format
  • 19. Confidential and Proprietary to Daugherty Business Solutions Query Text Avro ORC Parquet select count(*) from employees e join salaries s on s.emp_no = e.emp_no join titles t on t.emp_no = e.emp_no; 42.696 48.934 25.846 26.081 select d.name, count(1), d.first_name, d.last_name from (select d.dept_no, d.dept_name as name, m.first_name as first_name, m.last_name as last_name from departments d join dept_manager dm on dm.dept_no = d.dept_no join employees m on dm.emp_no = m.emp_no where dm.to_date='9999-01- 01') d join dept_emp de on de.dept_no = d.dept_no join employees e on de.emp_no = e.emp_no group by d.name, d.first_name, d.last_name; 59.536 63.08 27.954 26.073 Size 124M 134M 16.7M 30.5M 19 Comparison of file formats
  • 20. Confidential and Proprietary to Daugherty Business Solutions 20 Compression • Not just for storage (data-at-rest) but also critical for disk/network I/O (data-in- motion) • Splittability of the compression codec is an important consideration Snappy LZO • High speed with reasonable compression • Not splittable – only used with Avro • Optimized for speed as opposed to size • Splittable but requires additional indexing • Not shipped with Hadoop Gzip • Optimized for size • Write performance is half of snappy • Read performance as good as snappy • Smaller blocks = better performance bzip2 • Optimized for size (9% better compared to Gzip) • Splittable • Performance sucks; Primary use is archival on Hadoop
  • 21. Confidential and Proprietary to Daugherty Business Solutions 21 Partitioning & Bucketing • Partitioning is useful for chronological columns that don’t have a very high number of possible values • Bucketing is most useful for tables that are “most often” joined together on the same key • Skews useful when one or two column values dominate the table
  • 22. Confidential and Proprietary to Daugherty Business Solutions 22 Partitioning • Every query reads the entire table even when processing subset of data (full-table scan) • Breaks up data horizontally by column value sets • When partitioning you will use 1 or more “virtual” columns break up data • Virtual columns cause directories to be created in HDFS. • Static Partitioning versus Dynamic Partitioning • Partitioning makes queries go fast. • Partitioning works particularly well when querying with the “virtual column” • If queries use various columns, it may be hard to decide which columns should we partition by
  • 23. Confidential and Proprietary to Daugherty Business Solutions 23 Bucketing • Used to strike a balance between large files within partition • Breaks up data vertically by hashed key sets • When bucketing, you specify the number of buckets • Works particularly well when a lot of queries contain joins • Especially when the two data sets are bucketed on the join key
  • 24. Confidential and Proprietary to Daugherty Business Solutions 24 Comparison Query Text Partition Bucketed select d.name, count(1), d.first_name, d.last_name from (select d.dept_no, d.dept_name as name, m.first_name as first_name, m.last_name as last_name from departments d join dept_manager dm on dm.dept_no = d.dept_no join employees m on dm.emp_no = m.emp_no where dm.to_date='9999-01- 01') d join dept_emp_buck de on de.dept_no = d.dept_no join emp_buck e on de.emp_no = e.emp_no group by d.name, d.first_name, d.last_name; 59.536 59.652 55.196
  • 25. Confidential and Proprietary to Daugherty Business Solutions 25 Join Performance Map Side Joins • Star schemas (e.g. dimension tables) Good when table is small enough to fit in RAM
  • 26. Confidential and Proprietary to Daugherty Business Solutions 26 Reduce Side Joins Default Hive Join Works with data of any size
  • 27. Confidential and Proprietary to Daugherty Business Solutions Query Map-Side Reduce select /*+ MAPJOIN(d) */ d.name, count(1), d.first_name, d.last_name from (select d.dept_no, d.dept_name as name, m.first_name as first_name, m.last_name as last_name from departments d join dept_manager dm on dm.dept_no = d.dept_no join employees m on dm.emp_no = m.emp_no where dm.to_date='9999-01-01') d join dept_emp_buck de on de.dept_no = d.dept_no join emp_buck e on de.emp_no = e.emp_no group by d.name, d.first_name, d.last_name; 58.227 59.652 27 Comparison
  • 28. Confidential and Proprietary to Daugherty Business Solutions 28 Considerations for SQL Performance Tez
  • 29. Confidential and Proprietary to Daugherty Business Solutions • Hive uses a Cost-Based Optimizer to optimize the cost of running a query. • Calcite applies optimizations like query rewrite, join reordering, join elimination, and deriving implied predicates. • Calcite will prune away inefficient plans in order to produce and select the cheapest query plans. • Needs to be enabled: Set hive.cbo.enable=true; Set hive.stats.autogather=true; 29 CBO – Cost Based Optimization CBO Process Overview 1. Parse and validate query 2. Generate possible execution plans 3. For each logically equivalent plan, assign a cost 4. Select the plan with the lowest cost Optimization Factors • Join optimization • Table size
  • 30. Confidential and Proprietary to Daugherty Business Solutions • Consists of a long-lived daemon and a tightly integrated DAG framework. • Handles – Pre-fetching – Some Query Processing – Fine-grained column-level Access Control 30 LLAP
  • 31. Confidential and Proprietary to Daugherty Business Solutions
  • 32. Confidential and Proprietary to Daugherty Business Solutions Daugherty Overview 32 Combining world-class capabilities with a local practice model Long-term consultant employees with deep business acumen & leadership abilities Providing more experienced consultants & leading methods/techniques/tools to: • Accelerate results & productivity • Provide greater team continuity • More sustainable/cost effective price point. Over 1000 employees from Management Consultants to Developers 88% of our clients are long-term, repeat/referral relationships of 10+ years Demonstrated 31 year track record of delivering mission critical initiatives enabled by emerging technologies 1000 Engagements with over 75 Fortune 500 industry leaders over the past five years ATLANTA CHICAGO DALLAS DENVER MINNEAPOLIS NEW YORK SAINT LOUIS (HQ) DEVELOPMENT CENTER SUPPORT & HARDWARE CENTER 9BUSINESS UNITS 75 88% 31 BY THE NUMBERS 32 COLLABORATIVE Co-staffed teams, project Services, resource pools, collaborative managed services PRAGMATIC Pragmatic, co-staffed approach well suited to building internal competency while getting key project initiates completed ALTERNATIVE Strong Alternative to the Global Consultancies FLEXIBLE Flexible engagement model
  • 33. Confidential and Proprietary to Daugherty Business Solutions 33 Data & Analytics - What we bring to the table APPLICATION DEVELOPMENT Methods / Tools / Techniques • 12 Domain EIM Blueprint/Roadmap framework that manages technical complexity, accelerates initiatives and focuses on delivering greatest business analytics impact quickly. • Highly accurate BI Dimensional estimator that provides predictability in investments and time to market. • Analytic Strategy framework that aligns people, process and technology components to deliver business value • Analytic Governance reference model that mitigates risk and provide guardrails for self-service adoption • Business value models to calculate the value and ROI of investments in Data & Analytics initiatives • Reference architecture for a modern data & analytic platform • Dashboard Design best practices that transform complex business KPIs in a rich immersive design • Bi-Modal Data as a Service Operating Model that integrates Agile development with a Service oriented organization design PROGRAM & PROJECT MANAGEMENT • Program & Project Planning • Program & Project Management • Business Case Development • PMO Optimization • M&A Integration 4 Data & Analytics  Over 40% of Daugherty’s 1,000 consultants are focused on Information Management Solutions.  Bringing the latest thought leadership in Next Generation, Unified Architectures that integrate structured, unstructured data (“Big Data”) and applied advanced analytics into cohesive solutions.  Strong capabilities across both existing and emerging technologies while maintaining a technology neutral approach.  Leveraging the latest visual design concepts to deliver interactive and user friendly applications that drive adoption and satisfaction with solutions.  Leader in the effective application of Agile techniques applied to Data Engineering development and business analytics. Full Data life cycle methods & techniques from business definition through development and on-going support  Building and supporting mission-critical platforms for many Fortune 500 companies in multi-year, using a flexible support model including Collaborative Managed Services models. DATA & ANALYTICS • Data & Analytics Strategy & Roadmap • Building Analytic Solutions • Analytics Competency Development • Big Data / Next Gen Architecture • Business Analytics and Insights 33

Editor's Notes

  1. An updated version of ORC was released in HDP 2.6.3 with better support for vectorization.
  2. Although compression can greatly optimize processing performance, not all compression codecs supported on Hadoop are splittable. Since the MapReduce framework splits data for input to multiple tasks, having a non-splittable compression codec provides an impediment to efficient processing. If files cannot be split, that means the entire file needs to be passed to a single MapReduce task, eliminating the advantages of parallelism and data locality that Hadoop provides. For this reason, splittability is a major consideration in choosing a compression codec, as well as file format. We’ll discuss the various compression codecs available for Hadoop, and some considerations in choosing between them. Snappy Snappy is a compression codec developed at Google for high compression speeds with reasonable compression. Although Snappy doesn’t offer the best compression sizes, it does provide a good trade-off between speed and size. Processing performance with Snappy can be significantly better than other compression formats. An important thing to note is that Snappy is intended to be used with a container format like SequenceFiles or Avro, since it’s not inherently splittable. LZO LZO is similar to Snappy in that it’s optimized for speed as opposed to size. Unlike Snappy, LZO compressed files are splittable, but this requires an additional indexing step. This makes LZO a good choice for things like plain text files that are not being stored as part of a container format. It should also be noted that LZO’s license prevents it from being distributed with Hadoop, and requires a separate install, unlike Snappy, which can be distributed with Hadoop. Gzip Gzip provides very good compression performance (on average, about 2.5 times the compression that’d be offered by snappy), but its write speed performance is not as good as Snappy (on average, about half of that offered by snappy). Gzip usually performs almost as good as snappy in terms of read performance. Gzip is also not splittable, so should be used with a container format. It should be noted that one reason Gzip is sometimes slower than Snappy for processing is that due to Gzip compressed files taking up fewer blocks, fewer tasks are required for processing the same data. For this reason, when using Gzip, using smaller blocks can lead to better performance. bzip2 bzip2 provides excellent compression performance, but can be significantly slower than other compression codecs such as Snappy in terms of processing performance. For this reason, it’s not an ideal codec for Hadoop storage, unless the primary need is for reducing the storage footprint. Unlike Snappy and gzip, bzip2 is inherently splittable. In the examples we have seen, bzip2 will normally compress around 9% better as compared to GZip, in terms of storage space. However, this extra compression comes with a significant read/write performance cost. This performance difference will vary with different machines but in general it’s about 10x slower then GZip. For this reason, it’s not an ideal codec for Hadoop storage, unless the primary need is for reducing the storage footprint. Example of such a use case can be where Hadoop is being used mainly for active archival purposes.
  3. Multi-layer Partitioning is possible but often not efficient – Number of partitions becomes too much and will overwhelm the Metastore • Limit the number of partitions. Less may be better – 1000 partitions will often perform better than 10000 • Hadoop likes big files – avoid creating partitions with mostly small files • Only use when – Data is very large and there are lots of table scans – Data is queried aginst a particular column frequently – Column data must have low cardinality
  4. The map-side join can only be achieved if it is possible to join the records by key during read of the input files, so before the map phase. Additionally for this to work the input files need to be sorted by the same join key. Further more both inputs need to have the same number of partitions. Reaching these strict constraints is commonly hard to achieve. The most likely scenario for a map-side join is when both input tables were created by (different) MapReduce jobs having the same amount of reducers using the same (join) key. Set hive.auto.convert.join = true • HIVE then automatically uses broadcast join, if possible – Small tables held in memory by all nodes • Used for star-schema type joins common in Data warehousing use-cases • hive.auto.convert.join.noconditionaltask.size determines data size for automatic conversion to broadcast join: – Default 10MB is too low (check your default) – Recommended: 256MB for 4GB container
  5. Summary (top left) Insert box from capabilities overview slide Methods / Tools / Techniques (bottom left) What unique tools and techniques do we bring to the table? Identify differentiating methods, tools and techniques. Include graphics / images as appropriate to enhance and create impact. Capabilities (right) Create key points for each of these capabilities from the Daugherty Capabilities Overview. Confirm or update capabilities as appropriate. Comments should be specific and differentiating to the extent possible. What are the only things that Daugherty can say?