Cloud storage filesystems and Hive transactional tables

•Download as PPTX, PDF•

2 likes•903 views

Hortonworks

Hive - 1455: Cloud Storage

Technology

Page1 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Hive-14535 : Cloud storage
Gopal V

Page2 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Cloud “FileSystems” are Strange Beasts
“There are no directories. Only paths.”
“There are no users. Only keys.”
“There are no permissions. Only acl rules.”
“There is consistency, but not as we know it.”

Page3 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
“Directories vs Paths.”
• Storage of Path information can be assumed to be a sorted hash-table.
• File listings are no longer listing off a tree, but prefix search
• Directories don’t need to necessarily exist for a path below it
• Listing a single level is more complex than a full-depth traversal
• Renames can cause rebalancing and moving about of the structure
• Adjacent files are sometimes more expensive than random ops

Page4 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
“Users & permissions vs keys & ACLs”
• Distinguishing the user for an accessing process has no meaning
• Access keys are often rotated and occasionally invalidated
• User identity can be mapped to a key (externally or by id management)
• Buckets are commonly used to differentiate stores, instead of permissions
• Permissions are rarely set or applied per-file, but across path patterns
• Permissions set to a directory need extra user checks to be useful (chmod +x)

Page5 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
“Consistency”
• Arguably the most complex issue
• Renames needn’t be consistent, creates can have collisions
• Reads can return old data for the same path when overwriting
• Versioned reads are complex to manage and hard to throw a “Time machine” over
• Cross-Region Replication often lags and doubles stale-read issues

Page6 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Micro-Managed Hive Tables
• Support for all Hive input formats, including user ones
• Avoid rename operations as much as possible
• Never collide final paths for different inserts
• Ongoing inserts should be atomic across > 1 partitions
• Snapshot isolation for data reads for existing partitions being back-filled
• Stage data without accidental partial-reads for bucket replication

Page7 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Micro-Managed Hive Tables
CREATE TABLE `web_returns_hive_commit`(…
`wr_net_loss` float)
PARTITIONED BY (`wr_returned_date_sk` int)
STORED AS <FORMAT>
LOCATION 's3a://hwdev-hive-14535/web_returns_hive_commit'
TBLPROPERTIES
('transactional'='true',
'transactional_properties'='insert_only');

Page8 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Micro-Managed Hive Tables
drwxrwxrwx - cloudbreak 0 2016-12-07 21:42 s3a://hwdev-hive-14535/web_returns_hive_commit/wr_returned_date_sk=2450820
drwxrwxrwx - cloudbreak 0 2016-12-07 21:42 s3a://hwdev-hive-14535/web_returns_hive_commit/wr_returned_date_sk=2450820/mm_0
-rw-rw-rw- 1 cloudbreak 1791 2016-12-07 00:55 s3a://hwdev-hive-14535/web_returns_hive_commit/wr_returned_date_sk=2450820/mm_0/000021_0
drwxrwxrwx - cloudbreak 0 2016-12-07 21:42 s3a://hwdev-hive-14535/web_returns_hive_commit/wr_returned_date_sk=2450821
drwxrwxrwx - cloudbreak 0 2016-12-07 21:42 s3a://hwdev-hive-14535/web_returns_hive_commit/wr_returned_date_sk=2450821/mm_0
-rw-rw-rw- 1 cloudbreak 2186 2016-12-07 00:55 s3a://hwdev-hive-14535/web_returns_hive_commit/wr_returned_date_sk=2450821/mm_0/000022_0
drwxrwxrwx - cloudbreak 0 2016-12-07 21:42 s3a://hwdev-hive-14535/web_returns_hive_commit/wr_returned_date_sk=2450822
drwxrwxrwx - cloudbreak 0 2016-12-07 21:42 s3a://hwdev-hive-14535/web_returns_hive_commit/wr_returned_date_sk=2450822/mm_0
-rw-rw-rw- 1 cloudbreak 1814 2016-12-07 00:55 s3a://hwdev-hive-14535/web_returns_hive_commit/wr_returned_date_sk=2450822/mm_0/000023_0
/web_returns_hive_commit/wr_returned_date_sk=2450820/mm_0/000021_0

Page9 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
“Take a number” for inserts

Page10 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Read: tracking committed data
• Similar to Hive-ACID (ORC)
• Committed txns disappear from the tracking data
• With each query, it takes a highest known txn + list of open/aborted txns
• All valid transactions are < max(transaction_id) and not IN (open_txns)
• The transaction filtering is done at the listing level for all formats

Page11 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Branch + Future Work
Current measurement has 21% reduction in partition load time (+HIVE-15368)
Time taken to load dynamic partitions: 350.846 seconds -> 274.715 seconds
Work continues in the branch for hive-14535
Work ongoing to optimize to take advantage of faster recursive listings
Discussions towards incremental refresh for cube engines for backfill
Questions?
Suggestions?

What's hot

Getting involved with Open Source at the ASFHortonworks

LLAP: long-lived execution in HiveDataWorks Summit

How to Use Apache Zeppelin with HWX HDBHortonworks

Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...DataWorks Summit

Apache Hive on ACIDDataWorks Summit/Hadoop Summit

Apache Phoenix and HBase: Past, Present and Future of SQL over HBaseDataWorks Summit/Hadoop Summit

Hortonworks Technical Workshop - HDP Search Hortonworks

Apache Hadoop YARN: Present and FutureDataWorks Summit

Hive Does ACIDDataWorks Summit

An Apache Hive Based Data WarehouseDataWorks Summit

Evolving HDFS to Generalized Storage SubsystemDataWorks Summit/Hadoop Summit

Hive 3 - a new horizonThejas Nair

Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016alanfgates

An Apache Hive Based Data WarehouseDataWorks Summit

Curb your insecurity with HDPDataWorks Summit/Hadoop Summit

Operationalizing YARN based Hadoop Clusters in the CloudDataWorks Summit/Hadoop Summit

Transactional SQL in Apache HiveDataWorks Summit

Troubleshooting Kerberos in Hadoop: Taming the BeastDataWorks Summit

Apache Hadoop 3.0 What's new in YARN and MapReduceDataWorks Summit/Hadoop Summit

Hive edw-dataworks summit-eu-april-2017alanfgates

What's hot (20)

Getting involved with Open Source at the ASF

LLAP: long-lived execution in Hive

How to Use Apache Zeppelin with HWX HDB

Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...

Apache Hive on ACID

Apache Phoenix and HBase: Past, Present and Future of SQL over HBase

Hortonworks Technical Workshop - HDP Search

Apache Hadoop YARN: Present and Future

Hive Does ACID

An Apache Hive Based Data Warehouse

Evolving HDFS to Generalized Storage Subsystem

Hive 3 - a new horizon

Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016

An Apache Hive Based Data Warehouse

Curb your insecurity with HDP

Operationalizing YARN based Hadoop Clusters in the Cloud

Transactional SQL in Apache Hive

Troubleshooting Kerberos in Hadoop: Taming the Beast

Apache Hadoop 3.0 What's new in YARN and MapReduce

Hive edw-dataworks summit-eu-april-2017

Viewers also liked

How Universities Use Big Data to Transform EducationHortonworks

Hortonworks Data Cloud for AWS Hortonworks

Dynamic Column Masking and Row-Level Filtering in HDPHortonworks

The path to a Modern Data Architecture in Financial ServicesHortonworks

Pivotal - Advanced Analytics for Telecommunications Hortonworks

Top 5 Strategies for Retail Data AnalyticsHortonworks

Edw Optimization Solution Hortonworks

Scaling real time streaming architectures with HDF and Dell EMC IsilonHortonworks

Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...Hortonworks

Enabling the Real Time Analytical EnterpriseHortonworks

Double Your Hadoop Hardware Performance with SmartSenseHortonworks

Delivering a Flexible IT Infrastructure for Analytics on IBM Power SystemsHortonworks

Webinar Series Part 5 New Features of HDF 5Hortonworks

SAS - Hortonworks: Creating the Omnichannel Experience in Retail webinar marc...Hortonworks

Hortonworks technical workshop operations with ambariHortonworks

Apache Hadoop 0.23Hortonworks

Zementis hortonworks-webinar-2014-09Hortonworks

The Power of your Data Achieved - Next Gen ModernizationHortonworks

Apache Ambari: Past, Present, FutureHortonworks

Credit Card Analytics on a Connected Data PlatformHortonworks

Viewers also liked (20)

How Universities Use Big Data to Transform Education

Hortonworks Data Cloud for AWS

Dynamic Column Masking and Row-Level Filtering in HDP

The path to a Modern Data Architecture in Financial Services

Pivotal - Advanced Analytics for Telecommunications

Top 5 Strategies for Retail Data Analytics

Edw Optimization Solution

Scaling real time streaming architectures with HDF and Dell EMC Isilon

Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...

Enabling the Real Time Analytical Enterprise

Double Your Hadoop Hardware Performance with SmartSense

Delivering a Flexible IT Infrastructure for Analytics on IBM Power Systems

Webinar Series Part 5 New Features of HDF 5

SAS - Hortonworks: Creating the Omnichannel Experience in Retail webinar marc...

Hortonworks technical workshop operations with ambari

Apache Hadoop 0.23

Zementis hortonworks-webinar-2014-09

The Power of your Data Achieved - Next Gen Modernization

Apache Ambari: Past, Present, Future

Credit Card Analytics on a Connected Data Platform

Similar to Cloud storage filesystems and Hive transactional tables

Connecting Hadoop and OracleTanel Poder

Webinar: Untethering Compute from StorageAvere Systems

Building data pipelines with kiteJoey Echeverria

Avoiding big data antipatternsgrepalex

Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloudgluent.

What's New in Apache Hive 3.0?DataWorks Summit

What's New in Apache Hive 3.0 - TokyoDataWorks Summit

HDFS: Optimization, Stabilization and SupportabilityDataWorks Summit/Hadoop Summit

Hdfs 2016-hadoop-summit-dublin-v1Chris Nauroth

Real-time Big Data Analytics Engine using ImpalaJason Shih

Hadoop 3.0 - Revolution or evolution?Uwe Printz

Big Data Conference April 2015Aaron Benz

LLAP: Building Cloud First BIDataWorks Summit

MySQL highav AvailabilityBaruch Osoveskiy

Everything You Need to Know About Docker and Storage by Ryan Wallner, ClusterHQ Docker, Inc.

Hadoop operations-2015-hadoop-summit-san-jose-v5Chris Nauroth

Hadoop Operations - Best Practices from the FieldDataWorks Summit

RevisionDavid Sherlock

MySQL Webinar 2/4 Performance tuning, hardware, optimisationMark Swarbrick

SD Big Data Monthly Meetup #4 - Session 2 - WANDiscoBig Data Joe™ Rossi

Similar to Cloud storage filesystems and Hive transactional tables (20)

Connecting Hadoop and Oracle

Webinar: Untethering Compute from Storage

Building data pipelines with kite

Avoiding big data antipatterns

Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud

What's New in Apache Hive 3.0?

What's New in Apache Hive 3.0 - Tokyo

HDFS: Optimization, Stabilization and Supportability

Hdfs 2016-hadoop-summit-dublin-v1

Real-time Big Data Analytics Engine using Impala

Hadoop 3.0 - Revolution or evolution?

Big Data Conference April 2015

LLAP: Building Cloud First BI

MySQL highav Availability

Everything You Need to Know About Docker and Storage by Ryan Wallner, ClusterHQ

Hadoop operations-2015-hadoop-summit-san-jose-v5

Hadoop Operations - Best Practices from the Field

Revision

MySQL Webinar 2/4 Performance tuning, hardware, optimisation

SD Big Data Monthly Meetup #4 - Session 2 - WANDisco

Recently uploaded

08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls

Understanding the Laravel MVC ArchitecturePixlogix Infotech

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent

Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard

Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community

Install Stable Diffusion in windows machinePadma Pradeep

Pigging Solutions Piggable Sweeping ElbowsPigging Solutions

Next-generation AAM aircraft unveiled by Supernal, S-A2Hyundai Motor Group

08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski

Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4

Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies

Key Features Of Token Development (1).pptxLBM Solutions

Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes

Presentation on how to chat with PDF using ChatGPT code interpreternaman860154

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada

08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls

How to Remove Document Management Hurdles with X-Docs?XfilesPro

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh

Pigging Solutions in Pet Food ManufacturingPigging Solutions

Recently uploaded (20)

08448380779 Call Girls In Friends Colony Women Seeking Men

Understanding the Laravel MVC Architecture

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...

Maximizing Board Effectiveness 2024 Webinar.pptx

Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx

Install Stable Diffusion in windows machine

Pigging Solutions Piggable Sweeping Elbows

Next-generation AAM aircraft unveiled by Supernal, S-A2

08448380779 Call Girls In Civil Lines Women Seeking Men

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...

Azure Monitor & Application Insight to monitor Infrastructure & Application

Benefits Of Flutter Compared To Other Frameworks

Key Features Of Token Development (1).pptx

Enhancing Worker Digital Experience: A Hands-on Workshop for Partners

Presentation on how to chat with PDF using ChatGPT code interpreter

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men

How to Remove Document Management Hurdles with X-Docs?

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi

Pigging Solutions in Pet Food Manufacturing

Cloud storage filesystems and Hive transactional tables

2. Page2 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Cloud “FileSystems” are Strange Beasts “There are no directories. Only paths.” “There are no users. Only keys.” “There are no permissions. Only acl rules.” “There is consistency, but not as we know it.”

3. Page3 © Hortonworks Inc. 2011 – 2015. All Rights Reserved “Directories vs Paths.” • Storage of Path information can be assumed to be a sorted hash-table. • File listings are no longer listing off a tree, but prefix search • Directories don’t need to necessarily exist for a path below it • Listing a single level is more complex than a full-depth traversal • Renames can cause rebalancing and moving about of the structure • Adjacent files are sometimes more expensive than random ops

4. Page4 © Hortonworks Inc. 2011 – 2015. All Rights Reserved “Users & permissions vs keys & ACLs” • Distinguishing the user for an accessing process has no meaning • Access keys are often rotated and occasionally invalidated • User identity can be mapped to a key (externally or by id management) • Buckets are commonly used to differentiate stores, instead of permissions • Permissions are rarely set or applied per-file, but across path patterns • Permissions set to a directory need extra user checks to be useful (chmod +x)

5. Page5 © Hortonworks Inc. 2011 – 2015. All Rights Reserved “Consistency” • Arguably the most complex issue • Renames needn’t be consistent, creates can have collisions • Reads can return old data for the same path when overwriting • Versioned reads are complex to manage and hard to throw a “Time machine” over • Cross-Region Replication often lags and doubles stale-read issues

6. Page6 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Micro-Managed Hive Tables • Support for all Hive input formats, including user ones • Avoid rename operations as much as possible • Never collide final paths for different inserts • Ongoing inserts should be atomic across > 1 partitions • Snapshot isolation for data reads for existing partitions being back-filled • Stage data without accidental partial-reads for bucket replication

7. Page7 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Micro-Managed Hive Tables CREATE TABLE `web_returns_hive_commit`(… `wr_net_loss` float) PARTITIONED BY (`wr_returned_date_sk` int) STORED AS <FORMAT> LOCATION 's3a://hwdev-hive-14535/web_returns_hive_commit' TBLPROPERTIES ('transactional'='true', 'transactional_properties'='insert_only');

8. Page8 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Micro-Managed Hive Tables drwxrwxrwx - cloudbreak 0 2016-12-07 21:42 s3a://hwdev-hive-14535/web_returns_hive_commit/wr_returned_date_sk=2450820 drwxrwxrwx - cloudbreak 0 2016-12-07 21:42 s3a://hwdev-hive-14535/web_returns_hive_commit/wr_returned_date_sk=2450820/mm_0 -rw-rw-rw- 1 cloudbreak 1791 2016-12-07 00:55 s3a://hwdev-hive-14535/web_returns_hive_commit/wr_returned_date_sk=2450820/mm_0/000021_0 drwxrwxrwx - cloudbreak 0 2016-12-07 21:42 s3a://hwdev-hive-14535/web_returns_hive_commit/wr_returned_date_sk=2450821 drwxrwxrwx - cloudbreak 0 2016-12-07 21:42 s3a://hwdev-hive-14535/web_returns_hive_commit/wr_returned_date_sk=2450821/mm_0 -rw-rw-rw- 1 cloudbreak 2186 2016-12-07 00:55 s3a://hwdev-hive-14535/web_returns_hive_commit/wr_returned_date_sk=2450821/mm_0/000022_0 drwxrwxrwx - cloudbreak 0 2016-12-07 21:42 s3a://hwdev-hive-14535/web_returns_hive_commit/wr_returned_date_sk=2450822 drwxrwxrwx - cloudbreak 0 2016-12-07 21:42 s3a://hwdev-hive-14535/web_returns_hive_commit/wr_returned_date_sk=2450822/mm_0 -rw-rw-rw- 1 cloudbreak 1814 2016-12-07 00:55 s3a://hwdev-hive-14535/web_returns_hive_commit/wr_returned_date_sk=2450822/mm_0/000023_0 /web_returns_hive_commit/wr_returned_date_sk=2450820/mm_0/000021_0

10. Page10 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Read: tracking committed data • Similar to Hive-ACID (ORC) • Committed txns disappear from the tracking data • With each query, it takes a highest known txn + list of open/aborted txns • All valid transactions are < max(transaction_id) and not IN (open_txns) • The transaction filtering is done at the listing level for all formats

11. Page11 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Branch + Future Work Current measurement has 21% reduction in partition load time (+HIVE-15368) Time taken to load dynamic partitions: 350.846 seconds -> 274.715 seconds Work continues in the branch for hive-14535 Work ongoing to optimize to take advantage of faster recursive listings Discussions towards incremental refresh for cube engines for backfill Questions? Suggestions?

Cloud storage filesystems and Hive transactional tables

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Cloud storage filesystems and Hive transactional tables

Similar to Cloud storage filesystems and Hive transactional tables (20)

More from Hortonworks

More from Hortonworks (20)

Recently uploaded

Recently uploaded (20)

Cloud storage filesystems and Hive transactional tables