SlideShare a Scribd company logo
1 of 26
1 © Hortonworks Inc. 2011 – 2016. All Rights
Reserved
Hadoop & Cloud Storage:
Object Store Integration in
Production
Rajesh Balamohan
Hadoop Summit 2016
2 © Hortonworks Inc. 2011 – 2016. All Rights
Reserved
About Us
Rajesh Balamohan, rbalamohan@hortonworks.com, Twitter: @rajeshbalamohan
– Apache Tez Committer, PMC Member
– Mainly working on performance in Tez
– Have been using Hadoop since 2009
Chris Nauroth, cnauroth@hortonworks.com, Twitter: @cnauroth
– Apache Hadoop committer, PMC member, and Apache Software Foundation member
– Working on HDFS and alternative file systems such as WASB and S3A
– Hadoop user since 2010
Steve Loughran, stevel@hortonworks.com, Twitter: @steveloughran
– Apache Hadoop committer, PMC member, and Apache Software Foundation member
– Hadoop deployment since 2008, especially Cloud integration, Filesystem Spec author.
– Working on: Apache Slider, Spark+cloud integration, Hadoop + Cloud
3 © Hortonworks Inc. 2011 – 2016. All Rights
Reserved
Agenda
⬢ Hadoop/Cloud Storage Integration Use Cases
⬢ Hadoop-compatible File System Architecture
⬢ Recent Enhancements in S3A FileSystem Connector
⬢ Hive Access Patterns
⬢ Performance Improvements and TPC-DS Benchmarks with Hive-TestBench
⬢ Next Steps for S3A and other Object Stores
⬢ Q & A
4 © Hortonworks Inc. 2011 – 2016. All Rights
Reserved
Why Hadoop in the Cloud?
5 © Hortonworks Inc. 2011 – 2016. All Rights
Reserved
Hadoop Cloud Storage Utilization Evolution
HDFS
Application
HDFS
Application
GoalEvolution towards cloud storage as the primary Data Lake
Input Output
Backup Restore
Input
Output
Copy
HDFS
Application
Input
Output
tmp
6 © Hortonworks Inc. 2011 – 2016. All Rights
Reserved
What is the Problem?
Cloud Object Stores designed for
⬢ Scale
⬢ Cost
⬢ Geographic Distribution
⬢ Availability
⬢ Cloud app writers often modify apps to deal with cloud storage semantics and limitations
Challenges - Hadoop apps should work on HDFS or Cloud Storage transparently
⬢ Eventual consistency
⬢ Performance - separated from compute
⬢ Cloud Storage not designed for file-like access patterns
⬢ Limitations in APIs (e.g. rename)
7 © Hortonworks Inc. 2011 – 2016. All Rights
Reserved
Goal and Approach
Goals
⬢ Integrate with unique functionality of each cloud
⬢ Optimize each cloud’s object store connector
⬢ Optimize upper layers for cloud object stores
Overall Approach
⬢ Consistency in face of eventual consistency (use a secondary metadata store)
⬢ Performance in the connector (e.g. lazy seek)
⬢ Upper layer improvements (Hive, ORC, Tez, etc.)
8 © Hortonworks Inc. 2011 – 2016. All Rights
Reserved
Hadoop-compatible File System Architecture
9 © Hortonworks Inc. 2011 – 2016. All Rights
Reserved
Hadoop-compatible File System Architecture
⬢ Applications
– File system interactions coded to file system-agnostic abstraction layer.
• FileSystem class - traditional API
• FileContext/AbstractFileSystem classes - newer API providing split between client API and provider API
– Can be retargeted to a different file system by configuration changes (not code changes).
• Caveat: Different FileSystem implementations may offer limited feature set.
• Example: Only HDFS and WASB can run HBase.
⬢ File System Abstraction Layer
– Defines interface of common file system operations: create, open, rename, etc.
– Supports additional mix-in interfaces to indicate implementation of optional features.
– Semantics of each operation documented in formal specification, derived from HDFS behavior.
⬢ File System Implementation Layer
– Each file system provides a set of concrete classes implementing the interface.
– A set of common file system contract tests execute against each implementation to prove its adherence to specified
semantics.
1
0
© Hortonworks Inc. 2011 – 2016. All Rights
Reserved
Cloud Storage Connectors
Azure WASB ● Strongly consistent
● Good performance
● Well-tested on applications (incl. HBase)
ADL ● Strongly consistent
● Tuned for big data analytics workloads
Amazon Web Services S3A ● Eventually consistent - consistency work in
progress by Hortonworks
● Performance improvements in progress
● Active development in Apache
EMRFS ● Proprietary connector used in EMR
● Optional strong consistency for a cost
Google Cloud Platform GCS ● Multiple configurable consistency policies
● Currently Google open source
● Good performance
● Work under way for contribution to Apache
1
1
© Hortonworks Inc. 2011 – 2016. All Rights
Reserved
1
1
© Hortonworks Inc. 2011 – 2016. All Rights
Reserved
Case Study: S3A Functionality and
Performance
1
2
© Hortonworks Inc. 2011 – 2016. All Rights
Reserved
Authentication
⬢ Basic
– AWS Access Key ID and Secret Access Key in Hadoop Configuration Files
– Hadoop Credential Provider API to avoid using world-readable configuration files
⬢ EC2 Metadata
– Reads credentials published by AWS directly into EC2 VM instances
– More secure, because external distribution of secrets not required
⬢ AWS Environment Variables
– Less secure, but potentially easier integration for some applications
⬢ Session Credentials
– Temporary security credentials issued by Amazon Security Token Service
– Fixed lifetime reduces impact of credential leak
⬢ Anonymous Login
– Easy read-only access to public buckets for early prototyping
1
3
© Hortonworks Inc. 2011 – 2016. All Rights
Reserved
Encryption
⬢ S3 Server-Side Encryption
– Encryption of data at rest at S3
– Supports the SSE-S3 option: each object encrypted by a unique key using AES-256 cipher
– Now covered in S3A automated test suites
– Support for additional options under development (SSE-KMS and SSE-C)
1
4
© Hortonworks Inc. 2011 – 2016. All Rights
Reserved
Supportability
⬢ Documentation
– Backfill missing documentation, and include documentation in new enhancements
– To be published to hadoop.apache.org with Apache Hadoop 2.8.0 release
– Meanwhile, raw content visible on GitHub:
• https://github.com/apache/hadoop/blob/branch-2.8/hadoop-tools/hadoop-
aws/src/site/markdown/tools/hadoop-aws/index.md
⬢ Error Reporting
– Identify common user errors and provide more descriptive error messages
– S3 HTTP error codes examined and translated to specific error types
⬢ Instrumentation
– Internal metrics covering a wide range of metadata and data operations
– Already proven helpful in flagging a potential performance regression in a patch
1
5
© Hortonworks Inc. 2011 – 2016. All Rights
Reserved
Performance Improvements
⬢ Lazy Seek
– Earlier implementation
• Reopened file in every seek call; Aborted connection in every reopen
• Positional Read was expensive (seek, read, seek)
– Current implementation
• Seek is a no-op call
• Performs real seek on need basis
⬢ Connection Abort Problem
– Backward seeks caused connection aborts
– Recent modifications to S3AFileSystem fixes these and added support for sequential reads
and random reads
• fs.s3a.experimental.input.fadvise
1
6
© Hortonworks Inc. 2011 – 2016. All Rights
Reserved
Hive Access Patterns
⬢ ETL and Admin Activities
– Bringing in dataset / Creating Tables
– Cleansing / Transforming Data
– Analyze Tables, Compute Column Statistics
– MSCK to fix partition related information
⬢ Read
– Running Queries
⬢ Write
– Store Output
1
7
© Hortonworks Inc. 2011 – 2016. All Rights
Reserved
Hive - MSCK Improvements
⬢ MSCK helps in fixing metastore for partitioned dataset
– Scan table path to identify missing partitions (expensive in S3)
1
8
© Hortonworks Inc. 2011 – 2016. All Rights
Reserved
Hive - Analyze Column Statistics Improvements
⬢ Hive needs statistics to run queries efficiently
– Gathering table and column statistics can be expensive in partitioned datasets
1
9
© Hortonworks Inc. 2011 – 2016. All Rights
Reserved
Performance Considerations When Running Hive Queries
⬢ Splits Generation
– File formats like ORC provides threadpool in split generation
⬢ ORC Footer Cache
– hive.orc.cache.stripe.details.size > 0
– Caches footer details; Helps in reducing data reads during split generation
⬢ Reduce S3A reads in Task side
– hive.orc.splits.include.file.footer=true
– Sends ORC footer information in splits payload.
– Helps reducing the amount of data read in task side.
2
0
© Hortonworks Inc. 2011 – 2016. All Rights
Reserved
Performance Considerations When Running Hive Queries
⬢ Tez Splits Grouping
– Hive uses Tez as its default execution engine
– Tez groups splits based on min/max group setting, location details and so on
– S3A always provides “localhost” as its block location information
– When all splits-length falls below min group setting, Tez aggressively groups them into single
split. This causes issues with S3A as single task ends up doing sequential operations.
– Fixed in recent releases
⬢ Container Launches
– S3A always provides “localhost” for block locations.
– Good to set “yarn.scheduler.capacity.node-locality-delay=0”
2
1
© Hortonworks Inc. 2011 – 2016. All Rights
Reserved
Hive-TestBench Benchmark Results
⬢ Hive-TestBench has subset of queries from TPC-DS (https://github.com/hortonworks/hive-testbench)
⬢ m4x4x large - 5 nodes
⬢ TPC-DS @ 200 GB Scale in S3
⬢ “HDP 2.3 + S3 in cloud” vs “HDP 2.4 + S3 in cloud”
– Average speedup 2.5x
– Queries like 15,17, 25, 73,75 etc did not run in HDP 2.3 (throws AWS timeout exceptions)
2
2
© Hortonworks Inc. 2011 – 2016. All Rights
Reserved
Hive-TestBench Benchmark Results - LLAP
⬢ LLAP DAG runtime comparison with Hive
⬢ Reduces the amount of data to be read from S3 significantly; Improves runtime.
2
3
© Hortonworks Inc. 2011 – 2016. All Rights
Reserved
Best Practices
⬢ Tune multipart settings
– fs.s3a.multipart.threshold (default: Integer.MAX_VALUE)
– fs.s3a.multipart.size (default: 100 MB)
– fs.s3a.connection.timeout (default: 200 seconds)
⬢ Disable node locality delay in YARN
– Set “yarn.scheduler.capacity.node-locality-delay=0” to avoid delays in container launches
⬢ Disable Storage Based authorization in Hive
– hive.security.metastore.authorization.manager=org.apache.hadoop.hive.ql.security.authorization.DefaultHiveMetas
toreAuthorizationProvider
– hive.metastore.pre.event.listeners= (set to empty value)
⬢ Tune ORC threads for reducing split generation times
– hive.orc.compute.splits.num.threads (default 10)
2
4
© Hortonworks Inc. 2011 – 2016. All Rights
Reserved
Next Steps for S3A and other Object Stores
⬢ S3A Phase III
– https://issues.apache.org/jira/browse/HADOOP-13204
⬢ Output Committers
– Logical commit operation decoupled from rename (non-atomic and costly in object stores)
⬢ Object Store Abstraction Layer
– Avoid impedance mismatch with FileSystem API
– Provide specific APIs for better integration with object stores: saving, listing, copying
⬢ Ongoing Performance Improvement
– Less chatty call pattern for object listings
– Metadata caching to mask latency of remote object store calls
⬢ Consistency
2
5
© Hortonworks Inc. 2011 – 2016. All Rights
Reserved
Summary
⬢ Evolution towards cloud storage
⬢ Hadoop-compatible File System Architecture fosters integration with cloud storage
⬢ Integration with multiple cloud providers available: Azure, AWS, Google
⬢ Recent enhancements in S3A
⬢ Hive usage and TPC-DS benchmarks show significant S3A performance
improvements
⬢ More coming soon for S3A and other object stores
2
6
© Hortonworks Inc. 2011 – 2016. All Rights
Reserved
Q & A
Thank You!

More Related Content

What's hot

Bringing it All Together: Apache Metron (Incubating) as a Case Study of a Mod...
Bringing it All Together: Apache Metron (Incubating) as a Case Study of a Mod...Bringing it All Together: Apache Metron (Incubating) as a Case Study of a Mod...
Bringing it All Together: Apache Metron (Incubating) as a Case Study of a Mod...DataWorks Summit
 
Best Practices for Enterprise User Management in Hadoop Environment
Best Practices for Enterprise User Management in Hadoop EnvironmentBest Practices for Enterprise User Management in Hadoop Environment
Best Practices for Enterprise User Management in Hadoop EnvironmentDataWorks Summit/Hadoop Summit
 
Dancing Elephants - Efficiently Working with Object Stories from Apache Spark...
Dancing Elephants - Efficiently Working with Object Stories from Apache Spark...Dancing Elephants - Efficiently Working with Object Stories from Apache Spark...
Dancing Elephants - Efficiently Working with Object Stories from Apache Spark...DataWorks Summit/Hadoop Summit
 
Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies
Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies
Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies DataWorks Summit/Hadoop Summit
 
Visualizing Big Data in Realtime
Visualizing Big Data in RealtimeVisualizing Big Data in Realtime
Visualizing Big Data in RealtimeDataWorks Summit
 
Cloudy with a chance of Hadoop - real world considerations
Cloudy with a chance of Hadoop - real world considerationsCloudy with a chance of Hadoop - real world considerations
Cloudy with a chance of Hadoop - real world considerationsDataWorks Summit
 
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...Hortonworks
 
Hadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitHadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitDataWorks Summit
 
An Overview on Optimization in Apache Hive: Past, Present Future
An Overview on Optimization in Apache Hive: Past, Present FutureAn Overview on Optimization in Apache Hive: Past, Present Future
An Overview on Optimization in Apache Hive: Past, Present FutureDataWorks Summit/Hadoop Summit
 
Innovation in the Enterprise Rent-A-Car Data Warehouse
Innovation in the Enterprise Rent-A-Car Data WarehouseInnovation in the Enterprise Rent-A-Car Data Warehouse
Innovation in the Enterprise Rent-A-Car Data WarehouseDataWorks Summit
 
Dynamic DDL: Adding structure to streaming IoT data on the fly
Dynamic DDL: Adding structure to streaming IoT data on the flyDynamic DDL: Adding structure to streaming IoT data on the fly
Dynamic DDL: Adding structure to streaming IoT data on the flyDataWorks Summit
 
YARN Ready: Apache Spark
YARN Ready: Apache Spark YARN Ready: Apache Spark
YARN Ready: Apache Spark Hortonworks
 
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin DataWorks Summit/Hadoop Summit
 
Apache Ambari - HDP Cluster Upgrades Operational Deep Dive and Troubleshooting
Apache Ambari - HDP Cluster Upgrades Operational Deep Dive and TroubleshootingApache Ambari - HDP Cluster Upgrades Operational Deep Dive and Troubleshooting
Apache Ambari - HDP Cluster Upgrades Operational Deep Dive and TroubleshootingDataWorks Summit/Hadoop Summit
 

What's hot (20)

Bringing it All Together: Apache Metron (Incubating) as a Case Study of a Mod...
Bringing it All Together: Apache Metron (Incubating) as a Case Study of a Mod...Bringing it All Together: Apache Metron (Incubating) as a Case Study of a Mod...
Bringing it All Together: Apache Metron (Incubating) as a Case Study of a Mod...
 
Best Practices for Enterprise User Management in Hadoop Environment
Best Practices for Enterprise User Management in Hadoop EnvironmentBest Practices for Enterprise User Management in Hadoop Environment
Best Practices for Enterprise User Management in Hadoop Environment
 
Dancing Elephants - Efficiently Working with Object Stories from Apache Spark...
Dancing Elephants - Efficiently Working with Object Stories from Apache Spark...Dancing Elephants - Efficiently Working with Object Stories from Apache Spark...
Dancing Elephants - Efficiently Working with Object Stories from Apache Spark...
 
Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies
Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies
Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies
 
Visualizing Big Data in Realtime
Visualizing Big Data in RealtimeVisualizing Big Data in Realtime
Visualizing Big Data in Realtime
 
Cloudy with a chance of Hadoop - real world considerations
Cloudy with a chance of Hadoop - real world considerationsCloudy with a chance of Hadoop - real world considerations
Cloudy with a chance of Hadoop - real world considerations
 
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
 
Apache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, ScaleApache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, Scale
 
Apache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and FutureApache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and Future
 
Hadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitHadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop Summit
 
An Overview on Optimization in Apache Hive: Past, Present Future
An Overview on Optimization in Apache Hive: Past, Present FutureAn Overview on Optimization in Apache Hive: Past, Present Future
An Overview on Optimization in Apache Hive: Past, Present Future
 
What's new in Ambari
What's new in AmbariWhat's new in Ambari
What's new in Ambari
 
Innovation in the Enterprise Rent-A-Car Data Warehouse
Innovation in the Enterprise Rent-A-Car Data WarehouseInnovation in the Enterprise Rent-A-Car Data Warehouse
Innovation in the Enterprise Rent-A-Car Data Warehouse
 
Dynamic DDL: Adding structure to streaming IoT data on the fly
Dynamic DDL: Adding structure to streaming IoT data on the flyDynamic DDL: Adding structure to streaming IoT data on the fly
Dynamic DDL: Adding structure to streaming IoT data on the fly
 
Securing Hadoop in an Enterprise Context
Securing Hadoop in an Enterprise ContextSecuring Hadoop in an Enterprise Context
Securing Hadoop in an Enterprise Context
 
YARN Ready: Apache Spark
YARN Ready: Apache Spark YARN Ready: Apache Spark
YARN Ready: Apache Spark
 
Apache Hadoop Crash Course
Apache Hadoop Crash CourseApache Hadoop Crash Course
Apache Hadoop Crash Course
 
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
 
Apache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, ScaleApache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, Scale
 
Apache Ambari - HDP Cluster Upgrades Operational Deep Dive and Troubleshooting
Apache Ambari - HDP Cluster Upgrades Operational Deep Dive and TroubleshootingApache Ambari - HDP Cluster Upgrades Operational Deep Dive and Troubleshooting
Apache Ambari - HDP Cluster Upgrades Operational Deep Dive and Troubleshooting
 

Viewers also liked

A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark ClustersA Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark ClustersDataWorks Summit/Hadoop Summit
 
Hadoop & cloud storage object store integration in production (final)
Hadoop & cloud storage  object store integration in production (final)Hadoop & cloud storage  object store integration in production (final)
Hadoop & cloud storage object store integration in production (final)Chris Nauroth
 
Hadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialHadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialhadooparchbook
 
Visualización de Big Data con Power View
Visualización de Big Data con Power ViewVisualización de Big Data con Power View
Visualización de Big Data con Power ViewEduardo Castro
 
Ozone: An Object Store in HDFS
Ozone: An Object Store in HDFSOzone: An Object Store in HDFS
Ozone: An Object Store in HDFSDataWorks Summit
 
Que debe saber un DBA de SQL Server sobre Hadoop
Que debe saber un DBA de SQL Server sobre HadoopQue debe saber un DBA de SQL Server sobre Hadoop
Que debe saber un DBA de SQL Server sobre HadoopEduardo Castro
 
The TCO Calculator - Estimate the True Cost of Hadoop
The TCO Calculator - Estimate the True Cost of Hadoop The TCO Calculator - Estimate the True Cost of Hadoop
The TCO Calculator - Estimate the True Cost of Hadoop MapR Technologies
 
AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Ac...
AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Ac...AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Ac...
AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Ac...Amazon Web Services
 
(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...
(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...
(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...Amazon Web Services
 
Hadoop AWS infrastructure cost evaluation
Hadoop AWS infrastructure cost evaluationHadoop AWS infrastructure cost evaluation
Hadoop AWS infrastructure cost evaluationmattlieber
 
Operational Tips For Deploying Apache Spark
Operational Tips For Deploying Apache SparkOperational Tips For Deploying Apache Spark
Operational Tips For Deploying Apache SparkDatabricks
 
Big Data a traves de una implementación
Big Data a traves de una implementaciónBig Data a traves de una implementación
Big Data a traves de una implementaciónDiego Krauthamer
 
Construyendo una Infraestructura de Big Data rentable y escalable (la evoluci...
Construyendo una Infraestructura de Big Data rentable y escalable (la evoluci...Construyendo una Infraestructura de Big Data rentable y escalable (la evoluci...
Construyendo una Infraestructura de Big Data rentable y escalable (la evoluci...Socialmetrix
 
Major advancements in Apache Hive towards full support of SQL compliance
Major advancements in Apache Hive towards full support of SQL complianceMajor advancements in Apache Hive towards full support of SQL compliance
Major advancements in Apache Hive towards full support of SQL complianceDataWorks Summit/Hadoop Summit
 
Network for the Large-scale Hadoop cluster at Yahoo! JAPAN
Network for the Large-scale Hadoop cluster at Yahoo! JAPANNetwork for the Large-scale Hadoop cluster at Yahoo! JAPAN
Network for the Large-scale Hadoop cluster at Yahoo! JAPANDataWorks Summit/Hadoop Summit
 
S3Guard: What's in your consistency model?
S3Guard: What's in your consistency model?S3Guard: What's in your consistency model?
S3Guard: What's in your consistency model?Hortonworks
 
Spark: Interactive To Production
Spark: Interactive To ProductionSpark: Interactive To Production
Spark: Interactive To ProductionJen Aman
 

Viewers also liked (20)

A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark ClustersA Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
 
Hadoop & cloud storage object store integration in production (final)
Hadoop & cloud storage  object store integration in production (final)Hadoop & cloud storage  object store integration in production (final)
Hadoop & cloud storage object store integration in production (final)
 
Hadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialHadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorial
 
Visualización de Big Data con Power View
Visualización de Big Data con Power ViewVisualización de Big Data con Power View
Visualización de Big Data con Power View
 
Ozone: An Object Store in HDFS
Ozone: An Object Store in HDFSOzone: An Object Store in HDFS
Ozone: An Object Store in HDFS
 
Que debe saber un DBA de SQL Server sobre Hadoop
Que debe saber un DBA de SQL Server sobre HadoopQue debe saber un DBA de SQL Server sobre Hadoop
Que debe saber un DBA de SQL Server sobre Hadoop
 
The TCO Calculator - Estimate the True Cost of Hadoop
The TCO Calculator - Estimate the True Cost of Hadoop The TCO Calculator - Estimate the True Cost of Hadoop
The TCO Calculator - Estimate the True Cost of Hadoop
 
AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Ac...
AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Ac...AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Ac...
AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Ac...
 
(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...
(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...
(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...
 
Hadoop AWS infrastructure cost evaluation
Hadoop AWS infrastructure cost evaluationHadoop AWS infrastructure cost evaluation
Hadoop AWS infrastructure cost evaluation
 
Operational Tips For Deploying Apache Spark
Operational Tips For Deploying Apache SparkOperational Tips For Deploying Apache Spark
Operational Tips For Deploying Apache Spark
 
Big Data a traves de una implementación
Big Data a traves de una implementaciónBig Data a traves de una implementación
Big Data a traves de una implementación
 
Construyendo una Infraestructura de Big Data rentable y escalable (la evoluci...
Construyendo una Infraestructura de Big Data rentable y escalable (la evoluci...Construyendo una Infraestructura de Big Data rentable y escalable (la evoluci...
Construyendo una Infraestructura de Big Data rentable y escalable (la evoluci...
 
Modernise your EDW - Data Lake
Modernise your EDW - Data LakeModernise your EDW - Data Lake
Modernise your EDW - Data Lake
 
Streamline Hadoop DevOps with Apache Ambari
Streamline Hadoop DevOps with Apache AmbariStreamline Hadoop DevOps with Apache Ambari
Streamline Hadoop DevOps with Apache Ambari
 
Major advancements in Apache Hive towards full support of SQL compliance
Major advancements in Apache Hive towards full support of SQL complianceMajor advancements in Apache Hive towards full support of SQL compliance
Major advancements in Apache Hive towards full support of SQL compliance
 
Architecting a multi-tenanted platform
Architecting a multi-tenanted platform Architecting a multi-tenanted platform
Architecting a multi-tenanted platform
 
Network for the Large-scale Hadoop cluster at Yahoo! JAPAN
Network for the Large-scale Hadoop cluster at Yahoo! JAPANNetwork for the Large-scale Hadoop cluster at Yahoo! JAPAN
Network for the Large-scale Hadoop cluster at Yahoo! JAPAN
 
S3Guard: What's in your consistency model?
S3Guard: What's in your consistency model?S3Guard: What's in your consistency model?
S3Guard: What's in your consistency model?
 
Spark: Interactive To Production
Spark: Interactive To ProductionSpark: Interactive To Production
Spark: Interactive To Production
 

Similar to Hadoop & Cloud Storage: Object Store Integration in Production

Moving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloudMoving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloudDataWorks Summit/Hadoop Summit
 
Big data spain keynote nov 2016
Big data spain keynote nov 2016Big data spain keynote nov 2016
Big data spain keynote nov 2016alanfgates
 
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...Big Data Spain
 
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...DataWorks Summit
 
Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...DataWorks Summit
 
Cloudy with a chance of Hadoop - DataWorks Summit 2017 San Jose
Cloudy with a chance of Hadoop - DataWorks Summit 2017 San JoseCloudy with a chance of Hadoop - DataWorks Summit 2017 San Jose
Cloudy with a chance of Hadoop - DataWorks Summit 2017 San JoseMingliang Liu
 
Micro services vs hadoop
Micro services vs hadoopMicro services vs hadoop
Micro services vs hadoopGergely Devenyi
 
Driving Enterprise Data Governance for Big Data Systems through Apache Falcon
Driving Enterprise Data Governance for Big Data Systems through Apache FalconDriving Enterprise Data Governance for Big Data Systems through Apache Falcon
Driving Enterprise Data Governance for Big Data Systems through Apache FalconDataWorks Summit
 
Data Governance in Apache Falcon - Hadoop Summit Brussels 2015
Data Governance in Apache Falcon - Hadoop Summit Brussels 2015 Data Governance in Apache Falcon - Hadoop Summit Brussels 2015
Data Governance in Apache Falcon - Hadoop Summit Brussels 2015 Seetharam Venkatesh
 
Standalone metastore-dws-sjc-june-2018
Standalone metastore-dws-sjc-june-2018Standalone metastore-dws-sjc-june-2018
Standalone metastore-dws-sjc-june-2018alanfgates
 
Sharing metadata across the data lake and streams
Sharing metadata across the data lake and streamsSharing metadata across the data lake and streams
Sharing metadata across the data lake and streamsDataWorks Summit
 
Ozone- Object store for Apache Hadoop
Ozone- Object store for Apache HadoopOzone- Object store for Apache Hadoop
Ozone- Object store for Apache HadoopHortonworks
 
Hive acid and_2.x new_features
Hive acid and_2.x new_featuresHive acid and_2.x new_features
Hive acid and_2.x new_featuresAlberto Romero
 
Apache Hive 2.0; SQL, Speed, Scale
Apache Hive 2.0; SQL, Speed, ScaleApache Hive 2.0; SQL, Speed, Scale
Apache Hive 2.0; SQL, Speed, ScaleHortonworks
 
Apache Spark and Object Stores
Apache Spark and Object StoresApache Spark and Object Stores
Apache Spark and Object StoresSteve Loughran
 
Future of Data New Jersey - HDF 3.0 Deep Dive
Future of Data New Jersey - HDF 3.0 Deep DiveFuture of Data New Jersey - HDF 3.0 Deep Dive
Future of Data New Jersey - HDF 3.0 Deep DiveAldrin Piri
 

Similar to Hadoop & Cloud Storage: Object Store Integration in Production (20)

Moving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloudMoving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloud
 
Big data spain keynote nov 2016
Big data spain keynote nov 2016Big data spain keynote nov 2016
Big data spain keynote nov 2016
 
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...
 
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
 
Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...
 
Cloudy with a chance of Hadoop - DataWorks Summit 2017 San Jose
Cloudy with a chance of Hadoop - DataWorks Summit 2017 San JoseCloudy with a chance of Hadoop - DataWorks Summit 2017 San Jose
Cloudy with a chance of Hadoop - DataWorks Summit 2017 San Jose
 
Micro services vs hadoop
Micro services vs hadoopMicro services vs hadoop
Micro services vs hadoop
 
Apache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and FutureApache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and Future
 
Driving Enterprise Data Governance for Big Data Systems through Apache Falcon
Driving Enterprise Data Governance for Big Data Systems through Apache FalconDriving Enterprise Data Governance for Big Data Systems through Apache Falcon
Driving Enterprise Data Governance for Big Data Systems through Apache Falcon
 
Data Governance in Apache Falcon - Hadoop Summit Brussels 2015
Data Governance in Apache Falcon - Hadoop Summit Brussels 2015 Data Governance in Apache Falcon - Hadoop Summit Brussels 2015
Data Governance in Apache Falcon - Hadoop Summit Brussels 2015
 
Hadoop 3 in a Nutshell
Hadoop 3 in a NutshellHadoop 3 in a Nutshell
Hadoop 3 in a Nutshell
 
Standalone metastore-dws-sjc-june-2018
Standalone metastore-dws-sjc-june-2018Standalone metastore-dws-sjc-june-2018
Standalone metastore-dws-sjc-june-2018
 
Sharing metadata across the data lake and streams
Sharing metadata across the data lake and streamsSharing metadata across the data lake and streams
Sharing metadata across the data lake and streams
 
Apache Hadoop 3.0 What's new in YARN and MapReduce
Apache Hadoop 3.0 What's new in YARN and MapReduceApache Hadoop 3.0 What's new in YARN and MapReduce
Apache Hadoop 3.0 What's new in YARN and MapReduce
 
Ozone- Object store for Apache Hadoop
Ozone- Object store for Apache HadoopOzone- Object store for Apache Hadoop
Ozone- Object store for Apache Hadoop
 
Running Services on YARN
Running Services on YARNRunning Services on YARN
Running Services on YARN
 
Hive acid and_2.x new_features
Hive acid and_2.x new_featuresHive acid and_2.x new_features
Hive acid and_2.x new_features
 
Apache Hive 2.0; SQL, Speed, Scale
Apache Hive 2.0; SQL, Speed, ScaleApache Hive 2.0; SQL, Speed, Scale
Apache Hive 2.0; SQL, Speed, Scale
 
Apache Spark and Object Stores
Apache Spark and Object StoresApache Spark and Object Stores
Apache Spark and Object Stores
 
Future of Data New Jersey - HDF 3.0 Deep Dive
Future of Data New Jersey - HDF 3.0 Deep DiveFuture of Data New Jersey - HDF 3.0 Deep Dive
Future of Data New Jersey - HDF 3.0 Deep Dive
 

More from DataWorks Summit/Hadoop Summit

Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerDataWorks Summit/Hadoop Summit
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformDataWorks Summit/Hadoop Summit
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDataWorks Summit/Hadoop Summit
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...DataWorks Summit/Hadoop Summit
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...DataWorks Summit/Hadoop Summit
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLDataWorks Summit/Hadoop Summit
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)DataWorks Summit/Hadoop Summit
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...DataWorks Summit/Hadoop Summit
 

More from DataWorks Summit/Hadoop Summit (20)

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 

Recently uploaded

AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 

Recently uploaded (20)

AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 

Hadoop & Cloud Storage: Object Store Integration in Production

  • 1. 1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hadoop & Cloud Storage: Object Store Integration in Production Rajesh Balamohan Hadoop Summit 2016
  • 2. 2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved About Us Rajesh Balamohan, rbalamohan@hortonworks.com, Twitter: @rajeshbalamohan – Apache Tez Committer, PMC Member – Mainly working on performance in Tez – Have been using Hadoop since 2009 Chris Nauroth, cnauroth@hortonworks.com, Twitter: @cnauroth – Apache Hadoop committer, PMC member, and Apache Software Foundation member – Working on HDFS and alternative file systems such as WASB and S3A – Hadoop user since 2010 Steve Loughran, stevel@hortonworks.com, Twitter: @steveloughran – Apache Hadoop committer, PMC member, and Apache Software Foundation member – Hadoop deployment since 2008, especially Cloud integration, Filesystem Spec author. – Working on: Apache Slider, Spark+cloud integration, Hadoop + Cloud
  • 3. 3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Agenda ⬢ Hadoop/Cloud Storage Integration Use Cases ⬢ Hadoop-compatible File System Architecture ⬢ Recent Enhancements in S3A FileSystem Connector ⬢ Hive Access Patterns ⬢ Performance Improvements and TPC-DS Benchmarks with Hive-TestBench ⬢ Next Steps for S3A and other Object Stores ⬢ Q & A
  • 4. 4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Why Hadoop in the Cloud?
  • 5. 5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hadoop Cloud Storage Utilization Evolution HDFS Application HDFS Application GoalEvolution towards cloud storage as the primary Data Lake Input Output Backup Restore Input Output Copy HDFS Application Input Output tmp
  • 6. 6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved What is the Problem? Cloud Object Stores designed for ⬢ Scale ⬢ Cost ⬢ Geographic Distribution ⬢ Availability ⬢ Cloud app writers often modify apps to deal with cloud storage semantics and limitations Challenges - Hadoop apps should work on HDFS or Cloud Storage transparently ⬢ Eventual consistency ⬢ Performance - separated from compute ⬢ Cloud Storage not designed for file-like access patterns ⬢ Limitations in APIs (e.g. rename)
  • 7. 7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Goal and Approach Goals ⬢ Integrate with unique functionality of each cloud ⬢ Optimize each cloud’s object store connector ⬢ Optimize upper layers for cloud object stores Overall Approach ⬢ Consistency in face of eventual consistency (use a secondary metadata store) ⬢ Performance in the connector (e.g. lazy seek) ⬢ Upper layer improvements (Hive, ORC, Tez, etc.)
  • 8. 8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hadoop-compatible File System Architecture
  • 9. 9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hadoop-compatible File System Architecture ⬢ Applications – File system interactions coded to file system-agnostic abstraction layer. • FileSystem class - traditional API • FileContext/AbstractFileSystem classes - newer API providing split between client API and provider API – Can be retargeted to a different file system by configuration changes (not code changes). • Caveat: Different FileSystem implementations may offer limited feature set. • Example: Only HDFS and WASB can run HBase. ⬢ File System Abstraction Layer – Defines interface of common file system operations: create, open, rename, etc. – Supports additional mix-in interfaces to indicate implementation of optional features. – Semantics of each operation documented in formal specification, derived from HDFS behavior. ⬢ File System Implementation Layer – Each file system provides a set of concrete classes implementing the interface. – A set of common file system contract tests execute against each implementation to prove its adherence to specified semantics.
  • 10. 1 0 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Cloud Storage Connectors Azure WASB ● Strongly consistent ● Good performance ● Well-tested on applications (incl. HBase) ADL ● Strongly consistent ● Tuned for big data analytics workloads Amazon Web Services S3A ● Eventually consistent - consistency work in progress by Hortonworks ● Performance improvements in progress ● Active development in Apache EMRFS ● Proprietary connector used in EMR ● Optional strong consistency for a cost Google Cloud Platform GCS ● Multiple configurable consistency policies ● Currently Google open source ● Good performance ● Work under way for contribution to Apache
  • 11. 1 1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved 1 1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Case Study: S3A Functionality and Performance
  • 12. 1 2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Authentication ⬢ Basic – AWS Access Key ID and Secret Access Key in Hadoop Configuration Files – Hadoop Credential Provider API to avoid using world-readable configuration files ⬢ EC2 Metadata – Reads credentials published by AWS directly into EC2 VM instances – More secure, because external distribution of secrets not required ⬢ AWS Environment Variables – Less secure, but potentially easier integration for some applications ⬢ Session Credentials – Temporary security credentials issued by Amazon Security Token Service – Fixed lifetime reduces impact of credential leak ⬢ Anonymous Login – Easy read-only access to public buckets for early prototyping
  • 13. 1 3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Encryption ⬢ S3 Server-Side Encryption – Encryption of data at rest at S3 – Supports the SSE-S3 option: each object encrypted by a unique key using AES-256 cipher – Now covered in S3A automated test suites – Support for additional options under development (SSE-KMS and SSE-C)
  • 14. 1 4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Supportability ⬢ Documentation – Backfill missing documentation, and include documentation in new enhancements – To be published to hadoop.apache.org with Apache Hadoop 2.8.0 release – Meanwhile, raw content visible on GitHub: • https://github.com/apache/hadoop/blob/branch-2.8/hadoop-tools/hadoop- aws/src/site/markdown/tools/hadoop-aws/index.md ⬢ Error Reporting – Identify common user errors and provide more descriptive error messages – S3 HTTP error codes examined and translated to specific error types ⬢ Instrumentation – Internal metrics covering a wide range of metadata and data operations – Already proven helpful in flagging a potential performance regression in a patch
  • 15. 1 5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Performance Improvements ⬢ Lazy Seek – Earlier implementation • Reopened file in every seek call; Aborted connection in every reopen • Positional Read was expensive (seek, read, seek) – Current implementation • Seek is a no-op call • Performs real seek on need basis ⬢ Connection Abort Problem – Backward seeks caused connection aborts – Recent modifications to S3AFileSystem fixes these and added support for sequential reads and random reads • fs.s3a.experimental.input.fadvise
  • 16. 1 6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hive Access Patterns ⬢ ETL and Admin Activities – Bringing in dataset / Creating Tables – Cleansing / Transforming Data – Analyze Tables, Compute Column Statistics – MSCK to fix partition related information ⬢ Read – Running Queries ⬢ Write – Store Output
  • 17. 1 7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hive - MSCK Improvements ⬢ MSCK helps in fixing metastore for partitioned dataset – Scan table path to identify missing partitions (expensive in S3)
  • 18. 1 8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hive - Analyze Column Statistics Improvements ⬢ Hive needs statistics to run queries efficiently – Gathering table and column statistics can be expensive in partitioned datasets
  • 19. 1 9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Performance Considerations When Running Hive Queries ⬢ Splits Generation – File formats like ORC provides threadpool in split generation ⬢ ORC Footer Cache – hive.orc.cache.stripe.details.size > 0 – Caches footer details; Helps in reducing data reads during split generation ⬢ Reduce S3A reads in Task side – hive.orc.splits.include.file.footer=true – Sends ORC footer information in splits payload. – Helps reducing the amount of data read in task side.
  • 20. 2 0 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Performance Considerations When Running Hive Queries ⬢ Tez Splits Grouping – Hive uses Tez as its default execution engine – Tez groups splits based on min/max group setting, location details and so on – S3A always provides “localhost” as its block location information – When all splits-length falls below min group setting, Tez aggressively groups them into single split. This causes issues with S3A as single task ends up doing sequential operations. – Fixed in recent releases ⬢ Container Launches – S3A always provides “localhost” for block locations. – Good to set “yarn.scheduler.capacity.node-locality-delay=0”
  • 21. 2 1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hive-TestBench Benchmark Results ⬢ Hive-TestBench has subset of queries from TPC-DS (https://github.com/hortonworks/hive-testbench) ⬢ m4x4x large - 5 nodes ⬢ TPC-DS @ 200 GB Scale in S3 ⬢ “HDP 2.3 + S3 in cloud” vs “HDP 2.4 + S3 in cloud” – Average speedup 2.5x – Queries like 15,17, 25, 73,75 etc did not run in HDP 2.3 (throws AWS timeout exceptions)
  • 22. 2 2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hive-TestBench Benchmark Results - LLAP ⬢ LLAP DAG runtime comparison with Hive ⬢ Reduces the amount of data to be read from S3 significantly; Improves runtime.
  • 23. 2 3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Best Practices ⬢ Tune multipart settings – fs.s3a.multipart.threshold (default: Integer.MAX_VALUE) – fs.s3a.multipart.size (default: 100 MB) – fs.s3a.connection.timeout (default: 200 seconds) ⬢ Disable node locality delay in YARN – Set “yarn.scheduler.capacity.node-locality-delay=0” to avoid delays in container launches ⬢ Disable Storage Based authorization in Hive – hive.security.metastore.authorization.manager=org.apache.hadoop.hive.ql.security.authorization.DefaultHiveMetas toreAuthorizationProvider – hive.metastore.pre.event.listeners= (set to empty value) ⬢ Tune ORC threads for reducing split generation times – hive.orc.compute.splits.num.threads (default 10)
  • 24. 2 4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Next Steps for S3A and other Object Stores ⬢ S3A Phase III – https://issues.apache.org/jira/browse/HADOOP-13204 ⬢ Output Committers – Logical commit operation decoupled from rename (non-atomic and costly in object stores) ⬢ Object Store Abstraction Layer – Avoid impedance mismatch with FileSystem API – Provide specific APIs for better integration with object stores: saving, listing, copying ⬢ Ongoing Performance Improvement – Less chatty call pattern for object listings – Metadata caching to mask latency of remote object store calls ⬢ Consistency
  • 25. 2 5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Summary ⬢ Evolution towards cloud storage ⬢ Hadoop-compatible File System Architecture fosters integration with cloud storage ⬢ Integration with multiple cloud providers available: Azure, AWS, Google ⬢ Recent enhancements in S3A ⬢ Hive usage and TPC-DS benchmarks show significant S3A performance improvements ⬢ More coming soon for S3A and other object stores
  • 26. 2 6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Q & A Thank You!