Tachyon memory centric, fault tolerance storage for cluster framworks

•

1 like•1,227 views

Tachyon is a memory-centric storage system that provides reliable data sharing across different cluster frameworks and jobs at memory speeds. It uses a lineage-based approach to track the sequence of jobs and tasks that create output files to enable fault tolerance through recomputation rather than replication. Tachyon consists of a lineage layer to capture metadata and deliver high throughput I/O, and a persistence layer that takes asynchronous checkpoints of hot files to bound recomputation costs following failures while avoiding impacting system performance. Evaluation results showed Tachyon was over 100x faster than disk-based systems and reduced network traffic by up to 50% while keeping recomputation overhead below 1.6%.

Tachyon: memory centric, fault tolerance
storage for cluster framworks
presented by Viet-Trung Tran

Memory is King
• RAM throughput increasing exponentially
• Disk throughput increasing slowly
Memory-locality key to interactive response time

Memory as cache
• Improve READ
• Cannot help much with write
• Replication for fault tolerance
• Network bandwidth and latency are much worse than that of memory
• Write throughput is limited by disk I/O
• Required at least one copy on disk
• Inter-job data sharing cost dominates pipeline end-to-end latency
• 34% jobs output as large as input (Cloudera survey)

Different jobs share data
Slow writes to disk
Spark Task
Spark mem
block manager
block 1
block 3
Spark Task
Spark mem
block manager
block 3
block 1
HDFS / Amazon S3
block 1
block 3
block 2
block 4
storage engine &
execution engine
same process
(slow writes)
4

Different frameworks share data
Spark Task
Spark mem
block manager
block 1
block 3
Hadoop MR
YARN
HDFS / Amazon S3
block 1
block 3
block 2
block 4
storage engine &
execution engine
same process
(slow writes)
5
Slow writes to disk

Tachyon: realiable data sharing at memory speed
within and across frameworks/jobs
Tachyon
Spark
MapRe
duce
Spark
SQL
H2O GraphX Impala
HDFS S3
Gluster
FS
Orange
FS
NFS Ceph ……
……

Challenges
How to achieve reliability data sharing without replication?

Target workload properties
• Immutable data
• Deterministic jobs
• Locality based scheduling
• All data vs working set
• Program size vs data size

System architecture
Consists of two layer
• Lineage
• Deliver high throughput I/O
• Capture sequence of jobs/tasks that create output
• Persistence
• Asynchronous checkpoints
Facts
• One data copy in memory
• Recomputation for fault-tolerance

Master Node
• Similar to HDFS and GPS
• Passive standby model
• BUT also contains a workflow manager
• Track lineage information
• Compute checkpoint order
• Interact with cluster resource manager to allocate resources for re-
computations

Lineage metadata
• Binary program
• Configuration
• Input Files List
• Output Files List
• Dependency Type
• Narrow (filter, map)
• Wide (suffle, join)

Fault-recovery by recomputations
• Challenge
• Bounding the recomputation cost for a long running storage
• Asynchronous checkpointing
• Allocate resources for recomputations
• Make sure recomputation tasks get enough resources
• Do not impact system performance (task priorities)
• Assumption
• Input files are immutable
• job executions are deterministic
• Client side caching to mitigate read hotspots

Asynchronous checkpointing
• Goals
• Bounded recomputation time
• Checkpointing hot files
• Avoid checkpointing temp files
• Edge algoritim
• Modeling relationships of files with a DAG
• Vertices are files
• Edge from A to B if B is generated by a job that read A

Edge algorithm
• Checkpoint leaves
• Checkpointing hot files
• Most file access are less than 3 ( yahoo survey for big data workload)
• Thus, access more than twice get checkpointed
• Dealing with large dataset
• 96% active job sizes fit in the cluster memory
• synchronously write dataset above a defined threshold to disk
• Most of the files in memory checkpointed can be evicted from memory
to make room

Resource allocation
• Depend on the scheduling policy of the running cluster
• Requirements
• Priority compatibility
• Resource sharing
• Avoid cascading recomputation
• Best ordering recomputation
• Most common policies
• priority based
• weighted fair sharing

Evaluation
• 110x faster than MemHDFS
• 4x faster in realistic jobs
• 3,8x faster in case of failure
• Recover from master failure within 1 second
• reduce replication caused network traffic up to 50%
• recomputation impact is less than 1,6%

When dealing with infrastructure we often go through the process of determining the different resources needed to attend our application requirements. This talks looks into the way that resources are used by MongoDB and which aspects should be considered to determined the sizing, capacity and deployment of a MongoDB cluster given the different scenarios, different sets of operations and storage engines available.

Voldemort Nosql

elliando dias

Project Voldemort is a distributed key-value store inspired by Amazon Dynamo and Memcached. It was originally developed at LinkedIn to handle high volumes of data and queries in a scalable way across multiple servers. Voldemort uses consistent hashing to partition and replicate data, vector clocks to resolve concurrent write conflicts, and a layered architecture to provide flexibility. It prioritizes performance, availability, and simplicity over more complex consistency guarantees. LinkedIn uses multiple Voldemort clusters to power various real-time services and applications.

Apache Flink vs Apache Spark - Reproducible experiments on cloud.

Shelan Perera

Performant Streaming in Production: Preventing Common Pitfalls when Productio...

Databricks

Hw09 Low Latency, Random Reads From Hdfs

Cloudera, Inc.

RadFS is a modification of HDFS that aims to improve random access performance through caching and pooling of file handles. It implements all interactions with DataNodes as stateless positioned reads. This reduces server load and allows connections and threads to be reused. Benchmark results show RadFS provides faster random reads than HDFS, though caching adds overhead and the checksum implementation requires two reads per operation. Further work is needed to optimize checksumming and implement pipelining for improved streaming performance.

UWP apps development - Part 3

Jiri Danihelka

This document discusses several topics related to universal app development including: 1. Persisting settings locally or in Azure blob storage. Blob storage allows storing unlimited objects up to 200GB in containers. 2. Using toasts, tiles, badges and push notifications. Notifications are not guaranteed and can have payloads. The Azure Notification Hub can send notifications to multiple platforms. 3. Capabilities of Cortana and geofencing including location-based reminders, contextual information and rule-based actions.

Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes

DataWorks Summit/Hadoop Summit

Hadoop Distributed File System (HDFS) evolves from a MapReduce-centric storage system to a generic, cost-effective storage infrastructure where HDFS stores all data of inside the organizations. The new use case presents a new sets of challenges to the original HDFS architecture. One challenge is to scale the storage management of HDFS - the centralized scheme within NameNode becomes a main bottleneck which limits the total number of files stored. Although a typical large HDFS cluster is able to store several hundred petabytes of data, it is inefficient to handle large amounts of small files under the current architecture. In this talk, we introduce our new design and in-progress work that re-architects HDFS to attack this limitation. The storage management is enhanced to a distributed scheme. A new concept of storage container is introduced for storing objects. HDFS blocks are stored and managed as objects in the storage containers instead of being tracked only by NameNode. Storage containers are replicated across DataNodes using a newly-developed high-throughput protocol based on the Raft consensus algorithm. Our current prototype shows that under the new architecture the storage management of HDFS scales 10x better, demonstrating that HDFS is capable of storing billions of files.

Drill architecture 20120913

jasonfrantz

Apache Drill is a data analytics system with a flexible architecture that allows for pluggable components. It includes a driver, parser, compiler/optimizer, execution engine, and storage handlers. The parser converts queries to an intermediate representation, which is optimized and then executed across a cluster by the execution engine. Drill supports various data formats and sources through its extensible storage interfaces and scanner operators. Its design focuses on flexibility, ease of use, dependability, and high performance.

HBase and OpenTSDB Practices at Huawei describes HBase and OpenTSDB practices at Huawei including: 1) HBase practices such as accelerating HMaster startup time by parallelizing region locality computation, enhancing replication by configuring peer cluster principals and adaptive call timeouts, and ensuring reliable region assignment through periodic recovery of stuck regions and detection of duplicate region assignments. 2) OpenTSDB practices such as improving compaction by integrating it with HBase internal compaction to avoid extra write amplification, and adding per-metric data lifecycle management and a two-level thread model for improved performance. Benchmark results show significant improvements in throughput, CPU usage, and query latency.

2013 year of real-time hadoop

Geoff Hendrey

The document discusses how various technologies in the Hadoop ecosystem support real-time access and analysis of big data, ranging from easy to hard. HDFS allows real-time seeking to a byte, HBase enables key-based lookups, while MapReduce and Hive/Pig are not real-time due to batch processing. MPP architectures like Spark and Dremel can provide faster time to answer through in-memory processing and column-oriented data stores. However, true real-time interactivity depends on factors like data and cluster size.

MongoDB Capacity Planning

Norberto Leite

Cosco: An Efficient Facebook-Scale Shuffle Service

Databricks

Cosco is an efficient shuffle-as-a-service that powers Spark (and Hive) jobs at Facebook warehouse scale. It is implemented as a scalable, reliable and maintainable distributed system. Cosco is based on the idea of partial in-memory aggregation across a shared pool of distributed memory. This provides vastly improved efficiency in disk usage compared to Spark's built-in shuffle. Long term, we believe the Cosco architecture will be key to efficiently supporting jobs at ever larger scale. In this talk we'll take a deep dive into the Cosco architecture and describe how it's deployed at Facebook. We will then describe how it's integrated to run shuffle for Spark, and contrast it with Spark's built-in sort-based shuffle mechanism and SOS (presented at Spark+AI Summit 2018).

Hardware Provisioning

MongoDB

This document discusses hardware provisioning best practices for MongoDB. It covers key concepts like bottlenecks, working sets, and replication vs sharding. It also presents two case studies where these concepts were applied: 1) For a Spanish bank storing logs, the working set was 4TB so they provisioned servers with at least that much RAM. 2) For an online retailer storing products, testing found the working set was 270GB, so they recommended a replica set with 384GB RAM per server to avoid complexity of sharding. The key lessons are to understand requirements, test with a proof of concept, measure resource usage, and expect that applications may become bottlenecks over time.

HBaseCon2017 Efficient and portable data processing with Apache Beam and HBase

HBaseCon

In this talk we introduce Apache Beam, a unified model to create efficient and portable data processing pipelines. Beam uses a single set of abstractions to implement both batch and streaming computations that can be executed in different environments, e.g. Apache Spark, Apache Flink and Google Dataflow. Beam not only does data processing, but can be used as a tool to ingest/extract data to/from different data stores including HBase. We will present interaction scenarios between HBase and Beam and explore Beam's Input/Output (IO) model and how we leverage it to provide support for HBase.

Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?

Flink Forward

Flink provides a convenient abstraction layer for YARN that simplifies distributing computational tasks across a cluster. It allows writing custom input formats and operators more easily than traditional approaches like MapReduce. This document discusses two examples - a MongoDB to Avro data conversion pipeline and a file copying job - that were simplified and made more efficient by implementing them in Flink rather than traditional MapReduce or custom YARN applications. Flink handles task parallelization and orchestration automatically.

HBaseConAsia2018 Track1-3: HBase at Xiaomi

Michael Stack

This document summarizes Xiaomi's implementation and use of HBase for data storage. It discusses Xiaomi's HBase clusters across multiple public cloud providers and data centers. It also describes Xiaomi's approaches to multi-tenancy, quota and throttling, synchronous replication between clusters, and high availability in the case of node or cluster failures. Synchronous replication provides stronger consistency guarantees but with some performance overhead compared to asynchronous replication.

(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...

Reynold Xin

POLARDB: A database architecture for the cloud

oysteing

PolarDB is a cloud-native database architecture designed for the cloud. It separates storage and computation to independently scale each and provide high availability even across availability zones without data loss. PolarDB uses a shared storage architecture with PolarStore for storage and PolarProxy for intelligent routing. PolarStore is optimized for emerging hardware like NVMe and Optane and provides low latency access. PolarDB supports dynamic scaling, physical replication for high reliability, and read/write separation for session consistency.

January 2011 HUG: Kafka Presentation

Yahoo Developer Network

Capacity Planning

MongoDB

Deploying any software can be a challenge if you don't understand how resources are used or how to plan for the capacity of your systems. Whether you need to deploy or grow a single MongoDB instance, replica set, or tens of sharded clusters then you probably share the same challenges in trying to size that deployment. This webinar will cover what resources MongoDB uses, and how to plan for their use in your deployment. Topics covered will include understanding how to model and plan capacity needs for new and growing deployments. The goal of this webinar will be to provide you with the tools needed to be successful in managing your MongoDB capacity planning tasks.

Hadoop data ingestion

Vinod Nayal

Vinod Nayal presented on options for ingesting data into Hadoop, including batch loading from relational databases using Sqoop or vendor-specific tools. Data from files can be FTP'd to edge nodes and loaded using ETL tools like Informatica or Talend. Real-time data can be ingested using Flume for transport with light enrichment or Storm with Kafka for a queue to enable low-latency continuous ingestion with more in-flight processing. The choice between Flume and Storm depends on the amount of required in-flight processing.

Hoodie: How (And Why) We built an analytical datastore on Spark

Vinoth Chandar

POLARDB: A database architecture for the cloud

oysteing

PolarDB is a cloud-native database architecture developed by Alibaba for scalability, high availability, and integration with cloud services. It separates storage and computation to allow independent scaling. The storage component, PolarStore, is optimized for emerging hardware like NVMe and RDMA. It provides a distributed file system called PolarFS for low latency shared storage. PolarDB also supports read/write separation, parallel query processing, and hybrid transaction/analytical processing for high performance.

HBASE by Nicolas Liochon - Meetup HUGFR du 22 Sept 2014

Modern Data Stack France

Efficient processing of large and complex XML documents in Hadoop

DataWorks Summit

Many systems capture XML data in Hadoop for analytical processing. When XML documents are large and have complex nested structures, processing such data repeatedly would be inefficient as parsing XML becomes CPU intensive, not to mention the inefficiency of storing XML in its native form. The problem is compounded in the Big Data space, when millions of such documents have to be processed and analyzed within a reasonable time. In this talk an efficient method is proposed by leveraging the Avro storage and communication format, which is flexible, compact and specifically built for Hadoop environments to model complex data structures. XML documents may be parsed and converted into Avro format on load, which can then be accessed via Hive using a SQL-like interface, Java MapReduce or Pig. A concrete use-case is provided that validates this approach along with variations of the same and their relative trade-offs.

Pnuts Review

Ruchika Mehresh

PNUTS is Yahoo!'s scalable, highly available distributed database system for hosting web applications. It provides record-level operations and asynchronous consistency across geographically distributed data centers. The system architecture uses a distributed hash table for data storage and retrieval. Consistency is achieved through a per-record timeline model and a message broker for replication. PNUTS supports flexible schemas, queries, and bulk loading while providing high performance and availability.

Capacity Planning For Your Growing MongoDB Cluster

MongoDB

This document discusses capacity planning for deploying MongoDB. It defines capacity planning as planning for requirements like availability, throughput, and responsiveness by determining necessary resources like CPU, memory, storage, and network capacity. It emphasizes starting capacity planning before launch to avoid downtime. Key aspects of capacity planning for MongoDB include estimating working memory set size, storage I/O needs based on data size and access patterns, using tools like IOStat and MongoDB Management Service for monitoring and automation, and conducting iterative testing and deployments. Failure occurs if planned resources cannot meet requirements.

Presto at Twitter

Bill Graham

The Rules - SGS

Tania Kasongo

Balanceo de una ecuación química

dopamina mexico