Companies with batch and stream processing pipelines need to serve the insights they glean back to their users, an often-overlooked problem that can be hard to achieve reliably and at scale. Felix GV and Yan Yan offer an overview of Venice, a new data store capable of ingesting data from Hadoop and Kafka, merging it together, replicating it globally, and serving it online at low latency.
Venice was designed to be the next-generation replacement of the Voldemort Read-Only system, with the intent to provide a broader feature set, better availability characteristics, and a more efficient architecture. Venice is designed for high-throughput ingestion from Hadoop and Kafka, and these data sources can be merged at ingestion time in order to provide semantics similar to those of a lambda architecture but with a simpler, faster, and more available read path. Robustness is a primary architectural concern and, as such, Venice provides highly available reads and writes, self-healing, stringent data validation guarantees, and the ability to roll back entire datasets in cases where bad data is pushed.
Fast Online Access to Massive Offline Data - SECR 2016Felix GV
This document summarizes improvements made to Voldemort, a distributed key-value store used by LinkedIn. Voldemort has two modes: read-write and read-only. The read-only mode bulk loads data from Hadoop and serves it to applications. Recent improvements include adding compression to reduce cross-DC bandwidth, integrating with Nuage for multi-tenancy, improving build and push performance by 50%, and reducing client latency by optimizing communication. To get started with Voldemort, users can clone the GitHub repository, launch servers, and run build and push jobs.
Chicago Data Summit: Geo-based Content Processing Using HBaseCloudera, Inc.
NAVTEQ uses Cloudera Distribution including Apache Hadoop (CDH) and HBase with Cloudera Enterprise support to process and store location content data. With HBase and its distributed and column-oriented architecture, NAVTEQ is able to process large amounts of data in a scalable and cost-effective way.
Big Data Day LA 2015 - Introducing N1QL: SQL for Documents by Jeff Morris of ...Data Con LA
NoSQL has exploded on the developer scene promising alternatives to RDBMS that make rapidly developing, Internet scale applications easier than ever. However, as a trade off to the ease of development and scale, some of the familiarity with other well-known query interfaces such as SQL, has been lost. Until now that is...N1QL (pronounced ‘N1QL’) is a SQL like query language for querying JSON, which brings the familiarity of RDBMS back to the NoSQL world. In this session you will learn about the syntax and basics of this new language as well as Integration with the Couchbase SDKs.
This document discusses using HBase for online transaction processing (OLTP) workloads. It provides background on SQL-on-Hadoop and transaction processing with snapshot isolation. It then describes challenges in adding transactions directly to HBase, including using additional system tables to coordinate transactions. Examples are given for implementing transactions in HBase, along with issues like rollback handling. Finally, it discusses using SQL interfaces like Apache Phoenix or Drill on top of HBase, as well as open questions around the future of OLTP and OLAP processing on Hadoop versus traditional databases.
This document summarizes a presentation about using the HBase database with Ruby on Rails applications. It discusses what HBase is, some of the tradeoffs it involves compared to relational databases, when it may be suitable versus not suitable for an application, and how to interface with it from Rails. Examples are provided of libraries that can be used to connect Rails and HBase, as well as demos of JRuby scripts and Rails code that access an HBase backend.
This document discusses streaming data architectures and technologies. It begins with defining streaming processing as processing data continuously as it arrives, rather than in batches. It then covers streaming architectures, scalable data ingestion technologies like Kafka and Flume, and real-time streaming processing systems like Storm, Samza and Spark Streaming. The document aims to provide an overview of building distributed streaming systems for processing high volumes of real-time data.
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...Fwdays
We will start from understanding how Real-Time Analytics can be implemented on Enterprise Level Infrastructure and will go to details and discover how different cases of business intelligence be used in real-time on streaming data. We will cover different Stream Data Processing Architectures and discus their benefits and disadvantages. I'll show with live demos how to build Fast Data Platform in Azure Cloud using open source projects: Apache Kafka, Apache Cassandra, Mesos. Also I'll show examples and code from real projects.
Fast Online Access to Massive Offline Data - SECR 2016Felix GV
This document summarizes improvements made to Voldemort, a distributed key-value store used by LinkedIn. Voldemort has two modes: read-write and read-only. The read-only mode bulk loads data from Hadoop and serves it to applications. Recent improvements include adding compression to reduce cross-DC bandwidth, integrating with Nuage for multi-tenancy, improving build and push performance by 50%, and reducing client latency by optimizing communication. To get started with Voldemort, users can clone the GitHub repository, launch servers, and run build and push jobs.
Chicago Data Summit: Geo-based Content Processing Using HBaseCloudera, Inc.
NAVTEQ uses Cloudera Distribution including Apache Hadoop (CDH) and HBase with Cloudera Enterprise support to process and store location content data. With HBase and its distributed and column-oriented architecture, NAVTEQ is able to process large amounts of data in a scalable and cost-effective way.
Big Data Day LA 2015 - Introducing N1QL: SQL for Documents by Jeff Morris of ...Data Con LA
NoSQL has exploded on the developer scene promising alternatives to RDBMS that make rapidly developing, Internet scale applications easier than ever. However, as a trade off to the ease of development and scale, some of the familiarity with other well-known query interfaces such as SQL, has been lost. Until now that is...N1QL (pronounced ‘N1QL’) is a SQL like query language for querying JSON, which brings the familiarity of RDBMS back to the NoSQL world. In this session you will learn about the syntax and basics of this new language as well as Integration with the Couchbase SDKs.
This document discusses using HBase for online transaction processing (OLTP) workloads. It provides background on SQL-on-Hadoop and transaction processing with snapshot isolation. It then describes challenges in adding transactions directly to HBase, including using additional system tables to coordinate transactions. Examples are given for implementing transactions in HBase, along with issues like rollback handling. Finally, it discusses using SQL interfaces like Apache Phoenix or Drill on top of HBase, as well as open questions around the future of OLTP and OLAP processing on Hadoop versus traditional databases.
This document summarizes a presentation about using the HBase database with Ruby on Rails applications. It discusses what HBase is, some of the tradeoffs it involves compared to relational databases, when it may be suitable versus not suitable for an application, and how to interface with it from Rails. Examples are provided of libraries that can be used to connect Rails and HBase, as well as demos of JRuby scripts and Rails code that access an HBase backend.
This document discusses streaming data architectures and technologies. It begins with defining streaming processing as processing data continuously as it arrives, rather than in batches. It then covers streaming architectures, scalable data ingestion technologies like Kafka and Flume, and real-time streaming processing systems like Storm, Samza and Spark Streaming. The document aims to provide an overview of building distributed streaming systems for processing high volumes of real-time data.
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...Fwdays
We will start from understanding how Real-Time Analytics can be implemented on Enterprise Level Infrastructure and will go to details and discover how different cases of business intelligence be used in real-time on streaming data. We will cover different Stream Data Processing Architectures and discus their benefits and disadvantages. I'll show with live demos how to build Fast Data Platform in Azure Cloud using open source projects: Apache Kafka, Apache Cassandra, Mesos. Also I'll show examples and code from real projects.
This document provides an overview and technical discussion of Membase. It begins with introducing Membase and how it allows both applications and databases to scale horizontally. The rest of the document discusses Membase architecture, deployment options, use cases, and a demo. It also briefly explores developing with Membase and the future direction of NodeCode, which will allow extending Membase through custom modules.
- Apache HBase 2.0.0 is a major new release that was over 4 years in development and focused on compatibility, scale, and performance improvements.
- Key changes include a new master region assignment system, off-heap read/write paths, and in-memory compaction.
- The goals were to support larger clusters with better resource utilization while fixing issues with the previous master region assignment system.
Application Development with Apache Cassandra as a ServiceWSO2
WSO2 is an open source software company founded in 2005 that produces an entire middleware platform under the Apache license. Their business model involves selling comprehensive support and maintenance for their products. They have over 150 employees with offices globally. The document discusses using Apache Cassandra as a NoSQL database with WSO2's Column Store Service, including how to install the Cassandra feature, manage keyspaces and column families, and develop applications using the Java API Hector.
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder HortonworksData Con LA
Arun Murthy will be discussing the future of Hadoop and the next steps in what the big data world would start to look like in the future. With the advent of tools like Spark and Flink and containerization of apps using Docker, there is a lot of momentum currently in this space. Arun will share his thoughts and ideas on what the future holds for us.
Bio:-
Arun C. Murthy
Arun is a Apache Hadoop PMC member and has been a full time contributor to the project since the inception in 2006. He is also the lead of the MapReduce project and has focused on building NextGen MapReduce (YARN). Prior to co-founding Hortonworks, Arun was responsible for all MapReduce code and configuration deployed across the 42,000+ servers at Yahoo!. In essence, he was responsible for running Apache Hadoop’s MapReduce as a service for Yahoo!. Also, he jointly holds the current world sorting record using Apache Hadoop. Follow Arun on Twitter: @acmurthy.
What is Change Data Capture (CDC) and Why is it Important?FlyData Inc.
Check out what Change Data Capture (CDC) is and why it is becoming ever more important. Slides also include useful tips on how to design your CDC implementation.
In DiDi Chuxing Company, which is China’s most popular ride-sharing company. we use HBase to serve when we have a bigdata problem.
We run three clusters which serve different business needs. We backported the Region Grouping feature back to our internal HBase version so we could isolate the different use cases.
We built the Didi HBase Service platform which is popular amongst engineers at our company. It includes a workflow and project management function as well as a user monitoring view.
Internally we recommend users use Phoenix to simplify access.even more,we used row timestamp;multidimensional table schema to slove muti dimension query problems
C++, Go, Python, and PHP clients get to HBase via thrift2 proxies and QueryServer.
We run many important buisness applications out of our HBase cluster such as ETA/GPS/History Order/API metrics monitoring/ and Traffic in the Cloud. If you are interested in any aspects listed above, please come to our talk. We would like to share our experiences with you.
HBaseConAsia2018 Track3-3: HBase at China Life InsuranceMichael Stack
This document summarizes an HBase practice presentation at China Life Insurance Co., Ltd. It discusses scenarios for HBase integration, processing, querying, and exporting data. It also covers optimizations to the HBase cluster configuration and for writing and reading. Problems addressed include table copy failures and compactions that never end. Future work may involve using Phoenix for real-time querying and integrating real-time data sources like Kafka.
How Alibaba Cloud scaled ApsaraDB with MariaDB MaxScaleMariaDB plc
ApsaraDB is the leading cloud database in China with millions of database instances are running on it. However, the diversity and complexity of the mission-critical applications using it brought a huge challenge to ApsaraDB, scalability – a long-time pain point. To solve the problem, in middle of 2018 and after a careful evaluation, an elegant solution was found in MariaDB MaxScale. So far, the deep synergy of MariaDB MaxScale and ApsaraDB has proved very successful as thousands of high-demand customers of ApsaraDB are benefiting from a much-improved experience. In this presentation, we are going to share following topics:
- How ApsaraDB is using MariaDB MaxScale
- Best practices when leveraging MariaDB MaxScale with ApsaraDB
- Next steps and future plans for for MariaDB MaxScale and ApsaraDB
Hello, Enterprise! Meet Presto. (Presto Boston Meetup 10062015)Matt Fuller
Teradata has been hard at work on Presto, and we want to share with you what we've done so far and our roadmap going forward. From presto-admin, a tool for installing and administering Presto, to YARN/Ambari support, to fully certified JDBC and ODBC drivers, we are committed to making Presto the best, most enterprise-ready SQL-on Hadoop solution out there.
HBaseConAsia2018 Track3-2: HBase at China TelecomMichael Stack
HBase is used at China Telecom for various applications including persistence for streaming jobs, online reading and writing, and as a data store for their core system. They operate several HBase clusters storing over 500 TB of data ingesting 1 TB per day. They monitor HBase using Ganglia for basic metrics and Zabbix for critical alerts. When issues arise, such as a system hang, they investigate debug cases and perform optimizations like changing the garbage collector from CMS to G1 and implementing read/write splitting.
Dyn delivers exceptional Internet Performance. Enabling high quality services requires data centers around the globe. In order to manage services, customers need timely insight collected from all over the world. Dyn uses DataStax Enterprise (DSE) to deploy complex clusters across multiple datacenters to enable sub 50 ms query responses for hundreds of billions of data points. From granular DNS traffic data, to aggregated counts for a variety of report dimensions, DSE at Dyn has been up since 2013 and has shined through upgrades, data center migrations, DDoS attacks and hardware failures. In this webinar, Principal Engineers Tim Chadwick and Rick Bross cover the requirements which led them to choose DSE as their go-to Big Data solution, the path which led to SPARK, and the lessons that we’ve learned in the process.
ClustrixDB: how distributed databases scale outMariaDB plc
ClustrixDB, now part of MariaDB, is a fully distributed and transactional RDBMS for applications with the highest scalability requirements. In this session Robbie Mihalyi, VP of Engineering for ClustrixDB, provides an introduction to ClustrixDB, followed by an in-depth technical overview of its architecture, with a focus on distributed storage, transactions and query processing – and its unique approach to index partitioning.
Why you need benchmarks
Finding the right database solution for your use case can be an arduous journey. The database deployment touches aspects of throughput performance, latency control, high availability and data resilience.
You will need to decide on the infrastructure to use: Cloud, on-premise or a hybrid solution.
Data models also have an impact on finding the right fit for the use case. Once you establish a requirements set, the next step is to test your use case against the databases of choice.
In this workshop, we will discuss the different data points you need to collect in order to get the most realistic testing environment.
We will cover:
Data model impact on performance and latency
Client behavior related to database capabilities
Failover and high availability testing
Hardware selection and cluster configuration impact
We will show 2 benchmarking tools you can use to test and benchmark your clusters to identify the optimal deployment scenario for your use case.
Attend this virtual workshop if you are:
Looking to minimize the cost of your database deployment
Making a database decision based on performance and scale data
Planning to emulate your workload on a pre-production system where you can test, fail fast and learn.
Software defined storage real or bs-2014Howard Marks
This document discusses software defined storage and evaluates whether it is a real technology or just hype. It defines software defined storage as storage software that runs on standard x86 server hardware and can be sold as software or as an appliance. The document examines different types of software defined storage like storage that runs on a single server, in a virtual machine, or across multiple hypervisor hosts in a scale-out cluster. It also compares the benefits and challenges of converged infrastructure solutions using software defined storage versus dedicated storage arrays.
Streaming all over the world Real life use cases with Kafka Streamsconfluent
This document discusses using Apache Kafka Streams for stream processing. It begins with an overview of Apache Kafka and Kafka Streams. It then presents several real-life use cases that have been implemented with Kafka Streams, including data conversions from XML to Avro, stream-table joins for event propagation, duplicate elimination, and detecting absence of events. The document concludes with recommendations for developing and operating Kafka Streams applications.
Webseminar: MariaDB Enterprise und MariaDB Enterprise ClusterMariaDB Corporation
This document provides information about MariaDB Enterprise and MariaDB Enterprise Cluster from Ralf Gebhardt, including:
- An agenda covering MariaDB, MariaDB Enterprise, MariaDB Enterprise Cluster, services, and more info.
- Background on MariaDB, the MariaDB Foundation, MariaDB.com, and SkySQL.
- A timeline of MariaDB releases from 5.1 to the current 10.0 and Galera Cluster 10.
- An overview of key features and optimizations in MariaDB 10 like multi-source replication and improved query optimization.
- Mention of Fusion-IO page compression providing a 30% performance increase with atomic writes.
Venice is a derived data store that can handle both batch and streaming data. It uses Kafka to ingest all data, whether from Hadoop batch jobs or real-time sources like Samza. This allows Venice to offer a hybrid storage model that can merge the two data types. Venice improves on earlier systems by offering high availability, automatic data distribution, and seamless rollbacks between versions. It has been in production at LinkedIn since 2016 and is replacing their legacy Voldemort read-only stores.
My talk at ScaleConf 2017 in Cape Town on some tips and tactics for scaling WordPress, with reference to WordPress.com and the container-based VIP Go platform.
Video of my talk is here: https://www.youtube.com/watch?v=cs0DcY80spw
This document provides an overview and technical discussion of Membase. It begins with introducing Membase and how it allows both applications and databases to scale horizontally. The rest of the document discusses Membase architecture, deployment options, use cases, and a demo. It also briefly explores developing with Membase and the future direction of NodeCode, which will allow extending Membase through custom modules.
- Apache HBase 2.0.0 is a major new release that was over 4 years in development and focused on compatibility, scale, and performance improvements.
- Key changes include a new master region assignment system, off-heap read/write paths, and in-memory compaction.
- The goals were to support larger clusters with better resource utilization while fixing issues with the previous master region assignment system.
Application Development with Apache Cassandra as a ServiceWSO2
WSO2 is an open source software company founded in 2005 that produces an entire middleware platform under the Apache license. Their business model involves selling comprehensive support and maintenance for their products. They have over 150 employees with offices globally. The document discusses using Apache Cassandra as a NoSQL database with WSO2's Column Store Service, including how to install the Cassandra feature, manage keyspaces and column families, and develop applications using the Java API Hector.
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder HortonworksData Con LA
Arun Murthy will be discussing the future of Hadoop and the next steps in what the big data world would start to look like in the future. With the advent of tools like Spark and Flink and containerization of apps using Docker, there is a lot of momentum currently in this space. Arun will share his thoughts and ideas on what the future holds for us.
Bio:-
Arun C. Murthy
Arun is a Apache Hadoop PMC member and has been a full time contributor to the project since the inception in 2006. He is also the lead of the MapReduce project and has focused on building NextGen MapReduce (YARN). Prior to co-founding Hortonworks, Arun was responsible for all MapReduce code and configuration deployed across the 42,000+ servers at Yahoo!. In essence, he was responsible for running Apache Hadoop’s MapReduce as a service for Yahoo!. Also, he jointly holds the current world sorting record using Apache Hadoop. Follow Arun on Twitter: @acmurthy.
What is Change Data Capture (CDC) and Why is it Important?FlyData Inc.
Check out what Change Data Capture (CDC) is and why it is becoming ever more important. Slides also include useful tips on how to design your CDC implementation.
In DiDi Chuxing Company, which is China’s most popular ride-sharing company. we use HBase to serve when we have a bigdata problem.
We run three clusters which serve different business needs. We backported the Region Grouping feature back to our internal HBase version so we could isolate the different use cases.
We built the Didi HBase Service platform which is popular amongst engineers at our company. It includes a workflow and project management function as well as a user monitoring view.
Internally we recommend users use Phoenix to simplify access.even more,we used row timestamp;multidimensional table schema to slove muti dimension query problems
C++, Go, Python, and PHP clients get to HBase via thrift2 proxies and QueryServer.
We run many important buisness applications out of our HBase cluster such as ETA/GPS/History Order/API metrics monitoring/ and Traffic in the Cloud. If you are interested in any aspects listed above, please come to our talk. We would like to share our experiences with you.
HBaseConAsia2018 Track3-3: HBase at China Life InsuranceMichael Stack
This document summarizes an HBase practice presentation at China Life Insurance Co., Ltd. It discusses scenarios for HBase integration, processing, querying, and exporting data. It also covers optimizations to the HBase cluster configuration and for writing and reading. Problems addressed include table copy failures and compactions that never end. Future work may involve using Phoenix for real-time querying and integrating real-time data sources like Kafka.
How Alibaba Cloud scaled ApsaraDB with MariaDB MaxScaleMariaDB plc
ApsaraDB is the leading cloud database in China with millions of database instances are running on it. However, the diversity and complexity of the mission-critical applications using it brought a huge challenge to ApsaraDB, scalability – a long-time pain point. To solve the problem, in middle of 2018 and after a careful evaluation, an elegant solution was found in MariaDB MaxScale. So far, the deep synergy of MariaDB MaxScale and ApsaraDB has proved very successful as thousands of high-demand customers of ApsaraDB are benefiting from a much-improved experience. In this presentation, we are going to share following topics:
- How ApsaraDB is using MariaDB MaxScale
- Best practices when leveraging MariaDB MaxScale with ApsaraDB
- Next steps and future plans for for MariaDB MaxScale and ApsaraDB
Hello, Enterprise! Meet Presto. (Presto Boston Meetup 10062015)Matt Fuller
Teradata has been hard at work on Presto, and we want to share with you what we've done so far and our roadmap going forward. From presto-admin, a tool for installing and administering Presto, to YARN/Ambari support, to fully certified JDBC and ODBC drivers, we are committed to making Presto the best, most enterprise-ready SQL-on Hadoop solution out there.
HBaseConAsia2018 Track3-2: HBase at China TelecomMichael Stack
HBase is used at China Telecom for various applications including persistence for streaming jobs, online reading and writing, and as a data store for their core system. They operate several HBase clusters storing over 500 TB of data ingesting 1 TB per day. They monitor HBase using Ganglia for basic metrics and Zabbix for critical alerts. When issues arise, such as a system hang, they investigate debug cases and perform optimizations like changing the garbage collector from CMS to G1 and implementing read/write splitting.
Dyn delivers exceptional Internet Performance. Enabling high quality services requires data centers around the globe. In order to manage services, customers need timely insight collected from all over the world. Dyn uses DataStax Enterprise (DSE) to deploy complex clusters across multiple datacenters to enable sub 50 ms query responses for hundreds of billions of data points. From granular DNS traffic data, to aggregated counts for a variety of report dimensions, DSE at Dyn has been up since 2013 and has shined through upgrades, data center migrations, DDoS attacks and hardware failures. In this webinar, Principal Engineers Tim Chadwick and Rick Bross cover the requirements which led them to choose DSE as their go-to Big Data solution, the path which led to SPARK, and the lessons that we’ve learned in the process.
ClustrixDB: how distributed databases scale outMariaDB plc
ClustrixDB, now part of MariaDB, is a fully distributed and transactional RDBMS for applications with the highest scalability requirements. In this session Robbie Mihalyi, VP of Engineering for ClustrixDB, provides an introduction to ClustrixDB, followed by an in-depth technical overview of its architecture, with a focus on distributed storage, transactions and query processing – and its unique approach to index partitioning.
Why you need benchmarks
Finding the right database solution for your use case can be an arduous journey. The database deployment touches aspects of throughput performance, latency control, high availability and data resilience.
You will need to decide on the infrastructure to use: Cloud, on-premise or a hybrid solution.
Data models also have an impact on finding the right fit for the use case. Once you establish a requirements set, the next step is to test your use case against the databases of choice.
In this workshop, we will discuss the different data points you need to collect in order to get the most realistic testing environment.
We will cover:
Data model impact on performance and latency
Client behavior related to database capabilities
Failover and high availability testing
Hardware selection and cluster configuration impact
We will show 2 benchmarking tools you can use to test and benchmark your clusters to identify the optimal deployment scenario for your use case.
Attend this virtual workshop if you are:
Looking to minimize the cost of your database deployment
Making a database decision based on performance and scale data
Planning to emulate your workload on a pre-production system where you can test, fail fast and learn.
Software defined storage real or bs-2014Howard Marks
This document discusses software defined storage and evaluates whether it is a real technology or just hype. It defines software defined storage as storage software that runs on standard x86 server hardware and can be sold as software or as an appliance. The document examines different types of software defined storage like storage that runs on a single server, in a virtual machine, or across multiple hypervisor hosts in a scale-out cluster. It also compares the benefits and challenges of converged infrastructure solutions using software defined storage versus dedicated storage arrays.
Streaming all over the world Real life use cases with Kafka Streamsconfluent
This document discusses using Apache Kafka Streams for stream processing. It begins with an overview of Apache Kafka and Kafka Streams. It then presents several real-life use cases that have been implemented with Kafka Streams, including data conversions from XML to Avro, stream-table joins for event propagation, duplicate elimination, and detecting absence of events. The document concludes with recommendations for developing and operating Kafka Streams applications.
Webseminar: MariaDB Enterprise und MariaDB Enterprise ClusterMariaDB Corporation
This document provides information about MariaDB Enterprise and MariaDB Enterprise Cluster from Ralf Gebhardt, including:
- An agenda covering MariaDB, MariaDB Enterprise, MariaDB Enterprise Cluster, services, and more info.
- Background on MariaDB, the MariaDB Foundation, MariaDB.com, and SkySQL.
- A timeline of MariaDB releases from 5.1 to the current 10.0 and Galera Cluster 10.
- An overview of key features and optimizations in MariaDB 10 like multi-source replication and improved query optimization.
- Mention of Fusion-IO page compression providing a 30% performance increase with atomic writes.
Venice is a derived data store that can handle both batch and streaming data. It uses Kafka to ingest all data, whether from Hadoop batch jobs or real-time sources like Samza. This allows Venice to offer a hybrid storage model that can merge the two data types. Venice improves on earlier systems by offering high availability, automatic data distribution, and seamless rollbacks between versions. It has been in production at LinkedIn since 2016 and is replacing their legacy Voldemort read-only stores.
My talk at ScaleConf 2017 in Cape Town on some tips and tactics for scaling WordPress, with reference to WordPress.com and the container-based VIP Go platform.
Video of my talk is here: https://www.youtube.com/watch?v=cs0DcY80spw
This document discusses LinkedIn's use of Kafka, Hadoop, Storm, and Couchbase in their big data pipeline. It provides an overview of each technology and how LinkedIn uses them together. Specifically, it describes how LinkedIn uses Kafka to stream data to Hadoop for analytics and report generation. It also discusses how LinkedIn uses Hadoop to pre-build and warm Couchbase buckets for improved performance. The presentation includes a use case of streaming member profile and activity data through Kafka to both Hadoop and Couchbase clusters.
Petabyte Scale Object Storage Service Using Ceph in A Private Cloud - Varada ...Ceph Community
This document discusses Flipkart's use of Ceph object storage to provide a petabyte-scale object storage service. Key points:
- Flipkart runs two large data centers hosting over 20,000 servers and 60,000 VMs to power its e-commerce marketplace.
- It developed a highly scalable object storage service using Ceph to store over 1.5 billion objects totaling around 2PB of data. This service provides APIs compatible with AWS S3.
- The Ceph clusters are deployed across SSDs and HDDs to provide different performance and cost tiers for shared active, backup, and archival workloads with different SLAs around latency, throughput, and durability
OpenStack is an open source cloud computing platform that can manage large networks of virtual machines and physical servers. It uses a distributed architecture with components like Nova (compute), Swift (object storage), Cinder (block storage), and Quantum (networking). OpenStack has been successful due to its scalability, support for multiple hypervisors including Hyper-V, and compatibility with popular programming languages like Python. While OpenStack is best suited for large public and private clouds, its complex installation and lack of unified deployment tools can present challenges, especially for small to mid-sized clouds.
Rocketfuel processes over 120 billion ad auctions per day and needs to detect fraud in real time to prevent losses. They developed Helios, which ingests event data from Kafka and HDFS into Storm in real time, joins the streams in HBase, then runs MapReduce jobs hourly to populate an OLAP cube for analyzing feature vectors and detecting fraud patterns. This architecture on Hadoop allows them to easily scale real-time processing and experiment with different configurations to quickly react to fraud.
Building a derived data store using KafkaVenu Ryali
LinkedIn built a new derived data store called Venice to address limitations of their previous system Voldemort. Venice uses Kafka to enable scalable, fault-tolerant processing and replication of both batch-processed and incrementally updated derived data. It processes data through Hadoop jobs to Kafka topics, from which both batch-stored and real-time copies are maintained through Venice and Samza respectively. Kafka Mirror Maker replicates data across data centers for high availability.
This document provides an overview and agenda for a presentation on Apache ActiveMQ 5.9.x and Apache Apollo. The presentation will cover new features in ActiveMQ 5.9.x including AMQP 1.0 support, REST management, a new default file-based store using LevelDB, and high availability replication of the store. It will also introduce Apache Apollo and allow for a question and discussion period.
Technical overview of three of the most representative KeyValue Stores: Cassandra, Redis and CouchDB. Focused on Ruby and Ruby on Rails developement, this talk shows how to solve common problems, the most popular libraries, benchmarking and the best use case for each one of them.
This talk was part of the Conferencia Rails 2009, Madrid, Spain.
http://app.conferenciarails.org/talks/43-key-value-stores-conviertete-en-un-jedi-master
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloudgluent.
Hive was the first popular SQL layer built on Hadoop and has long been known as a heavyweight SQL engine suitable mainly for long-running batch jobs. This has greatly changed since Hive was announced to the world over 8 years ago. Hortonworks and the open source community have evolved Apache Hive into a fast, dynamic SQL on Hadoop engine capable of running highly concurrent query workloads over large datasets with sub-second response time.
The latest Hortonworks and Azure HDInsight platform versions fully support Hive with LLAP execution engine for production use. In this webinar, we will go through the architecture of Hive + LLAP engine and explain how it differs from previous Hive versions. We will then dive deeper and show how features like query vectorization and LLAP columnar caching bring further automatic performance improvements.
In the end, we will show how Gluent brings these new performance benefits to traditional enterprise database platforms via transparent data virtualization, allowing even your largest databases to benefit from all this without changing any application code. Join this webinar to learn about significant improvements in modern Hive architecture and how Gluent and Hive LLAP on Hortonworks or Azure HDInsight platforms can accelerate cloud migrations and greatly improve hybrid query performance!
Apache Kafka 0.8 basic training - VerisignMichael Noll
Apache Kafka 0.8 basic training (120 slides) covering:
1. Introducing Kafka: history, Kafka at LinkedIn, Kafka adoption in the industry, why Kafka
2. Kafka core concepts: topics, partitions, replicas, producers, consumers, brokers
3. Operating Kafka: architecture, hardware specs, deploying, monitoring, P&S tuning
4. Developing Kafka apps: writing to Kafka, reading from Kafka, testing, serialization, compression, example apps
5. Playing with Kafka using Wirbelsturm
Audience: developers, operations, architects
Created by Michael G. Noll, Data Architect, Verisign, https://www.verisigninc.com/
Verisign is a global leader in domain names and internet security.
Tools mentioned:
- Wirbelsturm (https://github.com/miguno/wirbelsturm)
- kafka-storm-starter (https://github.com/miguno/kafka-storm-starter)
Blog post at:
http://www.michael-noll.com/blog/2014/08/18/apache-kafka-training-deck-and-tutorial/
Many thanks to the LinkedIn Engineering team (the creators of Kafka) and the Apache Kafka open source community!
Storage Requirements and Options for Running Spark on KubernetesDataWorks Summit
In a world of serverless computing users tend to be frugal when it comes to expenditure on compute, storage and other resources. Paying for the same when they aren’t in use becomes a significant factor. Offering Spark as service on cloud presents very unique challenges. Running Spark on Kubernetes presents a lot of challenges especially around storage and persistence. Spark workloads have very unique requirements of Storage for intermediate data, long time persistence, Share file system and requirements become very tight when it same need to be offered as a service for enterprise to mange GDPR and other compliance like ISO 27001 and HIPAA certifications.
This talk covers challenges involved in providing Serverless Spark Clusters share the specific issues one can encounter when running large Kubernetes clusters in production especially covering the scenarios related to persistence.
This talk will help people using Kubernetes or docker runtime in production and help them understand various storage options available and which is more suitable for running Spark workloads on Kubernetes and what more can be done
Getting started with MariaDB? Whether it is on your laptop or server, containers are great ephemeral vessels for your applications. But what about the data that drives your business? It must survive containers coming and going, maintain its availability and reliability, and grow when you need it.
Speaker: Varun Sharma (Pinterest)
Over the past year, HBase has become an integral component of Pinterest's storage stack. HBase has enabled us to quickly launch and iterate on new products and create amazing pinner experiences. This talk briefly describes some of these applications, the underlying schema, and how our HBase setup stays highly available and performant despite billions of requests every week. It will also include some performance tips for running on SSDs. Finally, we will talk about a homegrown serving technology we built from a mashup of HBase components that has gained wide adoption across Pinterest.
This document discusses containerization and the Docker ecosystem. It begins by describing the challenges of managing different software stacks across multiple environments. It then introduces Docker as a solution that packages applications into standardized units called containers that are portable and can run anywhere. The rest of the document covers key aspects of the Docker ecosystem like orchestration tools like Kubernetes and Docker Swarm, networking solutions like Flannel and Weave, storage solutions, and security considerations. It aims to provide an overview of the container landscape and components.
Getting started with MariaDB with DockerMariaDB plc
This document discusses using MariaDB and Docker together from development to production. It begins by outlining the benefits of containers and Docker for database deployments. Requirements for databases in containers like data redundancy, self-discovery, self-healing and application tier discovery are discussed. An overview of MariaDB and how it meets these requirements with Galera cluster and MaxScale is provided. The document then demonstrates how to develop and deploy a Python/Flask app with MariaDB from development to a Docker Swarm production cluster behind HAProxy, including scaling the web tier and implementing a hardened database tier with Galera cluster and MaxScale behind secrets. Considerations around storage, networking and upgrades are discussed.
This is a presentation on apache hadoop technology. This presentation may be helpful for the beginners to know about the terminologies of hadoop. This presentation contains some pictures which describes about the working function of this technology. I hope it will be helpful for the beginners.
Thank you.
Mechatronics is a multidisciplinary field that refers to the skill sets needed in the contemporary, advanced automated manufacturing industry. At the intersection of mechanics, electronics, and computing, mechatronics specialists create simpler, smarter systems. Mechatronics is an essential foundation for the expected growth in automation and manufacturing.
Mechatronics deals with robotics, control systems, and electro-mechanical systems.
Software Engineering and Project Management - Introduction, Modeling Concepts...Prakhyath Rai
Introduction, Modeling Concepts and Class Modeling: What is Object orientation? What is OO development? OO Themes; Evidence for usefulness of OO development; OO modeling history. Modeling
as Design technique: Modeling, abstraction, The Three models. Class Modeling: Object and Class Concept, Link and associations concepts, Generalization and Inheritance, A sample class model, Navigation of class models, and UML diagrams
Building the Analysis Models: Requirement Analysis, Analysis Model Approaches, Data modeling Concepts, Object Oriented Analysis, Scenario-Based Modeling, Flow-Oriented Modeling, class Based Modeling, Creating a Behavioral Model.
Prediction of Electrical Energy Efficiency Using Information on Consumer's Ac...PriyankaKilaniya
Energy efficiency has been important since the latter part of the last century. The main object of this survey is to determine the energy efficiency knowledge among consumers. Two separate districts in Bangladesh are selected to conduct the survey on households and showrooms about the energy and seller also. The survey uses the data to find some regression equations from which it is easy to predict energy efficiency knowledge. The data is analyzed and calculated based on five important criteria. The initial target was to find some factors that help predict a person's energy efficiency knowledge. From the survey, it is found that the energy efficiency awareness among the people of our country is very low. Relationships between household energy use behaviors are estimated using a unique dataset of about 40 households and 20 showrooms in Bangladesh's Chapainawabganj and Bagerhat districts. Knowledge of energy consumption and energy efficiency technology options is found to be associated with household use of energy conservation practices. Household characteristics also influence household energy use behavior. Younger household cohorts are more likely to adopt energy-efficient technologies and energy conservation practices and place primary importance on energy saving for environmental reasons. Education also influences attitudes toward energy conservation in Bangladesh. Low-education households indicate they primarily save electricity for the environment while high-education households indicate they are motivated by environmental concerns.
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024Sinan KOZAK
Sinan from the Delivery Hero mobile infrastructure engineering team shares a deep dive into performance acceleration with Gradle build cache optimizations. Sinan shares their journey into solving complex build-cache problems that affect Gradle builds. By understanding the challenges and solutions found in our journey, we aim to demonstrate the possibilities for faster builds. The case study reveals how overlapping outputs and cache misconfigurations led to significant increases in build times, especially as the project scaled up with numerous modules using Paparazzi tests. The journey from diagnosing to defeating cache issues offers invaluable lessons on maintaining cache integrity without sacrificing functionality.
VARIABLE FREQUENCY DRIVE. VFDs are widely used in industrial applications for...PIMR BHOPAL
Variable frequency drive .A Variable Frequency Drive (VFD) is an electronic device used to control the speed and torque of an electric motor by varying the frequency and voltage of its power supply. VFDs are widely used in industrial applications for motor control, providing significant energy savings and precise motor operation.
Applications of artificial Intelligence in Mechanical Engineering.pdfAtif Razi
Historically, mechanical engineering has relied heavily on human expertise and empirical methods to solve complex problems. With the introduction of computer-aided design (CAD) and finite element analysis (FEA), the field took its first steps towards digitization. These tools allowed engineers to simulate and analyze mechanical systems with greater accuracy and efficiency. However, the sheer volume of data generated by modern engineering systems and the increasing complexity of these systems have necessitated more advanced analytical tools, paving the way for AI.
AI offers the capability to process vast amounts of data, identify patterns, and make predictions with a level of speed and accuracy unattainable by traditional methods. This has profound implications for mechanical engineering, enabling more efficient design processes, predictive maintenance strategies, and optimized manufacturing operations. AI-driven tools can learn from historical data, adapt to new information, and continuously improve their performance, making them invaluable in tackling the multifaceted challenges of modern mechanical engineering.
1. Introducing Venice
A Derived Data Store for Batch, Streaming & Lambda
Architectures
Jeff Weiner
Chief Executive Officer
Felix GV
Engineer
Yan Yan
Engineer
4. Kinds of Data
• Source of Truth
• Example use case:
• Profile
• Example systems:
• SQL
• Document Stores
• K-V Stores
Primary Data Derived Data
• Derived from computing primary data
• Example use case:
• People You May Know
• Example systems:
• Search Indices
• Graph Databases
• K-V Stores
9. Overview
Voldemort Read-Only
• Generates binary files on Hadoop
• Bulk loads data from Hadoop
• (in the background)
• Swaps new data when ready
• Keeps last dataset as a backup
• Allows quick rollbacks
10. Scale
Voldemort Read-Only
• At LinkedIn:
• ~1000 nodes
• > 500 stores
• ~ 1 PB of SSD storage
• > 240 TB refreshed / day
• ~ 500K queries / second
18. Design Goals
Venice
• To replace Voldemort Read-Only
• Drop-in replacement
• More efficient
• More resilient
• More operable
• To enable new use cases “as a
service”
• Nearline derived data
• Lambda Architecture
19. Read/Write API
Venice
• Derived data K-V store
• Single Get
• Batch Get
• High throughput ingestion from:
• Hadoop
• Samza
• Or both (hybrid)
28. Global Replication
Single replica pushed across the WAN
Push JobHadoop
Mirror
Maker Storage
Nodes
DC 3DC 2
DC 1Source DC
Many replicas consumed locally
30. Metadata Replication
Architecture
• Admin operations performed on
parent
• Store creation/deletion
• Schema evolution
• Quota changes, etc.
• Metadata replicated via “admin topic”
• Resilient to transient DC failures
31. Kafka Usage
Architecture
• One topic per store-version
• Kafka is fully managed by the
controller
• Dynamic topic creation/deletion
• Infinite retention
32. Step 1/3: Steady State, In-between Bulkloads
RouterStore
v7
Data Source Kafka Topics Venice Processes
Hadoop
Store
v6
Not consuming,
unless restoring
a failed replica.
33. Step 2/3: Offline Bulkload Into New Store-Version
RouterStore
v7
Data Source Kafka Topics Venice Processes
Hadoop Store
v8
Store
v6
Push Job
34. Step 3/3: Bulkload Finished, Router Swaps to New
Version
RouterStore
v7
Data Source Kafka Topics Venice Processes
Hadoop Store
v8
Store
v6
Push Job
36. Overview
Hybrid Store
• Hybrid Stores aim to
• Merge batch and streaming data
• Better read path performance than Lambda
Arch.
• Minimize application complexity
37. Data Merge
Hybrid Store
• Write-time merge
• All writes go through Kafka
• Hadoop writes into store-version topics
• Samza writes into a Real-Time Buffer topic
(RTB)
• The RTB gets replayed into store-version topics
38. Step 1/4: Steady State, In-between Bulkloads
RouterSamza Store
v7
Data Sources Kafka Topics Venice Processes
Hadoop
39. Step 2/4: Offline Bulkload Into New Store-Version
RouterSamza Store
v7
Data Sources Kafka Topics Venice Processes
Hadoop Store
v8
Push Job
40. Step 3/4: Bulkload Finished, Start Buffer Replay
RouterSamza Store
v7
Data Sources Kafka Topics Venice Processes
Hadoop Store
v8
Push Job
41. Step 4/4: Replay Caught Up, Router Swaps to New
Version
RouterSamza Store
v7
Data Sources Kafka Topics Venice Processes
Store
v8
Hadoop
42. Usage Patterns
Hybrid Store
• Offline Source
• Traditional Hadoop job
• Samza “reprocessing” job
• Nearline Source
• Overwrite same keys
• Write into different keys
The first feature I want to introduce is data versioning. Inside in Venice, we keep multiple versions for your data, once a user start to ingest a new offline data set, that data is being written in the background without impacting the current data version which is serving read requests. When this version is ready to serve, Venice will do an atomic swap to let all read requests hit this new version instead. The whole process is almost transparent to user, the only thing user need to do is just start the ingestion. Then Venice will help you to manage everything. Of course, in case that you find your data have any issue, you could also quickly rollback, just tell Venice which version you wanna to use.
So in terms of scalability, there are two kinds of meaning of scalability in Venice. The first one is we have to support large scale of machine and data. So Venice is a system running in multi-datacenter across continents. And could also run multiple clusters in one physical datacenter in order to get better resource utilization for different use cases. Example?Secondly, we want to run Venice as a service which means Venice could support large scale of users. This is why we provide the self-service onboarding with our internal cloud management platform to let user manage their store without our SRE/DEV. As each cluster is multi-tenant Env, which means users are going to share some resources like cpu and storage, so we introduced several ways to do the resource isolation like qps quota, storage quota and also multi-cluster to prevent user being impacted by each others.
Thanks Felix,I am Yan from Venice team, I am going to give you guys a brief introduction of Venice our new generation derived data platform, with highlighting our design goals, API and main features we have, the scalability that Venice support to and some tradeoffs we made when we designed that system.
So let’s start from design goals, as Venice is the successor of Voldemort read-only. So obviously it should be able to take over everything thing that Voldemort read-only has been doing. But more efficient, more resilient and more operable. And one more important thing here is we wanna make Venice as a drop-in replacement of Voldemort. Caz we have hundreds of users are living on Voldemort and all of their data, $$$ will be moved to Venice eventually. Caz you could expect migrating that much of data is not easy so This is why we we have to make the migration as smooth as possible.Another main goal of Venice is in addition to offline derived data, we could also serve for near line derived data and should have the ability to merge offline and near-line data together to give our user a unified view of them in one system.
All right, after we are clear on the goals, let’s see what APIs and features Venice have to achieve our goals. On the read path, actually you could think of Venice as a distributed key-value store, so obviousely the single get and batch get API are required.On the write path, a high throughput ingestion from both Hadoop and Samza will keep our system efficient, caz we have hundreds of Terabytes data get to be written into system every day. In Venice we provide an asynchronous way to write data from data source to Venice, I will explain the details when we talk about trade offs.
The first feature I want to introduce is data versioning. Inside in Venice, we keep multiple versions for your data, once a user start to ingest a new offline data set, that data is being written in the background without impacting the current data version which is serving read requests. When this version is ready to serve, Venice will do an atomic swap to let all read requests hit this new version instead. The whole process is almost transparent to user, the only thing user need to do is just start the ingestion. Then Venice will help you to manage everything. Of course, in case that you find your data have any issue, you could also quickly rollback, just tell Venice which version you wanna to use.
In this slides I want to talk about 3 new features in Venice which are all aimed for solving the pain points in Voldemort we met.1. At first, the Avro schema evolution allow user to update the schema of their data instead of creating a new store, which was what did in Voldemort.2. The second thing is Dynamic service discovery, we build our own service discovery feature on top of D2 which is a dynamic discovery framework opensourced by Linkedin. So on client side, user do not need to specify which endpoint they want to talk to, instead, Venice will find the proper server for you based on the store you are using. And in case of any server failure, Venice will give you another server as the replacement. So your application could focus on the business logic you want to solve. . Another benefit brought by this feature is No extra config needed if you migrate from voldemort to Venice.3. We also introduced Helix which is the open sourced cluster management framework widely used in Linkedin. It provides the ability to do the fully automatic and rack-ware data replica placement. Other than that we implement the further features on top of this framework, which is zero-downtime cluster expansion and upgrading, this is a big improvement on Voldemort in terms of availability, it means user could continue their data ingestion during our maintenance period which could normally take 2-3 hours each time.
So in terms of scalability, there are two kinds of meaning of scalability in Venice. The first one is we have to support large scale of machine and data. So Venice is a system running in multi-datacenter across continents. And could also run multiple clusters in one physical datacenter in order to get better resource utilization for different use cases. Example?Secondly, we want to run Venice as a service which means Venice could support large scale of users. This is why we provide the self-service onboarding with our internal cloud management platform to let user manage their store without our SRE/DEV. As each cluster is multi-tenant Env, which means users are going to share some resources like cpu and storage, so we introduced several ways to do the resource isolation like qps quota, storage quota and also multi-cluster to prevent user being impacted by each others.
So tradeoffs, as I said, we need a way to do the high throughput ingestion to keep our system efficient.The main tradeoff we made is how to ingest large amount of data into Venice. Basically we have two options: Fetch data from data source directly or write data into an intermedia then fetch the data from that intermedia asynchronousely.Finally we decided to make all writes go through Kafka, it means we write all data into Kafka at first thus we treat Kafka as the source of truth for Venice. As you know Kafka is a scalable message queue could provide good support for high throughput writes. In that case we could accept user’s data as much as possible and as also fast as possible . Remember that we do the bulk load from Hadoop, so we always face the message burst issue, with Kafka we have the ability to be burst tolerant, Caz Kafka persist those messages at first, then Venice consume them gradually to prevent running out of capacity, particually in case of accepting large number of message in a short period.What we paid for that asynchronous push mechanism is in near line cases Venice does not provide “read your writes” semantics. Imaging that your data has been written into Kafka and Venice return that your write succeeds. Now you could not see the data you just write, because normally Venice take several seconds to consume that data then persist locally before make that data visible to client. We think it’s acceptable for most our use cases, and we are also working on some workaround to support read your writes semantics with this push mechanism.
Now I wanna describe the architecture of Venice to help you understand how we implement the features we mentioned above. I’m goanna introduce the main components in Venice and how they interact with each others. Then jump into the global replication which let us be able to sync up the data and metadata across multiple datacenters. At last explain how we use kafka, caz, the way we use kafka is a slightly different from the common case.
So, there are two kinds of components in Venice, we have some processes running on server side like, storage node which is the real host of the dataset, the router is the gateway of our cluster so all of request will hit router first then forward to proper nodes. And we also have the controller who mange the whole cluster.Another kind of component is the library embedded in user’s application. Of course we have the client library to let user read data from Venice. And a H2V push job plugin in azkaban to let user be able to push offline derived data into venice. Regarding to nearline dervicd data, we provide a kinf of samza system produer, so user could push their data from samza job into Venice as well.
This diagram shows how components interact with each others.The blue shapes are Venice components we built and the gray shape are the dependencies we rely on. So you could see at first your data exist on the data sources like Hadoop, and in order to ingest them into Venice, you need to start a new push job and wait it to be completed. Underneath the job, Venice controller will create all essentials for you like a new data version and a new Kafka topic. Beside that Controller will also pick up proper storage nodes to be the real hosts for your data set. Those storage node will consume the data written by your push job parallely from kakfa, and report status to controller regularly. So once controller get enough information and think you data is ready to serve, it will notify routers to use the new data version, after version swap, Now your data is visible to read client.So this is whole lifecycle of a offline push job. For nearline derived data, Felix will give more details later.
As I said Venice is a plant level scale system, so each piece of data must be replicated to multiple datacenters located in different continents.In voldemort, the global replication is each replica in each datacenter will try to read a copy of data from source datacenter, so it every server pulls data redundantly, which eat large oversea bandwidth and slowdown the whole push.
So in Venice we build a new global replication mechanism that the push job write data into kafka located in the source datacenters, then we rely on Kafka mirror maker to replicate each message to all target datacenters. Please note here we only send one copy of data to a remoted datacenter, in order to keep enough replicas, multiple storage nodes consume the same copy of data from local kafka and persist them in local storage enginee.
So with this new global replication mechansism, we savd 40% time spent on the the whole push job. And also reduced the across datacenters bandwidth usage// depends on the replication factor it could be 2 of thrid or half bandwith cost will be saved.
Beside the data replication, we replicated our metadata across multi-datacenters as well. There is a dedicated kafka topic we used to transfer our medata data operations called “Admin topic”. So once you create store in source datacenter, all target data centers will receive this admin message and execute the related operation to create store to keep metadata consistent. In case one entire dc failed, we still have that data in kafka, either in source kafka cluster or target kafka cluster. Once this datacenter is recovered, the admin message will be eventually consumed by Venice running in the recovered datacenter, thus be handled properly by controller in that datacenter. So no manual operation is need to handle datacenter failure and metadata inconsistency.
Unlike most of kafka use cases, we create one topic once a new data version is created and delete that topic once the associate version is retired. And all those topic creation and deletion are dynamic and fully managed by controller. So topic here is no longer a pre-created one with long term life cycle.
Image we already have 2 data versions in your data store, V6 and V7. V7 is the current version serving read requests.
Now you gonna start a new push job, so Venice create V8 for this push
Once the push job succeed, V8 is ready to serve, so what Venice will do is swap the current version from V7 to V8, meanwhile retire the oldest version V6 with deleting the associate kafka topic and data persisted in local storage enginee. So now we still keep 2 versions for you data set but complete one round of data swap.
All right that’s all about Venice architecture and offline push job. I want to give it back to Felix to introduce more about our hybrid design. Thank you.