This document summarizes Facebook's Data Freeway and Puma systems for real-time analytics. Data Freeway is a scalable data streaming framework that includes components like Scribe, Calligraphus, and Continuous Copier to reliably stream data at high throughput with low latency. Puma is Facebook's real-time aggregation engine that uses Data Freeway to perform aggregations on streaming data as it arrives and store results in HBase, enabling real-time queries with latency of only seconds. It was improved from initial versions Puma2 and Puma3 by supporting more complex aggregations and queries.
Twitter uses HBase for mutable batch processing data storage and operational intelligence. They run HBase 0.94 on Hadoop 2.0 across multiple clusters managed by Puppet. HBase stores operational audit logs and supports Python apps. hRaven stores and analyzes MapReduce job metrics to optimize Pig reducers, plan cluster capacity, and troubleshoot problems. It tracks 12.6M jobs across flows, clusters, users and versions.
The document discusses NHN Japan's use of HBase for the LINE messaging platform's storage infrastructure. Some key points:
- HBase is used to store tens of billions of message rows per day for LINE, achieving sub-10ms response times and high availability through dual clusters.
- The presentation covers their experience migrating HBase clusters between data centers online, handling NameNode failures, and stabilizing the LINE message storage cluster.
- It describes the custom HBase replication and bulk data migration tools developed by NHN Japan to support online cluster migrations without downtime. Failure handling and cluster stabilization techniques are also discussed.
This document provides an overview of effective HBase health checking and troubleshooting. It discusses HBase architecture including the roles of the master, regionservers and Zookeeper. It then describes various tools and utilities for troubleshooting like the master and regionserver UIs, process logs, JMX stats, the HBase shell, HBCK and performance evaluation tools. It also covers common problems like HBase not serving data or abrupt regionserver restarts and provides steps to troubleshoot these issues.
HBaseCon 2015: HBase at Scale in an Online and High-Demand EnvironmentHBaseCon
Pinterest runs 38 different HBase clusters in production, doing a lot of different types of work—with some doing up to 5 million operations per second. In this talk, you'll get details about how we do capacity planning, maintenance tasks such as online automated rolling compaction, configuration management, and monitoring.
This document compares Apache Flume and Apache Kafka for use in data pipelines. It describes Conversant's evolution from a homegrown log collection system to using Flume and then integrating Kafka. Key points covered include how Flume and Kafka work, their capabilities for reliability, scalability, and ecosystems. The document also discusses customizing Flume for Conversant's needs, and how Conversant monitors and collects metrics from Flume and Kafka using tools like JMX, Grafana dashboards, and OpenTSDB.
Adobe has packaged HBase in Docker containers and uses Marathon and Mesos to schedule them—allowing us to decouple the RegionServer from the host, express resource requirements declaratively, and open the door for unassisted real-time deployments, elastic (up and down) real-time scalability, and more. In this talk, you'll hear what we've learned and explain why this approach could fundamentally change HBase operations.
HBase Accelerated introduces an in-memory flush and compaction pipeline for HBase to improve performance of real-time workloads. By keeping data in memory longer and avoiding frequent disk flushes and compactions, it reduces I/O and improves read and scan latencies. Evaluation on workloads with high update rates and small working sets showed the new approach significantly outperformed the default HBase implementation by serving most data from memory. Work is ongoing to further optimize the in-memory representation and memory usage.
Twitter uses HBase for mutable batch processing data storage and operational intelligence. They run HBase 0.94 on Hadoop 2.0 across multiple clusters managed by Puppet. HBase stores operational audit logs and supports Python apps. hRaven stores and analyzes MapReduce job metrics to optimize Pig reducers, plan cluster capacity, and troubleshoot problems. It tracks 12.6M jobs across flows, clusters, users and versions.
The document discusses NHN Japan's use of HBase for the LINE messaging platform's storage infrastructure. Some key points:
- HBase is used to store tens of billions of message rows per day for LINE, achieving sub-10ms response times and high availability through dual clusters.
- The presentation covers their experience migrating HBase clusters between data centers online, handling NameNode failures, and stabilizing the LINE message storage cluster.
- It describes the custom HBase replication and bulk data migration tools developed by NHN Japan to support online cluster migrations without downtime. Failure handling and cluster stabilization techniques are also discussed.
This document provides an overview of effective HBase health checking and troubleshooting. It discusses HBase architecture including the roles of the master, regionservers and Zookeeper. It then describes various tools and utilities for troubleshooting like the master and regionserver UIs, process logs, JMX stats, the HBase shell, HBCK and performance evaluation tools. It also covers common problems like HBase not serving data or abrupt regionserver restarts and provides steps to troubleshoot these issues.
HBaseCon 2015: HBase at Scale in an Online and High-Demand EnvironmentHBaseCon
Pinterest runs 38 different HBase clusters in production, doing a lot of different types of work—with some doing up to 5 million operations per second. In this talk, you'll get details about how we do capacity planning, maintenance tasks such as online automated rolling compaction, configuration management, and monitoring.
This document compares Apache Flume and Apache Kafka for use in data pipelines. It describes Conversant's evolution from a homegrown log collection system to using Flume and then integrating Kafka. Key points covered include how Flume and Kafka work, their capabilities for reliability, scalability, and ecosystems. The document also discusses customizing Flume for Conversant's needs, and how Conversant monitors and collects metrics from Flume and Kafka using tools like JMX, Grafana dashboards, and OpenTSDB.
Adobe has packaged HBase in Docker containers and uses Marathon and Mesos to schedule them—allowing us to decouple the RegionServer from the host, express resource requirements declaratively, and open the door for unassisted real-time deployments, elastic (up and down) real-time scalability, and more. In this talk, you'll hear what we've learned and explain why this approach could fundamentally change HBase operations.
HBase Accelerated introduces an in-memory flush and compaction pipeline for HBase to improve performance of real-time workloads. By keeping data in memory longer and avoiding frequent disk flushes and compactions, it reduces I/O and improves read and scan latencies. Evaluation on workloads with high update rates and small working sets showed the new approach significantly outperformed the default HBase implementation by serving most data from memory. Work is ongoing to further optimize the in-memory representation and memory usage.
The document discusses several key factors for optimizing HBase performance including:
1. Reads and writes compete for disk, network, and thread resources so they can cause bottlenecks.
2. Memory allocation needs to balance space for memstores, block caching, and Java heap usage.
3. The write-ahead log can be a major bottleneck and increasing its size or number of logs can improve write performance.
4. Flushes and compactions need to be tuned to avoid premature flushes causing "compaction storms".
HBaseCon 2012 | Gap Inc Direct: Serving Apparel Catalog from HBase for Live W...Cloudera, Inc.
Gap Inc Direct, the online division for Gap Inc., uses HBase to serve, in real-time, apparel catalog for all its brands’ and markets’ web sites. This case study will review the business case as well as key decisions regarding schema selection and cluster configurations. We will also discuss implementation challenges and insights that were learned.
Apache HBase, Accelerated: In-Memory Flush and Compaction HBaseCon
Eshcar Hillel and Anastasia Braginsky (Yahoo!)
Real-time HBase application performance depends critically on the amount of I/O in the datapath. Here we’ll describe an optimization of HBase for high-churn applications that frequently insert/update/delete the same keys, such as for high-speed queuing and e-commerce.
HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...Cloudera, Inc.
HBase Coprocessors allow user code to be run on region servers within each region of an HBase table. Coprocessors are loaded dynamically and scale automatically as regions are split or merged. They provide hooks into various HBase operations via observer classes and define an interface for custom endpoint calls between clients and servers. Examples of use cases include secondary indexes, filters, and replacing MapReduce jobs with server-side processing.
Introduction to streaming and messaging flume,kafka,SQS,kinesis Omid Vahdaty
Big data makes you a bit Confused ? messaging? batch processing? data streaming? in flight analytics? Cloud? open source? Flume? kafka? flafka (both)? SQS? kinesis? firehose?
Yfrog uses HBase as its scalable database backend to store and serve 250 million photos from over 60 million monthly users across 4 HBase clusters ranging from 50TB to 1PB in size. The authors provide best practices for configuring and monitoring HBase, including using smaller commodity servers, tuning JVM garbage collection, monitoring metrics like thread usage and disk I/O, and implementing caching and replication for high performance and reliability. Following these practices has allowed Yfrog's HBase deployment to run smoothly and efficiently.
This document summarizes a talk about Facebook's use of HBase for messaging data. It discusses how Facebook migrated data from MySQL to HBase to store metadata, search indexes, and small messages in HBase for improved scalability. It also outlines performance improvements made to HBase, such as for compactions and reads, and future plans such as cross-datacenter replication and running HBase in a multi-tenant environment.
The document provides an introduction and agenda for an HBase presentation. It begins with an overview of HBase and discusses why relational databases are not scalable for big data through examples of a growing website. It then introduces concepts of HBase including its column-oriented design and architecture. The document concludes with hands-on examples of installing HBase and performing basic operations through the HBase shell.
This document provides an overview of HBase architecture and advanced usage topics. It discusses course credit requirements, HBase architecture components like storage, write path, read path, files, region splits and more. It also covers advanced topics like secondary indexes, search integration, transactions and bloom filters. The document emphasizes that HBase uses log-structured merge trees for efficient data handling and operates at the disk transfer level rather than disk seek level for performance. It also provides details on various classes involved in write-ahead logging.
The document summarizes the HBase 1.0 release which introduces major new features and interfaces including a new client API, region replicas for high availability, online configuration changes, and semantic versioning. It describes goals of laying a stable foundation, stabilizing clusters and clients, and making versioning explicit. Compatibility with earlier versions is discussed and the new interfaces like ConnectionFactory, Connection, Table and BufferedMutator are introduced along with examples of using them.
Apache HBase is the Hadoop opensource, distributed, versioned storage manager well suited for random, realtime read/write access. This talk will give an overview on how HBase achieve random I/O, focusing on the storage layer internals. Starting from how the client interact with Region Servers and Master to go into WAL, MemStore, Compactions and on-disk format details. Looking at how the storage is used by features like snapshots, and how it can be improved to gain flexibility, performance and space efficiency.
This document discusses tuning HBase and HDFS for performance and correctness. Some key recommendations include:
- Enable HDFS sync on close and sync behind writes for correctness on power failures.
- Tune HBase compaction settings like blockingStoreFiles and compactionThreshold based on whether the workload is read-heavy or write-heavy.
- Size RegionServer machines based on disk size, heap size, and number of cores to optimize for the workload.
- Set client and server RPC chunk sizes like hbase.client.write.buffer to 2MB to maximize network throughput.
- Configure various garbage collection settings in HBase like -Xmn512m and -XX:+UseCMSInit
Hadoop World 2011: Apache HBase Road Map - Jonathan Gray - FacebookCloudera, Inc.
This technical session will provide a quick review of the Apache HBase project, looking at it from the past to the future. It will cover the imminent HBase 0.92 release as well as what is slated for 0.94 and beyond. A number of companies and use cases will be used as examples to describe the overall direction of the HBase community and project.
HBaseCon 2013: Apache HBase at Pinterest - Scaling Our Feed StorageCloudera, Inc.
Pinterest uses Apache HBase to store data for users' personalized "following feeds" at scale. This involves storing billions of pins and updates per day. Some key challenges addressed are handling high throughput writes from fanouts, providing low latency reads, and resolving potential data inconsistencies from race conditions. Optimizations to HBase include increased memstore size, block cache tuning, and prefix compression. Maintaining high availability involves writing to dual clusters, tight Zookeeper timeouts, and automated repairs.
HBase-2.0.0 has been a couple of years in the making. It is chock-a-block full of a long list of new features and fixes. In this session, the 2.0.0 release manager will perform the impossible, describing the release content inside the session time bounds.
hbaseconasia2017 hbasecon hbase https://www.eventbrite.com/e/hbasecon-asia-2017-tickets-34935546159#
HBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, SalesforceCloudera, Inc.
The strength of an open source project resides entirely in its developer community; a strong democratic culture of participation and hacking makes for a better piece of software. The key requirement is having developers who are not only willing to contribute, but also knowledgeable about the project’s internal structure and architecture. This session will introduce developers to the core internal architectural concepts of HBase, not just “what” it does from the outside, but “how” it works internally, and “why” it does things a certain way. We’ll walk through key sections of code and discuss key concepts like the MVCC implementation and memstore organization. The goal is to convert serious “HBase Users” into HBase Developer Users”, and give voice to some of the deep knowledge locked in the committers’ heads.
Achieving HBase Multi-Tenancy with RegionServer Groups and Favored NodesDataWorks Summit
At Yahoo! HBase has been running as a hosted multi-tenant service since 2013. In a single HBase cluster we have around 30 tenants running various types of workloads (ie batch, near real-time, ad-hoc, etc). Typically such a deployment would cause tenant workloads to negatively affect each other because of resource contention (disk, cpu, network, cache thrashing, etc). Using RegionServer Groups we are able to designate a dedicated subset of RegionServers in a cluster to host only tables of a given tenant (HBASE-6721).
Most HBase deployments use HDFS as their distributed filesystem, which in turn does not guarantee that a region’s data is locally available to the hosting regionserver. This poses a problem when providing isolation since the hdfs data blocks may have to be read remotely from a different tenant’s host thus contending for disk or network resources. Favored nodes addresses this problem by providing hints to HDFS on which datanodes data should be stored and only assigns regions to these favored regionservers (HBASE-15531).
We will walk through these features explaining our motivation, how they work as well as our experiences running these multi-tenant clusters. These features will be available in Apache HBase 2.0.
This document summarizes a presentation about optimizing for low latency in HBase. It discusses how to measure latency, the write and read paths in HBase, sources of latency like garbage collection and compactions, and techniques for reducing latency like streaming puts, block caching, and timeline consistency. The key points are that single puts can achieve millisecond latency while garbage collection and machine failures can cause pauses of 10s of milliseconds to seconds, and optimizing for the "magical 1%" of requests after the 99th percentile is important to improve average latency.
Now that you've seen Base 1.0, what's ahead in HBase 2.0, and beyond—and why? Find out from this panel of people who have designed and/or are working on 2.0 features.
1. The document provides a detailed checklist for preparing and conducting a successful web conference over multiple months.
2. Key preparations include determining goals, equipment needs, roles, content, and conducting rehearsals to test technology and flow.
3. During the conference, moderators should welcome participants, explain features, incorporate polls, and address questions to keep attendees engaged throughout the session.
Chris was born with a natural aptitude for computers and programming. From a young age he was fixing computer problems in his community. After trying other careers like the navy and office work, he returned to his strengths in programming and networking. He now creates innovative communication solutions as a network consultant after immigrating to Canada from the UK.
The document discusses several key factors for optimizing HBase performance including:
1. Reads and writes compete for disk, network, and thread resources so they can cause bottlenecks.
2. Memory allocation needs to balance space for memstores, block caching, and Java heap usage.
3. The write-ahead log can be a major bottleneck and increasing its size or number of logs can improve write performance.
4. Flushes and compactions need to be tuned to avoid premature flushes causing "compaction storms".
HBaseCon 2012 | Gap Inc Direct: Serving Apparel Catalog from HBase for Live W...Cloudera, Inc.
Gap Inc Direct, the online division for Gap Inc., uses HBase to serve, in real-time, apparel catalog for all its brands’ and markets’ web sites. This case study will review the business case as well as key decisions regarding schema selection and cluster configurations. We will also discuss implementation challenges and insights that were learned.
Apache HBase, Accelerated: In-Memory Flush and Compaction HBaseCon
Eshcar Hillel and Anastasia Braginsky (Yahoo!)
Real-time HBase application performance depends critically on the amount of I/O in the datapath. Here we’ll describe an optimization of HBase for high-churn applications that frequently insert/update/delete the same keys, such as for high-speed queuing and e-commerce.
HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...Cloudera, Inc.
HBase Coprocessors allow user code to be run on region servers within each region of an HBase table. Coprocessors are loaded dynamically and scale automatically as regions are split or merged. They provide hooks into various HBase operations via observer classes and define an interface for custom endpoint calls between clients and servers. Examples of use cases include secondary indexes, filters, and replacing MapReduce jobs with server-side processing.
Introduction to streaming and messaging flume,kafka,SQS,kinesis Omid Vahdaty
Big data makes you a bit Confused ? messaging? batch processing? data streaming? in flight analytics? Cloud? open source? Flume? kafka? flafka (both)? SQS? kinesis? firehose?
Yfrog uses HBase as its scalable database backend to store and serve 250 million photos from over 60 million monthly users across 4 HBase clusters ranging from 50TB to 1PB in size. The authors provide best practices for configuring and monitoring HBase, including using smaller commodity servers, tuning JVM garbage collection, monitoring metrics like thread usage and disk I/O, and implementing caching and replication for high performance and reliability. Following these practices has allowed Yfrog's HBase deployment to run smoothly and efficiently.
This document summarizes a talk about Facebook's use of HBase for messaging data. It discusses how Facebook migrated data from MySQL to HBase to store metadata, search indexes, and small messages in HBase for improved scalability. It also outlines performance improvements made to HBase, such as for compactions and reads, and future plans such as cross-datacenter replication and running HBase in a multi-tenant environment.
The document provides an introduction and agenda for an HBase presentation. It begins with an overview of HBase and discusses why relational databases are not scalable for big data through examples of a growing website. It then introduces concepts of HBase including its column-oriented design and architecture. The document concludes with hands-on examples of installing HBase and performing basic operations through the HBase shell.
This document provides an overview of HBase architecture and advanced usage topics. It discusses course credit requirements, HBase architecture components like storage, write path, read path, files, region splits and more. It also covers advanced topics like secondary indexes, search integration, transactions and bloom filters. The document emphasizes that HBase uses log-structured merge trees for efficient data handling and operates at the disk transfer level rather than disk seek level for performance. It also provides details on various classes involved in write-ahead logging.
The document summarizes the HBase 1.0 release which introduces major new features and interfaces including a new client API, region replicas for high availability, online configuration changes, and semantic versioning. It describes goals of laying a stable foundation, stabilizing clusters and clients, and making versioning explicit. Compatibility with earlier versions is discussed and the new interfaces like ConnectionFactory, Connection, Table and BufferedMutator are introduced along with examples of using them.
Apache HBase is the Hadoop opensource, distributed, versioned storage manager well suited for random, realtime read/write access. This talk will give an overview on how HBase achieve random I/O, focusing on the storage layer internals. Starting from how the client interact with Region Servers and Master to go into WAL, MemStore, Compactions and on-disk format details. Looking at how the storage is used by features like snapshots, and how it can be improved to gain flexibility, performance and space efficiency.
This document discusses tuning HBase and HDFS for performance and correctness. Some key recommendations include:
- Enable HDFS sync on close and sync behind writes for correctness on power failures.
- Tune HBase compaction settings like blockingStoreFiles and compactionThreshold based on whether the workload is read-heavy or write-heavy.
- Size RegionServer machines based on disk size, heap size, and number of cores to optimize for the workload.
- Set client and server RPC chunk sizes like hbase.client.write.buffer to 2MB to maximize network throughput.
- Configure various garbage collection settings in HBase like -Xmn512m and -XX:+UseCMSInit
Hadoop World 2011: Apache HBase Road Map - Jonathan Gray - FacebookCloudera, Inc.
This technical session will provide a quick review of the Apache HBase project, looking at it from the past to the future. It will cover the imminent HBase 0.92 release as well as what is slated for 0.94 and beyond. A number of companies and use cases will be used as examples to describe the overall direction of the HBase community and project.
HBaseCon 2013: Apache HBase at Pinterest - Scaling Our Feed StorageCloudera, Inc.
Pinterest uses Apache HBase to store data for users' personalized "following feeds" at scale. This involves storing billions of pins and updates per day. Some key challenges addressed are handling high throughput writes from fanouts, providing low latency reads, and resolving potential data inconsistencies from race conditions. Optimizations to HBase include increased memstore size, block cache tuning, and prefix compression. Maintaining high availability involves writing to dual clusters, tight Zookeeper timeouts, and automated repairs.
HBase-2.0.0 has been a couple of years in the making. It is chock-a-block full of a long list of new features and fixes. In this session, the 2.0.0 release manager will perform the impossible, describing the release content inside the session time bounds.
hbaseconasia2017 hbasecon hbase https://www.eventbrite.com/e/hbasecon-asia-2017-tickets-34935546159#
HBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, SalesforceCloudera, Inc.
The strength of an open source project resides entirely in its developer community; a strong democratic culture of participation and hacking makes for a better piece of software. The key requirement is having developers who are not only willing to contribute, but also knowledgeable about the project’s internal structure and architecture. This session will introduce developers to the core internal architectural concepts of HBase, not just “what” it does from the outside, but “how” it works internally, and “why” it does things a certain way. We’ll walk through key sections of code and discuss key concepts like the MVCC implementation and memstore organization. The goal is to convert serious “HBase Users” into HBase Developer Users”, and give voice to some of the deep knowledge locked in the committers’ heads.
Achieving HBase Multi-Tenancy with RegionServer Groups and Favored NodesDataWorks Summit
At Yahoo! HBase has been running as a hosted multi-tenant service since 2013. In a single HBase cluster we have around 30 tenants running various types of workloads (ie batch, near real-time, ad-hoc, etc). Typically such a deployment would cause tenant workloads to negatively affect each other because of resource contention (disk, cpu, network, cache thrashing, etc). Using RegionServer Groups we are able to designate a dedicated subset of RegionServers in a cluster to host only tables of a given tenant (HBASE-6721).
Most HBase deployments use HDFS as their distributed filesystem, which in turn does not guarantee that a region’s data is locally available to the hosting regionserver. This poses a problem when providing isolation since the hdfs data blocks may have to be read remotely from a different tenant’s host thus contending for disk or network resources. Favored nodes addresses this problem by providing hints to HDFS on which datanodes data should be stored and only assigns regions to these favored regionservers (HBASE-15531).
We will walk through these features explaining our motivation, how they work as well as our experiences running these multi-tenant clusters. These features will be available in Apache HBase 2.0.
This document summarizes a presentation about optimizing for low latency in HBase. It discusses how to measure latency, the write and read paths in HBase, sources of latency like garbage collection and compactions, and techniques for reducing latency like streaming puts, block caching, and timeline consistency. The key points are that single puts can achieve millisecond latency while garbage collection and machine failures can cause pauses of 10s of milliseconds to seconds, and optimizing for the "magical 1%" of requests after the 99th percentile is important to improve average latency.
Now that you've seen Base 1.0, what's ahead in HBase 2.0, and beyond—and why? Find out from this panel of people who have designed and/or are working on 2.0 features.
1. The document provides a detailed checklist for preparing and conducting a successful web conference over multiple months.
2. Key preparations include determining goals, equipment needs, roles, content, and conducting rehearsals to test technology and flow.
3. During the conference, moderators should welcome participants, explain features, incorporate polls, and address questions to keep attendees engaged throughout the session.
Chris was born with a natural aptitude for computers and programming. From a young age he was fixing computer problems in his community. After trying other careers like the navy and office work, he returned to his strengths in programming and networking. He now creates innovative communication solutions as a network consultant after immigrating to Canada from the UK.
Photosynthesis and respiration are closely related processes. Photosynthesis produces energy through chloroplasts in plants, while respiration uses this energy in mitochondria of cells. There are two types of respiration - aerobic respiration uses oxygen to break down food for energy in mitochondria, while anaerobic respiration occurs without oxygen through fermentation processes like lactic acid and alcohol fermentation.
MongoDB is a scalable, high-performance, open-source NoSQL database that uses documents with dynamic schemas instead of tables. It supports embedded documents and arrays, replication, and sharding. MongoDB is commonly used for web applications, content management, real-time analytics, and caching due to its fast performance for typical web operations. Some key companies using MongoDB in production include eBay, Craigslist, Foursquare, and Sourceforge.
Using free audio editing software Audacity, one can create professional quality audio recordings with just $500 worth of equipment. The presentation recommends a USB microphone, mic stand, pop filter, and acoustic panels. Audacity allows for multi-track recording and editing with effects like amplifying, changing pitch, equalization, and noise removal. With practice editing audio files in Audacity, one can produce high quality recordings.
This document appears to be a collection of disjointed notes and messages that do not form a coherent whole. It includes an apology for taking a budget, mentions of cheap costs, company and people's names, and various topics but lacks context and a clear overall purpose or narrative.
The document discusses the benefits of exercise for mental health. Regular physical activity can help reduce anxiety and depression and improve mood and cognitive functioning. Exercise boosts blood flow, releases endorphins, and promotes changes in the brain which help regulate emotions and stress levels.
An 11-year-old boy named Cameron W. McCutcheon created a slideshow about himself. He has his own YouTube channel called CWMmedia and gets frequent headaches. He has three cats named Revi, Wheaty, and Luke. Cameron considers himself quite skilled with computers and lists several programs he is proficient with, including Photoshop and Gimp. He hopes to further develop his skills in Photoshop and learn more about downloading through an IT class.
This document provides tips for creating high quality audio recordings on a budget. It recommends investing under $600 in audio hardware like a USB microphone, mic stand, and acoustic panels. The recording process involves writing a script, selecting voice talent, finding a quiet recording location, and using free audio editing software like Audacity to record and edit audio files. It emphasizes the importance of preparation, using a variety of voices, saving work frequently, and backing up files to produce professional sounding audio on a budget.
This report analyzes the Indian semiconductor design industry across three sections. The industry analysis examines market trends, talent landscape, and challenges facing key segments like VLSI, embedded software, and hardware design. Benchmarking compares India to other countries on 20 parameters relevant to the industry. Finally, based on findings, recommendations are provided to address challenges constraining growth, such as talent quality, startup ecosystem, and lack of infrastructure. The goal is for India to become a global leader in semiconductor design by 2020 through initiatives like incubating 50 fabless companies and doubling its skilled workforce.
The document discusses user-defined variables in Lectora and provides instructions for creating them. It explains that to create a variable, the user clicks the Variable Manager icon or selects Variable Manager from the Tools menu. The user then clicks Add to create a new variable and specifies its name, initial value, and other properties. Lectora also allows creating variables "on-the-fly" when adding actions that involve modifying a variable. Reserved (predefined) variables like the current date, time, and page number are also discussed.
DevOps Dilemma - Make Dev work with Ops!Sandeep Joshi
Every business runs on software and demanding more, faster and better from their IT teams. Current IT operating models are struggling to support the high velocity needs to the business. In this session we run through the steps that brings real meaning to the DevOps journey to make achieve faster and better turnaround for your projects, features and operations.
How to Get started with Press2Flash in 8 StepsErwan Jegouzo
Press2Flash is a framework developed to connect your Full-Flash website with Wordpress, the famous open source CMS.
This way, you can use Wordpress to manage the content displayed in Flash, upload files, moderate comments and so much more!
In these slides, I will show you how to create your first Press2Flash project.
El documento habla sobre el trueque por internet, donde Gilberto García, Luis Merchán, José Rojas y Ever Ricardom Torres ofrecen cambiar una nevera por una yegua a través del sitio web www.cambia.es. Brevemente describe el origen de la moneda.
This document summarizes Facebook's real-time analytics systems. It describes Data Freeway, which uses a scalable data streaming framework to collect log data with low latency. It also describes Puma, which performs reliable stream aggregation and storage by sharding computations in memory and checkpointing to HBase. Future work may include open sourcing components and adding scheduler support.
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012larsgeorge
This document summarizes Lars George's presentation on moving from batch to real-time processing with Hadoop. It discusses using Hadoop (HDFS and MapReduce) for batch processing of large amounts of data and integrating real-time databases and stream processing tools like HBase and Storm to enable faster querying and analytics. Example architectures shown combine batch and real-time systems by using real-time tools to process streaming data and periodically syncing results to Hadoop and HBase for long-term storage and analysis.
Building Mission Critical Messaging System On Top Of HBase
Facebook chose HBase as the storage system for its messaging platform due to HBase's high write throughput, good random read performance, horizontal scalability, and automatic failover. Facebook stores messages, metadata, and search indices in HBase. To improve performance and reliability, Facebook developed the system on a production-stabilized branch of HBase, used shadow testing, added extensive monitoring, and contributed improvements back to the HBase community.
The document discusses Facebook's use of HBase to store messaging data. It provides an overview of HBase, including its data model, performance characteristics, and how it was a good fit for Facebook's needs due to its ability to handle large volumes of data, high write throughput, and efficient random access. It also describes some enhancements Facebook made to HBase to improve availability, stability, and performance. Finally, it briefly mentions Facebook's migration of messaging data from MySQL to their HBase implementation.
The document discusses Facebook's use of HBase as the database storage engine for its messaging platform. It provides an overview of HBase, including its data model, architecture, and benefits like scalability, fault tolerance, and simpler consistency model compared to relational databases. The document also describes Facebook's contributions to HBase to improve performance, availability, and achieve its goal of zero data loss. It shares Facebook's operational experiences running large HBase clusters and discusses its migration of messaging data from MySQL to a de-normalized schema in HBase.
Near-realtime analytics with Kafka and HBasedave_revell
A presentation at OSCON 2012 by Nate Putnam and Dave Revell about Urban Airship's analytics stack. Features Kafka, HBase, and Urban Airship's own open source projects statshtable and datacube.
This document discusses Apache Kudu, an open source column-oriented storage system that provides fast analytics on fast data. It describes Kudu's design goals of high throughput for large scans, low latency for short accesses, and database-like semantics. The document outlines Kudu's architecture, including its use of columnar storage, replication for fault tolerance, and integrations with Spark, Impala and other frameworks. It provides examples of using Kudu for IoT and real-time analytics use cases. Performance comparisons show Kudu outperforming other NoSQL systems on analytics and operational workloads.
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...HostedbyConfluent
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam K Dey | Current 2022
Robinhood’s mission is to democratize finance for all. Data driven decision making is key to achieving this goal. Data needed are hosted in various OLTP databases. Replicating this data near real time in a reliable fashion to data lakehouse powers many critical use cases for the company. In Robinhood, CDC is not only used for ingestion to data-lake but is also being adopted for inter-system message exchanges between different online micro services. .
In this talk, we will describe the evolution of change data capture based ingestion in Robinhood not only in terms of the scale of data stored and queries made, but also the use cases that it supports. We will go in-depth into the CDC architecture built around our Kafka ecosystem using open source system Debezium and Apache Hudi. We will cover online inter-system message exchange use-cases along with our experience running this service at scale in Robinhood along with lessons learned.
- The document discusses running Hive/Spark on S3 object storage using S3A committers and running HBase on NFS file storage instead of HDFS. This separates compute and storage and avoids HDFS operations and complexity. S3A committers allow fast, atomic writes to S3 without renaming files. Benchmark results show the magic committer is faster than the file committer for S3 writes. HBase performance tests show FlashBlade NFS providing low latency for random reads/writes compared to Amazon EFS.
hbaseconasia2017: Building online HBase cluster of Zhihu based on KubernetesHBaseCon
Zhiyong Bai
As a high performance and scalable key value database, Zhihu use HBase to provide online data store system along with Mysql and Redis. Zhihu’s platform team had accumulated some experience in technology of container, and this time, based on Kubernetes, we build flexible platform of online HBase system, create multiple logic isolated HBase clusters on the shared physical cluster with fast rapid,and provide customized service for different business needs. Combined with Consul and DNS server, we implement high available access of HBase using client mainly written with Python. This presentation is mainly shared the architecture of online HBase platform in Zhihu and some practical experience in production environment.
hbaseconasia2017 hbasecon hbase
Speaker: Varun Sharma (Pinterest)
Over the past year, HBase has become an integral component of Pinterest's storage stack. HBase has enabled us to quickly launch and iterate on new products and create amazing pinner experiences. This talk briefly describes some of these applications, the underlying schema, and how our HBase setup stays highly available and performant despite billions of requests every week. It will also include some performance tips for running on SSDs. Finally, we will talk about a homegrown serving technology we built from a mashup of HBase components that has gained wide adoption across Pinterest.
Facebook uses HBase running on HDFS to store messaging data and metadata. Key reasons for choosing HBase include high write throughput, horizontal scalability, and integration with HDFS. Typical clusters have multiple regions and racks for redundancy. Facebook stores small messages, metadata, and attachments in HBase, while larger messages and attachments are stored separately. The system processes billions of read and write operations daily and continues to optimize performance and reliability.
Rocketfuel processes over 120 billion ad auctions per day and needs to detect fraud in real time to prevent losses. They developed Helios, which ingests event data from Kafka and HDFS into Storm in real time, joins the streams in HBase, then runs MapReduce jobs hourly to populate an OLAP cube for analyzing feature vectors and detecting fraud patterns. This architecture on Hadoop allows them to easily scale real-time processing and experiment with different configurations to quickly react to fraud.
Introduction to HBase. HBase is a NoSQL databases which experienced a tremendous increase in popularity during the last years. Large companies like Facebook, LinkedIn, Foursquare are using HBase. In this presentation we will address questions like: what is HBase?, and compared to relational databases?, what is the architecture?, how does HBase work?, what about the schema design?, what about the IT ressources?. Questions that should help you consider whether this solution might be suitable in your case.
Hive, Impala, and Pig are tools for querying and analyzing data stored in Hadoop HDFS. Hive uses SQL-like queries that run as MapReduce jobs. Impala also uses SQL but is designed like MPP systems to run queries faster than Hive. Pig uses a programming-like language called Pig Latin that compiles to MapReduce jobs. Each tool has its own strengths: Hive is best for complex queries, Impala for fast queries, and Pig for programmers. They provide scalable options to explore Hadoop data.
The document discusses Kudu, a new updatable columnar storage system for Hadoop that was built to address gaps in transactional and analytic capabilities of existing Hadoop storage technologies like HDFS and HBase. Kudu aims to provide both high throughput for large scans like HDFS and low latency for individual row lookups and updates like HBase, while supporting SQL queries and a relational data model. It leverages improvements in hardware by using a columnar format and indexes to improve CPU efficiency for these workloads compared to traditional storage systems. The document outlines Kudu's goals and capabilities and provides examples of use cases like time series analytics, machine data analytics and online reporting that would benefit from Kudu's simultaneous support for sequential
This document provides a high-level overview of Impala, an open-source SQL query engine for Apache Hadoop. It describes how Impala addresses limitations of MapReduce by providing faster, more interactive queries using MPP (Massively Parallel Processing). Key points include that Impala runs directly on data files without ETL, uses a distributed query planner and execution engine for high performance, and supports commonly used file formats like Parquet for columnar storage.
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)VMware Tanzu
Recorded at SpringOne2GX 2013 in Santa Clara, CA
Speaker: Adam Shook
This session assumes absolutely no knowledge of Apache Hadoop and will provide a complete introduction to all the major aspects of the Hadoop ecosystem of projects and tools. If you are looking to get up to speed on Hadoop, trying to work out what all the Big Data fuss is about, or just interested in brushing up your understanding of MapReduce, then this is the session for you. We will cover all the basics with detailed discussion about HDFS, MapReduce, YARN (MRv2), and a broad overview of the Hadoop ecosystem including Hive, Pig, HBase, ZooKeeper and more.
Learn More about Spring XD at: http://projects.spring.io/spring-xd
Learn More about Gemfire XD at:
http://www.gopivotal.com/big-data/pivotal-hd
At StampedeCon 2012 in St. Louis, Pritam Damania presents: Reliable backup and recovery is one of the main requirements for any enterprise grade application. HBase has been very well embraced by enterprises needing random, real-time read/write access with huge volumes of data and ease of scalability. As such, they are looking for backup solutions that are reliable, easy to use, and can co-exist with existing infrastructure. HBase comes with several backup options but there is a clear need to improve the native export mechanisms. This talk will cover various options that are available out of the box, their drawbacks and what various companies are doing to make backup and recovery efficient. In particular it will cover what Facebook has done to improve performance of backup and recovery process with minimal impact to production cluster.
Large-scale projects development (scaling LAMP)Alexey Rybak
This 8-hours tutorial was given at various conferences including Percona conference (London), DevConf (Moscow), Highload++ (Moscow).
ABSTRACT
During this tutorial we will cover various topics related to high scalability for the LAMP stack. This workshop is divided into three sections.
The first section covers basic principles of shared nothing architectures and horizontal scaling for the app//cache/database tiers.
Section two of this tutorial is devoted to MySQL sharding techniques, queues and a few performance-related tips and tricks.
In section three we will cover the practical approach for measuring site performance and quality, porviding a "lean" support philosophy, connecting buesiness and technology metrics.
In addition we will cover a very useful Pinba real-time statistical server, it's features and various use cases. All of the sections will be based on real-world examples built in Badoo, one of the biggest dating sites on the Internet.
Similar to Hic 2011 realtime_analytics_at_facebook (20)
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfChart Kalyan
A Mix Chart displays historical data of numbers in a graphical or tabular form. The Kalyan Rajdhani Mix Chart specifically shows the results of a sequence of numbers over different periods.
Programming Foundation Models with DSPy - Meetup SlidesZilliz
Prompting language models is hard, while programming language models is easy. In this talk, I will discuss the state-of-the-art framework DSPy for programming foundation models with its powerful optimizers and runtime constraint system.
Main news related to the CCS TSI 2023 (2023/1695)Jakub Marek
An English 🇬🇧 translation of a presentation to the speech I gave about the main changes brought by CCS TSI 2023 at the biggest Czech conference on Communications and signalling systems on Railways, which was held in Clarion Hotel Olomouc from 7th to 9th November 2023 (konferenceszt.cz). Attended by around 500 participants and 200 on-line followers.
The original Czech 🇨🇿 version of the presentation can be found here: https://www.slideshare.net/slideshow/hlavni-novinky-souvisejici-s-ccs-tsi-2023-2023-1695/269688092 .
The videorecording (in Czech) from the presentation is available here: https://youtu.be/WzjJWm4IyPk?si=SImb06tuXGb30BEH .
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slackshyamraj55
Discover the seamless integration of RPA (Robotic Process Automation), COMPOSER, and APM with AWS IDP enhanced with Slack notifications. Explore how these technologies converge to streamline workflows, optimize performance, and ensure secure access, all while leveraging the power of AWS IDP and real-time communication via Slack notifications.
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxSitimaJohn
Ocean Lotus cyber threat actors represent a sophisticated, persistent, and politically motivated group that poses a significant risk to organizations and individuals in the Southeast Asian region. Their continuous evolution and adaptability underscore the need for robust cybersecurity measures and international cooperation to identify and mitigate the threats posed by such advanced persistent threat groups.
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdfflufftailshop
When it comes to unit testing in the .NET ecosystem, developers have a wide range of options available. Among the most popular choices are NUnit, XUnit, and MSTest. These unit testing frameworks provide essential tools and features to help ensure the quality and reliability of code. However, understanding the differences between these frameworks is crucial for selecting the most suitable one for your projects.
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on automated letter generation for Bonterra Impact Management using Google Workspace or Microsoft 365.
Interested in deploying letter generation automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Dive into the realm of operating systems (OS) with Pravash Chandra Das, a seasoned Digital Forensic Analyst, as your guide. 🚀 This comprehensive presentation illuminates the core concepts, types, and evolution of OS, essential for understanding modern computing landscapes.
Beginning with the foundational definition, Das clarifies the pivotal role of OS as system software orchestrating hardware resources, software applications, and user interactions. Through succinct descriptions, he delineates the diverse types of OS, from single-user, single-task environments like early MS-DOS iterations, to multi-user, multi-tasking systems exemplified by modern Linux distributions.
Crucial components like the kernel and shell are dissected, highlighting their indispensable functions in resource management and user interface interaction. Das elucidates how the kernel acts as the central nervous system, orchestrating process scheduling, memory allocation, and device management. Meanwhile, the shell serves as the gateway for user commands, bridging the gap between human input and machine execution. 💻
The narrative then shifts to a captivating exploration of prominent desktop OSs, Windows, macOS, and Linux. Windows, with its globally ubiquitous presence and user-friendly interface, emerges as a cornerstone in personal computing history. macOS, lauded for its sleek design and seamless integration with Apple's ecosystem, stands as a beacon of stability and creativity. Linux, an open-source marvel, offers unparalleled flexibility and security, revolutionizing the computing landscape. 🖥️
Moving to the realm of mobile devices, Das unravels the dominance of Android and iOS. Android's open-source ethos fosters a vibrant ecosystem of customization and innovation, while iOS boasts a seamless user experience and robust security infrastructure. Meanwhile, discontinued platforms like Symbian and Palm OS evoke nostalgia for their pioneering roles in the smartphone revolution.
The journey concludes with a reflection on the ever-evolving landscape of OS, underscored by the emergence of real-time operating systems (RTOS) and the persistent quest for innovation and efficiency. As technology continues to shape our world, understanding the foundations and evolution of operating systems remains paramount. Join Pravash Chandra Das on this illuminating journey through the heart of computing. 🌟
Trusted Execution Environment for Decentralized Process MiningLucaBarbaro3
Presentation of the paper "Trusted Execution Environment for Decentralized Process Mining" given during the CAiSE 2024 Conference in Cyprus on June 7, 2024.
Have you ever been confused by the myriad of choices offered by AWS for hosting a website or an API?
Lambda, Elastic Beanstalk, Lightsail, Amplify, S3 (and more!) can each host websites + APIs. But which one should we choose?
Which one is cheapest? Which one is fastest? Which one will scale to meet our needs?
Join me in this session as we dive into each AWS hosting service to determine which one is best for your scenario and explain why!
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc
How does your privacy program stack up against your peers? What challenges are privacy teams tackling and prioritizing in 2024?
In the fifth annual Global Privacy Benchmarks Survey, we asked over 1,800 global privacy professionals and business executives to share their perspectives on the current state of privacy inside and outside of their organizations. This year’s report focused on emerging areas of importance for privacy and compliance professionals, including considerations and implications of Artificial Intelligence (AI) technologies, building brand trust, and different approaches for achieving higher privacy competence scores.
See how organizational priorities and strategic approaches to data security and privacy are evolving around the globe.
This webinar will review:
- The top 10 privacy insights from the fifth annual Global Privacy Benchmarks Survey
- The top challenges for privacy leaders, practitioners, and organizations in 2024
- Key themes to consider in developing and maintaining your privacy program
Skybuffer SAM4U tool for SAP license adoptionTatiana Kojar
Manage and optimize your license adoption and consumption with SAM4U, an SAP free customer software asset management tool.
SAM4U, an SAP complimentary software asset management tool for customers, delivers a detailed and well-structured overview of license inventory and usage with a user-friendly interface. We offer a hosted, cost-effective, and performance-optimized SAM4U setup in the Skybuffer Cloud environment. You retain ownership of the system and data, while we manage the ABAP 7.58 infrastructure, ensuring fixed Total Cost of Ownership (TCO) and exceptional services through the SAP Fiori interface.
HCL Notes and Domino License Cost Reduction in the World of DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-and-domino-license-cost-reduction-in-the-world-of-dlau/
The introduction of DLAU and the CCB & CCX licensing model caused quite a stir in the HCL community. As a Notes and Domino customer, you may have faced challenges with unexpected user counts and license costs. You probably have questions on how this new licensing approach works and how to benefit from it. Most importantly, you likely have budget constraints and want to save money where possible. Don’t worry, we can help with all of this!
We’ll show you how to fix common misconfigurations that cause higher-than-expected user counts, and how to identify accounts which you can deactivate to save money. There are also frequent patterns that can cause unnecessary cost, like using a person document instead of a mail-in for shared mailboxes. We’ll provide examples and solutions for those as well. And naturally we’ll explain the new licensing model.
Join HCL Ambassador Marc Thomas in this webinar with a special guest appearance from Franz Walder. It will give you the tools and know-how to stay on top of what is going on with Domino licensing. You will be able lower your cost through an optimized configuration and keep it low going forward.
These topics will be covered
- Reducing license cost by finding and fixing misconfigurations and superfluous accounts
- How do CCB and CCX licenses really work?
- Understanding the DLAU tool and how to best utilize it
- Tips for common problem areas, like team mailboxes, functional/test users, etc
- Practical examples and best practices to implement right away
4. Facebook Insights
• Use cases
▪ Websites/Ads/Apps/Pages
▪ Time series
▪ Demographic break-downs
▪ Unique counts/heavy hitters
• Major challenges
▪ Scalability
▪ Latency
5. Analytics based on Hadoop/Hive
Hourly Daily
seconds seconds Copier/Loader Pipeline Jobs
HTTP Scribe NFS Hive MySQL
Hadoop
• 3000-node Hadoop cluster
• Copier/Loader: Map-Reduce hides machine failures
• Pipeline Jobs: Hive allows SQL-like syntax
• Good scalability, but poor latency! 24 – 48 hours.
6. How to Get Lower Latency?
• Small-batch Processing • Stream Processing
▪ Run Map-reduce/Hive every hour, every ▪ Aggregate the data as soon as it arrives
15 min, every 5 min, …
▪ How to solve the reliability problem?
▪ How do we reduce per-batch
overhead?
9. Scribe
Batch
Copier
HDFS
tail/fopen
Scribe Scribe Scribe
Mid-Tier NFS
Clients Writers Log
• Simple push/RPC-based logging system Consumer
• Open-sourced in 2008. 100 log categories at that time.
• Routing driven by static configuration.
10. Data Freeway
Continuous
Copier
C1 C2 DataNode
HDFS
PTail
C1 C2 DataNode
(in the
plan)
Scribe PTail
Clients Calligraphus Calligraphus HDFS
Mid-tier Writers
Log
Consumer
Zookeeper
• 9GB/sec at peak, 10 sec latency, 2500 log categories
11. Calligraphus
• RPC File System
▪ Each log category is represented by 1 or more FS directories
▪ Each directory is an ordered list of files
• Bucketing support
▪ Application buckets are application-defined shards.
▪ Infrastructure buckets allows log streams from x B/s to x GB/s
• Performance
▪ Latency: Call sync every 7 seconds
▪ Throughput: Easily saturate 1Gbit NIC
12. Continuous Copier
• File System File System
• Low latency and smooth network usage
• Deployment
▪ Implemented as long-running map-only job
▪ Can move to any simple job scheduler
• Coordination
▪ Use lock files on HDFS for now
▪ Plan to move to Zookeeper
13. PTail
files checkpoint
directory
directory
directory
• File System Stream ( RPC )
• Reliability
▪ Checkpoints inserted into the data stream
▪ Can roll back to tail from any data checkpoints
▪ No data loss/duplicates
16. Overview
Log Stream Aggregations Serving
Storage
• ~ 1M log lines per second, but light read
• Multiple Group-By operations per log line
• The first key in Group By is always time/date-related
• Complex aggregations: Unique user count, most frequent
elements
17. MySQL and HBase: one page
MySQL HBase
Parallel Manual sharding Automatic
load balancing
Fail-over Manual master/slave Automatic
switch
Read efficiency High Low
Write efficiency Medium High
Columnar support No Yes
18. Puma2 Architecture
PTail Puma2 HBase Serving
• PTail provide parallel data streams
• For each log line, Puma2 issue “increment” operations to
HBase. Puma2 is symmetric (no sharding).
• HBase: single increment on multiple columns
19. Puma2: Pros and Cons
• Pros
▪ Puma2 code is very simple.
▪ Puma2 service is very easy to maintain.
• Cons
▪ “Increment” operation is expensive.
▪ Do not support complex aggregations.
▪ Hacky implementation of “most frequent elements”.
▪ Can cause small data duplicates.
20. Improvements in Puma2
• Puma2
▪ Batching of requests. Didn‟t work well because of long-tail distribution.
• HBase
▪ “Increment” operation optimized by reducing locks.
▪ HBase region/HDFS file locality; short-circuited read.
▪ Reliability improvements under high load.
• Still not good enough!
21. Puma3 Architecture
PTail Puma3 HBase
• Puma3 is sharded by aggregation key.
• Each shard is a hashmap in memory.
Serving
• Each entry in hashmap is a pair of
an aggregation key and a user-defined aggregation.
• HBase as persistent key-value storage.
22. Puma3 Architecture
PTail Puma3 HBase
• Write workflow
Serving
▪ For each log line, extract the columns for key and value.
▪ Look up in the hashmap and call user-defined aggregation
23. Puma3 Architecture
PTail Puma3 HBase
• Checkpoint workflow
▪ Every 5 min, save modified hashmap entries,
PTail checkpoint to HBase Serving
▪ On startup (after node failure), load from HBase
▪ Get rid of items in memory once the time window has passed
24. Puma3 Architecture
PTail Puma3 HBase
• Read workflow
Serving
▪ Read uncommitted: directly serve from the in-memory hashmap; load
from Hbase on miss.
▪ Read committed: read from HBase and serve.
25. Puma3 Architecture
PTail Puma3 HBase
• Join
▪ Static join table in HBase.
▪ Distributed hash lookup in user-defined function (udf). Serving
▪ Local cache improves the throughput of the udf a lot.
26. Puma2 / Puma3 comparison
• Puma3 is much better in write throughput
▪ Use 25% of the boxes to handle the same load.
▪ HBase is really good at write throughput.
• Puma3 needs a lot of memory
▪ Use 60GB of memory per box for the hashmap
▪ SSD can scale to 10x per box.
27. Puma3 Special Aggregations
• Unique Counts Calculation
▪ Adaptive sampling
▪ Bloom filter (in the plan)
• Most frequent item (in the plan)
▪ Lossy counting
▪ Probabilistic lossy counting
28. PQL – Puma Query Language
• CREATE INPUT TABLE t („time', • CREATE AGGREGATION „abc‟
„adid‟, „userid‟); INSERT INTO l (a, b, c)
SELECT
• CREATE VIEW v AS udf.hour(time),
SELECT *, udf.age(userid) adid,
FROM t age,
WHERE udf.age(userid) > 21 count(1),
udf.count_distinc(userid)
FROM v
GROUP BY
• CREATE HBASE TABLE h … udf.hour(time),
adid,
• CREATE LOGICAL TABLE l … age;
30. Future Works
• Scheduler Support
▪ Just need simple scheduling because the work load is continuous
• Mass adoption
▪ Migrate most daily reporting queries from Hive
• Open Source
▪ Biggest bottleneck: Java Thrift dependency
▪ Will come one by one
31. Similar Systems
• STREAM from Stanford
• Flume from Cloudera
• S4 from Yahoo
• Rainbird/Storm from Twitter
• Kafka from Linkedin
32. Key differences
• Scalable Data Streams
▪ 9 GB/sec with < 10 sec of latency
▪ Both Push/RPC-based and Pull/File System-based
▪ Components to support arbitrary combination of channels
• Reliable Stream Aggregations
▪ Good support for Time-based Group By, Table-Stream Lookup Join
▪ Query Language: Puma : Realtime-MR = Hive : MR
▪ No support for sliding window, stream joins
33. (c) 2009 Facebook, Inc. or its licensors. "Facebook" is a registered trademark of Facebook, Inc.. All rights reserved. 1.0
Editor's Notes
Good morning everyone. My name’s Zheng Shao.Today I am going to talk about the Real-time Analytics at Facebook.
This is the agenda of the talk. We will start with why we need realtime analytics, then get into details of how we implemented it, and finally future works and comparisons with other systems.
First of all, what is realtime analytics and why we want to do it.
This is the main use case for our analytics. We have a product called Facebook Insights, which allows website owners, advertisers, Facebook application developers, and Facebook page owners to view the time series of impression/click/action counters, the counters broken down by demographics like gender and age, as well as the unique user counters and heavy hitters like most popular urls.The major challenges of building the backend of this insights products are two folds. On one hand, we have huge amount of data coming from both Facebook and non-Facebook websites. On the other hand, customers of the insights product really want to have low-latency summaries, so that they can immediately know how popular a new article or a new game is.
We didhave an existing complete data warehouse solution at Facebook to handle insights workload.In short, log streams got generated from HTTP servers and transferred to NFS via a log collection framework called scribe all within seconds, and then got copied/loaded into Hadoop. Summaries got generated from daily pipeline jobs and eventually got loaded into MySQL for serving.Specifically, we have a 3000-note Hadoop cluster to handle the scalability issue. Copier/Loader are map-reduce jobs which handle machine failures automatically. And Pipeline Jobs are written in Hive which has a SQL-like syntax.Pretty good scalability until we hit the data center power limit. But latency is terrible.
We got 2 ideas on how to improve the latency.The first one is small-batch processing. Instead of using a batch of 1 day, we can produce much smaller batches. The question is how to reduce per-batch overhead, so that tiny batches like 1 min or less makes sense.The second one is stream processing. We can aggregate the data as soon as it arrives. This will produce near realtime results. The question is how to make the system reliable against hardware failures.It turns out the per-batch overhead of Map-Reduce is so high that it’s not practical to have even 5-minute batches on our Hadoop cluster, so we finally decided to go stream processing.
The rest of the talk will focus on two key systems that we built for realtime analytics.The first one, Data Freeway, is a scalable data stream framework on top of Scribe and HDFS.The second one, Puma, is a reliable stream aggregation engine on top of HBase.
This was our old data stream framework. It has several layers of data transportation. The first transport from clients to mid-tier is to reduce the fanout from tens of thousands to hundreds, the second transport is to shuffle the data based on log categories, so that one log category goes to a single writer. Then log data gets written into NFS, which is consumed by batch copier, as well as unix tail/fopen.In short, it’s a simple push/RPC-based logging system. Scribe was open-sourced in 2008, when we have 100 log categories at that time. It quickly got adopted by a lot of other companies. The routing is driven by static configuration which is flexible but have two problems: 1. not scalable because we need to maintain a config for each box in the writers, and a single writer is not scalable; 2. single point of failure in writers.
We came up with Data Freeway in 2011. Right now it’s handling 9GB/sec of data at peak with 10 sec end-to-end latency, and has over 2500 log categories.It contains 4 major components.The first one is scribe. It’s used only at the client, responsible for sending out data via RPCs. The second one is called Calligraphus. It utilizes Zookeeper to manage the ownership of categories, shuffles the data and write to HDFS.The third one is called Continuous Copier, which continuously copies files from one HDFS to another, as the file grows.The fourth one is called PTail, which in parallel tails multiple directories on HDFS and writes out to stdout. Right now we directly ptail from the HDFS written by Calligraphus, but we plan to tail from the HDFS written by Continuous Copier in the future.Let’s get into details of these components.
Calligraphus is responsible for getting log data from RPC and write to File System.Each log category is represented by 1 or more FS directories.Each directory is an ordered list of files, with date in the file name. The files can be compressed.This is a very simple protocol for storing log data. Probably the simplest that I can think of.The most interesting feature about Calligraphus is the bucketing support.We have application buckets, which are application-defined shards. These are used for sharded log consumers. Most of the big log consumers are sharded because their log stream is too big.We also support infrastructure buckets, which allow a single application bucket to have a throughput from several bytes per second to several gigabytes per second. Each infrastructure bucket is a directory. So big streams can go to multiple directories at the same time.Calligraphus has a pretty high performance. We call File System sync every 7 seconds, which is the major source of data latency right now. The network throughput can easily saturate 1Gbit NIC, and we are planning to use 10Gbit NIC some time soon.
Continuous Copier is for continuous data transfer from one File System to another.Compared with the batch-based map-reduce copier, it provide much lower latency as well as smooth network usage.Right now it’s implemented as a long-running map-only job, but it can be easily moved to any simple job scheduling system other than map-reduce.Right now it uses lock files in HDFS for coordination among different nodes, and we plan to move to Zookeeper very soon.The peak throughput of continuous copier in production is about 3GB/sec compressed right now.
The last component in Data Freeway is PTail, which transfers data from a File System to an output stream.The key feature of PTail is the checkpoint. A PTail checkpoint contains the current files and the file offsets in each of the directories. This makes it possible for PTail to roll back to an earlier checkpoint, and reproduce the data stream without any data loss/duplicates at the boundary.
To wrap up Data Freeway, we support 2 channels for data transfers.Push via RPC has lower latency, can potentially have some loss/dups when network has a problem, is less robust with respect to machine failures, and has a very low complexity in code.Pull via FS has a longer latency, but it does not have any loss/dups, and is robust to machine failures. The problems is that the code of the File System, especially HDFS, can be pretty complex, and we still need to identify and fix some bugs there.Data Freeway consists of 4 components that allows data transfer between these 2 channels.
This is the simplified architecture of a typical stream aggregation engine.Log streams get aggregated on a set of machines. The summaries is usually saved to storage for persistence. Online serving get summaries from either the aggregations directly or from the storage. Usually the write throughput is much higher than the read, because analytics data is only viewed by the owners of the website, e.g.In our environment, we have on the order of 1M log lines per second. For each of the log lines, we need to do multiple group-by operations, like by age, or by gender. The first key in group by is always time/date-related which means the summaries will become static after some time. Also we need to support complex aggregations like unique counts and heavy hitters.
Let’s look at our storage choices first.We considered using either MySQL or HBase as our storage engine. HBase is much easier to manage in a distributed environment, which was the major reason that we chose HBase. It also has better write efficiency as well as Columnar support. The read efficiency is inferior because HBase’s cache has less memory space efficiency.
The first architecture that we came up is called Puma2.We run Puma2 on a set of machines, and use PTail to provide parallel data streams. For each log line, Puma2 issues “increment” operations to HBase. Note that Puma2 servers are all symmetric, which means the same row in HBase can be incremented by multiple Puma2 at the same time.HBase can do single increment operation on multiple columns of the same row. So we can use a single increment operation in HBase to handle multiple Group-By’s.Puma2 went into production in March 2011 and is handling 600K log lines on 100 boxes (Puma2 + HBase)
Here are the pros and cons of the Puma2 architecture. The good thing about Puma2 is extremely simple and easy to maintain. The root reason is that Puma2 servers are symmetric and almost stateless. The only state is the PTail checkpoint that is saved to HBase periodically. As a result, we can easily add more boxes or reboot a box if the box went down.However, Puma2 also has its problems. First of all, HBase increment operation is expensive because it’s a read-and-write, and read is expensive. It’s also not possible to support aggregations other than counts, because that need a lot of customized code in HBase. We did a hacky implementation of “most frequent elements” by multiple layers of “frequent element table”. Finally, Puma2 can have small data duplicates because “increments” and checkpoint writes are not in a single transaction.
We did some small improvements to Puma2.On the Puma2 service, an obvious idea is to batch the increment requests to reduce the load on HBase. However, it didn’t work well because of the long-tail distribution of Group-By keys. It also made data less accurate because we cannot save checkpoints in the middle of a batch.On the HBase side, we first optimized the “increment” operation by reducing the number of locks. Another big efficiency improvement came from the short-circuited read from HBase directly to HDFS block files on the disk, instead of via DataNode daemon. We also improved the HBase reliability under the high load.All in all, we are still not happy about Puma2, especially when we try to support unique counters. So we switched to a new architecture called Puma3.
The biggest difference between Puma2 and Puma3 is that in Puma3, we do aggregations in the memory of Puma3 process instead of in HBase. Local memory operations are much faster so that we can achieve a much higher throughput.In order to make in-memory aggregations, we made Puma3 sharded by aggregation key. That means the input PTail data stream has to be sharded as well. That is supported by the application bucketing feature from Calligraphus.Each shard of Puma3 is basically a hashmap in memory. Each entry of the hashmap is a pair of an aggregation key and a user-defined aggregation, which can be count, sum, avg, or anything.We use HBase as a persistent storage but usually don’t read from it.
The write workflow for Puma3 is pretty simple.Basically, for each log line, we extract the columns for key and value. We use the key to look up the in-memory hashmap, and call user-defined aggregation with the value.Note that, since the log streams are sharded by aggregation key, the same aggregation key won’t appear in more than 1 Puma3 processes. This is the key to make Puma3 work.
We checkpoint the state of Puma3 process into HBase every 5 minutes. Basically, we save all the modified hashmap entries as well as the PTail checkpoint. That means if Puma3 crashes and restarts, it can load the state from HBase via sequential read, which is pretty fast in HBase.In order to save memory, we also get rid of hashmap entries from memory once the time window for the aggregation has passed, because we are not going to receive new log lines for that time window again.
There are 2 choices for the read workflow.If we want to read uncommitted aggregations which is usually with 10 seconds of latency, we directly serve from the in-memory hashmap. We go to HBase only for a miss, which will only happen if the time window of the aggregation has passed.If we want to read committed data, Puma3 will read from HBase and serve.Note that uncommitted aggregation result can decrease in value if the Puma3 process dies before making the next checkpoint. We plan to have a cache layer between serving and Puma3 to make sure numbers don’t decrease.
Puma3 also supports joining with a static table in HBase. The join key has to be the row key in the static HBase table. It’s implemented as a simple distributed hash lookup in a user-defined function. We have found that local cache improves the throughput of the udf a lot.
Comparing Puma2 and Puma3, we found that Puma3 is much better in writer throughput. We only need to use 25% of the boxes to handle the same work load. The main reason is that HBase is really good at write throughput.At the same time, Puma3 needs a lot of memory. Basically, all aggregations that can change needs to be stored in memory, to ensure the log stream write throughput. Right now we use 60GB of memory per box for the hashmap. In the future, we may use SSD that can easily scale to 10x more space per box.
With Puma3, we can easily support these special aggregations, with some approximation.For unique counts, we have implemented a simple adaptive sampling algorithm, that samples more aggressively when the numbe of unique item increases. We can also easily implement the standard bloom filter for counting.For the most frequent items, we plan to implement the classic lossy counting algorithm and probabilistic lossy counting algorithm.
The most important feature of Puma that distinguishes it from other stream processing projects is the language.We have built a SQL-like query language that allows us to define the input stream, the output table, as well as the query itself. Note that the query contains user-defined functions for Join as well as Aggregations.Puma3 is right now in pre-production stage. We plan to push it out in production as soon as we verified all the summaries against Puma2 and Hive.
Here are a list of things we plan to do next.First is simple scheduling for Puma3. We just need very simple scheduling because the work load is continuous. Most likely we will reuse some existing frameworks.Second is the mass adoption inside the company. We plan to migrate most daily reporting queries from Hive, as long as the query is simple enough to be supported by Puma. This will reduce the latency as well as improve the efficiency, because of the saving in compression/decompression.The third one is open-source. Right now, the biggest bottleneck is Java Thrift which has diverged between Facebook and open-source. We plan to open-source the projects one by one, starting from Calligraphus.
There are lots of similar systems in academia as well as other companies.
Instead of comparing them one by one, I will end the presentation by a summary of the key differences.Data Freeway is a scalable data stream framework with 9GB/sec throughput and 10 sec latency. It supports both Push/RPC-based and Pull/File System-based channels. We have components to support arbitrary combination of channels to adapt to the use case.Puma is a reliable stream aggregation engine. It has good support for time-window-based Group By as well as table-stream Lookup Join. It has a query language that makes Puma comparable to Hive when comparing Realtime-MR and MR. Puma has no support and no plan to support sliding window and stream joins because those are very hard problems that we don’t see in our environment.