Procella is a distributed SQL query engine built for flexible workloads within YouTube. Procella is highly scalable and is designed primarily to serve high volumes of queries at low latencies while ingesting realtime data. It is used to serve video/channel statistics for users watching videos as well as OLAP style queries for video analytics (youtube.com/analytics) and public dashboards (artists.youtube.com). Procella also supports complex SQL operations over structured data and is used by YouTube analysts for ad-hoc analysis.
Procella works on the Google distributed computing stack working directly on data residing in accessible columnar formats on the Google distributed file system Colossus. The underlying data is thus producible and directly consumable by other tools such as MapReduce and Dremel. The compute runs directly on shared machines on Borg clusters, and does not need dedicated virtual (or physical) machines. These features allows Procella to fit nicely in the Google ecosystem, scale compute and storage independently, and to gracefully handle evictions and machine failures without compromising availability or performance.
Procella has been in production for over two years and is currently serving billions of SQL queries per day across various workloads at YouTube and several other Google product areas.
Speaker
Aniket Mokashi, Google, Senior Software Engineer
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021StreamNative
You may be familiar with the Presto plugin used to run fast interactive queries over Pulsar using ANSI SQL and can be joined with other data sources. This plugin will soon get a rename to align with the rename of the PrestoSQL project to Trino. What is the purpose of this rename and what does it mean for those using the Presto plugin? We cover the history of the community shift from PrestoDB to PrestoSQL, as well as, the future plans for the Pulsar community to donate this plugin to the Trino project. One of the connector maintainers will then demo the connector and show what is possible when using Trino and Pulsar!
Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...Amazon Web Services
Struggling to keep up with an ever-increasing demand for data at your organisation? Do you spend hours tinkering with your streaming data pipelines? Does that one data scientist with direct EDW access keep you up at night? Introducing Snowflake, a brand new SQL data warehouse built for the cloud. We’ve designed and implemented a unique cloud-based architecture that addresses the most common shortcomings of existing data solutions. With Snowflake, you can unlock unlimited concurrency, enable instant scalability, and take advantage of built-in tuning and optimisation. Join us and find out what Netflix, Adobe, and Nike all have in common.
Analyzing big data quickly and efficiently requires a data warehouse optimized to handle and scale for large datasets. Amazon Redshift is a fast, petabyte-scale data warehouse that makes it simple and cost-effective to analyze big data for a fraction of the cost of traditional data warehouses. In this session, we take an in-depth look at data warehousing with Amazon Redshift for big data analytics. We cover best practices to take advantage of Amazon Redshift's columnar technology and parallel processing capabilities to deliver high throughput and query performance. We also discuss how to design optimal schemas, load data efficiently, and use work load management.
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021StreamNative
You may be familiar with the Presto plugin used to run fast interactive queries over Pulsar using ANSI SQL and can be joined with other data sources. This plugin will soon get a rename to align with the rename of the PrestoSQL project to Trino. What is the purpose of this rename and what does it mean for those using the Presto plugin? We cover the history of the community shift from PrestoDB to PrestoSQL, as well as, the future plans for the Pulsar community to donate this plugin to the Trino project. One of the connector maintainers will then demo the connector and show what is possible when using Trino and Pulsar!
Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...Amazon Web Services
Struggling to keep up with an ever-increasing demand for data at your organisation? Do you spend hours tinkering with your streaming data pipelines? Does that one data scientist with direct EDW access keep you up at night? Introducing Snowflake, a brand new SQL data warehouse built for the cloud. We’ve designed and implemented a unique cloud-based architecture that addresses the most common shortcomings of existing data solutions. With Snowflake, you can unlock unlimited concurrency, enable instant scalability, and take advantage of built-in tuning and optimisation. Join us and find out what Netflix, Adobe, and Nike all have in common.
Analyzing big data quickly and efficiently requires a data warehouse optimized to handle and scale for large datasets. Amazon Redshift is a fast, petabyte-scale data warehouse that makes it simple and cost-effective to analyze big data for a fraction of the cost of traditional data warehouses. In this session, we take an in-depth look at data warehousing with Amazon Redshift for big data analytics. We cover best practices to take advantage of Amazon Redshift's columnar technology and parallel processing capabilities to deliver high throughput and query performance. We also discuss how to design optimal schemas, load data efficiently, and use work load management.
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
Recently, a set of modern table formats such as Delta Lake, Hudi, Iceberg spring out. Along with Hive Metastore these table formats are trying to solve problems that stand in traditional data lake for a long time with their declared features like ACID, schema evolution, upsert, time travel, incremental consumption etc.
Learn the current state of the NoSQL landscape and discover the different data models within it. From document stores and key value databases to graph and Wide Column. Then you’ll learn why wide column databases are the most appropriate for scalable high performance use cases, including capabilities for massive scale-out architecture, peer-to-peer clustering to avoid bottlenecking and built-in multi-datacenter replication.
How to use Parquet as a basis for ETL and analyticsJulien Le Dem
Parquet is a columnar format designed to be extremely efficient and interoperable across the hadoop ecosystem. Its integration in most of the Hadoop processing frameworks (Impala, Hive, Pig, Cascading, Crunch, Scalding, Spark, …) and serialization models (Thrift, Avro, Protocol Buffers, …) makes it easy to use in existing ETL and processing pipelines, while giving flexibility of choice on the query engine (whether in Java or C++). In this talk, we will describe how one can us Parquet with a wide variety of data analysis tools like Spark, Impala, Pig, Hive, and Cascading to create powerful, efficient data analysis pipelines. Data management is simplified as the format is self describing and handles schema evolution. Support for nested structures enables more natural modeling of data for Hadoop compared to flat representations that create the need for often costly joins.
Data governance with Unity Catalog PresentationKnoldus Inc.
Databricks Unity Catalog is the industry’s first unified governance solution for data and AI on the lakehouse. With Unity Catalog, organizations can seamlessly govern their structured and unstructured data, machine learning models, notebooks, dashboards and files on any cloud or platform. Data scientists, analysts and engineers can use Unity Catalog to securely discover, access and collaborate on trusted data and AI assets, leveraging AI to boost productivity and unlock the full potential of the lakehouse environment. This session will cover the potential of unity catalog to achieve a flexible and scalable governance implementation without sacrificing the ability to manage and share data effectively.
ApacheCon 2022: From Column-Level to Cell-Level_ Towards Finer-grained Encryp...XinliShang1
This talk is about Apache Parquet cell-level encryption feature. It allows encryption can happen at the cell(intersection of column and row) level, which is finer-grained than the column level.
Data Con LA 2020
Description
In this session, I introduce the Amazon Redshift lake house architecture which enables you to query data across your data warehouse, data lake, and operational databases to gain faster and deeper insights. With a lake house architecture, you can store data in open file formats in your Amazon S3 data lake.
Speaker
Antje Barth, Amazon Web Services, Sr. Developer Advocate, AI and Machine Learning
Data Quality With or Without Apache Spark and Its EcosystemDatabricks
Few solutions exist in the open-source community either in the form of libraries or complete stand-alone platforms, which can be used to assure a certain data quality, especially when continuous imports happen. Organisations may consider picking up one of the available options – Apache Griffin, Deequ, DDQ and Great Expectations. In this presentation we’ll compare these different open-source products across different dimensions, like maturity, documentation, extensibility, features like data profiling and anomaly detection.
Data Lakes are meant to support many of the same analytics capabilities of Data Warehouses while overcoming some of the core problems. Yet Data Lakes have a distinctly different technology base. This webinar will provide an overview of the standard architecture components of Data Lakes.
This will include:
The Lab and the factory
The base environment for batch analytics
Critical governance components
Additional components necessary for real-time analytics and ingesting streaming data
Improving Python and Spark Performance and Interoperability: Spark Summit Eas...Spark Summit
Apache Spark has become a popular and successful way for Python programming to parallelize and scale up their data processing. In many use cases, though, a PySpark job can perform worse than equivalent job written in Scala. It is also costly to push and pull data between the user’s Python environment and the Spark master. In this talk, we’ll examine some of the the data serialization and other interoperability issues, especially with Python libraries like pandas and NumPy, that are impacting PySpark performance and work that is being done to address them. This will relate closely to other work in binary columnar serialization and data exchange tools in development such as Apache Arrow and Feather files.
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...Databricks
The Data Lake paradigm is often considered the scalable successor of the more curated Data Warehouse approach when it comes to democratization of data. However, many who went out to build a centralized Data Lake came out with a data swamp of unclear responsibilities, a lack of data ownership, and sub-par data availability.
by Darin Briskman, Technical Evangelist, AWS
Database Freedom means being able to use the database engine that’s right for you as your needs evolve. Being locked into a specific technology can prevent you from achieving your mission. Fortunately, AWS Database Migration Service makes it easy to switch between different database engines. We’ll look at how to use Schema Migration Tool with DMS to switch from a commercial database to open source. You’ll need a laptop with a Firefox or Chrome browser.
This presentation examines the main building blocks for building a big data pipeline in the enterprise. The content uses inspiration from some of the top big data pipelines in the world like the ones built by Netflix, Linkedin, Spotify or Goldman Sachs
How One Company Offloaded Data Warehouse ETL To Hadoop and Saved $30 MillionDataWorks Summit
A Fortune 100 company recently introduced Hadoop into their data warehouse environment and ETL workflow to save $30 Million. This session examines the specific use case to illustrate the design considerations, as well as the economics behind ETL offload with Hadoop. Additional information about how the Hadoop platform was leveraged to support extended analytics will also be referenced.
Designing the Next Generation of Data Pipelines at Zillow with Apache SparkDatabricks
The trade-off between development speed and pipeline maintainability is a constant for data engineers, especially for those in a rapidly evolving organization
Know different types of tips about Importance of dataware housing, Data Cleansing and Extracting etc . For more details visit: http://www.skylinecollege.com/business-analytics-course
Manufacturers have an abundance of data, whether from connected sensors, plant systems, manufacturing systems, claims systems and external data from industry and government. Manufacturers face increased challenges from continually improving product quality, reducing warranty and recall costs to efficiently leveraging their supply chain. For example, giving the manufacturer a complete view of the product and customer information integrating manufacturing and plant floor data, with as built product configurations with sensor data from customer use to efficiently analyze warranty claim information to reduce detection to correction time, detect fraud and even become proactive around issues requires a capable enterprise data hub that integrates large volumes of both structured and unstructured information. Learn how an enterprise data hub built on Hadoop provides the tools to support analysis at every level in the manufacturing organization.
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
Recently, a set of modern table formats such as Delta Lake, Hudi, Iceberg spring out. Along with Hive Metastore these table formats are trying to solve problems that stand in traditional data lake for a long time with their declared features like ACID, schema evolution, upsert, time travel, incremental consumption etc.
Learn the current state of the NoSQL landscape and discover the different data models within it. From document stores and key value databases to graph and Wide Column. Then you’ll learn why wide column databases are the most appropriate for scalable high performance use cases, including capabilities for massive scale-out architecture, peer-to-peer clustering to avoid bottlenecking and built-in multi-datacenter replication.
How to use Parquet as a basis for ETL and analyticsJulien Le Dem
Parquet is a columnar format designed to be extremely efficient and interoperable across the hadoop ecosystem. Its integration in most of the Hadoop processing frameworks (Impala, Hive, Pig, Cascading, Crunch, Scalding, Spark, …) and serialization models (Thrift, Avro, Protocol Buffers, …) makes it easy to use in existing ETL and processing pipelines, while giving flexibility of choice on the query engine (whether in Java or C++). In this talk, we will describe how one can us Parquet with a wide variety of data analysis tools like Spark, Impala, Pig, Hive, and Cascading to create powerful, efficient data analysis pipelines. Data management is simplified as the format is self describing and handles schema evolution. Support for nested structures enables more natural modeling of data for Hadoop compared to flat representations that create the need for often costly joins.
Data governance with Unity Catalog PresentationKnoldus Inc.
Databricks Unity Catalog is the industry’s first unified governance solution for data and AI on the lakehouse. With Unity Catalog, organizations can seamlessly govern their structured and unstructured data, machine learning models, notebooks, dashboards and files on any cloud or platform. Data scientists, analysts and engineers can use Unity Catalog to securely discover, access and collaborate on trusted data and AI assets, leveraging AI to boost productivity and unlock the full potential of the lakehouse environment. This session will cover the potential of unity catalog to achieve a flexible and scalable governance implementation without sacrificing the ability to manage and share data effectively.
ApacheCon 2022: From Column-Level to Cell-Level_ Towards Finer-grained Encryp...XinliShang1
This talk is about Apache Parquet cell-level encryption feature. It allows encryption can happen at the cell(intersection of column and row) level, which is finer-grained than the column level.
Data Con LA 2020
Description
In this session, I introduce the Amazon Redshift lake house architecture which enables you to query data across your data warehouse, data lake, and operational databases to gain faster and deeper insights. With a lake house architecture, you can store data in open file formats in your Amazon S3 data lake.
Speaker
Antje Barth, Amazon Web Services, Sr. Developer Advocate, AI and Machine Learning
Data Quality With or Without Apache Spark and Its EcosystemDatabricks
Few solutions exist in the open-source community either in the form of libraries or complete stand-alone platforms, which can be used to assure a certain data quality, especially when continuous imports happen. Organisations may consider picking up one of the available options – Apache Griffin, Deequ, DDQ and Great Expectations. In this presentation we’ll compare these different open-source products across different dimensions, like maturity, documentation, extensibility, features like data profiling and anomaly detection.
Data Lakes are meant to support many of the same analytics capabilities of Data Warehouses while overcoming some of the core problems. Yet Data Lakes have a distinctly different technology base. This webinar will provide an overview of the standard architecture components of Data Lakes.
This will include:
The Lab and the factory
The base environment for batch analytics
Critical governance components
Additional components necessary for real-time analytics and ingesting streaming data
Improving Python and Spark Performance and Interoperability: Spark Summit Eas...Spark Summit
Apache Spark has become a popular and successful way for Python programming to parallelize and scale up their data processing. In many use cases, though, a PySpark job can perform worse than equivalent job written in Scala. It is also costly to push and pull data between the user’s Python environment and the Spark master. In this talk, we’ll examine some of the the data serialization and other interoperability issues, especially with Python libraries like pandas and NumPy, that are impacting PySpark performance and work that is being done to address them. This will relate closely to other work in binary columnar serialization and data exchange tools in development such as Apache Arrow and Feather files.
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...Databricks
The Data Lake paradigm is often considered the scalable successor of the more curated Data Warehouse approach when it comes to democratization of data. However, many who went out to build a centralized Data Lake came out with a data swamp of unclear responsibilities, a lack of data ownership, and sub-par data availability.
by Darin Briskman, Technical Evangelist, AWS
Database Freedom means being able to use the database engine that’s right for you as your needs evolve. Being locked into a specific technology can prevent you from achieving your mission. Fortunately, AWS Database Migration Service makes it easy to switch between different database engines. We’ll look at how to use Schema Migration Tool with DMS to switch from a commercial database to open source. You’ll need a laptop with a Firefox or Chrome browser.
This presentation examines the main building blocks for building a big data pipeline in the enterprise. The content uses inspiration from some of the top big data pipelines in the world like the ones built by Netflix, Linkedin, Spotify or Goldman Sachs
How One Company Offloaded Data Warehouse ETL To Hadoop and Saved $30 MillionDataWorks Summit
A Fortune 100 company recently introduced Hadoop into their data warehouse environment and ETL workflow to save $30 Million. This session examines the specific use case to illustrate the design considerations, as well as the economics behind ETL offload with Hadoop. Additional information about how the Hadoop platform was leveraged to support extended analytics will also be referenced.
Designing the Next Generation of Data Pipelines at Zillow with Apache SparkDatabricks
The trade-off between development speed and pipeline maintainability is a constant for data engineers, especially for those in a rapidly evolving organization
Know different types of tips about Importance of dataware housing, Data Cleansing and Extracting etc . For more details visit: http://www.skylinecollege.com/business-analytics-course
Manufacturers have an abundance of data, whether from connected sensors, plant systems, manufacturing systems, claims systems and external data from industry and government. Manufacturers face increased challenges from continually improving product quality, reducing warranty and recall costs to efficiently leveraging their supply chain. For example, giving the manufacturer a complete view of the product and customer information integrating manufacturing and plant floor data, with as built product configurations with sensor data from customer use to efficiently analyze warranty claim information to reduce detection to correction time, detect fraud and even become proactive around issues requires a capable enterprise data hub that integrates large volumes of both structured and unstructured information. Learn how an enterprise data hub built on Hadoop provides the tools to support analysis at every level in the manufacturing organization.
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...HostedbyConfluent
Apache Kafka is used as the primary message bus for propagating events and logs across Uber. In particular, it pairs with Apache Pinot, a real-time distributed OLAP datastore, to deliver real-time insights seconds after the messages produced to Kafka.
One challenge we faced was to update existing data in Pinot with the changelog in Kafka, and deliver an accurate view in the real-time analytical results. For example, the financial dashboard can report gross booking with the corrected Ride fares. And restaurant owners can analyze the UberEats orders with their latest delivery status.
Implementing upserts in an immutable real-time OLAP store like Pinot is nontrivial. We need to make architectural changes in how data is distributed via Kafka amongst the server nodes, how it's indexed and queried in a distributed fashion. In this talk I will discuss how we leveraged Kafka's partition-by-key feature to this end and how we added this ability in Pinot without any performance degradation.
Half of the work that it takes to do data science is plumbing and wrangling. I’ll discuss some tricks we’ve learned while building AddThis over the years to collect and process data at web scale.
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...StampedeCon
This session will be a detailed recount of the design, implementation, and launch of the next-generation Shutterstock Data Platform, with strong emphasis on conveying clear, understandable learnings that can be transferred to your own organizations and projects. This platform was architected around the prevailing use of Kafka as a highly-scalable central data hub for shipping data across your organization in batch or streaming fashion. It also relies heavily on Avro as a serialization format and a global schema registry to provide structure that greatly improves quality and usability of our data sets, while also allowing the flexibility to evolve schemas and maintain backwards compatibility.
As a company, Shutterstock has always focused heavily on leveraging open source technologies in developing its products and infrastructure, and open source has been a driving force in big data more so than almost any other software sub-sector. With this plethora of constantly evolving data technologies, it can be a daunting task to select the right tool for your problem. We will discuss our approach for choosing specific existing technologies and when we made decisions to invest time in home-grown components and solutions.
We will cover advantages and the engineering process of developing language-agnostic APIs for publishing to and consuming from the data platform. These APIs can power some very interesting streaming analytics solutions that are easily accessible to teams across our engineering organization.
We will also discuss some of the massive advantages a global schema for your data provides for downstream ETL and data analytics. ETL into Hadoop and creation and maintenance of Hive databases and tables becomes much more reliable and easily automated with historically compatible schemas. To complement this schema-based approach, we will cover results of performance testing various file formats and compression schemes in Hadoop and Hive, the massive performance benefits you can gain in analytical workloads by leveraging highly optimized columnar file formats such as ORC and Parquet, and how you can use good old fashioned Hive as a tool for easily and efficiently converting exiting datasets into these formats.
Finally, we will cover lessons learned in launching this platform across our organization, future improvements and further design, and the need for data engineers to understand and speak the languages of data scientists and web, infrastructure, and network engineers.
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2lGNybu.
Stefan Krawczyk discusses how his team at StitchFix use the cloud to enable over 80 data scientists to be productive. He also talks about prototyping ideas, algorithms and analyses, how they set up & keep schemas in sync between Hive, Presto, Redshift & Spark and make access easy for their data scientists, etc. Filmed at qconsf.com..
Stefan Krawczyk is Algo Dev Platform Lead at StitchFix, where he’s leading development of the algorithm development platform. He spent formative years at Stanford, LinkedIn, Nextdoor & Idibon, working on everything from growth engineering, product engineering, data engineering, to recommendation systems, NLP, data science and business intelligence.
The Parquet Format and Performance Optimization OpportunitiesDatabricks
The Parquet format is one of the most widely used columnar storage formats in the Spark ecosystem. Given that I/O is expensive and that the storage layer is the entry point for any query execution, understanding the intricacies of your storage format is important for optimizing your workloads.
As an introduction, we will provide context around the format, covering the basics of structured data formats and the underlying physical data storage model alternatives (row-wise, columnar and hybrid). Given this context, we will dive deeper into specifics of the Parquet format: representation on disk, physical data organization (row-groups, column-chunks and pages) and encoding schemes. Now equipped with sufficient background knowledge, we will discuss several performance optimization opportunities with respect to the format: dictionary encoding, page compression, predicate pushdown (min/max skipping), dictionary filtering and partitioning schemes. We will learn how to combat the evil that is ‘many small files’, and will discuss the open-source Delta Lake format in relation to this and Parquet in general.
This talk serves both as an approachable refresher on columnar storage as well as a guide on how to leverage the Parquet format for speeding up analytical workloads in Spark using tangible tips and tricks.
Presto, an open source distributed SQL engine originally built at Facebook, has a rapidly growing community of developers and users. In this talk, speakers from both Facebook and Teradata, will discuss technical details of some of the recent developments such as integration with Hadoop ecosystem (YARN/Slider and Ambari), security features (Kerberos), enabling BI tools via JDBC/ODBC drivers, new connectors (Redis, MongoDB) and storage engines (Raptor) as well as improvements in performance and ANSI SQL coverage. In addition, we will present a few use cases and major new users that leverage interactive SQL capabilities Presto offers. Finally, we will present our roadmap for the next year.
See the video at https://youtu.be/wMy3LXuTb0U
Introduction: This workshop will provide a hands-on introduction to Machine Learning (ML) with an overview of Deep Learning (DL).
Format: An introductory lecture on several supervised and unsupervised ML techniques followed by light introduction to DL and short discussion what is current state-of-the-art. Several python code samples using the scikit-learn library will be introduced that users will be able to run in the Cloudera Data Science Workbench (CDSW).
Objective: To provide a quick and short hands-on introduction to ML with python’s scikit-learn library. The environment in CDSW is interactive and the step-by-step guide will walk you through setting up your environment, to exploring datasets, training and evaluating models on popular datasets. By the end of the crash course, attendees will have a high-level understanding of popular ML algorithms and the current state of DL, what problems they can solve, and walk away with basic hands-on experience training and evaluating ML models.
Prerequisites: For the hands-on portion, registrants must bring a laptop with a Chrome or Firefox web browser. These labs will be done in the cloud, no installation needed. Everyone will be able to register and start using CDSW after the introductory lecture concludes (about 1hr in). Basic knowledge of python highly recommended.
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
In a world with a myriad of distributed storage systems to choose from, the majority of Apache HBase clusters still rely on Apache HDFS. Theoretically, any distributed file system could be used by HBase. One major reason HDFS is predominantly used are the specific durability requirements of HBase's write-ahead log (WAL) and HDFS providing that guarantee correctly. However, HBase's use of HDFS for WALs can be replaced with sufficient effort.
This talk will cover the design of a "Log Service" which can be embedded inside of HBase that provides a sufficient level of durability that HBase requires for WALs. Apache Ratis (incubating) is a library-implementation of the RAFT consensus protocol in Java and is used to build this Log Service. We will cover the design choices of the Ratis Log Service, comparing and contrasting it to other log-based systems that exist today. Next, we'll cover how the Log Service "fits" into HBase and the necessary changes to HBase which enable this. Finally, we'll discuss how the Log Service can simplify the operational burden of HBase.
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
Utilizing Apache NiFi we read various open data REST APIs and camera feeds to ingest crime and related data real-time streaming it into HBase and Phoenix tables. HBase makes an excellent storage option for our real-time time series data sources. We can immediately query our data utilizing Apache Zeppelin against Phoenix tables as well as Hive external tables to HBase.
Apache Phoenix tables also make a great option since we can easily put microservices on top of them for application usage. I have an example Spring Boot application that reads from our Philadelphia crime table for front-end web applications as well as RESTful APIs.
Apache NiFi makes it easy to push records with schemas to HBase and insert into Phoenix SQL tables.
Resources:
https://community.hortonworks.com/articles/54947/reading-opendata-json-and-storing-into-phoenix-tab.html
https://community.hortonworks.com/articles/56642/creating-a-spring-boot-java-8-microservice-to-read.html
https://community.hortonworks.com/articles/64122/incrementally-streaming-rdbms-data-to-your-hadoop.html
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
Whilst HBase is the most logical answer for use cases requiring random, realtime read/write access to Big Data, it may not be so trivial to design applications that make most of its use, neither the most simple to operate. As it depends/integrates with other components from Hadoop ecosystem (Zookeeper, HDFS, Spark, Hive, etc) or external systems ( Kerberos, LDAP), and its distributed nature requires a "Swiss clockwork" infrastructure, many variables are to be considered when observing anomalies or even outages. Adding to the equation there's also the fact that HBase is still an evolving product, with different release versions being used currently, some of those can carry genuine software bugs. On this presentation, we'll go through the most common HBase issues faced by different organisations, describing identified cause and resolution action over my last 5 years supporting HBase to our heterogeneous customer base.
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
LocationTech GeoMesa enables spatial and spatiotemporal indexing and queries for HBase and Accumulo. In this talk, after an overview of GeoMesa’s capabilities in the Cloudera ecosystem, we will dive into how GeoMesa leverages Accumulo’s Iterator interface and HBase’s Filter and Coprocessor interfaces. The goal will be to discuss both what spatial operations can be pushed down into the distributed database and also how the GeoMesa codebase is organized to allow for consistent use across the two database systems.
OCLC has been using HBase since 2012 to enable single-search-box access to over a billion items from your library and the world’s library collection. This talk will provide an overview of how HBase is structured to provide this information and some of the challenges they have encountered to scale to support the world catalog and how they have overcome them.
Many individuals/organizations have a desire to utilize NoSQL technology, but often lack an understanding of how the underlying functional bits can be utilized to enable their use case. This situation can result in drastic increases in the desire to put the SQL back in NoSQL.
Since the initial commit, Apache Accumulo has provided a number of examples to help jumpstart comprehension of how some of these bits function as well as potentially help tease out an understanding of how they might be applied to a NoSQL friendly use case. One very relatable example demonstrates how Accumulo could be used to emulate a filesystem (dirlist).
In this session we will walk through the dirlist implementation. Attendees should come away with an understanding of the supporting table designs, a simple text search supporting a single wildcard (on file/directory names), and how the dirlist elements work together to accomplish its feature set. Attendees should (hopefully) also come away with a justification for sometimes keeping the SQL out of NoSQL.
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
Data serves as the platform for decision-making at Uber. To facilitate data driven decisions, many datasets at Uber are ingested in a Hadoop Data Lake and exposed to querying via Hive. Analytical queries joining various datasets are run to better understand business data at Uber.
Data ingestion, at its most basic form, is about organizing data to balance efficient reading and writing of newer data. Data organization for efficient reading involves factoring in query patterns to partition data to ensure read amplification is low. Data organization for efficient writing involves factoring the nature of input data - whether it is append only or updatable.
At Uber we ingest terabytes of many critical tables such as trips that are updatable. These tables are fundamental part of Uber's data-driven solutions, and act as the source-of-truth for all the analytical use-cases across the entire company. Datasets such as trips constantly receive updates to the data apart from inserts. To ingest such datasets we need a critical component that is responsible for bookkeeping information of the data layout, and annotates each incoming change with the location in HDFS where this data should be written. This component is called as Global Indexing. Without this component, all records get treated as inserts and get re-written to HDFS instead of being updated. This leads to duplication of data, breaking data correctness and user queries. This component is key to scaling our jobs where we are now handling greater than 500 billion writes a day in our current ingestion systems. This component will need to have strong consistency and provide large throughputs for index writes and reads.
At Uber, we have chosen HBase to be the backing store for the Global Indexing component and is a critical component in allowing us to scaling our jobs where we are now handling greater than 500 billion writes a day in our current ingestion systems. In this talk, we will discuss data@Uber and expound more on why we built the global index using Apache Hbase and how this helps to scale out our cluster usage. We’ll give details on why we chose HBase over other storage systems, how and why we came up with a creative solution to automatically load Hfiles directly to the backend circumventing the normal write path when bootstrapping our ingestion tables to avoid QPS constraints, as well as other learnings we had bringing this system up in production at the scale of data that Uber encounters daily.
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
Recently, Apache Phoenix has been integrated with Apache (incubator) Omid transaction processing service, to provide ultra-high system throughput with ultra-low latency overhead. Phoenix has been shown to scale beyond 0.5M transactions per second with sub-5ms latency for short transactions on industry-standard hardware. On the other hand, Omid has been extended to support secondary indexes, multi-snapshot SQL queries, and massive-write transactions.
These innovative features make Phoenix an excellent choice for translytics applications, which allow converged transaction processing and analytics. We share the story of building the next-gen data tier for advertising platforms at Verizon Media that exploits Phoenix and Omid to support multi-feed real-time ingestion and AI pipelines in one place, and discuss the lessons learned.
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
Cybersecurity requires an organization to collect data, analyze it, and alert on cyber anomalies in near real-time. This is a challenging endeavor when considering the variety of data sources which need to be collected and analyzed. Everything from application logs, network events, authentications systems, IOT devices, business events, cloud service logs, and more need to be taken into consideration. In addition, multiple data formats need to be transformed and conformed to be understood by both humans and ML/AI algorithms.
To solve this problem, the Aetna Global Security team developed the Unified Data Platform based on Apache NiFi, which allows them to remain agile and adapt to new security threats and the onboarding of new technologies in the Aetna environment. The platform currently has over 60 different data flows with 95% doing real-time ETL and handles over 20 billion events per day. In this session learn from Aetna’s experience building an edge to AI high-speed data pipeline with Apache NiFi.
In the healthcare sector, data security, governance, and quality are crucial for maintaining patient privacy and ensuring the highest standards of care. At Florida Blue, the leading health insurer of Florida serving over five million members, there is a multifaceted network of care providers, business users, sales agents, and other divisions relying on the same datasets to derive critical information for multiple applications across the enterprise. However, maintaining consistent data governance and security for protected health information and other extended data attributes has always been a complex challenge that did not easily accommodate the wide range of needs for Florida Blue’s many business units. Using Apache Ranger, we developed a federated Identity & Access Management (IAM) approach that allows each tenant to have their own IAM mechanism. All user groups and roles are propagated across the federation in order to determine users’ data entitlement and access authorization; this applies to all stages of the system, from the broadest tenant levels down to specific data rows and columns. We also enabled audit attributes to ensure data quality by documenting data sources, reasons for data collection, date and time of data collection, and more. In this discussion, we will outline our implementation approach, review the results, and highlight our “lessons learned.”
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale in a variety of use cases at Airbnb, Bloomberg, Comcast, Facebook, FINRA, LinkedIn, Lyft, Netflix, Twitter, and Uber, in the last few years Presto experienced an unprecedented growth in popularity in both on-premises and cloud deployments over Object Stores, HDFS, NoSQL and RDBMS data stores.
With the ever-growing list of connectors to new data sources such as Azure Blob Storage, Elasticsearch, Netflix Iceberg, Apache Kudu, and Apache Pulsar, recently introduced Cost-Based Optimizer in Presto must account for heterogeneous inputs with differing and often incomplete data statistics. This talk will explore this topic in detail as well as discuss best use cases for Presto across several industries. In addition, we will present recent Presto advancements such as Geospatial analytics at scale and the project roadmap going forward.
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
Specialized tools for machine learning development and model governance are becoming essential. MlFlow is an open source platform for managing the machine learning lifecycle. Just by adding a few lines of code in the function or script that trains their model, data scientists can log parameters, metrics, artifacts (plots, miscellaneous files, etc.) and a deployable packaging of the ML model. Every time that function or script is run, the results will be logged automatically as a byproduct of those lines of code being added, even if the party doing the training run makes no special effort to record the results. MLflow application programming interfaces (APIs) are available for the Python, R and Java programming languages, and MLflow sports a language-agnostic REST API as well. Over a relatively short time period, MLflow has garnered more than 3,300 stars on GitHub , almost 500,000 monthly downloads and 80 contributors from more than 40 companies. Most significantly, more than 200 companies are now using MLflow. We will demo MlFlow Tracking , Project and Model components with Azure Machine Learning (AML) Services and show you how easy it is to get started with MlFlow on-prem or in the cloud.
Extending Twitter's Data Platform to Google CloudDataWorks Summit
Twitter's Data Platform is built using multiple complex open source and in house projects to support Data Analytics on hundreds of petabytes of data. Our platform support storage, compute, data ingestion, discovery and management and various tools and libraries to help users for both batch and realtime analytics. Our DataPlatform operates on multiple clusters across different data centers to help thousands of users discover valuable insights. As we were scaling our Data Platform to multiple clusters, we also evaluated various cloud vendors to support use cases outside of our data centers. In this talk we share our architecture and how we extend our data platform to use cloud as another datacenter. We walk through our evaluation process, challenges we faced supporting data analytics at Twitter scale on cloud and present our current solution. Extending Twitter's Data platform to cloud was complex task which we deep dive in this presentation.
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
At Comcast, our team has been architecting a customer experience platform which is able to react to near-real-time events and interactions and deliver appropriate and timely communications to customers. By combining the low latency capabilities of Apache Flink and the dataflow capabilities of Apache NiFi we are able to process events at high volume to trigger, enrich, filter, and act/communicate to enhance customer experiences. Apache Flink and Apache NiFi complement each other with their strengths in event streaming and correlation, state management, command-and-control, parallelism, development methodology, and interoperability with surrounding technologies. We will trace our journey from starting with Apache NiFi over three years ago and our more recent introduction of Apache Flink into our platform stack to handle more complex scenarios. In this presentation we will compare and contrast which business and technical use cases are best suited to which platform and explore different ways to integrate the two platforms into a single solution.
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
Companies are increasingly moving to the cloud to store and process data. One of the challenges companies have is in securing data across hybrid environments with easy way to centrally manage policies. In this session, we will talk through how companies can use Apache Ranger to protect access to data both in on-premise as well as in cloud environments. We will go into details into the challenges of hybrid environment and how Ranger can solve it. We will also talk through how companies can further enhance the security by leveraging Ranger to anonymize or tokenize data while moving into the cloud and de-anonymize dynamically using Apache Hive, Apache Spark or when accessing data from cloud storage systems. We will also deep dive into the Ranger’s integration with AWS S3, AWS Redshift and other cloud native systems. We will wrap it up with an end to end demo showing how policies can be created in Ranger and used to manage access to data in different systems, anonymize or de-anonymize data and track where data is flowing.
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
Advanced Big Data Processing frameworks have been proposed to harness the fast data transmission capability of Remote Direct Memory Access (RDMA) over high-speed networks such as InfiniBand, RoCEv1, RoCEv2, iWARP, and OmniPath. However, with the introduction of the Non-Volatile Memory (NVM) and NVM express (NVMe) based SSD, these designs along with the default Big Data processing models need to be re-assessed to discover the possibilities of further enhanced performance. In this talk, we will present, NRCIO, a high-performance communication runtime for non-volatile memory over modern network interconnects that can be leveraged by existing Big Data processing middleware. We will show the performance of non-volatile memory-aware RDMA communication protocols using our proposed runtime and demonstrate its benefits by incorporating it into a high-performance in-memory key-value store, Apache Hadoop, Tez, Spark, and TensorFlow. Evaluation results illustrate that NRCIO can achieve up to 3.65x performance improvement for representative Big Data processing workloads on modern data centers.
Background: Some early applications of Computer Vision in Retail arose from e-commerce use cases - but increasingly, it is being used in physical stores in a variety of new and exciting ways, such as:
● Optimizing merchandising execution, in-stocks and sell-thru
● Enhancing operational efficiencies, enable real-time customer engagement
● Enhancing loss prevention capabilities, response time
● Creating frictionless experiences for shoppers
Abstract: This talk will cover the use of Computer Vision in Retail, the implications to the broader Consumer Goods industry and share business drivers, use cases and benefits that are unfolding as an integral component in the remaking of an age-old industry.
We will also take a ‘peek under the hood’ of Computer Vision and Deep Learning, sharing technology design principles and skill set profiles to consider before starting your CV journey.
Deep learning has matured considerably in the past few years to produce human or superhuman abilities in a variety of computer vision paradigms. We will discuss ways to recognize these paradigms in retail settings, collect and organize data to create actionable outcomes with the new insights and applications that deep learning enables.
We will cover the basics of object detection, then move into the advanced processing of images describing the possible ways that a retail store of the near future could operate. Identifying various storefront situations by having a deep learning system attached to a camera stream. Such things as; identifying item stocks on shelves, a shelf in need of organization, or perhaps a wandering customer in need of assistance.
We will also cover how to use a computer vision system to automatically track customer purchases to enable a streamlined checkout process, and how deep learning can power plausible wardrobe suggestions based on what a customer is currently wearing or purchasing.
Finally, we will cover the various technologies that are powering these applications today. Deep learning tools for research and development. Production tools to distribute that intelligence to an entire inventory of all the cameras situation around a retail location. Tools for exploring and understanding the new data streams produced by the computer vision systems.
By the end of this talk, attendees should understand the impact Computer Vision and Deep Learning are having in the Consumer Goods industry, key use cases, techniques and key considerations leaders are exploring and implementing today.
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
Whole genome shotgun based next generation transcriptomics and metagenomics studies often generate 100 to 1000 gigabytes (GB) sequence data derived from tens of thousands of different genes or microbial species. De novo assembling these data requires an ideal solution that both scales with data size and optimizes for individual gene or genomes. Here we developed an Apache Spark-based scalable sequence clustering application, SparkReadClust (SpaRC), that partitions the reads based on their molecule of origin to enable downstream assembly optimization. SpaRC produces high clustering performance on transcriptomics and metagenomics test datasets from both short read and long read sequencing technologies. It achieved a near linear scalability with respect to input data size and number of compute nodes. SpaRC can run on different cloud computing environments without modifications while delivering similar performance. In summary, our results suggest SpaRC provides a scalable solution for clustering billions of reads from the next-generation sequencing experiments, and Apache Spark represents a cost-effective solution with rapid development/deployment cycles for similar big data genomics problems.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Let's dive deeper into the world of ODC! Ricardo Alves (OutSystems) will join us to tell all about the new Data Fabric. After that, Sezen de Bruijn (OutSystems) will get into the details on how to best design a sturdy architecture within ODC.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
"Impact of front-end architecture on development cost", Viktor TurskyiFwdays
I have heard many times that architecture is not important for the front-end. Also, many times I have seen how developers implement features on the front-end just following the standard rules for a framework and think that this is enough to successfully launch the project, and then the project fails. How to prevent this and what approach to choose? I have launched dozens of complex projects and during the talk we will analyze which approaches have worked for me and which have not.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
2. About me
● Tech Lead on YouTube Data Team
○ Procella
○ Youtube Data Warehouse
○ Youtube Analytics Backend
○ Data Quality: Anomaly Detection
● Prior to Google
○ Data Infra teams at Twitter, Netflix
○ Apache Parquet PMC
○ Apache Pig PMC
○ Contributor: Hadoop, Hive
3. Agenda
● Products at YouTube and Google
● Use cases and features
● Architecture
● Key features deep dive
● Q&A
6. And more...
● Youtube Metrics
● Firebase Performance Console
● Google Play Console
● Youtube Internal Analytics
● ...
7. About Procella
A widely used SQL query engine at Google
● Fully featured: Most of SQL++: Structured, joins, set ops, analytic functions
● Super fast: Most queries are sub-second.
● Highly scalable: Petabytes of data, thousands of tables, trillions of rows, ...
● High QPS: Millions of point queries @ msec, thousands of reporting queries @
sub-second, hundreds of ad-hoc queries @ seconds
● Designed for real-time: Millions QPS instant ingestion, highly efficient, …
● Easy to use: SQL, DDL, Virtual tables, …
● Compatible: Capacitor, Dremel, ACLs, Encryption, …
8. External Reporting/Analytics
● Use cases
○ YouTube Analytics
○ Firebase Perf Monitoring
○ Google Play Console
○ Data Studio
● Properties
○ High QPS, low latency
ingestion and queries
○ Hundreds of metrics across
dozens of dimensions
○ SQL
● Technology:
○ On the fly aggregations
○ Indexing & partitioning
○ Real-time ingestion
○ Batch & Real-time Stitching
■ Lambda architecture
○ Caching
9. Internal Reporting/Dashboards
● Use cases
○ Dashboards
○ Experiments Analysis
○ Custom reporting
● Properties
○ Low qps
○ Speed & scale
○ SQL
● Technology
○ Stateful caching
○ Columnar optimized data
format (Artus)
○ Vector evaluation
○ Virtual tables
○ Tools (Dremel) and format
(Capacitor) compatibility
10. Real Time Insights/Monitoring
● Use cases
○ YTA Real Time (External)
○ YTA Insights (External)
○ YT Site Health (Internal)
● Properties
○ Native time-series support
○ Complex compaction, complex
metrics
○ High scan rate
○ Very high real time ingestion rate
● Technology
○ TTL
○ SQL based compaction
○ Approx algorithms
(Quantiles, Top, ..)
11. Real Time Stats Serving
● Use cases
○ YouTube metrics
(subscriptions, likes, etc.) on
various YT pages
● Properties
○ Millions of rows ingested per
sec
○ Millions of queries per sec
○ Milliseconds response time
○ Simple point queries
○ Many replicas
● Technology
○ Indexable columnar format
(Artus)
○ Heavily optimized serving
path
○ Pubsub based ingestion
15. Procella Scale
Property Value
Queries executed 450+ million per day
Queries on real-time data 200+ million per day
Leaf plans executed 90+ billion per day
Rows scanned 20+ quadrillion per day
Rows returned 100+ billion per day
Peak scan rate per query 100+ billion rows per second
Schema size 200+ metrics and dimensions
Tables 100+
16. Procella Latencies
Property p50 p99 p99.9
E2E Latency 25 ms 450 ms 3000 ms
Rows Scanned 17 M 270 M 1000 M
MDS Latency 6 ms 120 ms 250 ms
DS Latency 1 ms 75 ms 220 ms
Query Size 300B 4.7KB 10KB
18. Procella Secret Sauce
● Stateful Caching
○ Cached indexes - zone-maps, dictionaries, bloom filters etc.
○ File handle, metadata caching
○ Data segments caching
○ Data segments have data server affinity (Primary DS, Secondary DS,...)
○ Cached expressions
● Tail Latency Reduction
○ Backup RPCs - at about 90% of total RPCs.
○ Minibatch backup RPCs - one per n RPCs as a backup for slowest of
those n RPCs.
19. Procella Secret Sauce
● Indexing
○ Separate metadata server optimized for compact storage of zone maps
○ Dynamic range partitioning
○ Use of dictionaries and bloom filters
○ Posting lists for repeated data
● Distributed Execution
○ N tier execution, X000 parallel
● Vectorized Evaluation: Superluminal
○ Performs block eval
○ Natively understands RLE and dictionary
○ Columnar evaluation
20. Procella Secret Sauce
● Artus File Format
○ Multi-pass adaptive encodings to select the best encoding for a column
○ Uses adaptive encodings vs generalized compression
○ O(log n) lookups, O(1) seeks
○ Length based representation for repeated data types (arrays)
○ Exposes dictionary, RLE information to execution engine
○ Allows for rich metadata, inverted index
21. Procella Secret Sauce
● Real-time Ingestion
○ Data ingested into memory buffers
○ Periodically checkpointed to files
○ Small files are periodically aggregated and compacted into large files
● Virtual Tables
○ Index aware aggregate selection
○ Stitched Queries
○ Lambda architecture awareness
○ Normalization friendly (dimensional joins)
22. TPC-H queries*
* Run of TPC-H queries on:
● Static artus data
● Large shared instance
● Manually optimized queries
Geomean: 9.6 Seconds
Hi Everyone, I’m Aniket Mokashi, I am a tech lead at Google. Let me give you a little bit of background about myself. I work on Youtube’s Data team. At Youtube, I primarily contribute to Procella, which is what this talk is about. In addition, I work on Youtube Analytics Backend and on variety of projects under Youtube Data warehouse. Prior to Google, I have worked on data infra teams at Twitter and Netflix. I’m a Apache Parquet PMC member. I’ve contributed a few encodings, pig integration to Parquet. I was responsible for rolling out for first production use of Parquet at Twitter. I’m also a Apache Pig PMC member. My main contributions have been support of UDFs in scripting languages, Native Mapreduce, scalar values support, auto local mode and jar caching. I enjoy working on large scale data processing systems and in this talk I will cover Procella, a versatile analytics engine we’ve built at Youtube to serve most of our analytics needs.
We will cover a number of topics in this talk. First, we will look at products at Youtube and Google that are powered by Procella. Next, we will explore variety of use cases enabled by Procella and we will discuss features that are required in order to support those use cases. After that, I will give you all an overview of Procella’s architecture. And lastly, we will spend some time to deep dive into some of the key features that make Procella work so well. In interest of time, I will be taking questions at the end.
Let’s look at some products that are powered by Procella. But, before we get started, let me ask you all a few questions:
So, raise of hands,
How many of you use YouTube regularly like at least once a week… Perfect - that makes all of you.
How many of you are Youtube Creators - that is you have uploaded at least one video to Youtube? Those of you who haven’t done that yet, I will highly encourage you to try it out. Especially to get rich analytics... :)
How many of you have used youtube analytics for tracking their videos or at least seen youtube analytics before? Excellent.
So, Youtube Analytics is this amazingly rich analytics dashboard that lets Youtube content creators or content owners analyze usage and popularity of their videos and channels on the Youtube platform. It lets them track all of success metrics such as number of views, amount of watch time, revenue from monetized videos across various dimensions such as demographics, geo, devices, playback location etc. We also have a mobile application, that you see on the right, called creator studio which has similar details. Some of this information is also available in real-time.
Three years back, we started building Procella primarily to serve as a backend for Youtube analytics which is often referred as YTA. We wanted to make sure that backend of YTA will continue to scale horizontally as Youtube grows and as we add more features to the product. We had a few design goals in mind when we started working on Procella. First, we wanted to provide widely understood SQL interface so that navigating through this data is easy and integrating various frontends with the backend was less cumbersome. Second, we also wanted to provide flexibility to extend the product without a lot of involvement of the backend team, so we wanted to build as generic product as possible. We launched Procella as a backend for Youtube Analytics about 2 and half years ago. Since then, we have been able to make several changes to the the product seamlessly. For example, recently, we started showing impressions and impressions CTR in YTA and it required almost no involvement of the Procella team.
In addition to Youtube Analytics, over last few years Procella is being used for several other internal and external facing analytics dashboards or products. One example is Youtube Artists product that you can see on the slides. This product allows everyone to track popularity of artists and their music videos in a region on Youtube platform. If you haven’t already, I recommend you to check it out.
There are few other fairly large products at Google that are powered by Procella.
For example -
Firebase Performance Console. It lets app developers track various growth and success related metrics for their apps that are integrated with firebase platform.
Google Play Console - that lets android application developers track performance and popularity of their apps.
Also, now, various metric numbers like subscriptions, likes, dislikes that you see on the youtube.com website or app are also powered by Procella. Given the popularity of Youtube, as you can imagine, this was a significant achievement in terms of scale. I will describe later in the talk, the system properties that allowed us achieve it.
So now that you have looked at these interesting products that are using Procella. Let’s look at Procella.
What is Procella?
Procella is a widely used SQL query engine at Google.
It supports most of the standard SQL functionality including joins, set operations and analytic functions. It also supports queries on top of complex or structured data types like arrays, structs and maps.
What makes it unique is that it’s built to be super fast at scale - so most of the queries running on Procella have sub-second latencies.
It’s highly scalable - works on petabytes of data, on thousands of tables storing trillions of rows.
It can handle high qps of queries - like millions of qps of point lookup queries at millisecond latencies and thousands of qps of dashboard queries at sub-second latencies and hundreds of ad-hoc queries running in seconds.
It is also designed to serve data ingested in real-time. So, it supports millions of qps of ingestion of data rows.
It’s easy to use with SQL interface that supports DDL statements to create tables and partitions.
It also supports virtual tables. Virtual tables are used to hide the complexity of data model that is required to serve a use case.
Procella is developed to be compatible with existing tools at Google so that it can adopted easily. It exposes Dremel’s query interface which is a popular query system at Google and it can process files in Capacitor file format which is Dremel’s native file format.
Procella is a composable and versatile system. It powers a variety of use cases ranging from external facing metrics reporting applications to data pipelines. Let’s go through these use cases one by one to understand the motivations behind the architecture and various features in Procella.
External reporting applications are public facing products like Youtube analytics, firebase performance console, Google play console those I showed before. Being public facing, these applications have high qps and low-blink-of-an-eye latency requirements. These products typically surface over a hundred metrics sliced and diced by a few dozen of dimensions for a given entity or a customer.
Having a SQL interface instead of an API interface makes it easy build the backends and frontends independently.
These applications usually work on large datasets. Precomputing all the required metrics numbers required to power these dashboards is expensive or sometimes even infeasible. So, having ability to perform fast aggregations of these metrics on the fly is important.
To enable these applications, Procella provides efficient indexing and tablet pruning functionality as most of these queries need to process a small number of rows out of a very large amount of data. These applications also need ability to ingest real-time data to enable fast insights. And, in most cases, the ability to stitch between realtime datasets and batch datasets using lambda architecture is crucial. These applications can leverage temporal locality of data, so building caching at different layers helps in performance.
Another use case is internal reporting or dashboards. Some of the examples internal reporting are: product dashboards, experiment analysis dashboards, custom reports.
Internal reporting has much lower qps and relatively less aggressive latency requirements compared to external reporting. However, internal dashboards typically work on much complex and larger datasets. So, data scan efficiency and ability to handle complex data types are important to be able to serve their queries. To enable these use cases, we have developed features such as fast optimized data formats, vectorized evaluation library.
Real-time time-series analysis and monitoring is another use case. YTA provides insights in real-time. In addition, we also to track Youtube site health using Procella. These applications require native support of time dimension especially to filter based on different time boundaries like last 60 mins or last 2 days.
As I mentioned previously, use of Procella to power metrics on various pages of the Youtube platform is one of the most exciting use case. Youtube gets millions of activity updates per second. So, this requires millions of rows per second of reliable ingestion in real-time. This is mainly done through pubsub which is persistent message queue. Updates are replicated to multiple ingestion servers for reliability. On the serving side, this use case requires ability to serve millions of simple point lookup queries per seconds. These queries take only a few milliseconds to run on the ingested data. To make this possible, we have heavily optimized the query serving path with additional features that can lookup and scan required data efficiently.
Ad-hoc analytics essentially covers interactive querying at a low qps. Data scientists and analysts make use of tables stored in Procella to identify trends, growth factors and other important custom business metrics. This typically requires querying complex data types like arrays, nested structs and ability to join arbitrary datasets.
We support a number of joins that are required for these queries to work. In particular, we support broadcast join - which broadcasts the right side of the joins to all the leaf nodes so that they can construct local hash maps for lookups during the joins. We support remote lookup joins that leverage our lookup-friendly data format. We support pipelined joins that run in stages. They first compute small amount of join data from right side so that it can be used as a filter on the left side in the subsequent stages. Shuffled joins, which are essentially reduce-side joins that can scale for arbitrary large datasets. Lastly, co-partitioned joins, which leverage partitioning of left and right sides to efficiently merge them during the join.
Supporting SQL based data pipelines requires ability to handle large amount of data and process it using a complex business logic. These are typically multi-stage queries that join various data sources. These joins and aggregations require large shuffles. For logic that cannot be expressed in SQL, UDF support is necessary.
To enable this use case, we have plugable shuffle component and we leverage Google wide available efficient RDMA shuffle service. We have also implemented adaptive cost based optimizer that can optimize parts of these queries on the fly.
All of the categories of use cases I mentioned so far are being powered by Procella in production at Youtube. Procella’s unique scalable, composable architecture makes that possible. Let’s dive into the architecture now. This diagram roughly shows architecture of Procella. All rectangular boxes in the diagram are essentially independent services in the Procella query engine.
Towards your left is metadata server, metadata store and registration server. Users register their tables with Procella using DDL commands like CREATE/ALTER table or programmatically using RPCs. This is done to define the schemas of the tables. Then users setup upstream batch processes to periodically create partitions of datasets. These datasets are stored on Google’s distributed file system called Colossus. After datasets are generated, users register them with Procella using ALTER table commands or programmatically with RPCs. These datasets are typically stored in Columnar file formats such as Capacitor or Artus. Each dataset consists of many blocks of rows called tablets. A tablet typically has tens of thousands to millions of rows. Data is generally range or hash partitioned to achieve clustering of data by columns that are frequently filtered by. During the data registration, registration server extracts metadata information of all tablets such as stats, zone-maps, dictionaries, bloom filters and stores it in the metadata store in a highly compact and organized way.
In the query path, shown on your right, with yellow colored arrows, dashboard clients or human users send their SQL queries as strings. These are compiled into distributed multi-level query plans to be executed on data servers. Root server is responsible for compiling and coordinating execution of the query. It also does the last stage of query execution, before returning results back to the users.
Metadata server is primarily responsible for pruning tablets based on the partition metadata stored in the metadata store. To make this efficient, metadata is cached in the metadata servers in a LRU cache. We use zone-maps, dictionaries and bloom filters for pruning tablets required for the query.
Once the tablets to be processed are identified, distributed query plan is executed on the data servers for those tablets. Data servers load the columns required for query execution on need basis and cache them into large local caches at every data server. Data shuffles between stages of the query are handled using remote RDMA shuffle service.
In realtime ingestion, shown with blue arrows, data producers send rows of data directly via RPC or using a persistent pub sub queue. Based on the pre-configured partitioning on selected columns, data is ingested into two ingestion servers, primary and secondary for each row. This data is held into efficient lookup memory buffers and are periodically checkpointed to Colossus. These checkpointed files are eventually compacted or aggregated into larger files for efficiency of querying. Data ingested in Procella is servable from the point it is ingested into the system. So, queries on realtime tables are served from buffers and small files and the large compacted files. The lifecycle of these buffers, small files and large files is coordinated with transactions on the metadata store via registration server.
Now that we have looked at the architecture. Let’s look at some scale and latency numbers.
These are numbers from one of our instances serving analytics on Youtube. On this instance, we execute about 450+ million queries per day. Amongst which 200+ million are queries on real-time datasets. This results into over 90+ billion of leaf plans that run on the data servers. In total, this scans over 20 quadrillion rows per day and returns over 100 billion rows. We can achieve peak scan rate of 100+ billion rows per second. This instance has over 100 tables with more than 200 columns.
Let’s look at the latency numbers for our analytics instance. As you can see, our queries have just 25 ms of e2e latency at median. Our slowest queries take about 3 seconds possibly because of the size of the query and cache misses. At median, a query scans about 17M rows.
On range partitioned data, Procella metadata server can prune large number of tablets to be scanned. This is done by using efficient data structures to store tablet boundaries in memory and then binary searching through them to identify the tablets that satisfy the given filter clause.
Now, let’s deep dive into some of the features that make Procella work so well. These few features are essentially secret sauce that make Procella so fast and efficient.
Stateful caching:
For fast and efficient processing, we cache at various layers in Procella.
On the metadata server, constraints or zone-maps and other indexes are cached in a LRU cache to avoid going to metadata store. These are stored in a versioned cache so that invalidation of cache is easy.
As I mentioned before, data processed by Procella is stored remotely on Colossus, distributed file system. To allow efficient fetch of data, we cache file handles which essentially hold location of the remote Colossus server and metadata information like number of chunks, size and location of chunks. We also cache column level metadata information including size of various columns, their offset in the file. In addition, in a separate cache, we store blocks of columns required during query processing. We use a smart age and cost aware caching policy for this cache. Data segments have data server affinity to achieve high cache hit rate. So, any data segment is loaded and processed only by two data servers, a primary and a secondary for that segment. Any request to process a data segment first goes to primary and if primary does not respond fast enough, secondary handles that request. In addition, we do support expression caching that allows users to cache expensive expressions in the query.
Now, lets talk about tail latency reduction.
To achieve tail latency reduction, we use speculative execution for data server RPCs. So, during execution of any query, after 90% of query execution completes, we send backup RPCs to secondary data servers for the rest 10% incomplete RPCs. Also, we send minibatch backup RPCs, during the execution of query at regular intervals. That is, one backup RPC is sent for slowest RPCs in every set of n RPCs.
Indexing is one of the thing that makes Procella architecturally different from other query engines. Procella uses a separate metadata server to store indexing information and does a separate efficient processing pass over it before query execution to prune and select tablets. Procella uses dynamic range partitioning, for both batch and realtime tables, to achieve uniform distribution of data across tablets. Dictionary pruning is helpful when there are relatively small number of values compared to number of rows and bloom filters help significantly for needle in the haystack kind of queries for example - getting metrics for less popular videos like views for videos on my channel over last 28 days.
Our query execution is fully distributed and massively parallel so we can leverage power of all the data servers during execution of a query.
We also have a vectorized evaluation engine, called Superluminal that can work on block of values and leverage modern CPU instructions during evaluation. Superluminal is columnar evaluation library that can natively understand and leverage RLE and dictionary encodings to make group by and filtering operations much cheaper to execute.
Procella supports querying data stored in two columnar file formats- Capacitor and Artus. Capacitor is Dremel’s native file format which is widely used at Google. It supports RLE and dictionary encoded columns and has support for storing bloom filters for cheaply checking if a value exists in a column.
We have also developed another file format, Artus which in addition to features in Capacitor provides efficient lookup and seek APIs on columns to make Procella evaluation significantly more efficient. In Artus, we use multi-pass adaptive encoding to select best encoding for a column. Artus gets space savings from this without using generalized compressions as used by other file formats. Artus uses adaptive encodings instead of generalized compressions to avoid full decompression costs of columns. It allows for lookups using binary search and O(1) seek APIs to move between rows. It also simplifies handling of complex data types by using length based representation for arrays instead of using repetition and definition levels. Artus exposes dictionary and RLE information to execution engine so that operations like filtering and group-bys can be made more efficient.
We’ve already talked about realtime ingestion, but one more thing to add to it is that once the data is persisted to the files, queries on those are served by data servers. So ingestion server only manage memory buffers and queries on them.
Lastly Procella supports virtual tables. These are used for various purposes.
One, to simplify development and allow flexibility, they allow for efficient aggregate selection. So, users can write their queries to select metrics and dimensions from the virtual table without knowing what physical aggregates have them and Procella will rewrite these to query most efficient aggregate that can correctly serve this query. This also allows backend teams to add necessary aggregates, change backend data models independently without modifying any incoming queries.
Two: virtual tables provide automatic stitching between batch and realtime tables. So, for pipelines using lambda architecture, as the batch processed data arrives, queries automatically add necessary filters on real-time data to show a stitched view of datasets.
And lastly, virtual tables are normalization friendly. They provide denomalized view of the tables to the users and add necessary joins during the rewrite to perform denormalizations on the fly.
I haven’t covered all the features here but these key features are primarily responsible for making Procella a success.
Here are some results of running TPC-H benchmark queries on Procella. As you can see, using the techniques discussed before we are able to achieve one of the best results for this benchmark compared to other query engines. Note that these are slightly older numbers from running queries on static artus TPCH dataset, using a large instance with manually rewritten queries to add join hints etc.
With that, let me conclude my talk by summarizing what we learned so far. In this talk, we learned about Procella, a composable, scalable and versatile query engine we have built at Youtube. We learned about the architecture and what makes it work so well. If this is something that excites you, feel free to come talk to me later to learn about opportunities. I can take questions now. But, before that, I would like to mention that I will not be sharing any numbers other than ones mentioned on the slides in this talk. Also, I will not comment on comparison of Procella with any other existing Google products.