The document discusses online approximate OLAP in SparkSQL. It introduces the concept of generalized online aggregation (G-OLA) which provides continuous query answers with continuous error bars in SparkSQL. This is achieved by introducing delta maintenance as a first-class citizen in query execution. The document also discusses how BlinkDB can be used for error estimation on samples of data to provide error bounds for aggregate query results.
In this talk about Apache Flink we will touch on three main things, an introductory look at Flink, a look under the hood and a demo.
* In the introduction we will briefly look at the history of Flink and then go on to the API and different use cases. Here we will also see how it can be deployed in practice and what some of the pitfalls in a cluster setting can be.
* In the second section we will look at the streaming execution engine that lies at the heart of Flink. Here we will see what makes it tick and also what distinguishes it from other approaches, such as the mini-batch execution model.
Ufuk Celebi - PMC member at Apache Flink and co-founder and software engineer at data Artisans
* In the final section we will see a live demo of a fault-tolerant streaming job that performs analysis of the wikipedia edit-stream.
University of Virginia
cs4414: Operating Systems
http://rust-class.org
For embedded notes, see:
http://rust-class.org/class-21-bakers-and-philosophers.html
Instead of randomly injecting faults ( i.e. Chaos Monkey), what if we could order our experiments to perform min number of experiments for maximum yield? We present a solution(& results) to the problem of experiment selection using Lineage Driven Fault Injection to reduce the search space of faults.
Lineage Driven Fault Injection (LDFI) is a state of the art technique in chaos engineering experiment selection. LDFI since its inception has used an SAT solver under the hood which presents solutions to the decision problem (which faults to inject) in no particular order. As SRE’s we would like to perform experiments that reveal the bugs that the customers are most likely to hit first. In this talk, we present new improvements to LDFI that orders the experiment suggestions.
In the first the half of the talk we will show LDFI is a technique that can be widely used within an enterprise. We present the motivation for ordering the chaos experiments along with some prioritization we utilized while conducting the experiments. We also highlight how ordering is a general purpose technique that we can use to encode the peculiarities of a heterogeneous microservices architecture. LDFI can work in an enterprise by harnessing the observability infrastructure to model the redundancy of the system.
Next, we present experiments conducted within our organization using ordered LDFI and some preliminary results. We show examples of services where we discovered bugs, and how carefully controlling the order of experiments allowed LDFI to avoid running unnecessary experiments. We also present an example of an application where we declared the service shippable under crash stop model. We also present a comparison with Chaos Monkey and show how LDFI found the known bugs in a given application using orders of magnitude fewer experiments than a random fault injection tool like Chaos Monkey.
Finally, we discuss how we plan to take LDFI forward. We discuss open problems and possible solutions for scalarizing probabilities of failure, latency injection, integration with service mesh technologies like envoy for fine-grained fault injection, fault injection for stateful systems.
Key takeaways: 1) Understand how LDFI can be integrated in the enterprise by harnessing the observability infrastructure. 2) Limitations of LDFI w.r.t unordered solutions and why ordering matters for chaos engineering experiments. 3) Preliminary results of prioritized LDFI and a future direction for the community.
In this talk about Apache Flink we will touch on three main things, an introductory look at Flink, a look under the hood and a demo.
* In the introduction we will briefly look at the history of Flink and then go on to the API and different use cases. Here we will also see how it can be deployed in practice and what some of the pitfalls in a cluster setting can be.
* In the second section we will look at the streaming execution engine that lies at the heart of Flink. Here we will see what makes it tick and also what distinguishes it from other approaches, such as the mini-batch execution model.
Ufuk Celebi - PMC member at Apache Flink and co-founder and software engineer at data Artisans
* In the final section we will see a live demo of a fault-tolerant streaming job that performs analysis of the wikipedia edit-stream.
University of Virginia
cs4414: Operating Systems
http://rust-class.org
For embedded notes, see:
http://rust-class.org/class-21-bakers-and-philosophers.html
Instead of randomly injecting faults ( i.e. Chaos Monkey), what if we could order our experiments to perform min number of experiments for maximum yield? We present a solution(& results) to the problem of experiment selection using Lineage Driven Fault Injection to reduce the search space of faults.
Lineage Driven Fault Injection (LDFI) is a state of the art technique in chaos engineering experiment selection. LDFI since its inception has used an SAT solver under the hood which presents solutions to the decision problem (which faults to inject) in no particular order. As SRE’s we would like to perform experiments that reveal the bugs that the customers are most likely to hit first. In this talk, we present new improvements to LDFI that orders the experiment suggestions.
In the first the half of the talk we will show LDFI is a technique that can be widely used within an enterprise. We present the motivation for ordering the chaos experiments along with some prioritization we utilized while conducting the experiments. We also highlight how ordering is a general purpose technique that we can use to encode the peculiarities of a heterogeneous microservices architecture. LDFI can work in an enterprise by harnessing the observability infrastructure to model the redundancy of the system.
Next, we present experiments conducted within our organization using ordered LDFI and some preliminary results. We show examples of services where we discovered bugs, and how carefully controlling the order of experiments allowed LDFI to avoid running unnecessary experiments. We also present an example of an application where we declared the service shippable under crash stop model. We also present a comparison with Chaos Monkey and show how LDFI found the known bugs in a given application using orders of magnitude fewer experiments than a random fault injection tool like Chaos Monkey.
Finally, we discuss how we plan to take LDFI forward. We discuss open problems and possible solutions for scalarizing probabilities of failure, latency injection, integration with service mesh technologies like envoy for fine-grained fault injection, fault injection for stateful systems.
Key takeaways: 1) Understand how LDFI can be integrated in the enterprise by harnessing the observability infrastructure. 2) Limitations of LDFI w.r.t unordered solutions and why ordering matters for chaos engineering experiments. 3) Preliminary results of prioritized LDFI and a future direction for the community.
Update on C++ Core Guidelines Lifetime Analysis. Gábor Horváth. CoreHard Spri...corehard_by
This is an update of the Clang-based implementation of Herb Sutter’s Lifetime safety profile for the C++ Core Guidelines, available online at cppx.godbolt.org.
University of Virginia
cs4414: Operating Systems
http://rust-class.org
Explicit Memory Management
4.3BSD
Morris Worm
fingerd code
NX bit
For embedded notes, see: http://rust-class.org/class-8-managing-memory.html
An overview of the inner-workings of OpenJDK - with emphasis on...
- what triggers the just-in-time compiler (JIT)
- types of speculative optimizations performed by the JIT
- aspects of the Java language & ecosystem that make ahead-of-time (AOT) compilation challenging
[Defcon Russia #29] Алексей Тюрин - Spring autobindingDefconRussia
В Spring MVC есть классная фича — autobinding. Но если пользоваться ей неправильно, могут появиться «незаметные» уязвимости, иногда с серьёзным импактом. Рассмотрим пару примеров, углубимся в тонкости появления autobinding-багов. Writeup [ENG]: http://agrrrdog.blogspot.ru/2017/03/autobinding-vulns-and-spring-mvc.html
EdSketch: Execution-Driven Sketching for JavaLisa Hua
Sketching is a relatively recent approach to program synthesis, which has shown much promise. The key idea in sketching is to allow users to write partial programs that have “holes” and provide test harnesses or reference implementations, and let synthesis tools create program fragments that the holes such that the resulting complete program has the desired functionality. Traditional solutions to the sketching problem perform a translation to SAT and employ CEGIS. While e ective for a range of programs, when applied to real applications, such translation-based approaches have a key limitation: they require either translating all relevant libraries that are invoked directly or indirectly by the given sketch – which can lead to impractical SAT problems – or creating models of those libraries – which can require much manual effort.
is paper introduces execution-driven sketching, a novel approach for synthesis of Java programs using a backtracking search that is commonly employed in so ware model checkers. e key novelty of our work is to introduce effective pruning strategies to effciently explore the actual program behaviors in presence of libraries and to provide a practical solution to sketching small parts of real-world applications, which may use complex constructs of modern languages, such as reflection or native calls. Our tool EdSketch embodies our approach in two forms: a stateful search based on the Java PathFinder model checker; and a stateless search based on re-execution inspired by the VeriSoft model checker. Experimental results show that EdSketch’s performance compares well with the well-known SAT-based Sketch system for a range of small but complex programs, and moreover, that EdSketch can complete some sketches that require handling complex constructs.
Вениамин Гвоздиков: Особенности использования DTrace Yandex
Вначале рассмотрим возможности DTrace для трассировки приложений в реальном времени. Затем перейдём на особенности его использования на разных ОС и их ограничения. В конце будут приведены задания с динамически создаваемыми пробами для прикладной разработки на LUA в Tarantool.
Story of static code analyzer developmentAndrey Karpov
Greetings from the past: simple tools and bad standards. Regular expressions don’t work. What is inside modern static code analyzers on the PVS-Studio example. About machine learning. Learning vs Data Flow analysis.
This webinar by Oleksandr Navka (Lead Software Engineer, Consultant, GlobalLogic) was delivered at Java Community Webinar #1 on August 12, 2020.
Webinar agenda:
- The new structural unit of the program is Java Records
- Updated instanceof statement
- Updated switch operator
More details and presentation: https://www.globallogic.com/ua/about/events/java-community-webinar-1/
How to deploy Apache Spark in a multi-tenant, on-premises environmentBlueData, Inc.
Adoption of Apache Spark in the enterprise is increasing rapidly - it's become one of the fastest growing and most popular technologies in the Big Data ecosystem.
However, implementing an enterprise-ready, on-premises Spark deployment can be very complex and it requires expertise that is generally not available to all.
BlueData makes it easier to deploy Apache Spark on-premises. With BlueData, you can spin up virtual Spark clusters within minutes – providing secure, self-service, on-demand access to Big Data analytics and infrastructure. You can deploy Spark in standalone mode or with Hadoop / YARN. You can also build analytical pipelines and create Spark clusters using our RESTful APIs, and use web-based Zeppelin notebooks for interactive data analytics.
BlueData’s software platform leverages virtualization and Docker containers – combined with our own patent-pending innovations – to make it faster, and more cost-effective for enterprises to get up and running with a multi-tenant Spark deployment on-premises.
Learn more at www.bluedata.com
Update on C++ Core Guidelines Lifetime Analysis. Gábor Horváth. CoreHard Spri...corehard_by
This is an update of the Clang-based implementation of Herb Sutter’s Lifetime safety profile for the C++ Core Guidelines, available online at cppx.godbolt.org.
University of Virginia
cs4414: Operating Systems
http://rust-class.org
Explicit Memory Management
4.3BSD
Morris Worm
fingerd code
NX bit
For embedded notes, see: http://rust-class.org/class-8-managing-memory.html
An overview of the inner-workings of OpenJDK - with emphasis on...
- what triggers the just-in-time compiler (JIT)
- types of speculative optimizations performed by the JIT
- aspects of the Java language & ecosystem that make ahead-of-time (AOT) compilation challenging
[Defcon Russia #29] Алексей Тюрин - Spring autobindingDefconRussia
В Spring MVC есть классная фича — autobinding. Но если пользоваться ей неправильно, могут появиться «незаметные» уязвимости, иногда с серьёзным импактом. Рассмотрим пару примеров, углубимся в тонкости появления autobinding-багов. Writeup [ENG]: http://agrrrdog.blogspot.ru/2017/03/autobinding-vulns-and-spring-mvc.html
EdSketch: Execution-Driven Sketching for JavaLisa Hua
Sketching is a relatively recent approach to program synthesis, which has shown much promise. The key idea in sketching is to allow users to write partial programs that have “holes” and provide test harnesses or reference implementations, and let synthesis tools create program fragments that the holes such that the resulting complete program has the desired functionality. Traditional solutions to the sketching problem perform a translation to SAT and employ CEGIS. While e ective for a range of programs, when applied to real applications, such translation-based approaches have a key limitation: they require either translating all relevant libraries that are invoked directly or indirectly by the given sketch – which can lead to impractical SAT problems – or creating models of those libraries – which can require much manual effort.
is paper introduces execution-driven sketching, a novel approach for synthesis of Java programs using a backtracking search that is commonly employed in so ware model checkers. e key novelty of our work is to introduce effective pruning strategies to effciently explore the actual program behaviors in presence of libraries and to provide a practical solution to sketching small parts of real-world applications, which may use complex constructs of modern languages, such as reflection or native calls. Our tool EdSketch embodies our approach in two forms: a stateful search based on the Java PathFinder model checker; and a stateless search based on re-execution inspired by the VeriSoft model checker. Experimental results show that EdSketch’s performance compares well with the well-known SAT-based Sketch system for a range of small but complex programs, and moreover, that EdSketch can complete some sketches that require handling complex constructs.
Вениамин Гвоздиков: Особенности использования DTrace Yandex
Вначале рассмотрим возможности DTrace для трассировки приложений в реальном времени. Затем перейдём на особенности его использования на разных ОС и их ограничения. В конце будут приведены задания с динамически создаваемыми пробами для прикладной разработки на LUA в Tarantool.
Story of static code analyzer developmentAndrey Karpov
Greetings from the past: simple tools and bad standards. Regular expressions don’t work. What is inside modern static code analyzers on the PVS-Studio example. About machine learning. Learning vs Data Flow analysis.
This webinar by Oleksandr Navka (Lead Software Engineer, Consultant, GlobalLogic) was delivered at Java Community Webinar #1 on August 12, 2020.
Webinar agenda:
- The new structural unit of the program is Java Records
- Updated instanceof statement
- Updated switch operator
More details and presentation: https://www.globallogic.com/ua/about/events/java-community-webinar-1/
How to deploy Apache Spark in a multi-tenant, on-premises environmentBlueData, Inc.
Adoption of Apache Spark in the enterprise is increasing rapidly - it's become one of the fastest growing and most popular technologies in the Big Data ecosystem.
However, implementing an enterprise-ready, on-premises Spark deployment can be very complex and it requires expertise that is generally not available to all.
BlueData makes it easier to deploy Apache Spark on-premises. With BlueData, you can spin up virtual Spark clusters within minutes – providing secure, self-service, on-demand access to Big Data analytics and infrastructure. You can deploy Spark in standalone mode or with Hadoop / YARN. You can also build analytical pipelines and create Spark clusters using our RESTful APIs, and use web-based Zeppelin notebooks for interactive data analytics.
BlueData’s software platform leverages virtualization and Docker containers – combined with our own patent-pending innovations – to make it faster, and more cost-effective for enterprises to get up and running with a multi-tenant Spark deployment on-premises.
Learn more at www.bluedata.com
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...Data Con LA
Debugging data processing logic in Data-Intensive Scalable Computing (DISC) systems is a difficult and time consuming effort. To aid this effort, we built Titian, a library that enables data provenance tracking data through transformations in Apache Spark.
With Lakehouse as the future of data architecture, Delta becomes the de facto data storage format for all the data pipelines. By using delta, to build the curated data lakes, users achieve efficiency and reliability end-to-end. Curated data lakes involve multiple hops in the end-to-end data pipeline, which are executed regularly (mostly daily) depending on the need. As data travels through each hop, its quality improves and becomes suitable for end-user consumption. On the other hand real-time capabilities are key for any business and an added advantage, luckily Delta has seamless integration with structured streaming which makes it easy for users to achieve real-time capability using Delta. Overall, Delta Lake as a streaming source is a marriage made in heaven for various reasons and we are already seeing the rise in adoption among our users.
In this talk, we will discuss various functional components of structured streaming with Delta as a streaming source. Deep dive into Query Progress Logs(QPL) and their significance for operating streams in production. How to track the progress of any streaming job and map it with the source Delta table using QPL. What exactly gets persisted in the checkpoint directory and its details. Mapping the contents of the checkpoint directory with the QPL metrics and understanding the significance of contents in the checkpoint directory with respect to Delta streams.
BlinkDB - Approximate Queries on Very Large DataKnoldus Inc.
Big datasets are growing exponentially, but our needs to get quick interactive responses to our queries remain ever as important. This talk will feature an overview of various components in BlinkDB used to incrementally process massive amounts of data on clusters of tens, hundreds or thousands of machines while returning approximate answers. More precisely, this new execution model enables SparkSQL to present the user with meaningful approximate results (with error bars) that are continuously refined and updated, at a speed comfortable to the user, while it crunches larger and larger fractions of the whole dataset in the background. This not only alleviates the need for pre-processing the data in advance for a wide range of queries, but also enables the users to observe the progress of a query and control its execution on the fly– enabling a smooth time/accuracy trade-off.
Open Source SQL databases enter millions queries per second eraAlexander Korotkov
It's widespread belief that SQL DBMSes are doomed to be hulking because of burden of backward compatibility. This belief is frequently used by marketing of various NoSQL DBMSes. However, this is not necessary true. Development in Open Source community makes product development flexible enough to meet needs of the times. MySQL and PostgreSQL, which are most popular Open Source DBMSes, recently made optimizations for big servers which made it possible to process more than million of SQL queries per second in the single instance. This talk will cover particular optimizations made in PostgreSQL for achieving this result, which could previously seem like fantastic. And one may say that Open Sourcse DBMSes open a new era of millions queries per second.
Presentation slides of "Gaihre, Anil, et al. "Xbfs: exploring runtime optimizations for breadth-first search on gpus." Proceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing. 2019."
A Deep Dive into Query Execution Engine of Spark SQLDatabricks
Spark SQL enables Spark to perform efficient and fault-tolerant relational query processing with analytics database technologies. The relational queries are compiled to the executable physical plans consisting of transformations and actions on RDDs with the generated Java code. The code is compiled to Java bytecode, executed at runtime by JVM and optimized by JIT to native machine code at runtime. This talk will take a deep dive into Spark SQL execution engine. The talk includes pipelined execution, whole-stage code generation, UDF execution, memory management, vectorized readers, lineage based RDD transformation and action.
Lazy Join Optimizations Without Upfront Statistics with Matteo InterlandiDatabricks
Modern Data-Intensive Scalable Computing (DISC) systems such as Apache Spark do not support sophisticated cost-based query optimizers because they are specifically designed to process data that resides in external storage systems (e.g. HDFS), or they lack the necessary data statistics. Consequently, many crucial optimizations, such as join order and plan selection, are presently out-of-scope in these DISC system optimizers. Yet, join order is one of the most important decisions a cost-optimizer can make because wrong orders can result in a query response time that can become more than an order-of-magnitude slower compared to the better order.
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data ArtisansEvention
This talk will start with brief introduction to streaming processing and Flink itself. Next, we will take a look at some of the most interesting recent improvements in Flink such as incremental checkpointing,
end-to-end exactly-once processing guarantee and network latency optimizations. We’ll discuss real problems that Flink’s users were facing and how they were addressed by the community and dataArtisans.
Slides for a talk given by Fede Fernández & Fran Pérez. It describes some common tips to improve a Kafka and Spark application, going through improving table joins, operational parameters as blockIntervalTime or number of partitions, serializations or how byKey operations work under the scenes.
Spark Streaming Tips for Devs and Ops by Fran perez y federico fernándezJ On The Beach
During this talk we will see a regular Kafka/Spark Streaming application, going through some of the most common issues and how we fix them. We'll see how to improve our Spark App in two different point of views: Code quality and Spark Tuning. The final goal is to have a robust and resilient Spark Application deployable in a production-like environment.
Similar to Online Approximate OLAP in SparkSQL (20)
Introduction: This workshop will provide a hands-on introduction to Machine Learning (ML) with an overview of Deep Learning (DL).
Format: An introductory lecture on several supervised and unsupervised ML techniques followed by light introduction to DL and short discussion what is current state-of-the-art. Several python code samples using the scikit-learn library will be introduced that users will be able to run in the Cloudera Data Science Workbench (CDSW).
Objective: To provide a quick and short hands-on introduction to ML with python’s scikit-learn library. The environment in CDSW is interactive and the step-by-step guide will walk you through setting up your environment, to exploring datasets, training and evaluating models on popular datasets. By the end of the crash course, attendees will have a high-level understanding of popular ML algorithms and the current state of DL, what problems they can solve, and walk away with basic hands-on experience training and evaluating ML models.
Prerequisites: For the hands-on portion, registrants must bring a laptop with a Chrome or Firefox web browser. These labs will be done in the cloud, no installation needed. Everyone will be able to register and start using CDSW after the introductory lecture concludes (about 1hr in). Basic knowledge of python highly recommended.
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
In a world with a myriad of distributed storage systems to choose from, the majority of Apache HBase clusters still rely on Apache HDFS. Theoretically, any distributed file system could be used by HBase. One major reason HDFS is predominantly used are the specific durability requirements of HBase's write-ahead log (WAL) and HDFS providing that guarantee correctly. However, HBase's use of HDFS for WALs can be replaced with sufficient effort.
This talk will cover the design of a "Log Service" which can be embedded inside of HBase that provides a sufficient level of durability that HBase requires for WALs. Apache Ratis (incubating) is a library-implementation of the RAFT consensus protocol in Java and is used to build this Log Service. We will cover the design choices of the Ratis Log Service, comparing and contrasting it to other log-based systems that exist today. Next, we'll cover how the Log Service "fits" into HBase and the necessary changes to HBase which enable this. Finally, we'll discuss how the Log Service can simplify the operational burden of HBase.
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
Utilizing Apache NiFi we read various open data REST APIs and camera feeds to ingest crime and related data real-time streaming it into HBase and Phoenix tables. HBase makes an excellent storage option for our real-time time series data sources. We can immediately query our data utilizing Apache Zeppelin against Phoenix tables as well as Hive external tables to HBase.
Apache Phoenix tables also make a great option since we can easily put microservices on top of them for application usage. I have an example Spring Boot application that reads from our Philadelphia crime table for front-end web applications as well as RESTful APIs.
Apache NiFi makes it easy to push records with schemas to HBase and insert into Phoenix SQL tables.
Resources:
https://community.hortonworks.com/articles/54947/reading-opendata-json-and-storing-into-phoenix-tab.html
https://community.hortonworks.com/articles/56642/creating-a-spring-boot-java-8-microservice-to-read.html
https://community.hortonworks.com/articles/64122/incrementally-streaming-rdbms-data-to-your-hadoop.html
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
Whilst HBase is the most logical answer for use cases requiring random, realtime read/write access to Big Data, it may not be so trivial to design applications that make most of its use, neither the most simple to operate. As it depends/integrates with other components from Hadoop ecosystem (Zookeeper, HDFS, Spark, Hive, etc) or external systems ( Kerberos, LDAP), and its distributed nature requires a "Swiss clockwork" infrastructure, many variables are to be considered when observing anomalies or even outages. Adding to the equation there's also the fact that HBase is still an evolving product, with different release versions being used currently, some of those can carry genuine software bugs. On this presentation, we'll go through the most common HBase issues faced by different organisations, describing identified cause and resolution action over my last 5 years supporting HBase to our heterogeneous customer base.
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
LocationTech GeoMesa enables spatial and spatiotemporal indexing and queries for HBase and Accumulo. In this talk, after an overview of GeoMesa’s capabilities in the Cloudera ecosystem, we will dive into how GeoMesa leverages Accumulo’s Iterator interface and HBase’s Filter and Coprocessor interfaces. The goal will be to discuss both what spatial operations can be pushed down into the distributed database and also how the GeoMesa codebase is organized to allow for consistent use across the two database systems.
OCLC has been using HBase since 2012 to enable single-search-box access to over a billion items from your library and the world’s library collection. This talk will provide an overview of how HBase is structured to provide this information and some of the challenges they have encountered to scale to support the world catalog and how they have overcome them.
Many individuals/organizations have a desire to utilize NoSQL technology, but often lack an understanding of how the underlying functional bits can be utilized to enable their use case. This situation can result in drastic increases in the desire to put the SQL back in NoSQL.
Since the initial commit, Apache Accumulo has provided a number of examples to help jumpstart comprehension of how some of these bits function as well as potentially help tease out an understanding of how they might be applied to a NoSQL friendly use case. One very relatable example demonstrates how Accumulo could be used to emulate a filesystem (dirlist).
In this session we will walk through the dirlist implementation. Attendees should come away with an understanding of the supporting table designs, a simple text search supporting a single wildcard (on file/directory names), and how the dirlist elements work together to accomplish its feature set. Attendees should (hopefully) also come away with a justification for sometimes keeping the SQL out of NoSQL.
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
Data serves as the platform for decision-making at Uber. To facilitate data driven decisions, many datasets at Uber are ingested in a Hadoop Data Lake and exposed to querying via Hive. Analytical queries joining various datasets are run to better understand business data at Uber.
Data ingestion, at its most basic form, is about organizing data to balance efficient reading and writing of newer data. Data organization for efficient reading involves factoring in query patterns to partition data to ensure read amplification is low. Data organization for efficient writing involves factoring the nature of input data - whether it is append only or updatable.
At Uber we ingest terabytes of many critical tables such as trips that are updatable. These tables are fundamental part of Uber's data-driven solutions, and act as the source-of-truth for all the analytical use-cases across the entire company. Datasets such as trips constantly receive updates to the data apart from inserts. To ingest such datasets we need a critical component that is responsible for bookkeeping information of the data layout, and annotates each incoming change with the location in HDFS where this data should be written. This component is called as Global Indexing. Without this component, all records get treated as inserts and get re-written to HDFS instead of being updated. This leads to duplication of data, breaking data correctness and user queries. This component is key to scaling our jobs where we are now handling greater than 500 billion writes a day in our current ingestion systems. This component will need to have strong consistency and provide large throughputs for index writes and reads.
At Uber, we have chosen HBase to be the backing store for the Global Indexing component and is a critical component in allowing us to scaling our jobs where we are now handling greater than 500 billion writes a day in our current ingestion systems. In this talk, we will discuss data@Uber and expound more on why we built the global index using Apache Hbase and how this helps to scale out our cluster usage. We’ll give details on why we chose HBase over other storage systems, how and why we came up with a creative solution to automatically load Hfiles directly to the backend circumventing the normal write path when bootstrapping our ingestion tables to avoid QPS constraints, as well as other learnings we had bringing this system up in production at the scale of data that Uber encounters daily.
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
Recently, Apache Phoenix has been integrated with Apache (incubator) Omid transaction processing service, to provide ultra-high system throughput with ultra-low latency overhead. Phoenix has been shown to scale beyond 0.5M transactions per second with sub-5ms latency for short transactions on industry-standard hardware. On the other hand, Omid has been extended to support secondary indexes, multi-snapshot SQL queries, and massive-write transactions.
These innovative features make Phoenix an excellent choice for translytics applications, which allow converged transaction processing and analytics. We share the story of building the next-gen data tier for advertising platforms at Verizon Media that exploits Phoenix and Omid to support multi-feed real-time ingestion and AI pipelines in one place, and discuss the lessons learned.
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
Cybersecurity requires an organization to collect data, analyze it, and alert on cyber anomalies in near real-time. This is a challenging endeavor when considering the variety of data sources which need to be collected and analyzed. Everything from application logs, network events, authentications systems, IOT devices, business events, cloud service logs, and more need to be taken into consideration. In addition, multiple data formats need to be transformed and conformed to be understood by both humans and ML/AI algorithms.
To solve this problem, the Aetna Global Security team developed the Unified Data Platform based on Apache NiFi, which allows them to remain agile and adapt to new security threats and the onboarding of new technologies in the Aetna environment. The platform currently has over 60 different data flows with 95% doing real-time ETL and handles over 20 billion events per day. In this session learn from Aetna’s experience building an edge to AI high-speed data pipeline with Apache NiFi.
In the healthcare sector, data security, governance, and quality are crucial for maintaining patient privacy and ensuring the highest standards of care. At Florida Blue, the leading health insurer of Florida serving over five million members, there is a multifaceted network of care providers, business users, sales agents, and other divisions relying on the same datasets to derive critical information for multiple applications across the enterprise. However, maintaining consistent data governance and security for protected health information and other extended data attributes has always been a complex challenge that did not easily accommodate the wide range of needs for Florida Blue’s many business units. Using Apache Ranger, we developed a federated Identity & Access Management (IAM) approach that allows each tenant to have their own IAM mechanism. All user groups and roles are propagated across the federation in order to determine users’ data entitlement and access authorization; this applies to all stages of the system, from the broadest tenant levels down to specific data rows and columns. We also enabled audit attributes to ensure data quality by documenting data sources, reasons for data collection, date and time of data collection, and more. In this discussion, we will outline our implementation approach, review the results, and highlight our “lessons learned.”
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale in a variety of use cases at Airbnb, Bloomberg, Comcast, Facebook, FINRA, LinkedIn, Lyft, Netflix, Twitter, and Uber, in the last few years Presto experienced an unprecedented growth in popularity in both on-premises and cloud deployments over Object Stores, HDFS, NoSQL and RDBMS data stores.
With the ever-growing list of connectors to new data sources such as Azure Blob Storage, Elasticsearch, Netflix Iceberg, Apache Kudu, and Apache Pulsar, recently introduced Cost-Based Optimizer in Presto must account for heterogeneous inputs with differing and often incomplete data statistics. This talk will explore this topic in detail as well as discuss best use cases for Presto across several industries. In addition, we will present recent Presto advancements such as Geospatial analytics at scale and the project roadmap going forward.
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
Specialized tools for machine learning development and model governance are becoming essential. MlFlow is an open source platform for managing the machine learning lifecycle. Just by adding a few lines of code in the function or script that trains their model, data scientists can log parameters, metrics, artifacts (plots, miscellaneous files, etc.) and a deployable packaging of the ML model. Every time that function or script is run, the results will be logged automatically as a byproduct of those lines of code being added, even if the party doing the training run makes no special effort to record the results. MLflow application programming interfaces (APIs) are available for the Python, R and Java programming languages, and MLflow sports a language-agnostic REST API as well. Over a relatively short time period, MLflow has garnered more than 3,300 stars on GitHub , almost 500,000 monthly downloads and 80 contributors from more than 40 companies. Most significantly, more than 200 companies are now using MLflow. We will demo MlFlow Tracking , Project and Model components with Azure Machine Learning (AML) Services and show you how easy it is to get started with MlFlow on-prem or in the cloud.
Extending Twitter's Data Platform to Google CloudDataWorks Summit
Twitter's Data Platform is built using multiple complex open source and in house projects to support Data Analytics on hundreds of petabytes of data. Our platform support storage, compute, data ingestion, discovery and management and various tools and libraries to help users for both batch and realtime analytics. Our DataPlatform operates on multiple clusters across different data centers to help thousands of users discover valuable insights. As we were scaling our Data Platform to multiple clusters, we also evaluated various cloud vendors to support use cases outside of our data centers. In this talk we share our architecture and how we extend our data platform to use cloud as another datacenter. We walk through our evaluation process, challenges we faced supporting data analytics at Twitter scale on cloud and present our current solution. Extending Twitter's Data platform to cloud was complex task which we deep dive in this presentation.
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
At Comcast, our team has been architecting a customer experience platform which is able to react to near-real-time events and interactions and deliver appropriate and timely communications to customers. By combining the low latency capabilities of Apache Flink and the dataflow capabilities of Apache NiFi we are able to process events at high volume to trigger, enrich, filter, and act/communicate to enhance customer experiences. Apache Flink and Apache NiFi complement each other with their strengths in event streaming and correlation, state management, command-and-control, parallelism, development methodology, and interoperability with surrounding technologies. We will trace our journey from starting with Apache NiFi over three years ago and our more recent introduction of Apache Flink into our platform stack to handle more complex scenarios. In this presentation we will compare and contrast which business and technical use cases are best suited to which platform and explore different ways to integrate the two platforms into a single solution.
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
Companies are increasingly moving to the cloud to store and process data. One of the challenges companies have is in securing data across hybrid environments with easy way to centrally manage policies. In this session, we will talk through how companies can use Apache Ranger to protect access to data both in on-premise as well as in cloud environments. We will go into details into the challenges of hybrid environment and how Ranger can solve it. We will also talk through how companies can further enhance the security by leveraging Ranger to anonymize or tokenize data while moving into the cloud and de-anonymize dynamically using Apache Hive, Apache Spark or when accessing data from cloud storage systems. We will also deep dive into the Ranger’s integration with AWS S3, AWS Redshift and other cloud native systems. We will wrap it up with an end to end demo showing how policies can be created in Ranger and used to manage access to data in different systems, anonymize or de-anonymize data and track where data is flowing.
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
Advanced Big Data Processing frameworks have been proposed to harness the fast data transmission capability of Remote Direct Memory Access (RDMA) over high-speed networks such as InfiniBand, RoCEv1, RoCEv2, iWARP, and OmniPath. However, with the introduction of the Non-Volatile Memory (NVM) and NVM express (NVMe) based SSD, these designs along with the default Big Data processing models need to be re-assessed to discover the possibilities of further enhanced performance. In this talk, we will present, NRCIO, a high-performance communication runtime for non-volatile memory over modern network interconnects that can be leveraged by existing Big Data processing middleware. We will show the performance of non-volatile memory-aware RDMA communication protocols using our proposed runtime and demonstrate its benefits by incorporating it into a high-performance in-memory key-value store, Apache Hadoop, Tez, Spark, and TensorFlow. Evaluation results illustrate that NRCIO can achieve up to 3.65x performance improvement for representative Big Data processing workloads on modern data centers.
Background: Some early applications of Computer Vision in Retail arose from e-commerce use cases - but increasingly, it is being used in physical stores in a variety of new and exciting ways, such as:
● Optimizing merchandising execution, in-stocks and sell-thru
● Enhancing operational efficiencies, enable real-time customer engagement
● Enhancing loss prevention capabilities, response time
● Creating frictionless experiences for shoppers
Abstract: This talk will cover the use of Computer Vision in Retail, the implications to the broader Consumer Goods industry and share business drivers, use cases and benefits that are unfolding as an integral component in the remaking of an age-old industry.
We will also take a ‘peek under the hood’ of Computer Vision and Deep Learning, sharing technology design principles and skill set profiles to consider before starting your CV journey.
Deep learning has matured considerably in the past few years to produce human or superhuman abilities in a variety of computer vision paradigms. We will discuss ways to recognize these paradigms in retail settings, collect and organize data to create actionable outcomes with the new insights and applications that deep learning enables.
We will cover the basics of object detection, then move into the advanced processing of images describing the possible ways that a retail store of the near future could operate. Identifying various storefront situations by having a deep learning system attached to a camera stream. Such things as; identifying item stocks on shelves, a shelf in need of organization, or perhaps a wandering customer in need of assistance.
We will also cover how to use a computer vision system to automatically track customer purchases to enable a streamlined checkout process, and how deep learning can power plausible wardrobe suggestions based on what a customer is currently wearing or purchasing.
Finally, we will cover the various technologies that are powering these applications today. Deep learning tools for research and development. Production tools to distribute that intelligence to an entire inventory of all the cameras situation around a retail location. Tools for exploring and understanding the new data streams produced by the computer vision systems.
By the end of this talk, attendees should understand the impact Computer Vision and Deep Learning are having in the Consumer Goods industry, key use cases, techniques and key considerations leaders are exploring and implementing today.
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
Whole genome shotgun based next generation transcriptomics and metagenomics studies often generate 100 to 1000 gigabytes (GB) sequence data derived from tens of thousands of different genes or microbial species. De novo assembling these data requires an ideal solution that both scales with data size and optimizes for individual gene or genomes. Here we developed an Apache Spark-based scalable sequence clustering application, SparkReadClust (SpaRC), that partitions the reads based on their molecule of origin to enable downstream assembly optimization. SpaRC produces high clustering performance on transcriptomics and metagenomics test datasets from both short read and long read sequencing technologies. It achieved a near linear scalability with respect to input data size and number of compute nodes. SpaRC can run on different cloud computing environments without modifications while delivering similar performance. In summary, our results suggest SpaRC provides a scalable solution for clustering billions of reads from the next-generation sequencing experiments, and Apache Spark represents a cost-effective solution with rapid development/deployment cycles for similar big data genomics problems.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Generating a custom Ruby SDK for your web service or Rails API using Smithyg2nightmarescribd
Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Online Approximate OLAP in SparkSQL
1. Online Approximate
OLAP in SparkSQL
Kai Zengand Sameer Agarwal
Hadoop Summit | San Jose, CA | June 9th 2015
2. Hard Disks
(dailyreports)
½ - 1 Hour 1 - 5 Minutes 1 second
?
Memory
(hourly reports)
100 TB on 1000 machines
Need for Quick & Continuous Feedback
(exploratoryqueries)
3. ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
Online Aggregation1
What is the average latency in
the table?
34.6667
1Joseph M. Hellerstein, Peter J. Haas, Helen J.
Wang. Online Aggregation. In SIGMOD 1997.
4. ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
Online Aggregation
What is the average latency in
the table?
35
5. ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
Online Aggregation
What is the average latency in
the table?
33.83
35
6. ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
Online Aggregation
What is the average latency in
the table?
33.83
34.6667
35
7. ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
Online Aggregation
What is the average latency in
the table?
33.83 ± 1.3
34.6667 ± 0.0
35 ± 2.1
19. ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
Part 1: Generalized Online Aggregation (G-OLA)
What is the average latency in
the table?
33.83 ± 1.3
34.6667 ± 0.0
35 ± 2.1
20. ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
What is the average latency in
the table?
33.83 ± 1.3
34.6667 ± 0.0
35 ± 2.1
Part 2: Error Estimation in BlinkDB
21. Key Idea:
Introduce Delta Maintenance as a
First Class Citizen in Query
Execution
Part 1: Generalized Online Aggregation (G-OLA)
22. Query Execution: Under The Hood
SQL Query Query Plan
AGGR
SCAN
AVG(latency)
SELECT avg(latency)
FROM log
log
23. Query Execution: Under The Hood
SQL Query Query Plan
FILTER
JOIN
AGGR SCAN
SCAN
AGGR
latency > AVG(latency)
AVG(latency)
AVG(latency)SELECT avg(latency)
FROM log
WHERE latency >
(
SELECT
avg(latency)
FROM log
)
log
log
33. Intermediate State
SELECT avg(latency)
FROM log
WHERE latency >
(
SELECT avg(latency)
FROM log
)
G-OLA: Generalized Online Aggregation for
Interactive Analysis on Big Data. In SIGMOD 2015
34. Intermediate State
SELECT avg(latency)
FROM log
WHERE latency >
(
SELECT avg(latency)
FROM log
)
G-OLA: Generalized Online Aggregation for
Interactive Analysis on Big Data. In SIGMOD 2015
35. 35
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
SELECT avg(latency)
FROM log
WHERE latency > SELECT avg(latency)
FROM log
Intermediate State
36. 35
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
SELECT avg(latency)
FROM log
WHERE latency > SELECT avg(latency)
FROM log
Intermediate State
33.83
37. 35
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
SELECT avg(latency)
FROM log
WHERE latency > SELECT avg(latency)
FROM log
Intermediate State
33.83
38. 35 ± 1.5
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
SELECT avg(latency)
FROM log
WHERE latency > SELECT avg(latency)
FROM log
Intermediate State
33.83
39. 35 ± 1.5
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
SELECT avg(latency)
FROM log
WHERE latency > SELECT avg(latency)
FROM log
Intermediate State
33.83
40. 35 ± 1.5
Uncertain Tuples
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
SELECT avg(latency)
FROM log
WHERE latency > SELECT avg(latency)
FROM log
Intermediate State
33.83
42. Goal – Minimize Intermediate State
Each delta-update query
• Optimized using Catalyst in Spark SQL
– Capable of using all existing optimizations
• Compiled into a distributed Spark Job
Online Execution Model
43. ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
What is the average latency in
the table?
33.83 ± 1.3
34.6667 ± 0.0
35 ± 2.1
Part 2: Error Estimation in BlinkDB
44. ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
Error Estimation on a Sample of Data
What is the average latency in
the table?
34.6667
45. ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
Error Estimation on a Sample of Data
What is the average latency in
the table?
34.5
34.6667
46. ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
Error Estimation on a Sample of Data
What is the average latency in
the table?
34.5 ± 2.0
34.6667
47. Focused on estimating aggregate errors given representative
samples
Central LimitTheorem (CLT) Error Estimation using Bootstrap
HOE:ASTAT63,BIL: WILEY86, CGL:ASTAT83,PH:IBM96 EFRON:JAS82, EFRON:JAS87,VP:TPMS80, FGK:IJCAI99, ET:CH93
47
Error Estimation on a Sample of Data
d
predicate for the query)
The following results are (asymptotically in sample size) true, but not di-
rectly useful, since they depend on unknown properties of the underlying dis-
tribution. In all cases we just plug in the sample values. For example, instead
of µ we use 1
n
Pn
i=1 Xi where Xi is the ith sample value.
Note that for estimators other than sum and count, I assume no filtering
(p = 1). Filtering will increase variance a bit, or potentially a lot for extremely
selective queries (p = 0). I can compute the filtering-adjusted values if you like.
1. Count: N(np, n(1 p)p)
2. Sum: N(npµ, np( 2
+ (1 p)µ2
))
3. Mean: N(µ, 2
/n)
4. Variance: N( 2
, (µ4
4
)/n)
5. Stddev: N( , (µ4
4
)/(4 2
n))
Sampling!
…
Resampling!
D
S100
S1
S
θ(S1
)
θ(S100
)θ(S) 95%
confidence
interval!
…
48. ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
Error Estimation using Bootstrap
What is the average latency in
the table?
49. ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
Error Estimation using Bootstrap
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 SLC 34
50. ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
Error Estimation using Bootstrap
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 SLC 34
θ1 = 34
51. ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
Error Estimation using Bootstrap
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 SLC 34
ID City Latency
1 NYC 30
2 NYC 30
3 SLC 34
4 LA 36
θ1 = 34 θ2 = 32.5
52. ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
Error Estimation using Bootstrap
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 SLC 34
ID City Latency
1 NYC 30
2 NYC 30
3 SLC 34
4 LA 36
ID City Latency
1 SLC 34
2 LA 36
3 SLC 34
4 LA 36
...
θ1 = 34 ...θ2 = 32.5 θ100 = 35
53. ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
Error Estimation using Bootstrap
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 SLC 34
ID City Latency
1 NYC 30
2 NYC 30
3 SLC 34
4 LA 36
ID City Latency
1 SLC 34
2 LA 36
3 SLC 34
4 LA 36
...
θ1 = 34 ...
34.5 ± 2
θ2 = 32.5 θ100 = 35
54. ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
Error Estimation using Bootstrap
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 SLC 34
ID City Latency
1 NYC 30
2 NYC 30
3 SLC 34
4 LA 36
ID City Latency
1 SLC 34
2 LA 36
3 SLC 34
4 LA 36
...
θ1 = 34 ...
34.5 ± 2
θ2 = 32.5 θ100 = 35
55. ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
Error Estimation using Bootstrap
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 SLC 34
5 SLC 37
ID City Latency
1 SLC 37
2 NYC 30
3 SLC 34
4 LA 36
5 NYC 30
ID City Latency
1 SLC 34
2 SLC 37
3 SLC 34
4 LA 36
5 LA 36
...
θ1 = 34.6
...
35 ± 1.6
θ2 = 33.4 θ100 = 35.4
56. ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
What is the average latency in
the table?
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 SLC 34
5 SLC 37
Error Estimation in BlinkDB
Leverage Poissonized
Resampling to generate
samples with replacement
Knowing When You’re Wrong: Building Fast
and Reliable Approximate Query Processing
Systems. In SIGMOD 2014.
57. ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
What is the average latency in
the table?
ID City Latency #1
1 NYC 30 2
2 NYC 38 1
3 SLC 34 0
4 SLC 34 1
5 SLC 37 1
Sample from a
Poisson (1) Distribution
θ1 = 33.8
Error Estimation in BlinkDB
58. ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
What is the average latency in
the table?
ID City Latency #1
1 NYC 30 2
2 NYC 38 1
3 SLC 34 0
4 SLC 34 1
5 SLC 37 1
6 SF 28 2
Incremental Error
Estimation
Error Estimation in BlinkDB
59. ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
What is the average latency in
the table?
ID City Latency #1 #2
1 NYC 30 2 1
2 NYC 38 1 0
3 SLC 34 0 2
4 SLC 34 1 2
5 SLC 37 1 0
6 SF 28 2 1
Construct all
Resamples in
a Single Pass
Error Estimation in BlinkDB
60. Bootstrap Performance
Compute Sum(latency)
rSumi = rSumi + latency * #i
Virtual call to Add.eval()
Virtual call to rSum.eval()
Return boxed Double
Virtual call to Multiply.eval()
Virtual call to latency.eval()
Return boxed Double
Virtual call to #.eval()
……
Add
rSum Multiply
latency #
61. Using Scala Quasiquotes/ Runtime Reflection
def generateCode(e: Expression): Tree = e match {
case Attribute(ordinal) =>
q"inputRow.getInt($ordinal)"
case Add(left, right) =>
q"""
{
val leftResult = ${generateCode(left)}
val rightResult = ${generateCode(right)}
leftResult + rightResult
}
"""
}
62. Code Generation Bootstrap
Compute Sum(latency)
rSumi = rSumi + latency * #i
Consolidate all bootstrap in a tight for loop
val left0 = inputRow.getDouble(j) // latency
for (i from 0 until n)
val right0 = inputRow.getInt(k+i) // #i
val left1 = inputRow.getDouble(0+i) // rSumi
val right1 = left0*right0
val result1 = left1 + right1
resultRow.setDouble(0, result1)
62
2-5x Faster!
63. Conclusion
1. Online Approximation is an important
means to achieve interactivityin
processing large datasets
2. New SparkSQL Libraries:
- G-OLA for Continuous Answers
- BlinkDB for Continuous Error Bars
64. Check out our code!
1. Early Adopter Preview: http://github.com/amplab/bootstrap-
sql. Send us an email to kaizeng@cs.berkeley.eduand
sameer@databricks.comto get access!
2. Spark Package coming soon
3. GradualNative SparkSQL Integration in 1.5, 1.6 and beyond