Apache Druid Vision and Roadmap

•

1 like•647 views

Gian will offer his reflections on the Druid journey to date, plus describe his vision for what Druid will become. He will lay out the near-term Druid roadmap and take your questions. Watch video: https://imply.io/virtual-druid-summit/apache-druid-vision-and-roadmap-gian-merlino

Apache Druid
Taking the temperature of our data
April 2020
Gian Merlino
1

Who am I?
2
Gian Merlino
Committer & PMC chair at
Cofounder at (we’re hiring!)

Druid in the wild
6
100+ billion rows/day
1+ trillion rows, 1+ year retained
100s of servers
sub-second to few seconds query latency
mix of streaming and batch ingest

Real-time workflows
7
Exploration
Monitoring
Real-time workflows

Investing in real-time experience
9
Ease-of-use & reliability
02
● Indexer process
● Robust native batch ingestion
● Partition-aware background compaction
Performance
01 ● Query laning (0.18)
● Fully vectorized query engine

Thinking of real-time as “hot”
10
🔥
⏱ 0.1–3s query
🚰 fresh data
🏋‍♀ high concurrency
🚴 interactive workloads

Hot vs. cold
11
🔥
⏱ 0.1–3s query
🚰 fresh data
🏋‍♀ high concurrency
🚴 highly interactive
⚙
⏱ slow queries are ok
🚰 less fresh data is ok
🏋‍♀ low concurrency
🚴 reporting / planning

How about “warm”?
12
🍞
⏱ 5–30s query
🚰 less fresh data is ok
🏋‍♀ high concurrency
🚴 somewhat interactive

Why warm instead of hot?
◆ Cost
◆ Ease of data migration
13

Warm user experiences
◆ Still exploratory, but somewhat less interactive
◆ Not much monitoring
◆ Classic BI apps work well
14

Putting it together
15
Ease-of-use & reliability
02
● Indexer process
● Robust native batch ingestion
● Partition-aware background compaction
Performance
01 ● Query laning (0.18)
● Fully vectorized query engine
Query capabilities
03
● SQL JOINs: INNER, LEFT (0.18)
● Subqueries (0.18)
● Windowed aggregations
● JOINs v2: OUTER, RIGHT, wide dimension tables

Towards Druid 1.0
◆ Coming together of many efforts
◆ Native batch ingestion
◆ New and improved query engines
◆ SQL support
◆ Stay tuned!
16

Stay in touch
17
@druidio
Join the community
(Mailing lists, Slack, meetups)
https://druid.apache.org/community/
Follow the Druid project on Twitter!

Apache Druid® is an open source analytics database powering fresh, fast analytics in companies from AirBnB to Zeotap on clickstream, telemetry, financial transactions, applications and more. In this talk, we open the box on the three distributed processes in Druid led by the coordinator, overlord, and broker, and the ways that these come together to deliver reliable, performant query, ingestion, and management services.

Building a Real-Time Gaming Analytics Service with Apache Druid

Imply

At GameAnalytics we receive and process real time behavioural data from more than 100 million daily active users, helping thousands of game studios and developers understand user behaviour and improve their games. In this talk, you will learn how we managed to migrate our legacy backend system from using an in-house built streaming analytics service to Apache Druid, and the lessons learned along the way. By adopting Druid, we have been able to reduce development costs, increase reliability of our systems and implement new features that would have not been possible with our old stack. We will provide an overview of our approach to schema design, segments optimization, creation of our query layer, caching and datasources optimisation, which can help you better understand how you can successfully use Druid as a key component on your data processing and reporting infrastructure.

Analytics over Terabytes of Data at Twitter

Imply

MoPub, a Twitter company, provides monetization solutions for mobile app publishers and developers around the globe. MoPub receives over 33 Billion ad requests per day generating over 200TB of raw logs every day. We built MoPub Analytics as the analytics platform, using Druid + Imply for our end users who are Publishers, Demand side partners and Internal users. We will talk about the architecture of the analytics platform, our Druid cluster setup, hardware choices, monitoring, use cases, limiting factors, challenges with lookups and solutions we used. Watch video:https://imply.io/virtual-druid-summit/analytics-over-terabytes-of-data-at-twitter-apache-druid

Data Analytics and Processing at Snap - Druid Meetup LA - September 2018

Charles Allen

Archmage, Pinterest’s Real-time Analytics Platform on Druid

Imply

In this talk, we will talk about: 1) the motivation of switching from Hbase backed analytics system to Druid 2) the architecture design of Druid as a platform in Pinterest (Archmage, Hadoop, Kafka) including a query interface, Archmage, a thrift service in front of Druid which exposes a thrift api to company-wise clients, handles Druid broker hosts discovery, serves as a relay to broker hosts to abstract the async HTTP connection and provides query optimizations transparent to clients including directly translating fixed pattern SQL to Druid native JSON queries to save planning time. In addition, we’ll cover the production Hadoop batch and Kafka real time ingestion pipeline setup and the reason we picked a pull-based solution instead of a push-based solution for real time ingestion. 3) We will also talk about the use cases currently running in production on this platform including their data volume, QPS, Druid cluster setup, the unique challenges we met while onboarding and how we addressed them with extensive tunings to meet SLA and lessons learned for use cases including: partner insights, which provides partners with stats on organic pins; realtime spam detection, which detects user login related anomaly events and pin related spamming events like pin creation and repin; and migrating the backend from Presto to Druid for Ads related experiments data analysis.

Why data warehouses cannot support hot analytics

Imply

Check out the full webinar: https://imply.io/videos/why-data-warehouses-cannot-support-hot-analytics Today’s data warehouses - whether traditional, specialized or cloud-based - are good at supporting cold analytics, such as reporting, where query times can take minutes. But they cannot cost-effectively support hot analytics—interactive ad hoc analytics usually performed by larger groups of users against batch or streaming data. Examples of hot analytics include clickstream analytics; service, network and application performance monitoring; and risk analytics. Data warehouses struggle with hot analytics use cases because they are too slow, unable to scale, or too expensive. Learn how a new class of real-time data platforms overcome these limitations, and how companies implement a “temperature-based” approach to analytics.

Building Data Applications with Apache Druid

Imply

One of the most popular use cases for Apache Druid is building data applications. Data applications exist to deliver data into the hands of everyone on a team in a business, and are used by these teams to make faster, better decisions. To fulfill this role, they need to support granular drill down, because the devil is in the details, but also be extremely fast, because otherwise people won't use them! In this talk, Gian Merlino will cover: *The unique technical challenges of powering data-driven applications *What attributes of Druid make it a good platform for data applications *Some real-world data applications powered by Druid

Apache Druid ingests and enables instant query on many billions of events in real-time. But how? In this talk, each of the components of an Apache Druid cluster is described – along with the data and query optimisations at its core – that unlock fresh, fast data for all. Bio: Peter Marshall (https://linkedin.com/in/amillionbytes/) leads outreach and engineering across Europe for Imply (http://imply.io/), a company founded by the original developers of Apache Druid. He has 20 years architecture experience in CRM, EDRM, ERP, EIP, Digital Services, Security, BI, Analytics, and MDM. He is TOGAF certified and has a BA (hons) degree in Theology and Computer Studies from the University of Birmingham in the United Kingdom.

Druid Adoption Tips and Tricks

Imply

Peter Marshall, Technology Evangelist at Imply Abstract: Apache Druid® can revolutionise business decision-making with a view of the freshest of fresh data in web, mobile, desktop, and data science notebooks. In this talk, we look at key activities to integrate into Apache Druid POCs, discussing common hurdles and signposting to important information. Bio: Peter Marshall (https://petermarshall.io) is an Apache Druid Technology Evangelist at Imply (http://imply.io/), a company founded by original developers of Apache Druid. He has 20 years architecture experience in CRM, EDRM, ERP, EIP, Digital Services, Security, BI, Analytics, and MDM. He is TOGAF certified and has a BA degree in Theology and Computer Studies from the University of Birmingham in the United Kingdom.

How Netflix Uses Druid in Real-time to Ensure a High Quality Streaming Experi...

Imply

What’s New in Imply 3.3 & Apache Druid 0.18

Imply

Check out the webinar: https://imply.io/videos/whats-new-imply-3-3-apache-druid-0-18 The most recent Imply 3.3 release, based on Apache 0.18 brings several major new features, including joins, query laning and Clarity Alerts. These new features deliver increased design flexibility during design, and provide improved ingestion performance, and sub-second response times to help accelerate data warehouse and data lake deployments, and add real-time analytics in general.

How TrafficGuard uses Druid to Fight Ad Fraud and Bots

Imply

In this session, TrafficGuard’s Head of Data Science, Raigon Jolly, will discuss how TrafficGuard uses Druid and its partnership with Imply to: - Provide granular reporting to clients in near-real time - Monitor rules and concept drift - Staying ahead of the moving target that is ad fraud - Facilitate performance tuning and right-sizing infrastructure so our team can focus on innovation of our core product

Apache Druid Design and Future prospect

c-bslim

Druid in Spot Instances

Imply

Nicolas Trésegnie, Chief Architect at SuperAwesome Abstract: SuperAwesome's mission is to make the internet safer for kids. At the core of SuperAwesome's analytics is Druid. In this talk, we walk through how we run Druid on spot instances. We explain the consequences in terms of cost and reliability, how we managed to build a reliable system despite the risks, and how you could do the same. Nicolas works as Chief Architect at SuperAwesome, where is is looking after the overall architecture of the systems and the infrastructure. He is all about automation and how technology can be used to achieve business goals. Nicolas studied Computer Science and Bioinformatics, and he is now pursuing an MBA at Imperial.

Benchmarking Apache Druid

Matt Sarrel

Splunk: Druid on Kubernetes with Druid-operator

Imply

We went through the journey of deploying Apache Druid clusters on Kubernetes(K8s) and created a druid-operator (https://github.com/druid-io/druid-operator). This talk introduces the druid kubernetes operator, how to use it to deploy druid clusters and how it works under the hood. We will share how we use this operator to deploy Druid clusters at Splunk. Kubernetes is an open-source system for automating deployment, scaling, and management of containerized applications. Druid is a complex stateful distributed system and a Druid cluster consists of multiple web services such as Broker, Historical, Coordinator, Overlord, MiddleManager etc each deployed with multiple replicas. Deploying a single web service on K8s requires creating few K8s resources via YAML files and it multiplies due to multiple services inside of a Druid cluster. Now doing it for multiple Druid clusters (dev, staging, production environments) makes it even more tedious and error prone. K8s enables creation of application (such as Druid) specific extension, called “Operator”, that combines kubernetes and application specific knowledge into a reusable K8s extension that makes deploying complex applications simple.

July 2014 HUG : Pushing the limits of Realtime Analytics using Druid

Yahoo Developer Network

A Day in the Life of a Druid Implementor and Druid's Roadmap

Itai Yaffe

Benjamin Hopp (Solutions Architect) @ Imply: Druid is an emerging standard in the data infrastructure world, designed for high-performance slice-and-dice analytics (“OLAP”-style) on large data sets. This talk is for you if you’re interested in learning more about pushing Druid’s analytical performance to the limit. Perhaps you’re already running Druid and are looking to speed up your deployment, or perhaps you aren’t familiar with Druid and are interested in learning the basics. Some of the tips in this talk are Druid-specific, but many of them will apply to any operational analytics technology stack. The most important contributor to a fast analytical setup is getting the data model right. The talk will center around various choices you can make to prepare your data to get best possible query performance. We’ll look at some general best practices to model your data before ingestion such as OLAP dimensional modeling (called “roll-up” in Druid), data partitioning, and tips for choosing column types and indexes. We’ll also look at how more can be less: often, storing copies of your data partitioned, sorted, or aggregated in different ways can speed up queries by reducing the amount of computation needed. We’ll also look at Druid-specific optimizations that take advantage of approximations; where you can trade accuracy for performance and reduced storage. You’ll get introduced to Druid’s features for approximate counting, set operations, ranking, quantiles, and more. And we will finish with the latest and greatest Druid news, including details about the latest roadmap and releases.

The of Operational Analytics Data Store

Rommel Garcia

Druid meetup 2018-03-13

gianmerlino

The architecture of data analytics PaaS on AWS

Treasure Data, Inc.

How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...

confluent

Do you know who is knocking on your network’s door? Have new regulations left you scratching your head on how to handle what is happening in your network? Network flow data helps answer many questions across a multitude of use cases including network security, performance, capacity planning, routing, operational troubleshooting and more. Today’s modern day streaming data pipelines need to include tools that can scale to meet the demands of these service providers while continuing to provide responsive answers to difficult questions. In addition to stream processing, data needs to be stored in a redundant, operationally focused database to provide fast, reliable answers to critical questions. Together, Kafka and Druid work together to create such a pipeline. In this talk Eric Graham and Rachel Pedreschi will discuss these pipelines and cover the following topics: -Network flow use cases and why this data is important. -Reference architectures from production systems at a major international Bank. -Why Kafka and Druid and other OSS tools for Network Flows. -A demo of one such system.

Real-time analytics with Druid at Appsflyer

Michael Spector

Gregorry Letribot - Druid at Criteo - NoSQL matters 2015

NoSQLmatters

How do you monitor performance for one of your clients on a specific user segmentation when dealing with billions of events a day ? With over 2 billion ads served and 230Tb of data processed a day, we at Criteo have a comprehensive need for an interactive analytics stack. And by interactive, we mean a querying system with dynamic filtering to drill down over multiple dimensions, answering within sub-second latency. This session will take you on our journey with Druid, ""an open-source data store designed for real-time exploratory analytics on large data sets"". We will explore Druid's architecture and noticeable concepts, how relevant they are for some use cases and how it really performs.

Programmatic Bidding Data Streams & Druid

Charles Allen

Bio bigdata

Mk Kim

Optimizing Presto Connector on Cloud Storage

Kai Sasaki

Big Data made easy in the era of the Cloud - Demi Ben-Ari

Demi Ben-Ari

Game Analytics at London Apache Druid Meetup

Jelena Zanko

What's hot

Druid: Under the Covers (Virtual Meetup)

Imply

Druid Adoption Tips and Tricks

Imply

How Netflix Uses Druid in Real-time to Ensure a High Quality Streaming Experi...

Imply

What’s New in Imply 3.3 & Apache Druid 0.18

Imply

How TrafficGuard uses Druid to Fight Ad Fraud and Bots

Imply

Apache Druid Design and Future prospect

c-bslim

Druid in Spot Instances

Imply

Benchmarking Apache Druid

Matt Sarrel

Splunk: Druid on Kubernetes with Druid-operator

Imply

July 2014 HUG : Pushing the limits of Realtime Analytics using Druid

Yahoo Developer Network

A Day in the Life of a Druid Implementor and Druid's Roadmap

Itai Yaffe

The of Operational Analytics Data Store

Rommel Garcia

Druid meetup 2018-03-13

gianmerlino

The architecture of data analytics PaaS on AWS

Treasure Data, Inc.

How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...

confluent

Real-time analytics with Druid at Appsflyer

Michael Spector

Gregorry Letribot - Druid at Criteo - NoSQL matters 2015

NoSQLmatters

Programmatic Bidding Data Streams & Druid

Charles Allen

Bio bigdata

Mk Kim

Optimizing Presto Connector on Cloud Storage

Kai Sasaki

What's hot (20)

Druid: Under the Covers (Virtual Meetup)

Druid Adoption Tips and Tricks

How Netflix Uses Druid in Real-time to Ensure a High Quality Streaming Experi...

What’s New in Imply 3.3 & Apache Druid 0.18

How TrafficGuard uses Druid to Fight Ad Fraud and Bots

Apache Druid Design and Future prospect

Druid in Spot Instances

Benchmarking Apache Druid

Splunk: Druid on Kubernetes with Druid-operator

July 2014 HUG : Pushing the limits of Realtime Analytics using Druid

A Day in the Life of a Druid Implementor and Druid's Roadmap

The of Operational Analytics Data Store

Druid meetup 2018-03-13

The architecture of data analytics PaaS on AWS

How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...

Real-time analytics with Druid at Appsflyer

Gregorry Letribot - Druid at Criteo - NoSQL matters 2015

Programmatic Bidding Data Streams & Druid

Bio bigdata

Optimizing Presto Connector on Cloud Storage

Similar to Apache Druid Vision and Roadmap

Big Data made easy in the era of the Cloud - Demi Ben-Ari

Demi Ben-Ari

Game Analytics at London Apache Druid Meetup

Jelena Zanko

Infrastructure - a journey from datacentres to cloud

Equal Experts

Interconnection Automation For All - Extended - MPS 2023

Chris Grundemann

Matt "Grizz" Griswold and Chris Grundemann are both IX founders, internetworking experts, and automation proponents. With over 4 decades of combined experience they are now turning to sharing what they've learned about automating BGP and interconnection through a set of open source tools, along with support and services for those that need it. This talk will share what they have learned both from personal experience as well as through dozens of recent interviews with IX operators and interconnection engineers over the past several months. Including common challenges, productive methodologies, and best practices. The highlight of the talk will be announcing and describing two open source automation tools built to make interconnection and BGP easier for everyone. One is ixCtl, which is built to automate the most common and problematic tasks involved in running an internet exchange point, particularly configuring and managing secure route servers. The other is PeerCtl, which is built to automate the most common and problematic tasks involved in interconnecting an AS; from bilateral and multilateral peering to PNI and also transit connections. Code for both (along with several other tools) is available on GitHub: https://github.com/fullctl. Speaker: Chris Grundemann Speaker: Matt Griswold

GOAI: GPU-Accelerated Data Science DataSciCon 2017

Joshua Patterson

Designing a Distributed Cloud Database for Dummies

DataStax

Join Designing a Distributed Cloud Database for Dummies—the webinar. The webinar “stars” industry vet Patrick McFadin, best known among developers for his seven years at Apache Cassandra, where he held pivotal community roles. Register for the webinar today to learn: why you need distributed cloud databases, the technology you need to create the best used experience, the benefits of data autonomy and much more. View the recording: https://youtu.be/azC7lB0QU7E To explore all DataStax webinars: https://www.datastax.com/resources/webinars

Powering Real-Time Big Data Analytics with a Next-Gen GPU Database

Kinetica

Freed from the constraints of storage, network and memory, many big data analytics systems now are routinely revealing themselves to be compute bound. To compensate, big data analytic systems often result in wide horizontal sprawl (300-node Spark or NoSQL clusters are not unusual!)— to bring in enough compute for the task at hand. High system complexity and crushing operational costs often result. As the world shifts from physical to virtual assets and methods of engagement, there is an increasing need for systems of intelligence to live alongside the more traditional systems of record and systems of analysis. New approaches to data processing are required to support the real-time processing of data required to drive these systems of intelligence. Join 451 Research and Kinetica to learn: •An overview of the business and technical trends driving widespread interest in real-time analytics •Why systems of analysis need to be transformed and augmented with systems of intelligence bringing new approaches to data processing •How a new class of solution—a GPU-accelerated, scale out, in-memory database–can bring you orders of magnitude more compute power, significantly smaller hardware footprint, and unrivaled analytic capabilities. •Hear how other companies in a variety of industries, such as financial services, entertainment, pharmaceutical, and oil and gas, benefit from augmenting their legacy systems with a modern analytics database.

Learning Rust the Hard Way for a Production Kafka + ScyllaDB Pipeline

ScyllaDB

Numberly operates business-critical data pipelines and applications where failure and latency means "lost money" in the best-case scenario. Most of their data pipelines and applications are deployed on Kubernetes and rely on Kafka and ScyllaDB, with Kafka acting as the message bus and ScyllaDB as the source of data for enrichment. The availability and latency of both systems are thus very important for data pipelines. While most of Numberly’s applications are developed using Python, they found a need to move high-performance applications to Rust in order to benefit from a lower-level programming language. Learn the lessons from Numberly’s experience, including: - Rationale in selecting a lower-level language - Developing using a lower-level Rust code base - Observability and analyzing latency impacts with Rust - Tuning everything from Apache Avro to driver client settings - How to build a mission-critical system combining Apache Kafka and ScyllaDB - Half a year Rust in production feedback

Open Source Lambda Architecture with Hadoop, Kafka, Samza and Druid

DataWorks Summit

Big Data in Action – Real-World Solution Showcase

Inside Analysis

The Briefing Room with Radiant Advisors and IBM Live Webcast on February 25, 2014 Watch the archive: https://bloorgroup.webex.com/bloorgroup/lsr.php?RCID=53c9b7fa2000f98f5b236747e3602511 The power of Big Data depends heavily upon the context in which it's used, and most organizations are just beginning to figure out where, how and when to leverage it. One key to success is integration with existing information systems, many of which still rely on relational database technologies. Finding ways to blend these two worlds can help companies generate measurable business value in fairly short order. Register for this episode of The Briefing Room to hear Analysts Lindy Ryan and John O'Brien as they explain how the combination of traditional Business Intelligence with Big Data Analytics can provide game-changing results in today's information economy. They'll be briefed by Eric Poulin and Paul Flach of Stream Integration who will share best practices for designing and implementing Big Data solutions. They'll discuss the components of IBM BigInsights, and explain how BigSheets can empower non-technical users who need to explore self-structured data. Visit InsideAnlaysis.com for more information.

EMFcamp2022 - What if apps logged into you, instead of you logging into apps?

Chris Swan

As a hacker and engineer I've been interested in identity and privacy since the dawn of the Internet and the online services it's enabled. For the past year I've been helping to build and open source The @ Platform, which inverts the usual model by giving everybody (and every thing) their own place to store data and control who (and what) has access to it. This talk will give an overview of the platform and its underlying protocol, and illustrate how it can be used to build privacy preserving apps and Internet connected things. It will also cover how the platform can be self hosted on devices like the Raspberry Pi, and how people can get involved in the open source community growing around it.

Industrial IoT bootcamp

Lothar Schubert

Don't think DevOps think Compliant Database DevOps

Red Gate Software

DevOps and data privacy do not need to oppose each other. Rather, they can complement one another. The automation and audit trails that DevOps processes introduce to database development can ease compliance with data protection regulations and enable organizations to balance the need to deliver software faster with the requirement to protect and preserve personal data. So how can the promise of releasing changes to the database faster and easier be balanced with the need to keep data safe and remain compliant with legislation? Redgate’s Data Privacy and Protection Specialist Chris Unwin, shows how the answer lies in in going one step further than database DevOps and thinking about Compliant Database DevOps: • Introduce standardized team-based development • Automate deployments • Monitor performance and availability • Protect and preserve data

Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop

Shirshanka Das

Continuum Analytics and Python

Travis Oliphant

Big Data: fall seven times, stand up eight!

Roman Nikitchenko

More than year of extremely intensive Big Data technologies development with Hadoop, HBase, MapReduce and ZooKeeper as key technologies. New company that has established infrastructure which grows pretty fast. Lot of experience in networking and distributed systems but completely new enterprise solutions world. What tasks does this bring? What issue and traps? What lessons were learned and what is considered as near future tasks? How embedded developer can enter this new world and what advantages he or she has? What challenges should you be ready to face?

Elastic Data Analytics Platform @Datadog

C4Media

Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2l2Rr6L. Doug Daniels discusses the cloud-based platform they have built at DataDog and how it differs from a traditional datacenter-based analytics stack. He walks through the decisions they have made at each layer, covers the pros and cons of these decisions and discusses the tooling they have built. Filmed at qconsf.com. Doug Daniels is a Director of Engineering at Datadog, where he works on high-scale data systems for monitoring, data science, and analytics. Prior to joining Datadog, he was CTO at Mortar Data and an architect and developer at Wireless Generation, where he designed data systems to serve more than 4 million students in 49 states.

Honeypots, Deception, and Frankenstein

Phillip Maddux

Presented on April 14, 2018 at CarolinaCon (https://www.carolinacon.org). This talk will provide a quick overview honeypots, an explanation of the cyber deception space, and the benefits of implementing deception as part of your cyber defense program. In addition, this talk will highlight the HoneyDB project, which enables anyone to get started with operating deception sensors and start collecting threat information. Finally, this presentation will describe how I built scalable honeypot sensor collection, employing a "Frankenstein Cloud Architecture", for minimal cost.

Presto @ Uber Hadoop summit2017

Zhenxiao Luo

Even Faster: When Presto meets Parquet @ Uber

DataWorks Summit

As Uber continues to grow, our big data systems need to grow in scalability, reliability, and performance, to help Uber make business decisions, give user recommendations, and analyze experiments across all data sources. Since 2016, we put Presto in production. Now Presto is serving ~100K queries per day @ Uber, and it becomes a key component for interactive SQL queries on big data. In this presentation, we would like to talk about our experiences and engineering efforts, we start with general introduction about Hadoop Infrastructure & Analytics @ Uber, then comes a brief introduction to Presto, the Interactive SQL engine for big data. We will focus on how we build the New Parquet Reader for Presto, and the detail techniques, Columnar Reads, Lazy Reads, Nested Column Pruning. We will show performance improvements and Uber's Use Cases. Finally, we would like to share our ongoing plan and future work for Big Data Analytics @ Uber.

Similar to Apache Druid Vision and Roadmap (20)

Big Data made easy in the era of the Cloud - Demi Ben-Ari

Game Analytics at London Apache Druid Meetup

Infrastructure - a journey from datacentres to cloud

Interconnection Automation For All - Extended - MPS 2023

GOAI: GPU-Accelerated Data Science DataSciCon 2017

Designing a Distributed Cloud Database for Dummies

Powering Real-Time Big Data Analytics with a Next-Gen GPU Database

Learning Rust the Hard Way for a Production Kafka + ScyllaDB Pipeline

Open Source Lambda Architecture with Hadoop, Kafka, Samza and Druid

Big Data in Action – Real-World Solution Showcase

EMFcamp2022 - What if apps logged into you, instead of you logging into apps?

Industrial IoT bootcamp

Don't think DevOps think Compliant Database DevOps

Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop

Continuum Analytics and Python

Big Data: fall seven times, stand up eight!

Elastic Data Analytics Platform @Datadog

Honeypots, Deception, and Frankenstein

Presto @ Uber Hadoop summit2017

Even Faster: When Presto meets Parquet @ Uber

More from Imply

Pivot 2.0 - The next generation visualization tool for your streaming data

Imply

We have rearchitected Pivot from the ground up for enhanced dimensional analysis while ensuring that it is even faster, if that was even possible. Pivot 2.0 has plenty of new ways for you to visualize your data so that you can figure out the complex relationships between your data and enhanced the comparative analysis to quickly gain insight. In this webinar, will walk you through the exciting new features that are coming soon to Pivot.

Zeotap: Data Modeling in Druid for Non temporal and Nested Data

Imply

Druid has been the production workhorse for the past 2+ years at Zeotap powering the core Audience planning across our Connect and Targeting products. Though Druid is best suited for data having time as a dimension as it partitions data based on time first, we have used Druid to serve ML powered enhanced insights and Estimation of potential dataset sizes, to assist us with our core business case of Audience planning. These are datasets without timestamp a.k.a non-temporal with high scale and having nested dimensions. These have been achieved using nuanced data modelling to store the data sets and achieve millisecond latency retrieval on top of the same. The core of the presentation would be on the data modelling journey to achieve these use cases detailing the query access patterns. We also delve upon the architecture - ingestion into druid sink and processing including ML. In the end we go over the production setup and configurations and provide the performance tunings applied. The presentation would have the following heads: The presentation would have the following heads * Business case in Ad-Tech and Mar-Tech vertical * Audience Planner Usecase 1 - Insights -Lambda Architecture and data flow -Deep dive on data model -Takeaways *Audience Planner Usecase 2 - Estimator -Architecture and data flow -Stratified sampling explained -Data model to solve nested data - deep dive -Takeaways *Audience Planner Usecase 3 - Skew correction -Skew correction model -Query Access -Data model in Druid to accommodate output from ML models -Takeaways *Production setup, config and Tunings *Production Operation experience takeaways

Nielsen: Casting the Spell - Druid in Practice

Imply

At Nielsen Identity, we leverage Druid to provide our customers with real-time analytics tools for various use-cases, including in-flight analytics, reporting and building target audiences. The common challenge of these use-cases is counting distinct elements in real-time at scale. We’ve been using Druid to solve these problems for the past 4 years, and gained a lot of experience with it. In this talk, we will share some of the best practices and tips we’ve gathered over the years, including: *Data modeling *Ingestion *Retention and deletion *Query optimization

Maximizing Apache Druid performance: Beyond the basics

Imply

Druid is a powerful real-time database, and part of that power is the level of control you get over cluster configuration, allowing you to get maximum performance for your specific data and query types. In this talk, Gian Merlino, one of the original authors of Druid and CTO and co-founder of Imply, will walk you through some advanced techniques that can provide a multiplier to your Druid performance. Afterwards, he’ll take your questions about performance, or anything else Druid-related.

Building an Enterprise-Scale Dashboarding/Analytics Platform Powered by the C...

Imply

Target is one of the largest retailers in the United States, with brick-and-mortar stores in all 50 states and one of the most-visited ecommerce sites in the country. In addition to typical merchandising functions like assortment planning, pricing and inventory management, Target also operates a large supply chain, financial/banking operations and property management organizations. As a data-driven organization, we need a data analytics platform that can address the unique needs of each of these various business units, while scaling to hundreds of thousands of users and accommodating an ever-increasing amount of data. In this talk we’ll cover why Target chose to create our own analytics platform and specifically how Druid makes this platform successful. We’ll cover how we utilize key features in Druid, such as union datasources, arbitrary granularities, real-time ingestion, complex aggregation expressions and lightning-fast query response to provide analytics to users at all levels of the organization. We’ll also cover how Druid’s speed and flexibility allow us to provide interactive analytics to front-line, edge-of-business consumers to address hundreds of unique use-cases across several business units.

Self Service Analytics at Twitch

Imply

As Twitch grew, both the amount of data we received and the number of employees interested in the data grew rapidly. In order to continue empowering decision making as we scaled, we turned to using Druid and Imply to provide self service analytics to both our technical and non technical staff allowing them to drill into high level metrics in lieu of reading generated reports. In this talk, learn how Twitch implemented a common analytics platform for the needs of many different teams supporting hundreds of users, thousands of queries, and ~5 billion events each day. This session will explain our Druid architecture in detail, including: -The end-to-end architecture deployed on Amazon that includes Kinesis, RDS, S3, Druid, Pivot and Tableau -How the data is brought together to deliver a unified view of live customer engagement and historical trends -Operational best practices we learnt scaling Druid -An example walk through using the platform

Apache Druid: Lightning Fast Analytics on Real-time and Historical Data (Atla...

Imply

Talk abstract: Users are demanding access to large, multi-petabyte, multi-dimension, real-time datasets to answer business critical questions. Providing a self-service interface that meets the performance expectations of these users can be challenging. Enter Apache Druid: an open source analytics database powering real-time, ad hoc, lightning fast analytics. It is used for clickstream analytics, network telemetry, fraud detection, application monitoring and so much more by companies like Apple, Netflix, Twitter, and AirBnb. Druid can ingest millions of records per second and deliver sub-second response times on OLAP-style slice and dice queries. In this talk, we will start with an overview of Apache Druid followed by a look at several examples of how Druid is being used in the real-world. We'll finish up with Q&A and some virtual networking. Speaker Bio: Mike McLaughlin is a senior field engineer at Imply. He helps customers run and optimize Apache Druid in production. He has 20 years experience developing, architecting, and deploying software.

Benchmarking Apache Druid

Imply

More from Imply (8)

Pivot 2.0 - The next generation visualization tool for your streaming data

Zeotap: Data Modeling in Druid for Non temporal and Nested Data

Nielsen: Casting the Spell - Druid in Practice

Maximizing Apache Druid performance: Beyond the basics

Building an Enterprise-Scale Dashboarding/Analytics Platform Powered by the C...

Self Service Analytics at Twitch

Apache Druid: Lightning Fast Analytics on Real-time and Historical Data (Atla...

Benchmarking Apache Druid

Recently uploaded

FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf

FIDO Alliance

Assuring Contact Center Experiences for Your Customers With ThousandEyes

ThousandEyes

GraphRAG is All You need? LLM & Knowledge Graph

Guy Korland

Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs. 1. Unifying Large Language Models and Knowledge Graphs: A Roadmap. https://arxiv.org/abs/2306.08302 2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs: https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/

Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...

Jeffrey Haguewood

Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows. We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases. This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams. Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.

JMeter webinar - integration with InfluxDB and Grafana

RTTS

Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application. In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics. Length: 30 minutes Session Overview ------------------------------------------- During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana: - What out-of-the-box solutions are available for real-time monitoring JMeter tests? - What are the benefits of integrating InfluxDB and Grafana into the load testing stack? - Which features are provided by Grafana? - Demonstration of InfluxDB and Grafana using a practice web application To view the webinar recording, go to: https://www.rttsweb.com/jmeter-integration-webinar

FIDO Alliance Osaka Seminar: Overview.pdf

FIDO Alliance

From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...

Product School

Transcript: Selling digital books in 2024: Insights from industry leaders - T...

BookNet Canada

The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more. Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/ Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.

Key Trends Shaping the Future of Infrastructure.pdf

Cheryl Hung

FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf

FIDO Alliance

Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality

Inflectra

In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring. Learn about: • The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks. • Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective. • Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification. • Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process. Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.

Accelerate your Kubernetes clusters with Varnish Caching

Thijs Feryn

FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf

FIDO Alliance

Mission to Decommission: Importance of Decommissioning Products to Increase E...

Product School

Bits & Pixels using AI for Good.........

Alison B. Lowndes

Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...

Thierry Lestable

Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf

91mobiles

PCI PIN Basics Webinar from the Controlcase Team

ControlCase

Elevating Tactical DDD Patterns Through Object Calisthenics

Dorra BARTAGUIZ

After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!

State of ICS and IoT Cyber Threat Landscape Report 2024 preview

Prayukth K V

The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development. The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers: State of global ICS asset and network exposure Sectoral targets and attacks as well as the cost of ransom Global APT activity, AI usage, actor and tactic profiles, and implications Rise in volumes of AI-powered cyberattacks Major cyber events in 2024 Malware and malicious payload trends Cyberattack types and targets Vulnerability exploit attempts on CVEs Attacks on counties – USA Expansion of bot farms – how, where, and why In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East Why are attacks on smart factories rising? Cyber risk predictions Axis of attacks – Europe Systemic attacks in the Middle East Download the full report from here: https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/

Recently uploaded (20)

FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf

Assuring Contact Center Experiences for Your Customers With ThousandEyes

GraphRAG is All You need? LLM & Knowledge Graph

Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...

JMeter webinar - integration with InfluxDB and Grafana

FIDO Alliance Osaka Seminar: Overview.pdf

From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...

Transcript: Selling digital books in 2024: Insights from industry leaders - T...

Key Trends Shaping the Future of Infrastructure.pdf

FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf

Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality

Accelerate your Kubernetes clusters with Varnish Caching

FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf

Mission to Decommission: Importance of Decommissioning Products to Increase E...

Bits & Pixels using AI for Good.........

Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...

Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf

PCI PIN Basics Webinar from the Controlcase Team

Elevating Tactical DDD Patterns Through Object Calisthenics

State of ICS and IoT Cyber Threat Landscape Report 2024 preview

Apache Druid Vision and Roadmap

1. Apache Druid Taking the temperature of our data April 2020 Gian Merlino 1

2. Who am I? 2 Gian Merlino Committer & PMC chair at Cofounder at (we’re hiring!)

3. 3 Druid Summit 2020 Apache Druid is an independent project of The Apache Software Foundation. More information can be found at https://druid.apache.org. Apache Druid, Druid, and the Druid logo are either registered trademarks or trademarks of The Apache Software Foundation in the United States and other countries. talks from... netflix ✯ twitter ✯ ntt ✯ paypal ✯ cisco ✯ splunk ✯ central bank of turkey ✯ swisscom ✯ dbs ✯ nielsen ✯ lyft ✯ pinterest ✯ unity ✯ target ✯ expedia ✯ outbrain ✯ verizon ✯ confluent ✯ sk telecom ✯ game analytics plus fundamental or advanced training. November 2-4, 2020 San Francisco Waterfront Marriott Early Bird Rates Available druidsummit.org

4. 4 The industry is here to help

5. Druid in the wild 5

6. Druid in the wild 6 100+ billion rows/day 1+ trillion rows, 1+ year retained 100s of servers sub-second to few seconds query latency mix of streaming and batch ingest

7. Real-time workflows 7 Exploration Monitoring Real-time workflows

8. Real-time workflows 8

9. Investing in real-time experience 9 Ease-of-use & reliability 02 ● Indexer process ● Robust native batch ingestion ● Partition-aware background compaction Performance 01 ● Query laning (0.18) ● Fully vectorized query engine

10. Thinking of real-time as “hot” 10 🔥 ⏱ 0.1–3s query 🚰 fresh data 🏋‍♀ high concurrency 🚴 interactive workloads

11. Hot vs. cold 11 🔥 ⏱ 0.1–3s query 🚰 fresh data 🏋‍♀ high concurrency 🚴 highly interactive ⚙ ⏱ slow queries are ok 🚰 less fresh data is ok 🏋‍♀ low concurrency 🚴 reporting / planning

12. How about “warm”? 12 🍞 ⏱ 5–30s query 🚰 less fresh data is ok 🏋‍♀ high concurrency 🚴 somewhat interactive

13. Why warm instead of hot? ◆ Cost ◆ Ease of data migration 13

14. Warm user experiences ◆ Still exploratory, but somewhat less interactive ◆ Not much monitoring ◆ Classic BI apps work well 14

15. Putting it together 15 Ease-of-use & reliability 02 ● Indexer process ● Robust native batch ingestion ● Partition-aware background compaction Performance 01 ● Query laning (0.18) ● Fully vectorized query engine Query capabilities 03 ● SQL JOINs: INNER, LEFT (0.18) ● Subqueries (0.18) ● Windowed aggregations ● JOINs v2: OUTER, RIGHT, wide dimension tables

16. Towards Druid 1.0 ◆ Coming together of many efforts ◆ Native batch ingestion ◆ New and improved query engines ◆ SQL support ◆ Stay tuned! 16

17. Stay in touch 17 @druidio Join the community (Mailing lists, Slack, meetups) https://druid.apache.org/community/ Follow the Druid project on Twitter!

18. Time for questions @gianmerlino 18 Thank you! Apache Druid is an independent project of The Apache Software Foundation. More information can be found at https://druid.apache.org. Apache Druid, Druid, and the Druid logo are either registered trademarks or trademarks of The Apache Software Foundation in the United States and other countries.

19. 19 Register Now for Druid Summit November 2-4, 2020 San Francisco, CA druidsummit.org Apache Druid is an independent project of The Apache Software Foundation. More information can be found at https://druid.apache.org. Apache Druid, Druid, and the Druid logo are either registered trademarks or trademarks of The Apache Software Foundation in the United States and other countries.

Apache Druid Vision and Roadmap

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Apache Druid Vision and Roadmap

Similar to Apache Druid Vision and Roadmap (20)

More from Imply

More from Imply (8)

Recently uploaded

Recently uploaded (20)

Apache Druid Vision and Roadmap