Introducing Rainbird, Twitter's high volume distributed counting service for realtime analytics, built on Cassandra. This presentation looks at the motivation, design, and uses of Rainbird across Twitter.
Real Time Analytics: Algorithms and SystemsArun Kejariwal
In this tutorial, an in-depth overview of streaming analytics -- applications, algorithms and platforms -- landscape is presented. We walk through how the field has evolved over the last decade and then discuss the current challenges -- the impact of the other three Vs, viz., Volume, Variety and Veracity, on Big Data streaming analytics.
Geospatial Advancements in ElasticsearchElasticsearch
Geospatial data structures in Elasticsearch and Apache Lucene have been evolving and improving at a rapid pace. Learn how Elastic has simplified the geospatial indexing, search, and analytics use case by advancing the state of geo in Apache Lucene and Elasticsearch. From new spatial indexing strategies and performance improvements in Lucene to changes in geospatial field mappings in Elasticsearch, take a journey through the world of spatial and multi-dimensional indexing and search in the Elastic Stack.
ksqlDB is a stream processing SQL engine, which allows stream processing on top of Apache Kafka. ksqlDB is based on Kafka Stream and provides capabilities for consuming messages from Kafka, analysing these messages in near-realtime with a SQL like language and produce results again to a Kafka topic. By that, no single line of Java code has to be written and you can reuse your SQL knowhow. This lowers the bar for starting with stream processing significantly.
ksqlDB offers powerful capabilities of stream processing, such as joins, aggregations, time windows and support for event time. In this talk I will present how KSQL integrates with the Kafka ecosystem and demonstrate how easy it is to implement a solution using ksqlDB for most part. This will be done in a live demo on a fictitious IoT sample.
Kafka Streams State Stores Being Persistentconfluent
Being Persistent: A Look Into Kafka Streams State Stores, Neil Buesing, Principal Solutions Architect, Rill Data
Meetup link: https://www.meetup.com/TwinCities-Apache-Kafka/events/284002062/
Lightweight Transactions in Scylla versus Apache CassandraScyllaDB
Lightweight transactions (LWT) has been a long anticipated feature for Scylla. Join Scylla VP of Product Tzach Livyatan and Software Team Lead Konstantin Osipov for a webinar introducing the Scylla implementation of LWT, a feature that brings strong consistency to our NoSQL database.
In this webinar we will cover the tradeoffs typically made between database consistency, availability and latency; how to use lightweight transactions in Scylla; the similarities and differences between Scylla’s Paxos implementation and Cassandra’s, and what it all means to users.
From attending this live webinar you’ll learn…
The advantages and disadvantages of various consistency options
Scylla lightweight transactions: syntax and semantics
A design and implementation overview, changes in Paxos
Performance comparisons with Apache Cassandra
Scylla’s future roadmap for LWT beyond Paxos
Real Time Analytics: Algorithms and SystemsArun Kejariwal
In this tutorial, an in-depth overview of streaming analytics -- applications, algorithms and platforms -- landscape is presented. We walk through how the field has evolved over the last decade and then discuss the current challenges -- the impact of the other three Vs, viz., Volume, Variety and Veracity, on Big Data streaming analytics.
Geospatial Advancements in ElasticsearchElasticsearch
Geospatial data structures in Elasticsearch and Apache Lucene have been evolving and improving at a rapid pace. Learn how Elastic has simplified the geospatial indexing, search, and analytics use case by advancing the state of geo in Apache Lucene and Elasticsearch. From new spatial indexing strategies and performance improvements in Lucene to changes in geospatial field mappings in Elasticsearch, take a journey through the world of spatial and multi-dimensional indexing and search in the Elastic Stack.
ksqlDB is a stream processing SQL engine, which allows stream processing on top of Apache Kafka. ksqlDB is based on Kafka Stream and provides capabilities for consuming messages from Kafka, analysing these messages in near-realtime with a SQL like language and produce results again to a Kafka topic. By that, no single line of Java code has to be written and you can reuse your SQL knowhow. This lowers the bar for starting with stream processing significantly.
ksqlDB offers powerful capabilities of stream processing, such as joins, aggregations, time windows and support for event time. In this talk I will present how KSQL integrates with the Kafka ecosystem and demonstrate how easy it is to implement a solution using ksqlDB for most part. This will be done in a live demo on a fictitious IoT sample.
Kafka Streams State Stores Being Persistentconfluent
Being Persistent: A Look Into Kafka Streams State Stores, Neil Buesing, Principal Solutions Architect, Rill Data
Meetup link: https://www.meetup.com/TwinCities-Apache-Kafka/events/284002062/
Lightweight Transactions in Scylla versus Apache CassandraScyllaDB
Lightweight transactions (LWT) has been a long anticipated feature for Scylla. Join Scylla VP of Product Tzach Livyatan and Software Team Lead Konstantin Osipov for a webinar introducing the Scylla implementation of LWT, a feature that brings strong consistency to our NoSQL database.
In this webinar we will cover the tradeoffs typically made between database consistency, availability and latency; how to use lightweight transactions in Scylla; the similarities and differences between Scylla’s Paxos implementation and Cassandra’s, and what it all means to users.
From attending this live webinar you’ll learn…
The advantages and disadvantages of various consistency options
Scylla lightweight transactions: syntax and semantics
A design and implementation overview, changes in Paxos
Performance comparisons with Apache Cassandra
Scylla’s future roadmap for LWT beyond Paxos
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
Recently, a set of modern table formats such as Delta Lake, Hudi, Iceberg spring out. Along with Hive Metastore these table formats are trying to solve problems that stand in traditional data lake for a long time with their declared features like ACID, schema evolution, upsert, time travel, incremental consumption etc.
Kafka Streams vs. KSQL for Stream Processing on top of Apache KafkaKai Wähner
Spoilt for Choice – Kafka Streams vs. KSQL for Stream Processing on top of Apache Kafka:
Apache Kafka is a de facto standard streaming data processing platform. It is widely deployed as event streaming platform. Part of Kafka is its stream processing API “Kafka Streams”. In addition, the Kafka ecosystem now offers KSQL, a declarative, SQL-like stream processing language that lets you define powerful stream-processing applications easily. What once took some moderately sophisticated Java code can now be done at the command line with a familiar and eminently approachable syntax.
This session discusses and demos the pros and cons of Kafka Streams and KSQL to understand when to use which stream processing alternative for continuous stream processing natively on Apache Kafka infrastructures. The end of the session compares the trade-offs of Kafka Streams and KSQL to separate stream processing frameworks such as Apache Flink or Spark Streaming.
Flink Forward San Francisco 2022.
This talk will take you on the long journey of Apache Flink into the cloud-native era. It started all the way from where Hadoop and YARN were the standard way of deploying and operating data applications.
We're going to deep dive into the cloud-native set of principles and how they map to the Apache Flink internals and recent improvements. We'll cover fast checkpointing, fault tolerance, resource elasticity, minimal infrastructure dependencies, industry-standard tooling, ease of deployment and declarative APIs.
After this talk you'll get a broader understanding of the operational requirements for a modern streaming application and where the current limits are.
by
David Moravek
A brief introduction to Apache Kafka and describe its usage as a platform for streaming data. It will introduce some of the newer components of Kafka that will help make this possible, including Kafka Connect, a framework for capturing continuous data streams, and Kafka Streams, a lightweight stream processing library.
What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019confluent
Data stream processing is built on the core concept of time. However, understanding time semantics and reasoning about time is not simple, especially if deterministic processing is expected. In this talk, we explain the difference between processing, ingestion, and event time and what their impact is on data stream processing. Furthermore, we explain how Kafka clusters and stream processing applications must be configured to achieve specific time semantics. Finally, we deep dive into the time semantics of the Kafka Streams DSL and KSQL operators, and explain in detail how the runtime handles time. Apache Kafka offers many ways to handle time on the storage layer, ie, the brokers, allowing users to build applications with different semantics. Time semantics in the processing layer, ie, Kafka Streams and KSQL, are even richer, more powerful, but also more complicated. Hence, it is paramount for developers, to understand different time semantics and to know how to configure Kafka to achieve them. Therefore, this talk enables developers to design applications with their desired time semantics, help them to reason about the runtime behavior with regard to time, and allow them to understand processing/query results.
Flink Forward San Francisco 2022.
Resource Elasticity is a frequently requested feature in Apache Flink: Users want to be able to easily adjust their clusters to changing workloads for resource efficiency and cost saving reasons. In Flink 1.13, the initial implementation of Reactive Mode was introduced, later releases added more improvements to make the feature production ready. In this talk, we’ll explain scenarios to deploy Reactive Mode to various environments to achieve autoscaling and resource elasticity. We’ll discuss the constraints to consider when planning to use this feature, and also potential improvements from the Flink roadmap. For those interested in the internals of Flink, we’ll also briefly explain how the feature is implemented, and if time permits, conclude with a short demo.
by
Robert Metzger
Serverless Kafka and Spark in a Multi-Cloud Lakehouse ArchitectureKai Wähner
Apache Kafka in conjunction with Apache Spark became the de facto standard for processing and analyzing data. Both frameworks are open, flexible, and scalable.
Unfortunately, the latter makes operations a challenge for many teams. Ideally, teams can use serverless SaaS offerings to focus on business logic. However, hybrid and multi-cloud scenarios require a cloud-native platform that provides automated and elastic tooling to reduce the operations burden.
This session explores different architectures to build serverless Apache Kafka and Apache Spark multi-cloud architectures across regions and continents.
We start from the analytics perspective of a data lake and explore its relation to a fully integrated data streaming layer with Kafka to build a modern data Data Lakehouse.
Real-world use cases show the joint value and explore the benefit of the "delta lake" integration.
ksqlDB: A Stream-Relational Database Systemconfluent
Speaker: Matthias J. Sax, Software Engineer, Confluent
ksqlDB is a distributed event streaming database system that allows users to express SQL queries over relational tables and event streams. The project was released by Confluent in 2017 and is hosted on Github and developed with an open-source spirit. ksqlDB is built on top of Apache Kafka®, a distributed event streaming platform. In this talk, we discuss ksqlDB’s architecture that is influenced by Apache Kafka and its stream processing library, Kafka Streams. We explain how ksqlDB executes continuous queries while achieving fault tolerance and high vailability. Furthermore, we explore ksqlDB’s streaming SQL dialect and the different types of supported queries.
Matthias J. Sax is a software engineer at Confluent working on ksqlDB. He mainly contributes to Kafka Streams, Apache Kafka's stream processing library, which serves as ksqlDB's execution engine. Furthermore, he helps evolve ksqlDB's "streaming SQL" language. In the past, Matthias also contributed to Apache Flink and Apache Storm and he is an Apache committer and PMC member. Matthias holds a Ph.D. from Humboldt University of Berlin, where he studied distributed data stream processing systems.
https://db.cs.cmu.edu/events/quarantine-db-talk-2020-confluent-ksqldb-a-stream-relational-database-system/
Optimizing Delta/Parquet Data Lakes for Apache SparkDatabricks
This talk will start by explaining the optimal file format, compression algorithm, and file size for plain vanilla Parquet data lakes. It discusses the small file problem and how you can compact the small files. Then we will talk about partitioning Parquet data lakes on disk and how to examine Spark physical plans when running queries on a partitioned lake.
We will discuss why it’s better to avoid PartitionFilters and directly grab partitions when querying partitioned lakes. We will explain why partitioned lakes tend to have a massive small file problem and why it’s hard to compact a partitioned lake. Then we’ll move on to Delta lakes and explain how they offer cool features on top of what’s available in Parquet. We’ll start with Delta 101 best practices and then move on to compacting with the OPTIMIZE command.
We’ll talk about creating partitioned Delta lake and how OPTIMIZE works on a partitioned lake. Then we’ll talk about ZORDER indexes and how to incrementally update lakes with a ZORDER index. We’ll finish with a discussion on adding a ZORDER index to a partitioned Delta data lake.
Tame the small files problem and optimize data layout for streaming ingestion...Flink Forward
Flink Forward San Francisco 2022.
In modern data platform architectures, stream processing engines such as Apache Flink are used to ingest continuous streams of data into data lakes such as Apache Iceberg. Streaming ingestion to iceberg tables can suffer by two problems (1) small files problem that can hurt read performance (2) poor data clustering that can make file pruning less effective. To address those two problems, we propose adding a shuffling stage to the Flink Iceberg streaming writer. The shuffling stage can intelligently group data via bin packing or range partition. This can reduce the number of concurrent files that every task writes. It can also improve data clustering. In this talk, we will explain the motivations in details and dive into the design of the shuffling stage. We will also share the evaluation results that demonstrate the effectiveness of smart shuffling.
by
Gang Ye & Steven Wu
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Flink Forward
Flink Forward San Francisco 2022.
Flink consumers read from Kafka as a scalable, high throughput, and low latency data source. However, there are challenges in scaling out data streams where migration and multiple Kafka clusters are required. Thus, we introduced a new Kafka source to read sharded data across multiple Kafka clusters in a way that conforms well with elastic, dynamic, and reliable infrastructure. In this presentation, we will present the source design and how the solution increases application availability while reducing maintenance toil. Furthermore, we will describe how we extended the existing KafkaSource to provide mechanisms to read logical streams located on multiple clusters, to dynamically adapt to infrastructure changes, and to perform transparent cluster migrations and failover.
by
Mason Chen
Kafka Streams is a new stream processing library natively integrated with Kafka. It has a very low barrier to entry, easy operationalization, and a natural DSL for writing stream processing applications. As such it is the most convenient yet scalable option to analyze, transform, or otherwise process data that is backed by Kafka. We will provide the audience with an overview of Kafka Streams including its design and API, typical use cases, code examples, and an outlook of its upcoming roadmap. We will also compare Kafka Streams' light-weight library approach with heavier, framework-based tools such as Spark Streaming or Storm, which require you to understand and operate a whole different infrastructure for processing real-time data in Kafka.
Tuning Apache Kafka Connectors for Flink.pptxFlink Forward
Flink Forward San Francisco 2022.
In normal situations, the default Kafka consumer and producer configuration options work well. But we all know life is not all roses and rainbows and in this session we’ll explore a few knobs that can save the day in atypical scenarios. First, we'll take a detailed look at the parameters available when reading from Kafka. We’ll inspect the params helping us to spot quickly an application lock or crash, the ones that can significantly improve the performance and the ones to touch with gloves since they could cause more harm than benefit. Moreover we’ll explore the partitioning options and discuss when diverging from the default strategy is needed. Next, we’ll discuss the Kafka Sink. After browsing the available options we'll then dive deep into understanding how to approach use cases like sinking enormous records, managing spikes, and handling small but frequent updates.. If you want to understand how to make your application survive when the sky is dark, this session is for you!
by
Olena Babenko
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
Recently, a set of modern table formats such as Delta Lake, Hudi, Iceberg spring out. Along with Hive Metastore these table formats are trying to solve problems that stand in traditional data lake for a long time with their declared features like ACID, schema evolution, upsert, time travel, incremental consumption etc.
Kafka Streams vs. KSQL for Stream Processing on top of Apache KafkaKai Wähner
Spoilt for Choice – Kafka Streams vs. KSQL for Stream Processing on top of Apache Kafka:
Apache Kafka is a de facto standard streaming data processing platform. It is widely deployed as event streaming platform. Part of Kafka is its stream processing API “Kafka Streams”. In addition, the Kafka ecosystem now offers KSQL, a declarative, SQL-like stream processing language that lets you define powerful stream-processing applications easily. What once took some moderately sophisticated Java code can now be done at the command line with a familiar and eminently approachable syntax.
This session discusses and demos the pros and cons of Kafka Streams and KSQL to understand when to use which stream processing alternative for continuous stream processing natively on Apache Kafka infrastructures. The end of the session compares the trade-offs of Kafka Streams and KSQL to separate stream processing frameworks such as Apache Flink or Spark Streaming.
Flink Forward San Francisco 2022.
This talk will take you on the long journey of Apache Flink into the cloud-native era. It started all the way from where Hadoop and YARN were the standard way of deploying and operating data applications.
We're going to deep dive into the cloud-native set of principles and how they map to the Apache Flink internals and recent improvements. We'll cover fast checkpointing, fault tolerance, resource elasticity, minimal infrastructure dependencies, industry-standard tooling, ease of deployment and declarative APIs.
After this talk you'll get a broader understanding of the operational requirements for a modern streaming application and where the current limits are.
by
David Moravek
A brief introduction to Apache Kafka and describe its usage as a platform for streaming data. It will introduce some of the newer components of Kafka that will help make this possible, including Kafka Connect, a framework for capturing continuous data streams, and Kafka Streams, a lightweight stream processing library.
What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019confluent
Data stream processing is built on the core concept of time. However, understanding time semantics and reasoning about time is not simple, especially if deterministic processing is expected. In this talk, we explain the difference between processing, ingestion, and event time and what their impact is on data stream processing. Furthermore, we explain how Kafka clusters and stream processing applications must be configured to achieve specific time semantics. Finally, we deep dive into the time semantics of the Kafka Streams DSL and KSQL operators, and explain in detail how the runtime handles time. Apache Kafka offers many ways to handle time on the storage layer, ie, the brokers, allowing users to build applications with different semantics. Time semantics in the processing layer, ie, Kafka Streams and KSQL, are even richer, more powerful, but also more complicated. Hence, it is paramount for developers, to understand different time semantics and to know how to configure Kafka to achieve them. Therefore, this talk enables developers to design applications with their desired time semantics, help them to reason about the runtime behavior with regard to time, and allow them to understand processing/query results.
Flink Forward San Francisco 2022.
Resource Elasticity is a frequently requested feature in Apache Flink: Users want to be able to easily adjust their clusters to changing workloads for resource efficiency and cost saving reasons. In Flink 1.13, the initial implementation of Reactive Mode was introduced, later releases added more improvements to make the feature production ready. In this talk, we’ll explain scenarios to deploy Reactive Mode to various environments to achieve autoscaling and resource elasticity. We’ll discuss the constraints to consider when planning to use this feature, and also potential improvements from the Flink roadmap. For those interested in the internals of Flink, we’ll also briefly explain how the feature is implemented, and if time permits, conclude with a short demo.
by
Robert Metzger
Serverless Kafka and Spark in a Multi-Cloud Lakehouse ArchitectureKai Wähner
Apache Kafka in conjunction with Apache Spark became the de facto standard for processing and analyzing data. Both frameworks are open, flexible, and scalable.
Unfortunately, the latter makes operations a challenge for many teams. Ideally, teams can use serverless SaaS offerings to focus on business logic. However, hybrid and multi-cloud scenarios require a cloud-native platform that provides automated and elastic tooling to reduce the operations burden.
This session explores different architectures to build serverless Apache Kafka and Apache Spark multi-cloud architectures across regions and continents.
We start from the analytics perspective of a data lake and explore its relation to a fully integrated data streaming layer with Kafka to build a modern data Data Lakehouse.
Real-world use cases show the joint value and explore the benefit of the "delta lake" integration.
ksqlDB: A Stream-Relational Database Systemconfluent
Speaker: Matthias J. Sax, Software Engineer, Confluent
ksqlDB is a distributed event streaming database system that allows users to express SQL queries over relational tables and event streams. The project was released by Confluent in 2017 and is hosted on Github and developed with an open-source spirit. ksqlDB is built on top of Apache Kafka®, a distributed event streaming platform. In this talk, we discuss ksqlDB’s architecture that is influenced by Apache Kafka and its stream processing library, Kafka Streams. We explain how ksqlDB executes continuous queries while achieving fault tolerance and high vailability. Furthermore, we explore ksqlDB’s streaming SQL dialect and the different types of supported queries.
Matthias J. Sax is a software engineer at Confluent working on ksqlDB. He mainly contributes to Kafka Streams, Apache Kafka's stream processing library, which serves as ksqlDB's execution engine. Furthermore, he helps evolve ksqlDB's "streaming SQL" language. In the past, Matthias also contributed to Apache Flink and Apache Storm and he is an Apache committer and PMC member. Matthias holds a Ph.D. from Humboldt University of Berlin, where he studied distributed data stream processing systems.
https://db.cs.cmu.edu/events/quarantine-db-talk-2020-confluent-ksqldb-a-stream-relational-database-system/
Optimizing Delta/Parquet Data Lakes for Apache SparkDatabricks
This talk will start by explaining the optimal file format, compression algorithm, and file size for plain vanilla Parquet data lakes. It discusses the small file problem and how you can compact the small files. Then we will talk about partitioning Parquet data lakes on disk and how to examine Spark physical plans when running queries on a partitioned lake.
We will discuss why it’s better to avoid PartitionFilters and directly grab partitions when querying partitioned lakes. We will explain why partitioned lakes tend to have a massive small file problem and why it’s hard to compact a partitioned lake. Then we’ll move on to Delta lakes and explain how they offer cool features on top of what’s available in Parquet. We’ll start with Delta 101 best practices and then move on to compacting with the OPTIMIZE command.
We’ll talk about creating partitioned Delta lake and how OPTIMIZE works on a partitioned lake. Then we’ll talk about ZORDER indexes and how to incrementally update lakes with a ZORDER index. We’ll finish with a discussion on adding a ZORDER index to a partitioned Delta data lake.
Tame the small files problem and optimize data layout for streaming ingestion...Flink Forward
Flink Forward San Francisco 2022.
In modern data platform architectures, stream processing engines such as Apache Flink are used to ingest continuous streams of data into data lakes such as Apache Iceberg. Streaming ingestion to iceberg tables can suffer by two problems (1) small files problem that can hurt read performance (2) poor data clustering that can make file pruning less effective. To address those two problems, we propose adding a shuffling stage to the Flink Iceberg streaming writer. The shuffling stage can intelligently group data via bin packing or range partition. This can reduce the number of concurrent files that every task writes. It can also improve data clustering. In this talk, we will explain the motivations in details and dive into the design of the shuffling stage. We will also share the evaluation results that demonstrate the effectiveness of smart shuffling.
by
Gang Ye & Steven Wu
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Flink Forward
Flink Forward San Francisco 2022.
Flink consumers read from Kafka as a scalable, high throughput, and low latency data source. However, there are challenges in scaling out data streams where migration and multiple Kafka clusters are required. Thus, we introduced a new Kafka source to read sharded data across multiple Kafka clusters in a way that conforms well with elastic, dynamic, and reliable infrastructure. In this presentation, we will present the source design and how the solution increases application availability while reducing maintenance toil. Furthermore, we will describe how we extended the existing KafkaSource to provide mechanisms to read logical streams located on multiple clusters, to dynamically adapt to infrastructure changes, and to perform transparent cluster migrations and failover.
by
Mason Chen
Kafka Streams is a new stream processing library natively integrated with Kafka. It has a very low barrier to entry, easy operationalization, and a natural DSL for writing stream processing applications. As such it is the most convenient yet scalable option to analyze, transform, or otherwise process data that is backed by Kafka. We will provide the audience with an overview of Kafka Streams including its design and API, typical use cases, code examples, and an outlook of its upcoming roadmap. We will also compare Kafka Streams' light-weight library approach with heavier, framework-based tools such as Spark Streaming or Storm, which require you to understand and operate a whole different infrastructure for processing real-time data in Kafka.
Tuning Apache Kafka Connectors for Flink.pptxFlink Forward
Flink Forward San Francisco 2022.
In normal situations, the default Kafka consumer and producer configuration options work well. But we all know life is not all roses and rainbows and in this session we’ll explore a few knobs that can save the day in atypical scenarios. First, we'll take a detailed look at the parameters available when reading from Kafka. We’ll inspect the params helping us to spot quickly an application lock or crash, the ones that can significantly improve the performance and the ones to touch with gloves since they could cause more harm than benefit. Moreover we’ll explore the partitioning options and discuss when diverging from the default strategy is needed. Next, we’ll discuss the Kafka Sink. After browsing the available options we'll then dive deep into understanding how to approach use cases like sinking enormous records, managing spikes, and handling small but frequent updates.. If you want to understand how to make your application survive when the sky is dark, this session is for you!
by
Olena Babenko
Hadoop, Pig, and Twitter (NoSQL East 2009)Kevin Weil
A talk on the use of Hadoop and Pig inside Twitter, focusing on the flexibility and simplicity of Pig, and the benefits of that for solving real-world big data problems.
• 3. We’ll cover...1 What’s the point of analytics?2 A treasure map for dealing with data3 4 Google Analytics tactics you can use right now
• 4. How do you know what works? In business, we build as fast as we can. New projects, new products, and new customers.
• 5. 8*6*(9:&$2#;(/(4&6-<&><+(92=$&"#2#6>
• 6. Build
• 7. Measure
• 8. Learn
• 9. If you want to win, focus on speed The more times you go through the cycle, the better your business will be.
• 10. Analytics is critical for measuring and learning Analytics makes the Measure and Learn steps MUCH faster.
• 11. TIME FOR THETreasure Map
• 12. Guiding Principle #1 AVOID VANITY Metrics
• 13. These are vanity metrics
• 14. No connection to your business Vanity metrics make us feel good when they improve but they don’t help us grow our business.
• 15. Web analytics is vanity metrics In Google Analytics, vanity metrics are everywhere.
• 16. Vanity metrics are a distraction
• 17. Difficult to take action If pageviews go up, what should you do?
• 18. Be careful with web analytics Improving traffic doesn’t necessarily improve your business.
• 19. Google Analytics Tactic #1 1 Pageviews 2 Time on site 3 Pages/visit 4 Visits 5 Bounce rate
• 20. Guiding Principle #2 FOCUS ON Customers
• 21. How do we focus on our customers? Using customer analytics is the easiest way.
• 22. What if I don’t have customer analytics?
• 23. How can we tell what’s working?
• 24. This tells us where our customers come from
• 25. Google Analytics Tactic #2 How to Setup Google Analytics Goals
• 26. Google Analytics Tactic #3
• 27. Always ask yourself this question: I see all this data on my traffic. But where are my customers?
• 28. Guiding Principle #3 DO CUSTOMERS COME Back?
• 29. Who gets credit?
• 30. Conversions get split up
• 31. Only possible when tracking real people
• 32. This data include returning customers
• 33. Stop guessing
• 34. Guiding Principle #4 TRACK YOUR Funnel
• 35. Define your funnel
• 36. Funnels track two metrics at each step
• 37. Google Analytics Tactic #4
• 38. How to setup Google Analytics Funnels
• 39. Most businesses don’t have a single path
• 40. A customer analytics funnel
• 41. Focus on your business Instead of forcing your business to match a Google Analytics funnel, use customer analytics funnels to match your business.
• 42. Guiding Principle #5 START WITH YOUR Benchmarks
• 43. Your metrics will be different
• 44. Compare yourself to yourself
• 45. 5 Guiding Principles to Find Treasure
• 46. 4 Google Analytics Tactics
• 47. WHAT’S THENext Step?
• 48. Get customer analytics
• 49. Get reports on real people
• 50. Every point of engagement for a customer
• 51. Why is customer analytics so great?
• 52. Who offers customer analytics?
• 53. Other options? Pay an engineer $100,000+ in salary to build it for you.
Email Optimization: A discussion about how A/B testing generated $500 million...MarketingSherpa
In this webinar you’ll hear from Amelia Showalter, who headed the email and digital analytics teams for President Barack Obama’s 2012 presidential campaign.
They’ll discuss how they maintained a breakneck testing schedule for its massive email program with Daniel Burstein, Director of Editorial Content, MECLABS. Take an inside look at the campaign headquarters, detailing the thrilling successes and informative failures Obama for America encountered at the cutting edge of digital politics.
In this session, you'll learn:
• Why subject lines mattered so much, and the back stories behind some of the most memorable ones
• Why the prettiest email isn’t always the best email
• How a free bumper sticker offer can pay for itself many times over
• The most important takeaway from all those tests and trials
Leading in the new world of work – Human Resonance
With so many models and approaches – from large firms to business schools to boutiques – it is hard for companies to architect the tailored yet integrated experiences they need.
In our “Human Resonance” approach we offer what is needed.
Next level practice instead of best practice!
In this new world of work, the barriers between work and life are eliminated. The “new world of work” is one that requires a dramatic change in strategies for leadership, talent, and human resources.
A new playbook for new times
Growth, volatility, change, and disruptive technology drive companies to shift their underlying business model. It is time to address this disruption, transforming the leaders from a transaction-execution function into a dominant partner who pushes innovative solutions to managers at all levels. Unless c-level managers embraces this transformation, they will struggle to solve problems at the pace the business demands.
Today’s challenges require a new playbook – one that makes leaders more agile, forward thinking, bolder and more pushy in their solutions. Our goal in this presentation is to give business leaders fresh ideas and perspectives to shape thinking about priorities for 2015. In a growing, changing economy, business challenges abound. Yet few can be addressed successfully without new approaches to solving the people challenges that accompany them— challenges that have grown in importance and complexity.
Using Single Keyword Ad Groups To Drive PPC PerformanceSam Owen
A look at how a simplified campaign structure can improve PPC results through top performer single keyword ad group campaigns. From Sam Owen at Hero Conf 2014.
The Social Consumer, study explores the factors that inform, impact and shape trust, loyalty and preferences of the digitally connected consumer.
In this study, we tested the belief that brands which can tap into emotions about and awareness of their values (human/social) are most likely to inspire positive action and loyalty from consumers.
Our view is that the super-connectedness of global communications has challenged how companies interact, engage and maintain relevance and trust with their key audiences and the public-at-large. As such, the reputation of a company is no longer defined by what they “report” or what they “say” they stand for. Instead, they are increasingly defined by the shared opinions and experiences of socially-connected consumers.
The findings reflect a number of surprising and validating insights, informed by surveys completed by 927 respondents mostly from the U.S. with about 10 percent from rest-of-world with great distribution and balance across age and gender.
Case study detailing Affinitive's work on Prevacid®24HR Panel, an online consumer trial panel developed around the launch of new OTC heartburn product.
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2l2Rr6L.
Doug Daniels discusses the cloud-based platform they have built at DataDog and how it differs from a traditional datacenter-based analytics stack. He walks through the decisions they have made at each layer, covers the pros and cons of these decisions and discusses the tooling they have built. Filmed at qconsf.com.
Doug Daniels is a Director of Engineering at Datadog, where he works on high-scale data systems for monitoring, data science, and analytics. Prior to joining Datadog, he was CTO at Mortar Data and an architect and developer at Wireless Generation, where he designed data systems to serve more than 4 million students in 49 states.
Description of some of the elements that go in to creating a PostgreSQL-as-a-Service for organizations with many teams and a diverse ecosystem of applications and teams.
Data Platform at Twitter: Enabling Real-time & Batch Analytics at ScaleSriram Krishnan
The Data Platform at Twitter supports engineers and data scientists running batch jobs on Hadoop clusters that are several 1000s of nodes, and real-time jobs on top of systems such as Storm. In this presentation, I discuss the overall Data Platform stack at Twitter. In particular, I talk about enabling real-time and batch analytics at scale with the help of Scalding, which is a Scala DSL for batch jobs using MapReduce, Summingbird, which is a framework for combined real-time and batch processing, and Tsar, which is a framework for real-time time-series aggregations.
Real-time Fraud Detection for Southeast Asia’s Leading Mobile PlatformScyllaDB
Grab is one of the most frequently used mobile platforms in Southeast Asia, providing the everyday services that matter most to consumers. Its users commute, eat, arrange shopping deliveries, and pay with one e-wallet. Grab relies on the combination of Apache Kafka and Scylla for a very critical use case -- instantaneously detecting fraudulent transactions that might occur across approximately more than six million on-demand rides per day taking place in eight countries across Southeast Asia. Doing this successfully requires many things to happen in near-real time.
Join our webinar for this fascinating real-time big data use case, and learn the steps Grab took to optimize their fraud detection systems using the Scylla NoSQL database along with Apache Kafka.
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...Landon Robinson
The Spark Listener interface provides a fast, simple and efficient route to monitoring and observing your Spark application - and you can start using it in minutes. In this talk, we'll introduce the Spark Listener interfaces available in core and streaming applications, and show a few ways in which they've changed our world for the better at SpotX. If you're looking for a "Eureka!" moment in monitoring or tracking of your Spark apps, look no further than Spark Listeners and this talk!
(BDT306) Mission-Critical Stream Processing with Amazon EMR and Amazon Kinesi...Amazon Web Services
Organizations processing mission critical high-volume data must be able to achieve high levels of throughput and durability in data processing workflows. In this session, we will learn how DataXu is using Amazon Kinesis, Amazon S3, and Amazon EMR for its patented approach to programmatic marketing. Every second, the DataXu Marketing Cloud processes over 1 Million ad requests and makes more than 40 billion decisions to select and bid on ad impressions that are most likely to convert. In addition to addressing the scalability and availability of the platform, we will explore Amazon Kinesis producer and consumer applications that support high levels of scalability and durability in mission-critical record processing.
TSAR (TimeSeries AggregatoR) Tech TalkAnirudh Todi
Twitter's 250 million users generate tens of billions of tweet views per day. Aggregating these events in real time - in a robust enough way to incorporate into our products - presents a massive scaling challenge. In this presentation I introduce TSAR (the TimeSeries AggregatoR), a robust, flexible, and scalable service for real-time event aggregation designed to solve this problem and a range of similar ones. I discuss how we built TSAR using Python and Scala from the ground up, almost entirely on open-source technologies (Storm, Summingbird, Kafka, Aurora, and others), and describe some of the challenges we faced in scaling it to process tens of billions of events per day.
Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringDatabricks
The Spark Listener interface provides a fast, simple and efficient route to monitoring and observing your Spark application - and you can start using it in minutes. In this talk, we'll introduce the Spark Listener interfaces available in core and streaming applications, and show a few ways in which they've changed our world for the better at SpotX. If you're looking for a "Eureka!" moment in monitoring or tracking of your Spark apps, look no further than Spark Listeners and this talk!
The functional paradigm is not only applicable to programming. There is even more reason for using functional patterns at an architectural level. MapReduce is the most famous example of such a pattern. In this talk, we will go through a few other architectural patterns, and their corresponding stateful anti-patterns.
Introduction to Artificial Intelligence and Machine Learning services at AWS ...Amazon Web Services
AWS offers a family of intelligent services that provide cloud-native machine learning and deep learning technologies to address your different use cases and needs. For developers looking to add managed AI services to their applications, AWS brings natural language understanding (NLU) and text-to-speech (TTS) with Amazon Polly, visual search and image recognition with Amazon Rekognition, and developer-focused machine learning with Amazon Machine Learning. In this talk you will learn about these services and see demos of their capabilities
AWS Speaker: Denis V. Batalov, Solutions Architect - Amazon Web Services
Customer Speaker: Tom Wells - Synthesis Software Technologies
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCPDaniel Zivkovic
Enterprises traditionally think of App Platforms as PCF (Pivotal Cloud Foundry) or Red Hat OpenShift. In reality, public Clouds have evolved into Application Platforms - especially when using Managed Services & Serverless.
• If you are an IT Executive under increased pressure to cut costs, see how better Technology Stack choices – not layoffs or pay cuts, can reduce IT costs + increase business agility (while avoiding vendor lock-in):
• If you are a Developer lost in the sea of the Cloud Computing choices, watch Ray Tsang (Java Champion from GCP) live-code, and you will walk away Cloud-Native :)
See how to stop cannibalization of IT by deploying your good ol' Java Spring Boot Apps directly to Google Cloud Platform - no Servers/PCF/OpenShift/Kubernetes to manage, nor to limit your creativity: https://youtu.be/2B0wWagE0dc
P.S. For more forward-looking Software Developerment topics, join ServerlessToronto.org Meetups, and if you have any questions about the Architectural Patterns discussed, reach out to me to chat.
Cloud Native Data Pipelines (in Eng & Japanese) - QCon TokyoSid Anand
Slides from "Cloud Native Data Pipelines" talk given @ QCon Tokyo 2016. The slides are in both English and Japanese. Thanks to Kiro Harada (https://jp.linkedin.com/in/haradakiro) for the translation.
(Diapositivas de presentación son en inglés.)
Cada vez estamos produciendo un mayor volumen de datos, y los negocios necesitan del análisis de está información al segundo (o incluso al milisegundo). AWS proporciona tecnologías para resolver los problemas del Big Data, pero qué servicios debo usar, porqué, cuándo, cómo?. En esta sesión hablaremos de las diferentes fases en el análisis de los datos: ingesta, almacenamiento, procesamiento y visualización, y de como elegir la tecnología adecuada para cada una de ellas.
Similar to Rainbird: Realtime Analytics at Twitter (Strata 2011) (20)
2. Agenda
‣ Why Real-time Analytics?
‣ Rainbird and Cassandra
‣ Production Uses at Twitter
‣ Open Source
3. My Background
‣ Mathematics and Physics at Harvard, Physics at
Stanford
‣ Tropos Networks (city-wide wireless): mesh
routing algorithms, GBs of data
‣ Cooliris (web media): Hadoop and Pig for
analytics, TBs of data
‣ Twitter: Hadoop, Pig, HBase, Cassandra, data
viz, social graph analysis, soon to be PBs of data
4. My Background
‣ Mathematics and Physics at Harvard, Physics at
Stanford
‣ Tropos Networks (city-wide wireless): mesh
routing algorithms, GBs of data
‣ Cooliris (web media): Hadoop and Pig for
analytics, TBs of data
‣ Twitter: Hadoop, Pig, HBase, Cassandra, data
viz, social graph analysis, soon to be PBs of data
Now revenue products!
5. Agenda
‣ Why Real-time Analytics?
‣ Rainbird and Cassandra
‣ Production Uses at Twitter
‣ Open Source
10. Real-time Reporting
‣ Discussion around ad-based revenue model
‣ Help shape the conversation in real-time with
Promoted Tweets
11. Real-time Reporting
‣ Discussion around ad-based revenue model
‣ Help shape the conversation in real-time with
Promoted Tweets
‣ Realtime reporting
ties it all together
12. Agenda
‣ Why Real-time Analytics?
‣ Rainbird and Cassandra
‣ Production Uses at Twitter
‣ Open Source
13. Requirements
‣ Extremely high write volume
‣ Needs to scale to 100,000s of WPS
14. Requirements
‣ Extremely high write volume
‣ Needs to scale to 100,000s of WPS
‣ High read volume
‣ Needs to scale to 10,000s of RPS
15. Requirements
‣ Extremely high write volume
‣ Needs to scale to 100,000s of WPS
‣ High read volume
‣ Needs to scale to 10,000s of RPS
‣ Horizontally scalable (reads, storage, etc)
‣ Needs to scale to 100+ TB
16. Requirements
‣ Extremely high write volume
‣ Needs to scale to 100,000s of WPS
‣ High read volume
‣ Needs to scale to 10,000s of RPS
‣ Horizontally scalable (reads, storage, etc)
‣ Needs to scale to 100+ TB
‣ Low latency
‣ Most reads <100 ms (esp. recent data)
17. Cassandra
‣ Pro: In-house expertise
‣ Pro: Open source Apache project
‣ Pro: Writes are extremely fast
‣ Pro: Horizontally scalable, low latency
‣ Pro: Other startup adoption (Digg, SimpleGeo)
18. Cassandra
‣ Pro: In-house expertise
‣ Pro: Open source Apache project
‣ Pro: Writes are extremely fast
‣ Pro: Horizontally scalable, low latency
‣ Pro: Other startup adoption (Digg, SimpleGeo)
‣ Con: It was really young (0.3a)
19. Cassandra
‣ Pro: Some dudes at Digg had already started
working on distributed atomic counters in
Cassandra
20. Cassandra
‣ Pro: Some dudes at Digg had already started
working on distributed atomic counters in
Cassandra
‣ Say hi to @kelvin
21. Cassandra
‣ Pro: Some dudes at Digg had already started
working on distributed atomic counters in
Cassandra
‣ Say hi to @kelvin
‣ And @lenn0x
22. Cassandra
‣ Pro: Some dudes at Digg had already started
working on distributed atomic counters in
Cassandra
‣ Say hi to @kelvin
‣ And @lenn0x
‣ A dude from
Sweden began helping: @skr
23. Cassandra
‣ Pro: Some dudes at Digg had already started
working on distributed atomic counters in
Cassandra
‣ Say hi to @kelvin
‣ And @lenn0x
‣ A dude from
Sweden began helping: @skr
‣ Now all at Twitter :)
24. Rainbird
‣ It counts things. Really quickly.
‣ Layers on top of the distributed
counters patch, CASSANDRA-1072
25. Rainbird
‣ It counts things. Really quickly.
‣ Layers on top of the distributed
counters patch, CASSANDRA-1072
‣ Relies on Zookeeper, Cassandra, Scribe, Thrift
‣ Written in Scala
26. Rainbird Design
‣ Aggregators
buffer for 1m
‣ Intelligent
flush to
Cassandra
‣ Query
servers read
once written
‣ 1m is
configurable
33. Hierarchical Aggregation
‣ Say we’re counting Promoted Tweet impressions
‣ category = pti
‣ keys = [advertiser_id, campaign_id, tweet_id]
‣ count = 1
‣ Rainbird automatically increments the count for
‣ [advertiser_id, campaign_id, tweet_id]
‣ [advertiser_id, campaign_id]
‣ [advertiser_id]
‣ Means fast queries over each level of hierarchy
‣ Configurable in rainbird.conf, or dynamically via ZK
34. Hierarchical Aggregation
‣ Another example: tracking URL shortener tweets/clicks
‣ full URL = http://music.amazon.com/some_really_long_path
‣ keys = [com, amazon, music, full URL]
‣ count = 1
‣ Rainbird automatically increments the count for
‣ [com, amazon, music, full URL]
‣ [com, amazon, music]
‣ [com, amazon]
‣ [com]
‣ Means we can count clicks on full URLs
‣ And automatically aggregate over domains and subdomains!
35. Hierarchical Aggregation
‣ Another example: tracking URL shortener tweets/clicks
‣ full URL = http://music.amazon.com/some_really_long_path
‣ keys = [com, amazon, music, full URL]
‣ count = 1
‣ Rainbird automatically increments the count for
‣ [com, amazon, music, full URL]
‣ [com, amazon, music] How many people tweeted
‣ [com, amazon] full URL?
‣ [com]
‣ Means we can count clicks on full URLs
‣ And automatically aggregate over domains and subdomains!
36. Hierarchical Aggregation
‣ Another example: tracking URL shortener tweets/clicks
‣ full URL = http://music.amazon.com/some_really_long_path
‣ keys = [com, amazon, music, full URL]
‣ count = 1
‣ Rainbird automatically increments the count for
‣ [com, amazon, music, full URL]
‣ [com, amazon, music] How many people tweeted
‣ [com, amazon] any music.amazon.com URL?
‣ [com]
‣ Means we can count clicks on full URLs
‣ And automatically aggregate over domains and subdomains!
37. Hierarchical Aggregation
‣ Another example: tracking URL shortener tweets/clicks
‣ full URL = http://music.amazon.com/some_really_long_path
‣ keys = [com, amazon, music, full URL]
‣ count = 1
‣ Rainbird automatically increments the count for
‣ [com, amazon, music, full URL]
‣ [com, amazon, music] How many people tweeted
‣ [com, amazon] any amazon.com URL?
‣ [com]
‣ Means we can count clicks on full URLs
‣ And automatically aggregate over domains and subdomains!
38. Hierarchical Aggregation
‣ Another example: tracking URL shortener tweets/clicks
‣ full URL = http://music.amazon.com/some_really_long_path
‣ keys = [com, amazon, music, full URL]
‣ count = 1
‣ Rainbird automatically increments the count for
‣ [com, amazon, music, full URL]
‣ [com, amazon, music] How many people tweeted
‣ [com, amazon] any .com URL?
‣ [com]
‣ Means we can count clicks on full URLs
‣ And automatically aggregate over domains and subdomains!
39. Temporal Aggregation
‣ Rainbird also does (configurable) temporal
aggregation
‣ Each count is kept minutely, but also
denormalized hourly, daily, and all time
‣ Gives us quick counts at varying granularities
with no large scans at read time
‣ Trading storage for latency
40. Multiple Formulas
‣ So far we have talked about sums
‣ Could also store counts (1 for each event)
‣ ... which gives us a mean
‣ And sums of squares (count * count for each event)
‣ ... which gives us a standard deviation
‣ And min/max as well
‣ Configure this per-category in rainbird.conf
41. Rainbird
‣ Write 100,000s of events per second, each with
hierarchical structure
‣ Query with minutely granularity over any level of
the hierarchy, get back a time series
‣ Or query all time values
‣ Or query all time means, standard deviations
‣ Latency < 100ms
42. Agenda
‣ Why Real-time Analytics?
‣ Rainbird and Cassandra
‣ Production Uses at Twitter
‣ Open Source
43. Production Uses
‣ It turns out we need to count things all the time
‣ As soon as we had this service, we started
finding all sorts of use cases for it
‣ Promoted Products
‣ Tweeted URLs, by domain/subdomain
‣ Per-user Tweet interactions (fav, RT, follow)
‣ Arbitrary terms in Tweets
‣ Clicks on t.co URLs
45. Each different metric is part
Production Uses of the key hierarchy
‣ Promoted Tweet Analytics
46. Uses the temporal
aggregation to quickly show
Production Uses different levels of granularity
‣ Promoted Tweet Analytics
47. Data can be historical, or
Production Uses from 60 seconds ago
‣ Promoted Tweet Analytics
48. Production Uses
‣ Internal Monitoring and Alerting
‣ We require operational reporting on all internal services
‣ Needs to be real-time, but also want longer-term
aggregates
‣ Hierarchical, too: [stat, datacenter, service, machine]
49. Production Uses
‣ Tweet Button Counts
‣ Tweet Button counts are requested many many
times each day from across the web
‣ Uses the all time field
50. Agenda
‣ Why Real-time Analytics?
‣ Rainbird and Cassandra
‣ Production Uses at Twitter
‣ Open Source
53. Open Source?
‣ Yes! ... but not yet
‣ Relies on unreleased version of Cassandra
54. Open Source?
‣ Yes! ... but not yet
‣ Relies on unreleased version of Cassandra
‣ ... but the counters patch is committed in trunk (0.8)
55. Open Source?
‣ Yes! ... but not yet
‣ Relies on unreleased version of Cassandra
‣ ... but the counters patch is committed in trunk (0.8)
‣ ... also relies on some internal frameworks we need to
open source
56. Open Source?
‣ Yes! ... but not yet
‣ Relies on unreleased version of Cassandra
‣ ... but the counters patch is committed in trunk (0.8)
‣ ... also relies on some internal frameworks we need to
open source
‣ It will happen
57. Open Source?
‣ Yes! ... but not yet
‣ Relies on unreleased version of Cassandra
‣ ... but the counters patch is committed in trunk (0.8)
‣ ... also relies on some internal frameworks we need to
open source
‣ It will happen
‣ See http://github.com/twitter for proof of how much
Twitter open source
58. Team
‣ John Corwin (@johnxorz)
‣ Adam Samet (@damnitsamet)
‣ Johan Oskarsson (@skr)
‣ Kelvin Kakugawa (@kelvin)
‣ Chris Goffinet (@lenn0x)
‣ Steve Jiang (@sjiang)
‣ Kevin Weil (@kevinweil)
59. If You Only Remember One Slide...
‣ Rainbird is a distributed, high-volume counting
service built on top of Cassandra
‣ Write 100,000s events per second, query it with
hierarchy and multiple time granularities, returns
results in <100 ms
‣ Used by Twitter for multiple products internally,
including our Promoted Products, operational
monitoring and Tweet Button
‣ Will be open sourced so the community can use and
improve it!