SlideShare a Scribd company logo
1 of 23
© Rocana, Inc. All Rights Reserved. | 1
Eric Sammer – CTO and co-founder, @esammer
DataEngConf 2016
High cardinality time series search
A new level of scale
© Rocana, Inc. All Rights Reserved. | 2
Context
• We build a system for large scale realtime collection, processing, and
analysis of event-oriented machine data
• On prem or in the cloud, but not SaaS
• Supportability is a big deal for us
• Predictability of performance and under failures
• Ease of configuration and operation
• Behavior in wacky environments
• All of our decisions are informed by this - YMMV
© Rocana, Inc. All Rights Reserved. | 3
What I mean by “scale”
• Typical: 10s of TB of new data per day
• Average event size ~200-500 bytes
• 20TB per day
• @200 bytes = 1.2M events / second, ~109.9B events / day, 40.1T events / year
• @500 bytes = 509K events / second, ~43.9B events / day, 16T events / year,
• Retaining years online for query
© Rocana, Inc. All Rights Reserved. | 4
General purpose search – the good parts
• We originally built against Solr Cloud (but most of this goes for Elastic
Seach too)
• Amazing feature set for general purpose search
• Good support for moderate scale
• Excellent at
• Content search – news sites, document repositories
• Finite size datasets – product catalogs, job postings, things you prune
• Low(er) cardinality datasets that (mostly) fit in memory
© Rocana, Inc. All Rights Reserved. | 5
Problems with general purpose search systems
• Fixed shard allocation models – always N partitions
• Multi-level and semantic partitioning is painful without building your own
macro query planner
• All shards open all the time; poor resource control for high retention
• APIs are record-at-a-time focused for NRT indexing; poor ingest
performance (aka: please stop making everything REST!)
• Ingest concurrency is wonky
• High write amplification on data we know won’t change
• Other smaller stuff…
© Rocana, Inc. All Rights Reserved. | 6
“Well actually…”
Plenty of ways to push general purpose systems
(We tried many of them)
• Using multiple collections as partitions, macro query planning
• Running multiple JVMs per node for better utilization
• Pushing historical searches into another system
• Building weirdo caches of things
At some point the cost of hacking outweighed the cost of building
© Rocana, Inc. All Rights Reserved. | 7
Warning!
• This is not a condemnation of general purpose search systems!
• Unless the sky is falling, use one of those systems
© Rocana, Inc. All Rights Reserved. | 8
We built a thing: Rocana Search
High cardinality, low latency, parallel search system for time-oriented events
© Rocana, Inc. All Rights Reserved. | 9
Features of Rocana Search
• Fully parallelized ingest and query, built for large clusters
• Every node is an indexer, query coordinator, and executor
• Optimized for high cardinality time-oriented event data
• Built to keep all data online and queryable without wasting resources for
infrequently used data
• Fully durable, resistant to node failures
• Operationally friendly: online ops, predictable resource usage and
performance
• Uses battle tested open source components (Kafka, Lucene, HDFS, ZK)
© Rocana, Inc. All Rights Reserved. | 10
Major differences
• Storage and partition model looks more like range-partitioned tables in
databases; new partitions easily added, old ones dropped, support for
multi-field partitioning, allows for fine grained resource management
• Partitions subdivided into slices for parallel writes
• Query engine aggressively prunes partitions by analyzing predicates
• Ingestion path is Kafka, built for extremely high throughput of small
events
What we know about our data allows us to optimize
© Rocana, Inc. All Rights Reserved. | 11
Architecture
(A single node)
RS
HDFS
MetadataIndex Management Coordinator
ExecutorIndexes
Query Client
Kafka
Data Producers
ZK
© Rocana, Inc. All Rights Reserved. | 12
Collections, partitions, and slices
• A search collection is split into partitions by a partition strategy
• Think: “By year, month, day, hour”
• Partitioning invisible to queries (e.g. `time:[x TO y] AND host:z` works normally)
• Partitions are divided into slices to support (mostly) lock-free parallel
writes
• Think: “This hour has 20 slices, each of which is independent for write”
© Rocana, Inc. All Rights Reserved. | 13
Collections, partitions, and slices
Collection “events”
Partition “2016/01/01”
Slice 0 Slice 1
Slice 2 Slice N
Partition “2016/01/02”
Slice 0 Slice 1
Slice 2 Slice N
© Rocana, Inc. All Rights Reserved. | 14
From events to partitions to slices Partition 2016/01/01
Slice 0
Slice 1
Topic events
KP 0
KP 1
Event 1
2016/01/01
Event 2
2016/01/01
Event 3
2016/01/02
Partition 2016/01/02
Slice 0
Slice 1
E1
E2
E3
© Rocana, Inc. All Rights Reserved. | 15
Assigning slices to nodes
Node 1
Partition 2016/01/01
S 0 S 2
Partition 2016/01/02
S 0 S 2
Partition 2016/01/03
S 0 S 2
Partition 2016/01/04
S 0 S 2
Topic eventsKP 0 KP 2 KP 1 KP 3
Node 2
Partition 2016/01/01
S 1 S 3
Partition 2016/01/02
S 1 S 3
Partition 2016/01/03
S 1 S 3
Partition 2016/01/04
S 1 S 3
© Rocana, Inc. All Rights Reserved. | 16
Following the write path
• One of the search nodes is the exclusive owner of KP 0 and KP 1
• Consume a batch of events
• Use the partition strategy to figure out to which RS partition it belongs
• Kafka messages carry the partition so we know the slice
• Event written to the proper partition/slice
• Eventually the indexes are committed
• If the partition or slice is new, metadata service is informed
© Rocana, Inc. All Rights Reserved. | 17
Query engine basics
• Queries submitted to coordinator via RPC
• Coordinator (smart) parses, plans, schedules and monitors fragments,
merges results, responds to client
• Fragments are submitted to executors for processing
• Executors (dumb) search exactly what they’re told, stream to coordinator
• Fragment is generated for every partition/slice that may contain data
© Rocana, Inc. All Rights Reserved. | 18
Some implications
• Search processes are on the same nodes as the HDFS DataNode
• First replica of any event received by search from Kafka is written locally
• Result: Unless nodes fail, all reads are local (HDFS short circuit reads)
• Linux kernel page cache is useful here
• HDFS caching can be used
• Search has an off-heap block cache as well
• In case of failure, any search node can read any index
• HDFS overhead winds up being very little, still get the advantages
© Rocana, Inc. All Rights Reserved. | 19
Contrived query scenario
• 80 Kafka partitions (80 slices)
• Collection partitioned by day
• 80 nodes, 16 executor threads each
• Query: time:[2015-01-01 TO 2016-01-01] AND service:sshd
• 365 * 80 = 29200 fragments generated for the query (a lot!)
• 29200 / (80 * 16) = ~22 “waves” of fragments
• If each “wave” takes ~0.5 second, the query takes ~11 seconds
© Rocana, Inc. All Rights Reserved. | 20
More real, but a little outdated
• 24 AWS EC2 d2.2xl, instance storage
• Ingesting data at ~3 million events per minute (50K eps)
• 24 Kafka partitions / RS slices
• Index size: 5.9 billion events
• Query: All events, facet by 3 fields
• No tuning (default config): ~10 seconds (with a silly bug)
• 10 concurrent instances of the same query: ~21 seconds total
• 50 concurrent instances: ~41 seconds
• We do much better today
© Rocana, Inc. All Rights Reserved. | 21
What we’ve really shown
In the context of search, scale means:
• High cardinality: Billions of events per day
• High speed ingest: Millions of events per second
• Not having to age data out of the collection
• Handling large, concurrent queries, while ingesting data
• Fully utilizing modern hardware
These things are very possible
© Rocana, Inc. All Rights Reserved. | 22
Next steps
• Read replicas
• Smarter partition elimination in complex queries
• Speculative execution of query fragments
• Additional metadata for index fields to improve storage efficiency
• Smarter cache management
• Better visibility into performance and health
• Strong consensus (e.g. Raft, multi-paxos) for metadata?
© Rocana, Inc. All Rights Reserved. | 23
Thank you!
Hopefully I still have time for
questions.
rocana.com
@esammer
esammer@rocana.com
(ask me for stickers)
The (amazing) core search team:
• Michael Peterson - @quux00
• Mark Tozzi - @not_napoleon
• Brad Cupit
• Joey Echeverria - @fwiffo

More Related Content

What's hot

Deploying Kafka at Dropbox, Mark Smith, Sean Fellows
Deploying Kafka at Dropbox, Mark Smith, Sean FellowsDeploying Kafka at Dropbox, Mark Smith, Sean Fellows
Deploying Kafka at Dropbox, Mark Smith, Sean Fellowsconfluent
 
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life ExampleKafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Exampleconfluent
 
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...Amazon Web Services
 
Monitoring, Hold the Infrastructure - Getting the Most out of AWS Lambda – Da...
Monitoring, Hold the Infrastructure - Getting the Most out of AWS Lambda – Da...Monitoring, Hold the Infrastructure - Getting the Most out of AWS Lambda – Da...
Monitoring, Hold the Infrastructure - Getting the Most out of AWS Lambda – Da...Amazon Web Services
 
Inside Kafka Streams—Monitoring Comcast’s Outside Plant
Inside Kafka Streams—Monitoring Comcast’s Outside Plant Inside Kafka Streams—Monitoring Comcast’s Outside Plant
Inside Kafka Streams—Monitoring Comcast’s Outside Plant confluent
 
Events and metrics the Lifeblood of Webops
Events and metrics the Lifeblood of WebopsEvents and metrics the Lifeblood of Webops
Events and metrics the Lifeblood of WebopsDatadog
 
Kafka Summit NYC 2017 - Apache Kafka in the Enterprise: What if it Fails?
Kafka Summit NYC 2017 - Apache Kafka in the Enterprise: What if it Fails? Kafka Summit NYC 2017 - Apache Kafka in the Enterprise: What if it Fails?
Kafka Summit NYC 2017 - Apache Kafka in the Enterprise: What if it Fails? confluent
 
Should you read Kafka as a stream or in batch? Should you even care? | Ido Na...
Should you read Kafka as a stream or in batch? Should you even care? | Ido Na...Should you read Kafka as a stream or in batch? Should you even care? | Ido Na...
Should you read Kafka as a stream or in batch? Should you even care? | Ido Na...HostedbyConfluent
 
Maximizing Audience Engagement in Media Delivery (MED303) | AWS re:Invent 2013
Maximizing Audience Engagement in Media Delivery (MED303) | AWS re:Invent 2013Maximizing Audience Engagement in Media Delivery (MED303) | AWS re:Invent 2013
Maximizing Audience Engagement in Media Delivery (MED303) | AWS re:Invent 2013Amazon Web Services
 
Deploying Confluent Platform for Production
Deploying Confluent Platform for ProductionDeploying Confluent Platform for Production
Deploying Confluent Platform for Productionconfluent
 
Bullet: A Real Time Data Query Engine
Bullet: A Real Time Data Query EngineBullet: A Real Time Data Query Engine
Bullet: A Real Time Data Query EngineDataWorks Summit
 
Kafka Summit SF 2017 - Query the Application, Not a Database: “Interactive Qu...
Kafka Summit SF 2017 - Query the Application, Not a Database: “Interactive Qu...Kafka Summit SF 2017 - Query the Application, Not a Database: “Interactive Qu...
Kafka Summit SF 2017 - Query the Application, Not a Database: “Interactive Qu...confluent
 
Flink forward-2017-netflix keystones-paas
Flink forward-2017-netflix keystones-paasFlink forward-2017-netflix keystones-paas
Flink forward-2017-netflix keystones-paasMonal Daxini
 
Ingesting Healthcare Data, Micah Whitacre
Ingesting Healthcare Data, Micah WhitacreIngesting Healthcare Data, Micah Whitacre
Ingesting Healthcare Data, Micah Whitacreconfluent
 
Kafka Summit SF 2017 - Fast Data in Supply Chain Planning
Kafka Summit SF 2017 - Fast Data in Supply Chain PlanningKafka Summit SF 2017 - Fast Data in Supply Chain Planning
Kafka Summit SF 2017 - Fast Data in Supply Chain Planningconfluent
 
Kafka Summit SF 2017 - Real-Time Document Rankings with Kafka Streams
Kafka Summit SF 2017 - Real-Time Document Rankings with Kafka StreamsKafka Summit SF 2017 - Real-Time Document Rankings with Kafka Streams
Kafka Summit SF 2017 - Real-Time Document Rankings with Kafka Streamsconfluent
 
Honest performance testing with NDBench
Honest performance testing with NDBenchHonest performance testing with NDBench
Honest performance testing with NDBenchVinay Kumar Chella
 
The Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data ProblemsThe Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data ProblemsMonal Daxini
 
Hoodie: How (And Why) We built an analytical datastore on Spark
Hoodie: How (And Why) We built an analytical datastore on SparkHoodie: How (And Why) We built an analytical datastore on Spark
Hoodie: How (And Why) We built an analytical datastore on SparkVinoth Chandar
 

What's hot (20)

Deploying Kafka at Dropbox, Mark Smith, Sean Fellows
Deploying Kafka at Dropbox, Mark Smith, Sean FellowsDeploying Kafka at Dropbox, Mark Smith, Sean Fellows
Deploying Kafka at Dropbox, Mark Smith, Sean Fellows
 
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life ExampleKafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
 
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
 
Monitoring, Hold the Infrastructure - Getting the Most out of AWS Lambda – Da...
Monitoring, Hold the Infrastructure - Getting the Most out of AWS Lambda – Da...Monitoring, Hold the Infrastructure - Getting the Most out of AWS Lambda – Da...
Monitoring, Hold the Infrastructure - Getting the Most out of AWS Lambda – Da...
 
Inside Kafka Streams—Monitoring Comcast’s Outside Plant
Inside Kafka Streams—Monitoring Comcast’s Outside Plant Inside Kafka Streams—Monitoring Comcast’s Outside Plant
Inside Kafka Streams—Monitoring Comcast’s Outside Plant
 
Events and metrics the Lifeblood of Webops
Events and metrics the Lifeblood of WebopsEvents and metrics the Lifeblood of Webops
Events and metrics the Lifeblood of Webops
 
Kafka Summit NYC 2017 - Apache Kafka in the Enterprise: What if it Fails?
Kafka Summit NYC 2017 - Apache Kafka in the Enterprise: What if it Fails? Kafka Summit NYC 2017 - Apache Kafka in the Enterprise: What if it Fails?
Kafka Summit NYC 2017 - Apache Kafka in the Enterprise: What if it Fails?
 
Should you read Kafka as a stream or in batch? Should you even care? | Ido Na...
Should you read Kafka as a stream or in batch? Should you even care? | Ido Na...Should you read Kafka as a stream or in batch? Should you even care? | Ido Na...
Should you read Kafka as a stream or in batch? Should you even care? | Ido Na...
 
Maximizing Audience Engagement in Media Delivery (MED303) | AWS re:Invent 2013
Maximizing Audience Engagement in Media Delivery (MED303) | AWS re:Invent 2013Maximizing Audience Engagement in Media Delivery (MED303) | AWS re:Invent 2013
Maximizing Audience Engagement in Media Delivery (MED303) | AWS re:Invent 2013
 
Deploying Confluent Platform for Production
Deploying Confluent Platform for ProductionDeploying Confluent Platform for Production
Deploying Confluent Platform for Production
 
Bullet: A Real Time Data Query Engine
Bullet: A Real Time Data Query EngineBullet: A Real Time Data Query Engine
Bullet: A Real Time Data Query Engine
 
Kafka Summit SF 2017 - Query the Application, Not a Database: “Interactive Qu...
Kafka Summit SF 2017 - Query the Application, Not a Database: “Interactive Qu...Kafka Summit SF 2017 - Query the Application, Not a Database: “Interactive Qu...
Kafka Summit SF 2017 - Query the Application, Not a Database: “Interactive Qu...
 
Flink forward-2017-netflix keystones-paas
Flink forward-2017-netflix keystones-paasFlink forward-2017-netflix keystones-paas
Flink forward-2017-netflix keystones-paas
 
Ingesting Healthcare Data, Micah Whitacre
Ingesting Healthcare Data, Micah WhitacreIngesting Healthcare Data, Micah Whitacre
Ingesting Healthcare Data, Micah Whitacre
 
Kafka Summit SF 2017 - Fast Data in Supply Chain Planning
Kafka Summit SF 2017 - Fast Data in Supply Chain PlanningKafka Summit SF 2017 - Fast Data in Supply Chain Planning
Kafka Summit SF 2017 - Fast Data in Supply Chain Planning
 
Kafka Summit SF 2017 - Real-Time Document Rankings with Kafka Streams
Kafka Summit SF 2017 - Real-Time Document Rankings with Kafka StreamsKafka Summit SF 2017 - Real-Time Document Rankings with Kafka Streams
Kafka Summit SF 2017 - Real-Time Document Rankings with Kafka Streams
 
Honest performance testing with NDBench
Honest performance testing with NDBenchHonest performance testing with NDBench
Honest performance testing with NDBench
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
 
The Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data ProblemsThe Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data Problems
 
Hoodie: How (And Why) We built an analytical datastore on Spark
Hoodie: How (And Why) We built an analytical datastore on SparkHoodie: How (And Why) We built an analytical datastore on Spark
Hoodie: How (And Why) We built an analytical datastore on Spark
 

Viewers also liked

DataEngConf SF16 - Running simulations at scale
DataEngConf SF16 - Running simulations at scaleDataEngConf SF16 - Running simulations at scale
DataEngConf SF16 - Running simulations at scaleHakka Labs
 
DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
DataEngConf SF16 - Entity Resolution in Data Pipelines Using SparkDataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
DataEngConf SF16 - Entity Resolution in Data Pipelines Using SparkHakka Labs
 
DataEngConf SF16 - Deriving Meaning from Wearable Sensor Data
DataEngConf SF16 - Deriving Meaning from Wearable Sensor DataDataEngConf SF16 - Deriving Meaning from Wearable Sensor Data
DataEngConf SF16 - Deriving Meaning from Wearable Sensor DataHakka Labs
 
DataEngConf SF16 - Three lessons learned from building a production machine l...
DataEngConf SF16 - Three lessons learned from building a production machine l...DataEngConf SF16 - Three lessons learned from building a production machine l...
DataEngConf SF16 - Three lessons learned from building a production machine l...Hakka Labs
 
DataEngConf SF16 - Bridging the gap between data science and data engineering
DataEngConf SF16 - Bridging the gap between data science and data engineeringDataEngConf SF16 - Bridging the gap between data science and data engineering
DataEngConf SF16 - Bridging the gap between data science and data engineeringHakka Labs
 
DataEngConf SF16 - Data Asserts: Defensive Data Science
DataEngConf SF16 - Data Asserts: Defensive Data ScienceDataEngConf SF16 - Data Asserts: Defensive Data Science
DataEngConf SF16 - Data Asserts: Defensive Data ScienceHakka Labs
 
DataEngConf SF16 - Methods for Content Relevance at LinkedIn
DataEngConf SF16 - Methods for Content Relevance at LinkedInDataEngConf SF16 - Methods for Content Relevance at LinkedIn
DataEngConf SF16 - Methods for Content Relevance at LinkedInHakka Labs
 
DataEngConf SF16 - Beginning with Ourselves
DataEngConf SF16 - Beginning with OurselvesDataEngConf SF16 - Beginning with Ourselves
DataEngConf SF16 - Beginning with OurselvesHakka Labs
 
DataEngConf SF16 - Routing Billions of Analytics Events with High Deliverability
DataEngConf SF16 - Routing Billions of Analytics Events with High DeliverabilityDataEngConf SF16 - Routing Billions of Analytics Events with High Deliverability
DataEngConf SF16 - Routing Billions of Analytics Events with High DeliverabilityHakka Labs
 
DataEngConf SF16 - Collecting and Moving Data at Scale
DataEngConf SF16 - Collecting and Moving Data at Scale DataEngConf SF16 - Collecting and Moving Data at Scale
DataEngConf SF16 - Collecting and Moving Data at Scale Hakka Labs
 
DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...
DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...
DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...Hakka Labs
 
DataEngConf SF16 - Recommendations at Instacart
DataEngConf SF16 - Recommendations at InstacartDataEngConf SF16 - Recommendations at Instacart
DataEngConf SF16 - Recommendations at InstacartHakka Labs
 
Always Valid Inference (Ramesh Johari, Stanford)
Always Valid Inference (Ramesh Johari, Stanford)Always Valid Inference (Ramesh Johari, Stanford)
Always Valid Inference (Ramesh Johari, Stanford)Hakka Labs
 
DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQ
DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQDataEngConf SF16 - BYOMQ: Why We [re]Built IronMQ
DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQHakka Labs
 
DataEngConf SF16 - Scalable and Reliable Logging at Pinterest
DataEngConf SF16 - Scalable and Reliable Logging at PinterestDataEngConf SF16 - Scalable and Reliable Logging at Pinterest
DataEngConf SF16 - Scalable and Reliable Logging at PinterestHakka Labs
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataHakka Labs
 
DataEngConf SF16 - Multi-temporal Data Structures
DataEngConf SF16 - Multi-temporal Data StructuresDataEngConf SF16 - Multi-temporal Data Structures
DataEngConf SF16 - Multi-temporal Data StructuresHakka Labs
 
DataEngConf: Apache Spark in Financial Modeling at BlackRock
DataEngConf: Apache Spark in Financial Modeling at BlackRock DataEngConf: Apache Spark in Financial Modeling at BlackRock
DataEngConf: Apache Spark in Financial Modeling at BlackRock Hakka Labs
 
DataEngConf: Feature Extraction: Modern Questions and Challenges at Google
DataEngConf: Feature Extraction: Modern Questions and Challenges at GoogleDataEngConf: Feature Extraction: Modern Questions and Challenges at Google
DataEngConf: Feature Extraction: Modern Questions and Challenges at GoogleHakka Labs
 
A Primer on Entity Resolution
A Primer on Entity ResolutionA Primer on Entity Resolution
A Primer on Entity ResolutionBenjamin Bengfort
 

Viewers also liked (20)

DataEngConf SF16 - Running simulations at scale
DataEngConf SF16 - Running simulations at scaleDataEngConf SF16 - Running simulations at scale
DataEngConf SF16 - Running simulations at scale
 
DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
DataEngConf SF16 - Entity Resolution in Data Pipelines Using SparkDataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
 
DataEngConf SF16 - Deriving Meaning from Wearable Sensor Data
DataEngConf SF16 - Deriving Meaning from Wearable Sensor DataDataEngConf SF16 - Deriving Meaning from Wearable Sensor Data
DataEngConf SF16 - Deriving Meaning from Wearable Sensor Data
 
DataEngConf SF16 - Three lessons learned from building a production machine l...
DataEngConf SF16 - Three lessons learned from building a production machine l...DataEngConf SF16 - Three lessons learned from building a production machine l...
DataEngConf SF16 - Three lessons learned from building a production machine l...
 
DataEngConf SF16 - Bridging the gap between data science and data engineering
DataEngConf SF16 - Bridging the gap between data science and data engineeringDataEngConf SF16 - Bridging the gap between data science and data engineering
DataEngConf SF16 - Bridging the gap between data science and data engineering
 
DataEngConf SF16 - Data Asserts: Defensive Data Science
DataEngConf SF16 - Data Asserts: Defensive Data ScienceDataEngConf SF16 - Data Asserts: Defensive Data Science
DataEngConf SF16 - Data Asserts: Defensive Data Science
 
DataEngConf SF16 - Methods for Content Relevance at LinkedIn
DataEngConf SF16 - Methods for Content Relevance at LinkedInDataEngConf SF16 - Methods for Content Relevance at LinkedIn
DataEngConf SF16 - Methods for Content Relevance at LinkedIn
 
DataEngConf SF16 - Beginning with Ourselves
DataEngConf SF16 - Beginning with OurselvesDataEngConf SF16 - Beginning with Ourselves
DataEngConf SF16 - Beginning with Ourselves
 
DataEngConf SF16 - Routing Billions of Analytics Events with High Deliverability
DataEngConf SF16 - Routing Billions of Analytics Events with High DeliverabilityDataEngConf SF16 - Routing Billions of Analytics Events with High Deliverability
DataEngConf SF16 - Routing Billions of Analytics Events with High Deliverability
 
DataEngConf SF16 - Collecting and Moving Data at Scale
DataEngConf SF16 - Collecting and Moving Data at Scale DataEngConf SF16 - Collecting and Moving Data at Scale
DataEngConf SF16 - Collecting and Moving Data at Scale
 
DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...
DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...
DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...
 
DataEngConf SF16 - Recommendations at Instacart
DataEngConf SF16 - Recommendations at InstacartDataEngConf SF16 - Recommendations at Instacart
DataEngConf SF16 - Recommendations at Instacart
 
Always Valid Inference (Ramesh Johari, Stanford)
Always Valid Inference (Ramesh Johari, Stanford)Always Valid Inference (Ramesh Johari, Stanford)
Always Valid Inference (Ramesh Johari, Stanford)
 
DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQ
DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQDataEngConf SF16 - BYOMQ: Why We [re]Built IronMQ
DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQ
 
DataEngConf SF16 - Scalable and Reliable Logging at Pinterest
DataEngConf SF16 - Scalable and Reliable Logging at PinterestDataEngConf SF16 - Scalable and Reliable Logging at Pinterest
DataEngConf SF16 - Scalable and Reliable Logging at Pinterest
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
 
DataEngConf SF16 - Multi-temporal Data Structures
DataEngConf SF16 - Multi-temporal Data StructuresDataEngConf SF16 - Multi-temporal Data Structures
DataEngConf SF16 - Multi-temporal Data Structures
 
DataEngConf: Apache Spark in Financial Modeling at BlackRock
DataEngConf: Apache Spark in Financial Modeling at BlackRock DataEngConf: Apache Spark in Financial Modeling at BlackRock
DataEngConf: Apache Spark in Financial Modeling at BlackRock
 
DataEngConf: Feature Extraction: Modern Questions and Challenges at Google
DataEngConf: Feature Extraction: Modern Questions and Challenges at GoogleDataEngConf: Feature Extraction: Modern Questions and Challenges at Google
DataEngConf: Feature Extraction: Modern Questions and Challenges at Google
 
A Primer on Entity Resolution
A Primer on Entity ResolutionA Primer on Entity Resolution
A Primer on Entity Resolution
 

Similar to DataEngConf SF16 - High cardinality time series search

Supercharging Data Performance for Real-Time Data Analysis
Supercharging Data Performance for Real-Time Data Analysis Supercharging Data Performance for Real-Time Data Analysis
Supercharging Data Performance for Real-Time Data Analysis Ryft
 
Real-time Data Processing Using AWS Lambda
Real-time Data Processing Using AWS LambdaReal-time Data Processing Using AWS Lambda
Real-time Data Processing Using AWS LambdaAmazon Web Services
 
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache ApexApache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache ApexApache Apex
 
Introduction to Apache Apex
Introduction to Apache ApexIntroduction to Apache Apex
Introduction to Apache ApexApache Apex
 
Managing Security At 1M Events a Second using Elasticsearch
Managing Security At 1M Events a Second using ElasticsearchManaging Security At 1M Events a Second using Elasticsearch
Managing Security At 1M Events a Second using ElasticsearchJoe Alex
 
Architectual Comparison of Apache Apex and Spark Streaming
Architectual Comparison of Apache Apex and Spark StreamingArchitectual Comparison of Apache Apex and Spark Streaming
Architectual Comparison of Apache Apex and Spark StreamingApache Apex
 
Building a system for machine and event-oriented data - SF HUG Nov 2015
Building a system for machine and event-oriented data - SF HUG Nov 2015Building a system for machine and event-oriented data - SF HUG Nov 2015
Building a system for machine and event-oriented data - SF HUG Nov 2015Felicia Haggarty
 
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)Emprovise
 
Frontera: open source, large scale web crawling framework
Frontera: open source, large scale web crawling frameworkFrontera: open source, large scale web crawling framework
Frontera: open source, large scale web crawling frameworkScrapinghub
 
M6d cassandrapresentation
M6d cassandrapresentationM6d cassandrapresentation
M6d cassandrapresentationEdward Capriolo
 
Diagnosing Problems in Production - Cassandra
Diagnosing Problems in Production - CassandraDiagnosing Problems in Production - Cassandra
Diagnosing Problems in Production - CassandraJon Haddad
 
Performance Comparison of Streaming Big Data Platforms
Performance Comparison of Streaming Big Data PlatformsPerformance Comparison of Streaming Big Data Platforms
Performance Comparison of Streaming Big Data PlatformsDataWorks Summit/Hadoop Summit
 
Ingestion and Dimensions Compute and Enrich using Apache Apex
Ingestion and Dimensions Compute and Enrich using Apache ApexIngestion and Dimensions Compute and Enrich using Apache Apex
Ingestion and Dimensions Compute and Enrich using Apache ApexApache Apex
 
Slides for the Apache Geode Hands-on Meetup and Hackathon Announcement
Slides for the Apache Geode Hands-on Meetup and Hackathon Announcement Slides for the Apache Geode Hands-on Meetup and Hackathon Announcement
Slides for the Apache Geode Hands-on Meetup and Hackathon Announcement VMware Tanzu
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2aspyker
 
Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
Intro to Apache Apex (next gen Hadoop) & comparison to Spark StreamingIntro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
Intro to Apache Apex (next gen Hadoop) & comparison to Spark StreamingApache Apex
 
Alexander Sibiryakov- Frontera
Alexander Sibiryakov- FronteraAlexander Sibiryakov- Frontera
Alexander Sibiryakov- FronteraPyData
 
HPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyHPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyPeter Clapham
 

Similar to DataEngConf SF16 - High cardinality time series search (20)

Time-oriented event search. A new level of scale
Time-oriented event search. A new level of scale Time-oriented event search. A new level of scale
Time-oriented event search. A new level of scale
 
Supercharging Data Performance for Real-Time Data Analysis
Supercharging Data Performance for Real-Time Data Analysis Supercharging Data Performance for Real-Time Data Analysis
Supercharging Data Performance for Real-Time Data Analysis
 
Real-time Data Processing Using AWS Lambda
Real-time Data Processing Using AWS LambdaReal-time Data Processing Using AWS Lambda
Real-time Data Processing Using AWS Lambda
 
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache ApexApache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
 
Introduction to Apache Apex
Introduction to Apache ApexIntroduction to Apache Apex
Introduction to Apache Apex
 
Managing Security At 1M Events a Second using Elasticsearch
Managing Security At 1M Events a Second using ElasticsearchManaging Security At 1M Events a Second using Elasticsearch
Managing Security At 1M Events a Second using Elasticsearch
 
Architectual Comparison of Apache Apex and Spark Streaming
Architectual Comparison of Apache Apex and Spark StreamingArchitectual Comparison of Apache Apex and Spark Streaming
Architectual Comparison of Apache Apex and Spark Streaming
 
Building a system for machine and event-oriented data - SF HUG Nov 2015
Building a system for machine and event-oriented data - SF HUG Nov 2015Building a system for machine and event-oriented data - SF HUG Nov 2015
Building a system for machine and event-oriented data - SF HUG Nov 2015
 
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
 
Kafka & Hadoop in Rakuten
Kafka & Hadoop in RakutenKafka & Hadoop in Rakuten
Kafka & Hadoop in Rakuten
 
Frontera: open source, large scale web crawling framework
Frontera: open source, large scale web crawling frameworkFrontera: open source, large scale web crawling framework
Frontera: open source, large scale web crawling framework
 
M6d cassandrapresentation
M6d cassandrapresentationM6d cassandrapresentation
M6d cassandrapresentation
 
Diagnosing Problems in Production - Cassandra
Diagnosing Problems in Production - CassandraDiagnosing Problems in Production - Cassandra
Diagnosing Problems in Production - Cassandra
 
Performance Comparison of Streaming Big Data Platforms
Performance Comparison of Streaming Big Data PlatformsPerformance Comparison of Streaming Big Data Platforms
Performance Comparison of Streaming Big Data Platforms
 
Ingestion and Dimensions Compute and Enrich using Apache Apex
Ingestion and Dimensions Compute and Enrich using Apache ApexIngestion and Dimensions Compute and Enrich using Apache Apex
Ingestion and Dimensions Compute and Enrich using Apache Apex
 
Slides for the Apache Geode Hands-on Meetup and Hackathon Announcement
Slides for the Apache Geode Hands-on Meetup and Hackathon Announcement Slides for the Apache Geode Hands-on Meetup and Hackathon Announcement
Slides for the Apache Geode Hands-on Meetup and Hackathon Announcement
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2
 
Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
Intro to Apache Apex (next gen Hadoop) & comparison to Spark StreamingIntro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
 
Alexander Sibiryakov- Frontera
Alexander Sibiryakov- FronteraAlexander Sibiryakov- Frontera
Alexander Sibiryakov- Frontera
 
HPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyHPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journey
 

More from Hakka Labs

DataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopDataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopHakka Labs
 
DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...
DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...
DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...Hakka Labs
 
DataEngConf: Data Science at the New York Times by Chris Wiggins
DataEngConf: Data Science at the New York Times by Chris WigginsDataEngConf: Data Science at the New York Times by Chris Wiggins
DataEngConf: Data Science at the New York Times by Chris WigginsHakka Labs
 
DataEngConf: Building the Next New York Times Recommendation Engine
DataEngConf: Building the Next New York Times Recommendation EngineDataEngConf: Building the Next New York Times Recommendation Engine
DataEngConf: Building the Next New York Times Recommendation EngineHakka Labs
 
DataEngConf: Measuring Impact with Data in a Distributed World at Conde Nast
DataEngConf: Measuring Impact with Data in a Distributed World at Conde NastDataEngConf: Measuring Impact with Data in a Distributed World at Conde Nast
DataEngConf: Measuring Impact with Data in a Distributed World at Conde NastHakka Labs
 
DataEngConf: Talkographics: Using What Viewers Say Online to Measure TV and B...
DataEngConf: Talkographics: Using What Viewers Say Online to Measure TV and B...DataEngConf: Talkographics: Using What Viewers Say Online to Measure TV and B...
DataEngConf: Talkographics: Using What Viewers Say Online to Measure TV and B...Hakka Labs
 
DataEngConf: The Science of Virality at BuzzFeed
DataEngConf: The Science of Virality at BuzzFeedDataEngConf: The Science of Virality at BuzzFeed
DataEngConf: The Science of Virality at BuzzFeedHakka Labs
 
DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with...
DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with...DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with...
DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with...Hakka Labs
 
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big DataDataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big DataHakka Labs
 
DataEngConf: Apache Kafka at Rocana: a scalable, distributed log for machine ...
DataEngConf: Apache Kafka at Rocana: a scalable, distributed log for machine ...DataEngConf: Apache Kafka at Rocana: a scalable, distributed log for machine ...
DataEngConf: Apache Kafka at Rocana: a scalable, distributed log for machine ...Hakka Labs
 
DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn
DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedInDataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn
DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedInHakka Labs
 

More from Hakka Labs (11)

DataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopDataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL Workshop
 
DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...
DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...
DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...
 
DataEngConf: Data Science at the New York Times by Chris Wiggins
DataEngConf: Data Science at the New York Times by Chris WigginsDataEngConf: Data Science at the New York Times by Chris Wiggins
DataEngConf: Data Science at the New York Times by Chris Wiggins
 
DataEngConf: Building the Next New York Times Recommendation Engine
DataEngConf: Building the Next New York Times Recommendation EngineDataEngConf: Building the Next New York Times Recommendation Engine
DataEngConf: Building the Next New York Times Recommendation Engine
 
DataEngConf: Measuring Impact with Data in a Distributed World at Conde Nast
DataEngConf: Measuring Impact with Data in a Distributed World at Conde NastDataEngConf: Measuring Impact with Data in a Distributed World at Conde Nast
DataEngConf: Measuring Impact with Data in a Distributed World at Conde Nast
 
DataEngConf: Talkographics: Using What Viewers Say Online to Measure TV and B...
DataEngConf: Talkographics: Using What Viewers Say Online to Measure TV and B...DataEngConf: Talkographics: Using What Viewers Say Online to Measure TV and B...
DataEngConf: Talkographics: Using What Viewers Say Online to Measure TV and B...
 
DataEngConf: The Science of Virality at BuzzFeed
DataEngConf: The Science of Virality at BuzzFeedDataEngConf: The Science of Virality at BuzzFeed
DataEngConf: The Science of Virality at BuzzFeed
 
DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with...
DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with...DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with...
DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with...
 
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big DataDataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data
 
DataEngConf: Apache Kafka at Rocana: a scalable, distributed log for machine ...
DataEngConf: Apache Kafka at Rocana: a scalable, distributed log for machine ...DataEngConf: Apache Kafka at Rocana: a scalable, distributed log for machine ...
DataEngConf: Apache Kafka at Rocana: a scalable, distributed log for machine ...
 
DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn
DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedInDataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn
DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn
 

Recently uploaded

Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 

Recently uploaded (20)

Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 

DataEngConf SF16 - High cardinality time series search

  • 1. © Rocana, Inc. All Rights Reserved. | 1 Eric Sammer – CTO and co-founder, @esammer DataEngConf 2016 High cardinality time series search A new level of scale
  • 2. © Rocana, Inc. All Rights Reserved. | 2 Context • We build a system for large scale realtime collection, processing, and analysis of event-oriented machine data • On prem or in the cloud, but not SaaS • Supportability is a big deal for us • Predictability of performance and under failures • Ease of configuration and operation • Behavior in wacky environments • All of our decisions are informed by this - YMMV
  • 3. © Rocana, Inc. All Rights Reserved. | 3 What I mean by “scale” • Typical: 10s of TB of new data per day • Average event size ~200-500 bytes • 20TB per day • @200 bytes = 1.2M events / second, ~109.9B events / day, 40.1T events / year • @500 bytes = 509K events / second, ~43.9B events / day, 16T events / year, • Retaining years online for query
  • 4. © Rocana, Inc. All Rights Reserved. | 4 General purpose search – the good parts • We originally built against Solr Cloud (but most of this goes for Elastic Seach too) • Amazing feature set for general purpose search • Good support for moderate scale • Excellent at • Content search – news sites, document repositories • Finite size datasets – product catalogs, job postings, things you prune • Low(er) cardinality datasets that (mostly) fit in memory
  • 5. © Rocana, Inc. All Rights Reserved. | 5 Problems with general purpose search systems • Fixed shard allocation models – always N partitions • Multi-level and semantic partitioning is painful without building your own macro query planner • All shards open all the time; poor resource control for high retention • APIs are record-at-a-time focused for NRT indexing; poor ingest performance (aka: please stop making everything REST!) • Ingest concurrency is wonky • High write amplification on data we know won’t change • Other smaller stuff…
  • 6. © Rocana, Inc. All Rights Reserved. | 6 “Well actually…” Plenty of ways to push general purpose systems (We tried many of them) • Using multiple collections as partitions, macro query planning • Running multiple JVMs per node for better utilization • Pushing historical searches into another system • Building weirdo caches of things At some point the cost of hacking outweighed the cost of building
  • 7. © Rocana, Inc. All Rights Reserved. | 7 Warning! • This is not a condemnation of general purpose search systems! • Unless the sky is falling, use one of those systems
  • 8. © Rocana, Inc. All Rights Reserved. | 8 We built a thing: Rocana Search High cardinality, low latency, parallel search system for time-oriented events
  • 9. © Rocana, Inc. All Rights Reserved. | 9 Features of Rocana Search • Fully parallelized ingest and query, built for large clusters • Every node is an indexer, query coordinator, and executor • Optimized for high cardinality time-oriented event data • Built to keep all data online and queryable without wasting resources for infrequently used data • Fully durable, resistant to node failures • Operationally friendly: online ops, predictable resource usage and performance • Uses battle tested open source components (Kafka, Lucene, HDFS, ZK)
  • 10. © Rocana, Inc. All Rights Reserved. | 10 Major differences • Storage and partition model looks more like range-partitioned tables in databases; new partitions easily added, old ones dropped, support for multi-field partitioning, allows for fine grained resource management • Partitions subdivided into slices for parallel writes • Query engine aggressively prunes partitions by analyzing predicates • Ingestion path is Kafka, built for extremely high throughput of small events What we know about our data allows us to optimize
  • 11. © Rocana, Inc. All Rights Reserved. | 11 Architecture (A single node) RS HDFS MetadataIndex Management Coordinator ExecutorIndexes Query Client Kafka Data Producers ZK
  • 12. © Rocana, Inc. All Rights Reserved. | 12 Collections, partitions, and slices • A search collection is split into partitions by a partition strategy • Think: “By year, month, day, hour” • Partitioning invisible to queries (e.g. `time:[x TO y] AND host:z` works normally) • Partitions are divided into slices to support (mostly) lock-free parallel writes • Think: “This hour has 20 slices, each of which is independent for write”
  • 13. © Rocana, Inc. All Rights Reserved. | 13 Collections, partitions, and slices Collection “events” Partition “2016/01/01” Slice 0 Slice 1 Slice 2 Slice N Partition “2016/01/02” Slice 0 Slice 1 Slice 2 Slice N
  • 14. © Rocana, Inc. All Rights Reserved. | 14 From events to partitions to slices Partition 2016/01/01 Slice 0 Slice 1 Topic events KP 0 KP 1 Event 1 2016/01/01 Event 2 2016/01/01 Event 3 2016/01/02 Partition 2016/01/02 Slice 0 Slice 1 E1 E2 E3
  • 15. © Rocana, Inc. All Rights Reserved. | 15 Assigning slices to nodes Node 1 Partition 2016/01/01 S 0 S 2 Partition 2016/01/02 S 0 S 2 Partition 2016/01/03 S 0 S 2 Partition 2016/01/04 S 0 S 2 Topic eventsKP 0 KP 2 KP 1 KP 3 Node 2 Partition 2016/01/01 S 1 S 3 Partition 2016/01/02 S 1 S 3 Partition 2016/01/03 S 1 S 3 Partition 2016/01/04 S 1 S 3
  • 16. © Rocana, Inc. All Rights Reserved. | 16 Following the write path • One of the search nodes is the exclusive owner of KP 0 and KP 1 • Consume a batch of events • Use the partition strategy to figure out to which RS partition it belongs • Kafka messages carry the partition so we know the slice • Event written to the proper partition/slice • Eventually the indexes are committed • If the partition or slice is new, metadata service is informed
  • 17. © Rocana, Inc. All Rights Reserved. | 17 Query engine basics • Queries submitted to coordinator via RPC • Coordinator (smart) parses, plans, schedules and monitors fragments, merges results, responds to client • Fragments are submitted to executors for processing • Executors (dumb) search exactly what they’re told, stream to coordinator • Fragment is generated for every partition/slice that may contain data
  • 18. © Rocana, Inc. All Rights Reserved. | 18 Some implications • Search processes are on the same nodes as the HDFS DataNode • First replica of any event received by search from Kafka is written locally • Result: Unless nodes fail, all reads are local (HDFS short circuit reads) • Linux kernel page cache is useful here • HDFS caching can be used • Search has an off-heap block cache as well • In case of failure, any search node can read any index • HDFS overhead winds up being very little, still get the advantages
  • 19. © Rocana, Inc. All Rights Reserved. | 19 Contrived query scenario • 80 Kafka partitions (80 slices) • Collection partitioned by day • 80 nodes, 16 executor threads each • Query: time:[2015-01-01 TO 2016-01-01] AND service:sshd • 365 * 80 = 29200 fragments generated for the query (a lot!) • 29200 / (80 * 16) = ~22 “waves” of fragments • If each “wave” takes ~0.5 second, the query takes ~11 seconds
  • 20. © Rocana, Inc. All Rights Reserved. | 20 More real, but a little outdated • 24 AWS EC2 d2.2xl, instance storage • Ingesting data at ~3 million events per minute (50K eps) • 24 Kafka partitions / RS slices • Index size: 5.9 billion events • Query: All events, facet by 3 fields • No tuning (default config): ~10 seconds (with a silly bug) • 10 concurrent instances of the same query: ~21 seconds total • 50 concurrent instances: ~41 seconds • We do much better today
  • 21. © Rocana, Inc. All Rights Reserved. | 21 What we’ve really shown In the context of search, scale means: • High cardinality: Billions of events per day • High speed ingest: Millions of events per second • Not having to age data out of the collection • Handling large, concurrent queries, while ingesting data • Fully utilizing modern hardware These things are very possible
  • 22. © Rocana, Inc. All Rights Reserved. | 22 Next steps • Read replicas • Smarter partition elimination in complex queries • Speculative execution of query fragments • Additional metadata for index fields to improve storage efficiency • Smarter cache management • Better visibility into performance and health • Strong consensus (e.g. Raft, multi-paxos) for metadata?
  • 23. © Rocana, Inc. All Rights Reserved. | 23 Thank you! Hopefully I still have time for questions. rocana.com @esammer esammer@rocana.com (ask me for stickers) The (amazing) core search team: • Michael Peterson - @quux00 • Mark Tozzi - @not_napoleon • Brad Cupit • Joey Echeverria - @fwiffo

Editor's Notes

  1. YMMV Not necessarily true for you Enterprise software – shipping stuff to people Fine grained events – logs, user behavior, etc. For everything – solving the problem of “enterprise wide” ops, so it’s everything from everywhere from everyone for all time (until they run out of money for nodes). This isn’t condemnation of general purpose search engines as much as what we had to do for our domain
  2. YMMV Not necessarily true for you Enterprise software – shipping stuff to people Fine grained events – logs, user behavior, etc. For everything – solving the problem of “enterprise wide” ops, so it’s everything from everywhere from everyone for all time (until they run out of money for nodes). This isn’t condemnation of general purpose search engines as much as what we had to do for our domain
  3. It does most of what you want for most cases most of the time. They’ve solved some really hard problems. Content search (e.g. news sites, document repos), finite size datasets (e.g. product catalogs), low cardinality datasets that fit in memory. Not us.
  4. Flexible systems with a bevy of full text search features Moderate and fixed document count: big by historical standards, small by ours. Design reflects these assumptions. Fixed sharding at index creation. Partition events into N buckets. For long retention time-based systems, this isn’t how we think. Let’s keep it until it’s painful. Then we add boxes. When that’s painful, we prune. Not sure what that looks like. Repartitioning is not feasible at scale. Partitions count should be dynamic. Multi-level partitioning is painful without building your own query layer; by range(time), then hash(region) or identity(region). All shards are open all the time. Implicit assumption that either you 1. have queries that touch the data evenly or 2. have inifinite resources. Recent time events are hotter than distant, but distant still needs to be available for query. Poor cache control. Recent data should be in cache. Historical scans shouldn’t push recent data out of cache. APIs are extremely “single record” focused. REST with record-at-a-time is absolutely abysmal for high throughput systems. Batch indexing is not useful. No in between. Read replicas are expensive and homogenous. Ideally we have 3 read replicas for the last N days and 1 for others. Replicas (for performance) should take up space in memory, but not on disk. Ingest concurrency tends to be wonky; whole lotta locking going on. Anecdotally, it’s difficult to get Solr Cloud to light up all cores on a box without running multiple JVMs; something is weird. We can get the benefits of NRT indexing speed with fewer writer checkpoints because our ingest pipeline acts as a reliable log. We recover from Kafka based on the last time the writer checkpointed so we can checkpoint very infrequently if we want. We know our data doesn’t change, or changes very little, after a certain point, so we can optimize and freeze indexes reducing write amplification from compactions.
  5. There are plenty of ways we could of pushed the general purpose systems, and we did. We layered our own partitioning and shard selection on top of Solr Cloud with time-based collection round robining. That got us pretty far, but not far enough. We were starting to do a lot of query rewriting and scheduling. Run mulitple JVMs per box. Gross. Unsupportable. Push historical queries out of search to a system such as spark. Build weird caches of frequent data sets. At some point, the cost of hacking outweighed the cost of building.
  6. Flexible systems with a bevy of full text search features Moderate and fixed document count: big by historical standards, small by ours. Design reflects these assumptions. Fixed sharding at index creation. Partition events into N buckets. For long retention time-based systems, this isn’t how we think. Let’s keep it until it’s painful. Then we add boxes. When that’s painful, we prune. Not sure what that looks like. Repartitioning is not feasible at scale. Partitions count should be dynamic. Multi-level partitioning is painful without building your own query layer; by range(time), then hash(region) or identity(region). All shards are open all the time. Implicit assumption that either you 1. have queries that touch the data evenly or 2. have inifinite resources. Recent time events are hotter than distant, but distant still needs to be available for query. Poor cache control. Recent data should be in cache. Historical scans shouldn’t push recent data out of cache. APIs are extremely “single record” focused. REST with record-at-a-time is absolutely abysmal for high throughput systems. Batch indexing is not useful. No in between. Read replicas are expensive and homogenous. Ideally we have 3 read replicas for the last N days and 1 for others. Replicas (for performance) should take up space in memory, but not on disk. Ingest concurrency tends to be wonky; whole lotta locking going on. Anecdotally, it’s difficult to get Solr Cloud to light up all cores on a box without running multiple JVMs; something is weird. We can get the benefits of NRT indexing speed with fewer writer checkpoints because our ingest pipeline acts as a reliable log. We recover from Kafka based on the last time the writer checkpointed so we can checkpoint very infrequently if we want. We know our data doesn’t change, or changes very little, after a certain point, so we can optimize and freeze indexes reducing write amplification from compactions.