Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Indexes in geo-temporal data sets... How much is enough?

910 views

Published on

By CCRi's Chris Eichelberger: The query is slow. Ah, create an index! The query is fast. Now another query is slow... Quickly, the complexity of the query-planner, in addition to the volume of stored data, balloons. Is it worth it? That depends. It depends upon your data distribution, your per-field (and per-field-group) selectivities, your query planner fu, your query distribution, your support responsibilities... it depends on a lot of things. In this talk, we summarize our experience with GeoMesa, first as a geo-temporal data store, and then as a more general purpose data management platform; we cover the benefits and costs of adding exciting, new indexes every time someone's query is slow.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Indexes in geo-temporal data sets... How much is enough?

  1. 1. Strictly (Ordered) Ballroom (AKA, "geo indexing and sufficiency") Chris Eichelberger FOSS4G NA 2017
  2. 2. part 1: the pain "You can raise welts like nobody else / as we dance to the Masochism Tango" 2 with sincere apologies to the great Tom Lehrer
  3. 3. 3 searching for a NoSQL analog...
  4. 4. to answer the hard questions... which comes first? 4 ● Virginia ● Massachusetts this is the entire purpose of an index: given a data element, tell me which bin (disk page, tablet, ...) in which it will be found if it exists
  5. 5. which comes first? 5
  6. 6. indexing properties DATA-SPECIFIC ● think: Japanese street addresses ● lot numbers depend on building order ● good when: cross-index joins are cheap (RDBMS) SPACE-SPECIFIC ● think: US street addresses ● lots are aligned to block ranges ● good when: cross-index joins are expensive (NoSQL) 6
  7. 7. 7
  8. 8. 8 what does an SFC look like, do?
  9. 9. Z-order curve, 4 bits (2x2), 16 cells 9
  10. 10. Z-order curve, 6 bits (3x3), 64 cells 10
  11. 11. Z-order curve progression 11
  12. 12. SFCurve... a LocationTech project is born 12
  13. 13. but have you tried the... 13 API with the help of
  14. 14. this solution would become FOSS 14 with the help of
  15. 15. life is good 15 ● we have AN INDEX ● we can ingest geo-temporal data ● we can query with geometric bounds and a time span
  16. 16. part 2: the pain "Blacken my eye, set fire to my tie / as we dance to the Masochism Tango" 16
  17. 17. real data are often non-uniformly distributed 17
  18. 18. 18 a real place
  19. 19. 19 how real data are often distributed
  20. 20. 20 how SFC indexes might be distributed (gridded)
  21. 21. 21 how real data tend to map to SFC indexes (bins)
  22. 22. 22 how to trade density for uniformity
  23. 23. life is good 23 ● we have a (POTENTIALLY SHARDED) index ● we can ingest geo-temporal data ● we can query with geometric bounds and a time span ● we don't suffer from hot-spotting
  24. 24. part 3: the pain "My heart entreats, just hear those savage beats / and go put on your cleats / and come and trample me" 24
  25. 25. more than geo-temporal attributes? 25 https://www.britannica.com/technology/airplane/Types-of-aircraft
  26. 26. add more indexes! 26 ● add an ATTRIBUTE index ○ tied to the SimpleFeatureType (in user data) ○ each indexed attributes has all values recorded ○ contains a complete copy of every simple feature ● add a RECORD ID index ○ automatically created, populated ○ values are assumed to be unique to the SimpleFeature ○ contains a complete copy of every simple feature
  27. 27. index selection 27 ● simple cases ○ if you only filter on an indexed attribute, use the attribute index ○ if you only filter on a record ID, use the record-ID index ○ if you only filter on location and time, use the geo-temporal index ● all other cases ○ this is a geo-temporal store... use the geo-temporal index
  28. 28. life is good 28 ● we have some indexes ● we can ingest geo-temporal data ● we can query with geometric bounds and a time span ● we don't suffer from hot-spotting ● we have per-attribute indexes and a record-ID index ● we have the option of querying by any one attribute OR record ID or geo-time
  29. 29. part 4: the pain "Your heart is hard as stone or mahogany / that's why I'm in such exquisite agony" 29
  30. 30. pointedly... 30 ● the world is not flat ● it (the world) contains non-point geometries
  31. 31. handling non-point geometries 31 Christian Böhm, Gerald Klump and Hans-Peter Kriegel. "XZ-Ordering: A Space-Filling Curve for Objects with Spatial Extension". 6th Int. Symposium on Large Spatial Databases (SSD), 1999, Hong Kong, China
  32. 32. add more indexes! 32 ● add an XZ3 index ○ indexes longitude, latitude, and time ○ contains a complete copy of every simple feature ● add an XZ2 index, just to be sure ○ indexes longitude and latitude alone ○ contains a complete copy of every simple feature
  33. 33. life is good 33 ● we have some indexes ● we can ingest geo-temporal data ● we can query with geometric bounds and a time span ● we don't suffer from hot-spotting ● we have per-attribute indexes and a record-ID index ● we have the option of querying by any one attribute ● we have non-duplicative indexes for non-point geometries, even those that cross the anti-meridian
  34. 34. part 5: the pain "My soul is on fire; it's aflame with desire / which is why I perspire when we tango" 34
  35. 35. an embarrassment of riches 35http://i.ebayimg.com/00/s/NTY2WDg0OQ==/z/U~IAAOSw-jhUBFhb/$_32.JPG?set_id=880000500F
  36. 36. for once! 36 ● what we need is NOT another index... exactly
  37. 37. cost-based optimizer... oh, and summary statistics 37 ● CBO ○ rewrite query using DNF... or CNF ○ estimate cost of using a particular index ■ at least whether a full-table scan is required ○ requires knowing something about cardinalities ○ ought to be able to explain why it made its choice ● statistics collection ○ responsible for providing some estimates of cardinalities (HyperLogLog, count-min sketch, etc.) this is really just a fancy version of the board game Guess Who?
  38. 38. life is good 38 ● we have some indexes ● we can ingest geo-temporal data ● we can query with geometric bounds and a time span ● we don't suffer from hot-spotting ● we have per-attribute indexes and a record-ID index ● we have the option of querying by any one attribute ● we have non-duplicative indexes for non-point geometries, even those that cross the anti-meridian
  39. 39. part 6: the pain "You caught my nose in your left castanet, love / I can feel the pain yet, love / everytime I hear drums" 39
  40. 40. 40 serious fun requires serious thought
  41. 41. analytics, streaming, and cross-platform support 41 Apache Arrow
  42. 42. "who knew [geo data] could be so complicated?" ● there exist simpler solutions ○ D4M works very well, albeit not specifically for geo-time data ○ Elasticsearch has geographic, temporal indexes ● do you have a simpler problem? ○ do you need low-latency, high-velocity streaming data ingest, processing? ○ does even your streaming, in-memory geo-time data store require secondary indexing? ○ do your clients require access via OGC services, the GeoTools API? ○ must you support multiple flavors of NoSQL? 42
  43. 43. if it doesn't hurt, you're doing it wrong "Fracture my spine / and swear that you're mine / as we dance to the Masochism Tango" 43
  44. 44. for additional questions... 44

×