Talk about how Big Data and geospatial processing worlds are merging to get the best insights.
(The presenetation with effects here: https://docs.google.com/presentation/d/1EniUHMrRR3vQaJp6q0qBdOyZxv62DcSv3-iZXpcfwOM/edit?usp=sharing)
Geoposicionamiento Big Data o It's bigger on the inside Commit conf 2018
1. Big Data y geoposicionamiento
or
it’s bigger on the inside
Jorge López-Malla Matute
Senior Data Engineer
2. 1. Presentation
2. What does Key Value means and why does it matters so
much?
3. Why do we need Geopositioning analytics?
4. How can we merge these two worlds?
5. Q&A
Index
4. SKILLS
JORGE LÓPEZ-MALLA
@jorgelopezmalla
Arquitecto Big Data, certificado número
13 de Spark, riojano y miope.
Después de años tratando de solventar
problemas modernos con tecnologías
tradicionales lo intenté con el Big Data
y, ¡vi que lo resolvían!
5. What we do
Geoblink is the ultimate location
Intelligence solution that helps companies
of any size make strategic, location-related
decisions on an easy-to-use platform
6. COLLECTING
DATA
We combine our
client’s internal data
with external data and
Geoblink’s proprietary
location data
TRANSFORMING
DATA
We process and analyze
data using advanced
analytics (big data) and
artificial intelligence
techniques
PROVIDING
INSIGHTS
We present insights on a
user-friendly platform to
help companies make
powerful, data-driven
decisions
How we do it
7. What does “Key Value” mean
and why does it matters so
much?
8. ● Big Data was born in the early 2000s
● Data is no longer small enough to fit in a single commodity
machine
● Data grows exponentially
● Vertical scaling is both dangerous and expensive
A little bit of history
● Solutions?
13. Processing & Storing
● Choosing a proper key is not only critical in a stored system but
also very important in distributed processing frameworks
● Spark, is probably the most important distributed processing
framework right now, is no exception
● Both important in streaming and batch processing
15. The Five Ws are questions whose answers are considered basic in
information gathering or problem solving
● Who was involved?
● What happened?
● Why did that happen?
● When did it take place?
● Where did it take place?
Five W
16. ● Digital society needs immediate reactions
● “Slows” responses are not useful anymore
● Big Data allows us to answer 4 of the 5 W questions
● Geospatial problem is not just an enterprise problem
The where matters
20. ● Knowing both the problem to solve and technology should be
enough
● Obtaining the proper key is the “key” in every Big Data project
● In geospatial projects it is fundamental to obtain the results
exactly where we want
● Taking this in mind we should find the key to each record of our
dataset, easy … or not?
Merging worlds
23. ● Remember: We should assign a key to a value using as few
logic as possible
● All geospatial logic must be understandable by humans
● The intuitive behaviour is to assign each point to a knowing
geospatial cardinality
The real problem
25. Intersection
● Each coordinate is not relevant by itself
● To assign each coordinate to a recognizable area we need both
geometries
● So we need to intersect the coordinates with the areas
27. Intersection
● The intersect operation has a high computational cost
● We need to do this operation only in the cases that a
intersection is probable
● We need to find a key to reduce the operation cost
28. ● First of all, there is no silver bullet
● The “key” problem is worse in the Geospatial world
● Both storing and processing technologies have similar problems
● Geospatial indexes help a lot
Finding a proper Key
30. ● Some Geospatial tech has been grouped by Eclipse in
locationtech
● Geospark and Magellan are spatial modules for Spark
● Although we only talk about Spark, other processing engines
have this functionality
● We have tested only processing engines but researched for
storage techs
Big Data initiatives
32. ● Both Magellan and Geospark offer geospatial functionality
powered by Apache Spark
● Both allow us to use SparkSQL for Geospatial queries
● Both optimize the queries in Spark
● Geospark’s documentation is better than Magellan Spark
Processing engines
34. ● Spatial joins allows us to assign several geometries to a
geometry
● Remember intersect operations came with a high cost
● In most use cases you only want a 1:1 mapping
● You can use Broadcast variables!
Do you really need a join?
35. Geomesa-Big Data storing
● Geomesa is an open-source project that allows performing
geospatial operations against several datasources and
processing engines
● Has connectors with visual tools (like Geoserver)
● We only tested Geomesa with Hbase and as a POC (yet
● We only have tested Geomesa as a POC
39. Takeaways
● We really need to give the insights in the proper location
● Big Data requires finding suitable key to our problem
● When dealing with big amount of data we have to aggregate it
● Spatial indexes are adecuate keys but they are not perfect
● If you only need to assign one geometry to another, a spatial
join is not a good idea