2. Contents
What is Spatial Tajo?
Motive for Development
Why I chose Apache Tajo?
Plan for the implementation of the plug-in
Current status
◦ Parts implemented
◦ Parts not yet implemented
Conclusion
References
3. What is Spatial Tajo?
A plug-in to provide spatial queries for Tajo
In detail, it is a plug-in allowing the provision
and performance of queries about spatial
relations and spatial analysis for data sets
stored in the distributed data warehouse system.
◦ Providing spatial functions for spatial queries
◦ Supporting data types
◦ Supporting an index structure for spatial data sets
◦ Supporting raster data
Overall Architecture of Apache Tajo
Tajo Worker
Local
File-
System
HDFS
Amazon
S3
QueryMaster
Local Query Engine
StorageManager Spatial Tajo
Tajo Master
CatalogStore
Allocate
a query
Manage
metadata
Client
JDBC SQL Shell Web UI
4. Motive for development
The volume of the spatial data sets to be analyzed come near big data, and I’d like
to analyze this using SQL.
I'd like to use a system working without batch processing.
I'd like to use free software or free solution (Provided that I contribute my
experiences to communities).
5. Motive for development
Of course, there are good ones among the known software and solutions. But…
◦ Relational Database and DBMS
◦ Oracle Spatial and Graph (Oracle Database+Plug-in)
◦ MySQL DBMS
◦ PostGIS with PostgreSQL
◦ NoSQL
◦ Document-oriented database: MongoDB, CouchDB (Plug-in), RethinkDB
◦ HBase, Hive
◦ Cluster and Cloud
◦ GeoMesa, ESRI GIS Tools for Hadoop, SpatialHadoop (http://spatialhadoop.cs.umn.edu) (on Hadoop)
◦ CartoDB (on top of PostgreSQL, PostGIS and SaaS)
Conclusion: FOSS + Spatial → Apache Tajo + Spatial Plug-in!
6. Why I chose Apache Tajo?
Apache Tajo
◦ A robust big data relational and distributed data warehouse system for Apache Hadoop
◦ Designed for low-latency and scalable ad-hoc queries, online aggregation, and ETL on large
-data sets stored on HDFS and other data sources.
Why did I choose Apache Tajo?: Features
◦ Motions of distributing and storing data are entrusted to Hadoop or Amazon S3
◦ It supports the insertion of data, but doesn’t support the update of the data
◦ Compatibility: Can query with ANSI/ISO SQL
◦ It is faster than processing using MapReduce and guarantees fault tolerance
◦ It is easy to build up and manage
◦ The user can implement and plug it in yourself if necessary
7. Plan for the implementation of the plug-in
Spatial functions for spatial queries
◦ Distances, Equals, Disjoints, Intersects and Touches,
Crosses, Overlaps, Contains, Lengths, Areas and Centroids
◦ Transforming functions for spatial types (like from_OOOO or to_OOOO)
Adding spatial data types
Enabling to run kNN queries
Supporting an index for spatial data
◦ R-tree, Quad-tree and KD-tree, etc.
8. Current status – Parts implemented
Most primary spatial functions
◦ Implementing most primary spatial functions using JTS
◦ Distances, Equals, Disjoints, Intersects, Touches, Crosses, Overlaps and Contains
Running kNN queries
◦ It can run kNN queries using the implemented spatial functions.
Indexing for spatial data
◦ R-tree indexing using Sort-Tile-Recursive (STR)
Local indexes
Global
Index
The process of reading data
using the index
1. Reading the global index
and finding search keys
2. Finding local indexes
corresponding to the
search keys,
3. Finding the search keys in
the local indexes
4. Directly reading the tuples
Wikipedia: R-tree - STR, SpatialHadoop Paper and Tajo Document
9. Current status – Parts not yet implemented
Adding spatial data types
◦ Parameters of spatial functions are inaccurate and inconvenient to use.
Spatial functions not yet implemented
◦ Lengths, areas and centroids
◦ Transform functions (e.g. from_OOO and to_OOO)
Optimizing functions and queries
◦ Optimizing spatial functions
◦ Optimizing kNN queries
Indexing spatial data
◦ Quad-tree (with GeoHash) and KD-tree
Modularization
◦ Currently, since it is not separated from Apache Tajo, it is impossible to install the plug-in.
10. Conclusion
What is Spatial Tajo?
Motive for Development
Why I chose Apache Tajo?
Plan for the implementation of the plug-in
Current status
◦ Parts implemented
◦ Parts not yet implemented
11. References
Apache Tajo
◦ Official Website, Source codes, User documentation
◦ Efficient In-situ processing of various storage types on Apache Tajo
◦ Apache Tajo Enters the SQL-on-Hadoop Space
◦ SQL-on-Hadoop: What does “100 times faster than Hive” actually mean?
◦ Setting up an Apache Tajo Cluster on Amazon EMR
PostgreSQL and PostGIS Document
SpatialHadoop
◦ Official Website, Source codes
◦ A demonstration of SpatialHadoop: an efficient mapreduce framework for spatial data
◦ Spatialhadoop: towards flexible and scalable spatial processing using MapReduce
12. Reference
Spatial Databases: With Application to GIS
Indexing
◦ STR: A simple and efficient algorithm for R-tree packing
◦ Spatialhadoop: towards flexible and scalable spatial processing using mapreduce
◦ Tajo: A distributed data warehouse system on large clusters
◦ Apache Tajo Documents: Index types
Wikipedia
◦ Spatial Database
◦ Spatial Query
◦ R-tree
13. Thank You for listening
Do you have any questions?
Please e-mail(pseudojo.1989@gmail.com) me with the questions,
and I’ll answer them in detail.
Editor's Notes
Hello, everyone. My name is Hyungu Cho, and I am doing my master’s in Computer Science at Kunsan National University. The title of my presentation is Spatial Tajo (Supporting spatial queries on Tajo). I would appreciate if you understand that I am going to make the presentation by reading the transcript because I am not that good at English.
The presentation will be made in the following order: “What is Tajo?”, “Motive for development”, “Why I chose Apache Tajo?”, “Plan for the implementation of the plug-in”, “Parts implemented and parts not yet implemented” and “Conclusion.”
What is spatial Tajo? To describe it briefly, it is a plug-in to provide spatial queries for Tajo. To describe it more in detail, it is a plug-in to provide and perform querying datasets by spatial queries using SQL in a distributed data warehouse system, Tajo, which provides spatial functions, supports spatial data types for spatial queries, supports indexing spatial data and allows the use of raster data.
Then, what would be the motive for the development of this plug-in? As there was getting more interest in big data for the last decade, attempts to analyze spatial big data, too, have increased naturally. I, also, thought that the volume of ‘the data containing the spatial information’ I attempted to analyze could come close to big data. The data held by my Lab are a collection of tweets on Twitter in real time, which are not the ones consisting only of spatial information, but I attempted a few analyses using the data. When I analyzed the data, analysis using Hadoop was a trend, and I, too, conducted an analysis using Hadoop. However, I had to use MapReduce to conduct an analysis using Hadoop, and often thought that using SQL whenever I conduct an experiment would be convenient. Since it is somewhat difficult for an analyst to use MapReduce freely, which essentially is batch processing, the more data, the higher latency becomes.
So, I did research to use existing software or solutions. Of course, there are good ones among the existing software and solutions.
Traditional spatial database and DBMS include Oracle Spatial and Graph which can be installed to Oracle database as a plug-in and MySQL DBMS. Both are satisfactory solutions, but are commercial products, somewhat difficult to build up a cluster. My school did not provide them separately, and solutions that continuously cost were not an appropriate option. Using PostGIS that does not cost seems to be a good method; however, since it is not itself software to analyze big data, it was held off.
As for NoSQL, there are document-oriented databases as MongoDB, CouchDB and RethinkDB. I could have used this, but I already had structured data, so I did not have to use this. HBase was made by modeling after Google’s BigTable, but it is somewhat difficult for an analyst to use that. Hive is convenient since it does not use SQL, but HiveQL, similar to that, but it was not judged that Hive which uses MapReduce would be an appropriate choice.
Other solutions that use Hadoop include GeoMesa, ESRI GIS Tools for Hadoop and SpatialHadoop. ESRI GIS Tools for Hadoop is close to tools or libraries prepared for analysis using Hadoop, while SpatialHadoop, too, is close to the form of a combination of spatial plug-in with Hadoop.
Overall, in that it does not cost, despite a little effort should be invested, it seems that performing spatial queries using Hive or using GeoMesa may be most appropriate. However, I finally, decided to prepare plug-in that allows Tajo, free and open-source software with low latency to perform spatial queries by myself.
First, I will introduce Tajo before I will speak about why I chose Tajo. Apache Tajo is a robust big data relational and distributed data warehouse system for Apache Hadoop. Tajo is designed for low-latency and scalable ad-hoc queries, online aggregation, and Extract-Transform-Load (ETL) process on large-data sets stored on Hadoop Distributed File System (HDFS) and other data sources. By supporting SQL standards and leveraging advanced database techniques, Tajo allows direct control of distributed execution and data flow across a variety of query evaluation strategies and optimization opportunities.
Then, why did I choose Tajo? There are a few features that become the reasons for my choice. Since Tajo is designed to run on Hadoop, using Hadoop can alleviate my concern about data distribution, and also supports Amazon S3. Tajo has a function of external table, so it can bring existing files and make queries. It does not support update syntax like HBase or Hive, but it can overwrite. Tajo supports SQL standard and does not use MapReduce, so it is faster than batch processing using MapReduce, has fault tolerance and supports dynamic scheduling for long-running queries. Tajo is convenient in terms of installation, construction and operation. Of course, it is a project still growing, so it is certainly difficult to solve when it fails. And if you need, you can implement it yourself and attach it in the form of plug-in. Tajo, as described above, has merits and demerits as compared to the solutions or software in the examples, but it is free and open-source software, so I decided to implement it on Tajo, which was judged to be appropriate considering the characteristics introduced.
The plan for the implementation of the plug-in is broadly divided into four steps. First, for the implementation of spatial functions for spatial queries, I decided to implement 11 basic functions including distance and equals and the function for the conversion of spatial data types. Second, the addition of types of spatial data, which means the custom type containing spatial information like ‘Point, LineString’. Third, kNN query, which means the implementation for smooth performance of kNN after the implementation of spatial functions. Fourth, indexing spatial data, which means the implementation of an index that allows retrieval of necessary data in making spatial queries.
Parts that have been implemented so far include spatial functions, kNN query and spatial data indexing. For the spatial functions, I implemented most of the primary functions using JTS. I carried out kNN queries using the spatial functions implemented, which were working smoothly. For spatial data indexing, borrowing the operation methods of index in Tajo and SpatialHadoop, I implemented a two level R-tree using Sort-Tile-Recursive (STR).
The two-level R-tree has two forms, a global index and local indexes, as shown in the picture.
The process of building the two-level R-tree index is as follows: First, Tajo's workers divide the stored datasets by each area using the STR. Second, they build local indexes for each area. Third, they extract only the data on a certain upper level from each of the local indexes and build a global index using them.
The process of reading the two-level R-tree index is as follows: First, Tajo’s workers read the global index and find search keys. Second, they find the local indexes corresponding to the search keys. Third, they find the search keys in the local indexes. Lastly, they read the data directly from the storage and construct tuples.
Parts that are not yet implemented include spatial data types, spatial functions, kNN query, spatial data index and modularization. Spatial data types are currently not implemented, so there is an inconvenience that spatial functions have to use primitive types. I am going to resolve the inconvenience of spatial functions, once spatial data types are implemented.
First, as for spatial functions, I am going to implement the functions as lengths, areas, centroids that are not implemented and then, optimize each of the spatial functions and kNN queries.
Next, as for the indexing of spatial data, I am going to implement Quad-tree or KD-Tree using GeoHash as well as R-tree.
Lastly, I am going to modularize the plug-in. Currently, it is combined since it is difficult to separate it from Tajo, but after going through the final modularization in the plug-in, it will be distributed in the form of a plug-in.
Conclusion
What is Spatial Tajo? It is a plug-in to provide spatial queries for Tajo.
What is a motive for development? I began developing it as I wanted an analysis using SQL in a distributed data warehouse system, instead of using MapReduce or similar batch processing.
Why did I choose Apache Tajo? I did so because I can use it without a great concern about the distribution storage, it supports SQL syntax, is faster than MapReduce, guarantees fault tolerance, and above all, I can implement and attach it in the form of a plug-in myself if necessary.
A plan for implementation is as follows: As an overall plan for implementation, I am going to implement spatial functions for spatial queries, add spatial data types and allow running kNN queries and supporting the function of indexing the spatial data.
The current status is as follows: The parts implemented so far include most spatial functions, the running of kNN queries and support for indexing through implementing a two-level R-tree. The parts not yet implemented include spatial data types, the remaining spatial functions, the optimization of spatial functions and spatial queries, another method of indexing the spatial data and the modularization of plug-in.
For today’s presentation, I mainly referred to the documents and source codes of Tajo, SpatialHadoop and PostGIS and etc., and there are books or websites for information about the contents, as well.
For today’s presentation, I mainly referred to the documents and source codes of Tajo, SpatialHadoop and PostGIS and etc., and there are books or websites for information about the contents, as well.
For Q&A, please e-mail your questions to the e-mail address stated here, and I will answer them in detail. Thank you for listening to my presentation.