SlideShare a Scribd company logo
1 of 13
Spatial Tajo
Supporting Spatial Queries on Apache Tajo
Contents
 What is Spatial Tajo?
 Motive for Development
 Why I chose Apache Tajo?
 Plan for the implementation of the plug-in
 Current status
◦ Parts implemented
◦ Parts not yet implemented
 Conclusion
 References
What is Spatial Tajo?
A plug-in to provide spatial queries for Tajo
In detail, it is a plug-in allowing the provision
and performance of queries about spatial
relations and spatial analysis for data sets
stored in the distributed data warehouse system.
◦ Providing spatial functions for spatial queries
◦ Supporting data types
◦ Supporting an index structure for spatial data sets
◦ Supporting raster data
Overall Architecture of Apache Tajo
Tajo Worker
Local
File-
System
HDFS
Amazon
S3
QueryMaster
Local Query Engine
StorageManager Spatial Tajo
Tajo Master
CatalogStore
Allocate
a query
Manage
metadata
Client
JDBC SQL Shell Web UI
Motive for development
 The volume of the spatial data sets to be analyzed come near big data, and I’d like
to analyze this using SQL.
 I'd like to use a system working without batch processing.
 I'd like to use free software or free solution (Provided that I contribute my
experiences to communities).
Motive for development
 Of course, there are good ones among the known software and solutions. But…
◦ Relational Database and DBMS
◦ Oracle Spatial and Graph (Oracle Database+Plug-in)
◦ MySQL DBMS
◦ PostGIS with PostgreSQL
◦ NoSQL
◦ Document-oriented database: MongoDB, CouchDB (Plug-in), RethinkDB
◦ HBase, Hive
◦ Cluster and Cloud
◦ GeoMesa, ESRI GIS Tools for Hadoop, SpatialHadoop (http://spatialhadoop.cs.umn.edu) (on Hadoop)
◦ CartoDB (on top of PostgreSQL, PostGIS and SaaS)
 Conclusion: FOSS + Spatial → Apache Tajo + Spatial Plug-in!
Why I chose Apache Tajo?
 Apache Tajo
◦ A robust big data relational and distributed data warehouse system for Apache Hadoop
◦ Designed for low-latency and scalable ad-hoc queries, online aggregation, and ETL on large
-data sets stored on HDFS and other data sources.
 Why did I choose Apache Tajo?: Features
◦ Motions of distributing and storing data are entrusted to Hadoop or Amazon S3
◦ It supports the insertion of data, but doesn’t support the update of the data
◦ Compatibility: Can query with ANSI/ISO SQL
◦ It is faster than processing using MapReduce and guarantees fault tolerance
◦ It is easy to build up and manage
◦ The user can implement and plug it in yourself if necessary
Plan for the implementation of the plug-in
 Spatial functions for spatial queries
◦ Distances, Equals, Disjoints, Intersects and Touches,
Crosses, Overlaps, Contains, Lengths, Areas and Centroids
◦ Transforming functions for spatial types (like from_OOOO or to_OOOO)
 Adding spatial data types
 Enabling to run kNN queries
 Supporting an index for spatial data
◦ R-tree, Quad-tree and KD-tree, etc.
Current status – Parts implemented
 Most primary spatial functions
◦ Implementing most primary spatial functions using JTS
◦ Distances, Equals, Disjoints, Intersects, Touches, Crosses, Overlaps and Contains
 Running kNN queries
◦ It can run kNN queries using the implemented spatial functions.
 Indexing for spatial data
◦ R-tree indexing using Sort-Tile-Recursive (STR)
Local indexes
Global
Index
The process of reading data
using the index
1. Reading the global index
and finding search keys
2. Finding local indexes
corresponding to the
search keys,
3. Finding the search keys in
the local indexes
4. Directly reading the tuples
Wikipedia: R-tree - STR, SpatialHadoop Paper and Tajo Document
Current status – Parts not yet implemented
 Adding spatial data types
◦ Parameters of spatial functions are inaccurate and inconvenient to use.
 Spatial functions not yet implemented
◦ Lengths, areas and centroids
◦ Transform functions (e.g. from_OOO and to_OOO)
Optimizing functions and queries
◦ Optimizing spatial functions
◦ Optimizing kNN queries
Indexing spatial data
◦ Quad-tree (with GeoHash) and KD-tree
Modularization
◦ Currently, since it is not separated from Apache Tajo, it is impossible to install the plug-in.
Conclusion
 What is Spatial Tajo?
 Motive for Development
 Why I chose Apache Tajo?
 Plan for the implementation of the plug-in
 Current status
◦ Parts implemented
◦ Parts not yet implemented
References
Apache Tajo
◦ Official Website, Source codes, User documentation
◦ Efficient In-situ processing of various storage types on Apache Tajo
◦ Apache Tajo Enters the SQL-on-Hadoop Space
◦ SQL-on-Hadoop: What does “100 times faster than Hive” actually mean?
◦ Setting up an Apache Tajo Cluster on Amazon EMR
PostgreSQL and PostGIS Document
SpatialHadoop
◦ Official Website, Source codes
◦ A demonstration of SpatialHadoop: an efficient mapreduce framework for spatial data
◦ Spatialhadoop: towards flexible and scalable spatial processing using MapReduce
Reference
Spatial Databases: With Application to GIS
Indexing
◦ STR: A simple and efficient algorithm for R-tree packing
◦ Spatialhadoop: towards flexible and scalable spatial processing using mapreduce
◦ Tajo: A distributed data warehouse system on large clusters
◦ Apache Tajo Documents: Index types
Wikipedia
◦ Spatial Database
◦ Spatial Query
◦ R-tree
Thank You for listening
Do you have any questions?
Please e-mail(pseudojo.1989@gmail.com) me with the questions,
and I’ll answer them in detail.

More Related Content

What's hot

End-to-End Data Pipelines with Apache Spark
End-to-End Data Pipelines with Apache SparkEnd-to-End Data Pipelines with Apache Spark
End-to-End Data Pipelines with Apache SparkBurak Yavuz
 
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...Spark Summit
 
Accelerating Data Processing in Spark SQL with Pandas UDFs
Accelerating Data Processing in Spark SQL with Pandas UDFsAccelerating Data Processing in Spark SQL with Pandas UDFs
Accelerating Data Processing in Spark SQL with Pandas UDFsDatabricks
 
Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101Mark Kromer
 
Jump Start into Apache Spark (Seattle Spark Meetup)
Jump Start into Apache Spark (Seattle Spark Meetup)Jump Start into Apache Spark (Seattle Spark Meetup)
Jump Start into Apache Spark (Seattle Spark Meetup)Denny Lee
 
Mapping Data Flows Training April 2021
Mapping Data Flows Training April 2021Mapping Data Flows Training April 2021
Mapping Data Flows Training April 2021Mark Kromer
 
Spark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science LondonSpark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science LondonDatabricks
 
Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Eric Sun
 
Advanced Natural Language Processing with Apache Spark NLP
Advanced Natural Language Processing with Apache Spark NLPAdvanced Natural Language Processing with Apache Spark NLP
Advanced Natural Language Processing with Apache Spark NLPDatabricks
 
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...Spark Summit
 
Performant data processing with PySpark, SparkR and DataFrame API
Performant data processing with PySpark, SparkR and DataFrame APIPerformant data processing with PySpark, SparkR and DataFrame API
Performant data processing with PySpark, SparkR and DataFrame APIRyuji Tamagawa
 
Redshift Spectrum & AWS Athena Deep Dive
Redshift Spectrum & AWS Athena Deep DiveRedshift Spectrum & AWS Athena Deep Dive
Redshift Spectrum & AWS Athena Deep DiveOz Levi
 
Cosmos DB Real-time Advanced Analytics Workshop
Cosmos DB Real-time Advanced Analytics WorkshopCosmos DB Real-time Advanced Analytics Workshop
Cosmos DB Real-time Advanced Analytics WorkshopDatabricks
 
Hadoop Strata Talk - Uber, your hadoop has arrived
Hadoop Strata Talk - Uber, your hadoop has arrived Hadoop Strata Talk - Uber, your hadoop has arrived
Hadoop Strata Talk - Uber, your hadoop has arrived Vinoth Chandar
 
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...DataWorks Summit
 
Informational Referential Integrity Constraints Support in Apache Spark with ...
Informational Referential Integrity Constraints Support in Apache Spark with ...Informational Referential Integrity Constraints Support in Apache Spark with ...
Informational Referential Integrity Constraints Support in Apache Spark with ...Databricks
 
Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...
Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...
Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...Databricks
 
Revealing the Power of Legacy Machine Data
Revealing the Power of Legacy Machine DataRevealing the Power of Legacy Machine Data
Revealing the Power of Legacy Machine DataDatabricks
 

What's hot (20)

End-to-End Data Pipelines with Apache Spark
End-to-End Data Pipelines with Apache SparkEnd-to-End Data Pipelines with Apache Spark
End-to-End Data Pipelines with Apache Spark
 
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
 
Accelerating Data Processing in Spark SQL with Pandas UDFs
Accelerating Data Processing in Spark SQL with Pandas UDFsAccelerating Data Processing in Spark SQL with Pandas UDFs
Accelerating Data Processing in Spark SQL with Pandas UDFs
 
Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101
 
Big data clustering
Big data clusteringBig data clustering
Big data clustering
 
Jump Start into Apache Spark (Seattle Spark Meetup)
Jump Start into Apache Spark (Seattle Spark Meetup)Jump Start into Apache Spark (Seattle Spark Meetup)
Jump Start into Apache Spark (Seattle Spark Meetup)
 
Mapping Data Flows Training April 2021
Mapping Data Flows Training April 2021Mapping Data Flows Training April 2021
Mapping Data Flows Training April 2021
 
Spark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science LondonSpark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science London
 
Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)
 
Advanced Natural Language Processing with Apache Spark NLP
Advanced Natural Language Processing with Apache Spark NLPAdvanced Natural Language Processing with Apache Spark NLP
Advanced Natural Language Processing with Apache Spark NLP
 
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
 
Performant data processing with PySpark, SparkR and DataFrame API
Performant data processing with PySpark, SparkR and DataFrame APIPerformant data processing with PySpark, SparkR and DataFrame API
Performant data processing with PySpark, SparkR and DataFrame API
 
Redshift Spectrum & AWS Athena Deep Dive
Redshift Spectrum & AWS Athena Deep DiveRedshift Spectrum & AWS Athena Deep Dive
Redshift Spectrum & AWS Athena Deep Dive
 
Cosmos DB Real-time Advanced Analytics Workshop
Cosmos DB Real-time Advanced Analytics WorkshopCosmos DB Real-time Advanced Analytics Workshop
Cosmos DB Real-time Advanced Analytics Workshop
 
Hadoop Strata Talk - Uber, your hadoop has arrived
Hadoop Strata Talk - Uber, your hadoop has arrived Hadoop Strata Talk - Uber, your hadoop has arrived
Hadoop Strata Talk - Uber, your hadoop has arrived
 
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
 
Informational Referential Integrity Constraints Support in Apache Spark with ...
Informational Referential Integrity Constraints Support in Apache Spark with ...Informational Referential Integrity Constraints Support in Apache Spark with ...
Informational Referential Integrity Constraints Support in Apache Spark with ...
 
Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...
Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...
Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...
 
Revealing the Power of Legacy Machine Data
Revealing the Power of Legacy Machine DataRevealing the Power of Legacy Machine Data
Revealing the Power of Legacy Machine Data
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 

Similar to [FOSS4G 2015 SEOUL] Spatial tajo supporting spatial queries on Apache Tajo

Apache Tajo - BWC 2014
Apache Tajo - BWC 2014Apache Tajo - BWC 2014
Apache Tajo - BWC 2014Gruter
 
Big Data Day LA 2015 - What's New Tajo 0.10 and Beyond by Hyunsik Choi of Gruter
Big Data Day LA 2015 - What's New Tajo 0.10 and Beyond by Hyunsik Choi of GruterBig Data Day LA 2015 - What's New Tajo 0.10 and Beyond by Hyunsik Choi of Gruter
Big Data Day LA 2015 - What's New Tajo 0.10 and Beyond by Hyunsik Choi of GruterData Con LA
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in HadoopBackup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadooplarsgeorge
 
Apache Tajo - An open source big data warehouse
Apache Tajo - An open source big data warehouseApache Tajo - An open source big data warehouse
Apache Tajo - An open source big data warehousehadoopsphere
 
Discovery & Consumption of Analytics Data @Twitter
Discovery & Consumption of Analytics Data @TwitterDiscovery & Consumption of Analytics Data @Twitter
Discovery & Consumption of Analytics Data @TwitterKamran Munshi
 
Jethro data meetup index base sql on hadoop - oct-2014
Jethro data meetup    index base sql on hadoop - oct-2014Jethro data meetup    index base sql on hadoop - oct-2014
Jethro data meetup index base sql on hadoop - oct-2014Eli Singer
 
Apache Hive for modern DBAs
Apache Hive for modern DBAsApache Hive for modern DBAs
Apache Hive for modern DBAsLuis Marques
 
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stackAccelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stackAlluxio, Inc.
 
What it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! PerspectivesWhat it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! PerspectivesDataWorks Summit
 
Definitive Guide to Select Right Data Warehouse (2020)
Definitive Guide to Select Right Data Warehouse (2020)Definitive Guide to Select Right Data Warehouse (2020)
Definitive Guide to Select Right Data Warehouse (2020)Sprinkle Data Inc
 
Hadoop ecosystem framework n hadoop in live environment
Hadoop ecosystem framework  n hadoop in live environmentHadoop ecosystem framework  n hadoop in live environment
Hadoop ecosystem framework n hadoop in live environmentDelhi/NCR HUG
 
Druid: Under the Covers (Virtual Meetup)
Druid: Under the Covers (Virtual Meetup)Druid: Under the Covers (Virtual Meetup)
Druid: Under the Covers (Virtual Meetup)Imply
 
Ledingkart Meetup #4: Data pipeline @ lk
Ledingkart Meetup #4: Data pipeline @ lkLedingkart Meetup #4: Data pipeline @ lk
Ledingkart Meetup #4: Data pipeline @ lkMukesh Singh
 
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS Amazon Web Services LATAM
 
How to obtain the Cloudera Data Engineer Certification
How to obtain the Cloudera Data Engineer CertificationHow to obtain the Cloudera Data Engineer Certification
How to obtain the Cloudera Data Engineer Certificationelephantscale
 

Similar to [FOSS4G 2015 SEOUL] Spatial tajo supporting spatial queries on Apache Tajo (20)

Apache Tajo - BWC 2014
Apache Tajo - BWC 2014Apache Tajo - BWC 2014
Apache Tajo - BWC 2014
 
Big Data Day LA 2015 - What's New Tajo 0.10 and Beyond by Hyunsik Choi of Gruter
Big Data Day LA 2015 - What's New Tajo 0.10 and Beyond by Hyunsik Choi of GruterBig Data Day LA 2015 - What's New Tajo 0.10 and Beyond by Hyunsik Choi of Gruter
Big Data Day LA 2015 - What's New Tajo 0.10 and Beyond by Hyunsik Choi of Gruter
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in HadoopBackup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 
Apache Tajo - An open source big data warehouse
Apache Tajo - An open source big data warehouseApache Tajo - An open source big data warehouse
Apache Tajo - An open source big data warehouse
 
Discovery & Consumption of Analytics Data @Twitter
Discovery & Consumption of Analytics Data @TwitterDiscovery & Consumption of Analytics Data @Twitter
Discovery & Consumption of Analytics Data @Twitter
 
Presto
PrestoPresto
Presto
 
Jethro data meetup index base sql on hadoop - oct-2014
Jethro data meetup    index base sql on hadoop - oct-2014Jethro data meetup    index base sql on hadoop - oct-2014
Jethro data meetup index base sql on hadoop - oct-2014
 
Apache Hive for modern DBAs
Apache Hive for modern DBAsApache Hive for modern DBAs
Apache Hive for modern DBAs
 
Prashanth Kumar_Hadoop_NEW
Prashanth Kumar_Hadoop_NEWPrashanth Kumar_Hadoop_NEW
Prashanth Kumar_Hadoop_NEW
 
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stackAccelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
 
What it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! PerspectivesWhat it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! Perspectives
 
Definitive Guide to Select Right Data Warehouse (2020)
Definitive Guide to Select Right Data Warehouse (2020)Definitive Guide to Select Right Data Warehouse (2020)
Definitive Guide to Select Right Data Warehouse (2020)
 
Hadoop ecosystem framework n hadoop in live environment
Hadoop ecosystem framework  n hadoop in live environmentHadoop ecosystem framework  n hadoop in live environment
Hadoop ecosystem framework n hadoop in live environment
 
Druid: Under the Covers (Virtual Meetup)
Druid: Under the Covers (Virtual Meetup)Druid: Under the Covers (Virtual Meetup)
Druid: Under the Covers (Virtual Meetup)
 
Ledingkart Meetup #4: Data pipeline @ lk
Ledingkart Meetup #4: Data pipeline @ lkLedingkart Meetup #4: Data pipeline @ lk
Ledingkart Meetup #4: Data pipeline @ lk
 
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
 
Using Data Lakes
Using Data Lakes Using Data Lakes
Using Data Lakes
 
Presto@Uber
Presto@UberPresto@Uber
Presto@Uber
 
How to obtain the Cloudera Data Engineer Certification
How to obtain the Cloudera Data Engineer CertificationHow to obtain the Cloudera Data Engineer Certification
How to obtain the Cloudera Data Engineer Certification
 
InternReport
InternReportInternReport
InternReport
 

Recently uploaded

Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
Clustering techniques data mining book ....
Clustering techniques data mining book ....Clustering techniques data mining book ....
Clustering techniques data mining book ....ShaimaaMohamedGalal
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 

Recently uploaded (20)

Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
Clustering techniques data mining book ....
Clustering techniques data mining book ....Clustering techniques data mining book ....
Clustering techniques data mining book ....
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 

[FOSS4G 2015 SEOUL] Spatial tajo supporting spatial queries on Apache Tajo

  • 1. Spatial Tajo Supporting Spatial Queries on Apache Tajo
  • 2. Contents  What is Spatial Tajo?  Motive for Development  Why I chose Apache Tajo?  Plan for the implementation of the plug-in  Current status ◦ Parts implemented ◦ Parts not yet implemented  Conclusion  References
  • 3. What is Spatial Tajo? A plug-in to provide spatial queries for Tajo In detail, it is a plug-in allowing the provision and performance of queries about spatial relations and spatial analysis for data sets stored in the distributed data warehouse system. ◦ Providing spatial functions for spatial queries ◦ Supporting data types ◦ Supporting an index structure for spatial data sets ◦ Supporting raster data Overall Architecture of Apache Tajo Tajo Worker Local File- System HDFS Amazon S3 QueryMaster Local Query Engine StorageManager Spatial Tajo Tajo Master CatalogStore Allocate a query Manage metadata Client JDBC SQL Shell Web UI
  • 4. Motive for development  The volume of the spatial data sets to be analyzed come near big data, and I’d like to analyze this using SQL.  I'd like to use a system working without batch processing.  I'd like to use free software or free solution (Provided that I contribute my experiences to communities).
  • 5. Motive for development  Of course, there are good ones among the known software and solutions. But… ◦ Relational Database and DBMS ◦ Oracle Spatial and Graph (Oracle Database+Plug-in) ◦ MySQL DBMS ◦ PostGIS with PostgreSQL ◦ NoSQL ◦ Document-oriented database: MongoDB, CouchDB (Plug-in), RethinkDB ◦ HBase, Hive ◦ Cluster and Cloud ◦ GeoMesa, ESRI GIS Tools for Hadoop, SpatialHadoop (http://spatialhadoop.cs.umn.edu) (on Hadoop) ◦ CartoDB (on top of PostgreSQL, PostGIS and SaaS)  Conclusion: FOSS + Spatial → Apache Tajo + Spatial Plug-in!
  • 6. Why I chose Apache Tajo?  Apache Tajo ◦ A robust big data relational and distributed data warehouse system for Apache Hadoop ◦ Designed for low-latency and scalable ad-hoc queries, online aggregation, and ETL on large -data sets stored on HDFS and other data sources.  Why did I choose Apache Tajo?: Features ◦ Motions of distributing and storing data are entrusted to Hadoop or Amazon S3 ◦ It supports the insertion of data, but doesn’t support the update of the data ◦ Compatibility: Can query with ANSI/ISO SQL ◦ It is faster than processing using MapReduce and guarantees fault tolerance ◦ It is easy to build up and manage ◦ The user can implement and plug it in yourself if necessary
  • 7. Plan for the implementation of the plug-in  Spatial functions for spatial queries ◦ Distances, Equals, Disjoints, Intersects and Touches, Crosses, Overlaps, Contains, Lengths, Areas and Centroids ◦ Transforming functions for spatial types (like from_OOOO or to_OOOO)  Adding spatial data types  Enabling to run kNN queries  Supporting an index for spatial data ◦ R-tree, Quad-tree and KD-tree, etc.
  • 8. Current status – Parts implemented  Most primary spatial functions ◦ Implementing most primary spatial functions using JTS ◦ Distances, Equals, Disjoints, Intersects, Touches, Crosses, Overlaps and Contains  Running kNN queries ◦ It can run kNN queries using the implemented spatial functions.  Indexing for spatial data ◦ R-tree indexing using Sort-Tile-Recursive (STR) Local indexes Global Index The process of reading data using the index 1. Reading the global index and finding search keys 2. Finding local indexes corresponding to the search keys, 3. Finding the search keys in the local indexes 4. Directly reading the tuples Wikipedia: R-tree - STR, SpatialHadoop Paper and Tajo Document
  • 9. Current status – Parts not yet implemented  Adding spatial data types ◦ Parameters of spatial functions are inaccurate and inconvenient to use.  Spatial functions not yet implemented ◦ Lengths, areas and centroids ◦ Transform functions (e.g. from_OOO and to_OOO) Optimizing functions and queries ◦ Optimizing spatial functions ◦ Optimizing kNN queries Indexing spatial data ◦ Quad-tree (with GeoHash) and KD-tree Modularization ◦ Currently, since it is not separated from Apache Tajo, it is impossible to install the plug-in.
  • 10. Conclusion  What is Spatial Tajo?  Motive for Development  Why I chose Apache Tajo?  Plan for the implementation of the plug-in  Current status ◦ Parts implemented ◦ Parts not yet implemented
  • 11. References Apache Tajo ◦ Official Website, Source codes, User documentation ◦ Efficient In-situ processing of various storage types on Apache Tajo ◦ Apache Tajo Enters the SQL-on-Hadoop Space ◦ SQL-on-Hadoop: What does “100 times faster than Hive” actually mean? ◦ Setting up an Apache Tajo Cluster on Amazon EMR PostgreSQL and PostGIS Document SpatialHadoop ◦ Official Website, Source codes ◦ A demonstration of SpatialHadoop: an efficient mapreduce framework for spatial data ◦ Spatialhadoop: towards flexible and scalable spatial processing using MapReduce
  • 12. Reference Spatial Databases: With Application to GIS Indexing ◦ STR: A simple and efficient algorithm for R-tree packing ◦ Spatialhadoop: towards flexible and scalable spatial processing using mapreduce ◦ Tajo: A distributed data warehouse system on large clusters ◦ Apache Tajo Documents: Index types Wikipedia ◦ Spatial Database ◦ Spatial Query ◦ R-tree
  • 13. Thank You for listening Do you have any questions? Please e-mail(pseudojo.1989@gmail.com) me with the questions, and I’ll answer them in detail.

Editor's Notes

  1. Hello, everyone. My name is Hyungu Cho, and I am doing my master’s in Computer Science at Kunsan National University. The title of my presentation is Spatial Tajo (Supporting spatial queries on Tajo). I would appreciate if you understand that I am going to make the presentation by reading the transcript because I am not that good at English.
  2. The presentation will be made in the following order: “What is Tajo?”, “Motive for development”, “Why I chose Apache Tajo?”, “Plan for the implementation of the plug-in”, “Parts implemented and parts not yet implemented” and “Conclusion.”
  3. What is spatial Tajo? To describe it briefly, it is a plug-in to provide spatial queries for Tajo. To describe it more in detail, it is a plug-in to provide and perform querying datasets by spatial queries using SQL in a distributed data warehouse system, Tajo, which provides spatial functions, supports spatial data types for spatial queries, supports indexing spatial data and allows the use of raster data.
  4. Then, what would be the motive for the development of this plug-in? As there was getting more interest in big data for the last decade, attempts to analyze spatial big data, too, have increased naturally. I, also, thought that the volume of ‘the data containing the spatial information’ I attempted to analyze could come close to big data. The data held by my Lab are a collection of tweets on Twitter in real time, which are not the ones consisting only of spatial information, but I attempted a few analyses using the data. When I analyzed the data, analysis using Hadoop was a trend, and I, too, conducted an analysis using Hadoop. However, I had to use MapReduce to conduct an analysis using Hadoop, and often thought that using SQL whenever I conduct an experiment would be convenient. Since it is somewhat difficult for an analyst to use MapReduce freely, which essentially is batch processing, the more data, the higher latency becomes. So, I did research to use existing software or solutions. Of course, there are good ones among the existing software and solutions.
  5. Traditional spatial database and DBMS include Oracle Spatial and Graph which can be installed to Oracle database as a plug-in and MySQL DBMS. Both are satisfactory solutions, but are commercial products, somewhat difficult to build up a cluster. My school did not provide them separately, and solutions that continuously cost were not an appropriate option. Using PostGIS that does not cost seems to be a good method; however, since it is not itself software to analyze big data, it was held off. As for NoSQL, there are document-oriented databases as MongoDB, CouchDB and RethinkDB. I could have used this, but I already had structured data, so I did not have to use this. HBase was made by modeling after Google’s BigTable, but it is somewhat difficult for an analyst to use that. Hive is convenient since it does not use SQL, but HiveQL, similar to that, but it was not judged that Hive which uses MapReduce would be an appropriate choice. Other solutions that use Hadoop include GeoMesa, ESRI GIS Tools for Hadoop and SpatialHadoop. ESRI GIS Tools for Hadoop is close to tools or libraries prepared for analysis using Hadoop, while SpatialHadoop, too, is close to the form of a combination of spatial plug-in with Hadoop. Overall, in that it does not cost, despite a little effort should be invested, it seems that performing spatial queries using Hive or using GeoMesa may be most appropriate. However, I finally, decided to prepare plug-in that allows Tajo, free and open-source software with low latency to perform spatial queries by myself.
  6. First, I will introduce Tajo before I will speak about why I chose Tajo. Apache Tajo is a robust big data relational and distributed data warehouse system for Apache Hadoop. Tajo is designed for low-latency and scalable ad-hoc queries, online aggregation, and Extract-Transform-Load (ETL) process on large-data sets stored on Hadoop Distributed File System (HDFS) and other data sources. By supporting SQL standards and leveraging advanced database techniques, Tajo allows direct control of distributed execution and data flow across a variety of query evaluation strategies and optimization opportunities. Then, why did I choose Tajo? There are a few features that become the reasons for my choice. Since Tajo is designed to run on Hadoop, using Hadoop can alleviate my concern about data distribution, and also supports Amazon S3. Tajo has a function of external table, so it can bring existing files and make queries. It does not support update syntax like HBase or Hive, but it can overwrite. Tajo supports SQL standard and does not use MapReduce, so it is faster than batch processing using MapReduce, has fault tolerance and supports dynamic scheduling for long-running queries. Tajo is convenient in terms of installation, construction and operation. Of course, it is a project still growing, so it is certainly difficult to solve when it fails. And if you need, you can implement it yourself and attach it in the form of plug-in. Tajo, as described above, has merits and demerits as compared to the solutions or software in the examples, but it is free and open-source software, so I decided to implement it on Tajo, which was judged to be appropriate considering the characteristics introduced.
  7. The plan for the implementation of the plug-in is broadly divided into four steps. First, for the implementation of spatial functions for spatial queries, I decided to implement 11 basic functions including distance and equals and the function for the conversion of spatial data types. Second, the addition of types of spatial data, which means the custom type containing spatial information like ‘Point, LineString’. Third, kNN query, which means the implementation for smooth performance of kNN after the implementation of spatial functions. Fourth, indexing spatial data, which means the implementation of an index that allows retrieval of necessary data in making spatial queries.
  8. Parts that have been implemented so far include spatial functions, kNN query and spatial data indexing. For the spatial functions, I implemented most of the primary functions using JTS. I carried out kNN queries using the spatial functions implemented, which were working smoothly. For spatial data indexing, borrowing the operation methods of index in Tajo and SpatialHadoop, I implemented a two level R-tree using Sort-Tile-Recursive (STR). The two-level R-tree has two forms, a global index and local indexes, as shown in the picture. The process of building the two-level R-tree index is as follows: First, Tajo's workers divide the stored datasets by each area using the STR. Second, they build local indexes for each area. Third, they extract only the data on a certain upper level from each of the local indexes and build a global index using them. The process of reading the two-level R-tree index is as follows: First, Tajo’s workers read the global index and find search keys. Second, they find the local indexes corresponding to the search keys. Third, they find the search keys in the local indexes. Lastly, they read the data directly from the storage and construct tuples.
  9. Parts that are not yet implemented include spatial data types, spatial functions, kNN query, spatial data index and modularization. Spatial data types are currently not implemented, so there is an inconvenience that spatial functions have to use primitive types. I am going to resolve the inconvenience of spatial functions, once spatial data types are implemented. First, as for spatial functions, I am going to implement the functions as lengths, areas, centroids that are not implemented and then, optimize each of the spatial functions and kNN queries. Next, as for the indexing of spatial data, I am going to implement Quad-tree or KD-Tree using GeoHash as well as R-tree. Lastly, I am going to modularize the plug-in. Currently, it is combined since it is difficult to separate it from Tajo, but after going through the final modularization in the plug-in, it will be distributed in the form of a plug-in.
  10. Conclusion What is Spatial Tajo? It is a plug-in to provide spatial queries for Tajo. What is a motive for development? I began developing it as I wanted an analysis using SQL in a distributed data warehouse system, instead of using MapReduce or similar batch processing. Why did I choose Apache Tajo? I did so because I can use it without a great concern about the distribution storage, it supports SQL syntax, is faster than MapReduce, guarantees fault tolerance, and above all, I can implement and attach it in the form of a plug-in myself if necessary. A plan for implementation is as follows: As an overall plan for implementation, I am going to implement spatial functions for spatial queries, add spatial data types and allow running kNN queries and supporting the function of indexing the spatial data. The current status is as follows: The parts implemented so far include most spatial functions, the running of kNN queries and support for indexing through implementing a two-level R-tree. The parts not yet implemented include spatial data types, the remaining spatial functions, the optimization of spatial functions and spatial queries, another method of indexing the spatial data and the modularization of plug-in.
  11. For today’s presentation, I mainly referred to the documents and source codes of Tajo, SpatialHadoop and PostGIS and etc., and there are books or websites for information about the contents, as well.
  12. For today’s presentation, I mainly referred to the documents and source codes of Tajo, SpatialHadoop and PostGIS and etc., and there are books or websites for information about the contents, as well.
  13. For Q&A, please e-mail your questions to the e-mail address stated here, and I will answer them in detail. Thank you for listening to my presentation.