Building a geospatial processing pipeline using Hadoop and HBase and how Mons...DataWorks Summit
Monsanto built a geospatial platform on Hadoop and HBase capable of managing over 120 billion polygons. As a result of the extreme data volumes and compute complexities we were forced to migrate our data processing from a more traditional RDBMS to a scale out Hadoop implementation. Data processing that took over 30 days on 8% of the data now runs in under 12 hours on the entire data set. Very little concrete material exist for how you process spatial data via MapReduce or model it in HBase. We will provide concrete and novel examples for processing and storing spatial data on Hadoop and HBase. As part of the data processing pipeline we integrated the popular open source geospatial processing library GDAL with MapReduce to convert all geospatial datasets to a common format and projection. We developed a method for splitting and processing images via MapReduce in which the boundaries of splits needed to be shared by multiple tasks due to the nature of the computation being performed on the data. Bulk writes to HBase were performed by writing HFiles directly. Finally we developed a novel method for storing geospatial data in HBase that met the needs of our access pattern.
Hadoop for High-Performance Climate Analytics - Use Cases and Lessons LearnedDataWorks Summit
Scientific data services are a critical aspect of the NASA Center for Climate Simulation’s mission (NCCS). Hadoop, via MapReduce, provides an approach to high-performance analytics that is proving to be useful to data intensive problems in climate research. It offers an analysis paradigm that uses clusters of computers and combines distributed storage of large data sets with parallel computation. The NCCS is particularly interested in the potential of Hadoop to speed up basic operations common to a wide range of analyses. In order to evaluate this potential, we prototyped a series of canonical MapReduce operations over a test suite of observational and climate simulation datasets. The initial focus was on averaging operations over arbitrary spatial and temporal extents within Modern Era Retrospective- Analysis for Research and Applications (MERRA) data. After preliminary results suggested that this approach improves efficiencies within data intensive analytic workflows, we invested in building a cyberinfrastructure resource for developing a new generation of climate data analysis capabilities using Hadoop. This resource is focused on reducing the time spent in the preparation of reanalysis data used in data-model intercomparison, a long sought goal of the climate community. This paper summarizes the related use cases and lessons learned.
Sept 17 2013 - THUG - HBase a Technical IntroductionAdam Muise
HBase Technical Introduction. This deck includes a description of memory design, write path, read path, some operational tidbits, SQL on HBase (Phoenix and Hive), as well as HOYA (HBase on YARN).
Building a geospatial processing pipeline using Hadoop and HBase and how Mons...DataWorks Summit
Monsanto built a geospatial platform on Hadoop and HBase capable of managing over 120 billion polygons. As a result of the extreme data volumes and compute complexities we were forced to migrate our data processing from a more traditional RDBMS to a scale out Hadoop implementation. Data processing that took over 30 days on 8% of the data now runs in under 12 hours on the entire data set. Very little concrete material exist for how you process spatial data via MapReduce or model it in HBase. We will provide concrete and novel examples for processing and storing spatial data on Hadoop and HBase. As part of the data processing pipeline we integrated the popular open source geospatial processing library GDAL with MapReduce to convert all geospatial datasets to a common format and projection. We developed a method for splitting and processing images via MapReduce in which the boundaries of splits needed to be shared by multiple tasks due to the nature of the computation being performed on the data. Bulk writes to HBase were performed by writing HFiles directly. Finally we developed a novel method for storing geospatial data in HBase that met the needs of our access pattern.
Hadoop for High-Performance Climate Analytics - Use Cases and Lessons LearnedDataWorks Summit
Scientific data services are a critical aspect of the NASA Center for Climate Simulation’s mission (NCCS). Hadoop, via MapReduce, provides an approach to high-performance analytics that is proving to be useful to data intensive problems in climate research. It offers an analysis paradigm that uses clusters of computers and combines distributed storage of large data sets with parallel computation. The NCCS is particularly interested in the potential of Hadoop to speed up basic operations common to a wide range of analyses. In order to evaluate this potential, we prototyped a series of canonical MapReduce operations over a test suite of observational and climate simulation datasets. The initial focus was on averaging operations over arbitrary spatial and temporal extents within Modern Era Retrospective- Analysis for Research and Applications (MERRA) data. After preliminary results suggested that this approach improves efficiencies within data intensive analytic workflows, we invested in building a cyberinfrastructure resource for developing a new generation of climate data analysis capabilities using Hadoop. This resource is focused on reducing the time spent in the preparation of reanalysis data used in data-model intercomparison, a long sought goal of the climate community. This paper summarizes the related use cases and lessons learned.
Sept 17 2013 - THUG - HBase a Technical IntroductionAdam Muise
HBase Technical Introduction. This deck includes a description of memory design, write path, read path, some operational tidbits, SQL on HBase (Phoenix and Hive), as well as HOYA (HBase on YARN).
Accompanying slides for the class “Introduction to Hadoop” at the PRACE Autumn school 2020 - HPC and FAIR Big Data organized by the faculty of Mechanical Engineering of the University of Ljubljana (Slovenia).
Hadoop Summit San Jose 2014: Costing Your Big Data Operations Sumeet Singh
As organizations begin to make use of large data sets, approaches to understand and manage true costs of big data will become an important facet with increasing scale of operations.
Whether an on-premise or cloud-based platform is used for storing, processing and analyzing data, our approach explains how to calculate the total cost of ownership (TCO), develop a deeper understanding of compute and storage resources, and run the big data operations with its own P&L, full transparency in costs, and with metering and billing provisions. While our approach is generic, we will illustrate the methodology with three primary deployments in the Apache Hadoop ecosystem, namely MapReduce and HDFS, HBase, and Storm due to the significance of capital investments with increasing scale in data nodes, region servers, and supervisor nodes respectively.
As we discuss our approach, we will share insights gathered from the exercise conducted on one of the largest data infrastructures in the world. We will illustrate how to organize cluster resources, compile data required and typical sources, develop TCO models tailored for individual situations, derive unit costs of usage, measure resources consumed, optimize for higher utilization and ROI, and benchmark the cost.
A presentation on Hadoop for scientific researchers given at Universitat Rovira i Virgili in Catalonia, Spain in October 2010. http://etseq.urv.cat/seminaris/seminars/3/
Accompanying slides for the class “Introduction to Hadoop” at the PRACE Autumn school 2020 - HPC and FAIR Big Data organized by the faculty of Mechanical Engineering of the University of Ljubljana (Slovenia).
Introduction to Big Data & Hadoop Architecture - Module 1Rohit Agrawal
Learning Objectives - In this module, you will understand what is Big Data, What are the limitations of the existing solutions for Big Data problem; How Hadoop solves the Big Data problem, What are the common Hadoop ecosystem components, Hadoop Architecture, HDFS and Map Reduce Framework, and Anatomy of File Write and Read.
This is an updated version of Amr's Hadoop presentation. Amr gave this talk recently at NASA CIDU event, TDWI LA Chapter, and also Netflix HQ. You should watch the powerpoint version as it has animations. The slides also include handout notes with additional information.
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMINGijiert bestjournal
An unstructured data poses challenges to storing da ta. Experts estimate that 80 to 90 percent of the d ata in any organization is unstructured. And the amount of uns tructured data in enterprises is growing significan tly� often many times faster than structured databases are gro wing. As structured data is existing in table forma t i,e having proper scheme but unstructured data is schema less database So it�s directly signifying the importance of NoSQL storage Model and Map Reduce platform. For processi ng unstructured data,where in existing it is given to Cassandra dataset. Here in present system along wit h Cassandra dataset,Mongo DB is to be implemented. As Mongo DB provide flexible data model and large amou nt of options for querying unstructured data. Where as Cassandra model their data in such a way as to mini mize the total number of queries through more caref ul planning and renormalizations. It offers basic secondary ind exes but for the best performance it�s recommended to model our data as to use them infrequently. So to process
Accompanying slides for the class “Introduction to Hadoop” at the PRACE Autumn school 2020 - HPC and FAIR Big Data organized by the faculty of Mechanical Engineering of the University of Ljubljana (Slovenia).
Hadoop Summit San Jose 2014: Costing Your Big Data Operations Sumeet Singh
As organizations begin to make use of large data sets, approaches to understand and manage true costs of big data will become an important facet with increasing scale of operations.
Whether an on-premise or cloud-based platform is used for storing, processing and analyzing data, our approach explains how to calculate the total cost of ownership (TCO), develop a deeper understanding of compute and storage resources, and run the big data operations with its own P&L, full transparency in costs, and with metering and billing provisions. While our approach is generic, we will illustrate the methodology with three primary deployments in the Apache Hadoop ecosystem, namely MapReduce and HDFS, HBase, and Storm due to the significance of capital investments with increasing scale in data nodes, region servers, and supervisor nodes respectively.
As we discuss our approach, we will share insights gathered from the exercise conducted on one of the largest data infrastructures in the world. We will illustrate how to organize cluster resources, compile data required and typical sources, develop TCO models tailored for individual situations, derive unit costs of usage, measure resources consumed, optimize for higher utilization and ROI, and benchmark the cost.
A presentation on Hadoop for scientific researchers given at Universitat Rovira i Virgili in Catalonia, Spain in October 2010. http://etseq.urv.cat/seminaris/seminars/3/
Accompanying slides for the class “Introduction to Hadoop” at the PRACE Autumn school 2020 - HPC and FAIR Big Data organized by the faculty of Mechanical Engineering of the University of Ljubljana (Slovenia).
Introduction to Big Data & Hadoop Architecture - Module 1Rohit Agrawal
Learning Objectives - In this module, you will understand what is Big Data, What are the limitations of the existing solutions for Big Data problem; How Hadoop solves the Big Data problem, What are the common Hadoop ecosystem components, Hadoop Architecture, HDFS and Map Reduce Framework, and Anatomy of File Write and Read.
This is an updated version of Amr's Hadoop presentation. Amr gave this talk recently at NASA CIDU event, TDWI LA Chapter, and also Netflix HQ. You should watch the powerpoint version as it has animations. The slides also include handout notes with additional information.
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMINGijiert bestjournal
An unstructured data poses challenges to storing da ta. Experts estimate that 80 to 90 percent of the d ata in any organization is unstructured. And the amount of uns tructured data in enterprises is growing significan tly� often many times faster than structured databases are gro wing. As structured data is existing in table forma t i,e having proper scheme but unstructured data is schema less database So it�s directly signifying the importance of NoSQL storage Model and Map Reduce platform. For processi ng unstructured data,where in existing it is given to Cassandra dataset. Here in present system along wit h Cassandra dataset,Mongo DB is to be implemented. As Mongo DB provide flexible data model and large amou nt of options for querying unstructured data. Where as Cassandra model their data in such a way as to mini mize the total number of queries through more caref ul planning and renormalizations. It offers basic secondary ind exes but for the best performance it�s recommended to model our data as to use them infrequently. So to process
Learning objectives
• Understand how to handle massive amount of data using data grid.
• Explains data replication and namespaces
• Identify the various data access model.
Definitive Guide to Select Right Data Warehouse (2020)Sprinkle Data Inc
Choosing the right data warehouse is a big challenge for organisations. In this doc, we have made an end to end comparison of leading data warehouses. Snowflake vs Redshift vs BigQuery vs Hive vs Athena
Sprinkledata.com
This is a slide deck that I have been using to present on GeoTrellis for various meetings and workshops. The information is speaks to GeoTrellis pre-1.0 release in Q4 of 2016.
Cosmos DB Real-time Advanced Analytics WorkshopDatabricks
The workshop implements an innovative fraud detection solution as a PoC for a bank who provides payment processing services for commerce to their merchant customers all across the globe, helping them save costs by applying machine learning and advanced analytics to detect fraudulent transactions. Since their customers are around the world, the right solutions should minimize any latencies experienced using their service by distributing as much of the solution as possible, as closely as possible, to the regions in which their customers use the service. The workshop designs a data pipeline solution that leverages Cosmos DB for both the scalable ingest of streaming data, and the globally distributed serving of both pre-scored data and machine learning models. Cosmos DB’s major advantage when operating at a global scale is its high concurrency with low latency and predictable results.
This combination is unique to Cosmos DB and ideal for the bank needs. The solution leverages the Cosmos DB change data feed in concert with the Azure Databricks Delta and Spark capabilities to enable a modern data warehouse solution that can be used to create risk reduction solutions for scoring transactions for fraud in an offline, batch approach and in a near real-time, request/response approach. https://github.com/Microsoft/MCW-Cosmos-DB-Real-Time-Advanced-Analytics Takeaway: How to leverage Azure Cosmos DB + Azure Databricks along with Spark ML for building innovative advanced analytics pipelines.
Big Data SSD Architecture: Digging Deep to Discover Where SSD Performance Pay...Samsung Business USA
Which storage technology, HDDs or SSDs, excels in big data architecture? SSDs clearly win on speed, offering higher sequential read/write speeds and higher IOPS. However, deploying SSDs in hundreds or thousands of nodes could add up to a very expensive proposition. A better approach identifies critical locations where SSDs enable immediate cost-per-performance wins. This whitepaper will look at the basics of big data tools, review two performance wins with SSDs in a well-known framework, as well as present some examples of emerging opportunities on the leading edge of big data technology.
Assimilating sense into disaster recovery databases and judgement framing pr...IJECEIAES
The replication between the primary and secondary (standby) databases can be configured in either synchronous or asynchronous mode. It is referred to as out-of-sync in either mode if there is any lag between the primary and standby databases. In the previous research, the advantages of the asynchronous method were demonstrated over the synchronous method on highly transactional databases. The asynchronous method requires human intervention and a great deal of manual effort to configure disaster recovery database setups. Moreover, in existing setups there was no accurate calculation process for estimating the lag between the primary and standby databases in terms of sequences and time factors with intelligence. To address these research gaps, the current work has implemented a self-image looping database link process and provided decision-making capabilities at standby databases. Those decisions from standby are always in favor of selecting the most efficient data retrieval method and being in sync with the primary database. The purpose of this paper is to add intelligence and automation to the standby database to begin taking decisions based on the rate of concurrency in transactions at primary and out-of-sync status at standby.
Similar to HGrid A Data Model for Large Geospatial Data Sets in HBase (20)
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfPeter Spielvogel
Building better applications for business users with SAP Fiori.
• What is SAP Fiori and why it matters to you
• How a better user experience drives measurable business benefits
• How to get started with SAP Fiori today
• How SAP Fiori elements accelerates application development
• How SAP Build Code includes SAP Fiori tools and other generative artificial intelligence capabilities
• How SAP Fiori paves the way for using AI in SAP apps
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
3. The General Research Problem
The Geospatial Problem Instance
The Data Set
HBase data-organization alternatives
Performance analysis
Some Lessons Learned
07/12/13 3Cloud 2013
10. [1] built a multi-dimensional index layer on top of a one-
dimensional key-value store HBase to perform spatial queries.
[2] presented a novel key formulation schema, based on R+-tree
for spatial index in HBase.
Focus on row-key design
no discussion about column and version design
07/12/13 11
[1] Shoji Nishimura, Sudipto Das, Divyakant Agrawal, Amr El Abbadi: MD-HBase: A
Scalable Multi-dimensional Data Infrastructure for Location Aware Services. Mobile
Data Management (1) 2011: 7-16
[2] Ya-Ting Hsu, Yi-Chin Pan, Ling-Yin Wei, Wen-Chih Peng, Wang-Chien Lee: Key
Formulation Schemes for Spatial Index in Cloud Data Managements. MDM 2012: 21-26
Cloud 2013
11. Two Synthetic Datasets
Uniform and ZipF distribution
Based on Bixi dataset, each object includes
▪ station ID,
▪ latitude, longitude, station name, terminal name,
▪ number of docks
▪ number of bikes
100 Million objects (70GB)
in a 100km*100km simulated space
07/12/13 12Cloud 2013
12. Regular Grid Indexing
Row key: Grid rowID
Column: Grid columnID
Version: counter of Objects
Value: one object in JSON format
07/12/13 13
Counter
Column ID
RowID
00 01 02 03
00
01
02
03
Cloud 2013
13. Tie-based quad-tree Indexing
Z-value Linearization
Rowkey: Z-value
Column: Object ID
Value: one object in JSON Format
07/12/13 14
Z-Value
Object ID
Z-value
Cloud 2013
14. Quad-Tree data model
More rows with deeper tree
Z-ordering linearization
(violates data locality)
In-time construction vs. pre-
construction implies a
tradeoff between query
performance and memory
allocation
Regular Grid data model
Very easy to locate a cell by
row id and column id
Cannot handle large space
and fine-grained grid
because in-memory indexes
are subject to memory
constraints
07/12/13 15
How much unrelated data is examined in a query matters a lot!
Cloud 2013
16. 07/12/13 17Cloud 2013
The row key is
the QT Z-value +
the RG row
index.
The row key is
the QT Z-value +
the RG row
index.
The column
name is the RG
column and the
object-ID
The column
name is the RG
column and the
object-ID
The attributes of
the data point
are stored in the
third dimension.
The attributes of
the data point
are stored in the
third dimension.
17. 1. Compute minimum bounding square based on the query
input location and the range
2. Compute the quad-tree tiles that overlap with the
bounding square Z-codes
3. Compute all the regular-grid cells indexes in these quad-
tree tiles the secondary index of rows and columns
4. Issue one sub-query for each selected tile of the quad-
tree; process with user-level coprocessors on the HBase
regions
5. Collect the results of the sub-queries at the client-side
07/12/13 18Cloud 2013
21. 1. Estimate the search range (density-based range
estimation)
2. Compute indices of rows and columns (steps 2 and 3 of
Range Query)
3. Issue a scan query to retrieve the relevant data points
4. If fewer than K data points are returned, re-estimate the
search range and repeat steps 2-3
5. Sort the return set in increasing distance from the input
location
07/12/13 23Cloud 2013
22. Experiment Environment
A four-node cluster on virtual machines with Ubuntu on
OpenStack
Hadoop 1.0.2 (replication factor is 2), HBase 0.94
HBase Configuration
▪ 5K Caching Size
▪ Block cache is true
▪ ROWCOL bloom filter
Query processing Implementation
Native java API
User-Level Coprocessor Implementation
07/12/13 24Cloud 2013
23. The granularity of grid affects query-processing
performance
Explore the “best” cell configuration of each model
Quad-tree=>(t= 1)
RG=>(t=0.1)
HGrid=>(T=10,t=0.1)
07/12/13 25Cloud 2013
HG:≈10:0.1 fewer sub-queries
more false positives
HG:≈1:0.1 more sub-queries
fewer false positives
HG:≈10:0.01 more rows
HG:≈10:0.1 fewer rows
24. 07/12/13 26
Given a location and a radius,
Return the data points, located within a distance
less or equal to the radius from the input location
Cloud 2013
25. Given the coordinates of a location,
Return the K points nearest to the location
07/12/13 27Cloud 2013
28. Data Organization
Short row key and column name
Better to have one column family and few columns
Not large amount of data in one row
Row key design should ease pruning unrelated data
3rd dimension can store data as well
Bloom Filter should be configured to prune rows and columns
Compression can reduce the amount of data transmission
07/12/13 30Cloud 2013
29. Query Processing
Scanned rows for one query should not exceed the scan cache
size, otherwise, split the query into sub-queries.
“Scan” is better than “Get” for retrieving discontinuous keys, even
though the unrelated data
“Scan” for small queries, while Coprocessor for large queries
Better to split one large query into multiple sub-queries than use
one query with row filter mechanism
07/12/13 31Cloud 2013
30. Benefits from the good locality of the RG index; suffers from
the poor locality of the z-ordering QT linearization
Performance could be improved with other linearization techniques
Can be flexibly configured and extended
The QT index can be replaced by the hash code of each sub-space
The granularity in the second stage can be varied from sub-space to
sub-space based on the various densities
Is more suitable for homogeneously covered and
discontinuous spaces
07/12/13 32Cloud 2013
31. A Data Model for spatio-temporal dataset
Towards a General Systematic Guidance for Column
Families and other NoSQL databases
To apply the data model into cloud-based
applications and big data analytics system
07/12/13 33Cloud 2013
Editor's Notes
Cloud Computing, is attracting business owners for the perceived benefits, Such as 1 Elasticity which provides on-demand service based on the fluctuating load -- example 2 Excellent system scalability -- there is no limit to storage 3 Low latency and high availability of service. -- latency is a strange thing to discuss… Given these advantages, some enterprises have been working on migrating the legacy applications to the cloud. Currently, the majority of applications deployed in the cloud include social networking, online shopping, monitoring system. In these applications, the data grows monotonously over time. In order to improve the existing services and discover new knowledge, business owners are committing substantial budgets on Analysis of these large time-series data, ranging from simple descriptive statistics to complex analytics. In this movement, the success adoption requires a new model of storage.
Therefore, to address these challenge in this migration, we aim to develop a systematic method for guiding data organization in NoSQL databases, given the data type, the data size and its usage pattern. We start our investigation with HBase, which is a NoSQL database offering. It is built on top of Hadoop. It provides two frameworks for parallel distributed computation. One is MapReduce, and the other one is Coprocessor. MapReduce is very effective for distributed computation over data stored within its tables, But in many cases, for example, simple additive or aggregating operations like summing, counting, Coprocessor can give a dramatic performance improvement over HBase’s already good scanning performance. Therefore, we used Coprocessor implementation in our experiment.
Both of them investigate how to access the multidimensional data efficient with spatial indices, which is a part of problem that we are addressing in this paper. Their methods demonstrate efficient performance with the spatial indices. However, both only focus on the row key design in HBase for data organization, there is little discussion about the column and version design. To work out an appropriate data model for geospatial datasets which can be easily and directly applied to the location-based applications, in addition of the row key, we also take column and version into account to model the data in HBase. Furthermore, we implemented the queries with HBase Coprocessor to harness the parallelism benefits, while the above studies processed the queries with HBase Scan.
It relies on a regular-grid index. The row key is the row index of the cell in the grid, the column is the column index of the cell, Version: counter of objects. So we can see the third dimension holds a stack of data points located in the same grid cell. Value: each storage cell represents one object in JSON format holding all other attributes and values
It relies on a trie-based quad-tree index. Applies Z-ordering to transform the two-dimensional spatial data into an one-dimensional array. In this model, the row key, is Z-value. The column is the object id which locates in the cell. Usually people encode the cells with binary digits. But as the row key should be short in HBase, we use decimal encoding here.
QT If the index is built in real time for each query, the construction cost dominates many small queries. If the index is maintained in memory, the granularity of the grid is limited by the amount of memory available, since the memory needed to maintain the index increases as the depth of the tree increases and the size of the grid cells becomes smaller. RG The third dimension holds a stack of data points located in the same grid cell, and an index is maintained to keep the count of objects in each cell stack in order to support updates.
1) the data-set space is divided into equally-sized rectangular tiles T, encoded with their Z-value. 2) the data points are organized in a regular grid of continuous uniform fine-grained cells. In this model, each data point is uniquely identified in terms of its row key and column name.
The row key is the concatenation of the quad-tree Z-value and the Regular Grid row index. The column name is the concatenation of the Regular Grid column and the object id of the data point. The attributes of the data point are stored in the third dimension.
Range queries are commonly used in location based applications. Given the coordinates of a location and a radius, a vector of data points, located within a distance less or equal to the radius from the input location is returned. Relying on the HGrid data model, to answer this query the following process is followed:
In this example, four tiles are involved. If the query is computed in one call, the scan range is from 03 to 09, and it will include the irrelevant tiles “04,05,07,08,10,11”. Therefore, instead of triggering one query, we split them into four sub-queries, and each sub-query corresponds to one tile. The sub-queries are called in sequence, and in each sub-query, the query is executed by coprocessor in parallel.
In this example, four tiles are involved. If the query is computed in one call, the scan range is from 03 to 09, and it will include the irrelevant tiles “04,05,07,08,10,11”. Therefore, instead of triggering one query, we split them into four sub-queries, and each sub-query corresponds to one tile. The sub-queries are called in sequence, and in each sub-query, the query is executed by coprocessor in parallel.
In this example, four tiles are involved. If the query is computed in one call, the scan range is from 03 to 09, and it will include the irrelevant tiles “04,05,07,08,10,11”. Therefore, instead of triggering one query, we split them into four sub-queries, and each sub-query corresponds to one tile. The sub-queries are called in sequence, and in each sub-query, the query is executed by coprocessor in parallel.
In this example, four tiles are involved. If the query is computed in one call, the scan range is from 03 to 09, and it will include the irrelevant tiles “04,05,07,08,10,11”. Therefore, instead of triggering one query, we split them into four sub-queries, and each sub-query corresponds to one tile. The sub-queries are called in sequence, and in each sub-query, the query is executed by coprocessor in parallel. But still within a coprocessor tile, irrelevant regular-grid cells will be seen, so we filter them out
Have not got a better way to describe this algorithm
Our experiments were performed on a four-node cluster, running on four virtual machines on an Openstack Cloud. The virtual machines run 64bit Ubuntu 11.10 and have 2 cores, 4GB of RAM, and a 200 GB disk. We used Hadoop version 1.0.2, and HBase version 0.94. Hadoop and HBase were each given 2GB of Heap size in every running node. Configurations: 1) HDFS was configured with a replication factor of 2. 2) The data was compressed with gzip. 3) the ROWCOL filter was applied on each table. 4) The scan cache size was set to 5K and the block cache was set to true, for the query processing. We implemented the Range Query processing with the Coprocessor framework and KNN with Scan in order to examine the implications of these different implementations to performance.
Across these three data models, a common configuration parameter is the size of each cell, which determines the granularity of grid. This is a very important variable which substantially affects the query-processing performance. Therefore, before we compare the three data models against each other, we have explored the best cell configuration for each model. We varied the size of the cell to observe how the different sizes of cell affect the performance of each data model. Result: In our experiments, for the QT data model, the appropriate cell configuration was 1, while for the RG data model the acceptable cell size was 0.1.
We evaluated range query performance under three data models with both uniform and Zipf distribution data. The table shows the query response time of the three data models for various ranges when the system contains 100 million objects. As the radius increases, the size of irrelevant data vs the return-set size ration increases, the running time also increases because more data points are retrieved. Comparing the three models, we can see that the regular-grid data model outperforms the others. Because it supports better data locality, it demonstrates better performance since the percentage of irrelevant rows scanned is low. The HGrid data model is much better than quadtree data model and worse than the regular-grid data model. The same performance trends persist with both uniform and skewed data.
We now evaluate the performance for k Nearest Neighbor (kNN) queries using the same data set, under the three data models. This table shows the response time (in seconds) for kNN queries, where k takes the values 1, 10, 100, 1,000, and 10,000. As the density-based range estimation method is employed , there is only one scan operation in the query processing for uniform data, while for skewed data, more than one scan iterations are invoked to retrieve the data. That is why the performance with skewed data under all data models is a little worse than that with the uniform data set. For both uniform and skewed data, the Regular-grid data model demonstrates best performance among the three data models; the HGrid data model come second with slightly worse performance than the regular-grid data model; and the quadtree data model is outperformed by the other two. The poor locality preservation, due to the Z-order linearization method, contributes to the poor performance of the quadtree data model, and also impacts the performance of HGrid, albeit less strongly. For skewed data, with too many false positives, the query with the data points having more than 70% probability cannot get the result below the timeout threshold under all data models when k equals to 10K. To improve performance, a finer granularity is required to filter irrelevant data scanning.
In summary, the query performance of the HGrid data model is better than the quadtree data model and worse than the RG data model. Benefits the good locality of the second-tier regular-grid index, while at the same time suffering from the poor locality of the Zordering linearization at the first tier. Better performance can potentially be obtained with alternative linearization techniques. For skewed data, the HGrid behaves better with an appropriate configuration, while the regular-grid and QT data models are subject to memory constraints. Can be flexibly configured and extended 1) The quad-tree index can be replaced by the hash code of each sub-space 2) The point-based quad-tree index method is employed. 3) The granularity in the second stage can be varied from sub-space to sub-space based on the various densities. Therefore, HGrid is more scalable and suitable for both homogeneously covered and discontinuous spaces.
The row key and column name should be short, since they are stored with every cell in the row. It is better to have one column family, only introducing more column families in the case where data access is usually column scoped. The number of columns should be limited. A number in the hundreds is likely to lead to good performance. The amount of data in one row should be kept relatively small. The cost (in time) of retrieving a row has n data increases more than twice with n [12]. The row key should be designed to support pruning of unrelated data easily. When the third dimension is used for storing other information rather than time-to-live values, it is preferable to keep it shallow, and be limited to containing up to no more than hundreds of data points, as deep stacks lead to poor insertion performance. The Bloom Filter [10] should be configured as it can accelerate the performance by pruning the data from both row and column sides. Compression can improve the performance by reducing the amount of data transmission.
Scan operations are preferable to Get operations for retrieving discontinuous keys, even though the Scan result is bound to also include data points that are not part of the response data set. It is more efficient to Get one row with n data points than n rows with one data point each [12]. It is advisable to narrow the range of queried columns with the Filter mechanism. The number of rows to be scanned for a query should not exceed the scan cache size, which depends on client and server memory. Otherwise, it is better to split the query into several sub-queries. When there are too many unrelated rows within the defined scan range, splitting one query into multiple sub-queries with multiple Scan operations is more efficient than one query with Filter mechanism to retrieve rows one by one. The Scan operation is preferable for small queries, while Coprocessor should be used for large queries.
Can be flexibly configured and extended 1) The quad-tree index can be replaced by the hash code of each sub-space 2) The point-based quad-tree index method is employed. 3) The granularity in the second stage can be varied from sub-space to sub-space based on the various densities. Therefore, HGrid is more scalable and suitable for both homogeneously covered and discontinuous spaces.