Talk Abstract
GeoWave is an open source software project developed at the National Geospatial-Intelligence Agency (NGA) in collaboration with Booz Allen Hamilton and RadiantBlue Technologies. GeoWave leverages Accumulo’s architecture to manage petabytes of raster and vector data by serving as an enterprise level geospatial data store. To efficiently index geospatial data and answer queries with geospatial constraints, GeoWave employs a space filling curve to form bidirectional mappings between multi-dimensional data and Accumulo’s sorted row identifiers. As a complete offering, Geowave provides a plug-in to the Open Source Geospatial Foundation’s GeoServer platform, enabling management of geospatial data and associated attributes through Open Geospatial Consortium (OGC) standard services, and map-reduce input/output formats to support scalable post-processing and analysis of geospatial data.
Speakers
Eric Robertson
Lead Technologist, Booz Allen Hamilton
Eric Robertson is a Data Scientist at Booz Allen Hamilton and has over twenty years of experience in software development across many diverse vertical domains including telecommunication, pharmaceuticals, finance, economics and defense. Eric has extensive experience in designing and developing identity correlation systems using graph analytics. Eric holds a M.S. in Computer Science from University of Maryland Baltimore County. Eric's current interests include machine learning and linear programming.
Rich Fecher
Senior Software Engineer, RadiantBlue
Over the past 10 years, Rich Fecher has been solving the hard technical challenges that face the U.S. Defense and Intelligence Communities. Rich has extensive expertise in architecting and building end-to-end systems. His experience ranges from visualization to distributed computing, and he has primarily focused his career toward enriching geospatial content and delivery. Rich holds a M.S. in Computer Science from George Mason University; he received his post-graduate certificate in GIS from Pennsylvania State University, and received a B.S. in Computer Science with minors in Applied Math and Physics from the University of Virginia.
Thermal Protection System design of a Reusable Launch Vehicle using integral...AndreaAprovitola
In the present paper a modelling procedure of the thermal
protection system designed for a conceptual Reusable Launch Vehicle is
presented.
A special parametric model, featuring a scalar field irradiated by a
set
of bi-dimensional soft objects is
developed and used to assign an almost arbitrary distribution of
insulating materials over the vehicle surface.
The model fully exploits the auto-blending capability of soft
objects, and allows an rational distribution of thermal
coating materials
using a limited number of parameters.
Applications to different conceptual vehicle configurations of an assigned
thickness map, and materials layout show the flexibility
of the model.
The model is finally integrated in the framework of a multidisciplinary
analysis to perform a trajectory-based TPS sizing, subjected
to fixed thermal constraints.
The students should be familiar with the following terms in Boolean Algebra before going through this module on K-MAPS..Boolean variable, Constants and Operators, Postulates of Boolean Algebra, Theorems of Boolean Algebra, Logic Gates- AND, OR, NOT, NAND, NOR,Boolean Expressions and related terms, MINTERM (Product Term), MAXTERM (Sum Term), Canonical Form of Expressions
Thermal Protection System design of a Reusable Launch Vehicle using integral...AndreaAprovitola
In the present paper a modelling procedure of the thermal
protection system designed for a conceptual Reusable Launch Vehicle is
presented.
A special parametric model, featuring a scalar field irradiated by a
set
of bi-dimensional soft objects is
developed and used to assign an almost arbitrary distribution of
insulating materials over the vehicle surface.
The model fully exploits the auto-blending capability of soft
objects, and allows an rational distribution of thermal
coating materials
using a limited number of parameters.
Applications to different conceptual vehicle configurations of an assigned
thickness map, and materials layout show the flexibility
of the model.
The model is finally integrated in the framework of a multidisciplinary
analysis to perform a trajectory-based TPS sizing, subjected
to fixed thermal constraints.
The students should be familiar with the following terms in Boolean Algebra before going through this module on K-MAPS..Boolean variable, Constants and Operators, Postulates of Boolean Algebra, Theorems of Boolean Algebra, Logic Gates- AND, OR, NOT, NAND, NOR,Boolean Expressions and related terms, MINTERM (Product Term), MAXTERM (Sum Term), Canonical Form of Expressions
Fitting probability distribution into data is very essential knowledge for the researchers of any discipline. I hope this presentation slides may contribute in scientific research.
Use FME To Efficiently Create National-Scale Vector Contours From High-Resolu...Safe Software
TerraLogik was contracted by NAV CANADA to generate a 1:500,000-scale topographic base from 1:50,000-scale data for use in aeronautical navigation charts. TerraLogik used the raster analysis tools in FME, coupled with custom Python code and the GDAL open-source library, to create generalized contours at 1:500,000 scale from 1:50,000-scale DEMs covering Canada and the Northern United States. See how the WorkspaceRunner is used to perform this process - and reduced processing time from 24 hours to 1 hour per chart.
ODSC India 2018: Topological space creation & Clustering at BigData scaleKuldeep Jiwani
Every data has an inherent natural geometry associated with it. We are generally influenced by how the world visually appears to us and apply the same flat Euclidean geometry to data. The data geometry could be curved, may have holes, distances cannot be defined in all cases. But if we still impose Euclidean geometry on it, then we may be distorting the data space and also destroying the information content inside it.
In the space of BigData world we have to regularly handle TBs of data and extract meaningful information from it. We have to apply many Unsupervised Machine Learning techniques to extract such information from the data. Two important steps in this process is building a topological space that captures the natural geometry of the data and then clustering in that topological space to obtain meaningful clusters.
This talk will walk through "Data Geometry" discovery techniques, first analytically and then via applied Machine learning methods. So that the listeners can take back, hands on techniques of discovering the real geometry of the data. The attendees will be presented with various BigData techniques along with showcasing Apache Spark code on how to build data geometry over massive data lakes.
Mathematics (from Greek μάθημα máthēma, “knowledge, study, learning”) is the study of topics such as quantity (numbers), structure, space, and change. There is a range of views among mathematicians and philosophers as to the exact scope and definition of mathematics
Fitting probability distribution into data is very essential knowledge for the researchers of any discipline. I hope this presentation slides may contribute in scientific research.
Use FME To Efficiently Create National-Scale Vector Contours From High-Resolu...Safe Software
TerraLogik was contracted by NAV CANADA to generate a 1:500,000-scale topographic base from 1:50,000-scale data for use in aeronautical navigation charts. TerraLogik used the raster analysis tools in FME, coupled with custom Python code and the GDAL open-source library, to create generalized contours at 1:500,000 scale from 1:50,000-scale DEMs covering Canada and the Northern United States. See how the WorkspaceRunner is used to perform this process - and reduced processing time from 24 hours to 1 hour per chart.
ODSC India 2018: Topological space creation & Clustering at BigData scaleKuldeep Jiwani
Every data has an inherent natural geometry associated with it. We are generally influenced by how the world visually appears to us and apply the same flat Euclidean geometry to data. The data geometry could be curved, may have holes, distances cannot be defined in all cases. But if we still impose Euclidean geometry on it, then we may be distorting the data space and also destroying the information content inside it.
In the space of BigData world we have to regularly handle TBs of data and extract meaningful information from it. We have to apply many Unsupervised Machine Learning techniques to extract such information from the data. Two important steps in this process is building a topological space that captures the natural geometry of the data and then clustering in that topological space to obtain meaningful clusters.
This talk will walk through "Data Geometry" discovery techniques, first analytically and then via applied Machine learning methods. So that the listeners can take back, hands on techniques of discovering the real geometry of the data. The attendees will be presented with various BigData techniques along with showcasing Apache Spark code on how to build data geometry over massive data lakes.
Mathematics (from Greek μάθημα máthēma, “knowledge, study, learning”) is the study of topics such as quantity (numbers), structure, space, and change. There is a range of views among mathematicians and philosophers as to the exact scope and definition of mathematics
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfPeter Spielvogel
Building better applications for business users with SAP Fiori.
• What is SAP Fiori and why it matters to you
• How a better user experience drives measurable business benefits
• How to get started with SAP Fiori today
• How SAP Fiori elements accelerates application development
• How SAP Build Code includes SAP Fiori tools and other generative artificial intelligence capabilities
• How SAP Fiori paves the way for using AI in SAP apps
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
Welcome to the first live UiPath Community Day Dubai! Join us for this unique occasion to meet our local and global UiPath Community and leaders. You will get a full view of the MEA region's automation landscape and the AI Powered automation technology capabilities of UiPath. Also, hosted by our local partners Marc Ellis, you will enjoy a half-day packed with industry insights and automation peers networking.
📕 Curious on our agenda? Wait no more!
10:00 Welcome note - UiPath Community in Dubai
Lovely Sinha, UiPath Community Chapter Leader, UiPath MVPx3, Hyper-automation Consultant, First Abu Dhabi Bank
10:20 A UiPath cross-region MEA overview
Ashraf El Zarka, VP and Managing Director MEA, UiPath
10:35: Customer Success Journey
Deepthi Deepak, Head of Intelligent Automation CoE, First Abu Dhabi Bank
11:15 The UiPath approach to GenAI with our three principles: improve accuracy, supercharge productivity, and automate more
Boris Krumrey, Global VP, Automation Innovation, UiPath
12:15 To discover how Marc Ellis leverages tech-driven solutions in recruitment and managed services.
Brendan Lingam, Director of Sales and Business Development, Marc Ellis
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
2. • Geographic Information Systems (GIS)
• GeoWave Overview
– Features
– Components
– Data Types
• The Fundamentals
– How does GeoWave organize geospatial data?
• Set of problems and solutions with Accumulo
– Deduplication
– WFS-T Transaction Isolation
– Map Occlusion Culling
– Raster Data
– Statistics
OUTLINE
3. • GIS Technology Explosion
– E.g. Smart Phone and GPS Applications
• Data Explosion
– Satellite Imagery, Ground Based Imagery,
Aerial Photography
• Problems:
– Generate Maps: Create base image and
add vector data (shapes):
• points of interest
• roads
• boundaries
– Find Features
“restaurants near you”
– Analysis
Density, Surface Analysis, Interpolation,
Pattern Discovery
GIS: GEOGRAPHIC INFORMATION SYSTEM
Generated by OpenStreetMap.org
4. • Leverage Accumulo offerings as distributed
data store
– High-performance ingest
– Horizontally scalable
– Per-entry access constraints
• Fast geospatial retrieval
• Geo-temporal indexing
• Pre-calculated statistics:
– Counts per Data Type
– Bounding Region
– Time Range
– Numeric Range
– Histograms
FEATURES OF GEOWAVE
5. Accumulo 1.5.1, 1.6.x
Cloudera 2.0.0-cdh4.7.0, 2.5.0-cdh5.2
Hortonworks HDP 2.1
Apache 2.6
GeoTools 11.4, 12.1, 12.2
Geoserver 2.5.2 ,2.6.1
Accumulo Data Store
Hadoop Map-Reduce input/output formats
GeoServer integration with GeoTools
Vector and Raster Data
Multi-Threaded Ingest Tools
Administrative RESTful Services
Layers and Data Stores
Analytics
Kernel Density
K-means Clustering
Sampling
INTEGRATED COMPONENTS
Tested Versions
PDBScan coming soon with Apache Spark 1.2.1
6. • Data Structures
– Simple Feature (ISO 19125) via GeoTools (http://www.geotools.org/).
– Raster Images
– Custom
• Provided Ingest Types
– Vector Data Sources (GeoTools)
• Examples: Shapefiles, GeoJSON, PostGIS, etc.
– Grid Formats (GeoTools)
• Examples: ArcGrid, GeoTIFF, etc.
– GeoLife GPS Trajectories (http://research.microsoft.com/en-
us/projects/GeoLife/)
– GPX (http://www.topografix.com/gpx.asp)
– T-Drive (http://research.microsoft.com/en-us/projects/tdrive/)
– PDAL
DATA TYPES
7. • Basic Problem: Efficiently
locate and retrieve vectors or
tiles intersecting a polygon (e.g
bounding box).
• Accumulo: Each table
organized into blocks of sorted
row identifiers.
• Revised Problem: Two-way
mapping between multiple
dimensions and a single
dimension row ID to support
location efficient storage and
retrieval of vectors or tiles
given constraints in terms of
multi-dimensional boundaries.
MAIN PROBLEM:
INDEX TWO DIMENSION IN SINGLE DIMENSION INDEX
8. GENERALIZED PROBLEMS
Solve the general problem first. Then apply to Geospatial specific problems.
Multi-Dimension Index supporting efficient data retrieval given bounded
set of constraints for each dimension.
Indexed data includes scalars and intervals per dimension.
For example, a range of time or a polygon.
Index over a mix of bounded and unbounded dimensions.
9. Curves are constructed
iteratively. Each
iteration produces a
sequence of piecewise
linear continuous
curves, each one more
closely approximating
the space-filling limit.
Each discrete value on
the curve represents a
hyper-rectangle in n-
dimensional space.
Space Filling Curve: A curve whose range contains the entire n-dimensional hypercube.
FUNDAMENTAL APPROACH:
SPACE FILLING CURVES TRAVERSE N-DIMENSIONAL SPACE
10. Achieve optimal read performance through contiguous series of values
across two or more dimensions.
Reading 11 records over a contiguous range 23->33 is faster than reading non-
contiguous range such as 15,18,34,56-58,83,99,101-102.
Consider: Latitude and Longitude defined by a range (latA, lonA) -> (latB, lonB) should
map to the least number of ranges on the space filling curve.
Haverkort and Walderveen[1] describe 3 metrics to help quantify this.
CURVE SELECTION : SEQUENTIAL IO OPTIMIZATION
Worst Case Dilation Average Bounding BoxWorst Case Bounding Box
𝑠𝑞𝑢𝑎𝑟𝑒𝑑 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 𝑝 𝑎𝑛𝑑 𝑞
𝑎𝑟𝑒𝑎 𝑓𝑖𝑙𝑙𝑒𝑑 𝑏𝑦 𝑐𝑢𝑟𝑣𝑒 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 𝑝 𝑎𝑛𝑑 𝑞
𝑎𝑟𝑒𝑎 𝑜𝑓 𝑚𝑖𝑛𝑖𝑚𝑢𝑚 𝑏𝑜𝑢𝑛𝑑𝑖𝑛𝑔 𝑟𝑒𝑐𝑡𝑎𝑛𝑔𝑙𝑒 (𝑏𝑙𝑢𝑒)
𝑎𝑟𝑒𝑎 𝑓𝑖𝑙𝑙𝑒𝑑 𝑏𝑦 𝑐𝑢𝑟𝑣𝑒 (𝑔𝑟𝑒𝑒𝑛)
11. Z-Order Hilbert H-order Peano AR2W2 BΩ
Worst Case
Dilation
Average Box Area
Worst Case Area
L∞
L2
L1
∞ 6 4 8 5.40 5.00
∞ 6 4 8 6.04 5.00
∞ 9 8 10.66 12.00 9.00
∞ 2.40 3.00 2.00 3.05 2.22
2.86 1.41 1.69 1.42 1.47 1.40
[1] Haverkort, Walderveen Locality and Bounding-Box Quality of Two-
Dimensional Space-Filling Curves 2008 arXiv:0806.4787v2
CURVE SELECTION : LOCALITY
12. • Place a grid on the globe (dotted lines)
• Connect all the points on the grid with a
Hilbert SFC.
• Curve provides linear ordering over two
dimensional space.
• Bounding box is defined by the set of ranges
covered by the Hilbert SFC.
HILBERT CURVE MAPPING IN 2D: THE GLOBAL
13. • Precision determined by the ‘depth’ of the
curve. In this example globe is defined by a
16X16 grid.
• Resolution is 22.5 degrees latitude and
11.25 degrees longitude per cell.
• Each elbow (discrete point) in the Hilbert
SFC maps to a grid cell.
• The precision, defined in terms of the
number of bits, of the Hilbert SFC
determines the grid. Thus, more bits
equates to finer grained cell.
HILBERT CURVE PRECISION
14. Recursively decompose the Hilbert
region to find only those covered
regions that overlap the query box.
The figure depicts a third order (23
“buckets” per dimension) Hilbert
curve in 2D.
Forms a quad-tree view over the
data.
Each two bits, from most significant
to least represents a “quadrant.”
00 01
1011
10
11 00
01
11
10
00
01
Hilbert Index (52) =
11 01 00
RECURSIVE DECOMPOSITION : TWO DIMENSION EXAMPLE
15. Bounding Box over grid cells (2,9) to (5,13) (lower left) to (upper right)
Decompose cells intersecting
bounding box as shown in the blue.
Range decomposes to three (color
coded) ranges –
• 70 -> 75
• 92 -> 99
• 116 -> 121
Note: Bounding box from a geospatial
query window does not necessarily
“snap” perfectly to the grid cells. (e.g.
6.2, 8.8 instead of 6, 9). The bounding
box is expanded to encompass all
intersecting cells.
DECODE THE BOUNDING BOX: RANGE DECOMPOSITION
16. Here we see the query
range fully decomposed
into the underlying
“quadrants.”
Decomposition stops
when the query window
fully contains the quad.
(See segment 3 and
segment 8)
RANGE DECOMPOSITION OPTIMIZATION
17. INTERVALS: POLYGONS AND MULTI-POLYGON
Duplicate entry for each intersecting hyper-rectangle over the interval.
Polygon covers 66 cells in the example
Remove duplicate data for each cell – 66
duplicates.
De-Duplication is applied in Accumulo
Iterator as well as client-side.
Query is defined by a range per dimension
(a bounding rectangle in 2D)
18. INTERVALS: POLYGONS AND MULTI-POLYGONS
High resolution curves force excessive number of duplicates for large intervals.
A high resolution 2D curve – 231 x 231 and a large polygon such as the pacific ocean.
The pacific ocean covers ~33% of the earths surface, amplifies to ~1.5 quintillion
duplicate entries.
Solution: Tiered Indexing[8]
• Each tier has a resolution of 2nx2n, where n
is the tier number. Thus, each lower tier
has a two order increase in resolution.
• Polygons are stored in the lowest tier
possible that minimizes the number of
duplicates.
• Example: Blue polygon indexed in tier 2;
Red polygon indexed in tier 3.
19. TIERS: QUERY REGIONS WITH FALSE POSITIVES
Balance between an acceptable amount of duplicates and false positives due to
lower granularity of higher tiers.
Consider a query region in orange. It
does not intersect either polygons.
However, it does intersect shared
quadrants at the respective tiers for
both shapes. Thus, more rows are
filtered during range scan.
Without tiers, using a higher resolution,
this false positive does not occur.
However, consider that, for a resolution
of 10 (e.g. 210), hundreds of duplicates
occur.
20. TIERS: WORST CASE
Cap the amount of duplicates by choosing an appropriate tier.
Our analysis indicates that an optimal
number of duplicates is represented by 2d
where d is the number of dimensions (ie.
in 2 dimensions, cap at 4)
Consider the worst case, a small square
polygon centered on the inner intersecting
boundary (example polygon in red).
Regardless of size, there is always four
duplicates at all tiers except at a 20 tier—
the orange box, representing the entire
world
21. UNBOUNDED DIMENSION: TIME
To normalize real-world values to fit on a space filling curve, the sample space must be
bound.
Solution: Binning
• A bin represents a period for EACH
dimension. For example, a
periodicity of a year can be used
for time.
• Each bin covers its own Hilbert
space.
• Entries that contain ranges may
span multiple bins resulting in
duplicates.
• The Bin ID is part of row identifier.
1997 1998 1999
A single bin for an unbounded dimension :
[min + (period * period duration),
min + ((period+1) * period duration))
22. BIN: VARIABILITY OVER DIMENSIONS
Time
Elevation
Velocity
Each Bin is a
hyper-rectangle
representing
ranges of data
labeled by points
on a Hilbert
curve.
Bounded dimensions assume a single Bin.
For example, Latitude and Longitude.
25. MAP OCCLUSION CULLING
A specific determined zoom level, each pixel signifies a range in degrees.
Scanning the data, only one entry is needed within each pixel range. The rest of the
entries can be skipped.
The block identified in red represents many data points, but is rendered
by the 9 pixels.
26. 1
2 3
4
1
2 3
4
Database Data
The accumulo iterator starts at the first pixel,
scans until it hits a geometry, then skips to
the next pixel.
Scan to the first pixel
Seek to the beginning of the
next pixel
The rendering engine received only
these points
Points that were all skipped.
MAP OCCLUSION CULLING: ITERATORS
Displayed
Pixels
27. GeoServer
(GeoWave Plugin)
DISTRIBUTED RENDERING
Map Request
Map Response
Layer
Style
Accumulo(GeoWaveIterators)
Rendered
Map
Each scan result is an image
with the data in the range
All resultant images
are composited together
29. SFC Curve
Hierarchy
SFC Value is
Effectively
a Tile ID
Coverage
Name
RASTER DATA PERSISTENCE MODEL
Image Data Buffer
+ Image Metadata
Image Metadata is customizable.
Default is to store “no data” values,
but can be customized
Tiles are unique,
ignore duplication
Unique name for
global coverage
30. RASTER DATA: GRID COVERAGE
Tiled, each “cell” fit to
boundary.
“No Data” values must
be maintained.
Multi-band, more
than just RGB.
31. Histogram Equalization [10]
Image Pyramid [11]
Tile Merge Strategy
t1
t2
t3
f ( f( , ), ) =t1 t2 t3 final
tn
Image Data
Buffer
Meta-
data
Value
Custom data per tile,
in scope for f(x)
RASTER DATA: ADVANCED OPTIONS
32. STATISTICS: STRUCTURE
Statistics infrastructure supports summary data.
Currently, each row ID includes adapter ID and a statistics ID.
Current statistics types include population bounding boxes, counts and ranges.
Key
Statistic ID
Row ID
Column
Value
Adapter ID
Family Qualifier Visibility
“STATS”
Matches
represented
data
Attribute Name &
Statistic Type.
Time
33. STATISTICS: COMBINER
Statistic ID
ValueAdapter
ID
Family Qualifier Visibility
“STATS”
“Count” 300xA43E“STATS” A&B
“Count” 600xA43E“STATS” A&C
“Count” 200xA43E“STATS” A&B
“Count” 500xA43E“STATS” A&B
MERGE
Time
2
4
7
9
BBOX: Grow Envelope to Minimum and Maximum corners.
RANGE: Minimum and Maximum
HISTOGRAM: Update bins from coverage over raster image
34. STATISTICS: TRANSFORMATION ITERATOR
Statistic ID
ValueAdapter
ID
Family Qualifier Visibility
“STATS”
“Count” 500xA43E“STATS” A&B
“Count” 600xA43E“STATS” A&C
“Count” 1100xA43E“STATS” A&B&C
MERGE
Time
9
4
9
Query authorization may authorize multiple rows.
Query with authorization A,B & C
35. WFS-T[12] TRANSACTIONS: ISOLATION
• Problem: Isolation of updates and new records until commit.
• Solution:
– Use a managed set of transaction identifiers as authorization tags. A single transaction
places an authorization tag in all new entries.
– Upon commit, the authorization tag is removed using a transforming iterator.
Role1, role2, tx123
Role1, role2
Commit
37. Microsoft GeoLife
Microsoft research has made available a trajectory data set that contains the GPS
coordinates of 182 users over a three year period (April 2007 to August 2012).
There are 17,621 trajectories in this data set with a total distance of about 1.2
million kilometers and a total duration of 48,000+ hours recorded by GPS loggers and
GPS phones often sampling every 1-5 seconds or every 5-10 meters.
http://research.microsoft.com/jump/131675
42. OSM – Planet GPX dump
Every track ever uploaded to Open Street Map
Complete data attribution
2.9 Billion spatial entities (points)
https://blog.openstreetmap.org/2013/04/12/bulk-gpx-track-data/
49. [1] Haverkort, Walderveen Locality and Bounding-Box Qualifty of Two-Dimensional Space-Filling Curves 2008
arXiv:0806.4787v2
[2] Hamilton, Rau-Chaplin Compact Hilbert indices: Space-filling curves for domains with unequal side lengths 2008
Information Processing Letters 105 (155-163)
[3] Hayes Crinkly Curves 2013 American Scientist 100-3 (178). DOI: 10.1511/2013.102.1
[4] Skilling Programming the Hilbert Curve Bayesian Inference and Maximum Entropy Methods in Science and Engineering:
23rd Workshop Proceedings. 2004. American Institude of Physics 0-7354-0182-9/04
[5] Wikipedia Well-known_binary http://en.Wikipedia.org/wiki/Well-known_binary 2013
[6] Wikipedia Hilbert curve http://en.wikipedia.org/wiki/Hilbert_curve 2013
[7] Aioanei Uzaygezen–Compact Hilbert Index implementation in Java http://code.google.com/p/uzaygezen/ 2008 Google Inc.
[8] Surratt, Boyd, Russelavage Z-Value Curve Index Evaluation 2012 Internal Presentation.
[9] Open Geospatial Consortium Standard List http://www.opengeospatial.org/standards/is
[10] Remote Sensed Image Processing on Grids for Training in Earth Observation
http://www.intechopen.com/source/html/6674/media/image3.jpeg
[11] OSGeo Wiki http://wiki.osgeo.org/images/thumb/d/d0/Pyramid.jpg/286px-Pyramid.jpg
[12] WFS-T (http://www.opengeospatial.org/standards/wfs )
BIBLIOGRAPHY
Editor's Notes
Expect velocity in the beginning.
Slow down during fundamentals and even slower during the accumulo specific.
Thinking no more than 2 minutes a slide up to fundamentals.
Shift Focus to two dimensions
TBD: Change examples from red to other color and only show four, not six
Three dimensions a box, etc.
Per side ~ side of box. In this example, lowest tier is an 8X8 grid
Use Polygons and Multi-Polygons as an example of intervals
Special focus on generality…without specifics
TO DO: Fix graphic
Index ID…if added and slide not updated,
Reason for Index ID:
Keep index/adapter statistics…all measures missing components…a like a partialed file indx
Eventually to support cost based optimizer, picking the best index
None idempotent operations such as count
Statictics: Communitive Monoids
Recall: the Combiner occurs before the Versioner.
Semigroup M is a nonempty set equipped with a binary operation, which
is required (only!) to be associative.
An element I of a semigroup M is said to be an identity if for all x ∈ M, Ix=xI=x.
A semigroup can have at most one identity.
Definition: A
Monoid is a semigroup with an identity element.
Semigroup M is commutative if x · y = y · x for all x; y ∈ M
Dropped slide on statistics future so mention
Bloom filters, hyper log log, bin stats, etc.