SlideShare a Scribd company logo
The algorithm operates as follows:
1. Create a set S = { s1
, s2
, …, sn
} representing each office space.
2. Create a kd-tree T using the set S.
3. While S is not empty:
a) Pop point s from S and compute the radius r around it containing the nearest 50 neighboring
buildings using a pre-built SciPy KDTree and starting maximum distance d = 0.082° 9 km.≅
b) Using T, find all spaces within r and add them to a new cluster, and remove them from S.
c) Merge the new cluster into an existing cluster, if there is overlap between them.
4. If the number of clusters is greater than k, recursively perform Step 3-4 with the original set S
and 2d as the new maximum distance. Otherwise, merge intersecting clusters and compute a
weighted centroid and radius for each cluster.
Creating a Real-Time Recommendation Engine using Modified K-Means
Clustering and Remote Sensing Signature Matching Algorithms
Abstract
Built on Google App Engine, Real-
Massive encountered challenges
while attempting to scale its rec-
ommendation engine to match a
14% week-over-week increase of
data. To address this problem of
scale, we applied techniques from
spectral data processing to trans-
form our domain-specific problem.
The result: a quantitative solution to
a qualitative problem that can
match the skill of domain experts
while operating in sub-second time.
David Lippa*,†
Jason Vertrees**,†
Background
Spectral analysis algorithms provide one way to quantify similarity when comparing a data collection
against a known signature. This process—material identification3,6,9
—is quite literally finding a needle in a
pixelated haystack. One such algorithm, Spectral Angle Mapper treats each pixel as an n-dimensional
vector, computing the angle between them using the definition of a dot product: A · B = ||A|| ||B|| cos θ.
Similarity increases as |θ| approaches 0. 10° is a typical upper threshold. Negative angles are valid in
spectral datasets, but not in our case, since values are always positive.
To remap the problem, we:
● Treat the list of potential candidates as “pixels” of a
spectral data cube.
● Create a library of “signature” vectors.
● Cluster using a stripped down version of SciPy's
kdtree.py since Google App Engine prohibits
execution of native code in 3rd
party libraries.2
● Use independent object attributes for vector
components, such as cost, size, number of
parking spaces, etc.
● Avoid ratios and dependent variables.
● Aggregate each cluster's vector components to
produce a “signature.”
● Sort the results in ascending order by θ.
This solution results in a quantifiable, accurate, and
flexible measurement of similarity.
Phase 1: Clustering
K-means clustering is one of the best-known
methods for breaking up n data points into k
discrete clusters. While easy to implement
and fast in practice, a few worst-case sce-
narios may arise in certain unusual data
conditions8
. To mitigate this, we exploit
known attributes of the data: limited overlap
between data points since they exist physi-
cally in 3-dimensional space; limited data
range since the data is clustered by latitude
and longitude; related data that can used to
improve estimation of the initial cluster sizes.
Results
Since its inception, the new recommendation service has provided more than 302,925 recommendations
in sub-second time. With each call, it sifts through over 80,000 spaces and has handled a workload of
18,327 requests per work day and 6,188 per hour. The result was the product of just 3 weeks of implemen-
tation time, from design to production.
In the future, we will add refinements to the clustering algorithm to consider client-specific needs and other
related data sets. We can also improve the matching algorithm by applying a cosine rule or Euclidian dis-
tance calculation to prevent an extreme case of collinearity–such as the vectors (1, 1, 1) and (1000, 1000,
1000)–showing as a perfect match.
Summary
Google App Engine provides a powerful search engine in a scalable infrastructure. It can be customized to
address new problems outside of typical keyword searching. To address our problem of pattern matching
in commercial real estate, we created a new scalable, domain-specific recommendation engine. We bor-
rowed techniques from the field of remote sensing, while also taking advantage of constraints and satisfic-
ing over optimizing to overcome our rapid data growth and the restrictions of Google App Engine.
*
david.lippa@realmassive.com
**
jason.vertrees@realmassive.com
†
RealMassive, Inc. 1717 West 6th
St. Austin, TX 78703
+
This data cube measures 614 x 512 pixels x 224 bands spanning the entire
visible, near-infrared, and short-wave infrared spectrum. Visualizations provided
by the open-source Opticks remote sensing toolkit4
.
References:
1. AVARIS Home page. (2015, June 26). Retrieved from http://aviris.jpl.nasa.gov/data/free_data.html
2. Google. (2015, June 11). Google App Engine for Python 1.9.21 Documentation.
Retrieved from https://cloud.google.com/appengine/docs/python
3. Landgrebe, David A (2005). Signal Theory Methods in Multispectral Remote Sensing. Hoboken, NJ: John Wiley & Sons.
4. Opticks. (2015, June 26). Opticks remote sensing toolkit. Retrieved from https://opticks.org
5. RealMassive. (2015, June 10). Retrieved from https://www.realmassive.com
Method
There are 3 phases needed to overcome constraints imposed by App Engine2
:
● Cluster user inputs into “signatures” to reduce the length of query strings and sort expressions.
● Apply fixed filters to limit search results to within the 10,000 hit sort limit.
● Score results by signature match to override the default search-term relevance score.
Doubling the initial radius results in an absolute maximum of 26 recursive calls for an overall asymp-
totic complexity of O(2n log2
n). This never happens in practice due to low building density. The final
result is similar to the representation of clusters in Figure 35
. Once the spaces have been clustered, it
is trivial to compute the average of each vector component to produce each cluster’s signature.
Figure 3: Clustering 50 spaces from across the US
Figure 2: Graphic representation of hyper-spectral data7
Figure 1: A Commercial Real Estate Survey with Recommendations
Phase 2: Filtering
Next, we apply fixed filters informed by domain expertise. For commercial real estate, this includes the
building type (e.g. "office", "industrial", etc.), location, and any necessary exclusions. These constraints
produce a reasonable subset that can be matched against signatures.
Figure 4: AVARIS data courtesy NASA/JPL-Caltech,
showing a signature match1+
6. M. Richmond. Licensed under Creative Commons. Retrieved from http://spiff.rit.edu/classes/phys301/lectures/comp/comp.html
7. N Short, Sr. Graphic representation of hyperspectral data. Licensed under Creative Commons.
Retrieved from http://rst.gsfc.nasa.gov/
8. A. Vattani. K-means Requires Exponentially Many Iterations Even in the Plane, Discrete Comput Geom. 45(4): 596–616. 2011.
9. H. Zhang, Y. Lan, R. Lacey, W. Hoffmann, Y. Huang. Analysis of vegetation indices derived from aerial multispectral and ground
hyperspectral data, International Journal of Agricultural and Biological Engineering. 2(3): 33. 2009.
Acknowledgments
The authors would like to thank Fatih Akici, John Leonard, Natalya Shelburne, and Michael Westgate for their suggestions for this poster.
Phase 3: Sort by Angle
Executing the Spectral Angle Mapper algo-
rithm on a reduced dataset of 10,000 items
equates to performing material identification
on a 115 x 87 pixel x 3-band data cube from
a multi-spectral sensor, or 3% of the compu-
tations required for a small data cube, such
as Figure 4. Google App Engine can quickly
perform calculations in-place on search re-
sults, but it lacks the inverse cosine function2
.
Our solution uses the cosine ratio as a proxy
for the angle: sorting by the cosine ratio in
descending order is equivalent to sorting by
the angle in ascending order to find the most
similar matches.

More Related Content

What's hot

Presentation of my master thesis
Presentation of my master thesisPresentation of my master thesis
Presentation of my master thesis
MichaelRra
 
Big Data and Geospatial with HPCC Systems
Big Data and Geospatial with HPCC SystemsBig Data and Geospatial with HPCC Systems
Big Data and Geospatial with HPCC Systems
HPCC Systems
 
05 k-means clustering
05 k-means clustering05 k-means clustering
05 k-means clustering
Subhas Kumar Ghosh
 
Spark for Behavioral Analytics Research: Spark Summit East talk by John W u
Spark for Behavioral Analytics Research: Spark Summit East talk by John W uSpark for Behavioral Analytics Research: Spark Summit East talk by John W u
Spark for Behavioral Analytics Research: Spark Summit East talk by John W u
Spark Summit
 
Probabilistic data structures
Probabilistic data structuresProbabilistic data structures
Probabilistic data structures
Yoav chernobroda
 
OCC Overview OMG Clouds Meeting 07-13-09 v3
OCC Overview OMG Clouds Meeting 07-13-09 v3OCC Overview OMG Clouds Meeting 07-13-09 v3
OCC Overview OMG Clouds Meeting 07-13-09 v3
Robert Grossman
 
Spark algorithms
Spark algorithmsSpark algorithms
Spark algorithms
Ashutosh Trivedi
 
An Efficient Cluster Tree Based Data Collection Scheme for Large Mobile With ...
An Efficient Cluster Tree Based Data Collection Scheme for Large Mobile With ...An Efficient Cluster Tree Based Data Collection Scheme for Large Mobile With ...
An Efficient Cluster Tree Based Data Collection Scheme for Large Mobile With ...
kavitha.s kavi
 
Moa: Real Time Analytics for Data Streams
Moa: Real Time Analytics for Data StreamsMoa: Real Time Analytics for Data Streams
Moa: Real Time Analytics for Data Streams
Albert Bifet
 
Analysis_of_Remote_Sensing_Quantitative_Inversion_in_Cloud_Computing.ppt
Analysis_of_Remote_Sensing_Quantitative_Inversion_in_Cloud_Computing.pptAnalysis_of_Remote_Sensing_Quantitative_Inversion_in_Cloud_Computing.ppt
Analysis_of_Remote_Sensing_Quantitative_Inversion_in_Cloud_Computing.pptgrssieee
 
K-means Clustering with Scikit-Learn
K-means Clustering with Scikit-LearnK-means Clustering with Scikit-Learn
K-means Clustering with Scikit-Learn
Sarah Guido
 
How Deep Learning Could Predict Weather Events
How Deep Learning Could Predict Weather EventsHow Deep Learning Could Predict Weather Events
How Deep Learning Could Predict Weather Events
inside-BigData.com
 
Fast and Probvably Seedings for k-Means
Fast and Probvably Seedings for k-MeansFast and Probvably Seedings for k-Means
Fast and Probvably Seedings for k-Means
Kimikazu Kato
 
Using parallel hierarchical clustering to
Using parallel hierarchical clustering toUsing parallel hierarchical clustering to
Using parallel hierarchical clustering to
Biniam Behailu
 
Efficient Online Evaluation of Big Data Stream Classifiers
Efficient Online Evaluation of Big Data Stream ClassifiersEfficient Online Evaluation of Big Data Stream Classifiers
Efficient Online Evaluation of Big Data Stream Classifiers
Albert Bifet
 
Denis Reznik Data driven future
Denis Reznik Data driven futureDenis Reznik Data driven future
Denis Reznik Data driven future
Аліна Шепшелей
 
Efficient processing of Rank-aware queries in Map/Reduce
Efficient processing of Rank-aware queries in Map/ReduceEfficient processing of Rank-aware queries in Map/Reduce
Efficient processing of Rank-aware queries in Map/Reduce
Spiros Oikonomakis
 

What's hot (18)

Presentation of my master thesis
Presentation of my master thesisPresentation of my master thesis
Presentation of my master thesis
 
Big Data and Geospatial with HPCC Systems
Big Data and Geospatial with HPCC SystemsBig Data and Geospatial with HPCC Systems
Big Data and Geospatial with HPCC Systems
 
05 k-means clustering
05 k-means clustering05 k-means clustering
05 k-means clustering
 
Improved k-means
Improved k-meansImproved k-means
Improved k-means
 
Spark for Behavioral Analytics Research: Spark Summit East talk by John W u
Spark for Behavioral Analytics Research: Spark Summit East talk by John W uSpark for Behavioral Analytics Research: Spark Summit East talk by John W u
Spark for Behavioral Analytics Research: Spark Summit East talk by John W u
 
Probabilistic data structures
Probabilistic data structuresProbabilistic data structures
Probabilistic data structures
 
OCC Overview OMG Clouds Meeting 07-13-09 v3
OCC Overview OMG Clouds Meeting 07-13-09 v3OCC Overview OMG Clouds Meeting 07-13-09 v3
OCC Overview OMG Clouds Meeting 07-13-09 v3
 
Spark algorithms
Spark algorithmsSpark algorithms
Spark algorithms
 
An Efficient Cluster Tree Based Data Collection Scheme for Large Mobile With ...
An Efficient Cluster Tree Based Data Collection Scheme for Large Mobile With ...An Efficient Cluster Tree Based Data Collection Scheme for Large Mobile With ...
An Efficient Cluster Tree Based Data Collection Scheme for Large Mobile With ...
 
Moa: Real Time Analytics for Data Streams
Moa: Real Time Analytics for Data StreamsMoa: Real Time Analytics for Data Streams
Moa: Real Time Analytics for Data Streams
 
Analysis_of_Remote_Sensing_Quantitative_Inversion_in_Cloud_Computing.ppt
Analysis_of_Remote_Sensing_Quantitative_Inversion_in_Cloud_Computing.pptAnalysis_of_Remote_Sensing_Quantitative_Inversion_in_Cloud_Computing.ppt
Analysis_of_Remote_Sensing_Quantitative_Inversion_in_Cloud_Computing.ppt
 
K-means Clustering with Scikit-Learn
K-means Clustering with Scikit-LearnK-means Clustering with Scikit-Learn
K-means Clustering with Scikit-Learn
 
How Deep Learning Could Predict Weather Events
How Deep Learning Could Predict Weather EventsHow Deep Learning Could Predict Weather Events
How Deep Learning Could Predict Weather Events
 
Fast and Probvably Seedings for k-Means
Fast and Probvably Seedings for k-MeansFast and Probvably Seedings for k-Means
Fast and Probvably Seedings for k-Means
 
Using parallel hierarchical clustering to
Using parallel hierarchical clustering toUsing parallel hierarchical clustering to
Using parallel hierarchical clustering to
 
Efficient Online Evaluation of Big Data Stream Classifiers
Efficient Online Evaluation of Big Data Stream ClassifiersEfficient Online Evaluation of Big Data Stream Classifiers
Efficient Online Evaluation of Big Data Stream Classifiers
 
Denis Reznik Data driven future
Denis Reznik Data driven futureDenis Reznik Data driven future
Denis Reznik Data driven future
 
Efficient processing of Rank-aware queries in Map/Reduce
Efficient processing of Rank-aware queries in Map/ReduceEfficient processing of Rank-aware queries in Map/Reduce
Efficient processing of Rank-aware queries in Map/Reduce
 

Similar to Development Infographic

PointNet
PointNetPointNet
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering AlgorithmIRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET Journal
 
Dynamic approach to k means clustering algorithm-2
Dynamic approach to k means clustering algorithm-2Dynamic approach to k means clustering algorithm-2
Dynamic approach to k means clustering algorithm-2IAEME Publication
 
Vol 16 No 2 - July-December 2016
Vol 16 No 2 - July-December 2016Vol 16 No 2 - July-December 2016
Vol 16 No 2 - July-December 2016
ijcsbi
 
What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care? What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care?
Robert Grossman
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
Benjamin Bengfort
 
Experimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithmsExperimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithms
IJDKP
 
New Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids AlgorithmNew Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids Algorithm
Editor IJCATR
 
Scalable and efficient cluster based framework for multidimensional indexing
Scalable and efficient cluster based framework for multidimensional indexingScalable and efficient cluster based framework for multidimensional indexing
Scalable and efficient cluster based framework for multidimensional indexing
eSAT Journals
 
Scalable and efficient cluster based framework for
Scalable and efficient cluster based framework forScalable and efficient cluster based framework for
Scalable and efficient cluster based framework for
eSAT Publishing House
 
IRJET - Object Detection using Hausdorff Distance
IRJET -  	  Object Detection using Hausdorff DistanceIRJET -  	  Object Detection using Hausdorff Distance
IRJET - Object Detection using Hausdorff Distance
IRJET Journal
 
IRJET- A Review on K-Means++ Clustering Algorithm and Cloud Computing wit...
IRJET-  	  A Review on K-Means++ Clustering Algorithm and Cloud Computing wit...IRJET-  	  A Review on K-Means++ Clustering Algorithm and Cloud Computing wit...
IRJET- A Review on K-Means++ Clustering Algorithm and Cloud Computing wit...
IRJET Journal
 
EFFICIENT INDEX FOR A VERY LARGE DATASETS WITH HIGHER DIMENSION
EFFICIENT INDEX FOR A VERY LARGE DATASETS WITH HIGHER DIMENSIONEFFICIENT INDEX FOR A VERY LARGE DATASETS WITH HIGHER DIMENSION
EFFICIENT INDEX FOR A VERY LARGE DATASETS WITH HIGHER DIMENSION
AM Publications,India
 
IRJET- Object Detection using Hausdorff Distance
IRJET-  	  Object Detection using Hausdorff DistanceIRJET-  	  Object Detection using Hausdorff Distance
IRJET- Object Detection using Hausdorff Distance
IRJET Journal
 
House price prediction
House price predictionHouse price prediction
House price prediction
SabahBegum
 
Fast top k path-based relevance query on massive graphs
Fast top k path-based relevance query on massive graphsFast top k path-based relevance query on massive graphs
Fast top k path-based relevance query on massive graphs
ieeechennai
 
Parallel KNN for Big Data using Adaptive Indexing
Parallel KNN for Big Data using Adaptive IndexingParallel KNN for Big Data using Adaptive Indexing
Parallel KNN for Big Data using Adaptive Indexing
IRJET Journal
 
C42011318
C42011318C42011318
C42011318
IJERA Editor
 
Silicon valleycodecamp2013
Silicon valleycodecamp2013Silicon valleycodecamp2013
Silicon valleycodecamp2013
Sanjeev Mishra
 

Similar to Development Infographic (20)

PointNet
PointNetPointNet
PointNet
 
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering AlgorithmIRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
 
Dynamic approach to k means clustering algorithm-2
Dynamic approach to k means clustering algorithm-2Dynamic approach to k means clustering algorithm-2
Dynamic approach to k means clustering algorithm-2
 
Vol 16 No 2 - July-December 2016
Vol 16 No 2 - July-December 2016Vol 16 No 2 - July-December 2016
Vol 16 No 2 - July-December 2016
 
What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care? What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care?
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
 
Experimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithmsExperimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithms
 
New Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids AlgorithmNew Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids Algorithm
 
Scalable and efficient cluster based framework for multidimensional indexing
Scalable and efficient cluster based framework for multidimensional indexingScalable and efficient cluster based framework for multidimensional indexing
Scalable and efficient cluster based framework for multidimensional indexing
 
Scalable and efficient cluster based framework for
Scalable and efficient cluster based framework forScalable and efficient cluster based framework for
Scalable and efficient cluster based framework for
 
Noura2
Noura2Noura2
Noura2
 
IRJET - Object Detection using Hausdorff Distance
IRJET -  	  Object Detection using Hausdorff DistanceIRJET -  	  Object Detection using Hausdorff Distance
IRJET - Object Detection using Hausdorff Distance
 
IRJET- A Review on K-Means++ Clustering Algorithm and Cloud Computing wit...
IRJET-  	  A Review on K-Means++ Clustering Algorithm and Cloud Computing wit...IRJET-  	  A Review on K-Means++ Clustering Algorithm and Cloud Computing wit...
IRJET- A Review on K-Means++ Clustering Algorithm and Cloud Computing wit...
 
EFFICIENT INDEX FOR A VERY LARGE DATASETS WITH HIGHER DIMENSION
EFFICIENT INDEX FOR A VERY LARGE DATASETS WITH HIGHER DIMENSIONEFFICIENT INDEX FOR A VERY LARGE DATASETS WITH HIGHER DIMENSION
EFFICIENT INDEX FOR A VERY LARGE DATASETS WITH HIGHER DIMENSION
 
IRJET- Object Detection using Hausdorff Distance
IRJET-  	  Object Detection using Hausdorff DistanceIRJET-  	  Object Detection using Hausdorff Distance
IRJET- Object Detection using Hausdorff Distance
 
House price prediction
House price predictionHouse price prediction
House price prediction
 
Fast top k path-based relevance query on massive graphs
Fast top k path-based relevance query on massive graphsFast top k path-based relevance query on massive graphs
Fast top k path-based relevance query on massive graphs
 
Parallel KNN for Big Data using Adaptive Indexing
Parallel KNN for Big Data using Adaptive IndexingParallel KNN for Big Data using Adaptive Indexing
Parallel KNN for Big Data using Adaptive Indexing
 
C42011318
C42011318C42011318
C42011318
 
Silicon valleycodecamp2013
Silicon valleycodecamp2013Silicon valleycodecamp2013
Silicon valleycodecamp2013
 

Recently uploaded

FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
Peter Spielvogel
 

Recently uploaded (20)

FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
 

Development Infographic

  • 1. The algorithm operates as follows: 1. Create a set S = { s1 , s2 , …, sn } representing each office space. 2. Create a kd-tree T using the set S. 3. While S is not empty: a) Pop point s from S and compute the radius r around it containing the nearest 50 neighboring buildings using a pre-built SciPy KDTree and starting maximum distance d = 0.082° 9 km.≅ b) Using T, find all spaces within r and add them to a new cluster, and remove them from S. c) Merge the new cluster into an existing cluster, if there is overlap between them. 4. If the number of clusters is greater than k, recursively perform Step 3-4 with the original set S and 2d as the new maximum distance. Otherwise, merge intersecting clusters and compute a weighted centroid and radius for each cluster. Creating a Real-Time Recommendation Engine using Modified K-Means Clustering and Remote Sensing Signature Matching Algorithms Abstract Built on Google App Engine, Real- Massive encountered challenges while attempting to scale its rec- ommendation engine to match a 14% week-over-week increase of data. To address this problem of scale, we applied techniques from spectral data processing to trans- form our domain-specific problem. The result: a quantitative solution to a qualitative problem that can match the skill of domain experts while operating in sub-second time. David Lippa*,† Jason Vertrees**,† Background Spectral analysis algorithms provide one way to quantify similarity when comparing a data collection against a known signature. This process—material identification3,6,9 —is quite literally finding a needle in a pixelated haystack. One such algorithm, Spectral Angle Mapper treats each pixel as an n-dimensional vector, computing the angle between them using the definition of a dot product: A · B = ||A|| ||B|| cos θ. Similarity increases as |θ| approaches 0. 10° is a typical upper threshold. Negative angles are valid in spectral datasets, but not in our case, since values are always positive. To remap the problem, we: ● Treat the list of potential candidates as “pixels” of a spectral data cube. ● Create a library of “signature” vectors. ● Cluster using a stripped down version of SciPy's kdtree.py since Google App Engine prohibits execution of native code in 3rd party libraries.2 ● Use independent object attributes for vector components, such as cost, size, number of parking spaces, etc. ● Avoid ratios and dependent variables. ● Aggregate each cluster's vector components to produce a “signature.” ● Sort the results in ascending order by θ. This solution results in a quantifiable, accurate, and flexible measurement of similarity. Phase 1: Clustering K-means clustering is one of the best-known methods for breaking up n data points into k discrete clusters. While easy to implement and fast in practice, a few worst-case sce- narios may arise in certain unusual data conditions8 . To mitigate this, we exploit known attributes of the data: limited overlap between data points since they exist physi- cally in 3-dimensional space; limited data range since the data is clustered by latitude and longitude; related data that can used to improve estimation of the initial cluster sizes. Results Since its inception, the new recommendation service has provided more than 302,925 recommendations in sub-second time. With each call, it sifts through over 80,000 spaces and has handled a workload of 18,327 requests per work day and 6,188 per hour. The result was the product of just 3 weeks of implemen- tation time, from design to production. In the future, we will add refinements to the clustering algorithm to consider client-specific needs and other related data sets. We can also improve the matching algorithm by applying a cosine rule or Euclidian dis- tance calculation to prevent an extreme case of collinearity–such as the vectors (1, 1, 1) and (1000, 1000, 1000)–showing as a perfect match. Summary Google App Engine provides a powerful search engine in a scalable infrastructure. It can be customized to address new problems outside of typical keyword searching. To address our problem of pattern matching in commercial real estate, we created a new scalable, domain-specific recommendation engine. We bor- rowed techniques from the field of remote sensing, while also taking advantage of constraints and satisfic- ing over optimizing to overcome our rapid data growth and the restrictions of Google App Engine. * david.lippa@realmassive.com ** jason.vertrees@realmassive.com † RealMassive, Inc. 1717 West 6th St. Austin, TX 78703 + This data cube measures 614 x 512 pixels x 224 bands spanning the entire visible, near-infrared, and short-wave infrared spectrum. Visualizations provided by the open-source Opticks remote sensing toolkit4 . References: 1. AVARIS Home page. (2015, June 26). Retrieved from http://aviris.jpl.nasa.gov/data/free_data.html 2. Google. (2015, June 11). Google App Engine for Python 1.9.21 Documentation. Retrieved from https://cloud.google.com/appengine/docs/python 3. Landgrebe, David A (2005). Signal Theory Methods in Multispectral Remote Sensing. Hoboken, NJ: John Wiley & Sons. 4. Opticks. (2015, June 26). Opticks remote sensing toolkit. Retrieved from https://opticks.org 5. RealMassive. (2015, June 10). Retrieved from https://www.realmassive.com Method There are 3 phases needed to overcome constraints imposed by App Engine2 : ● Cluster user inputs into “signatures” to reduce the length of query strings and sort expressions. ● Apply fixed filters to limit search results to within the 10,000 hit sort limit. ● Score results by signature match to override the default search-term relevance score. Doubling the initial radius results in an absolute maximum of 26 recursive calls for an overall asymp- totic complexity of O(2n log2 n). This never happens in practice due to low building density. The final result is similar to the representation of clusters in Figure 35 . Once the spaces have been clustered, it is trivial to compute the average of each vector component to produce each cluster’s signature. Figure 3: Clustering 50 spaces from across the US Figure 2: Graphic representation of hyper-spectral data7 Figure 1: A Commercial Real Estate Survey with Recommendations Phase 2: Filtering Next, we apply fixed filters informed by domain expertise. For commercial real estate, this includes the building type (e.g. "office", "industrial", etc.), location, and any necessary exclusions. These constraints produce a reasonable subset that can be matched against signatures. Figure 4: AVARIS data courtesy NASA/JPL-Caltech, showing a signature match1+ 6. M. Richmond. Licensed under Creative Commons. Retrieved from http://spiff.rit.edu/classes/phys301/lectures/comp/comp.html 7. N Short, Sr. Graphic representation of hyperspectral data. Licensed under Creative Commons. Retrieved from http://rst.gsfc.nasa.gov/ 8. A. Vattani. K-means Requires Exponentially Many Iterations Even in the Plane, Discrete Comput Geom. 45(4): 596–616. 2011. 9. H. Zhang, Y. Lan, R. Lacey, W. Hoffmann, Y. Huang. Analysis of vegetation indices derived from aerial multispectral and ground hyperspectral data, International Journal of Agricultural and Biological Engineering. 2(3): 33. 2009. Acknowledgments The authors would like to thank Fatih Akici, John Leonard, Natalya Shelburne, and Michael Westgate for their suggestions for this poster. Phase 3: Sort by Angle Executing the Spectral Angle Mapper algo- rithm on a reduced dataset of 10,000 items equates to performing material identification on a 115 x 87 pixel x 3-band data cube from a multi-spectral sensor, or 3% of the compu- tations required for a small data cube, such as Figure 4. Google App Engine can quickly perform calculations in-place on search re- sults, but it lacks the inverse cosine function2 . Our solution uses the cosine ratio as a proxy for the angle: sorting by the cosine ratio in descending order is equivalent to sorting by the angle in ascending order to find the most similar matches.