This document discusses clustering the output of Apache Nutch web pages using Apache Spark. It presents structural and style similarity measures to group similar web pages based on their DOM structure and CSS styles. Shared near neighbor clustering is implemented on the Spark GraphX library to cluster the web pages based on a similarity matrix without prior knowledge of cluster sizes or shapes. A demo is provided to visualize the clustered results.