This work analyses the problem coming from the so called Big Data scenario, which can be defined as the technological challenge to manage and administer quantity of information with global dimension in the order of Terabyte (〖10〗^12 bytes) or Petabyte (〖10〗^15 bytes) and with an exponential growth rate.
We’ll explore a technological and algorithmic approach to handle and calculate theses amounts of data that exceed the limit of computation of a traditional architecture based on real-time request processing: in particular we’ll analyze a singular open source technology, called Apache Hadoop, which implements the approach described as Map and Reduce.
We’ll describe also how to distribute a cluster of common server to create a Virtual File System and use this environment to populate a centralized search index (realized using another open source technology, called Apache Lucene).
The practical implementation will be a web based application which offers to the user a unified searching interface against a collection of technical papers.
The scope is to demonstrate that a performant search system can be obtained pre-processing the data using the Map and Reduce paradigm, in order to obtain a real time response, which is independent to the underlying amount of data.
Finally we’ll compare this solutions to different approaches based on clusterization or No SQL solutions, with the scope to describe the characteristics of concrete scenarios, which suggest the adoption of those technologies.