SlideShare a Scribd company logo
1 of 24
Find Me a Roof !
project for “Gestione dell’informazione sul Web” class
                    AA 2009-2010
Alessandro Manfredi & Marco Bontempi & Marco Giannone
     {a.n0on3,bontempi,marco.giannone}@gmail.com
Goals
✓ Build a search engine on the vertical domain of realties
  advertisement.

✓ Index-linking informations from multiple sources.
✓ Design so that adding sources will be easy.
✓ Enriching poor informations with web services
  integration.

✓ Provide a user-friendly interface for localized and
  domain-field selective efficient searches.

✓ “Did you mean ... ?” and search suggestions.
✓ Deploy on Amazon EC2/S3.
Preview
Preview ( autocomplete )
Preview ( results )
Preview ( did you mean ... ? )
What we used
Back End Overview
                                                Download &
                                                 Dispatch


                               url repository
      roof bots


                                                Extractor 11
                                                 Extractor 1
          Main                                    Extractor

   LUCENE Indexes                               Extractor 11
                                                 Extractor 2
                                     DB          Extractor
SpellChecker   AutoCompleter
                                                      ...
                                                Extractor 11
                                                 Extractor n
                                                 Extractor
Back End Overview
                                                              Download &
                                                               Dispatch


                                             url repository
      roof bots


                                                              Extractor 11
                                                               Extractor 1
          Main                                                  Extractor

   LUCENE Indexes                                             Extractor 11
                                                               Extractor 2
                                                   DB          Extractor
SpellChecker    AutoCompleter
                                                                    ...
                                                              Extractor 11
                                                               Extractor n
                                                               Extractor
                     Why the DB ?
               will be explained later ...
Crawling
• Collecting informations from
  • www.trova-casa.net
  • www.immobiliare.it
• First attempt on trova-casa.net :
  • multithreading bruteforce on same-
    structured url: after 75 k ...
Crawling
• Collecting informations from
  • www.trova-casa.net
  • www.immobiliare.it
• First attempt on trova-casa.net :
  • multithreading bruteforce on same-
    structured url: after 75 k ...

 • ... we got banned :-)
Crawling

• WebSphinx ( Carnegie Mellon University )
   • http://www-2.cs.cmu.edu/~rcm/websphinx/

• Timeout: 1s
• Limited scope to Rome and
   surroundings

   • Regex on url to visit and save
   • Coordinate filtering
Crawling
• Somehow WebSphinx stopped before reaching
  all of the realties ads...

• We wrote a simple PHP roofbot:
  • Starting from sitemaps
  • Reach indexing pages
  • Collecting urls with given navigation paths
• This way we reached all of the ~87k ads
  available in Rome and surroundings.
Data Extraction
•          HtmlUnit + Neko

•          JTidy + XPath
    ( even if #562127 (JTidy) forced us to skip few fields )


• Information collected :
     • Data ( realty type, contract type, address,
          surface, price, coordinates, contacts )

     • Text ( description )
• Data has been cleaned with regex
Data Enrichment
• Using Google maps API and web-services
   • Adding coordinates from the address
       • Geocoding WS with csv output :
   •   http://maps.google.com/maps/geo?output=csv&sensor=false&q=...


   • Adding address from coordinates
       • API Geocoding WS, max 2.500 requests / day :
   •   http://maps.google.com/maps/api/geocode/xml?sensor=false&latlng=...


• This works for 83% of performed requests.
   • i.e. failed when street numbers are out of google
       knowledge or when streets names are mistyped.
Text search
• While the user is typing, AutoCompleter
  index is queried to give suggestions using
  javascript.

• The Main index is used for search
  • If less than a threshold results are
    returned or if the highter score is too
    low, SpellChecker index is invoked to
    guess possible spell errors and results
    for the deducted correct query are also
    displayed.
Suggestions

• Actually, since AutoCompleter index often
  returned results for negligible words and
  don’t provide support for phrase-queries,
  we returned suggestions searching on a
  list of common locations and keywords.

• In production, this list may be feed with
  most common searches.
Why use a DB ?
        • To take advantages of indexes for
          efficient in-range searches for data
          analysis.
        • E.g. provide the average price for surface
          unit in the location with pickable range.
        • Chance to delegate filtering to the

          LUCENE
         Main Index
                           ID-based
QUERY                       Merge
                                               Results

           DB
An Example
SELECT avg("Prezzo"/"Superficie") FROM "Annunci"
WHERE "Contratto" = ‘Vendita’
AND "Latitudine" < X AND "Latitudine" > Y
AND "Longitudine" > Z AND "Longitudine" < W
AND "Superficie"   != 0 AND "Prezzo" != 0 ;
The current implementation
 • Filtering is performed at application level
   over lucene main index results
 • Database is used for data analysis
                     QUERY

                 LUCENE Main Index


       Data
      Analysis
                                     DB

                      Merge

                     Results
Data Analysis
• Right now, limited to the comparison
  with the local price for surface unit.
Geolocation




• Users can navigate the map to select their
  location of interest, and filter out ads
  located outside even if matching the
  query.
Deploy on AWS


• Launch and configure an EC2 AMI ( Amazon
  Machine Image ) starting from community
  provided “Debian” Linux AMI

• Saving the instance on S3 to preserve
  filesystem:
  •   ec2-bundle-vol -k <KEY> -c <CERT> -u <USER-ID> --destination /mnt --exclude /mnt

  •   ec2-upload-bundle -b <S3-bucket-name> -m /mnt/image.manifest.xml -a <ACCESS-KEY> -s
      <SECRET-KEY>

  •   ec2-register <S3-bucket-name>/image.manifest.xml -n <AMI-NAME> -K <KEY> -C <CERT>
Find Me a Roof !
                      ( we don’t let you living under a bridge )




                  Thanks


project for “Gestione dell’informazione sul Web” class
                    AA 2009-2010
Alessandro Manfredi & Marco Bontempi & Marco Giannone
     {a.n0on3,bontempi,marco.giannone}@gmail.com

More Related Content

Similar to Find me a roof!

Dense Retrieval with Apache Solr Neural Search.pdf
Dense Retrieval with Apache Solr Neural Search.pdfDense Retrieval with Apache Solr Neural Search.pdf
Dense Retrieval with Apache Solr Neural Search.pdfSease
 
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...Spark Summit
 
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...Sease
 
Genomic Computation at Scale with Serverless, StackStorm and Docker Swarm
Genomic Computation at Scale with Serverless, StackStorm and Docker SwarmGenomic Computation at Scale with Serverless, StackStorm and Docker Swarm
Genomic Computation at Scale with Serverless, StackStorm and Docker SwarmDmitri Zimine
 
Play Framework and Activator
Play Framework and ActivatorPlay Framework and Activator
Play Framework and ActivatorKevin Webber
 
Java day2016 "Reinventing design patterns with java 8"
Java day2016 "Reinventing design patterns with java 8"Java day2016 "Reinventing design patterns with java 8"
Java day2016 "Reinventing design patterns with java 8"Alexander Pashynskiy
 
Spark Summit EU talk by Ruben Pulido Behar Veliqi
Spark Summit EU talk by Ruben Pulido Behar VeliqiSpark Summit EU talk by Ruben Pulido Behar Veliqi
Spark Summit EU talk by Ruben Pulido Behar VeliqiSpark Summit
 
Spark Summit EU talk by Ruben Pulido and Behar Veliqi
Spark Summit EU talk by Ruben Pulido and Behar VeliqiSpark Summit EU talk by Ruben Pulido and Behar Veliqi
Spark Summit EU talk by Ruben Pulido and Behar VeliqiSpark Summit
 
Spark Summit - Watson Analytics for Social Media: From single tenant Hadoop t...
Spark Summit - Watson Analytics for Social Media: From single tenant Hadoop t...Spark Summit - Watson Analytics for Social Media: From single tenant Hadoop t...
Spark Summit - Watson Analytics for Social Media: From single tenant Hadoop t...Behar Veliqi
 
Neural Search Comes to Apache Solr
Neural Search Comes to Apache SolrNeural Search Comes to Apache Solr
Neural Search Comes to Apache SolrSease
 
Kubernetes 101
Kubernetes 101Kubernetes 101
Kubernetes 101Huy Vo
 
To Build My Own Cloud with Blackjack…
To Build My Own Cloud with Blackjack…To Build My Own Cloud with Blackjack…
To Build My Own Cloud with Blackjack…Sergey Dzyuban
 
Kubernetes for java developers - Tutorial at Oracle Code One 2018
Kubernetes for java developers - Tutorial at Oracle Code One 2018Kubernetes for java developers - Tutorial at Oracle Code One 2018
Kubernetes for java developers - Tutorial at Oracle Code One 2018Anthony Dahanne
 
Docker Orchestration: Welcome to the Jungle! Devoxx & Docker Meetup Tour Nov ...
Docker Orchestration: Welcome to the Jungle! Devoxx & Docker Meetup Tour Nov ...Docker Orchestration: Welcome to the Jungle! Devoxx & Docker Meetup Tour Nov ...
Docker Orchestration: Welcome to the Jungle! Devoxx & Docker Meetup Tour Nov ...Patrick Chanezon
 
使用 Prometheus 監控 Kubernetes Cluster
使用 Prometheus 監控 Kubernetes Cluster 使用 Prometheus 監控 Kubernetes Cluster
使用 Prometheus 監控 Kubernetes Cluster inwin stack
 
Installing and tweaking FASTSearch
Installing and tweaking FASTSearchInstalling and tweaking FASTSearch
Installing and tweaking FASTSearchArno Flapper
 
Kubernetes上で動作する機械学習モジュールの配信&管理基盤Rekcurd について
Kubernetes上で動作する機械学習モジュールの配信&管理基盤Rekcurd についてKubernetes上で動作する機械学習モジュールの配信&管理基盤Rekcurd について
Kubernetes上で動作する機械学習モジュールの配信&管理基盤Rekcurd についてLINE Corporation
 
JavaOne 2009 - Full-Text Search: Human Heaven and Database Savior in the Cloud
JavaOne 2009 - Full-Text Search: Human Heaven and Database Savior in the CloudJavaOne 2009 - Full-Text Search: Human Heaven and Database Savior in the Cloud
JavaOne 2009 - Full-Text Search: Human Heaven and Database Savior in the CloudAaron Walker
 

Similar to Find me a roof! (20)

Dense Retrieval with Apache Solr Neural Search.pdf
Dense Retrieval with Apache Solr Neural Search.pdfDense Retrieval with Apache Solr Neural Search.pdf
Dense Retrieval with Apache Solr Neural Search.pdf
 
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
 
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
 
Genomic Computation at Scale with Serverless, StackStorm and Docker Swarm
Genomic Computation at Scale with Serverless, StackStorm and Docker SwarmGenomic Computation at Scale with Serverless, StackStorm and Docker Swarm
Genomic Computation at Scale with Serverless, StackStorm and Docker Swarm
 
Docker Kubernetes Istio
Docker Kubernetes IstioDocker Kubernetes Istio
Docker Kubernetes Istio
 
Play Framework and Activator
Play Framework and ActivatorPlay Framework and Activator
Play Framework and Activator
 
Docker Kubernetes Istio
Docker Kubernetes IstioDocker Kubernetes Istio
Docker Kubernetes Istio
 
Java day2016 "Reinventing design patterns with java 8"
Java day2016 "Reinventing design patterns with java 8"Java day2016 "Reinventing design patterns with java 8"
Java day2016 "Reinventing design patterns with java 8"
 
Spark Summit EU talk by Ruben Pulido Behar Veliqi
Spark Summit EU talk by Ruben Pulido Behar VeliqiSpark Summit EU talk by Ruben Pulido Behar Veliqi
Spark Summit EU talk by Ruben Pulido Behar Veliqi
 
Spark Summit EU talk by Ruben Pulido and Behar Veliqi
Spark Summit EU talk by Ruben Pulido and Behar VeliqiSpark Summit EU talk by Ruben Pulido and Behar Veliqi
Spark Summit EU talk by Ruben Pulido and Behar Veliqi
 
Spark Summit - Watson Analytics for Social Media: From single tenant Hadoop t...
Spark Summit - Watson Analytics for Social Media: From single tenant Hadoop t...Spark Summit - Watson Analytics for Social Media: From single tenant Hadoop t...
Spark Summit - Watson Analytics for Social Media: From single tenant Hadoop t...
 
Neural Search Comes to Apache Solr
Neural Search Comes to Apache SolrNeural Search Comes to Apache Solr
Neural Search Comes to Apache Solr
 
Kubernetes 101
Kubernetes 101Kubernetes 101
Kubernetes 101
 
To Build My Own Cloud with Blackjack…
To Build My Own Cloud with Blackjack…To Build My Own Cloud with Blackjack…
To Build My Own Cloud with Blackjack…
 
Kubernetes for java developers - Tutorial at Oracle Code One 2018
Kubernetes for java developers - Tutorial at Oracle Code One 2018Kubernetes for java developers - Tutorial at Oracle Code One 2018
Kubernetes for java developers - Tutorial at Oracle Code One 2018
 
Docker Orchestration: Welcome to the Jungle! Devoxx & Docker Meetup Tour Nov ...
Docker Orchestration: Welcome to the Jungle! Devoxx & Docker Meetup Tour Nov ...Docker Orchestration: Welcome to the Jungle! Devoxx & Docker Meetup Tour Nov ...
Docker Orchestration: Welcome to the Jungle! Devoxx & Docker Meetup Tour Nov ...
 
使用 Prometheus 監控 Kubernetes Cluster
使用 Prometheus 監控 Kubernetes Cluster 使用 Prometheus 監控 Kubernetes Cluster
使用 Prometheus 監控 Kubernetes Cluster
 
Installing and tweaking FASTSearch
Installing and tweaking FASTSearchInstalling and tweaking FASTSearch
Installing and tweaking FASTSearch
 
Kubernetes上で動作する機械学習モジュールの配信&管理基盤Rekcurd について
Kubernetes上で動作する機械学習モジュールの配信&管理基盤Rekcurd についてKubernetes上で動作する機械学習モジュールの配信&管理基盤Rekcurd について
Kubernetes上で動作する機械学習モジュールの配信&管理基盤Rekcurd について
 
JavaOne 2009 - Full-Text Search: Human Heaven and Database Savior in the Cloud
JavaOne 2009 - Full-Text Search: Human Heaven and Database Savior in the CloudJavaOne 2009 - Full-Text Search: Human Heaven and Database Savior in the Cloud
JavaOne 2009 - Full-Text Search: Human Heaven and Database Savior in the Cloud
 

More from Alessandro Manfredi

More from Alessandro Manfredi (9)

Hey Cloud, it’s the user calling, he says he wants the security back
Hey Cloud, it’s the user calling, he says he wants the security backHey Cloud, it’s the user calling, he says he wants the security back
Hey Cloud, it’s the user calling, he says he wants the security back
 
WhyMCA HappyHour - EUHackathon Part II
WhyMCA HappyHour - EUHackathon Part IIWhyMCA HappyHour - EUHackathon Part II
WhyMCA HappyHour - EUHackathon Part II
 
Connect (4|n)
Connect (4|n)Connect (4|n)
Connect (4|n)
 
LUG - Ricompilazione kernel
LUG - Ricompilazione kernelLUG - Ricompilazione kernel
LUG - Ricompilazione kernel
 
LUG - Logical volumes management
LUG - Logical volumes managementLUG - Logical volumes management
LUG - Logical volumes management
 
LUG - Install Fest 2008
LUG - Install Fest 2008LUG - Install Fest 2008
LUG - Install Fest 2008
 
Advanced Shell Scripting
Advanced Shell ScriptingAdvanced Shell Scripting
Advanced Shell Scripting
 
ExAlg Overview
ExAlg OverviewExAlg Overview
ExAlg Overview
 
The "vi" Text Editor
The "vi" Text EditorThe "vi" Text Editor
The "vi" Text Editor
 

Recently uploaded

Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 

Recently uploaded (20)

Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 

Find me a roof!

  • 1. Find Me a Roof ! project for “Gestione dell’informazione sul Web” class AA 2009-2010 Alessandro Manfredi & Marco Bontempi & Marco Giannone {a.n0on3,bontempi,marco.giannone}@gmail.com
  • 2. Goals ✓ Build a search engine on the vertical domain of realties advertisement. ✓ Index-linking informations from multiple sources. ✓ Design so that adding sources will be easy. ✓ Enriching poor informations with web services integration. ✓ Provide a user-friendly interface for localized and domain-field selective efficient searches. ✓ “Did you mean ... ?” and search suggestions. ✓ Deploy on Amazon EC2/S3.
  • 6. Preview ( did you mean ... ? )
  • 8. Back End Overview Download & Dispatch url repository roof bots Extractor 11 Extractor 1 Main Extractor LUCENE Indexes Extractor 11 Extractor 2 DB Extractor SpellChecker AutoCompleter ... Extractor 11 Extractor n Extractor
  • 9. Back End Overview Download & Dispatch url repository roof bots Extractor 11 Extractor 1 Main Extractor LUCENE Indexes Extractor 11 Extractor 2 DB Extractor SpellChecker AutoCompleter ... Extractor 11 Extractor n Extractor Why the DB ? will be explained later ...
  • 10. Crawling • Collecting informations from • www.trova-casa.net • www.immobiliare.it • First attempt on trova-casa.net : • multithreading bruteforce on same- structured url: after 75 k ...
  • 11. Crawling • Collecting informations from • www.trova-casa.net • www.immobiliare.it • First attempt on trova-casa.net : • multithreading bruteforce on same- structured url: after 75 k ... • ... we got banned :-)
  • 12. Crawling • WebSphinx ( Carnegie Mellon University ) • http://www-2.cs.cmu.edu/~rcm/websphinx/ • Timeout: 1s • Limited scope to Rome and surroundings • Regex on url to visit and save • Coordinate filtering
  • 13. Crawling • Somehow WebSphinx stopped before reaching all of the realties ads... • We wrote a simple PHP roofbot: • Starting from sitemaps • Reach indexing pages • Collecting urls with given navigation paths • This way we reached all of the ~87k ads available in Rome and surroundings.
  • 14. Data Extraction • HtmlUnit + Neko • JTidy + XPath ( even if #562127 (JTidy) forced us to skip few fields ) • Information collected : • Data ( realty type, contract type, address, surface, price, coordinates, contacts ) • Text ( description ) • Data has been cleaned with regex
  • 15. Data Enrichment • Using Google maps API and web-services • Adding coordinates from the address • Geocoding WS with csv output : • http://maps.google.com/maps/geo?output=csv&sensor=false&q=... • Adding address from coordinates • API Geocoding WS, max 2.500 requests / day : • http://maps.google.com/maps/api/geocode/xml?sensor=false&latlng=... • This works for 83% of performed requests. • i.e. failed when street numbers are out of google knowledge or when streets names are mistyped.
  • 16. Text search • While the user is typing, AutoCompleter index is queried to give suggestions using javascript. • The Main index is used for search • If less than a threshold results are returned or if the highter score is too low, SpellChecker index is invoked to guess possible spell errors and results for the deducted correct query are also displayed.
  • 17. Suggestions • Actually, since AutoCompleter index often returned results for negligible words and don’t provide support for phrase-queries, we returned suggestions searching on a list of common locations and keywords. • In production, this list may be feed with most common searches.
  • 18. Why use a DB ? • To take advantages of indexes for efficient in-range searches for data analysis. • E.g. provide the average price for surface unit in the location with pickable range. • Chance to delegate filtering to the LUCENE Main Index ID-based QUERY Merge Results DB
  • 19. An Example SELECT avg("Prezzo"/"Superficie") FROM "Annunci" WHERE "Contratto" = ‘Vendita’ AND "Latitudine" < X AND "Latitudine" > Y AND "Longitudine" > Z AND "Longitudine" < W AND "Superficie" != 0 AND "Prezzo" != 0 ;
  • 20. The current implementation • Filtering is performed at application level over lucene main index results • Database is used for data analysis QUERY LUCENE Main Index Data Analysis DB Merge Results
  • 21. Data Analysis • Right now, limited to the comparison with the local price for surface unit.
  • 22. Geolocation • Users can navigate the map to select their location of interest, and filter out ads located outside even if matching the query.
  • 23. Deploy on AWS • Launch and configure an EC2 AMI ( Amazon Machine Image ) starting from community provided “Debian” Linux AMI • Saving the instance on S3 to preserve filesystem: • ec2-bundle-vol -k <KEY> -c <CERT> -u <USER-ID> --destination /mnt --exclude /mnt • ec2-upload-bundle -b <S3-bucket-name> -m /mnt/image.manifest.xml -a <ACCESS-KEY> -s <SECRET-KEY> • ec2-register <S3-bucket-name>/image.manifest.xml -n <AMI-NAME> -K <KEY> -C <CERT>
  • 24. Find Me a Roof ! ( we don’t let you living under a bridge ) Thanks project for “Gestione dell’informazione sul Web” class AA 2009-2010 Alessandro Manfredi & Marco Bontempi & Marco Giannone {a.n0on3,bontempi,marco.giannone}@gmail.com

Editor's Notes