• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Solr Extracting Data

Solr Extracting Data



A presentation showing how to extract data from the solr tool, it is part 3 of a three part series. Originally from my youtube channel.

A presentation showing how to extract data from the solr tool, it is part 3 of a three part series. Originally from my youtube channel.



Total Views
Views on SlideShare
Embed Views



0 Embeds 0

No embeds



Upload Details

Uploaded via as OpenOffice

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

    Solr Extracting Data Solr Extracting Data Presentation Transcript

    • Solr Extracting Data ● Start this session with a full Solr indexed repository – Movie cAiYBD4BQeE showed installation – Movie Th5Scvlyt-E showed Nutch web crawl ● This movie will show how to – Extract data from Solr – Extract to xml or csv – Show aim to load into data warehouse ● This movie assumes you know Linux
    • Solr Extracting Data ● Progress so far, greyed out area yet to be examined
    • Checking Solr Data ● Data should have been indexed in Solr ● In Solr Admin window – Set 'Core Selector' = collection1 – Click 'Query' – In Query window set fl field = url – Click Execute Query ● The result ( next ) shows the filtered list of urls in Solr
    • Checking Solr Data
    • How To Extract ● How could we get at Solr data ? – In admin console via query – Via http solr select – Via curl -o call using solr http select ● What format of data – that suits this purpose – Xml – Comma separated variable (csv)
    • How To Extract ● We want to extract two columns from Solr – tstamp, url ● We want to extract as csv ( csv in call below could be xml ) ● We want to extract to a file ● So we will use an http call – http://localhost:8983/solr/select?q=*:*&fl=tstamp,url&wt=csv ● We will also use a curl call – curl -o <csv file> '<http call>'
    • How To Extract ● Ceate a bash file in Solr install directory – cd solr-4-2-1/extract ; touch solr_url_extract.bash – chmod 755 solr_url_extract.bash ● Add contents to bash file – #!/bin/bash – curl -o result.csv 'http://localhost:8983/solr/select?q=*:*&fl=tstamp,url&wt=csv' – mv result.csv result.csv.$(date +”%Y%m%d.%H%M%S”) ● Now run the bash script – ./solr_url_extract.bash
    • Check Output ● Now we check whether we have data ● ls -l shows – result.csv.20130506.124857 ● Check the content , wc -l shows 11 lines ● Check the content , head -2 shows – tstamp, url – 2013-05-04T01:56:58.157Z,http://www.mysite.co.nz/Search? DateRange=7& ... ● Congratulations, you have extracted data from Solr ● It's in CSV format ready to be loaded into a data warehouse
    • Possible Next Steps ● Choose more fields to extract from data ● Allow Nutch crawl to go deeper ● Allow Nutch crawl to collect a lot more data ● Look at facets in Solr data ● Load CSV files into Data Warehouse Staging schema ● Next movie will show next step in progress
    • Contact Us ● Feel free to contact us at – www.semtech-solutions.co.nz – info@semtech-solutions.co.nz ● We offer IT project consultancy ● We are happy to hear about your problems ● You can just pay for those hours that you need ● To solve your problems