O C T O B E R 1 1 - 1 4 , 2 0 1 6 • B O S T O N , M A
Solr JDBC
Kevin Risden
Apache Lucene/Solr Committer; Avalon Consulting, LLC
3
03
About me
•  Consultant with Avalon Consulting, LLC
•  ~4 years working with Hadoop and Search
•  Contributed patches to Ambari, HBase, Knox, Solr, Storm
•  Installation, security, performance tuning, development,
administration
•  Kevin Risden
•  Apache Lucene/Solr Committer
•  YCSB Contributor
4
03
Overview
•  Background
•  Use Case
•  Solr JDBC
•  Demo
•  Future Development/Improvements
5
01
Background - What is JDBC?
The JDBC API is a Java API that can access any kind of tabular data, especially
data stored in a Relational Database.
Source: https://docs.oracle.com/javase/tutorial/jdbc/overview/
JDBC drivers convert SQL into a backend query.
6
01
Background - Why should you care about Solr JDBC?
•  SQL skills are prolific.
•  JDBC drivers exist for most relational databases.
•  Existing reporting tools work with JDBC/ODBC drivers.
Solr 6 works with SQL and existing JDBC tools!
7
01
Use Case – Analytics – Utility Rates
Data set: 2011 Utility Rates
Questions:
•  How many utility companies serve the state of Maryland?
•  Which Maryland utility has the cheapest residential rates?
•  What are the minimum and maximum residential power rates excluding missing data elements?
•  What is the state and zip code with the highest residential rate?
How could you answer those questions with Solr?
Inspired By: http://blog.cloudera.com/blog/2015/10/how-to-use-apache-solr-to-query-indexed-data-for-analytics/
•  Facets
•  Filter Queries
•  Filters
•  Grouping
•  Sorting
•  Stats
•  String queries together
8
01
Use Case – Analytics – Utility Rates
Inspired By: http://blog.cloudera.com/blog/2015/10/how-to-use-apache-solr-to-query-indexed-data-for-analytics/
Method: Lucene syntax
Questions:
•  How many utility companies serve the state of Maryland?
http://solr:8983/solr/rates/select?q=state%3A%22MD
%22&wt=json&indent=true&group=true&group.field=utility_name&rows=10&group.limit=1
•  Which Maryland utility has the cheapest residential rates?
http://solr:8983/solr/rates/select?q=state%3A%22MD
%22&wt=json&indent=true&group=true&group.field=utility_name&rows=1&group.limit=1&sort=res_rate+asc
•  What are the minimum and maximum residential power rates excluding missing data elements?
http://solr:8983/solr/rates/select?q=*:*&fq=%7b!frange+l%3D0.0+incl%3Dfalse
%7dres_rate&wt=json&indent=true&rows=0&stats=true&stats.field=res_rate
•  What is the state and zip code with the highest residential rate?
http://solr:8983/solr/rates/select?q=res_rate:0.849872773537&wt=json&indent=true&rows=1
Is there a better way?
9
01
Solr JDBC
Highlights
•  JDBC Driver for Solr
•  Powered by Streaming Expressions and Parallel SQL
•  Thursday - Parallel SQL and Analytics with Solr – Yonik Seeley
•  Thursday - Creating New Streaming Expressions – Dennis Gove
•  Integrates with any* JDBC client * tested with the JDBC clients in this presentation
Usage
jdbc:solr://SOLR_ZK_CONNECTION_STRING?collection=COLLECTION_NAME
Apache Solr Reference Guide - Parallel SQL Interface
10
01
Solr JDBC - Architecture
11
01
Demo
Programming Languages
•  Java
•  Python/Jython
•  R
•  Apache Spark
Web
•  Apache Zeppelin
•  RStudio
GUI – JDBC
•  DbVisualizer
•  SQuirreL SQL
GUI – ODBC
•  Microsoft Excel
•  Tableau*
https://github.com/risdenk/solrj-jdbc-testing
12
01
Demo – Java
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.sql.*;
public class SolrJJDBCTestingJava {
private static final Logger LOGGER = LoggerFactory.getLogger(SolrJJDBCTestingJava.class);
public static void main(String[] args) throws Exception {
String sql = args[0];
try (Connection con = DriverManager.getConnection("jdbc:solr://solr:9983?collection=test")) {
try (Statement stmt = con.createStatement()) {
try (ResultSet rs = stmt.executeQuery(sql)) {
ResultSetMetaData rsMetaData = rs.getMetaData();
int columns = rsMetaData.getColumnCount();
StringBuilder header = new StringBuilder();
for(int i = 1; i < columns + 1; i++) {
header.append(rsMetaData.getColumnLabel(i)).append(",");
}
LOGGER.info(header.toString());
while (rs.next()) {
StringBuilder row = new StringBuilder();
for(int i = 1; i < columns + 1; i++) {
row.append(rs.getObject(i)).append(",");
}
LOGGER.info(row.toString());
}
}
}
}
}
}
Apache Solr Reference Guide - Generic
13
01
Demo – Python
#!/usr/bin/env python
# https://pypi.python.org/pypi/JayDeBeApi/
import jaydebeapi
import sys
if __name__ == '__main__':
jdbc_url = "jdbc:solr://solr:9983?collection=test”
driverName = "org.apache.solr.client.solrj.io.sql.DriverImpl”
statement = "select fielda, fieldb, fieldc, fieldd_s, fielde_i from test limit 10”
conn = jaydebeapi.connect(driverName, jdbc_url)
curs = conn.cursor() curs.execute(statement)
print(curs.fetchall())
conn.close()
Apache Solr Reference Guide - Python/Jython
14
01
Demo – Jython
#!/usr/bin/env jython
# http://www.jython.org/jythonbook/en/1.0/DatabasesAndJython.html
# https://wiki.python.org/jython/DatabaseExamples#SQLite_using_JDBC
import sys from java.lang
import Class from java.sql
import DriverManager, SQLException
if __name__ == '__main__':
jdbc_url = "jdbc:solr://solr:9983?collection=test”
driverName = "org.apache.solr.client.solrj.io.sql.DriverImpl”
statement = "select fielda, fieldb, fieldc, fieldd_s, fielde_i from test limit 10”
dbConn = DriverManager.getConnection(jdbc_url)
stmt = dbConn.createStatement()
resultSet = stmt.executeQuery(statement)
while resultSet.next():
print(resultSet.getString("fielda"))
resultSet.close()
stmt.close()
dbConn.close() Apache Solr Reference Guide - Python/Jython
15
01
Demo – R
# https://www.rforge.net/RJDBC/
library("RJDBC")
solrCP <- c(list.files('/opt/solr/dist/solrj-lib', full.names=TRUE),
list.files('/opt/solr/dist', pattern='solrj', full.names=TRUE, recursive = TRUE))
drv <- JDBC("org.apache.solr.client.solrj.io.sql.DriverImpl",
solrCP,
identifier.quote="`")
conn <- dbConnect(drv, "jdbc:solr://solr:9983?collection=test", "user", "pwd")
dbGetQuery(conn, "select fielda, fieldb, fieldc, fieldd_s, fielde_i from test limit 10")
dbDisconnect(conn)
Apache Solr Reference Guide - R
16
01
Demo – Apache Zeppelin
Apache Solr Reference Guide - Apache Zeppelin
17
01
Demo – RStudio
18
01
Demo – DbVisualizer
Apache Solr Reference Guide - DbVisualizer
19
01
Demo – SQuirreL SQL
Apache Solr Reference Guide - SQuirreL SQL
20
01
Demo – Microsoft Excel
21
01
Use Case – Analytics – Utility Rates
Inspired By: http://blog.cloudera.com/blog/2015/10/how-to-use-apache-solr-to-query-indexed-data-for-analytics/
Method: Lucene syntax
Questions:
•  How many utility companies serve the state of Maryland?
http://solr:8983/solr/rates/select?q=state%3A%22MD
%22&wt=json&indent=true&group=true&group.field=utility_name&rows=10&group.limit=1
•  Which Maryland utility has the cheapest residential rates?
http://solr:8983/solr/rates/select?q=state%3A%22MD
%22&wt=json&indent=true&group=true&group.field=utility_name&rows=1&group.limit=1&sort=res_rate+asc
•  What are the minimum and maximum residential power rates excluding missing data elements?
http://solr:8983/solr/rates/select?q=*:*&fq=%7b!frange+l%3D0.0+incl%3Dfalse
%7dres_rate&wt=json&indent=true&rows=0&stats=true&stats.field=res_rate
•  What is the state and zip code with the highest residential rate?
http://solr:8983/solr/rates/select?q=res_rate:0.849872773537&wt=json&indent=true&rows=1
Is there a better way?
22
01
Use Case – Analytics – Utility Rates
Method: SQL
Questions:
•  How many utility companies serve the state of Maryland?
select distinct utility_name from rates where state='MD';
•  Which Maryland utility has the cheapest residential rates?
select utility_name,min(res_rate) from rates where state='MD' group by utility_name order by min(res_rate) asc limit 1;
•  What are the minimum and maximum residential power rates excluding missing data elements?
select min(res_rate),max(res_rate) from rates where not res_rate = 0;
•  What is the state and zip code with the highest residential rate?
select state,zip,max(res_rate) from rates group by state,zip order by max(res_rate) desc limit 1;
How should you answer those questions with Solr? – Using SQL!
Inspired By: http://blog.cloudera.com/blog/2015/10/how-to-use-apache-solr-to-query-indexed-data-for-analytics/
23
01
Use Case – Analytics – Utility Rates
How should you answer those questions with Solr? – Using SQL!
Inspired By: http://blog.cloudera.com/blog/2015/10/how-to-use-apache-solr-to-query-indexed-data-for-analytics/
24
01
Future Development/Improvements
•  Replace Presto with Apache Calcite - SOLR-8593
•  Improve SQL compatibility
•  Ability to specify optimization rules (push downs, joins, etc)
•  Potentially use Avatica JDBC/ODBC drivers
•  Streaming Expressions/Parallel SQL improvements - SOLR-8125
•  JDBC driver improvements - SOLR-8659
Info on how to get involved
25
01
Future Development/Improvements
SQL Join
Info on how to get involved
SELECT
movie_title,character_name,line
FROM
movie_dialogs_movie_titles_metadata a
JOIN
movie_dialogs_movie_lines b
ON
a.movieID=b.movieID;
select(
innerJoin(
search(movie_dialogs_movie_titles_metadata,
q=*:*,
fl="movieID,movie_title",
sort="movieID asc"),
search(movie_dialogs_movie_lines,
q=*:*,
fl="movieID,character_name,line",
sort="movieID asc"),
on="movieID”
),
movie_title,character_name,line
)
Streaming Expression Join
26
01
Questions?

Solr JDBC: Presented by Kevin Risden, Avalon Consulting

  • 1.
    O C TO B E R 1 1 - 1 4 , 2 0 1 6 • B O S T O N , M A
  • 2.
    Solr JDBC Kevin Risden ApacheLucene/Solr Committer; Avalon Consulting, LLC
  • 3.
    3 03 About me •  Consultantwith Avalon Consulting, LLC •  ~4 years working with Hadoop and Search •  Contributed patches to Ambari, HBase, Knox, Solr, Storm •  Installation, security, performance tuning, development, administration •  Kevin Risden •  Apache Lucene/Solr Committer •  YCSB Contributor
  • 4.
    4 03 Overview •  Background •  UseCase •  Solr JDBC •  Demo •  Future Development/Improvements
  • 5.
    5 01 Background - Whatis JDBC? The JDBC API is a Java API that can access any kind of tabular data, especially data stored in a Relational Database. Source: https://docs.oracle.com/javase/tutorial/jdbc/overview/ JDBC drivers convert SQL into a backend query.
  • 6.
    6 01 Background - Whyshould you care about Solr JDBC? •  SQL skills are prolific. •  JDBC drivers exist for most relational databases. •  Existing reporting tools work with JDBC/ODBC drivers. Solr 6 works with SQL and existing JDBC tools!
  • 7.
    7 01 Use Case –Analytics – Utility Rates Data set: 2011 Utility Rates Questions: •  How many utility companies serve the state of Maryland? •  Which Maryland utility has the cheapest residential rates? •  What are the minimum and maximum residential power rates excluding missing data elements? •  What is the state and zip code with the highest residential rate? How could you answer those questions with Solr? Inspired By: http://blog.cloudera.com/blog/2015/10/how-to-use-apache-solr-to-query-indexed-data-for-analytics/ •  Facets •  Filter Queries •  Filters •  Grouping •  Sorting •  Stats •  String queries together
  • 8.
    8 01 Use Case –Analytics – Utility Rates Inspired By: http://blog.cloudera.com/blog/2015/10/how-to-use-apache-solr-to-query-indexed-data-for-analytics/ Method: Lucene syntax Questions: •  How many utility companies serve the state of Maryland? http://solr:8983/solr/rates/select?q=state%3A%22MD %22&wt=json&indent=true&group=true&group.field=utility_name&rows=10&group.limit=1 •  Which Maryland utility has the cheapest residential rates? http://solr:8983/solr/rates/select?q=state%3A%22MD %22&wt=json&indent=true&group=true&group.field=utility_name&rows=1&group.limit=1&sort=res_rate+asc •  What are the minimum and maximum residential power rates excluding missing data elements? http://solr:8983/solr/rates/select?q=*:*&fq=%7b!frange+l%3D0.0+incl%3Dfalse %7dres_rate&wt=json&indent=true&rows=0&stats=true&stats.field=res_rate •  What is the state and zip code with the highest residential rate? http://solr:8983/solr/rates/select?q=res_rate:0.849872773537&wt=json&indent=true&rows=1 Is there a better way?
  • 9.
    9 01 Solr JDBC Highlights •  JDBCDriver for Solr •  Powered by Streaming Expressions and Parallel SQL •  Thursday - Parallel SQL and Analytics with Solr – Yonik Seeley •  Thursday - Creating New Streaming Expressions – Dennis Gove •  Integrates with any* JDBC client * tested with the JDBC clients in this presentation Usage jdbc:solr://SOLR_ZK_CONNECTION_STRING?collection=COLLECTION_NAME Apache Solr Reference Guide - Parallel SQL Interface
  • 10.
    10 01 Solr JDBC -Architecture
  • 11.
    11 01 Demo Programming Languages •  Java • Python/Jython •  R •  Apache Spark Web •  Apache Zeppelin •  RStudio GUI – JDBC •  DbVisualizer •  SQuirreL SQL GUI – ODBC •  Microsoft Excel •  Tableau* https://github.com/risdenk/solrj-jdbc-testing
  • 12.
    12 01 Demo – Java importorg.slf4j.Logger; import org.slf4j.LoggerFactory; import java.sql.*; public class SolrJJDBCTestingJava { private static final Logger LOGGER = LoggerFactory.getLogger(SolrJJDBCTestingJava.class); public static void main(String[] args) throws Exception { String sql = args[0]; try (Connection con = DriverManager.getConnection("jdbc:solr://solr:9983?collection=test")) { try (Statement stmt = con.createStatement()) { try (ResultSet rs = stmt.executeQuery(sql)) { ResultSetMetaData rsMetaData = rs.getMetaData(); int columns = rsMetaData.getColumnCount(); StringBuilder header = new StringBuilder(); for(int i = 1; i < columns + 1; i++) { header.append(rsMetaData.getColumnLabel(i)).append(","); } LOGGER.info(header.toString()); while (rs.next()) { StringBuilder row = new StringBuilder(); for(int i = 1; i < columns + 1; i++) { row.append(rs.getObject(i)).append(","); } LOGGER.info(row.toString()); } } } } } } Apache Solr Reference Guide - Generic
  • 13.
    13 01 Demo – Python #!/usr/bin/envpython # https://pypi.python.org/pypi/JayDeBeApi/ import jaydebeapi import sys if __name__ == '__main__': jdbc_url = "jdbc:solr://solr:9983?collection=test” driverName = "org.apache.solr.client.solrj.io.sql.DriverImpl” statement = "select fielda, fieldb, fieldc, fieldd_s, fielde_i from test limit 10” conn = jaydebeapi.connect(driverName, jdbc_url) curs = conn.cursor() curs.execute(statement) print(curs.fetchall()) conn.close() Apache Solr Reference Guide - Python/Jython
  • 14.
    14 01 Demo – Jython #!/usr/bin/envjython # http://www.jython.org/jythonbook/en/1.0/DatabasesAndJython.html # https://wiki.python.org/jython/DatabaseExamples#SQLite_using_JDBC import sys from java.lang import Class from java.sql import DriverManager, SQLException if __name__ == '__main__': jdbc_url = "jdbc:solr://solr:9983?collection=test” driverName = "org.apache.solr.client.solrj.io.sql.DriverImpl” statement = "select fielda, fieldb, fieldc, fieldd_s, fielde_i from test limit 10” dbConn = DriverManager.getConnection(jdbc_url) stmt = dbConn.createStatement() resultSet = stmt.executeQuery(statement) while resultSet.next(): print(resultSet.getString("fielda")) resultSet.close() stmt.close() dbConn.close() Apache Solr Reference Guide - Python/Jython
  • 15.
    15 01 Demo – R #https://www.rforge.net/RJDBC/ library("RJDBC") solrCP <- c(list.files('/opt/solr/dist/solrj-lib', full.names=TRUE), list.files('/opt/solr/dist', pattern='solrj', full.names=TRUE, recursive = TRUE)) drv <- JDBC("org.apache.solr.client.solrj.io.sql.DriverImpl", solrCP, identifier.quote="`") conn <- dbConnect(drv, "jdbc:solr://solr:9983?collection=test", "user", "pwd") dbGetQuery(conn, "select fielda, fieldb, fieldc, fieldd_s, fielde_i from test limit 10") dbDisconnect(conn) Apache Solr Reference Guide - R
  • 16.
    16 01 Demo – ApacheZeppelin Apache Solr Reference Guide - Apache Zeppelin
  • 17.
  • 18.
    18 01 Demo – DbVisualizer ApacheSolr Reference Guide - DbVisualizer
  • 19.
    19 01 Demo – SQuirreLSQL Apache Solr Reference Guide - SQuirreL SQL
  • 20.
  • 21.
    21 01 Use Case –Analytics – Utility Rates Inspired By: http://blog.cloudera.com/blog/2015/10/how-to-use-apache-solr-to-query-indexed-data-for-analytics/ Method: Lucene syntax Questions: •  How many utility companies serve the state of Maryland? http://solr:8983/solr/rates/select?q=state%3A%22MD %22&wt=json&indent=true&group=true&group.field=utility_name&rows=10&group.limit=1 •  Which Maryland utility has the cheapest residential rates? http://solr:8983/solr/rates/select?q=state%3A%22MD %22&wt=json&indent=true&group=true&group.field=utility_name&rows=1&group.limit=1&sort=res_rate+asc •  What are the minimum and maximum residential power rates excluding missing data elements? http://solr:8983/solr/rates/select?q=*:*&fq=%7b!frange+l%3D0.0+incl%3Dfalse %7dres_rate&wt=json&indent=true&rows=0&stats=true&stats.field=res_rate •  What is the state and zip code with the highest residential rate? http://solr:8983/solr/rates/select?q=res_rate:0.849872773537&wt=json&indent=true&rows=1 Is there a better way?
  • 22.
    22 01 Use Case –Analytics – Utility Rates Method: SQL Questions: •  How many utility companies serve the state of Maryland? select distinct utility_name from rates where state='MD'; •  Which Maryland utility has the cheapest residential rates? select utility_name,min(res_rate) from rates where state='MD' group by utility_name order by min(res_rate) asc limit 1; •  What are the minimum and maximum residential power rates excluding missing data elements? select min(res_rate),max(res_rate) from rates where not res_rate = 0; •  What is the state and zip code with the highest residential rate? select state,zip,max(res_rate) from rates group by state,zip order by max(res_rate) desc limit 1; How should you answer those questions with Solr? – Using SQL! Inspired By: http://blog.cloudera.com/blog/2015/10/how-to-use-apache-solr-to-query-indexed-data-for-analytics/
  • 23.
    23 01 Use Case –Analytics – Utility Rates How should you answer those questions with Solr? – Using SQL! Inspired By: http://blog.cloudera.com/blog/2015/10/how-to-use-apache-solr-to-query-indexed-data-for-analytics/
  • 24.
    24 01 Future Development/Improvements •  ReplacePresto with Apache Calcite - SOLR-8593 •  Improve SQL compatibility •  Ability to specify optimization rules (push downs, joins, etc) •  Potentially use Avatica JDBC/ODBC drivers •  Streaming Expressions/Parallel SQL improvements - SOLR-8125 •  JDBC driver improvements - SOLR-8659 Info on how to get involved
  • 25.
    25 01 Future Development/Improvements SQL Join Infoon how to get involved SELECT movie_title,character_name,line FROM movie_dialogs_movie_titles_metadata a JOIN movie_dialogs_movie_lines b ON a.movieID=b.movieID; select( innerJoin( search(movie_dialogs_movie_titles_metadata, q=*:*, fl="movieID,movie_title", sort="movieID asc"), search(movie_dialogs_movie_lines, q=*:*, fl="movieID,character_name,line", sort="movieID asc"), on="movieID” ), movie_title,character_name,line ) Streaming Expression Join
  • 26.