This document discusses using Apache Solr with Apache Spark for faster data analytics. It provides examples of using Solr's SQL capabilities through Fusion SQL to perform queries and aggregations over data in Solr. These include count distinct queries, joins, full text search, and queries over time partitioned collections. It also discusses how Spark SQL can integrate with Solr as a data source and how to plug in custom query strategies.
3. 3
01
Fun Experiment with Fusion SQL Engine
• Blog performed a comparison of their SQL engine against
common DBs using a count distinct query typical for
dashboards
• 14M logs, 1200 distinct dashboards, 1700 distinct user_id/
dashboard_id pairs
• Replicated the experiment with Fusion on Ec2 (m1.xlarge),
single instance of Solr
Fusion: 1.2 secs (cold)
https://www.periscopedata.com/blog/count-distinct-in-mysql-postgres-sql-server-and-oracle.html
5. 5
01
Solr SQL and limitations
• Combines SQL with full-text search capabilities
• MapReduce and JSON Facet API aggregations
• SQL queries implemented as Streaming expressions
Cons:
• Complex joins not supported
• no UDF/UDAF support
• COUNT DISTINCT not supported
• Incompatibility with analytical tools like Tableau
6. 6
01
TPC DS Benchmark
WITH customer_total_return AS
(SELECT sr_customer_sk AS ctr_customer_sk, sr_store_sk AS ctr_store_sk,
sum(sr_return_amt) AS ctr_total_return
FROM store_returns, date_dim
WHERE sr_returned_date_sk = d_date_sk AND d_year = 2000
GROUP BY sr_customer_sk, sr_store_sk)
SELECT c_customer_id
FROM customer_total_return ctr1, store, customer
WHERE ctr1.ctr_total_return >
(SELECT avg(ctr_total_return)*1.2
FROM customer_total_return ctr2
WHERE ctr1.ctr_store_sk = ctr2.ctr_store_sk)
AND s_store_sk = ctr1.ctr_store_sk
AND s_state = 'TN'
AND ctr1.ctr_customer_sk = c_customer_sk
ORDER BY c_customer_id LIMIT 100
Transaction Processing Performance Council Dataset
7. 7
01
Spark SQL
• Spark SQL provides a powerful & extensible query plan
optimizer with SQL2003 support
• SQL parser that supports both ANSI-SQL as well as Hive QL
• Whole stage code generation (since 2.0)
• Ability to run SQL queries across data files, Hive tables,
external databases and Spark connectors
• Supports predicate pushdown
9. 9
01
Solr as Spark SQL Data Source
• Connectivity between Spark and Solr supported via spark-solr
(https://github.com/LucidWorks/spark-solr) (open source)
• Read/write data from/to Solr (request handlers, streaming
expressions, sql)
• Use Solr Schema API to access field level metadata
• Supports Basic auth & Kerberos authentication to Solr
• Maintained and Supported by Lucidworks
10. 10
01
Fun Experiment with Fusion SQL Engine
• Blog performed a comparison of their SQL engine against
common DBs using a count distinct query typical for
dashboards
• 14M logs, 1200 distinct dashboards, 1700 distinct user_id/
dashboard_id pairs
• Replicated the experiment with Fusion on Ec2 (m1.xlarge),
single instance of Solr
Fusion: 1.2 secs (cold)
https://www.periscopedata.com/blog/count-distinct-in-mysql-postgres-sql-server-and-oracle.html
11. 11
01
Fusion SQL
select
dashboards.name,
log_counts.ct
from dashboards
join (
select distinct_logs.dashboard_id,
count(1) as ct
from (
select distinct dashboard_id, user_id
from time_on_site_logs
) as distinct_logs
group by distinct_logs.dashboard_id
) as log_counts
on log_counts.dashboard_id = dashboards.id
order by log_counts.ct desc
12. 12
01
Fusion SQL
sort(
select(
rollup(
facet(
time_on_site_logs,
q="dashboard_id:[* TO *]",
buckets="dashboard_id,user_id",
bucketSizeLimit=10000,
bucketSorts="dashboard_id asc",
count(*)
),
over="dashboard_id",
count(*)
),
dashboard_id, count(*) as ct
),
by="ct desc"
)
select distinct_logs.dashboard_id,
count(1) as ct
from (
select distinct dashboard_id, user_id
from time_on_site_logs
) as distinct_logs
group by distinct_logs.dashboard_id
13. 13
01
Fusion SQL (Full text search)
q=*:*&rows=5000&collection=movies&qt=/sql&sql=select+movie_id+from+movies+w
+_query_%3D'plot_txt_en:dogs'
select movie_id from movies where _query_='plot_txt_en:dogs';
14. 14
01
Fusion SQL (timestamp pushdown)
q=*:*&rows=5000&qt=/export&fl=host_s,port_s&fq=timestamp_tdt:
{2017-08-31T00:00:00.00Z+TO+*]&sort=id+asc&collection=logs
SELECT host_s, port_s from logs WHERE timestamp_tdt > '2017-08-31';
15. 15
01
Fusion SQL (Solr SQL)
q=*:*&rows=5000&qt=/sql&sql=SELECT+movie_id,+COUNT(*)+as+num_ratings,+avg(rating)+as
+aggAvg+FROM+ratings+GROUP+BY+movie_id+HAVING+COUN
T(*)+>+100+ORDER+BY+aggAvg+ASC+LIMIT+10&collection=ratings
SELECT movie_id
FROM
(SELECT movie_id,
COUNT(*) AS num_ratings,
avg(rating) AS aggAvg
FROM ratings
GROUP BY movie_id
HAVING COUNT(*) > 100
ORDER BY aggAvg ASC
LIMIT 10) as solr;
19. 19
01
Plugging in Custom strategies
sqlContext.experimental.extraStrategies ++= Seq(new
SpecialPushdownStrategy)
class SpecialPushdownStrategy extends Strategy {
def apply(plan: LogicalPlan): Seq[SparkPlan] = plan match {
case Aggregate(Nil,
Seq(Alias(AggregateExpression(Count(Seq(IntegerLiteral(n))),
_, _, _), name)), child) => {
// Form a Solr query, get numFound and return a
SparkPlan object
}
}
}