Faster Data Analytics with Apache Spark using Apache Solr
Kiran Chitturi
Data Engineer, Lucidworks
2
01
ALL ABOUT THAT SQL
3
01
Fun Experiment with Fusion SQL Engine
• Blog performed a comparison of their SQL engine against
common DBs using a count distinct query typical for
dashboards
• 14M logs, 1200 distinct dashboards, 1700 distinct user_id/
dashboard_id pairs
• Replicated the experiment with Fusion on Ec2 (m1.xlarge),
single instance of Solr
Fusion: 1.2 secs (cold)
https://www.periscopedata.com/blog/count-distinct-in-mysql-postgres-sql-server-and-oracle.html
4
01Best of Spark and Solr (Fusion SQL)
5
01
Solr SQL and limitations
• Combines SQL with full-text search capabilities
• MapReduce and JSON Facet API aggregations
• SQL queries implemented as Streaming expressions
Cons:
• Complex joins not supported
• no UDF/UDAF support
• COUNT DISTINCT not supported
• Incompatibility with analytical tools like Tableau
6
01
TPC DS Benchmark
WITH customer_total_return AS
(SELECT sr_customer_sk AS ctr_customer_sk, sr_store_sk AS ctr_store_sk,
sum(sr_return_amt) AS ctr_total_return
FROM store_returns, date_dim
WHERE sr_returned_date_sk = d_date_sk AND d_year = 2000
GROUP BY sr_customer_sk, sr_store_sk)
SELECT c_customer_id
FROM customer_total_return ctr1, store, customer
WHERE ctr1.ctr_total_return >
(SELECT avg(ctr_total_return)*1.2
FROM customer_total_return ctr2
WHERE ctr1.ctr_store_sk = ctr2.ctr_store_sk)
AND s_store_sk = ctr1.ctr_store_sk
AND s_state = 'TN'
AND ctr1.ctr_customer_sk = c_customer_sk
ORDER BY c_customer_id LIMIT 100
Transaction Processing Performance Council Dataset
7
01
Spark SQL
• Spark SQL provides a powerful & extensible query plan
optimizer with SQL2003 support
• SQL parser that supports both ANSI-SQL as well as Hive QL
• Whole stage code generation (since 2.0)
• Ability to run SQL queries across data files, Hive tables,
external databases and Spark connectors
• Supports predicate pushdown
8
01
Spark SQL execution
https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html
9
01
Solr as Spark SQL Data Source
• Connectivity between Spark and Solr supported via spark-solr
(https://github.com/LucidWorks/spark-solr) (open source)
• Read/write data from/to Solr (request handlers, streaming
expressions, sql)
• Use Solr Schema API to access field level metadata
• Supports Basic auth & Kerberos authentication to Solr
• Maintained and Supported by Lucidworks
10
01
Fun Experiment with Fusion SQL Engine
• Blog performed a comparison of their SQL engine against
common DBs using a count distinct query typical for
dashboards
• 14M logs, 1200 distinct dashboards, 1700 distinct user_id/
dashboard_id pairs
• Replicated the experiment with Fusion on Ec2 (m1.xlarge),
single instance of Solr
Fusion: 1.2 secs (cold)
https://www.periscopedata.com/blog/count-distinct-in-mysql-postgres-sql-server-and-oracle.html
11
01
Fusion SQL
select
dashboards.name,
log_counts.ct
from dashboards
join (
select distinct_logs.dashboard_id,
count(1) as ct
from (
select distinct dashboard_id, user_id
from time_on_site_logs
) as distinct_logs
group by distinct_logs.dashboard_id
) as log_counts
on log_counts.dashboard_id = dashboards.id
order by log_counts.ct desc
12
01
Fusion SQL
sort(
select(
rollup(
facet(
time_on_site_logs,
q="dashboard_id:[* TO *]",
buckets="dashboard_id,user_id",
bucketSizeLimit=10000,
bucketSorts="dashboard_id asc",
count(*)
),
over="dashboard_id",
count(*)
),
dashboard_id, count(*) as ct
),
by="ct desc"
)
select distinct_logs.dashboard_id,
count(1) as ct
from (
select distinct dashboard_id, user_id
from time_on_site_logs
) as distinct_logs
group by distinct_logs.dashboard_id
13
01
Fusion SQL (Full text search)
q=*:*&rows=5000&collection=movies&qt=/
sql&sql=select+movie_id+from+movies+where+_query_%3D'plot_txt_en:dogs'
select movie_id from movies where _query_='plot_txt_en:dogs';
14
01
Fusion SQL (timestamp pushdown)
q=*:*&rows=5000&qt=/export&fl=host_s,port_s&fq=timestamp_tdt:
{2017-08-31T00:00:00.00Z+TO+*]&sort=id+asc&collection=logs
SELECT host_s, port_s from logs WHERE timestamp_tdt > '2017-08-31';
15
01
Fusion SQL (Solr SQL)
q=*:*&rows=5000&qt=/sql&sql=SELECT+movie_id,+COUNT(*)+as+num_ratings,+avg(rating)
+as+aggAvg+FROM+ratings+GROUP+BY+movie_id+HAVING+COUN
T(*)+>+100+ORDER+BY+aggAvg+ASC+LIMIT+10&collection=ratings
SELECT movie_id
FROM
(SELECT movie_id,
COUNT(*) AS num_ratings,
avg(rating) AS aggAvg
FROM ratings
GROUP BY movie_id
HAVING COUNT(*) > 100
ORDER BY aggAvg ASC
LIMIT 10) as solr;
16
01
Fusion SQL (Time partitioned collections)
17
01
Fusion SQL (Time partitioned collections)
q=*:*&rows=5000&qt=/export&fl=song_s,gender_s&sort=id+asc
&fq=ts_dt{2017-09-10T00:00:00.00Z+TO+*]&collection=eventsim_2017_09_11
SELECT song_s, gender_s FROM eventsim WHERE ts_dt > '2017-09-10';
18
01
Fusion SQL
Zeppelin DEMO
19
01
Plugging in Custom strategies
sqlContext.experimental.extraStrategies ++= Seq(new
SpecialPushdownStrategy)
class SpecialPushdownStrategy extends Strategy {
def apply(plan: LogicalPlan): Seq[SparkPlan] = plan match {
case Aggregate(Nil,
Seq(Alias(AggregateExpression(Count(Seq(IntegerLiteral(n))),
_, _, _), name)), child) => {
// Form a Solr query, get numFound and return a
SparkPlan object
}
}
}
Thank You

Faster Data Analytics with Apache Spark using Apache Solr

  • 1.
    Faster Data Analyticswith Apache Spark using Apache Solr Kiran Chitturi Data Engineer, Lucidworks
  • 2.
  • 3.
    3 01 Fun Experiment withFusion SQL Engine • Blog performed a comparison of their SQL engine against common DBs using a count distinct query typical for dashboards • 14M logs, 1200 distinct dashboards, 1700 distinct user_id/ dashboard_id pairs • Replicated the experiment with Fusion on Ec2 (m1.xlarge), single instance of Solr Fusion: 1.2 secs (cold) https://www.periscopedata.com/blog/count-distinct-in-mysql-postgres-sql-server-and-oracle.html
  • 4.
    4 01Best of Sparkand Solr (Fusion SQL)
  • 5.
    5 01 Solr SQL andlimitations • Combines SQL with full-text search capabilities • MapReduce and JSON Facet API aggregations • SQL queries implemented as Streaming expressions Cons: • Complex joins not supported • no UDF/UDAF support • COUNT DISTINCT not supported • Incompatibility with analytical tools like Tableau
  • 6.
    6 01 TPC DS Benchmark WITHcustomer_total_return AS (SELECT sr_customer_sk AS ctr_customer_sk, sr_store_sk AS ctr_store_sk, sum(sr_return_amt) AS ctr_total_return FROM store_returns, date_dim WHERE sr_returned_date_sk = d_date_sk AND d_year = 2000 GROUP BY sr_customer_sk, sr_store_sk) SELECT c_customer_id FROM customer_total_return ctr1, store, customer WHERE ctr1.ctr_total_return > (SELECT avg(ctr_total_return)*1.2 FROM customer_total_return ctr2 WHERE ctr1.ctr_store_sk = ctr2.ctr_store_sk) AND s_store_sk = ctr1.ctr_store_sk AND s_state = 'TN' AND ctr1.ctr_customer_sk = c_customer_sk ORDER BY c_customer_id LIMIT 100 Transaction Processing Performance Council Dataset
  • 7.
    7 01 Spark SQL • SparkSQL provides a powerful & extensible query plan optimizer with SQL2003 support • SQL parser that supports both ANSI-SQL as well as Hive QL • Whole stage code generation (since 2.0) • Ability to run SQL queries across data files, Hive tables, external databases and Spark connectors • Supports predicate pushdown
  • 8.
  • 9.
    9 01 Solr as SparkSQL Data Source • Connectivity between Spark and Solr supported via spark-solr (https://github.com/LucidWorks/spark-solr) (open source) • Read/write data from/to Solr (request handlers, streaming expressions, sql) • Use Solr Schema API to access field level metadata • Supports Basic auth & Kerberos authentication to Solr • Maintained and Supported by Lucidworks
  • 10.
    10 01 Fun Experiment withFusion SQL Engine • Blog performed a comparison of their SQL engine against common DBs using a count distinct query typical for dashboards • 14M logs, 1200 distinct dashboards, 1700 distinct user_id/ dashboard_id pairs • Replicated the experiment with Fusion on Ec2 (m1.xlarge), single instance of Solr Fusion: 1.2 secs (cold) https://www.periscopedata.com/blog/count-distinct-in-mysql-postgres-sql-server-and-oracle.html
  • 11.
    11 01 Fusion SQL select dashboards.name, log_counts.ct from dashboards join( select distinct_logs.dashboard_id, count(1) as ct from ( select distinct dashboard_id, user_id from time_on_site_logs ) as distinct_logs group by distinct_logs.dashboard_id ) as log_counts on log_counts.dashboard_id = dashboards.id order by log_counts.ct desc
  • 12.
    12 01 Fusion SQL sort( select( rollup( facet( time_on_site_logs, q="dashboard_id:[* TO*]", buckets="dashboard_id,user_id", bucketSizeLimit=10000, bucketSorts="dashboard_id asc", count(*) ), over="dashboard_id", count(*) ), dashboard_id, count(*) as ct ), by="ct desc" ) select distinct_logs.dashboard_id, count(1) as ct from ( select distinct dashboard_id, user_id from time_on_site_logs ) as distinct_logs group by distinct_logs.dashboard_id
  • 13.
    13 01 Fusion SQL (Fulltext search) q=*:*&rows=5000&collection=movies&qt=/ sql&sql=select+movie_id+from+movies+where+_query_%3D'plot_txt_en:dogs' select movie_id from movies where _query_='plot_txt_en:dogs';
  • 14.
    14 01 Fusion SQL (timestamppushdown) q=*:*&rows=5000&qt=/export&fl=host_s,port_s&fq=timestamp_tdt: {2017-08-31T00:00:00.00Z+TO+*]&sort=id+asc&collection=logs SELECT host_s, port_s from logs WHERE timestamp_tdt > '2017-08-31';
  • 15.
    15 01 Fusion SQL (SolrSQL) q=*:*&rows=5000&qt=/sql&sql=SELECT+movie_id,+COUNT(*)+as+num_ratings,+avg(rating) +as+aggAvg+FROM+ratings+GROUP+BY+movie_id+HAVING+COUN T(*)+>+100+ORDER+BY+aggAvg+ASC+LIMIT+10&collection=ratings SELECT movie_id FROM (SELECT movie_id, COUNT(*) AS num_ratings, avg(rating) AS aggAvg FROM ratings GROUP BY movie_id HAVING COUNT(*) > 100 ORDER BY aggAvg ASC LIMIT 10) as solr;
  • 16.
    16 01 Fusion SQL (Timepartitioned collections)
  • 17.
    17 01 Fusion SQL (Timepartitioned collections) q=*:*&rows=5000&qt=/export&fl=song_s,gender_s&sort=id+asc &fq=ts_dt{2017-09-10T00:00:00.00Z+TO+*]&collection=eventsim_2017_09_11 SELECT song_s, gender_s FROM eventsim WHERE ts_dt > '2017-09-10';
  • 18.
  • 19.
    19 01 Plugging in Customstrategies sqlContext.experimental.extraStrategies ++= Seq(new SpecialPushdownStrategy) class SpecialPushdownStrategy extends Strategy { def apply(plan: LogicalPlan): Seq[SparkPlan] = plan match { case Aggregate(Nil, Seq(Alias(AggregateExpression(Count(Seq(IntegerLiteral(n))), _, _, _), name)), child) => { // Form a Solr query, get numFound and return a SparkPlan object } } }
  • 20.