Faster Data Analytics with Apache Spark using Apache Solr

Faster Data Analytics with Apache Spark using Apache Solr
Kiran Chitturi
Data Engineer, Lucidworks

3
01
Fun Experiment with Fusion SQL Engine
• Blog performed a comparison of their SQL engine against
common DBs using a count distinct query typical for
dashboards
• 14M logs, 1200 distinct dashboards, 1700 distinct user_id/
dashboard_id pairs
• Replicated the experiment with Fusion on Ec2 (m1.xlarge),
single instance of Solr
Fusion: 1.2 secs (cold)
https://www.periscopedata.com/blog/count-distinct-in-mysql-postgres-sql-server-and-oracle.html

4
01Best of Spark and Solr (Fusion SQL)

5
01
Solr SQL and limitations
• Combines SQL with full-text search capabilities
• MapReduce and JSON Facet API aggregations
• SQL queries implemented as Streaming expressions
Cons:
• Complex joins not supported
• no UDF/UDAF support
• COUNT DISTINCT not supported
• Incompatibility with analytical tools like Tableau

6
01
TPC DS Benchmark
WITH customer_total_return AS
(SELECT sr_customer_sk AS ctr_customer_sk, sr_store_sk AS ctr_store_sk,
sum(sr_return_amt) AS ctr_total_return
FROM store_returns, date_dim
WHERE sr_returned_date_sk = d_date_sk AND d_year = 2000
GROUP BY sr_customer_sk, sr_store_sk)
SELECT c_customer_id
FROM customer_total_return ctr1, store, customer
WHERE ctr1.ctr_total_return >
(SELECT avg(ctr_total_return)*1.2
FROM customer_total_return ctr2
WHERE ctr1.ctr_store_sk = ctr2.ctr_store_sk)
AND s_store_sk = ctr1.ctr_store_sk
AND s_state = 'TN'
AND ctr1.ctr_customer_sk = c_customer_sk
ORDER BY c_customer_id LIMIT 100
Transaction Processing Performance Council Dataset

7
01
Spark SQL
• Spark SQL provides a powerful & extensible query plan
optimizer with SQL2003 support
• SQL parser that supports both ANSI-SQL as well as Hive QL
• Whole stage code generation (since 2.0)
• Ability to run SQL queries across data ﬁles, Hive tables,
external databases and Spark connectors
• Supports predicate pushdown

8
01
Spark SQL execution
https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html

9
01
Solr as Spark SQL Data Source
• Connectivity between Spark and Solr supported via spark-solr
(https://github.com/LucidWorks/spark-solr) (open source)
• Read/write data from/to Solr (request handlers, streaming
expressions, sql)
• Use Solr Schema API to access ﬁeld level metadata
• Supports Basic auth & Kerberos authentication to Solr
• Maintained and Supported by Lucidworks

10
01
Fun Experiment with Fusion SQL Engine
• Blog performed a comparison of their SQL engine against
common DBs using a count distinct query typical for
dashboards
• 14M logs, 1200 distinct dashboards, 1700 distinct user_id/
dashboard_id pairs
• Replicated the experiment with Fusion on Ec2 (m1.xlarge),
single instance of Solr
Fusion: 1.2 secs (cold)
https://www.periscopedata.com/blog/count-distinct-in-mysql-postgres-sql-server-and-oracle.html

11
01
Fusion SQL
select
dashboards.name,
log_counts.ct
from dashboards
join (
select distinct_logs.dashboard_id,
count(1) as ct
from (
select distinct dashboard_id, user_id
from time_on_site_logs
) as distinct_logs
group by distinct_logs.dashboard_id
) as log_counts
on log_counts.dashboard_id = dashboards.id
order by log_counts.ct desc

12
01
Fusion SQL
sort(
select(
rollup(
facet(
time_on_site_logs,
q="dashboard_id:[* TO *]",
buckets="dashboard_id,user_id",
bucketSizeLimit=10000,
bucketSorts="dashboard_id asc",
count(*)
),
over="dashboard_id",
count(*)
),
dashboard_id, count(*) as ct
),
by="ct desc"
)
select distinct_logs.dashboard_id,
count(1) as ct
from (
select distinct dashboard_id, user_id
from time_on_site_logs
) as distinct_logs
group by distinct_logs.dashboard_id

13
01
Fusion SQL (Full text search)
q=*:*&rows=5000&collection=movies&qt=/
sql&sql=select+movie_id+from+movies+where+_query_%3D'plot_txt_en:dogs'
select movie_id from movies where _query_='plot_txt_en:dogs';

14
01
Fusion SQL (timestamp pushdown)
q=*:*&rows=5000&qt=/export&ﬂ=host_s,port_s&fq=timestamp_tdt:
{2017-08-31T00:00:00.00Z+TO+*]&sort=id+asc&collection=logs
SELECT host_s, port_s from logs WHERE timestamp_tdt > '2017-08-31';

15
01
Fusion SQL (Solr SQL)
q=*:*&rows=5000&qt=/sql&sql=SELECT+movie_id,+COUNT(*)+as+num_ratings,+avg(rating)
+as+aggAvg+FROM+ratings+GROUP+BY+movie_id+HAVING+COUN
T(*)+>+100+ORDER+BY+aggAvg+ASC+LIMIT+10&collection=ratings
SELECT movie_id
FROM
(SELECT movie_id,
COUNT(*) AS num_ratings,
avg(rating) AS aggAvg
FROM ratings
GROUP BY movie_id
HAVING COUNT(*) > 100
ORDER BY aggAvg ASC
LIMIT 10) as solr;

16
01
Fusion SQL (Time partitioned collections)

17
01
Fusion SQL (Time partitioned collections)
q=*:*&rows=5000&qt=/export&ﬂ=song_s,gender_s&sort=id+asc
&fq=ts_dt{2017-09-10T00:00:00.00Z+TO+*]&collection=eventsim_2017_09_11
SELECT song_s, gender_s FROM eventsim WHERE ts_dt > '2017-09-10';

18
01
Fusion SQL
Zeppelin DEMO

19
01
Plugging in Custom strategies
sqlContext.experimental.extraStrategies ++= Seq(new
SpecialPushdownStrategy)
class SpecialPushdownStrategy extends Strategy {
def apply(plan: LogicalPlan): Seq[SparkPlan] = plan match {
case Aggregate(Nil,
Seq(Alias(AggregateExpression(Count(Seq(IntegerLiteral(n))),
_, _, _), name)), child) => {
// Form a Solr query, get numFound and return a
SparkPlan object
}
}
}

Faster Data Analytics with Apache Spark using Apache Solr

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Faster Data Analytics with Apache Spark using Apache Solr

Similar to Faster Data Analytics with Apache Spark using Apache Solr (20)

Recently uploaded

Recently uploaded (20)

Faster Data Analytics with Apache Spark using Apache Solr