Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Lessons Learned with
Spark at the US Patent &
Trademark Office
Christopher Bradford
Big Data Architect at OpenSource Conne...
Christopher Bradford
Twitter: @bradfordcp
GitHub: bradfordcp
OpenSource Connections
Exploring Search Technologies - EST
EST – Technology Stack
EST – Data Loading
CSS Ingestion (CSS2C) Solr Ingestion (C2S)
EST – C2S Process
Note: some connections are omitted for clarity
EST – C2S Process (Scaled Out)
Note: some connections are omitted for clarity
EST – C2S Review
Did it work?
Why change it?
How could we make it better?
EST – Old C2S Process
Note: some connections are omitted for clarity
EST – Spark C2S Process
Note: some connections are omitted for clarity
How did this work out?
Poorly
Poor Performance
joinedRDD = …
joinedRDD.foreach()
document = … // build document
sc = new SolrConnection()
sc.push(docume...
Poor Performance
sc = new SolrConnection()
sc.push(document)
sc.disconnect()
Optimum Performance
joinedRDD = …
sc = new SolrConnection()
joinedRDD.foreach()
document = … // build document
sc.push(doc...
The Solution!
joinedRDD = …
joinedRDD.mapPartitions()
sc = new SolrConnection()
partition.foreach()
document = … // build
...
Results?
Solr Indexing
Better Solr Indexing
Note: some connections are omitted for clarity
EST – Spark C2S Process v2
Note: some connections are omitted for clarity
Success?
YUP
5x faster than the original C2S process (with optimizations)
What’s Next?
•  Optimization of the C2S Spark job
•  More Spark jobs
•  Newer version of Spark & DSE
•  Scala Spark jobs i...
Lessons Learned with Spark at the US Patent & Trademark Office-(Christopher Bradford, Open Source Connections)
Upcoming SlideShare
Loading in …5
×

Lessons Learned with Spark at the US Patent & Trademark Office-(Christopher Bradford, Open Source Connections)

1,997 views

Published on

Presentation at Spark Summit 2015

Published in: Data & Analytics
  • Be the first to comment

Lessons Learned with Spark at the US Patent & Trademark Office-(Christopher Bradford, Open Source Connections)

  1. 1. Lessons Learned with Spark at the US Patent & Trademark Office Christopher Bradford Big Data Architect at OpenSource Connections
  2. 2. Christopher Bradford Twitter: @bradfordcp GitHub: bradfordcp
  3. 3. OpenSource Connections
  4. 4. Exploring Search Technologies - EST
  5. 5. EST – Technology Stack
  6. 6. EST – Data Loading CSS Ingestion (CSS2C) Solr Ingestion (C2S)
  7. 7. EST – C2S Process Note: some connections are omitted for clarity
  8. 8. EST – C2S Process (Scaled Out) Note: some connections are omitted for clarity
  9. 9. EST – C2S Review Did it work? Why change it? How could we make it better?
  10. 10. EST – Old C2S Process Note: some connections are omitted for clarity
  11. 11. EST – Spark C2S Process Note: some connections are omitted for clarity
  12. 12. How did this work out? Poorly
  13. 13. Poor Performance joinedRDD = … joinedRDD.foreach() document = … // build document sc = new SolrConnection() sc.push(document) sc.disconnect() // Job is done
  14. 14. Poor Performance sc = new SolrConnection() sc.push(document) sc.disconnect()
  15. 15. Optimum Performance joinedRDD = … sc = new SolrConnection() joinedRDD.foreach() document = … // build document sc.push(document) sc.disconnect() // Job is done joinedRDD = … joinedRDD.foreachPartition() sc = new SolrConnection() partition.foreach() document = … // build document sc.push(document) sc.disconnect() // Job is done Almost
  16. 16. The Solution! joinedRDD = … joinedRDD.mapPartitions() sc = new SolrConnection() partition.foreach() document = … // build document sc.push(document) sc.close() return partition.rows .collect() joinedRDD = … joinedRDD.mapPartitions() sc = new SolrConnection() partition.foreach() document = … // build document sc.push(document) sc.close() return partitions.rows.count .collect()
  17. 17. Results?
  18. 18. Solr Indexing
  19. 19. Better Solr Indexing Note: some connections are omitted for clarity
  20. 20. EST – Spark C2S Process v2 Note: some connections are omitted for clarity
  21. 21. Success? YUP 5x faster than the original C2S process (with optimizations)
  22. 22. What’s Next? •  Optimization of the C2S Spark job •  More Spark jobs •  Newer version of Spark & DSE •  Scala Spark jobs instead of Java

×