Using Spark and Shark for
Fast Cycle Analysis on Diverse Data
Vaibhav Nivargi
12.2.13
clearstorydata.com
About ClearStory Data

clearstorydata.com
Analysis in the New Data Landscape
New use cases seen in all industries.
• Live situational analysis requiring fast-cycle
...
Example: Interactive Multi-source Analysis
More data and more people change the analysis.

News
Coverage
Online, Print,
Te...
Today’s Need is Speed, Scale & Ad Hoc Flexibility
With more sources, more data and more people.

?

?

clearstorydata.com
...
Why Spark and Shark ?
• RDDs
– Low latency & scale
– Iterative and Interactive computation

• Lineage and fault tolerance
...
The ClearStory Solution

Data Sources

ClearStory Platform

ClearStory Application

Harmonization

Data Inference & Profil...
Where do Spark & Shark fit ?
User Application
ClearStory API
Harmonization Engine and Blended Data Processing
Spark Cluste...
How we leverage Spark & Shark
• User intent captured and translated to custom API
• Harmonization-as-a-Service
• Manages S...
How we leverage Spark & Shark
• Query results returned to the application for
scalable visualization and ClearStory-specif...
Spark Developments – What We Like
• Query cancellation, progress indication (0.8.1 and
beyond)
• More performance breakthr...
We’re Hiring!
• Working with the community, giving back
• Lots of exciting new developments
• This is like the early days ...
clearstorydata.com
Upcoming SlideShare
Loading in …5
×

Clear story _spark_

1,073 views
902 views

Published on

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,073
On SlideShare
0
From Embeds
0
Number of Embeds
398
Actions
Shares
0
Downloads
28
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Clear story _spark_

  1. 1. Using Spark and Shark for Fast Cycle Analysis on Diverse Data Vaibhav Nivargi 12.2.13 clearstorydata.com
  2. 2. About ClearStory Data clearstorydata.com
  3. 3. Analysis in the New Data Landscape New use cases seen in all industries. • Live situational analysis requiring fast-cycle analysis across internal data and sources of external data • Multi-source analysis with data refreshing on new insights, as data from sources evolves • Large-scale analysis of structured and unstructured data combined in integrated insights clearstorydata.com
  4. 4. Example: Interactive Multi-source Analysis More data and more people change the analysis. News Coverage Online, Print, Television Data Intelligence Donations New Members, Donations Website Traffic Traffic, Referrals, Content Facebook Shares, Likes, Comments Twitter Followers, Tweets, Retweets Interactive analysis on diverse internal & external data Corporate Sponsors Corporate Engagement, New Inquiries clearstorydata.com
  5. 5. Today’s Need is Speed, Scale & Ad Hoc Flexibility With more sources, more data and more people. ? ? clearstorydata.com ? ?
  6. 6. Why Spark and Shark ? • RDDs – Low latency & scale – Iterative and Interactive computation • Lineage and fault tolerance – Able to re-derive data • Expressive power of Scala and SQL – Operations beyond aggregations, joins, and statistical operators – Advanced: ML, data mining, segmentation, approximate queries, graphs … • Support for structured and semi-structured data • BDAS Stack & AMPLab – Tachyon, MLBase, BlinkDB, GraphX … • Community and adoption clearstorydata.com
  7. 7. The ClearStory Solution Data Sources ClearStory Platform ClearStory Application Harmonization Data Inference & Profiling In-Memory Data Units Visualization Collaboration clearstorydata.com
  8. 8. Where do Spark & Shark fit ? User Application ClearStory API Harmonization Engine and Blended Data Processing Spark Cluster + ClearStory IP Data Access, Inference and Lineage Data Source API RDBMS clearstorydata.com Hadoop Files Public Web Premium
  9. 9. How we leverage Spark & Shark • User intent captured and translated to custom API • Harmonization-as-a-Service • Manages Spark and Shark query execution • Read cached data from HDFS • RESTful • Merges datasets (RDDs) on the fly – on user request • Support conversion of user actions to backend queries • Query optimizations • Performance optimizations • Mixed-mode execution (sql2rdd & spark native) • Caching • Pre-computation clearstorydata.com
  10. 10. How we leverage Spark & Shark • Query results returned to the application for scalable visualization and ClearStory-specific viz techniques • RDDs cached/un-cached and materialized at strategic points based on usage patterns and signals • Data updates automatically processed as source data changes • ClearStory’s own deployment, packaging, and integrated monitoring for operations at scale clearstorydata.com
  11. 11. Spark Developments – What We Like • Query cancellation, progress indication (0.8.1 and beyond) • More performance breakthroughs • Workload Management • BlinkDB • MLBase • Tachyon • GraphX clearstorydata.com
  12. 12. We’re Hiring! • Working with the community, giving back • Lots of exciting new developments • This is like the early days of Hadoop – massive momentum gathering The First Spark Summit! More Meet-ups! clearstorydata.com
  13. 13. clearstorydata.com

×