An efficient data mining solution by integrating Spark and Cassandra

2,503 views

Published on

Integrating C* and Spark gives us a system that combines the best of both worlds. The goal of this integration is to obtain a better result than using Spark over HDFS because Cassandra´s philosophy is much closer to RDD's philosophy than what HDFS is. The goal with Cassandra is to have a system that mines all the information stored in C* in a much more efficient way than having the information stored in HDFS. Cassandra data storage and Spark data mining power: an unrivalled mix.

Published in: Technology
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,503
On SlideShare
0
From Embeds
0
Number of Embeds
12
Actions
Shares
0
Downloads
41
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide

An efficient data mining solution by integrating Spark and Cassandra

  1. 1. AN EFFICIENT DATA MINING SOLUTION
  2. 2. Hadoop?
  3. 3. Cassandra?
  4. 4. Spark?
  5. 5. Stratio Deep An efficient data mining solution “Two and two are four? Sometimes… Sometimes they are five.” G. Orwell #StratioB
  6. 6. Goals • • • • #StratioB Why do you need Cassandra? What is the problem? Why do you need Spark? How do they work together?
  7. 7. Cassandra • • • • #StratioB Based on DynamoDB… Replication, Key/Value, P2P And based on Big Table… Column oriented
  8. 8. ROBUST FAST EFFICENT
  9. 9. NO BOTTLENECK DECENTRALIZED REPLICATE D
  10. 10. Another Databas e?
  11. 11. Why?
  12. 12. Case A One User – Lot of data #StratioB
  13. 13. Case B Many User – Few data #StratioB
  14. 14. Case C Many user – Lot of data #StratioB
  15. 15. Crawler app 100 M Indexed pages 3k reads Cassandra, I choose you #StratioB Query time < 1s
  16. 16. But…
  17. 17. Marketing walks in
  18. 18. New query “I need to find all the reference to the domain ACME. I need the answer by Friday.” #StratioB
  19. 19. Problem Cassandra is not well suited to resolved this type of queries You need to design the schema with the query in mind #StratioB
  20. 20. Challenge Accepted
  21. 21. What options do we have? • • • #StratioB Run Hive Query on top of C* Write an ETL script and load data into another DB Clone the cluster
  22. 22. What options do we have? Run Hive Query on top of C* Write ETL scripts and load into another DB Clone the cluster #StratioB
  23. 23. And now… what can we do? “We can't solve problems by using the same kind of thinking we used when we created them” Albert Einstein #StratioB
  24. 24. Spark • • • • • Alternative to MapReduce A low latency cluster computing system For very large datasets Create by UC Berkeley AMP Lab in 2010. May be 100 times faster than MapReduce for:   #StratioB Interactive algorithms. Interactive data mining
  25. 25. Logistic regression in Spark vs Hadoop SOURCE | http://spark.incubator.apache.org/ #StratioB
  26. 26. WHO USES SPARK?
  27. 27. Spark and Cassandra Integration points #StratioB
  28. 28. Cassandra’s HDFS abstraction layer Advantantages: • Easily integrates with legacy systems. Drawbacks: • • Very high-level: no access to low level Cassandra’s features. Questionable performance. INTEGRATION POINTS: HDFS OVER CASSANDRA #StratioB
  29. 29. Cassandra’s Hadoop Interface • Thrift protocol • CQL3 (our implementation)  Uses the novel Cassandra’s CqlPagingInputFormat INTEGRATION POINTS: HDFS OVER CASSANDRA #StratioB
  30. 30. CQL3 Integration • • • Supports CQL3 features Respects data locality Good compromise between performance / implementation complexity INTEGRATION POINTS: CASSANDRA’S HADOOP INTERFACE – CQL3 #StratioB
  31. 31. CQL3 Integration (II) Provides a Java friendly API: • Developers map Column Families to custom serializable POJOs • StratioDeep wraps the complexity of performing Spark calculations directly over the user provided POJOs. INTEGRATION POINTS: CASSANDRA’S HADOOP INTERFACE – CQL3 #StratioB
  32. 32. Demo
  33. 33. CQL3 Integration (III) Drawbacks: • Still not preforming as well as we’d like Uses Cassandra’s Hadoop Interface No analyst-friendly interface:  •  #StratioB No SQL-like query features INTEGRATION POINTS: CASSANDRA’S HADOOP INTERFACE – CQL3
  34. 34. Future extensions What are we currently working on? Bring the integration to another level: • • • #StratioB Dump Cassandra’s Hadoop Interface Direct access to Cassandra’s SSTable(s) files. Extend Cassandra’s CQL3 to make use of Spark’s distributed data processing power
  35. 35. Conclusion #StratioB
  36. 36. THANKS

×