Stratio big data spain

1,195 views
934 views

Published on

Published in: Technology
1 Comment
1 Like
Statistics
Notes
  • Interesting. I agree that at some point reading directly from SSTables is the way to go.
    I don't know if you've looked at Neflix's Aegisthus, which a Map reduce implementation to reading Cassandra SSTables. I think it could be a good start from integrating Cassandra SSTables with Spark Custom RDDs. And also using CQL3 data structures in Spark as native RDD.
    Worth exploring. Great job.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total views
1,195
On SlideShare
0
From Embeds
0
Number of Embeds
8
Actions
Shares
0
Downloads
8
Comments
1
Likes
1
Embeds 0
No embeds

No notes for slide
  • Good afternoon, in this moment, everybody should know Stratio, the big data company. Now, I need to know if you seem familiar some concepts.
  • Stratio big data spain

    1. 1. AN EFFICIENT DATA MINING SOLUTION
    2. 2. Hadoop?
    3. 3. Cassandra?
    4. 4. Spark?
    5. 5. Stratio Deep An efficient data mining solution “Two and two are four? Sometimes… Sometimes they are five.” G. Orwell #StratioBD
    6. 6. Goals • • • • #StratioBD Why do you need Cassandra? What is the problem? Why do you need Spark? How do they work together?
    7. 7. Cassandra • • • • #StratioBD Based on DynamoDB… Replication, Key/Value, P2P And based on Big Table… Column oriented
    8. 8. ROBUST FAST EFFICENT
    9. 9. NO BOTTLENECK DECENTRALIZED REPLICATED
    10. 10. Another Database?
    11. 11. Why?
    12. 12. Case A One User – Lot of data #StratioBD
    13. 13. Case B Many User – Few data #StratioBD
    14. 14. Case C Many user – Lot of data #StratioBD
    15. 15. Crawler app 100M Indexed pages 3k reads Cassandra, I choose you #StratioBD Query time < 1s
    16. 16. But…
    17. 17. Marketing walks in
    18. 18. New query “I need to find all the reference to the domain ACME. I need the answer by Friday.” #StratioBD
    19. 19. Problem Cassandra is not well suited to resolved this type of queries You need to design the schema with the query in mind #StratioBD
    20. 20. Challenge Accepted
    21. 21. What options do we have? • • • #StratioBD Run Hive Query on top of C* Write an ETL script and load data into another DB Clone the cluster
    22. 22. What options do we have? Run Hive Query on top of C* Write ETL scripts and load into another DB Clone the cluster #StratioBD
    23. 23. And now… what can we do? “We can't solve problems by using the same kind of thinking we used when we created them” Albert Einstein #StratioBD
    24. 24. Spark • • • • • Alternative to MapReduce A low latency cluster computing system For very large datasets Create by UC Berkeley AMP Lab in 2010. May be 100 times faster than MapReduce for:   #StratioBD Interactive algorithms. Interactive data mining
    25. 25. Logistic regression in Spark vs Hadoop SOURCE | http://spark.incubator.apache.org/ #StratioBD
    26. 26. WHO USES SPARK?
    27. 27. Spark and Cassandra Integration points #StratioBD
    28. 28. Cassandra’s HDFS abstraction layer Advantantages: • Easily integrates with legacy systems. Drawbacks: • • Very high-level: no access to low level Cassandra’s features. Questionable performance. INTEGRATION POINTS: HDFS OVER CASSANDRA #StratioBD
    29. 29. Cassandra’s Hadoop Interface • Thrift protocol • CQL3 (our implementation)  Uses the novel Cassandra’s CqlPagingInputFormat INTEGRATION POINTS: HDFS OVER CASSANDRA #StratioBD
    30. 30. CQL3 Integration • • • Supports CQL3 features Respects data locality Good compromise between performance / implementation complexity INTEGRATION POINTS: CASSANDRA’S HADOOP INTERFACE – CQL3 #StratioBD
    31. 31. CQL3 Integration (II) Provides a Java friendly API: • Developers map Column Families to custom serializable POJOs • StratioDeep wraps the complexity of performing Spark calculations directly over the user provided POJOs. INTEGRATION POINTS: CASSANDRA’S HADOOP INTERFACE – CQL3 #StratioBD
    32. 32. Demo
    33. 33. CQL3 Integration (III) Drawbacks: • Still not preforming as well as we’d like  • No analyst-friendly interface:  #StratioBD Uses Cassandra’s Hadoop Interface No SQL-like query features INTEGRATION POINTS: CASSANDRA’S HADOOP INTERFACE – CQL3
    34. 34. Future extensions What are we currently working on? Bring the integration to another level: • • • #StratioBD Dump Cassandra’s Hadoop Interface Direct access to Cassandra’s SSTable(s) files. Extend Cassandra’s CQL3 to make use of Spark’s distributed data processing power
    35. 35. Conclusion #StratioBD
    36. 36. THANKS

    ×