Cassandra + Hadoop @ApacheCon

5,748

Published on

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
5,748
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
160
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide
  • Talk a little about background of the theme – hippies, The Turtles, readability.
  • Mention Jeff Hodges, Johan, Stu, and Todd Lipcon.
  • Mention how InputSplit works and how it can choose among replicas – array of locations returned.
  • Highlight how this is the same extension point that is used with HDFS, HBase and any other data source/destination for MapReduce.
  • IOW, are people using this stuff in the real world? In production?
    Put some notes in here about raptr and imagini’s use cases.
  • Cassandra + Hadoop @ApacheCon

    1. 1. So HappyTogether
    2. 2.  BigTable + Dynamo  Semi-structured data model  Decentralized – no special roles  Ridiculously fast writes, fast reads  Tunably consistent  Cross-DC capable
    3. 3.  You design your data model based off of your query model  Real-time ad-hoc queries aren’t viable  Secondary indexes help (0.7)  What about analytics?
    4. 4.  Hadoop has analytics  MapReduce  Pig/Hive and other tools built above MapReduce  Configurable data sources/destinations  Many already familiar with it  Active community
    5. 5.  Always able to output to Cassandra directly  0.6  ColumnFamilyInputFormat  Pig support – Cassandra LoadFunc  0.7  ColumnFamilyOutputFormat  Hadoop Streaming Output  Streamlined configuration
    6. 6.  Recipe  Overlay Hadoop on top of Cassandra  Separate server for name node and job tracker  Co-locate task trackers with Cassandra nodes  Add data nodes to taste  Voilà  Data locality  Analytics engine scales with data  Example
    7. 7.  Cassandra specific InputFormat  Configuration – ConfigHelper, Hadoop variables  InputSplits over the data – tunable  Example usage in contrib/word_count
    8. 8.  OutputFormat  Configuration – ConfigHelper, Hadoop variables  Batches output – tunable  Don’t have to use Cassandra api  Some optimizations (e.g. ConsistencyLevel.ONE)  Example usage in contrib/word_count
    9. 9.  60,000+ Documented UFO Sightings  Data set from http://infochimps.com sighted_at reported_at location shape duration description 19951009 19951009 Iowa City, IA Man repts.Witnessing “flash, followed by a classic UFO, w/ a tailfin at back.” … 19940801 19950220 Renton, WA Man repts. seeing 2x large ships hovering in night sky while using Russian-made night binoculars. 19970111 19970111 St. Cloud, MN pyramid 2 min. Summary : Right when me and my friend left my house we saw a bright green glowing object that looked like a 4 sided pyramid then after about 2 min it took off straight into the sky leaving a yellow trail behind it…
    10. 10.  What about languages outside of Java?  Build on what Hadoop uses - Streaming  Output streaming in 0.7.0  Example in contrib/hadoop_streaming_output  Input streaming in progress, likely 0.7.1
    11. 11.  Developed atYahoo!  PigLatin/Grunt shell  Powerful scripting language for analytics  Example usage in contrib/pig  Configuration – Hadoop/Env variables
    12. 12.  Raptr.com  Home grown solution -> Cassandra + Hadoop  Query time: hours -> minutes  Pig obviated their need for multi-lingual MR  Speed and ease are enabling  Imagini/Visual DNA  US Government (Digital Reasoning)  See http://github.com/digitalreasoning/PyStratus
    13. 13.  Hive support in progress (HIVE-1434)  Hadoop Input Streaming (likely 0.7.1)  Performance improvements
    14. 14.  Hadoop analytics for Cassandra  Data locality for processing  Scales with the cluster
    15. 15.  More information  http://cassandra.apache.org  http://wiki.apache.org/cassandra/HadoopSupport  Cassandra:The Definitive Guide  About me:  jeremy.hanna@rackspace.com  @jeromatron onTwitter  jeromatron on IRC in #cassandra
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×