High order bits from cassandra & hadoop

  • 1,545 views
Uploaded on

 

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
1,545
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
31
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. High-order bits from Cassandra & Hadoop
    srisatishambati
    @srisatish
  • 2. Thank You!
    svccg in first page of search results for “cloud” on google!
  • 3. NoSQL-
    Know your queries.
  • 4. points
    Usecases
    Why cassandra?
    Usecase: Hadoop, Brisk
    FUD:Consistency
    Why facebook is not using Cassandra?
    Anti-patterns
    Community, Code, Tools
    Q&A
  • 5. Users. Netflix.
    Key by Customer, read-heavy
    Key by Customer:Movie, write-heavy
  • 6. TimeSeries: (several customers)
    periodic readings: dev0, dev1…deviceID:metric:timestamp ->value
    Metrics typically way larger dataset than users.
  • 7. Why Cassandra?
  • 8. Operational simplicity
    peer-to-peer
  • 9. Operational simplicity
    peer-to-peer
  • 10. Replication:
    Multi-datacenter
    Multi-region ec2
    Multi-availability zones
  • 11. reads local
    dc1
    dc2
    Replication:
    Multi-datacenter
    Multi-region ec2, aws
    Multi-availability zones
  • 12. 4.21.2011, Amazon Web Services outage:
    “Movie marathons on Netflix awaiting AWS to come back up.” #ec2disabled
  • 13. 4.21.2011, Amazon Web Services outage:
    Netflix was running on AWS.
  • 14. fast durable writes.
    fast reads.
  • 15. Writes
    Sequential, append-only.
    ~1-5ms
  • 16. Writes
    Sequential, append-only.
    ~1-5ms
    On cloud: ephemeral disks rock!
  • 17. Reads
    Local
    Key & row caches, (also, jna-based 0xffheap)
    indexes, materialized
  • 18. Reads
    Local
    Key & row caches, (also, jna-based 0xffheap)
    indexes, materialized
    ssds: improved read performance!
  • 19. Distribution between nodes
    Gossip
    Anti-entropy
    Failure-detector
    L i g h t w e i g h t
  • 20. Clients: cql, thrift
    pycassa, phpcassa
    hector, pelops
    (scala, ruby, clojure)
  • 21. Usecase #3: hadoop
    Hdfs cassandra hive
    Logs stats analytics
  • 22. Brisk
    Truly peer-to-peer hadoop.
  • 23. mv computation
    not data
  • 24.
  • 25. Parallel Execution View
  • 26.
  • 27. jobtracker, tasktracker
    hdfs: namenode, datanode
  • 28. cloudera
    amazon: elastic map reduce
    hortonworks
    mapR
    brisk
  • 29. Tools & Analytics
    Hive, Pig, R
    Karmasphere
    Datameer
    … dozens of stealth startups!
  • 30. Namenode decomposition, explained.
  • 31.
  • 32.
  • 33. Use column families (tables)
    inode
    sblock
  • 34. near-real time hadoop
    Low latency: cassandra_dc nodes
    Batch Analytics: brisk_dc nodes
  • 35. FUD,
    acronym: fear, uncertainty, doubt.
  • 36. Consistency: R + W > N
    ORACLE, 2-node: R=1, W=2, N=2,(T=2)
    DNS
    * N is replication factor. Not to be confused with T=total #of nodes
  • 37. Tune-able, flexibility.
    For High Consistency:
    read:quorum, write:quorum
    For High Availability:
    high W, low R.
  • 38.
  • 39. Inbox Search:
    600+cores.120+TB (2008)
    Went from 100-500m users.
    Average NoSQL deployment size: ~6-12 nodes.
  • 40. Usecase #5: search
    Apache Solr + Cassandra = Solandra
    Other inbox/file Searches:
    xobni, c3
    github.com/tjake/solandra
  • 41. “Eventual consistency is harder to program.”
    mostly immutable data.
    complex systems at scale.
  • 42. Miscellaneous,
    Myth: data-loss, partial rows.
    writes are durable.
  • 43. Anti-Patterns
    Transactions
    Joins
    Read before write
  • 44. Anti-Patterns for cloud
    ebs
    jvm, virtualized
    single region
  • 45. Three good reasons for Cassandra...
  • 46. Tools
    AMIs, OpsCenter, DataStax
    AppDynamics
    Netflix just builds AMIs for deployment!
  • 47. B e a u t i f u l C 0 d e
    = new code(); //less is more
    ~90k.java.concurrent.@annotate.
    bloomfilters, merkletrees.
    non-blocking, staged-event-driven.
    bigtable, dynamo.
  • 48. Current & Future Focus:
    Distributed Counters, CQL.
    Simple client.
    operational smoothening.
    compaction.
  • 49. Community
    Robust. Rapid. #
    Professional support from DataStax.
    Filesysteminnovatin from Acunu
    engineers: independent,startups, large companies, Rackspace, Twitter, Netflix..
    Come join the efforts!
  • 50.
  • 51. Usecase #4: first NoSQL, then scale!
    simpledb Cassandra
    mongodb Cassandra
  • 52.
  • 53.
  • 54.
  • 55. Copyright: xkcd
  • 56. Copyright: plantoys
    … more than one way to do it!
  • 57. Summary -
    high scale peer-to-peer datastore
    best friend for
    multi-region, multi-zone availability.
    Hadoop – HDFS engulfing the DataWorld
  • 58. Q&A
    @srisatish
  • 59. NoSQL-
    Know your queries.