Brisk: Truly peer­to­peer Hadoop             srisatish.ambati AT gmail.com      DataStax/OpenJDK      @srisatish          ...
Brisk: Hive + Hadoop + Cassandra                                                      @srisatish
Map Reduce                          @srisatish
Have large sets of data & you can     work on small pieces in parallel.                                                   ...
            Map Reduce                 @srisatish
Multi­core map reduce framework,     Kunle, et al                                                               @srisatish
                       Parallel Execution View   @srisatish
JobTracker    NameNode      HDFS                          @srisatish
Write­once­read­many!    File once created, written & closed need change                                                  ...
Move computation, not data                                                  @srisatish
DataNodes: Read, Write Blocks                                                       @srisatish
NameNode: Single Master nodeSingle Machine Address spaceSingle Point of failure                       
When “it” does not fit in a single node!    … Enter the distributed dragon!                  Enter the Cassandra:         ...
NameNode    DataNodes           
Cassandra:     High Scale    Peer­to­peer                             @srisatish
Portfolio DemoLow latency        Live tick prices for stocks.Batch Analytics        Historical EOD prices.        Value at...
Demo URLs (good for this demo only)http://ec2­50­19­4­143.compute­1.amazonaws.com:8888/opscenter/index.htmlhttp://ec2­67­2...
Dynamo, 2007Bigtable, 2006                        OSS, 2008      Incubator, 2009      TLP, 2010
Y                                       Key “C”                           A        W            Cassandra:             Hig...
Brisk                     @srisatish
Brisk    HowStuffWorks version                                           @srisatish
YDH security edition (soon to be Apache)Apache Hive – Access via SQL like  CassandraHandlerCassandra 0.8                  ...
Use ColumnFamiliesinodesblock                                        @srisatish
      String keyspace = “cfs”;     CfDef cf = new CfDef();        cf.setName(inodeDefaultCf);        cf.setComparator_type...
Consistency: R + W > N"brisk.consistencylevel.read", "QUORUM";"brisk.consistencylevel.write", "QUORUM";                   ...
Hadoop: job tracker, task tracker                                             @srisatish
BriskSnitch: brisk nodes, cassandra nodes                                                @srisatish
BriskSimpleSnitch.javaif(TrackerInitializer.isTrackerNode)     {           myDC = BRISK_DC;          logger.info("Detected...
Hive: SQL­like accesscli, hwi, jdbc, metastorePushdown predicates (v beta2)                                               ...
ETL      Real­time    Cassandra CFs     DataCenters        Scale                               @srisatish
No me in team!    ●   Ben Coverston                ●   Michael Allen    ●   Ben Werther                  ●   Mike Bulman  ...
                              100­node Brisk Cluster on Opscenter                                          @srisatish
Dynamo, 2007Bigtable, 2006              +                               OSS, 2008                 Incubator 2009          ...
git clone git@github.com:riptano/brisk.githttp://www.datastax.com/product/briskGetting  Started via Brisk AMI.Mahalo. Than...
References    ●   MapReduce: Simplified Data Processing on Large Clusters, 2004, Jeffrey Dean and         Sanjay Ghemawat,...
Upcoming SlideShare
Loading in...5

Brisk hadoop june2011


Published on

Brisk - Truly peer-to-peer hadoop.

Brisk is an open-source Hadoop & Hive distribution that uses Apache Cassandra for its core services and storage. Brisk makes it possible to run Hadoop MapReduce on top of CassandraFS, an HDFS-compatible storage layer. By replacing HDFS with CassandraFS, users leverage MapReduce jobs on Cassandra’s peer-to-peer, fault-tolerant and scalable architecture.

With CassandraFS all nodes are peers. Data files can be loaded through any node in the cluster and any node can serve as the JobTracker for MapReduce jobs. Hive MetaStore is stored & accessed as just another column family (table) on the distributed data store. Brisk makes Hadoop truly peer-to-peer.

We demonstrate visualisation & monitoring of Brisk using OpsCenter. The operational simplicity of cassandra’s multi-datacenter & multi-region aware replication makes Brisk well-suited for a rich set of Applications and usecases. And by being able to store and isolate hdfs & online data within the same data cluster, Brisk makes analytics possible without ETL!

LA Scalability Talk, Mahalo
May 31.2011

Published in: Technology
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Transcript of "Brisk hadoop june2011"

  1. 1. Brisk: Truly peer­to­peer Hadoop       srisatish.ambati AT gmail.com   DataStax/OpenJDK   @srisatish   
  2. 2. Brisk: Hive + Hadoop + Cassandra    @srisatish
  3. 3. Map Reduce    @srisatish
  4. 4. Have large sets of data & you can  work on small pieces in parallel.     @srisatish
  5. 5.     Map Reduce @srisatish
  6. 6. Multi­core map reduce framework,  Kunle, et al    @srisatish
  7. 7.     Parallel Execution View @srisatish
  8. 8.     @srisatish
  9. 9.     @srisatish
  10. 10. JobTracker NameNode HDFS    @srisatish
  11. 11. Write­once­read­many! File once created, written & closed need change    @srisatish
  12. 12. Move computation, not data    @srisatish
  13. 13.     @srisatish
  14. 14. DataNodes: Read, Write Blocks    @srisatish
  15. 15. NameNode: Single Master nodeSingle Machine Address spaceSingle Point of failure   
  16. 16. When “it” does not fit in a single node! … Enter the distributed dragon! Enter the Cassandra: High Scale Peer­to­peer    @srisatish
  17. 17. NameNode DataNodes   
  18. 18. One­kind­of­node!   
  19. 19. Cassandra: High Scale Peer­to­peer    @srisatish
  20. 20. Portfolio DemoLow latency Live tick prices for stocks.Batch Analytics Historical EOD prices. Value at Risk. http://www.datastax.com/docs/0.8/brisk/brisk_demo   
  21. 21. Demo URLs (good for this demo only)http://ec2­50­19­4­143.compute­1.amazonaws.com:8888/opscenter/index.htmlhttp://ec2­67­202­12­176.compute­1.amazonaws.com:50030/jobdetails.jsp?jobhttp://ec2­50­19­4­143.compute­1.amazonaws.com:8983/portfolio/   
  22. 22. Dynamo, 2007Bigtable, 2006 OSS, 2008 Incubator, 2009 TLP, 2010
  23. 23. Y Key “C” A W Cassandra: High Scale U Peer­to­peer F No SPOF T L P    @srisatish
  24. 24.    
  25. 25.    
  26. 26. Brisk    @srisatish
  27. 27. Brisk HowStuffWorks version    @srisatish
  28. 28. YDH security edition (soon to be Apache)Apache Hive – Access via SQL like CassandraHandlerCassandra 0.8   
  29. 29. Use ColumnFamiliesinodesblock      @srisatish
  30. 30.   String keyspace = “cfs”; CfDef cf = new CfDef();    cf.setName(inodeDefaultCf);    cf.setComparator_type("BytesType"); …             cf.setName(sblockDefaultCf);      cf.setKey_cache_size(1M);      cf.setComment(  "Stores blocks of information associated with a inode"); cf.setKeyspace(keyspace);    @srisatish
  31. 31. Consistency: R + W > N"brisk.consistencylevel.read", "QUORUM";"brisk.consistencylevel.write", "QUORUM";    @srisatish
  32. 32. Hadoop: job tracker, task tracker    @srisatish
  33. 33. BriskSnitch: brisk nodes, cassandra nodes    @srisatish
  34. 34. BriskSimpleSnitch.javaif(TrackerInitializer.isTrackerNode)     {           myDC = BRISK_DC;          logger.info("Detected Hadoop trackers are enabled, setting my DC to " + myDC);      } else      {            myDC = CASSANDRA_DC; logger.info("Looks like Vanilla Cassandra nodes, setting my DC to " + myDC);      }     @srisatish
  35. 35. Hive: SQL­like accesscli, hwi, jdbc, metastorePushdown predicates (v beta2)    @srisatish
  36. 36. hive>  CREATE TABLE invites (foo INT, bar STRING)PARTITIONED BY (ds STRING);hive>  LOAD DATA LOCAL INPATH $BRISK_HOME/resources/hive/examples/files/kv2.txt OVERWRITE INTO TABLE invites PARTITION (ds=2008­08­15);hive>  SELECT count(*), ds FROM invites GROUP BY ds;    http://www.datastax.com/docs/0.8/brisk/about_hive @srisatish
  37. 37. ETL Real­time Cassandra CFs DataCenters Scale    @srisatish
  38. 38.     @srisatish
  39. 39. No me in team! ● Ben Coverston ● Michael Allen ● Ben Werther ● Mike Bulman ● Brandon Williams ● Michael Weir ● Cathy Daw ● Nate McCall ● Daria Hutchinson ● Nick M Bailey ● Jackson Chung ● Patricio Echague ● Jake Luciani ● Tyler Hobbs ● Joaquin Casares ● SriSatish Ambati ● Jonathan Ellis ● Yewei Zhang    @srisatish
  40. 40.     100­node Brisk Cluster on Opscenter @srisatish
  41. 41. Dynamo, 2007Bigtable, 2006 + OSS, 2008 Incubator 2009 TLP, 2010 Cassandra + + Brisk    
  42. 42. git clone git@github.com:riptano/brisk.githttp://www.datastax.com/product/briskGetting  Started via Brisk AMI.Mahalo. Thank You.     @srisatish
  43. 43. References ● MapReduce: Simplified Data Processing on Large Clusters, 2004, Jeffrey Dean and  Sanjay Ghemawat, http://bit.ly/googmr_pdf ● Multi­core MapReduce, Kunle, et al. http://bit.ly/iRJd1n    @srisatish
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.