Polyglot PersistenceBig Data in the CloudAndrei Savu / andrei.savu@cloudsoftcorp.com
Overview• Introduction• Databases• Search• Processing• Deployment
Polyglot Persistence“Polyglot Persistence, like polyglotprogramming, is all about choosing the rightpersistence option for...
It all started from ...a set of papers released by Google & Amazon
• Google Filesystem (2003)  http://research.google.com/archive/gfs.html• Google MapReduce (2004)  http://research.google.c...
Databases
Apache HBase•   Java                     •   persistence through                                 HDFS (Hadoop)•   designed...
Apache Cassandra•   Java                  •   really fast writes•   inspired by Google    •   excellent for a large    Big...
MongoDB•   C++                          •   map/reduce with                                     javascript•   document dat...
Apache CouchDB•   Erlang                       •   exposes a stream of                                     realtime update...
Riak (Basho)•   Erlang, C, Javascript      •   tunable trade-offs (N, R,                                   W)•   key, valu...
Neo4j•   Java                      •   web admin interface•   graph database            •   nodes & relationships         ...
Redis•   C/C++                       •   values can be expired•   disk-backed data            •   Pub/Sub for messaging   ...
Search
elasticsearch•   Java                     •   simple multi-tenancy•   based on Apache Lucene   •   real-time search•   dis...
Apache SolrCloud•   Java                     •   automatic management                                 of multiple shards• ...
Processing
Apache Hadoop•   Java, C/C++               •   can scale to 1000s of                                  machines•   set of d...
Hadoop Ecosystem•   HDFS (Storage)           •   Oozie (workflow)•   MapReduce (Processing)   •   Mahout (machine          ...
Deploymenton Cloud Infrastructure (using jclouds)
Apache Whirr        https://whirr.apache.org/ * disclaimer: I am a member of the PMC
First Steps• Download  $ curl -O http://www.apache.org/dist/whirr/whirr-0.7.1/whirr-0.7.1.tar.gz  $ tar zxf whirr-0.7.1.ta...
Deploy Hadoopwhirr.instance-templates= 1 hadoop-namenode+hadoop-jobtracker, 10 hadoop-datanode+hadoop-tasktracker         ...
With Mahoutwhirr.instance-templates= 1 hadoop-namenode+hadoop-jobtracker   +mahout-client, 10 hadoop-datanode+hadoop-taskt...
Or with HBasewhirr.instance-templates= 1 hadoop-namenode+hadoop-jobtracker    +hbase-master+zookeeper, 10 hadoop-datanode+...
Or Cassandrawhirr.instance-templates=10 cassandra
And elasticsearchwhirr.instance-templates=10 elasticsearch
Thanks!andrei.savu@cloudsoftcorp.com
Polyglot Persistence & Big Data in the Cloud
Upcoming SlideShare
Loading in...5
×

Polyglot Persistence & Big Data in the Cloud

1,438

Published on

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,438
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Polyglot Persistence & Big Data in the Cloud

    1. 1. Polyglot PersistenceBig Data in the CloudAndrei Savu / andrei.savu@cloudsoftcorp.com
    2. 2. Overview• Introduction• Databases• Search• Processing• Deployment
    3. 3. Polyglot Persistence“Polyglot Persistence, like polyglotprogramming, is all about choosing the rightpersistence option for the task at hand” http://www.nearinfinity.com/blogs/scott_leberknight/polyglot_persistence.html http://martinfowler.com/bliki/PolyglotPersistence.html
    4. 4. It all started from ...a set of papers released by Google & Amazon
    5. 5. • Google Filesystem (2003) http://research.google.com/archive/gfs.html• Google MapReduce (2004) http://research.google.com/archive/mapreduce.html• Google BigTable (2006) http://research.google.com/archive/bigtable.html• Amazon Dynamo (2007) http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo- sosp2007.pdf
    6. 6. Databases
    7. 7. Apache HBase• Java • persistence through HDFS (Hadoop)• designed to be able to store massive amounts • Map/Reduce with of data Hadoop• speaks HTTP / REST, • designed for real time Thrift, Avro workloads• based on Google • https://hbase.apache.org/ BigTable
    8. 8. Apache Cassandra• Java • really fast writes• inspired by Google • excellent for a large BigTable and Amazon number of high speed Dynamo counters• tunable trade-offs • Map/Reduce possible with Hadoop• query by column and • range of keys http://cassandra.apache.org/
    9. 9. MongoDB• C++ • map/reduce with javascript• document database (bson) with rich indexing • server side javascript• master / slave replication • journaling• built-in sharding • fast in-place updates• auto failover with replica • http://www.mongodb.org/ sets
    10. 10. Apache CouchDB• Erlang • exposes a stream of realtime updates• document database (json) • needs compacting• bi-directional replication • indexing via views (JS)• advanced conflict • attachment handling resolution • https://couchdb.apache.org/• MVCC - writes do not block reads
    11. 11. Riak (Basho)• Erlang, C, Javascript • tunable trade-offs (N, R, W)• key, value store • mapreduce in JS or• focus on fault tolerance Erlang and cross datacenter replication • full-text indexing with riak search• speaks HTTP/REST or custom binary • http://wiki.basho.com/
    12. 12. Neo4j• Java • web admin interface• graph database • nodes & relationships can have metadata• speaks HTTP/REST • indexing• standalone or embeddable in Java apps • http://neo4j.org/• full ACID
    13. 13. Redis• C/C++ • values can be expired• disk-backed data • Pub/Sub for messaging structure server • ideal for rapidly changing• master-slave replication data that fits in memory• supports: strings, lists, • http://redis.io/ sets, hashes, sorted sets• batch operations
    14. 14. Search
    15. 15. elasticsearch• Java • simple multi-tenancy• based on Apache Lucene • real-time search• distributed by design • scale to 100s of machines• cloud aware (Amazon) • http://www.elasticsearch.org/• understands JSON objects• no-schema required
    16. 16. Apache SolrCloud• Java • automatic management of multiple shards• based on Apache Lucene (share the same repo) • automatic fail-over• adds distributed • durable writes capabilites to Solr • https://wiki.apache.org/• based on ZooKeeper for solr/SolrCloud coordination & config
    17. 17. Processing
    18. 18. Apache Hadoop• Java, C/C++ • can scale to 1000s of machines• set of distributed systems (hdfs, mr etc.) • designed to be highly available at the• framework for application level distributed data processing • https:// hadoop.apache.org/• simple programming model (map / reduce)
    19. 19. Hadoop Ecosystem• HDFS (Storage) • Oozie (workflow)• MapReduce (Processing) • Mahout (machine learning)• Hive, Pig (high level languages) • Flume (log streaming)• HBase (database) • Sqoop (data import)• ZooKeeper • Whirr (deployment) (coordination)
    20. 20. Deploymenton Cloud Infrastructure (using jclouds)
    21. 21. Apache Whirr https://whirr.apache.org/ * disclaimer: I am a member of the PMC
    22. 22. First Steps• Download $ curl -O http://www.apache.org/dist/whirr/whirr-0.7.1/whirr-0.7.1.tar.gz $ tar zxf whirr-0.7.1.tar.gz; cd whirr-0.7.1• Use # export credentials $ bin/whirr launch-cluster --config ... $ bin/whirr destroy-cluster --config ... https://whirr.apache.org/docs/latest/whirr-in-5-minutes.html
    23. 23. Deploy Hadoopwhirr.instance-templates= 1 hadoop-namenode+hadoop-jobtracker, 10 hadoop-datanode+hadoop-tasktracker https://whirr.apache.org/docs/0.7.1/quick-start-guide.html
    24. 24. With Mahoutwhirr.instance-templates= 1 hadoop-namenode+hadoop-jobtracker +mahout-client, 10 hadoop-datanode+hadoop-tasktracker
    25. 25. Or with HBasewhirr.instance-templates= 1 hadoop-namenode+hadoop-jobtracker +hbase-master+zookeeper, 10 hadoop-datanode+hadoop-tasktracker +hbase-regionserver
    26. 26. Or Cassandrawhirr.instance-templates=10 cassandra
    27. 27. And elasticsearchwhirr.instance-templates=10 elasticsearch
    28. 28. Thanks!andrei.savu@cloudsoftcorp.com

    ×