• Save
Polyglot Persistence & Big Data in the Cloud
Upcoming SlideShare
Loading in...5
×
 

Polyglot Persistence & Big Data in the Cloud

on

  • 1,751 views

 

Statistics

Views

Total Views
1,751
Views on SlideShare
1,741
Embed Views
10

Actions

Likes
2
Downloads
0
Comments
0

4 Embeds 10

https://www.linkedin.com 4
http://coderwall.com 3
http://www.linkedin.com 2
http://www.instacurate.com 1

Accessibility

Categories

Upload Details

Uploaded via as Apple Keynote

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n

Polyglot Persistence & Big Data in the Cloud Polyglot Persistence & Big Data in the Cloud Presentation Transcript

  • Polyglot PersistenceBig Data in the CloudAndrei Savu / andrei.savu@cloudsoftcorp.com
  • Overview• Introduction• Databases• Search• Processing• Deployment
  • Polyglot Persistence“Polyglot Persistence, like polyglotprogramming, is all about choosing the rightpersistence option for the task at hand” http://www.nearinfinity.com/blogs/scott_leberknight/polyglot_persistence.html http://martinfowler.com/bliki/PolyglotPersistence.html View slide
  • It all started from ...a set of papers released by Google & Amazon View slide
  • • Google Filesystem (2003) http://research.google.com/archive/gfs.html• Google MapReduce (2004) http://research.google.com/archive/mapreduce.html• Google BigTable (2006) http://research.google.com/archive/bigtable.html• Amazon Dynamo (2007) http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo- sosp2007.pdf
  • Databases
  • Apache HBase• Java • persistence through HDFS (Hadoop)• designed to be able to store massive amounts • Map/Reduce with of data Hadoop• speaks HTTP / REST, • designed for real time Thrift, Avro workloads• based on Google • https://hbase.apache.org/ BigTable
  • Apache Cassandra• Java • really fast writes• inspired by Google • excellent for a large BigTable and Amazon number of high speed Dynamo counters• tunable trade-offs • Map/Reduce possible with Hadoop• query by column and • range of keys http://cassandra.apache.org/
  • MongoDB• C++ • map/reduce with javascript• document database (bson) with rich indexing • server side javascript• master / slave replication • journaling• built-in sharding • fast in-place updates• auto failover with replica • http://www.mongodb.org/ sets
  • Apache CouchDB• Erlang • exposes a stream of realtime updates• document database (json) • needs compacting• bi-directional replication • indexing via views (JS)• advanced conflict • attachment handling resolution • https://couchdb.apache.org/• MVCC - writes do not block reads
  • Riak (Basho)• Erlang, C, Javascript • tunable trade-offs (N, R, W)• key, value store • mapreduce in JS or• focus on fault tolerance Erlang and cross datacenter replication • full-text indexing with riak search• speaks HTTP/REST or custom binary • http://wiki.basho.com/
  • Neo4j• Java • web admin interface• graph database • nodes & relationships can have metadata• speaks HTTP/REST • indexing• standalone or embeddable in Java apps • http://neo4j.org/• full ACID
  • Redis• C/C++ • values can be expired• disk-backed data • Pub/Sub for messaging structure server • ideal for rapidly changing• master-slave replication data that fits in memory• supports: strings, lists, • http://redis.io/ sets, hashes, sorted sets• batch operations
  • Search
  • elasticsearch• Java • simple multi-tenancy• based on Apache Lucene • real-time search• distributed by design • scale to 100s of machines• cloud aware (Amazon) • http://www.elasticsearch.org/• understands JSON objects• no-schema required
  • Apache SolrCloud• Java • automatic management of multiple shards• based on Apache Lucene (share the same repo) • automatic fail-over• adds distributed • durable writes capabilites to Solr • https://wiki.apache.org/• based on ZooKeeper for solr/SolrCloud coordination & config
  • Processing
  • Apache Hadoop• Java, C/C++ • can scale to 1000s of machines• set of distributed systems (hdfs, mr etc.) • designed to be highly available at the• framework for application level distributed data processing • https:// hadoop.apache.org/• simple programming model (map / reduce)
  • Hadoop Ecosystem• HDFS (Storage) • Oozie (workflow)• MapReduce (Processing) • Mahout (machine learning)• Hive, Pig (high level languages) • Flume (log streaming)• HBase (database) • Sqoop (data import)• ZooKeeper • Whirr (deployment) (coordination)
  • Deploymenton Cloud Infrastructure (using jclouds)
  • Apache Whirr https://whirr.apache.org/ * disclaimer: I am a member of the PMC
  • First Steps• Download $ curl -O http://www.apache.org/dist/whirr/whirr-0.7.1/whirr-0.7.1.tar.gz $ tar zxf whirr-0.7.1.tar.gz; cd whirr-0.7.1• Use # export credentials $ bin/whirr launch-cluster --config ... $ bin/whirr destroy-cluster --config ... https://whirr.apache.org/docs/latest/whirr-in-5-minutes.html
  • Deploy Hadoopwhirr.instance-templates= 1 hadoop-namenode+hadoop-jobtracker, 10 hadoop-datanode+hadoop-tasktracker https://whirr.apache.org/docs/0.7.1/quick-start-guide.html
  • With Mahoutwhirr.instance-templates= 1 hadoop-namenode+hadoop-jobtracker +mahout-client, 10 hadoop-datanode+hadoop-tasktracker
  • Or with HBasewhirr.instance-templates= 1 hadoop-namenode+hadoop-jobtracker +hbase-master+zookeeper, 10 hadoop-datanode+hadoop-tasktracker +hbase-regionserver
  • Or Cassandrawhirr.instance-templates=10 cassandra
  • And elasticsearchwhirr.instance-templates=10 elasticsearch
  • Thanks!andrei.savu@cloudsoftcorp.com