Key and value can be arbitrarily complex, they are serialized and can
Statistical learning as the ultimate agile development tool (Peter Norvig), “business logic” through data rather than code
Reference to previous presentation and amazon dynamo model
Simple configuration to use a compressed store
EC2 testing should be ready in next few weeks
Main project for Bhupesh and I Minimal and tunable performance for the cluster.
Will be in next release, happens November 15th
Client bound
Give Azkaban demo here Show HelloWorldJob Show hello1.properties Show dependencies with graph job Started out as a week project, was much more complex than we realized
Example: member data--does not make sense to repeatedly join positions, emails, groups, etc. Explain about joins How to better model in java? Json like data model
Give Azkaban demo here Show HelloWorldJob Show hello1.properties Show dependencies with graph job Started out as a week project, was much more complex than we realized
Give Azkaban demo here Show HelloWorldJob Show hello1.properties Show dependencies with graph job Started out as a week project, was much more complex than we realized
Started out as a week project, was much more complex than we realized Period can be daily, hourly depending on need and database availability.
Utilizes Hadoop power for computation intensive index build Provides Voldemort online serving advantages.
Voldemort storage engine is very fast Key lookup using binary search Value lookup using single seek in value file. Operation system cache optimizes .
Client bound
Questions, comments, etc
Switch to Bhup
- Strong Consistency: all clients see the same view, even in presence of updates - High Availability: all clients can find some replica of the data, even in the presence of failures Partition-tolerance: the system properties hold even when the system is partitioned high availability : Mantra for websites Better to deal with inconsistencies, because their primary need is to scale well to allow for a smooth user experience.
Hashing .. Why do we need it ?? Basic problem : Clients need to know which data is where ?? Many ways of solving it Central configuration Hashing Linear hashing works : issue is when cluster is dynamic ?? KeyHash –node IDmapping change for a lot of entries When you add new slots Consistent hashing : preserves key –Node mapping for most of the keys and only change the minimal amount needed How to do it ?? Number of partitions ---------------------------- Arbitrary , each node is allocated many partitions (better load balancing and fault tolerance) Few hundreds to few thousands .. Key partition mapping is fixed and only ownership of partitions can change
Fancy way of doing Optimistic locking
Will discuss each layer in more detail Layered design One interface for all layers: put/get/delete Each layer decorates the next Very flexible Easy to test Client API : very basic API just provides the raw interface to user Conflct reslution layer : handles all the versioning issues and provides hooks to supply custom conflict resolution strategy Serialization : Object <=> Data Network Layer : Depending on configuration can fit either here or below .. Main job is to handle the network interface, socket pooling other performance related optimizatons Routing layer : Handles and hide many details from the client/user hashing schema failed nodes replication required reads/required writes Storage engine Handle disk persistenct
Very simple APIS NO Range Scans .. . No iterator on KeySet / Entry SET : Very hard to fix performance Have plans to provide such an iterator
Give example of read and writes with vector clocks Pros and cons vs paxos and 2pc User can supply strategy for handling cases where v1 and v2 are not comparable.
Avro is good, but new and unreleased Storing data is really different from an RPC IDL Data is forever (some data) Inverted index Threaded comment tree Don’t hardcode serialization Don’t just do byte[] -- checks are good, many features depend on knowing what the data is Xml profile
Explain about partitions Make things fast by removing slow things, not by tuning HTTP client not performant Separate caching layer
Client v. server - client is better - but harder
You can write an implementation very easily We support plugins so you can run this backend without being in the main code base