Big Data Loading:
Project Voldemort
Big Data Loading
●   So you've processed your data...
●   Now, how to get that to people quickly?

●   Project Voldemort's Read-Only stores
    ●   Simple key-value store
    ●   Based upon Amazon Dynamo
    ●   Simple Java interface and operation
    ●   Immutable read only stores
Read Only Stores
●   Precompute in Hadoop or else where
●   Creates an indexed key-value store
    ●   One reducer (or file) per node
    ●   Replicated data for fail over


●   Atomically loads into nodes
    ●   Copy from hdfs or other http source
    ●   Very fast, limited by network or storage i/o
    ●   Can throttle so not affecting live services
●   Can also roll back to previous versions
Example Hadoop Store Builder
public class JsonStoreBuilder
   extends AbstractHadoopStoreBuilderMapper<LongWritable, Text>{

    JSONParser parser = new JSONParser();

    @Override
    public Object makeKey(LongWritable lineNo, Text line) {
       JSONObject json = parser.parse(line.toString());
       return json.get("name");
    }

    @Override
    public Object makeValue(LongWritable lineNo, Text line) {
       return line.toString();
    }
}
Example Hadoop Job
$VOLDEMORT_HOME/bin/hadoop-build-readonly-store.sh

  --input hdfs/JsonFile.json
  --output hdfs/StoreOut
  --tmpdir hdfs/temp_dir
  --mapper uk.co.danharvey.hadoop.JsonStoreBuilder
  --jar hadoop-core.jar
  --cluster config/cluster.xml
  --storename example_store
  --storedefinitions config/store.xml
  --chunksize 1073741824
  --replication 1
Pig to Json Index
●   Output JSON from pig
        STORE bag INTO 'data.json' USING JsonStorage();


●   JsonStoreBuilder
    ●   Extends Voldemort StoreBuilder
    ●   Easily index any field


●   Code up here:
    http://github.com/danharvey/pigJsonUtils

Project Voldemort: Big data loading

  • 1.
  • 2.
    Big Data Loading ● So you've processed your data... ● Now, how to get that to people quickly? ● Project Voldemort's Read-Only stores ● Simple key-value store ● Based upon Amazon Dynamo ● Simple Java interface and operation ● Immutable read only stores
  • 3.
    Read Only Stores ● Precompute in Hadoop or else where ● Creates an indexed key-value store ● One reducer (or file) per node ● Replicated data for fail over ● Atomically loads into nodes ● Copy from hdfs or other http source ● Very fast, limited by network or storage i/o ● Can throttle so not affecting live services ● Can also roll back to previous versions
  • 4.
    Example Hadoop StoreBuilder public class JsonStoreBuilder extends AbstractHadoopStoreBuilderMapper<LongWritable, Text>{ JSONParser parser = new JSONParser(); @Override public Object makeKey(LongWritable lineNo, Text line) { JSONObject json = parser.parse(line.toString()); return json.get("name"); } @Override public Object makeValue(LongWritable lineNo, Text line) { return line.toString(); } }
  • 5.
    Example Hadoop Job $VOLDEMORT_HOME/bin/hadoop-build-readonly-store.sh --input hdfs/JsonFile.json --output hdfs/StoreOut --tmpdir hdfs/temp_dir --mapper uk.co.danharvey.hadoop.JsonStoreBuilder --jar hadoop-core.jar --cluster config/cluster.xml --storename example_store --storedefinitions config/store.xml --chunksize 1073741824 --replication 1
  • 6.
    Pig to JsonIndex ● Output JSON from pig STORE bag INTO 'data.json' USING JsonStorage(); ● JsonStoreBuilder ● Extends Voldemort StoreBuilder ● Easily index any field ● Code up here: http://github.com/danharvey/pigJsonUtils