2. What is Avro?
• Avro is a serialization framework developed within Apache's Hadoop
project. It uses JSON for defining data types and protocols, and
serializes data in a compact binary format. Its primary use is in Apache
Hadoop, where it can provide both a serialization format for persistent
data.
• Avro provides good way to convert unstructured and semi-structured
data into a structured way using schemas
7. How to create Avro record? (cont. 2)
1. The first step to create Avro record is to create JSON-based schema
2. Avro provides parser that will take a Avro schema string and return schema object.
3. Once the schema object is created, we have created a builder that will allow us to create
records with default values
10. How to create Avro schema dynamically?
String[] fields = {"FirstName", "LastName", "Account"};
Schema s = Schema.createRecord("Ex2", “desc", ”namespace", false);
List<Schema.Field> lstFields = new LinkedList<Schema.Field>();
for (String f : fields) {
lstFields.add(new Schema.Field(f,
Schema.create(Schema.Type.STRING),
"doc",
new TextNode("")));
}
s.setFields(lstFields);
11. How to create Avro schema dynamically?
String[] fields = {"FirstName", "LastName", "Account"};
Schema s = Schema.createRecord("Ex2", “desc", ”namespace", false);
List<Schema.Field> lstFields = new LinkedList<Schema.Field>();
for (String f : fields) {
lstFields.add(new Schema.Field(f,
Schema.create(Schema.Type.STRING),
"doc",
new TextNode("")));
}
s.setFields(lstFields);
12. How to sort Avro records?
You can also specify the which field you would like to order on and in which order:
Options: ascending, descending, ignore
{
"name" : "isActive",
"type" : "boolean",
"default" : true,
"order" : "ignore"
}, {
"name" : "Account",
"type" : "int",
"default" : 0,
"order" : "descending"
}
13. How to sort Avro records?
You can also specify the which field you would like to order on and in which order:
Options: ascending, descending, ignore
{
"name" : "isActive",
"type" : "boolean",
"default" : true,
"order" : "ignore"
}, {
"name" : "Account",
"type" : "int",
"default" : 0,
"order" : "descending"
}
14. How to write Avro records in a file?
File file = new File(“<file-name>");
DatumWriter<GenericRecord> writer = new GenericDatumWriter<GenericRecord>(schema);
DataFileWriter<GenericRecord> dataFileWriter = new DataFileWriter<GenericRecord>(writer);
dataFileWriter.create(schema, file);
for (Record rec : list) {
dataFileWriter.append(rec);
}
dataFileWriter.close();
15. How to reading Avro records from a file?
File file = new File(“<file-name>");
DatumWriter<GenericRecord> writer = new GenericDatumWriter<GenericRecord>(schema);
DataFileWriter<GenericRecord> dataFileWriter = new DataFileWriter<GenericRecord>(writer);
dataFileWriter.create(schema, file);
for (Record rec : list) {
dataFileWriter.append(rec);
}
dataFileWriter.close();
16. How to read Avro records from a file?
File file = new File(“<file-name>");
DatumReader<GenericRecord> reader = new GenericDatumReader<GenericRecord>();
DataFileReader<GenericRecord> dataFileReader
= new DataFileReader<GenericRecord>(file, reader);
while (dataFileReader.hasNext()) {
Record r = (Record) dataFileReader.next();
System.out.println(r.toString());
}
17. Running MapReduce Jobs on Avro Data
1. Set input schema on AvroJob based on the schema from input path
File file = new File(DATA_PATH);
DatumReader<GenericRecord> reader = new GenericDatumReader<GenericRecord>();
DataFileReader<GenericRecord> dataFileReader =
new DataFileReader<GenericRecord>(file, reader);
Schema s = dataFileReader.getSchema();
AvroJob.setInputSchema(job, s);
18. Running MapReduce Jobs on Avro Data (cont. 2)
1. Set input schema on AvroJob based on the schema from input path
File file = new File(DATA_PATH);
DatumReader<GenericRecord> reader = new GenericDatumReader<GenericRecord>();
DataFileReader<GenericRecord> dataFileReader =
new DataFileReader<GenericRecord>(file, reader);
Schema s = dataFileReader.getSchema();
AvroJob.setInputSchema(job, s);
19. Running MapReduce Jobs on Avro Data - Mapper
public static class MapImpl extends
AvroMapper<GenericRecord, Pair<String, GenericRecord>> {
public void map( GenericRecord datum,
AvroCollector<Pair<String, GenericRecord>> collector,
Reporter reporter)
throws IOException {
….
}
}
20. Running MapReduce Jobs on Avro Data - Reducer
public static class ReduceImpl extends
AvroReducer<Utf8, GenericRecord, GenericRecord> {
public void reduce(Utf8 key, Iterable<GenericRecord> values,
AvroCollector< GenericRecord> collector,
Reporter reporter) throws IOException {
collector.collect(values.iterator().next());
return;
}
}
21. Running Avro MapReduce Jobs on Data with Different schema
List<Schema> schemas= new ArrayList<Schema>();
schemas.add(schema1);
schemas.add(schema2);
Schema schema3=Schema.createUnion(schemas);
This will allow to read data from different sources and process
both of them in the same mapper
22. Summary
• Avro is a great tool to use for semi-structured and structured data
• Simplifies MapReduce development
• Provides good compression mechanism
• Great tool for conversion from existing SQL code
• Questions?