Avro introduction

Introduction
Introduction to Avro and Integration with
Hadoop

What is Avro?
• Avro is a serialization framework developed within Apache's Hadoop
project. It uses JSON for defining data types and protocols, and
serializes data in a compact binary format. Its primary use is in Apache
Hadoop, where it can provide both a serialization format for persistent
data.
• Avro provides good way to convert unstructured and semi-structured
data into a structured way using schemas

Creating your first Avro schema
Schema description:
{
"name": "User",
"type": "record",
"fields": [
{"name": "FirstName", "type": "string", "doc": "First Name"},
{"name": "LastName", "type": "string"},
{"name": "isActive", "type": "boolean", "default": true},
{"name": "Account", "type": "int", "default": 0} ]
}

Avro schema features
1. Primitive types (null, boolean, int, long, float, double, bytes, string)
2. Records
{ "type": "record",
"name": "LongList",
[ {"name": "value", "type": "long"},
{"name": ”description", "type”:”string”}]
}
3. Others (Enums, Arrays, Maps, Unions, Fixed)

How to create Avro record?
String schemaDescription = " { n"
+ " "name": "User", n"
+ " "type": "record",n" + " "fields": [n"
+ " {"name": "FirstName", "type": "string", "doc": "First Name"},n"
+ " {"name": "LastName", "type": "string"},n"
+ " {"name": "isActive", "type": "boolean", "default": true},n"
+ " {"name": "Account", "type": "int", "default": 0} ]n" + "}";
Schema.Parser parser = new Schema.Parser();
Schema s = parser.parse(schemaDescription);
GenericRecordBuilder builder = new GenericRecordBuilder(s);

How to create Avro record? (cont. 2)
1. The first step to create Avro record is to create JSON-based schema
2. Avro provides parser that will take a Avro schema string and return schema object.
3. Once the schema object is created, we have created a builder that will allow us to create
records with default values

How to create Avro record? (cont. 3)
GenericRecord r = builder.build();
System.out.println("Record" + r);
r.put("FirstName", "Joe");
r.put("LastName", "Hadoop");
r.put("Account", 12345);
System.out.println("Record" + r);
System.out.println("FirstName:" + r.get("FirstName"));
{"FirstName": null, "LastName": null, "isActive": true, "Account": 0}
{"FirstName": "Joe", "LastName": "Hadoop", "isActive": true, "Account": 12345}
FirstName:Joe

How to create Avro schema dynamically?
String[] fields = {"FirstName", "LastName", "Account"};
Schema s = Schema.createRecord("Ex2", “desc", ”namespace", false);
List<Schema.Field> lstFields = new LinkedList<Schema.Field>();
for (String f : fields) {
lstFields.add(new Schema.Field(f,
Schema.create(Schema.Type.STRING),
"doc",
new TextNode("")));
}
s.setFields(lstFields);

How to sort Avro records?
You can also specify the which field you would like to order on and in which order:
Options: ascending, descending, ignore
{
"name" : "isActive",
"type" : "boolean",
"default" : true,
"order" : "ignore"
}, {
"name" : "Account",
"type" : "int",
"default" : 0,
"order" : "descending"
}

How to write Avro records in a file?
File file = new File(“<file-name>");
DatumWriter<GenericRecord> writer = new GenericDatumWriter<GenericRecord>(schema);
DataFileWriter<GenericRecord> dataFileWriter = new DataFileWriter<GenericRecord>(writer);
dataFileWriter.create(schema, file);
for (Record rec : list) {
dataFileWriter.append(rec);
}
dataFileWriter.close();

How to reading Avro records from a file?
DatumWriter<GenericRecord> writer = new GenericDatumWriter<GenericRecord>(schema);
DataFileWriter<GenericRecord> dataFileWriter = new DataFileWriter<GenericRecord>(writer);
dataFileWriter.create(schema, file);
for (Record rec : list) {
dataFileWriter.append(rec);
}
dataFileWriter.close();

How to read Avro records from a file?
DatumReader<GenericRecord> reader = new GenericDatumReader<GenericRecord>();
DataFileReader<GenericRecord> dataFileReader
= new DataFileReader<GenericRecord>(file, reader);
while (dataFileReader.hasNext()) {
Record r = (Record) dataFileReader.next();
System.out.println(r.toString());
}

Running MapReduce Jobs on Avro Data
1. Set input schema on AvroJob based on the schema from input path
File file = new File(DATA_PATH);
DataFileReader<GenericRecord> dataFileReader =
new DataFileReader<GenericRecord>(file, reader);
Schema s = dataFileReader.getSchema();
AvroJob.setInputSchema(job, s);

Running MapReduce Jobs on Avro Data (cont. 2)
1. Set input schema on AvroJob based on the schema from input path
File file = new File(DATA_PATH);
DataFileReader<GenericRecord> dataFileReader =
new DataFileReader<GenericRecord>(file, reader);
Schema s = dataFileReader.getSchema();
AvroJob.setInputSchema(job, s);

Running MapReduce Jobs on Avro Data - Mapper
public static class MapImpl extends
AvroMapper<GenericRecord, Pair<String, GenericRecord>> {
public void map( GenericRecord datum,
AvroCollector<Pair<String, GenericRecord>> collector,
Reporter reporter)
throws IOException {
….
}
}

Running MapReduce Jobs on Avro Data - Reducer
public static class ReduceImpl extends
AvroReducer<Utf8, GenericRecord, GenericRecord> {
public void reduce(Utf8 key, Iterable<GenericRecord> values,
AvroCollector< GenericRecord> collector,
Reporter reporter) throws IOException {
collector.collect(values.iterator().next());
return;
}
}

Running Avro MapReduce Jobs on Data with Different schema
List<Schema> schemas= new ArrayList<Schema>();
schemas.add(schema1);
schemas.add(schema2);
Schema schema3=Schema.createUnion(schemas);
This will allow to read data from different sources and process
both of them in the same mapper

Summary
• Avro is a great tool to use for semi-structured and structured data
• Simplifies MapReduce development
• Provides good compression mechanism
• Great tool for conversion from existing SQL code
• Questions?

Avro introduction

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Avro introduction

Similar to Avro introduction (20)

Avro introduction