SlideShare a Scribd company logo
1 of 22
Introduction
Introduction to Avro and Integration with
Hadoop
What is Avro?
• Avro is a serialization framework developed within Apache's Hadoop
project. It uses JSON for defining data types and protocols, and
serializes data in a compact binary format. Its primary use is in Apache
Hadoop, where it can provide both a serialization format for persistent
data.
• Avro provides good way to convert unstructured and semi-structured
data into a structured way using schemas
Creating your first Avro schema
Schema description:
{
"name": "User",
"type": "record",
"fields": [
{"name": "FirstName", "type": "string", "doc": "First Name"},
{"name": "LastName", "type": "string"},
{"name": "isActive", "type": "boolean", "default": true},
{"name": "Account", "type": "int", "default": 0} ]
}
Avro schema features
1. Primitive types (null, boolean, int, long, float, double, bytes, string)
2. Records
{ "type": "record",
"name": "LongList",
[ {"name": "value", "type": "long"},
{"name": ”description", "type”:”string”}]
}
3. Others (Enums, Arrays, Maps, Unions, Fixed)
Avro schema features
1. Primitive types (null, boolean, int, long, float, double, bytes, string)
2. Records
{ "type": "record",
"name": "LongList",
[ {"name": "value", "type": "long"},
{"name": ”description", "type”:”string”}]
}
3. Others (Enums, Arrays, Maps, Unions, Fixed)
How to create Avro record?
String schemaDescription = " { n"
+ " "name": "User", n"
+ " "type": "record",n" + " "fields": [n"
+ " {"name": "FirstName", "type": "string", "doc": "First Name"},n"
+ " {"name": "LastName", "type": "string"},n"
+ " {"name": "isActive", "type": "boolean", "default": true},n"
+ " {"name": "Account", "type": "int", "default": 0} ]n" + "}";
Schema.Parser parser = new Schema.Parser();
Schema s = parser.parse(schemaDescription);
GenericRecordBuilder builder = new GenericRecordBuilder(s);
How to create Avro record? (cont. 2)
1. The first step to create Avro record is to create JSON-based schema
2. Avro provides parser that will take a Avro schema string and return schema object.
3. Once the schema object is created, we have created a builder that will allow us to create
records with default values
How to create Avro record? (cont. 3)
GenericRecord r = builder.build();
System.out.println("Record" + r);
r.put("FirstName", "Joe");
r.put("LastName", "Hadoop");
r.put("Account", 12345);
System.out.println("Record" + r);
System.out.println("FirstName:" + r.get("FirstName"));
{"FirstName": null, "LastName": null, "isActive": true, "Account": 0}
{"FirstName": "Joe", "LastName": "Hadoop", "isActive": true, "Account": 12345}
FirstName:Joe
How to create Avro record? (cont. 3)
GenericRecord r = builder.build();
System.out.println("Record" + r);
r.put("FirstName", "Joe");
r.put("LastName", "Hadoop");
r.put("Account", 12345);
System.out.println("Record" + r);
System.out.println("FirstName:" + r.get("FirstName"));
{"FirstName": null, "LastName": null, "isActive": true, "Account": 0}
{"FirstName": "Joe", "LastName": "Hadoop", "isActive": true, "Account": 12345}
FirstName:Joe
How to create Avro schema dynamically?
String[] fields = {"FirstName", "LastName", "Account"};
Schema s = Schema.createRecord("Ex2", “desc", ”namespace", false);
List<Schema.Field> lstFields = new LinkedList<Schema.Field>();
for (String f : fields) {
lstFields.add(new Schema.Field(f,
Schema.create(Schema.Type.STRING),
"doc",
new TextNode("")));
}
s.setFields(lstFields);
How to create Avro schema dynamically?
String[] fields = {"FirstName", "LastName", "Account"};
Schema s = Schema.createRecord("Ex2", “desc", ”namespace", false);
List<Schema.Field> lstFields = new LinkedList<Schema.Field>();
for (String f : fields) {
lstFields.add(new Schema.Field(f,
Schema.create(Schema.Type.STRING),
"doc",
new TextNode("")));
}
s.setFields(lstFields);
How to sort Avro records?
You can also specify the which field you would like to order on and in which order:
Options: ascending, descending, ignore
{
"name" : "isActive",
"type" : "boolean",
"default" : true,
"order" : "ignore"
}, {
"name" : "Account",
"type" : "int",
"default" : 0,
"order" : "descending"
}
How to sort Avro records?
You can also specify the which field you would like to order on and in which order:
Options: ascending, descending, ignore
{
"name" : "isActive",
"type" : "boolean",
"default" : true,
"order" : "ignore"
}, {
"name" : "Account",
"type" : "int",
"default" : 0,
"order" : "descending"
}
How to write Avro records in a file?
File file = new File(“<file-name>");
DatumWriter<GenericRecord> writer = new GenericDatumWriter<GenericRecord>(schema);
DataFileWriter<GenericRecord> dataFileWriter = new DataFileWriter<GenericRecord>(writer);
dataFileWriter.create(schema, file);
for (Record rec : list) {
dataFileWriter.append(rec);
}
dataFileWriter.close();
How to reading Avro records from a file?
File file = new File(“<file-name>");
DatumWriter<GenericRecord> writer = new GenericDatumWriter<GenericRecord>(schema);
DataFileWriter<GenericRecord> dataFileWriter = new DataFileWriter<GenericRecord>(writer);
dataFileWriter.create(schema, file);
for (Record rec : list) {
dataFileWriter.append(rec);
}
dataFileWriter.close();
How to read Avro records from a file?
File file = new File(“<file-name>");
DatumReader<GenericRecord> reader = new GenericDatumReader<GenericRecord>();
DataFileReader<GenericRecord> dataFileReader
= new DataFileReader<GenericRecord>(file, reader);
while (dataFileReader.hasNext()) {
Record r = (Record) dataFileReader.next();
System.out.println(r.toString());
}
Running MapReduce Jobs on Avro Data
1. Set input schema on AvroJob based on the schema from input path
File file = new File(DATA_PATH);
DatumReader<GenericRecord> reader = new GenericDatumReader<GenericRecord>();
DataFileReader<GenericRecord> dataFileReader =
new DataFileReader<GenericRecord>(file, reader);
Schema s = dataFileReader.getSchema();
AvroJob.setInputSchema(job, s);
Running MapReduce Jobs on Avro Data (cont. 2)
1. Set input schema on AvroJob based on the schema from input path
File file = new File(DATA_PATH);
DatumReader<GenericRecord> reader = new GenericDatumReader<GenericRecord>();
DataFileReader<GenericRecord> dataFileReader =
new DataFileReader<GenericRecord>(file, reader);
Schema s = dataFileReader.getSchema();
AvroJob.setInputSchema(job, s);
Running MapReduce Jobs on Avro Data - Mapper
public static class MapImpl extends
AvroMapper<GenericRecord, Pair<String, GenericRecord>> {
public void map( GenericRecord datum,
AvroCollector<Pair<String, GenericRecord>> collector,
Reporter reporter)
throws IOException {
….
}
}
Running MapReduce Jobs on Avro Data - Reducer
public static class ReduceImpl extends
AvroReducer<Utf8, GenericRecord, GenericRecord> {
public void reduce(Utf8 key, Iterable<GenericRecord> values,
AvroCollector< GenericRecord> collector,
Reporter reporter) throws IOException {
collector.collect(values.iterator().next());
return;
}
}
Running Avro MapReduce Jobs on Data with Different schema
List<Schema> schemas= new ArrayList<Schema>();
schemas.add(schema1);
schemas.add(schema2);
Schema schema3=Schema.createUnion(schemas);
This will allow to read data from different sources and process
both of them in the same mapper
Summary
• Avro is a great tool to use for semi-structured and structured data
• Simplifies MapReduce development
• Provides good compression mechanism
• Great tool for conversion from existing SQL code
• Questions?

More Related Content

What's hot

Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
elliando dias
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Simplilearn
 

What's hot (20)

Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Denormalization
DenormalizationDenormalization
Denormalization
 
Java I/O
Java I/OJava I/O
Java I/O
 
Apache HBase™
Apache HBase™Apache HBase™
Apache HBase™
 
Testing Hadoop jobs with MRUnit
Testing Hadoop jobs with MRUnitTesting Hadoop jobs with MRUnit
Testing Hadoop jobs with MRUnit
 
Big Data Technology Stack : Nutshell
Big Data Technology Stack : NutshellBig Data Technology Stack : Nutshell
Big Data Technology Stack : Nutshell
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 
Dimensional Modelling
Dimensional ModellingDimensional Modelling
Dimensional Modelling
 
introduction to NOSQL Database
introduction to NOSQL Databaseintroduction to NOSQL Database
introduction to NOSQL Database
 
Java I/o streams
Java I/o streamsJava I/o streams
Java I/o streams
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Encapsulation of operations, methods & persistence
Encapsulation of operations, methods & persistenceEncapsulation of operations, methods & persistence
Encapsulation of operations, methods & persistence
 
PostgreSQL Database Slides
PostgreSQL Database SlidesPostgreSQL Database Slides
PostgreSQL Database Slides
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
 
Java Course 8: I/O, Files and Streams
Java Course 8: I/O, Files and StreamsJava Course 8: I/O, Files and Streams
Java Course 8: I/O, Files and Streams
 
Java - File Input Output Concepts
Java - File Input Output ConceptsJava - File Input Output Concepts
Java - File Input Output Concepts
 
1.2 steps and functionalities
1.2 steps and functionalities1.2 steps and functionalities
1.2 steps and functionalities
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Python Pandas
Python PandasPython Pandas
Python Pandas
 

Similar to Avro introduction

How to write rust instead of c and get away with it
How to write rust instead of c and get away with itHow to write rust instead of c and get away with it
How to write rust instead of c and get away with it
Flavien Raynaud
 
PHP and MySQL with snapshots
 PHP and MySQL with snapshots PHP and MySQL with snapshots
PHP and MySQL with snapshots
richambra
 
Change the code in Writer.java only to get it working. Must contain .pdf
Change the code in Writer.java only to get it working. Must contain .pdfChange the code in Writer.java only to get it working. Must contain .pdf
Change the code in Writer.java only to get it working. Must contain .pdf
secunderbadtirumalgi
 
Kafka Summit SF 2017 - Efficient Schemas in Motion with Kafka and Schema Regi...
Kafka Summit SF 2017 - Efficient Schemas in Motion with Kafka and Schema Regi...Kafka Summit SF 2017 - Efficient Schemas in Motion with Kafka and Schema Regi...
Kafka Summit SF 2017 - Efficient Schemas in Motion with Kafka and Schema Regi...
confluent
 

Similar to Avro introduction (20)

Avro, la puissance du binaire, la souplesse du JSON
Avro, la puissance du binaire, la souplesse du JSONAvro, la puissance du binaire, la souplesse du JSON
Avro, la puissance du binaire, la souplesse du JSON
 
How to write rust instead of c and get away with it
How to write rust instead of c and get away with itHow to write rust instead of c and get away with it
How to write rust instead of c and get away with it
 
AWS Hadoop and PIG and overview
AWS Hadoop and PIG and overviewAWS Hadoop and PIG and overview
AWS Hadoop and PIG and overview
 
PHP and MySQL with snapshots
 PHP and MySQL with snapshots PHP and MySQL with snapshots
PHP and MySQL with snapshots
 
5java Io
5java Io5java Io
5java Io
 
Java Input Output and File Handling
Java Input Output and File HandlingJava Input Output and File Handling
Java Input Output and File Handling
 
File Handling in Java.pdf
File Handling in Java.pdfFile Handling in Java.pdf
File Handling in Java.pdf
 
Wprowadzenie do technologii Big Data / Intro to Big Data Ecosystem
Wprowadzenie do technologii Big Data / Intro to Big Data EcosystemWprowadzenie do technologii Big Data / Intro to Big Data Ecosystem
Wprowadzenie do technologii Big Data / Intro to Big Data Ecosystem
 
Input/Output Exploring java.io
Input/Output Exploring java.ioInput/Output Exploring java.io
Input/Output Exploring java.io
 
Xlab #1: Advantages of functional programming in Java 8
Xlab #1: Advantages of functional programming in Java 8Xlab #1: Advantages of functional programming in Java 8
Xlab #1: Advantages of functional programming in Java 8
 
Change the code in Writer.java only to get it working. Must contain .pdf
Change the code in Writer.java only to get it working. Must contain .pdfChange the code in Writer.java only to get it working. Must contain .pdf
Change the code in Writer.java only to get it working. Must contain .pdf
 
Strongly Typed Languages and Flexible Schemas
Strongly Typed Languages and Flexible SchemasStrongly Typed Languages and Flexible Schemas
Strongly Typed Languages and Flexible Schemas
 
Webinar: Strongly Typed Languages and Flexible Schemas
Webinar: Strongly Typed Languages and Flexible SchemasWebinar: Strongly Typed Languages and Flexible Schemas
Webinar: Strongly Typed Languages and Flexible Schemas
 
Streaming Operational Data with MariaDB MaxScale
Streaming Operational Data with MariaDB MaxScaleStreaming Operational Data with MariaDB MaxScale
Streaming Operational Data with MariaDB MaxScale
 
Best of build 2021 - C# 10 & .NET 6
Best of build 2021 -  C# 10 & .NET 6Best of build 2021 -  C# 10 & .NET 6
Best of build 2021 - C# 10 & .NET 6
 
Jug java7
Jug java7Jug java7
Jug java7
 
The Ring programming language version 1.6 book - Part 42 of 189
The Ring programming language version 1.6 book - Part 42 of 189The Ring programming language version 1.6 book - Part 42 of 189
The Ring programming language version 1.6 book - Part 42 of 189
 
Let's talk about NoSQL Standard
Let's talk about NoSQL StandardLet's talk about NoSQL Standard
Let's talk about NoSQL Standard
 
Let's talk about NoSQL Standard
Let's talk about NoSQL StandardLet's talk about NoSQL Standard
Let's talk about NoSQL Standard
 
Kafka Summit SF 2017 - Efficient Schemas in Motion with Kafka and Schema Regi...
Kafka Summit SF 2017 - Efficient Schemas in Motion with Kafka and Schema Regi...Kafka Summit SF 2017 - Efficient Schemas in Motion with Kafka and Schema Regi...
Kafka Summit SF 2017 - Efficient Schemas in Motion with Kafka and Schema Regi...
 

Avro introduction

  • 1. Introduction Introduction to Avro and Integration with Hadoop
  • 2. What is Avro? • Avro is a serialization framework developed within Apache's Hadoop project. It uses JSON for defining data types and protocols, and serializes data in a compact binary format. Its primary use is in Apache Hadoop, where it can provide both a serialization format for persistent data. • Avro provides good way to convert unstructured and semi-structured data into a structured way using schemas
  • 3. Creating your first Avro schema Schema description: { "name": "User", "type": "record", "fields": [ {"name": "FirstName", "type": "string", "doc": "First Name"}, {"name": "LastName", "type": "string"}, {"name": "isActive", "type": "boolean", "default": true}, {"name": "Account", "type": "int", "default": 0} ] }
  • 4. Avro schema features 1. Primitive types (null, boolean, int, long, float, double, bytes, string) 2. Records { "type": "record", "name": "LongList", [ {"name": "value", "type": "long"}, {"name": ”description", "type”:”string”}] } 3. Others (Enums, Arrays, Maps, Unions, Fixed)
  • 5. Avro schema features 1. Primitive types (null, boolean, int, long, float, double, bytes, string) 2. Records { "type": "record", "name": "LongList", [ {"name": "value", "type": "long"}, {"name": ”description", "type”:”string”}] } 3. Others (Enums, Arrays, Maps, Unions, Fixed)
  • 6. How to create Avro record? String schemaDescription = " { n" + " "name": "User", n" + " "type": "record",n" + " "fields": [n" + " {"name": "FirstName", "type": "string", "doc": "First Name"},n" + " {"name": "LastName", "type": "string"},n" + " {"name": "isActive", "type": "boolean", "default": true},n" + " {"name": "Account", "type": "int", "default": 0} ]n" + "}"; Schema.Parser parser = new Schema.Parser(); Schema s = parser.parse(schemaDescription); GenericRecordBuilder builder = new GenericRecordBuilder(s);
  • 7. How to create Avro record? (cont. 2) 1. The first step to create Avro record is to create JSON-based schema 2. Avro provides parser that will take a Avro schema string and return schema object. 3. Once the schema object is created, we have created a builder that will allow us to create records with default values
  • 8. How to create Avro record? (cont. 3) GenericRecord r = builder.build(); System.out.println("Record" + r); r.put("FirstName", "Joe"); r.put("LastName", "Hadoop"); r.put("Account", 12345); System.out.println("Record" + r); System.out.println("FirstName:" + r.get("FirstName")); {"FirstName": null, "LastName": null, "isActive": true, "Account": 0} {"FirstName": "Joe", "LastName": "Hadoop", "isActive": true, "Account": 12345} FirstName:Joe
  • 9. How to create Avro record? (cont. 3) GenericRecord r = builder.build(); System.out.println("Record" + r); r.put("FirstName", "Joe"); r.put("LastName", "Hadoop"); r.put("Account", 12345); System.out.println("Record" + r); System.out.println("FirstName:" + r.get("FirstName")); {"FirstName": null, "LastName": null, "isActive": true, "Account": 0} {"FirstName": "Joe", "LastName": "Hadoop", "isActive": true, "Account": 12345} FirstName:Joe
  • 10. How to create Avro schema dynamically? String[] fields = {"FirstName", "LastName", "Account"}; Schema s = Schema.createRecord("Ex2", “desc", ”namespace", false); List<Schema.Field> lstFields = new LinkedList<Schema.Field>(); for (String f : fields) { lstFields.add(new Schema.Field(f, Schema.create(Schema.Type.STRING), "doc", new TextNode(""))); } s.setFields(lstFields);
  • 11. How to create Avro schema dynamically? String[] fields = {"FirstName", "LastName", "Account"}; Schema s = Schema.createRecord("Ex2", “desc", ”namespace", false); List<Schema.Field> lstFields = new LinkedList<Schema.Field>(); for (String f : fields) { lstFields.add(new Schema.Field(f, Schema.create(Schema.Type.STRING), "doc", new TextNode(""))); } s.setFields(lstFields);
  • 12. How to sort Avro records? You can also specify the which field you would like to order on and in which order: Options: ascending, descending, ignore { "name" : "isActive", "type" : "boolean", "default" : true, "order" : "ignore" }, { "name" : "Account", "type" : "int", "default" : 0, "order" : "descending" }
  • 13. How to sort Avro records? You can also specify the which field you would like to order on and in which order: Options: ascending, descending, ignore { "name" : "isActive", "type" : "boolean", "default" : true, "order" : "ignore" }, { "name" : "Account", "type" : "int", "default" : 0, "order" : "descending" }
  • 14. How to write Avro records in a file? File file = new File(“<file-name>"); DatumWriter<GenericRecord> writer = new GenericDatumWriter<GenericRecord>(schema); DataFileWriter<GenericRecord> dataFileWriter = new DataFileWriter<GenericRecord>(writer); dataFileWriter.create(schema, file); for (Record rec : list) { dataFileWriter.append(rec); } dataFileWriter.close();
  • 15. How to reading Avro records from a file? File file = new File(“<file-name>"); DatumWriter<GenericRecord> writer = new GenericDatumWriter<GenericRecord>(schema); DataFileWriter<GenericRecord> dataFileWriter = new DataFileWriter<GenericRecord>(writer); dataFileWriter.create(schema, file); for (Record rec : list) { dataFileWriter.append(rec); } dataFileWriter.close();
  • 16. How to read Avro records from a file? File file = new File(“<file-name>"); DatumReader<GenericRecord> reader = new GenericDatumReader<GenericRecord>(); DataFileReader<GenericRecord> dataFileReader = new DataFileReader<GenericRecord>(file, reader); while (dataFileReader.hasNext()) { Record r = (Record) dataFileReader.next(); System.out.println(r.toString()); }
  • 17. Running MapReduce Jobs on Avro Data 1. Set input schema on AvroJob based on the schema from input path File file = new File(DATA_PATH); DatumReader<GenericRecord> reader = new GenericDatumReader<GenericRecord>(); DataFileReader<GenericRecord> dataFileReader = new DataFileReader<GenericRecord>(file, reader); Schema s = dataFileReader.getSchema(); AvroJob.setInputSchema(job, s);
  • 18. Running MapReduce Jobs on Avro Data (cont. 2) 1. Set input schema on AvroJob based on the schema from input path File file = new File(DATA_PATH); DatumReader<GenericRecord> reader = new GenericDatumReader<GenericRecord>(); DataFileReader<GenericRecord> dataFileReader = new DataFileReader<GenericRecord>(file, reader); Schema s = dataFileReader.getSchema(); AvroJob.setInputSchema(job, s);
  • 19. Running MapReduce Jobs on Avro Data - Mapper public static class MapImpl extends AvroMapper<GenericRecord, Pair<String, GenericRecord>> { public void map( GenericRecord datum, AvroCollector<Pair<String, GenericRecord>> collector, Reporter reporter) throws IOException { …. } }
  • 20. Running MapReduce Jobs on Avro Data - Reducer public static class ReduceImpl extends AvroReducer<Utf8, GenericRecord, GenericRecord> { public void reduce(Utf8 key, Iterable<GenericRecord> values, AvroCollector< GenericRecord> collector, Reporter reporter) throws IOException { collector.collect(values.iterator().next()); return; } }
  • 21. Running Avro MapReduce Jobs on Data with Different schema List<Schema> schemas= new ArrayList<Schema>(); schemas.add(schema1); schemas.add(schema2); Schema schema3=Schema.createUnion(schemas); This will allow to read data from different sources and process both of them in the same mapper
  • 22. Summary • Avro is a great tool to use for semi-structured and structured data • Simplifies MapReduce development • Provides good compression mechanism • Great tool for conversion from existing SQL code • Questions?