ApacheCon09: Avro

7,882 views

Published on

Doug Cutting presents Avro at the 2009 ApacheCon US.

Published in: Technology
1 Comment
7 Likes
Statistics
Notes
  • http://www.dbmanagement.info/Tutorials/Hadoop.htm #Hadoop #Avro #Cassandro #Drill #Flume Tutorial (Videos and Books)at $7.95
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total views
7,882
On SlideShare
0
From Embeds
0
Number of Embeds
48
Actions
Shares
0
Downloads
167
Comments
1
Likes
7
Embeds 0
No embeds

No notes for slide

ApacheCon09: Avro

  1. 1. State of the Elephant Doug Cutting Cloudera
  2. 2. 2009 Hadoop Milestones ● A 2009 Sort Champion (O'Malley) ● won 100TB “Gray” sort @ .578TB/minute ● won “Minute” sort with 500GB ● Split Core Project in Three ● Common, HDFS & MapReduce ● Released 0.18.3, 0.19.[0-2], 0.20.[0-1] ● Many meetups, conferences, etc. ● Yada, yada, yada.
  3. 3. Goals for 2010 and beyond ● IMHO, YMMV, IANAM, etc. ● Concrete ● 1.0 release – compatible APIs & RPCs for > 1 year – Kerberos-based authentication ● Abstract ● faster, more reliable, available ● easier sharing – of data & hardware resources ● spreadsheet-like interfaces – provide non-programmers – with powerful, interactive tools
  4. 4. Abstract Requirements ● security ● facilitate sharing of resources ● stable cross-language APIs ● facilitate diverse tools & apps ● expressive, inter-operable data ● facilitates sharing of datasets ● facilitates dynamic analyses
  5. 5. Data Formats ● today in Hadoop: ● text – pro: inter-operable – con: not expressive, inefficient ● Java Writable – pro: expressive, efficient – con: platform-specific, fragile
  6. 6. Protocol Buffers & Thrift ● expressive ● efficient (small & fast) ● but not very dynamic ● cannot browse arbitrary data ● no DESCRIBE or SHOW ● viewing a new dataset – requires code generation & load ● writing a new dataset – requires generating schema text – plus code generation & load
  7. 7. Avro Data ● as expressive ● smaller and faster ● dynamic ● schema stored with data – but factored out of instances ● APIs permit reading & creating – arbitrary datatypes – without generating & loading code
  8. 8. Avro Data ● includes a file format ● includes a textual encoding ● handles versioning ● if schema changes ● can still process data ● hope Hadoop apps will ● upgrade from text; & ● and standardize on Avro for data
  9. 9. Avro RPC ● leverage versioning support ● to permit different versions of services to interoperate ● for Hadoop services, will ● provide cross-language access ● let apps talk to clusters running different versions
  10. 10. Avro Status ● Java & Python APIs ● C & C++ APIs making rapid progress ● 1.1 release ● added JSON data and comparators ● 1.2 release ● added HTTP & UDP-based RPC ● included in Hadoop 0.21 ● as format for job history ● in sequence files
  11. 11. Avro Near Future ● full mapreduce support for Avro data ● enables fast comparators for non-Java apps ● Avro RPC used in Hadoop 0.22 (1.0)? ● provides compatibility; & ● native access from non-Java
  12. 12. Thanks! hadoop.apache.org/avro

×