Your SlideShare is downloading. ×
20080529dublinpt2
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

20080529dublinpt2

915

Published on

One in a series of presentations given at the IBM Cloud Computing Center in Dublin.

One in a series of presentations given at the IBM Cloud Computing Center in Dublin.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
915
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
50
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Working with Structured Data in Hadoop Jeff Hammerbacher Manager, Data May 28 - 29, 2008
  • 2. Structured Data Management in Hadoop State of the World HBase is a Hadoop subproject ▪ Powerset and Rapleaf are the main contributors ▪ Hypertable is Bigtable in C++ ▪ Zvents are the main contributors ▪ Pig is an Apache Incubator project ▪ Yahoo! is the main contributor ▪ JAQL has been released as open source ▪ IBM is the main contributor ▪ Hive not available publicly, hopefully under contrib/ soon ▪ Facebook is the main contributor ▪
  • 3. Pig Philosophy Pigs Eat Anything ▪ Operate on data with or without metadata ▪ Operate on relational, nested, or unstructured data ▪ Pigs Live Anywhere ▪ The language is independent of execution environment ▪ Pigs are Domestic Animals ▪ Integrate user code wherever possible ▪ Allow control over code reorganization when optimizing ▪ Pigs Fly ▪
  • 4. Pig Components Pig Latin ▪ Dataflow programming language; procedural, not declarative ▪ Algebraic: each step specifies only a single data transformation ▪ Parse, verify, and build a logical plan ▪ Evaluation Mechanisms ▪ Local evaluation in single JVM ▪ Compilation to Hadoop MapReduce ▪ Grunt: interactive shell ▪ Pig Pen: debugging environment ▪
  • 5. Pig Data Model Pig has four types of data items: ▪ Atom: string or number ▪ Tuple: “data record” consisting of an ordered sequence of “fields” ▪ Denoted with < > bracketing ▪ Bag: an unordered collection of tuples with possible duplicates and ▪ possibly inconsistent schemas Denoted with { } bracketing ▪ Map: an unordered collection of data items where each data item has ▪ an associated key; the key must be a string Denoted with [ ] bracketing ▪
  • 6. Pig Data Model, continued Fields in a tuple may be named for easier access ▪ A “relation” is a Bag that has been assigned a name (“alias”) ▪ Example: ▪ Let t = < 1, { <2, 3, 4>, <4, 6, 8>, <5, 7, 11>}, [‘apache’: ‘search’] > ▪ Give the fields of t the names “f1”, “f2”, and “f3” ▪ Give the fields of the tuples of the bag the names “g1”, “g2”, and “g3” ▪ We’ll look at Pig’s data access syntax on the next page ▪
  • 7. Pig Data Access t = < 1, { <2, 3, 4>, <4, 6, 8>, <5, 7, 11>}, [‘apache’: ‘search’] > ▪ Method of Data Access Example Value for t Applies to which Data Item Constant ‘1.0’ or ‘apache.org’ Constant Atom ‘1’ Positional Reference $0 Tuple ‘1’ Named Reference f1 Tuple Projection f2.$0 { <2>, <4>, <5> } Bag Multiple Projection f2.(g1, g3) { <2, 4>, <4, 8>, <5, 11> } Bag Map Lookup f3#’apache’ ‘search’ Map Multiple Map Lookup (?) ? ? Map
  • 8. Pig Questions How does a tuple with named fields differ from a map? ▪ How does a tuple of tuples differ from a bag? ▪ When do you ever use a map? ▪ For further information, see Pig’s documentation and mailing lists: ▪ Web site: incubator.apache.org/pig ▪ Wiki: http://wiki.apache.org/pig ▪ Paper: http://www.cs.cmu.edu/~olston/publications/sigmod08.pdf ▪ Language reference: http://wiki.apache.org/pig/PigLatin ▪
  • 9. Pig Statements A Pig Latin statement is a command that produces a relation ▪ Pig commands can take zero, one, or more relations as input ▪ Pig commands can span multiple lines and must include “;” at the end ▪ To play with Pig syntax, you can use the grunt shell or the ▪ StandAloneParser
  • 10. Pig Example Data Let ‘a.txt’ be a tab-delimited file with values: ▪ 123 ▪ 421 ▪ 834 ▪ 433 ▪ 725 ▪ 843 ▪
  • 11. Pig Example Data Let ‘b.txt’ be a tab-delimited file with values: ▪ 24 ▪ 89 ▪ 13 ▪ 27 ▪ 29 ▪ 46 ▪ 49 ▪
  • 12. Pig Statements: LOAD and STORE LOAD <filename> [USING <function>] [AS <schema>] ▪ Example: ▪ grunt> a = LOAD ‘a.txt’ USING PigStorage(‘t’) AS (f1, f2, f3); ▪ Now a is a relation with six tuples which share a common schema: ▪ a = { <1, 2, 3>, <4, 2, 1>, <8, 3, 4>, <4, 3, 3>, <7, 2, 5>, <8, 4, 3> } ▪ all the tuples have field names “f1”, “f2”, and “f3” ▪ PigStorage() can be any deserialization function ▪ STORE <relation> INTO <filename> [USING <function>] does the reverse ▪ PigStorage() can’t handle nested relations; use BinStorage() instead ▪
  • 13. Pig Statements: FILTER FILTER <relation> BY <condition> ▪ Example: ▪ grunt> x = FILTER a BY f1 == ‘8’ OR f3 > 4; ▪ The relation x has three tuples which again share the schema (f1, f2, ▪ f3): x = { <8, 3, 4>, <8, 4, 3>, <7, 2, 5> } ▪ In addition to standard numerical comparisons, you can also do string ▪ comparisons and even do regular expression matching You can also use your own comparison function ▪
  • 14. Pig Statements: GROUP GROUP <relation> BY [<fields> | ALL | ANY] ▪ Only makes sense if tuples in relation have partially shared schemas ▪ Example: ▪ grunt> y = GROUP x BY f1; ▪ The relation y has two tuples which share the schema (group, x): ▪ y = { < 7, { < 7, 2, 5 > } >, < 8, { < 8, 3, 4 >, < 8, 4, 3 > } > } ▪ Using ANY will return a single tuple with all tuples into a single bag ▪ Note that GROUP is just syntactic sugar for COGROUP for a single ▪ relation
  • 15. Pig Statements: COGROUP COGROUP <relation> BY <fields> [INNER][, <relation> BY <fields> [INNER]]; ▪ Example: ▪ grunt> z = COGROUP x BY f3 INNER, b BY $0 INNER; ▪ The relation z has three tuples with the schema (group, x, b): ▪ z = { 4, { < 8, 3, 4 > }, { < 4, 6 >, < 4, 9 > } } ▪ Note that we could have used multiple fields with BY ▪ The INNER keyword on either relation will toss out the group records ▪ for which there are empty tuples for that relation
  • 16. Pig Statements: FOREACH ... GENERATE FOREACH <relation> GENERATE <data item>, <data item>, ...; ▪ Example: ▪ w = FOREACH x GENERATE f1, f3; ▪ Equivalent to the projection x.(f1, f3) ▪ The relation w has three tuples which share the schema (f1, f3): ▪ w = { <8, 4>, <8, 3>, <7, 5> } ▪ Can also have “nested projections”: ▪ u = FOREACH y GENERATE group, SUM(x.f3) AS thirdcolsum; ▪ u = { <7, 5>, <8, 7> }, where tuples have the schema (group, thirdcolsum) ▪
  • 17. Pig More Keywords and Statements FLATTEN ▪ JOIN ▪ ORDER ▪ DISTINCT ▪ CROSS ▪ UNION ▪ SPLIT ▪ Write your own functions: http://wiki.apache.org/pig/PigFunctions ▪
  • 18. Pig Physical Execution via Hadoop MapReduce How is a logical Pig plan executed via Hadoop? ▪ Details in SIGMOD paper ▪ Essentially each (CO)GROUP results in a new map and reduce function ▪ Similar to Teradata, intermediate data is materialized in the DFS ▪ For Pig commands that take multiple relations as input, an additional ▪ field is inserted into each tuple to indicate which relation it came from
  • 19. Pig Grunt Shell Allows you to maintain a working session ▪ You can interact with the DFS as well as your Pig logical objects ▪ DUMP command will let you see the objects you are working with ▪ ILLUSTRATE command provides for simple debugging ▪ For more, check out http://wiki.apache.org/pig/Grunt ▪
  • 20. Pig Pig Pen Run sequence of Pig commands over a representative sample of data ▪ Difficult to generate a representative sample when using highly ▪ selective FILTER or COGROUP statements Algorithm runs multiple sampling passes over the data and generates ▪ representative data if necessary Allows for incremental construction of complex Pig commands ▪
  • 21. Pig Pen
  • 22. Pig What’s Missing? Metadata repository ▪ Browse schemas for persistent data ▪ Library of serialization and deserialization functions ▪ Optimized logical and physical organization of data ▪ SQL interface ▪ UDF in any language ▪ Execution dataflows other than MapReduce ▪ Hash joins, aggregate operators that don’t require a sort, etc. ▪ Query optimization ▪
  • 23. (c) 2008 Facebook, Inc. or its licensors.  quot;Facebookquot; is a registered trademark of Facebook, Inc.. All rights reserved. 1.0

×