20080529dublinpt2

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

0 comments

Post a comment

    Post a comment
    Embed Video
    Edit your comment Cancel

    1 Favorite

    20080529dublinpt2 - Presentation Transcript

    1. Working with Structured Data in Hadoop Jeff Hammerbacher Manager, Data May 28 - 29, 2008
    2. Structured Data Management in Hadoop State of the World HBase is a Hadoop subproject ▪ Powerset and Rapleaf are the main contributors ▪ Hypertable is Bigtable in C++ ▪ Zvents are the main contributors ▪ Pig is an Apache Incubator project ▪ Yahoo! is the main contributor ▪ JAQL has been released as open source ▪ IBM is the main contributor ▪ Hive not available publicly, hopefully under contrib/ soon ▪ Facebook is the main contributor ▪
    3. Pig Philosophy Pigs Eat Anything ▪ Operate on data with or without metadata ▪ Operate on relational, nested, or unstructured data ▪ Pigs Live Anywhere ▪ The language is independent of execution environment ▪ Pigs are Domestic Animals ▪ Integrate user code wherever possible ▪ Allow control over code reorganization when optimizing ▪ Pigs Fly ▪
    4. Pig Components Pig Latin ▪ Dataflow programming language; procedural, not declarative ▪ Algebraic: each step specifies only a single data transformation ▪ Parse, verify, and build a logical plan ▪ Evaluation Mechanisms ▪ Local evaluation in single JVM ▪ Compilation to Hadoop MapReduce ▪ Grunt: interactive shell ▪ Pig Pen: debugging environment ▪
    5. Pig Data Model Pig has four types of data items: ▪ Atom: string or number ▪ Tuple: “data record” consisting of an ordered sequence of “fields” ▪ Denoted with < > bracketing ▪ Bag: an unordered collection of tuples with possible duplicates and ▪ possibly inconsistent schemas Denoted with { } bracketing ▪ Map: an unordered collection of data items where each data item has ▪ an associated key; the key must be a string Denoted with [ ] bracketing ▪
    6. Pig Data Model, continued Fields in a tuple may be named for easier access ▪ A “relation” is a Bag that has been assigned a name (“alias”) ▪ Example: ▪ Let t = < 1, { <2, 3, 4>, <4, 6, 8>, <5, 7, 11>}, [‘apache’: ‘search’] > ▪ Give the fields of t the names “f1”, “f2”, and “f3” ▪ Give the fields of the tuples of the bag the names “g1”, “g2”, and “g3” ▪ We’ll look at Pig’s data access syntax on the next page ▪
    7. Pig Data Access t = < 1, { <2, 3, 4>, <4, 6, 8>, <5, 7, 11>}, [‘apache’: ‘search’] > ▪ Method of Data Access Example Value for t Applies to which Data Item Constant ‘1.0’ or ‘apache.org’ Constant Atom ‘1’ Positional Reference $0 Tuple ‘1’ Named Reference f1 Tuple Projection f2.$0 { <2>, <4>, <5> } Bag Multiple Projection f2.(g1, g3) { <2, 4>, <4, 8>, <5, 11> } Bag Map Lookup f3#’apache’ ‘search’ Map Multiple Map Lookup (?) ? ? Map
    8. Pig Questions How does a tuple with named fields differ from a map? ▪ How does a tuple of tuples differ from a bag? ▪ When do you ever use a map? ▪ For further information, see Pig’s documentation and mailing lists: ▪ Web site: incubator.apache.org/pig ▪ Wiki: http://wiki.apache.org/pig ▪ Paper: http://www.cs.cmu.edu/~olston/publications/sigmod08.pdf ▪ Language reference: http://wiki.apache.org/pig/PigLatin ▪
    9. Pig Statements A Pig Latin statement is a command that produces a relation ▪ Pig commands can take zero, one, or more relations as input ▪ Pig commands can span multiple lines and must include “;” at the end ▪ To play with Pig syntax, you can use the grunt shell or the ▪ StandAloneParser
    10. Pig Example Data Let ‘a.txt’ be a tab-delimited file with values: ▪ 123 ▪ 421 ▪ 834 ▪ 433 ▪ 725 ▪ 843 ▪
    11. Pig Example Data Let ‘b.txt’ be a tab-delimited file with values: ▪ 24 ▪ 89 ▪ 13 ▪ 27 ▪ 29 ▪ 46 ▪ 49 ▪
    12. Pig Statements: LOAD and STORE LOAD <filename> [USING <function>] [AS <schema>] ▪ Example: ▪ grunt> a = LOAD ‘a.txt’ USING PigStorage(‘\\t’) AS (f1, f2, f3); ▪ Now a is a relation with six tuples which share a common schema: ▪ a = { <1, 2, 3>, <4, 2, 1>, <8, 3, 4>, <4, 3, 3>, <7, 2, 5>, <8, 4, 3> } ▪ all the tuples have field names “f1”, “f2”, and “f3” ▪ PigStorage() can be any deserialization function ▪ STORE <relation> INTO <filename> [USING <function>] does the reverse ▪ PigStorage() can’t handle nested relations; use BinStorage() instead ▪
    13. Pig Statements: FILTER FILTER <relation> BY <condition> ▪ Example: ▪ grunt> x = FILTER a BY f1 == ‘8’ OR f3 > 4; ▪ The relation x has three tuples which again share the schema (f1, f2, ▪ f3): x = { <8, 3, 4>, <8, 4, 3>, <7, 2, 5> } ▪ In addition to standard numerical comparisons, you can also do string ▪ comparisons and even do regular expression matching You can also use your own comparison function ▪
    14. Pig Statements: GROUP GROUP <relation> BY [<fields> | ALL | ANY] ▪ Only makes sense if tuples in relation have partially shared schemas ▪ Example: ▪ grunt> y = GROUP x BY f1; ▪ The relation y has two tuples which share the schema (group, x): ▪ y = { < 7, { < 7, 2, 5 > } >, < 8, { < 8, 3, 4 >, < 8, 4, 3 > } > } ▪ Using ANY will return a single tuple with all tuples into a single bag ▪ Note that GROUP is just syntactic sugar for COGROUP for a single ▪ relation
    15. Pig Statements: COGROUP COGROUP <relation> BY <fields> [INNER][, <relation> BY <fields> [INNER]]; ▪ Example: ▪ grunt> z = COGROUP x BY f3 INNER, b BY $0 INNER; ▪ The relation z has three tuples with the schema (group, x, b): ▪ z = { 4, { < 8, 3, 4 > }, { < 4, 6 >, < 4, 9 > } } ▪ Note that we could have used multiple fields with BY ▪ The INNER keyword on either relation will toss out the group records ▪ for which there are empty tuples for that relation
    16. Pig Statements: FOREACH ... GENERATE FOREACH <relation> GENERATE <data item>, <data item>, ...; ▪ Example: ▪ w = FOREACH x GENERATE f1, f3; ▪ Equivalent to the projection x.(f1, f3) ▪ The relation w has three tuples which share the schema (f1, f3): ▪ w = { <8, 4>, <8, 3>, <7, 5> } ▪ Can also have “nested projections”: ▪ u = FOREACH y GENERATE group, SUM(x.f3) AS thirdcolsum; ▪ u = { <7, 5>, <8, 7> }, where tuples have the schema (group, thirdcolsum) ▪
    17. Pig More Keywords and Statements FLATTEN ▪ JOIN ▪ ORDER ▪ DISTINCT ▪ CROSS ▪ UNION ▪ SPLIT ▪ Write your own functions: http://wiki.apache.org/pig/PigFunctions ▪
    18. Pig Physical Execution via Hadoop MapReduce How is a logical Pig plan executed via Hadoop? ▪ Details in SIGMOD paper ▪ Essentially each (CO)GROUP results in a new map and reduce function ▪ Similar to Teradata, intermediate data is materialized in the DFS ▪ For Pig commands that take multiple relations as input, an additional ▪ field is inserted into each tuple to indicate which relation it came from
    19. Pig Grunt Shell Allows you to maintain a working session ▪ You can interact with the DFS as well as your Pig logical objects ▪ DUMP command will let you see the objects you are working with ▪ ILLUSTRATE command provides for simple debugging ▪ For more, check out http://wiki.apache.org/pig/Grunt ▪
    20. Pig Pig Pen Run sequence of Pig commands over a representative sample of data ▪ Difficult to generate a representative sample when using highly ▪ selective FILTER or COGROUP statements Algorithm runs multiple sampling passes over the data and generates ▪ representative data if necessary Allows for incremental construction of complex Pig commands ▪
    21. Pig Pen
    22. Pig What’s Missing? Metadata repository ▪ Browse schemas for persistent data ▪ Library of serialization and deserialization functions ▪ Optimized logical and physical organization of data ▪ SQL interface ▪ UDF in any language ▪ Execution dataflows other than MapReduce ▪ Hash joins, aggregate operators that don’t require a sort, etc. ▪ Query optimization ▪
    23. (c) 2008 Facebook, Inc. or its licensors.  \"Facebook\" is a registered trademark of Facebook, Inc.. All rights reserved. 1.0

    + jhammerbjhammerb, 2 years ago

    custom

    623 views, 1 favs, 0 embeds more stats

    One in a series of presentations given at the IBM C more

    More info about this document

    © All Rights Reserved

    Go to text version

    • Total Views 623
      • 623 on SlideShare
      • 0 from embeds
    • Comments 0
    • Favorites 1
    • Downloads 35
    Most viewed embeds

    more

    All embeds

    less

    Flagged as inappropriate Flag as inappropriate
    Flag as inappropriate

    Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

    Cancel
    File a copyright complaint
    Having problems? Go to our helpdesk?

    Categories