New features in Pig 0.11

Pig 0.11 - New Features

Daniel Dai
Member of Technical Staff
Committer, VP of Apache Pig

© Hortonworks Inc. 2011 Page 1

Pig 0.11 release plan

• Branched on Oct 12, 2012
• Will come in weeks
– Fix tests: PIG-2972
– Documentation: PIG-2756
– Several last minute fixes

Architecting the Future of Big Data
Page 2
© Hortonworks Inc. 2011

New features

• CUBE operator
• Rank operator
• Groovy UDFs
• New data type: DateTime
• SchemaTuple optimization
• Works with JDK 7
• Works with Windows ?

Page 3

New features

• Faster local mode
• Better stats/notification
– Ambros
• Default scripts: pigrc
• Integrate HCat DDL
• Grunt enhancement: history/clear
• UDF enhancement
– New/enhanced UDFs
– AvroStorage enhancement
Page 4

CUBE operator
rawdata = load ‟input' as (ptype, pstore, number);
cubed = cube rawdata by rollup(ptype, pstore);
result = foreach cubed generate flatten(group), SUM(cube.number);
dump result;

Ptype Pstore Sum
Ptype Pstore number Cat Miami 18
Dog Miami 12 Cat Naples 9
Cat Miami 18 Cat 27
Dog Miami 12
Turtle Tampa 4
Dog Tampa 14
Dog Tampa 14
Dog Naples 5
Cat Naples 9
Dog 31
Dog Naples 5
Turtle Tampa 4
Turtle Naples 1 Turtle Naples 1
Turtle 5
63
Page 5

CUBE operator
• Syntax
outalias = CUBE inalias BY { CUBE expression | ROLLUP expression }, [
CUBE expression | ROLLUP expression ] [PARALLEL n];

• Umbrella Jira PIG-2167
• Non-distributed version will be in 0.11 (PIG-
2765)
• Distributed version still in progress (PIG-2831)
– Push algebraic computation to map/combiner
– Reference: “Distributed cube materialization on
holistic measures”, Arnab Nandi et al, ICDE 2011

Page 6

Rank operator
rawdata = load ‟input' as (name, gpa:double);
ranked = rank rawdata by gpa;
dump ranked;

Name Gpa Rank Name Gpa
Katie 3.5 1 Katie 3.5
Fred 4.0 5 Fred 4.0
Holly 3.7 2 Holly 3.7
Luke 3.5 1 Luke 3.5
Nick 3.7 2 Nick 3.7

Page 7

Rank operator
rawdata = load ‟input' as (name, gpa:double);
ranked = rank rawdata by gpa desc dense;
dump ranked;

Name Gpa Rank Name Gpa
Katie 3.5 3 Katie 3.5
Fred 4.0 1 Fred 4.0
Holly 3.7 2 Holly 3.7
Luke 3.5 3 Luke 3.5
Nick 3.7 2 Nick 3.7

Page 8

Rank operator
• Limitation
– Only 1 reducer
• Possible improvements
– Provide a distributed implementation
• PIG-2353

Page 9

Groovy UDFs
register 'test.groovy' using groovy as myfuncs;

a = load '1.txt' as (a0, a1:long);
b = foreach a generate myfuncs.square(a1);
dump b;

test.groovy:

import org.apache.pig.builtin.OutputSchema;

class GroovyUDFs {
@OutputSchema('x:long')
long square(long x) {
return x*x;
}
}

Page 10

Embed Pig into Groovy
import org.apache.pig.scripting.Pig;

public static void main(String[] args) {
String input = ”input"
String output = "output"

Pig P = Pig.compile("A = load '$in'; store A into '$out';")

result = P.bind(['in':input, 'out':output]).runSingle()

if (result.isSuccessful()) {
print("Pig job succeeded")
} else {
print("Pig job failed")
}
}

Command line:
bin/pig -x local demo.groovy

Page 11

New data type: DateTime
a = load ‟input' as (a0: datetime, a1:chararray, a2:long);
b = foreach a generate a0, ToDate(a1, „yyyyMMdd HH:mm:ss‟),
ToDate(a2), CurrentTime();

• Support timezone
• Millisecond precision

Page 12

New data type: DateTime
• DateTime UDFs
GetYear YearsBetween SubtractDuration
GetMonth MonthsBetween ToDate
GetDay WeeksBetween ToDateISO
GetWeekYear DaysBetween ToMilliSeconds
GetWeek HoursBetween ToString
GetHour MinutesBetween ToUnixTime
GetMinute SecondsBetween CurrentTime
GetSecond MilliSecondsBetween
GetMilliSecond AddDuration

Page 13

SchemaTuple optimization
• Idea
– Generate schema specific tuple code when
schema is known
• Benefit
– Decrease memory footprint
– Better performance

Page 14

SchemaTuple optimization
• When tuple schema is known
(a0: int, a1: chararray, a2: double)

Original Tuple: Schema Tuple:
Tuple { SchemaTuple {
int f0;
List<Object> mFields;
String f1;
Object get(int fieldNum) { double f2;
return mFields.get(fieldNum); Object get(int fieldNum) {
} switch (fieldNum) {
case 0: return f0;
void set(int fieldNum, Object val) case 1: return f1;
mFields.set(fieldNum, val); case 2: return f2;
} }
void set(int fieldNum, Object val)
}
……
}
}

Page 15

Pig on new environment
• JDK 7
– All unit tests pass
– Jira: PIG-2908
• Hadoop 2.0.0
– Jira: PIG-2791
• Windows
– No need for cygwin
– Jira: PIG-2793
– Try to make it to 0.11

Page 16

Faster local mode
• Skip generating job.jar
– PIG-2128
– In 0.9, 0.10 as well, unadvertised
• Remove 5 seconds hardcoded wait time for
JobControl
– PIG-2702

Page 17

Better stats
• Information on alias/lines of a map/reduce job
– An information line for every map/reduce job
detailed locations: M: A[1,4],A[3,4],B[2,4] C: A[3,4],B[2,4] R: A[3,4], C[5, 4]

Explanation:
Map contains:
alias A: line 1 column 4
alias B: line 2 column 4
Combiner contains:
alias B: line 2 column 4
Reduce contains:
alias C: line 5 column 4

Page 18

Better notification
• For the support of
Ambrose
– Check out “Twitter
Ambrose”
– github open source
– Monitor Pig job
progress in a UI

Page 19

Integrate HCat DDL
• Embed HCat DDL command in Pig script
• Run HCat DDL command in Grunt
grunt> sql create table pig_test(name string, age int, gpa double)
stored as textfile;
grunt>

• Embed HCat DDL in scripting language
from org.apache.pig.scripting import Pig
ret = Pig.sql("""drop table if exists table_1;""")
if ret==0:
#success

Page 20

Grunt enhancement
• History
grunt> a = load '1.txt';
grunt> b = foreach a generate $0, $1;
grunt> history
1 a = load '1.txt';
2 b = foreach a generate $0, $1;
grunt>

• Clear
– Clear screen

Page 21

New/enhanced UDFs
• New UDFs
STARTSWITH INVERSEMAP VALUESET
BagToString KEYSET
BagToTuple VALUELIST

• Enhanced UDFs
RANDOM Take a seed
AvroStorage Support recursive record
Support globs and commas
Upgrade to Avro 1.7.1
• EvalFunc enhancement
– getInputSchema(): Get input schema for UDF
Page 22

Hortonworks Data Platform
• Simplify deployment to get
started quickly and easily

• Monitor, manage any size
cluster with familiar
console and tools

1 • Only platform to include
data integration services
to interact with any data

• Metadata services opens
the platform for integration
with existing applications

• Dependable high
availability architecture
 Reduce risks and cost of adoption
 Lower the total cost to administer and provision • Tested at scale to future
proof your cluster growth
 Integrate with your existing ecosystem

Page 23

Hortonworks Training
The expert source for
Apache Hadoop training & certification

Role-based Developer and Administration training
– Coursework built and maintained by the core Apache Hadoop development team.
– The “right” course, with the most extensive and realistic hands-on materials
– Provide an immersive experience into real-world Hadoop scenarios
– Public and Private courses available

Comprehensive Apache Hadoop Certification
– Become a trusted and valuable
Apache Hadoop expert

Page 24

Next Steps?

1 Download Hortonworks Data Platform
hortonworks.com/download

2 Use the getting started guide
hortonworks.com/get-started

3 Learn more… get support

Hortonworks Support
• Expert role based training • Full lifecycle technical support
• Course for admins, developers across four service levels
and operators • Delivered by Apache Hadoop
• Certification program Experts/Committers
• Custom onsite options • Forward-compatible
hortonworks.com/training hortonworks.com/support

Page 25

Thank You!
Questions & Answers

Follow: @hortonworks
Read: hortonworks.com/blog

Page 26

New features in Pig 0.11

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to New features in Pig 0.11

Similar to New features in Pig 0.11 (20)

More from Hortonworks

More from Hortonworks (20)

New features in Pig 0.11

Editor's Notes