More Related Content Similar to New features in Pig 0.11 Similar to New features in Pig 0.11 (20) More from Hortonworks (20) New features in Pig 0.111. Pig 0.11 - New Features
Daniel Dai
Member of Technical Staff
Committer, VP of Apache Pig
© Hortonworks Inc. 2011 Page 1
2. Pig 0.11 release plan
• Branched on Oct 12, 2012
• Will come in weeks
– Fix tests: PIG-2972
– Documentation: PIG-2756
– Several last minute fixes
Architecting the Future of Big Data
Page 2
© Hortonworks Inc. 2011
3. New features
• CUBE operator
• Rank operator
• Groovy UDFs
• New data type: DateTime
• SchemaTuple optimization
• Works with JDK 7
• Works with Windows ?
Architecting the Future of Big Data
Page 3
© Hortonworks Inc. 2011
4. New features
• Faster local mode
• Better stats/notification
– Ambros
• Default scripts: pigrc
• Integrate HCat DDL
• Grunt enhancement: history/clear
• UDF enhancement
– New/enhanced UDFs
– AvroStorage enhancement
Architecting the Future of Big Data
Page 4
© Hortonworks Inc. 2011
5. CUBE operator
rawdata = load ‟input' as (ptype, pstore, number);
cubed = cube rawdata by rollup(ptype, pstore);
result = foreach cubed generate flatten(group), SUM(cube.number);
dump result;
Ptype Pstore Sum
Ptype Pstore number Cat Miami 18
Dog Miami 12 Cat Naples 9
Cat Miami 18 Cat 27
Dog Miami 12
Turtle Tampa 4
Dog Tampa 14
Dog Tampa 14
Dog Naples 5
Cat Naples 9
Dog 31
Dog Naples 5
Turtle Tampa 4
Turtle Naples 1 Turtle Naples 1
Turtle 5
63
Architecting the Future of Big Data
Page 5
© Hortonworks Inc. 2011
6. CUBE operator
• Syntax
outalias = CUBE inalias BY { CUBE expression | ROLLUP expression }, [
CUBE expression | ROLLUP expression ] [PARALLEL n];
• Umbrella Jira PIG-2167
• Non-distributed version will be in 0.11 (PIG-
2765)
• Distributed version still in progress (PIG-2831)
– Push algebraic computation to map/combiner
– Reference: “Distributed cube materialization on
holistic measures”, Arnab Nandi et al, ICDE 2011
Architecting the Future of Big Data
Page 6
© Hortonworks Inc. 2011
7. Rank operator
rawdata = load ‟input' as (name, gpa:double);
ranked = rank rawdata by gpa;
dump ranked;
Name Gpa Rank Name Gpa
Katie 3.5 1 Katie 3.5
Fred 4.0 5 Fred 4.0
Holly 3.7 2 Holly 3.7
Luke 3.5 1 Luke 3.5
Nick 3.7 2 Nick 3.7
Architecting the Future of Big Data
Page 7
© Hortonworks Inc. 2011
8. Rank operator
rawdata = load ‟input' as (name, gpa:double);
ranked = rank rawdata by gpa desc dense;
dump ranked;
Name Gpa Rank Name Gpa
Katie 3.5 3 Katie 3.5
Fred 4.0 1 Fred 4.0
Holly 3.7 2 Holly 3.7
Luke 3.5 3 Luke 3.5
Nick 3.7 2 Nick 3.7
Architecting the Future of Big Data
Page 8
© Hortonworks Inc. 2011
9. Rank operator
• Limitation
– Only 1 reducer
• Possible improvements
– Provide a distributed implementation
• PIG-2353
Architecting the Future of Big Data
Page 9
© Hortonworks Inc. 2011
10. Groovy UDFs
register 'test.groovy' using groovy as myfuncs;
a = load '1.txt' as (a0, a1:long);
b = foreach a generate myfuncs.square(a1);
dump b;
test.groovy:
import org.apache.pig.builtin.OutputSchema;
class GroovyUDFs {
@OutputSchema('x:long')
long square(long x) {
return x*x;
}
}
Architecting the Future of Big Data
Page 10
© Hortonworks Inc. 2011
11. Embed Pig into Groovy
import org.apache.pig.scripting.Pig;
public static void main(String[] args) {
String input = ”input"
String output = "output"
Pig P = Pig.compile("A = load '$in'; store A into '$out';")
result = P.bind(['in':input, 'out':output]).runSingle()
if (result.isSuccessful()) {
print("Pig job succeeded")
} else {
print("Pig job failed")
}
}
Command line:
bin/pig -x local demo.groovy
Architecting the Future of Big Data
Page 11
© Hortonworks Inc. 2011
12. New data type: DateTime
a = load ‟input' as (a0: datetime, a1:chararray, a2:long);
b = foreach a generate a0, ToDate(a1, „yyyyMMdd HH:mm:ss‟),
ToDate(a2), CurrentTime();
• Support timezone
• Millisecond precision
Architecting the Future of Big Data
Page 12
© Hortonworks Inc. 2011
13. New data type: DateTime
• DateTime UDFs
GetYear YearsBetween SubtractDuration
GetMonth MonthsBetween ToDate
GetDay WeeksBetween ToDateISO
GetWeekYear DaysBetween ToMilliSeconds
GetWeek HoursBetween ToString
GetHour MinutesBetween ToUnixTime
GetMinute SecondsBetween CurrentTime
GetSecond MilliSecondsBetween
GetMilliSecond AddDuration
Architecting the Future of Big Data
Page 13
© Hortonworks Inc. 2011
14. SchemaTuple optimization
• Idea
– Generate schema specific tuple code when
schema is known
• Benefit
– Decrease memory footprint
– Better performance
Architecting the Future of Big Data
Page 14
© Hortonworks Inc. 2011
15. SchemaTuple optimization
• When tuple schema is known
(a0: int, a1: chararray, a2: double)
Original Tuple: Schema Tuple:
Tuple { SchemaTuple {
int f0;
List<Object> mFields;
String f1;
Object get(int fieldNum) { double f2;
return mFields.get(fieldNum); Object get(int fieldNum) {
} switch (fieldNum) {
case 0: return f0;
void set(int fieldNum, Object val) case 1: return f1;
mFields.set(fieldNum, val); case 2: return f2;
} }
void set(int fieldNum, Object val)
}
……
}
}
Architecting the Future of Big Data
Page 15
© Hortonworks Inc. 2011
16. Pig on new environment
• JDK 7
– All unit tests pass
– Jira: PIG-2908
• Hadoop 2.0.0
– Jira: PIG-2791
• Windows
– No need for cygwin
– Jira: PIG-2793
– Try to make it to 0.11
Architecting the Future of Big Data
Page 16
© Hortonworks Inc. 2011
17. Faster local mode
• Skip generating job.jar
– PIG-2128
– In 0.9, 0.10 as well, unadvertised
• Remove 5 seconds hardcoded wait time for
JobControl
– PIG-2702
Architecting the Future of Big Data
Page 17
© Hortonworks Inc. 2011
18. Better stats
• Information on alias/lines of a map/reduce job
– An information line for every map/reduce job
detailed locations: M: A[1,4],A[3,4],B[2,4] C: A[3,4],B[2,4] R: A[3,4], C[5, 4]
Explanation:
Map contains:
alias A: line 1 column 4
alias A: line 3 column 4
alias B: line 2 column 4
Combiner contains:
alias A: line 3 column 4
alias B: line 2 column 4
Reduce contains:
alias A: line 3 column 4
alias C: line 5 column 4
Architecting the Future of Big Data
Page 18
© Hortonworks Inc. 2011
19. Better notification
• For the support of
Ambrose
– Check out “Twitter
Ambrose”
– github open source
– Monitor Pig job
progress in a UI
Architecting the Future of Big Data
Page 19
© Hortonworks Inc. 2011
20. Integrate HCat DDL
• Embed HCat DDL command in Pig script
• Run HCat DDL command in Grunt
grunt> sql create table pig_test(name string, age int, gpa double)
stored as textfile;
grunt>
• Embed HCat DDL in scripting language
from org.apache.pig.scripting import Pig
ret = Pig.sql("""drop table if exists table_1;""")
if ret==0:
#success
Architecting the Future of Big Data
Page 20
© Hortonworks Inc. 2011
21. Grunt enhancement
• History
grunt> a = load '1.txt';
grunt> b = foreach a generate $0, $1;
grunt> history
1 a = load '1.txt';
2 b = foreach a generate $0, $1;
grunt>
• Clear
– Clear screen
Architecting the Future of Big Data
Page 21
© Hortonworks Inc. 2011
22. New/enhanced UDFs
• New UDFs
STARTSWITH INVERSEMAP VALUESET
BagToString KEYSET
BagToTuple VALUELIST
• Enhanced UDFs
RANDOM Take a seed
AvroStorage Support recursive record
Support globs and commas
Upgrade to Avro 1.7.1
• EvalFunc enhancement
– getInputSchema(): Get input schema for UDF
Architecting the Future of Big Data
Page 22
© Hortonworks Inc. 2011
23. Hortonworks Data Platform
• Simplify deployment to get
started quickly and easily
• Monitor, manage any size
cluster with familiar
console and tools
1 • Only platform to include
data integration services
to interact with any data
• Metadata services opens
the platform for integration
with existing applications
• Dependable high
availability architecture
Reduce risks and cost of adoption
Lower the total cost to administer and provision • Tested at scale to future
proof your cluster growth
Integrate with your existing ecosystem
Page 23
© Hortonworks Inc. 2011
24. Hortonworks Training
The expert source for
Apache Hadoop training & certification
Role-based Developer and Administration training
– Coursework built and maintained by the core Apache Hadoop development team.
– The “right” course, with the most extensive and realistic hands-on materials
– Provide an immersive experience into real-world Hadoop scenarios
– Public and Private courses available
Comprehensive Apache Hadoop Certification
– Become a trusted and valuable
Apache Hadoop expert
Page 24
© Hortonworks Inc. 2011
25. Next Steps?
1 Download Hortonworks Data Platform
hortonworks.com/download
2 Use the getting started guide
hortonworks.com/get-started
3 Learn more… get support
Hortonworks Support
• Expert role based training • Full lifecycle technical support
• Course for admins, developers across four service levels
and operators • Delivered by Apache Hadoop
• Certification program Experts/Committers
• Custom onsite options • Forward-compatible
hortonworks.com/training hortonworks.com/support
Page 25
© Hortonworks Inc. 2011
26. Thank You!
Questions & Answers
Follow: @hortonworks
Read: hortonworks.com/blog
Page 26
© Hortonworks Inc. 2011
Editor's Notes Hortonworks Data Platform (HDP) is the only 100% open source Apache Hadoop distribution that provides a complete and reliable foundation for enterprises that want to build, deploy and manage big data solutions. It allows you to confidently capture, process and share data in any format, at scale on commodity hardware and/or in a cloud environment. As the foundation for the next generation enterprise data architecture, HDP delivers all of the necessary components to uncover business insights from the growing streams of data flowing into and throughout your business. HDP is a fully integrated data platform that includes the stable core functions of Apache Hadoop (HDFS and MapReduce), the baseline tools to process big data (Apache Hive, Apache HBase, Apache Pig) as well as a set of advanced capabilities (Apache Ambari, Apache HCatalog and High Availability) that make big data operational and ready for the enterprise. Run through the points on left…