SlideShare a Scribd company logo
1 of 26
Download to read offline
Pig 0.11 - New Features

Daniel Dai
Member of Technical Staff
Committer, VP of Apache Pig




© Hortonworks Inc. 2011       Page 1
Pig 0.11 release plan

• Branched on Oct 12, 2012
• Will come in weeks
 – Fix tests: PIG-2972
 – Documentation: PIG-2756
 – Several last minute fixes




    Architecting the Future of Big Data
                                          Page 2
    © Hortonworks Inc. 2011
New features

•   CUBE operator
•   Rank operator
•   Groovy UDFs
•   New data type: DateTime
•   SchemaTuple optimization
•   Works with JDK 7
•   Works with Windows ?

     Architecting the Future of Big Data
                                           Page 3
     © Hortonworks Inc. 2011
New features

• Faster local mode
• Better stats/notification
    – Ambros
•    Default scripts: pigrc
•    Integrate HCat DDL
•    Grunt enhancement: history/clear
•    UDF enhancement
    – New/enhanced UDFs
    – AvroStorage enhancement
      Architecting the Future of Big Data
                                            Page 4
      © Hortonworks Inc. 2011
CUBE operator
  rawdata = load ‟input' as (ptype, pstore, number);
  cubed = cube rawdata by rollup(ptype, pstore);
  result = foreach cubed generate flatten(group), SUM(cube.number);
  dump result;

                                               Ptype    Pstore   Sum
Ptype        Pstore                 number     Cat      Miami    18
Dog          Miami                  12         Cat      Naples   9
Cat          Miami                  18         Cat               27
                                               Dog      Miami    12
Turtle       Tampa                  4
                                               Dog      Tampa    14
Dog          Tampa                  14
                                               Dog      Naples   5
Cat          Naples                 9
                                               Dog               31
Dog          Naples                 5
                                               Turtle   Tampa    4
Turtle       Naples                 1          Turtle   Naples   1
                                               Turtle            5
                                                                 63
         Architecting the Future of Big Data
                                                                       Page 5
         © Hortonworks Inc. 2011
CUBE operator
•      Syntax
    outalias = CUBE inalias BY { CUBE expression | ROLLUP expression }, [
    CUBE expression | ROLLUP expression ] [PARALLEL n];

•  Umbrella Jira PIG-2167
•  Non-distributed version will be in 0.11 (PIG-
   2765)
• Distributed version still in progress (PIG-2831)
  – Push algebraic computation to map/combiner
      – Reference: “Distributed cube materialization on
       holistic measures”, Arnab Nandi et al, ICDE 2011

          Architecting the Future of Big Data
                                                                            Page 6
          © Hortonworks Inc. 2011
Rank operator
                      rawdata = load ‟input' as (name, gpa:double);
                      ranked = rank rawdata by gpa;
                      dump ranked;




Name                  Gpa                     Rank          Name      Gpa
Katie                 3.5                     1             Katie     3.5
Fred                  4.0                     5             Fred      4.0
Holly                 3.7                     2             Holly     3.7
Luke                  3.5                     1             Luke      3.5
Nick                  3.7                     2             Nick      3.7




        Architecting the Future of Big Data
                                                                            Page 7
        © Hortonworks Inc. 2011
Rank operator
                      rawdata = load ‟input' as (name, gpa:double);
                      ranked = rank rawdata by gpa desc dense;
                      dump ranked;




Name                  Gpa                     Rank          Name      Gpa
Katie                 3.5                     3             Katie     3.5
Fred                  4.0                     1             Fred      4.0
Holly                 3.7                     2             Holly     3.7
Luke                  3.5                     3             Luke      3.5
Nick                  3.7                     2             Nick      3.7




        Architecting the Future of Big Data
                                                                            Page 8
        © Hortonworks Inc. 2011
Rank operator
•    Limitation
    – Only 1 reducer
•    Possible improvements
    – Provide a distributed implementation
•    PIG-2353




       Architecting the Future of Big Data
                                             Page 9
       © Hortonworks Inc. 2011
Groovy UDFs
                   register 'test.groovy' using groovy as myfuncs;

                   a = load '1.txt' as (a0, a1:long);
                   b = foreach a generate myfuncs.square(a1);
                   dump b;


  test.groovy:

                import org.apache.pig.builtin.OutputSchema;

                class GroovyUDFs {
                  @OutputSchema('x:long')
                  long square(long x) {
                    return x*x;
                  }
                }




     Architecting the Future of Big Data
                                                                     Page 10
     © Hortonworks Inc. 2011
Embed Pig into Groovy
   import org.apache.pig.scripting.Pig;

   public static void main(String[] args) {
    String input = ”input"
    String output = "output"

       Pig P = Pig.compile("A = load '$in'; store A into '$out';")

       result = P.bind(['in':input, 'out':output]).runSingle()

       if (result.isSuccessful()) {
         print("Pig job succeeded")
       } else {
         print("Pig job failed")
       }
   }


Command line:
   bin/pig -x local demo.groovy


           Architecting the Future of Big Data
                                                                       Page 11
           © Hortonworks Inc. 2011
New data type: DateTime
    a = load ‟input' as (a0: datetime, a1:chararray, a2:long);
    b = foreach a generate a0, ToDate(a1, „yyyyMMdd HH:mm:ss‟),
    ToDate(a2), CurrentTime();


•   Support timezone
•   Millisecond precision




      Architecting the Future of Big Data
                                                                  Page 12
      © Hortonworks Inc. 2011
New data type: DateTime
•     DateTime UDFs
    GetYear                                    YearsBetween          SubtractDuration
    GetMonth                                   MonthsBetween         ToDate
    GetDay                                     WeeksBetween          ToDateISO
    GetWeekYear                                DaysBetween           ToMilliSeconds
    GetWeek                                    HoursBetween          ToString
    GetHour                                    MinutesBetween        ToUnixTime
    GetMinute                                  SecondsBetween        CurrentTime
    GetSecond                                  MilliSecondsBetween
    GetMilliSecond                             AddDuration




         Architecting the Future of Big Data
                                                                                        Page 13
         © Hortonworks Inc. 2011
SchemaTuple optimization
•    Idea
  – Generate schema specific tuple code when
   schema is known
• Benefit
    – Decrease memory footprint
    – Better performance




      Architecting the Future of Big Data
                                               Page 14
      © Hortonworks Inc. 2011
SchemaTuple optimization
•     When tuple schema is known
                     (a0: int, a1: chararray, a2: double)

Original Tuple:                                 Schema Tuple:
Tuple {                                              SchemaTuple {
                                                       int f0;
  List<Object> mFields;
                                                       String f1;
  Object get(int fieldNum) {                           double f2;
     return mFields.get(fieldNum);                     Object get(int fieldNum) {
  }                                                       switch (fieldNum) {
                                                             case 0: return f0;
  void set(int fieldNum, Object val)                         case 1: return f1;
     mFields.set(fieldNum, val);                             case 2: return f2;
  }                                                    }
                                                       void set(int fieldNum, Object val)
}
                                                          ……
                                                       }
                                                     }



         Architecting the Future of Big Data
                                                                                            Page 15
         © Hortonworks Inc. 2011
Pig on new environment
•    JDK 7
    – All unit tests pass
    – Jira: PIG-2908
•    Hadoop 2.0.0
    – Jira: PIG-2791
•    Windows
    – No need for cygwin
    – Jira: PIG-2793
    – Try to make it to 0.11


       Architecting the Future of Big Data
                                             Page 16
       © Hortonworks Inc. 2011
Faster local mode
•    Skip generating job.jar
    – PIG-2128
  – In 0.9, 0.10 as well, unadvertised
• Remove 5 seconds hardcoded wait time for
   JobControl
    – PIG-2702




      Architecting the Future of Big Data
                                             Page 17
      © Hortonworks Inc. 2011
Better stats
•     Information on alias/lines of a map/reduce job
     – An information line for every map/reduce job
      detailed locations: M: A[1,4],A[3,4],B[2,4] C: A[3,4],B[2,4] R: A[3,4], C[5, 4]

    Explanation:
      Map contains:
                   alias A: line 1 column 4
                   alias A: line 3 column 4
                   alias B: line 2 column 4
      Combiner contains:
                   alias A: line 3 column 4
                   alias B: line 2 column 4
      Reduce contains:
                   alias A: line 3 column 4
                   alias C: line 5 column 4

         Architecting the Future of Big Data
                                                                                        Page 18
         © Hortonworks Inc. 2011
Better notification
•    For the support of
     Ambrose
    – Check out “Twitter
     Ambrose”
    – github open source
    – Monitor Pig job
     progress in a UI




      Architecting the Future of Big Data
                                            Page 19
      © Hortonworks Inc. 2011
Integrate HCat DDL
•   Embed HCat DDL command in Pig script
•   Run HCat DDL command in Grunt
    grunt> sql create table pig_test(name string, age int, gpa double)
    stored as textfile;
    grunt>



•   Embed HCat DDL in scripting language
    from org.apache.pig.scripting import Pig
    ret = Pig.sql("""drop table if exists table_1;""")
    if ret==0:
        #success




       Architecting the Future of Big Data
                                                                         Page 20
       © Hortonworks Inc. 2011
Grunt enhancement
•     History
    grunt> a = load '1.txt';
    grunt> b = foreach a generate $0, $1;
    grunt> history
    1 a = load '1.txt';
    2 b = foreach a generate $0, $1;
    grunt>

•     Clear
    – Clear screen




         Architecting the Future of Big Data
                                               Page 21
         © Hortonworks Inc. 2011
New/enhanced UDFs
•    New UDFs
    STARTSWITH                              INVERSEMAP       VALUESET
    BagToString                             KEYSET
    BagToTuple                              VALUELIST

•    Enhanced UDFs
    RANDOM                                  Take a seed
    AvroStorage                             Support recursive record
                                            Support globs and commas
                                            Upgrade to Avro 1.7.1
•    EvalFunc enhancement
    – getInputSchema(): Get input schema for UDF
      Architecting the Future of Big Data
                                                                        Page 22
      © Hortonworks Inc. 2011
Hortonworks Data Platform
                                                     • Simplify deployment to get
                                                       started quickly and easily

                                                     • Monitor, manage any size
                                                       cluster with familiar
                                                       console and tools


                                1                    • Only platform to include
                                                       data integration services
                                                       to interact with any data

                                                     • Metadata services opens
                                                       the platform for integration
                                                       with existing applications

                                                     • Dependable high
                                                       availability architecture
 Reduce risks and cost of adoption
 Lower the total cost to administer and provision   • Tested at scale to future
                                                       proof your cluster growth
 Integrate with your existing ecosystem

                                                                             Page 23
      © Hortonworks Inc. 2011
Hortonworks Training
                           The expert source for
                           Apache Hadoop training & certification

Role-based Developer and Administration training
   – Coursework built and maintained by the core Apache Hadoop development team.
   – The “right” course, with the most extensive and realistic hands-on materials
   – Provide an immersive experience into real-world Hadoop scenarios
   – Public and Private courses available


Comprehensive Apache Hadoop Certification
   – Become a trusted and valuable
     Apache Hadoop expert




                                                                              Page 24
       © Hortonworks Inc. 2011
Next Steps?

1                                 Download Hortonworks Data Platform
                                  hortonworks.com/download




2   Use the getting started guide
    hortonworks.com/get-started



3   Learn more… get support

                                                             Hortonworks Support
       • Expert role based training                          • Full lifecycle technical support
       • Course for admins, developers                         across four service levels
         and operators                                       • Delivered by Apache Hadoop
       • Certification program                                 Experts/Committers
       • Custom onsite options                               • Forward-compatible
       hortonworks.com/training                              hortonworks.com/support


                                                                                                  Page 25
        © Hortonworks Inc. 2011
Thank You!
Questions & Answers

Follow: @hortonworks
Read: hortonworks.com/blog




                               Page 26
     © Hortonworks Inc. 2011

More Related Content

What's hot

Gruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in TelcoGruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in TelcoGruter
 
Real-Time Data Loading from MySQL to Hadoop
Real-Time Data Loading from MySQL to HadoopReal-Time Data Loading from MySQL to Hadoop
Real-Time Data Loading from MySQL to HadoopContinuent
 
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter SlidesJuly 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter Slidesryancox
 
Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)
Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)
Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)Gruter
 
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - ClouderaHadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - ClouderaCloudera, Inc.
 
Rigorous and Multi-tenant HBase Performance Measurement
Rigorous and Multi-tenant HBase Performance MeasurementRigorous and Multi-tenant HBase Performance Measurement
Rigorous and Multi-tenant HBase Performance MeasurementDataWorks Summit
 
HBase Data Modeling and Access Patterns with Kite SDK
HBase Data Modeling and Access Patterns with Kite SDKHBase Data Modeling and Access Patterns with Kite SDK
HBase Data Modeling and Access Patterns with Kite SDKHBaseCon
 
A Basic Introduction to the Hadoop eco system - no animation
A Basic Introduction to the Hadoop eco system - no animationA Basic Introduction to the Hadoop eco system - no animation
A Basic Introduction to the Hadoop eco system - no animationSameer Tiwari
 
In-memory Caching in HDFS: Lower Latency, Same Great Taste
In-memory Caching in HDFS: Lower Latency, Same Great TasteIn-memory Caching in HDFS: Lower Latency, Same Great Taste
In-memory Caching in HDFS: Lower Latency, Same Great TasteDataWorks Summit
 
Hw09 Monitoring Best Practices
Hw09   Monitoring Best PracticesHw09   Monitoring Best Practices
Hw09 Monitoring Best PracticesCloudera, Inc.
 
Zero-downtime Hadoop/HBase Cross-datacenter Migration
Zero-downtime Hadoop/HBase Cross-datacenter MigrationZero-downtime Hadoop/HBase Cross-datacenter Migration
Zero-downtime Hadoop/HBase Cross-datacenter MigrationScott Miao
 
Hadoop operations-2015-hadoop-summit-san-jose-v5
Hadoop operations-2015-hadoop-summit-san-jose-v5Hadoop operations-2015-hadoop-summit-san-jose-v5
Hadoop operations-2015-hadoop-summit-san-jose-v5Chris Nauroth
 
Hadoop - Lessons Learned
Hadoop - Lessons LearnedHadoop - Lessons Learned
Hadoop - Lessons Learnedtcurdt
 
Introduction to Apache Drill - interactive query and analysis at scale
Introduction to Apache Drill - interactive query and analysis at scaleIntroduction to Apache Drill - interactive query and analysis at scale
Introduction to Apache Drill - interactive query and analysis at scaleMapR Technologies
 
Keynote: Getting Serious about MySQL and Hadoop at Continuent
Keynote: Getting Serious about MySQL and Hadoop at ContinuentKeynote: Getting Serious about MySQL and Hadoop at Continuent
Keynote: Getting Serious about MySQL and Hadoop at ContinuentContinuent
 
HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...
HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...
HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...Cloudera, Inc.
 
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Pla...
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Pla...Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Pla...
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Pla...Yahoo Developer Network
 

What's hot (20)

Gruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in TelcoGruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in Telco
 
Real-Time Data Loading from MySQL to Hadoop
Real-Time Data Loading from MySQL to HadoopReal-Time Data Loading from MySQL to Hadoop
Real-Time Data Loading from MySQL to Hadoop
 
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter SlidesJuly 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
 
Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)
Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)
Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - ClouderaHadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
 
Hadoop 24/7
Hadoop 24/7Hadoop 24/7
Hadoop 24/7
 
Rigorous and Multi-tenant HBase Performance Measurement
Rigorous and Multi-tenant HBase Performance MeasurementRigorous and Multi-tenant HBase Performance Measurement
Rigorous and Multi-tenant HBase Performance Measurement
 
HBase Data Modeling and Access Patterns with Kite SDK
HBase Data Modeling and Access Patterns with Kite SDKHBase Data Modeling and Access Patterns with Kite SDK
HBase Data Modeling and Access Patterns with Kite SDK
 
A Basic Introduction to the Hadoop eco system - no animation
A Basic Introduction to the Hadoop eco system - no animationA Basic Introduction to the Hadoop eco system - no animation
A Basic Introduction to the Hadoop eco system - no animation
 
In-memory Caching in HDFS: Lower Latency, Same Great Taste
In-memory Caching in HDFS: Lower Latency, Same Great TasteIn-memory Caching in HDFS: Lower Latency, Same Great Taste
In-memory Caching in HDFS: Lower Latency, Same Great Taste
 
Hw09 Monitoring Best Practices
Hw09   Monitoring Best PracticesHw09   Monitoring Best Practices
Hw09 Monitoring Best Practices
 
Zero-downtime Hadoop/HBase Cross-datacenter Migration
Zero-downtime Hadoop/HBase Cross-datacenter MigrationZero-downtime Hadoop/HBase Cross-datacenter Migration
Zero-downtime Hadoop/HBase Cross-datacenter Migration
 
Hadoop operations-2015-hadoop-summit-san-jose-v5
Hadoop operations-2015-hadoop-summit-san-jose-v5Hadoop operations-2015-hadoop-summit-san-jose-v5
Hadoop operations-2015-hadoop-summit-san-jose-v5
 
NoSQL: Cassadra vs. HBase
NoSQL: Cassadra vs. HBaseNoSQL: Cassadra vs. HBase
NoSQL: Cassadra vs. HBase
 
Hadoop - Lessons Learned
Hadoop - Lessons LearnedHadoop - Lessons Learned
Hadoop - Lessons Learned
 
Introduction to Apache Drill - interactive query and analysis at scale
Introduction to Apache Drill - interactive query and analysis at scaleIntroduction to Apache Drill - interactive query and analysis at scale
Introduction to Apache Drill - interactive query and analysis at scale
 
Keynote: Getting Serious about MySQL and Hadoop at Continuent
Keynote: Getting Serious about MySQL and Hadoop at ContinuentKeynote: Getting Serious about MySQL and Hadoop at Continuent
Keynote: Getting Serious about MySQL and Hadoop at Continuent
 
HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...
HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...
HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...
 
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Pla...
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Pla...Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Pla...
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Pla...
 

Similar to New features in Pig 0.11

Stangel open stack community activity board and metrics 041513
Stangel open stack community activity board and metrics 041513Stangel open stack community activity board and metrics 041513
Stangel open stack community activity board and metrics 041513OpenStack Foundation
 
Pig programming is more fun: New features in Pig
Pig programming is more fun: New features in PigPig programming is more fun: New features in Pig
Pig programming is more fun: New features in Pigdaijy
 
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォームPivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォームMasayuki Matsushita
 
Extensions on PostgreSQL
Extensions on PostgreSQLExtensions on PostgreSQL
Extensions on PostgreSQLAlpaca
 
CHEP 2018: A Python upgrade to the GooFit package for parallel fitting
CHEP 2018: A Python upgrade to the GooFit package for parallel fittingCHEP 2018: A Python upgrade to the GooFit package for parallel fitting
CHEP 2018: A Python upgrade to the GooFit package for parallel fittingHenry Schreiner
 
Parallel SPAM Clustering with Hadoop
Parallel SPAM Clustering with HadoopParallel SPAM Clustering with Hadoop
Parallel SPAM Clustering with HadoopThibault Debatty
 
Realtime Analytics with MongoDB Counters (mongonyc 2012)
Realtime Analytics with MongoDB Counters (mongonyc 2012)Realtime Analytics with MongoDB Counters (mongonyc 2012)
Realtime Analytics with MongoDB Counters (mongonyc 2012)Scott Hernandez
 
Pig Out to Hadoop
Pig Out to HadoopPig Out to Hadoop
Pig Out to HadoopHortonworks
 
Atlanta OpenStack Summit: Technical Deep Dive: Big Data Computations Using El...
Atlanta OpenStack Summit: Technical Deep Dive: Big Data Computations Using El...Atlanta OpenStack Summit: Technical Deep Dive: Big Data Computations Using El...
Atlanta OpenStack Summit: Technical Deep Dive: Big Data Computations Using El...Sergey Lukjanov
 
Troubleshooting Dual-Protocol Networks and Systems by Scott Hogg at gogoNET L...
Troubleshooting Dual-Protocol Networks and Systems by Scott Hogg at gogoNET L...Troubleshooting Dual-Protocol Networks and Systems by Scott Hogg at gogoNET L...
Troubleshooting Dual-Protocol Networks and Systems by Scott Hogg at gogoNET L...gogo6
 
Leveraging docker for hadoop build automation and big data stack provisioning
Leveraging docker for hadoop build automation and big data stack provisioningLeveraging docker for hadoop build automation and big data stack provisioning
Leveraging docker for hadoop build automation and big data stack provisioningEvans Ye
 
Leveraging Docker for Hadoop build automation and Big Data stack provisioning
Leveraging Docker for Hadoop build automation and Big Data stack provisioningLeveraging Docker for Hadoop build automation and Big Data stack provisioning
Leveraging Docker for Hadoop build automation and Big Data stack provisioningDataWorks Summit
 
Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)PyData
 
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014Dataiku
 
Bio bigdata
Bio bigdata Bio bigdata
Bio bigdata Mk Kim
 
Using the puppet debugger for lightweight exploration
Using the puppet debugger for lightweight explorationUsing the puppet debugger for lightweight exploration
Using the puppet debugger for lightweight explorationCorey Osman
 
[db tech showcase Tokyo 2017] C13:There and back again or how to connect Orac...
[db tech showcase Tokyo 2017] C13:There and back again or how to connect Orac...[db tech showcase Tokyo 2017] C13:There and back again or how to connect Orac...
[db tech showcase Tokyo 2017] C13:There and back again or how to connect Orac...Insight Technology, Inc.
 
There and back_again_oracle_and_big_data_16x9
There and back_again_oracle_and_big_data_16x9There and back_again_oracle_and_big_data_16x9
There and back_again_oracle_and_big_data_16x9Gleb Otochkin
 

Similar to New features in Pig 0.11 (20)

Stangel open stack community activity board and metrics 041513
Stangel open stack community activity board and metrics 041513Stangel open stack community activity board and metrics 041513
Stangel open stack community activity board and metrics 041513
 
Who Built Grizzly
Who Built GrizzlyWho Built Grizzly
Who Built Grizzly
 
Pig programming is more fun: New features in Pig
Pig programming is more fun: New features in PigPig programming is more fun: New features in Pig
Pig programming is more fun: New features in Pig
 
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォームPivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
 
Pig programming is fun
Pig programming is funPig programming is fun
Pig programming is fun
 
Extensions on PostgreSQL
Extensions on PostgreSQLExtensions on PostgreSQL
Extensions on PostgreSQL
 
CHEP 2018: A Python upgrade to the GooFit package for parallel fitting
CHEP 2018: A Python upgrade to the GooFit package for parallel fittingCHEP 2018: A Python upgrade to the GooFit package for parallel fitting
CHEP 2018: A Python upgrade to the GooFit package for parallel fitting
 
Parallel SPAM Clustering with Hadoop
Parallel SPAM Clustering with HadoopParallel SPAM Clustering with Hadoop
Parallel SPAM Clustering with Hadoop
 
Realtime Analytics with MongoDB Counters (mongonyc 2012)
Realtime Analytics with MongoDB Counters (mongonyc 2012)Realtime Analytics with MongoDB Counters (mongonyc 2012)
Realtime Analytics with MongoDB Counters (mongonyc 2012)
 
Pig Out to Hadoop
Pig Out to HadoopPig Out to Hadoop
Pig Out to Hadoop
 
Atlanta OpenStack Summit: Technical Deep Dive: Big Data Computations Using El...
Atlanta OpenStack Summit: Technical Deep Dive: Big Data Computations Using El...Atlanta OpenStack Summit: Technical Deep Dive: Big Data Computations Using El...
Atlanta OpenStack Summit: Technical Deep Dive: Big Data Computations Using El...
 
Troubleshooting Dual-Protocol Networks and Systems by Scott Hogg at gogoNET L...
Troubleshooting Dual-Protocol Networks and Systems by Scott Hogg at gogoNET L...Troubleshooting Dual-Protocol Networks and Systems by Scott Hogg at gogoNET L...
Troubleshooting Dual-Protocol Networks and Systems by Scott Hogg at gogoNET L...
 
Leveraging docker for hadoop build automation and big data stack provisioning
Leveraging docker for hadoop build automation and big data stack provisioningLeveraging docker for hadoop build automation and big data stack provisioning
Leveraging docker for hadoop build automation and big data stack provisioning
 
Leveraging Docker for Hadoop build automation and Big Data stack provisioning
Leveraging Docker for Hadoop build automation and Big Data stack provisioningLeveraging Docker for Hadoop build automation and Big Data stack provisioning
Leveraging Docker for Hadoop build automation and Big Data stack provisioning
 
Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)
 
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
 
Bio bigdata
Bio bigdata Bio bigdata
Bio bigdata
 
Using the puppet debugger for lightweight exploration
Using the puppet debugger for lightweight explorationUsing the puppet debugger for lightweight exploration
Using the puppet debugger for lightweight exploration
 
[db tech showcase Tokyo 2017] C13:There and back again or how to connect Orac...
[db tech showcase Tokyo 2017] C13:There and back again or how to connect Orac...[db tech showcase Tokyo 2017] C13:There and back again or how to connect Orac...
[db tech showcase Tokyo 2017] C13:There and back again or how to connect Orac...
 
There and back_again_oracle_and_big_data_16x9
There and back_again_oracle_and_big_data_16x9There and back_again_oracle_and_big_data_16x9
There and back_again_oracle_and_big_data_16x9
 

More from Hortonworks

Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks
 
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT StrategyIoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT StrategyHortonworks
 
Getting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with CloudbreakGetting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with CloudbreakHortonworks
 
Johns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log EventsJohns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log EventsHortonworks
 
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad GuysCatch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad GuysHortonworks
 
HDF 3.2 - What's New
HDF 3.2 - What's NewHDF 3.2 - What's New
HDF 3.2 - What's NewHortonworks
 
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging ManagerCuring Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging ManagerHortonworks
 
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical EnvironmentsInterpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical EnvironmentsHortonworks
 
IBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data LandscapeIBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data LandscapeHortonworks
 
Premier Inside-Out: Apache Druid
Premier Inside-Out: Apache DruidPremier Inside-Out: Apache Druid
Premier Inside-Out: Apache DruidHortonworks
 
Accelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at ScaleAccelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at ScaleHortonworks
 
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATATIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATAHortonworks
 
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...Hortonworks
 
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: ClearsenseDelivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: ClearsenseHortonworks
 
Making Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with EaseMaking Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with EaseHortonworks
 
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World PresentationWebinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World PresentationHortonworks
 
Driving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data ManagementDriving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data ManagementHortonworks
 
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHortonworks
 
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...Hortonworks
 
Unlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDCUnlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDCHortonworks
 

More from Hortonworks (20)

Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
 
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT StrategyIoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
 
Getting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with CloudbreakGetting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with Cloudbreak
 
Johns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log EventsJohns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log Events
 
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad GuysCatch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
 
HDF 3.2 - What's New
HDF 3.2 - What's NewHDF 3.2 - What's New
HDF 3.2 - What's New
 
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging ManagerCuring Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
 
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical EnvironmentsInterpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
 
IBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data LandscapeIBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data Landscape
 
Premier Inside-Out: Apache Druid
Premier Inside-Out: Apache DruidPremier Inside-Out: Apache Druid
Premier Inside-Out: Apache Druid
 
Accelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at ScaleAccelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at Scale
 
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATATIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
 
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
 
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: ClearsenseDelivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
 
Making Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with EaseMaking Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with Ease
 
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World PresentationWebinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
 
Driving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data ManagementDriving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data Management
 
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
 
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
 
Unlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDCUnlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDC
 

New features in Pig 0.11

  • 1. Pig 0.11 - New Features Daniel Dai Member of Technical Staff Committer, VP of Apache Pig © Hortonworks Inc. 2011 Page 1
  • 2. Pig 0.11 release plan • Branched on Oct 12, 2012 • Will come in weeks – Fix tests: PIG-2972 – Documentation: PIG-2756 – Several last minute fixes Architecting the Future of Big Data Page 2 © Hortonworks Inc. 2011
  • 3. New features • CUBE operator • Rank operator • Groovy UDFs • New data type: DateTime • SchemaTuple optimization • Works with JDK 7 • Works with Windows ? Architecting the Future of Big Data Page 3 © Hortonworks Inc. 2011
  • 4. New features • Faster local mode • Better stats/notification – Ambros • Default scripts: pigrc • Integrate HCat DDL • Grunt enhancement: history/clear • UDF enhancement – New/enhanced UDFs – AvroStorage enhancement Architecting the Future of Big Data Page 4 © Hortonworks Inc. 2011
  • 5. CUBE operator rawdata = load ‟input' as (ptype, pstore, number); cubed = cube rawdata by rollup(ptype, pstore); result = foreach cubed generate flatten(group), SUM(cube.number); dump result; Ptype Pstore Sum Ptype Pstore number Cat Miami 18 Dog Miami 12 Cat Naples 9 Cat Miami 18 Cat 27 Dog Miami 12 Turtle Tampa 4 Dog Tampa 14 Dog Tampa 14 Dog Naples 5 Cat Naples 9 Dog 31 Dog Naples 5 Turtle Tampa 4 Turtle Naples 1 Turtle Naples 1 Turtle 5 63 Architecting the Future of Big Data Page 5 © Hortonworks Inc. 2011
  • 6. CUBE operator • Syntax outalias = CUBE inalias BY { CUBE expression | ROLLUP expression }, [ CUBE expression | ROLLUP expression ] [PARALLEL n]; • Umbrella Jira PIG-2167 • Non-distributed version will be in 0.11 (PIG- 2765) • Distributed version still in progress (PIG-2831) – Push algebraic computation to map/combiner – Reference: “Distributed cube materialization on holistic measures”, Arnab Nandi et al, ICDE 2011 Architecting the Future of Big Data Page 6 © Hortonworks Inc. 2011
  • 7. Rank operator rawdata = load ‟input' as (name, gpa:double); ranked = rank rawdata by gpa; dump ranked; Name Gpa Rank Name Gpa Katie 3.5 1 Katie 3.5 Fred 4.0 5 Fred 4.0 Holly 3.7 2 Holly 3.7 Luke 3.5 1 Luke 3.5 Nick 3.7 2 Nick 3.7 Architecting the Future of Big Data Page 7 © Hortonworks Inc. 2011
  • 8. Rank operator rawdata = load ‟input' as (name, gpa:double); ranked = rank rawdata by gpa desc dense; dump ranked; Name Gpa Rank Name Gpa Katie 3.5 3 Katie 3.5 Fred 4.0 1 Fred 4.0 Holly 3.7 2 Holly 3.7 Luke 3.5 3 Luke 3.5 Nick 3.7 2 Nick 3.7 Architecting the Future of Big Data Page 8 © Hortonworks Inc. 2011
  • 9. Rank operator • Limitation – Only 1 reducer • Possible improvements – Provide a distributed implementation • PIG-2353 Architecting the Future of Big Data Page 9 © Hortonworks Inc. 2011
  • 10. Groovy UDFs register 'test.groovy' using groovy as myfuncs; a = load '1.txt' as (a0, a1:long); b = foreach a generate myfuncs.square(a1); dump b; test.groovy: import org.apache.pig.builtin.OutputSchema; class GroovyUDFs { @OutputSchema('x:long') long square(long x) { return x*x; } } Architecting the Future of Big Data Page 10 © Hortonworks Inc. 2011
  • 11. Embed Pig into Groovy import org.apache.pig.scripting.Pig; public static void main(String[] args) { String input = ”input" String output = "output" Pig P = Pig.compile("A = load '$in'; store A into '$out';") result = P.bind(['in':input, 'out':output]).runSingle() if (result.isSuccessful()) { print("Pig job succeeded") } else { print("Pig job failed") } } Command line: bin/pig -x local demo.groovy Architecting the Future of Big Data Page 11 © Hortonworks Inc. 2011
  • 12. New data type: DateTime a = load ‟input' as (a0: datetime, a1:chararray, a2:long); b = foreach a generate a0, ToDate(a1, „yyyyMMdd HH:mm:ss‟), ToDate(a2), CurrentTime(); • Support timezone • Millisecond precision Architecting the Future of Big Data Page 12 © Hortonworks Inc. 2011
  • 13. New data type: DateTime • DateTime UDFs GetYear YearsBetween SubtractDuration GetMonth MonthsBetween ToDate GetDay WeeksBetween ToDateISO GetWeekYear DaysBetween ToMilliSeconds GetWeek HoursBetween ToString GetHour MinutesBetween ToUnixTime GetMinute SecondsBetween CurrentTime GetSecond MilliSecondsBetween GetMilliSecond AddDuration Architecting the Future of Big Data Page 13 © Hortonworks Inc. 2011
  • 14. SchemaTuple optimization • Idea – Generate schema specific tuple code when schema is known • Benefit – Decrease memory footprint – Better performance Architecting the Future of Big Data Page 14 © Hortonworks Inc. 2011
  • 15. SchemaTuple optimization • When tuple schema is known (a0: int, a1: chararray, a2: double) Original Tuple: Schema Tuple: Tuple { SchemaTuple { int f0; List<Object> mFields; String f1; Object get(int fieldNum) { double f2; return mFields.get(fieldNum); Object get(int fieldNum) { } switch (fieldNum) { case 0: return f0; void set(int fieldNum, Object val) case 1: return f1; mFields.set(fieldNum, val); case 2: return f2; } } void set(int fieldNum, Object val) } …… } } Architecting the Future of Big Data Page 15 © Hortonworks Inc. 2011
  • 16. Pig on new environment • JDK 7 – All unit tests pass – Jira: PIG-2908 • Hadoop 2.0.0 – Jira: PIG-2791 • Windows – No need for cygwin – Jira: PIG-2793 – Try to make it to 0.11 Architecting the Future of Big Data Page 16 © Hortonworks Inc. 2011
  • 17. Faster local mode • Skip generating job.jar – PIG-2128 – In 0.9, 0.10 as well, unadvertised • Remove 5 seconds hardcoded wait time for JobControl – PIG-2702 Architecting the Future of Big Data Page 17 © Hortonworks Inc. 2011
  • 18. Better stats • Information on alias/lines of a map/reduce job – An information line for every map/reduce job detailed locations: M: A[1,4],A[3,4],B[2,4] C: A[3,4],B[2,4] R: A[3,4], C[5, 4] Explanation: Map contains: alias A: line 1 column 4 alias A: line 3 column 4 alias B: line 2 column 4 Combiner contains: alias A: line 3 column 4 alias B: line 2 column 4 Reduce contains: alias A: line 3 column 4 alias C: line 5 column 4 Architecting the Future of Big Data Page 18 © Hortonworks Inc. 2011
  • 19. Better notification • For the support of Ambrose – Check out “Twitter Ambrose” – github open source – Monitor Pig job progress in a UI Architecting the Future of Big Data Page 19 © Hortonworks Inc. 2011
  • 20. Integrate HCat DDL • Embed HCat DDL command in Pig script • Run HCat DDL command in Grunt grunt> sql create table pig_test(name string, age int, gpa double) stored as textfile; grunt> • Embed HCat DDL in scripting language from org.apache.pig.scripting import Pig ret = Pig.sql("""drop table if exists table_1;""") if ret==0: #success Architecting the Future of Big Data Page 20 © Hortonworks Inc. 2011
  • 21. Grunt enhancement • History grunt> a = load '1.txt'; grunt> b = foreach a generate $0, $1; grunt> history 1 a = load '1.txt'; 2 b = foreach a generate $0, $1; grunt> • Clear – Clear screen Architecting the Future of Big Data Page 21 © Hortonworks Inc. 2011
  • 22. New/enhanced UDFs • New UDFs STARTSWITH INVERSEMAP VALUESET BagToString KEYSET BagToTuple VALUELIST • Enhanced UDFs RANDOM Take a seed AvroStorage Support recursive record Support globs and commas Upgrade to Avro 1.7.1 • EvalFunc enhancement – getInputSchema(): Get input schema for UDF Architecting the Future of Big Data Page 22 © Hortonworks Inc. 2011
  • 23. Hortonworks Data Platform • Simplify deployment to get started quickly and easily • Monitor, manage any size cluster with familiar console and tools 1 • Only platform to include data integration services to interact with any data • Metadata services opens the platform for integration with existing applications • Dependable high availability architecture  Reduce risks and cost of adoption  Lower the total cost to administer and provision • Tested at scale to future proof your cluster growth  Integrate with your existing ecosystem Page 23 © Hortonworks Inc. 2011
  • 24. Hortonworks Training The expert source for Apache Hadoop training & certification Role-based Developer and Administration training – Coursework built and maintained by the core Apache Hadoop development team. – The “right” course, with the most extensive and realistic hands-on materials – Provide an immersive experience into real-world Hadoop scenarios – Public and Private courses available Comprehensive Apache Hadoop Certification – Become a trusted and valuable Apache Hadoop expert Page 24 © Hortonworks Inc. 2011
  • 25. Next Steps? 1 Download Hortonworks Data Platform hortonworks.com/download 2 Use the getting started guide hortonworks.com/get-started 3 Learn more… get support Hortonworks Support • Expert role based training • Full lifecycle technical support • Course for admins, developers across four service levels and operators • Delivered by Apache Hadoop • Certification program Experts/Committers • Custom onsite options • Forward-compatible hortonworks.com/training hortonworks.com/support Page 25 © Hortonworks Inc. 2011
  • 26. Thank You! Questions & Answers Follow: @hortonworks Read: hortonworks.com/blog Page 26 © Hortonworks Inc. 2011

Editor's Notes

  1. Hortonworks Data Platform (HDP) is the only 100% open source Apache Hadoop distribution that provides a complete and reliable foundation for enterprises that want to build, deploy and manage big data solutions. It allows you to confidently capture, process and share data in any format, at scale on commodity hardware and/or in a cloud environment. As the foundation for the next generation enterprise data architecture, HDP delivers all of the necessary components to uncover business insights from the growing streams of data flowing into and throughout your business. HDP is a fully integrated data platform that includes the stable core functions of Apache Hadoop (HDFS and MapReduce), the baseline tools to process big data (Apache Hive, Apache HBase, Apache Pig) as well as a set of advanced capabilities (Apache Ambari, Apache HCatalog and High Availability) that make big data operational and ready for the enterprise.  Run through the points on left…