Pig at Linkedin

2,870 views
2,815 views

Published on

Pig at LinkedIn by Chris Riccomini from LinkedIn
Pig is an integral part of data analytics at LinkedIn. Learn about LinkedIn’s analytic stack, and see how Pig is used to design, develop, and deliver data products at LinkedIn. We’ll explore a successful example of Pig deployment at LinkedIn, pain points, and integration with Azkaban, Voldemort, Hadoop, and the rest of LinkedIn’s ecosystem.

Published in: Education, Technology, Business
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,870
On SlideShare
0
From Embeds
0
Number of Embeds
320
Actions
Shares
0
Downloads
33
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • Chris Riccomini
    Senior Data Scientist at LinkedIn
    Involved in People you may know, Who’s viewed my profile, Avatara, and Distributed Computing at LinkedIn
    Worked in PayPal’s anti-fraud team as a data visualization engineer
  • talking about linkedin’s analytic environment, motivation for pig at linkedin, how we integrated it, and pig in the future
  • aster, hadoop, voldemort, azkaban, pig
  • aster, hadoop, voldemort, azkaban, pig
  • 40% of jobs run are pig
    # of production products that use pig
    pymk, ads, profile stats, jobs for you, talent match, groups you might like, browse maps, experimentation platform
  • in early 2009 we were working on converting PYMK from Aster to Hadoop
    everything was java based
    tired of writing joins, filters, etc (glue)
    built and deployed pig on laptop while at a conference
    wrote serializer in a few days
    significantly sped up delivery time for PYMK
  • motivation was not ad-hoc/sql/business analytics.
    motivation was product analytics, and PRODUCTION products.
    stability was key.
    reproducability was key.
    simplicity/understandability was key. (both the scripts and the system itself)
    "if it runs now, it will always run"
  • as streaming became more popular, pig is still used as glue, but complex jobs are now just python instead of java.
  • we use "voldemort" serialization (binary json) .. basically the same as avro
    not much csv (pigstorage) used
    some pain was involved in writing/updating the serializer (0.3 interface was insufficient)
  • we use "voldemort" serialization (binary json) .. basically the same as avro
    not much csv (pigstorage) used
    some pain was involved in writing/updating the serializer (0.3 interface was insufficient)
  • we use "voldemort" serialization (binary json) .. basically the same as avro
    not much csv (pigstorage) used
    some pain was involved in writing/updating the serializer (0.3 interface was insufficient)
  • we use pig to read from and write to voldemort
    all writes are currently done with read-only stores
    reads are done using roshan's voldemort loader func
    can also use roshan's voldemort store func to write directly to read-write stores
  • one problem that we had with pig was how to handle folders partitioned by date (yyyy/mm/dd)
    people were querying the root directory, and filtering out only the days they needed
    other people were writing custom jobs that would add only the sub folders that they were interested in as input paths
    our solution was to add a filter parameter to voldemort
    views = LOAD '/data/etl/tracking/extracted/profile-view' USING VoldemortStorage('date.range', 'num.days=90;days.ago=1');
    member_position = LOAD '/data/etl/replicated/member/member_position/#LATEST' USING VoldemortStorage();
  • one problem that we had with pig was how to handle folders partitioned by date (yyyy/mm/dd)
    people were querying the root directory, and filtering out only the days they needed
    other people were writing custom jobs that would add only the sub folders that they were interested in as input paths
    our solution was to add a filter parameter to voldemort
    views = LOAD '/data/etl/tracking/extracted/profile-view' USING VoldemortStorage('date.range', 'num.days=90;days.ago=1');
    member_position = LOAD '/data/etl/replicated/member/member_position/#LATEST' USING VoldemortStorage();
  • one problem that we had with pig was how to handle folders partitioned by date (yyyy/mm/dd)
    people were querying the root directory, and filtering out only the days they needed
    other people were writing custom jobs that would add only the sub folders that they were interested in as input paths
    our solution was to add a filter parameter to voldemort
    views = LOAD '/data/etl/tracking/extracted/profile-view' USING VoldemortStorage('date.range', 'num.days=90;days.ago=1');
    member_position = LOAD '/data/etl/replicated/member/member_position/#LATEST' USING VoldemortStorage();
  • one problem that we had with pig was how to handle folders partitioned by date (yyyy/mm/dd)
    people were querying the root directory, and filtering out only the days they needed
    other people were writing custom jobs that would add only the sub folders that they were interested in as input paths
    our solution was to add a filter parameter to voldemort
    views = LOAD '/data/etl/tracking/extracted/profile-view' USING VoldemortStorage('date.range', 'num.days=90;days.ago=1');
    member_position = LOAD '/data/etl/replicated/member/member_position/#LATEST' USING VoldemortStorage();
  • one problem that we had with pig was how to handle folders partitioned by date (yyyy/mm/dd)
    people were querying the root directory, and filtering out only the days they needed
    other people were writing custom jobs that would add only the sub folders that they were interested in as input paths
    our solution was to add a filter parameter to voldemort
    views = LOAD '/data/etl/tracking/extracted/profile-view' USING VoldemortStorage('date.range', 'num.days=90;days.ago=1');
    member_position = LOAD '/data/etl/replicated/member/member_position/#LATEST' USING VoldemortStorage();
  • one problem that we had with pig was how to handle folders partitioned by date (yyyy/mm/dd)
    people were querying the root directory, and filtering out only the days they needed
    other people were writing custom jobs that would add only the sub folders that they were interested in as input paths
    our solution was to add a filter parameter to voldemort
    views = LOAD '/data/etl/tracking/extracted/profile-view' USING VoldemortStorage('date.range', 'num.days=90;days.ago=1');
    member_position = LOAD '/data/etl/replicated/member/member_position/#LATEST' USING VoldemortStorage();
  • we use azkaban (like a very simple version of oozie)
    azkaban contains a "pig" job. specify type=pig, pig.script=path/to/pig/script.pig
    supports parameter passing between azkaban properties and pig parameters
    azkaban also provides resource locking
    and dependencies
    and scheduling
    makes it very easy to write a production pig job
    write pig file
    write job file
    throw pig and job file into a zip
    upload the zip
  • we use azkaban (like a very simple version of oozie)
    azkaban contains a "pig" job. specify type=pig, pig.script=path/to/pig/script.pig
    supports parameter passing between azkaban properties and pig parameters
    azkaban also provides resource locking
    and dependencies
    and scheduling
    makes it very easy to write a production pig job
    write pig file
    write job file
    throw pig and job file into a zip
    upload the zip
  • we use azkaban (like a very simple version of oozie)
    azkaban contains a "pig" job. specify type=pig, pig.script=path/to/pig/script.pig
    supports parameter passing between azkaban properties and pig parameters
    azkaban also provides resource locking
    and dependencies
    and scheduling
    makes it very easy to write a production pig job
    write pig file
    write job file
    throw pig and job file into a zip
    upload the zip
  • we use azkaban (like a very simple version of oozie)
    azkaban contains a "pig" job. specify type=pig, pig.script=path/to/pig/script.pig
    supports parameter passing between azkaban properties and pig parameters
    azkaban also provides resource locking
    and dependencies
    and scheduling
    makes it very easy to write a production pig job
    write pig file
    write job file
    throw pig and job file into a zip
    upload the zip
  • we use azkaban (like a very simple version of oozie)
    azkaban contains a "pig" job. specify type=pig, pig.script=path/to/pig/script.pig
    supports parameter passing between azkaban properties and pig parameters
    azkaban also provides resource locking
    and dependencies
    and scheduling
    makes it very easy to write a production pig job
    write pig file
    write job file
    throw pig and job file into a zip
    upload the zip
  • just starting to use pig for ad hoc analysis
    mostly engineers using it now
    some business analytics are starting to use it
    also looking at hive
  • pig 0.8, avro, hive, UDFs
  • dates
    the promise of pig as a generic map reduce language (not just hadoop)
    fix the data structures
    more json
  • dates
    the promise of pig as a generic map reduce language (not just hadoop)
    fix the data structures
    more json
  • dates
    the promise of pig as a generic map reduce language (not just hadoop)
    fix the data structures
    more json
  • dates
    the promise of pig as a generic map reduce language (not just hadoop)
    fix the data structures
    more json
  • dates
    the promise of pig as a generic map reduce language (not just hadoop)
    fix the data structures
    more json
  • Pig at Linkedin

    1. 1. Pig at Linkedin Chris Riccomini 9/29/10
    2. 2. Who?
    3. 3. What?
    4. 4. LinkedIn Analytics
    5. 5. Pig at LinkedIn
    6. 6. Why?
    7. 7. Production Quality
    8. 8. Streaming
    9. 9. Serialization
    10. 10. VoldemortStorage ~ Avro
    11. 11. views = LOAD '/data/awesome' USING VoldemortStorage();
    12. 12. Voldemort ♥ Pig
    13. 13. Partitioning
    14. 14. YYYY/MM/DD
    15. 15. Last N days?
    16. 16. views = LOAD '/data/etl/tracking/extracted/profile-view' USING VoldemortStorage('date.range', 'num.days=90;days.ago=1’)
    17. 17. Some-file-YYYY-MM-DD
    18. 18. member_position = LOAD '/data/etl/replicated/member/member_position/#LATEST' USING VoldemortStorage()
    19. 19. Scheduling
    20. 20. Azkaban
    21. 21. type=pig pig.script=myscript.pig
    22. 22. Ad hoc?
    23. 23. Future at LinkedIn
    24. 24. Wishes
    25. 25. Dates
    26. 26. Fix Data Types
    27. 27. JSON
    28. 28. Cross Platform
    29. 29. Questions? • criccomini@linkedin.com • http://www.riccomini.name • http://www.sna-projects.com • http://www.project-voldemort.com • @criccomini • LinkedIn is Hiring! Email me!

    ×