Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Retrieving big data for the non developer


Published on

  • Be the first to comment

Retrieving big data for the non developer

  1. 1. Retrieving Big Data For the non-developer
  2. 2. Intended Audience People who do not write code But don’t want to wait for IT to bring them data
  3. 3. Disclaimer You will have to write code. Sorry...
  4. 4. Worth Noting A common objection, “But I’m not a developer” Coding does not make you a developer anymore than patching some drywall makes you a carpenter
  5. 5. Agenda ● The minimum you need to know about Big Data (Hadoop) o Specifically, HBase and Pig ● How you can retrieve data in HBase with Pig o How to use Python with Pig to make querying easier
  6. 6. One Big Caveat ● We are not talking about analysis ● Analysis is hard ● Learning code and trying to understand an analytical approach is really hard ● Following a straightforward Pig tutorial is better than a boring lecture
  7. 7. Big Data in One Slide (oh boy) ● Today, Big Data == Hadoop ● Hadoop is both a distributed file system (HDFS) and an approach to messing with data on the file system (MapReduce) o HBase is a popular database that sits on top of HDFS o Pig is a high level language that makes messing with data on HDFS or in HBase easier
  8. 8. HBase in one slide ● HBase = Hadoop Database, based on Google’s Big Table ● Column-oriented database – basically one giant table
  9. 9. Pig in one slide ● A data flow language we will use to write queries against HBase ● Pig is not the developer’s solution for retrieving data from HBase, but it works well enough for the BI analyst (and, of course, we aren’t developers)
  10. 10. Pig is easier...Not Easy ● If you have no coding background, Pig will not be easy ● But it’s the best of a bad set of options right now ● Not hating on SQL-on-Hadoop providers, but with SQL you tell the computer what you want, which quickly gets complicated
  11. 11. Here’s our HBase table
  12. 12. Let’s dive in - Load raw = LOAD 'hbase://peeps' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage( 'info:first_name info:last_name, '-loadKey true -limit 1') AS (id:chararray, first_name:chararray, last_name:chararray); You have to specify each field and it’s type in order to load it
  13. 13. Response is as expected 'info:first_name info:last_name, AS (first_name:chararray, last_name:chararray); Will return a first name and last name as seperate fields, e.g., “Steve”, “Buscemi”
  14. 14. If you can write a Vlookup() =VLOOKUP(C34, Z17:AZ56, 17, FALSE) You can write a load statement in Pig. Both are equally esoteric.
  15. 15. But what if we don’t know the fields? ● Suppose we have a column family of friends ● Each record will contain will zero to many friends, e.g., friend_0: “John”, friend_1: “Paul”
  16. 16. The number of friends is variable ● There could be thousands of friends per row ● And we cannot specify “friend_5” because there is no guarantee that each record has five friends
  17. 17. This is common... ● NoSQL databases are known for flexible schemas and flat table structures ● Unfortunately, the way Pig handles this problem utterly sucks...
  18. 18. Loading unknown friends raw = LOAD 'hbase://SampleTable' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage( 'info:first_name info:last_name info:friends_*, '-loadKey true -limit 5') AS (id:chararry, first_name:chararray, last_name:chararray, friends:map[]); Now we have info:friends_* that is represented as a “map”
  19. 19. A map is just a collection of key- value pairs ● That look like this: friend_1# ‘Steve’, friend_2# ‘Willie’ ● They are very similar to Python dictionaries...
  20. 20. Here’s why they suck ● We can’t iterate over them ● In order to access a value, in this case a friend’s name, I have to provide the specific key value, e.g., friend_5, in order to receive the name of the fifth friend
  21. 21. But I thought you said we didn’t know the number of friends? ● You are right – Pig expects us to provide the specific value of something unknown ● If only there were some way to iterate over a collection of key-value pairs…
  22. 22. Enter Python ● Pig may not allow you to iterate over a map, but it does allow you to write User-Defined Functions (UDFs) in Python ● In a python UDF we can read in a map as a python dict and return key-value pairs
  23. 23. Python UDF for Pig @outputSchema("values:bag{t:tuple(key, value)}") def bag_of_tuples(map_dict): return map_dict.items() We are passing in a map, e.g., “Friend_1#Steve, Friend_2#Willie” and manipulating a python dict, e.g. {‘Friend_1’: ‘Steve’, ‘Friend_2’: ‘Willie’} Based on blog post by Chase Seibert
  24. 24. We can add loops and logic too @outputSchema("status:chararray") def get_steve(map_dict): for key, value in map_dict: if value == 'Steve': return "I hate that guy" else: return value
  25. 25. Or if you just want the data in Excel register ‘’ using jython as my_udf raw = LOAD 'hbase://peeps' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage( 'info:first_name info:last_name info:friends_*, '-loadKey true -limit 5') AS (id:chararry, first_name:chararray, last_name:chararray, friends:map[]); clean_table = FOREACH raw GENERATE id, FLATTEN(my_udf.bag_of_tuples(friends)); dump clean_table;
  26. 26. Final Thought Make Your Big Data Small ● Prototype your Pig Scripts on your local file system o Download some data to your local machine o Start you pig shell from the command line: pig -x local o Load - Transform - Dump
  27. 27. Notes Pig Tutorials ● Excellent video on Pig ● Mortar Data introduction to Pig ● Flatten HBase column with Python Me ● ● @GusCavanaugh