4. Worth Noting
A common objection, “But I’m not a developer”
Coding does not make you a developer
anymore than patching some drywall makes
you a carpenter
5. Agenda
● The minimum you need to know about Big
Data (Hadoop)
o Specifically, HBase and Pig
● How you can retrieve data in HBase with Pig
o How to use Python with Pig to make querying easier
6. One Big Caveat
● We are not talking about analysis
● Analysis is hard
● Learning code and trying to understand an
analytical approach is really hard
● Following a straightforward Pig tutorial is
better than a boring lecture
7. Big Data in One Slide (oh boy)
● Today, Big Data == Hadoop
● Hadoop is both a distributed file system
(HDFS) and an approach to messing with
data on the file system (MapReduce)
o HBase is a popular database that sits on top of
HDFS
o Pig is a high level language that makes messing
with data on HDFS or in HBase easier
8. HBase in one slide
● HBase = Hadoop Database, based on
Google’s Big Table
● Column-oriented database – basically one
giant table
9. Pig in one slide
● A data flow language we will use to write
queries against HBase
● Pig is not the developer’s solution for
retrieving data from HBase, but it works well
enough for the BI analyst (and, of course, we
aren’t developers)
10. Pig is easier...Not Easy
● If you have no coding background, Pig will
not be easy
● But it’s the best of a bad set of options right
now
● Not hating on SQL-on-Hadoop providers, but
with SQL you tell the computer what you
want, which quickly gets complicated
12. Let’s dive in - Load
raw = LOAD 'hbase://peeps'
USING org.apache.pig.backend.hadoop.hbase.HBaseStorage(
'info:first_name info:last_name, '-loadKey true -limit 1')
AS (id:chararray, first_name:chararray, last_name:chararray);
You have to specify each field and it’s type in
order to load it
13. Response is as expected
'info:first_name info:last_name,
AS (first_name:chararray,
last_name:chararray);
Will return a first name and last name as
seperate fields, e.g., “Steve”, “Buscemi”
14. If you can write a Vlookup()
=VLOOKUP(C34, Z17:AZ56, 17, FALSE)
You can write a load statement in Pig.
Both are equally esoteric.
15. But what if we don’t know the fields?
● Suppose we have a column family of friends
● Each record will contain will zero to many
friends, e.g., friend_0: “John”, friend_1:
“Paul”
16. The number of friends is variable
● There could be thousands of friends per row
● And we cannot specify “friend_5” because
there is no guarantee that each record has
five friends
17. This is common...
● NoSQL databases are known for flexible
schemas and flat table structures
● Unfortunately, the way Pig handles this
problem utterly sucks...
18. Loading unknown friends
raw = LOAD 'hbase://SampleTable'
USING org.apache.pig.backend.hadoop.hbase.HBaseStorage(
'info:first_name info:last_name info:friends_*, '-loadKey true -limit 5')
AS (id:chararry, first_name:chararray, last_name:chararray, friends:map[]);
Now we have info:friends_* that is represented
as a “map”
19. A map is just a collection of key-
value pairs
● That look like this: friend_1# ‘Steve’,
friend_2# ‘Willie’
● They are very similar to Python
dictionaries...
20. Here’s why they suck
● We can’t iterate over them
● In order to access a value, in this case a
friend’s name, I have to provide the specific
key value, e.g., friend_5, in order to receive
the name of the fifth friend
21. But I thought you said we didn’t
know the number of friends?
● You are right – Pig expects us to provide the
specific value of something unknown
● If only there were some way to iterate over a
collection of key-value pairs…
22. Enter Python
● Pig may not allow you to iterate over a map,
but it does allow you to write User-Defined
Functions (UDFs) in Python
● In a python UDF we can read in a map as a
python dict and return key-value pairs
23. Python UDF for Pig
@outputSchema("values:bag{t:tuple(key, value)}")
def bag_of_tuples(map_dict):
return map_dict.items()
We are passing in a map, e.g., “Friend_1#Steve, Friend_2#Willie”
and manipulating a python dict, e.g. {‘Friend_1’: ‘Steve’, ‘Friend_2’:
‘Willie’}
Based on blog post by Chase Seibert
24. We can add loops and logic too
@outputSchema("status:chararray")
def get_steve(map_dict):
for key, value in map_dict:
if value == 'Steve':
return "I hate that guy"
else:
return value
25. Or if you just want the data in Excel
register ‘sample_udf.py’ using jython as my_udf
raw = LOAD 'hbase://peeps'
USING org.apache.pig.backend.hadoop.hbase.HBaseStorage(
'info:first_name info:last_name info:friends_*, '-loadKey true -limit 5')
AS (id:chararry, first_name:chararray, last_name:chararray, friends:map[]);
clean_table = FOREACH raw GENERATE id, FLATTEN(my_udf.bag_of_tuples(friends));
dump clean_table;
26. Final Thought
Make Your Big Data Small
● Prototype your Pig Scripts on your local file
system
o Download some data to your local machine
o Start you pig shell from the command line: pig -x
local
o Load - Transform - Dump
27. Notes
Pig Tutorials
● Excellent video on Pig
● Mortar Data introduction to Pig
● Flatten HBase column with Python
Me
● codingcharlatan.com
● @GusCavanaugh