SlideShare a Scribd company logo
1 of 27
Retrieving Big Data
For the non-developer
Intended Audience
People who do not write code
But don’t want to wait for IT to bring them data
Disclaimer
You will have to write code. Sorry...
Worth Noting
A common objection, “But I’m not a developer”
Coding does not make you a developer
anymore than patching some drywall makes
you a carpenter
Agenda
● The minimum you need to know about Big
Data (Hadoop)
o Specifically, HBase and Pig
● How you can retrieve data in HBase with Pig
o How to use Python with Pig to make querying easier
One Big Caveat
● We are not talking about analysis
● Analysis is hard
● Learning code and trying to understand an
analytical approach is really hard
● Following a straightforward Pig tutorial is
better than a boring lecture
Big Data in One Slide (oh boy)
● Today, Big Data == Hadoop
● Hadoop is both a distributed file system
(HDFS) and an approach to messing with
data on the file system (MapReduce)
o HBase is a popular database that sits on top of
HDFS
o Pig is a high level language that makes messing
with data on HDFS or in HBase easier
HBase in one slide
● HBase = Hadoop Database, based on
Google’s Big Table
● Column-oriented database – basically one
giant table
Pig in one slide
● A data flow language we will use to write
queries against HBase
● Pig is not the developer’s solution for
retrieving data from HBase, but it works well
enough for the BI analyst (and, of course, we
aren’t developers)
Pig is easier...Not Easy
● If you have no coding background, Pig will
not be easy
● But it’s the best of a bad set of options right
now
● Not hating on SQL-on-Hadoop providers, but
with SQL you tell the computer what you
want, which quickly gets complicated
Here’s our HBase table
Let’s dive in - Load
raw = LOAD 'hbase://peeps'
USING org.apache.pig.backend.hadoop.hbase.HBaseStorage(
'info:first_name info:last_name, '-loadKey true -limit 1')
AS (id:chararray, first_name:chararray, last_name:chararray);
You have to specify each field and it’s type in
order to load it
Response is as expected
'info:first_name info:last_name,
AS (first_name:chararray,
last_name:chararray);
Will return a first name and last name as
seperate fields, e.g., “Steve”, “Buscemi”
If you can write a Vlookup()
=VLOOKUP(C34, Z17:AZ56, 17, FALSE)
You can write a load statement in Pig.
Both are equally esoteric.
But what if we don’t know the fields?
● Suppose we have a column family of friends
● Each record will contain will zero to many
friends, e.g., friend_0: “John”, friend_1:
“Paul”
The number of friends is variable
● There could be thousands of friends per row
● And we cannot specify “friend_5” because
there is no guarantee that each record has
five friends
This is common...
● NoSQL databases are known for flexible
schemas and flat table structures
● Unfortunately, the way Pig handles this
problem utterly sucks...
Loading unknown friends
raw = LOAD 'hbase://SampleTable'
USING org.apache.pig.backend.hadoop.hbase.HBaseStorage(
'info:first_name info:last_name info:friends_*, '-loadKey true -limit 5')
AS (id:chararry, first_name:chararray, last_name:chararray, friends:map[]);
Now we have info:friends_* that is represented
as a “map”
A map is just a collection of key-
value pairs
● That look like this: friend_1# ‘Steve’,
friend_2# ‘Willie’
● They are very similar to Python
dictionaries...
Here’s why they suck
● We can’t iterate over them
● In order to access a value, in this case a
friend’s name, I have to provide the specific
key value, e.g., friend_5, in order to receive
the name of the fifth friend
But I thought you said we didn’t
know the number of friends?
● You are right – Pig expects us to provide the
specific value of something unknown
● If only there were some way to iterate over a
collection of key-value pairs…
Enter Python
● Pig may not allow you to iterate over a map,
but it does allow you to write User-Defined
Functions (UDFs) in Python
● In a python UDF we can read in a map as a
python dict and return key-value pairs
Python UDF for Pig
@outputSchema("values:bag{t:tuple(key, value)}")
def bag_of_tuples(map_dict):
return map_dict.items()
We are passing in a map, e.g., “Friend_1#Steve, Friend_2#Willie”
and manipulating a python dict, e.g. {‘Friend_1’: ‘Steve’, ‘Friend_2’:
‘Willie’}
Based on blog post by Chase Seibert
We can add loops and logic too
@outputSchema("status:chararray")
def get_steve(map_dict):
for key, value in map_dict:
if value == 'Steve':
return "I hate that guy"
else:
return value
Or if you just want the data in Excel
register ‘sample_udf.py’ using jython as my_udf
raw = LOAD 'hbase://peeps'
USING org.apache.pig.backend.hadoop.hbase.HBaseStorage(
'info:first_name info:last_name info:friends_*, '-loadKey true -limit 5')
AS (id:chararry, first_name:chararray, last_name:chararray, friends:map[]);
clean_table = FOREACH raw GENERATE id, FLATTEN(my_udf.bag_of_tuples(friends));
dump clean_table;
Final Thought
Make Your Big Data Small
● Prototype your Pig Scripts on your local file
system
o Download some data to your local machine
o Start you pig shell from the command line: pig -x
local
o Load - Transform - Dump
Notes
Pig Tutorials
● Excellent video on Pig
● Mortar Data introduction to Pig
● Flatten HBase column with Python
Me
● codingcharlatan.com
● @GusCavanaugh

More Related Content

What's hot

IPTC News in JSON AGM 2013
IPTC News in JSON AGM 2013IPTC News in JSON AGM 2013
IPTC News in JSON AGM 2013Stuart Myles
 
Data Science Stack with MongoDB and RStudio
Data Science Stack with MongoDB and RStudioData Science Stack with MongoDB and RStudio
Data Science Stack with MongoDB and RStudioWinston Chen
 
図でわかるHDFS Erasure Coding
図でわかるHDFS Erasure Coding図でわかるHDFS Erasure Coding
図でわかるHDFS Erasure CodingKai Sasaki
 
A Year With MongoDB: The Tips
A Year With MongoDB: The TipsA Year With MongoDB: The Tips
A Year With MongoDB: The TipsRizky Abdilah
 
Introduction to MongoDB at IGDTUW
Introduction to MongoDB at IGDTUWIntroduction to MongoDB at IGDTUW
Introduction to MongoDB at IGDTUWAnkur Raina
 
Papyri.info's Linked Data Story
Papyri.info's Linked Data StoryPapyri.info's Linked Data Story
Papyri.info's Linked Data StoryHugh Cayless
 
An Introduction to Apache Pig
An Introduction to Apache PigAn Introduction to Apache Pig
An Introduction to Apache PigSachin Vakkund
 
Java Development with MongoDB (James Williams)
Java Development with MongoDB (James Williams)Java Development with MongoDB (James Williams)
Java Development with MongoDB (James Williams)MongoSF
 
hadoop&zing
hadoop&zinghadoop&zing
hadoop&zingzingopen
 
Introduction to MongoDB and Hadoop
Introduction to MongoDB and HadoopIntroduction to MongoDB and Hadoop
Introduction to MongoDB and HadoopSteven Francia
 
Practical Implementation of Space-Efficient Dynamic Keyword Dictionaries
Practical Implementation of Space-Efficient Dynamic Keyword DictionariesPractical Implementation of Space-Efficient Dynamic Keyword Dictionaries
Practical Implementation of Space-Efficient Dynamic Keyword DictionariesShunsuke Kanda
 

What's hot (13)

IPTC News in JSON AGM 2013
IPTC News in JSON AGM 2013IPTC News in JSON AGM 2013
IPTC News in JSON AGM 2013
 
Data Science Stack with MongoDB and RStudio
Data Science Stack with MongoDB and RStudioData Science Stack with MongoDB and RStudio
Data Science Stack with MongoDB and RStudio
 
図でわかるHDFS Erasure Coding
図でわかるHDFS Erasure Coding図でわかるHDFS Erasure Coding
図でわかるHDFS Erasure Coding
 
A Year With MongoDB: The Tips
A Year With MongoDB: The TipsA Year With MongoDB: The Tips
A Year With MongoDB: The Tips
 
Python
PythonPython
Python
 
Introduction to MongoDB at IGDTUW
Introduction to MongoDB at IGDTUWIntroduction to MongoDB at IGDTUW
Introduction to MongoDB at IGDTUW
 
R, Hadoop and Amazon Web Services
R, Hadoop and Amazon Web ServicesR, Hadoop and Amazon Web Services
R, Hadoop and Amazon Web Services
 
Papyri.info's Linked Data Story
Papyri.info's Linked Data StoryPapyri.info's Linked Data Story
Papyri.info's Linked Data Story
 
An Introduction to Apache Pig
An Introduction to Apache PigAn Introduction to Apache Pig
An Introduction to Apache Pig
 
Java Development with MongoDB (James Williams)
Java Development with MongoDB (James Williams)Java Development with MongoDB (James Williams)
Java Development with MongoDB (James Williams)
 
hadoop&zing
hadoop&zinghadoop&zing
hadoop&zing
 
Introduction to MongoDB and Hadoop
Introduction to MongoDB and HadoopIntroduction to MongoDB and Hadoop
Introduction to MongoDB and Hadoop
 
Practical Implementation of Space-Efficient Dynamic Keyword Dictionaries
Practical Implementation of Space-Efficient Dynamic Keyword DictionariesPractical Implementation of Space-Efficient Dynamic Keyword Dictionaries
Practical Implementation of Space-Efficient Dynamic Keyword Dictionaries
 

Similar to Retrieving big data for the non developer

Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with PythonDonald Miner
 
Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!Nathan Bijnens
 
A fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFsA fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFsHolden Karau
 
Elephant in the room: A DBA's Guide to Hadoop
Elephant in the room: A DBA's Guide to HadoopElephant in the room: A DBA's Guide to Hadoop
Elephant in the room: A DBA's Guide to HadoopStuart Ainsworth
 
A super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAMA super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAMHolden Karau
 
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018Holden Karau
 
Big data beyond the JVM - DDTX 2018
Big data beyond the JVM -  DDTX 2018Big data beyond the JVM -  DDTX 2018
Big data beyond the JVM - DDTX 2018Holden Karau
 
Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Holden Karau
 
Introduction to pig & pig latin
Introduction to pig & pig latinIntroduction to pig & pig latin
Introduction to pig & pig latinknowbigdata
 
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional   w/ Apache Spark @ Scala Days NYCKeeping the fun in functional   w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional w/ Apache Spark @ Scala Days NYCHolden Karau
 
Apache Spark vs rest of the world – Problems and Solutions by Arkadiusz Jachn...
Apache Spark vs rest of the world – Problems and Solutions by Arkadiusz Jachn...Apache Spark vs rest of the world – Problems and Solutions by Arkadiusz Jachn...
Apache Spark vs rest of the world – Problems and Solutions by Arkadiusz Jachn...Big Data Spain
 
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesIntroducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesHolden Karau
 
Getting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduceGetting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduceobdit
 
Groovy On Trading Desk (2010)
Groovy On Trading Desk (2010)Groovy On Trading Desk (2010)
Groovy On Trading Desk (2010)Jonathan Felch
 
Sql saturday pig session (wes floyd) v2
Sql saturday   pig session (wes floyd) v2Sql saturday   pig session (wes floyd) v2
Sql saturday pig session (wes floyd) v2Wes Floyd
 
Architecture | Busy Java Developers Guide to NoSQL | Ted Neward
Architecture | Busy Java Developers Guide to NoSQL | Ted NewardArchitecture | Busy Java Developers Guide to NoSQL | Ted Neward
Architecture | Busy Java Developers Guide to NoSQL | Ted NewardJAX London
 

Similar to Retrieving big data for the non developer (20)

Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with Python
 
Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!
 
Apache Pig
Apache PigApache Pig
Apache Pig
 
A fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFsA fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFs
 
Elephant in the room: A DBA's Guide to Hadoop
Elephant in the room: A DBA's Guide to HadoopElephant in the room: A DBA's Guide to Hadoop
Elephant in the room: A DBA's Guide to Hadoop
 
06 pig-01-intro
06 pig-01-intro06 pig-01-intro
06 pig-01-intro
 
A super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAMA super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAM
 
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
 
Big data beyond the JVM - DDTX 2018
Big data beyond the JVM -  DDTX 2018Big data beyond the JVM -  DDTX 2018
Big data beyond the JVM - DDTX 2018
 
Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018
 
Introduction to pig & pig latin
Introduction to pig & pig latinIntroduction to pig & pig latin
Introduction to pig & pig latin
 
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional   w/ Apache Spark @ Scala Days NYCKeeping the fun in functional   w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
 
Apache Spark vs rest of the world – Problems and Solutions by Arkadiusz Jachn...
Apache Spark vs rest of the world – Problems and Solutions by Arkadiusz Jachn...Apache Spark vs rest of the world – Problems and Solutions by Arkadiusz Jachn...
Apache Spark vs rest of the world – Problems and Solutions by Arkadiusz Jachn...
 
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesIntroducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
 
The Bund language
The Bund languageThe Bund language
The Bund language
 
Getting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduceGetting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduce
 
Groovy On Trading Desk (2010)
Groovy On Trading Desk (2010)Groovy On Trading Desk (2010)
Groovy On Trading Desk (2010)
 
Sql saturday pig session (wes floyd) v2
Sql saturday   pig session (wes floyd) v2Sql saturday   pig session (wes floyd) v2
Sql saturday pig session (wes floyd) v2
 
Architecture | Busy Java Developers Guide to NoSQL | Ted Neward
Architecture | Busy Java Developers Guide to NoSQL | Ted NewardArchitecture | Busy Java Developers Guide to NoSQL | Ted Neward
Architecture | Busy Java Developers Guide to NoSQL | Ted Neward
 
מיכאל
מיכאלמיכאל
מיכאל
 

Retrieving big data for the non developer

  • 1. Retrieving Big Data For the non-developer
  • 2. Intended Audience People who do not write code But don’t want to wait for IT to bring them data
  • 3. Disclaimer You will have to write code. Sorry...
  • 4. Worth Noting A common objection, “But I’m not a developer” Coding does not make you a developer anymore than patching some drywall makes you a carpenter
  • 5. Agenda ● The minimum you need to know about Big Data (Hadoop) o Specifically, HBase and Pig ● How you can retrieve data in HBase with Pig o How to use Python with Pig to make querying easier
  • 6. One Big Caveat ● We are not talking about analysis ● Analysis is hard ● Learning code and trying to understand an analytical approach is really hard ● Following a straightforward Pig tutorial is better than a boring lecture
  • 7. Big Data in One Slide (oh boy) ● Today, Big Data == Hadoop ● Hadoop is both a distributed file system (HDFS) and an approach to messing with data on the file system (MapReduce) o HBase is a popular database that sits on top of HDFS o Pig is a high level language that makes messing with data on HDFS or in HBase easier
  • 8. HBase in one slide ● HBase = Hadoop Database, based on Google’s Big Table ● Column-oriented database – basically one giant table
  • 9. Pig in one slide ● A data flow language we will use to write queries against HBase ● Pig is not the developer’s solution for retrieving data from HBase, but it works well enough for the BI analyst (and, of course, we aren’t developers)
  • 10. Pig is easier...Not Easy ● If you have no coding background, Pig will not be easy ● But it’s the best of a bad set of options right now ● Not hating on SQL-on-Hadoop providers, but with SQL you tell the computer what you want, which quickly gets complicated
  • 12. Let’s dive in - Load raw = LOAD 'hbase://peeps' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage( 'info:first_name info:last_name, '-loadKey true -limit 1') AS (id:chararray, first_name:chararray, last_name:chararray); You have to specify each field and it’s type in order to load it
  • 13. Response is as expected 'info:first_name info:last_name, AS (first_name:chararray, last_name:chararray); Will return a first name and last name as seperate fields, e.g., “Steve”, “Buscemi”
  • 14. If you can write a Vlookup() =VLOOKUP(C34, Z17:AZ56, 17, FALSE) You can write a load statement in Pig. Both are equally esoteric.
  • 15. But what if we don’t know the fields? ● Suppose we have a column family of friends ● Each record will contain will zero to many friends, e.g., friend_0: “John”, friend_1: “Paul”
  • 16. The number of friends is variable ● There could be thousands of friends per row ● And we cannot specify “friend_5” because there is no guarantee that each record has five friends
  • 17. This is common... ● NoSQL databases are known for flexible schemas and flat table structures ● Unfortunately, the way Pig handles this problem utterly sucks...
  • 18. Loading unknown friends raw = LOAD 'hbase://SampleTable' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage( 'info:first_name info:last_name info:friends_*, '-loadKey true -limit 5') AS (id:chararry, first_name:chararray, last_name:chararray, friends:map[]); Now we have info:friends_* that is represented as a “map”
  • 19. A map is just a collection of key- value pairs ● That look like this: friend_1# ‘Steve’, friend_2# ‘Willie’ ● They are very similar to Python dictionaries...
  • 20. Here’s why they suck ● We can’t iterate over them ● In order to access a value, in this case a friend’s name, I have to provide the specific key value, e.g., friend_5, in order to receive the name of the fifth friend
  • 21. But I thought you said we didn’t know the number of friends? ● You are right – Pig expects us to provide the specific value of something unknown ● If only there were some way to iterate over a collection of key-value pairs…
  • 22. Enter Python ● Pig may not allow you to iterate over a map, but it does allow you to write User-Defined Functions (UDFs) in Python ● In a python UDF we can read in a map as a python dict and return key-value pairs
  • 23. Python UDF for Pig @outputSchema("values:bag{t:tuple(key, value)}") def bag_of_tuples(map_dict): return map_dict.items() We are passing in a map, e.g., “Friend_1#Steve, Friend_2#Willie” and manipulating a python dict, e.g. {‘Friend_1’: ‘Steve’, ‘Friend_2’: ‘Willie’} Based on blog post by Chase Seibert
  • 24. We can add loops and logic too @outputSchema("status:chararray") def get_steve(map_dict): for key, value in map_dict: if value == 'Steve': return "I hate that guy" else: return value
  • 25. Or if you just want the data in Excel register ‘sample_udf.py’ using jython as my_udf raw = LOAD 'hbase://peeps' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage( 'info:first_name info:last_name info:friends_*, '-loadKey true -limit 5') AS (id:chararry, first_name:chararray, last_name:chararray, friends:map[]); clean_table = FOREACH raw GENERATE id, FLATTEN(my_udf.bag_of_tuples(friends)); dump clean_table;
  • 26. Final Thought Make Your Big Data Small ● Prototype your Pig Scripts on your local file system o Download some data to your local machine o Start you pig shell from the command line: pig -x local o Load - Transform - Dump
  • 27. Notes Pig Tutorials ● Excellent video on Pig ● Mortar Data introduction to Pig ● Flatten HBase column with Python Me ● codingcharlatan.com ● @GusCavanaugh