The document describes a project to organize US Census data into a more accessible format. The initial objectives are to create a repeatable process to build a "Data-Mart" of census data using Hive and Hadoop, and to make the data more understandable by organizing it into logical categories. The project involves downloading census data, converting it to SAS tables and CSVs, learning Hadoop and Hive, categorizing the census data topics, creating logical column headers using Ruby scripts, and loading the data onto HDFS to build an external Hive table and query the data in Impala.
2. “MAKE THE CENSUS DATA EASIER TO USE”
PROJECT OVERVIEW
INITIAL OBJECTIVES
• Create a repeatable process to build a “Data-Mart” of all census
data
• Utilize Hive and Hadoop to write and store data
• Make census data more accessible and understandable by
organizing data into categories with logical column headers
3. QUICK REVIEW OF ACS DATA
• “American Community Survey”
• Sent to approximately 295,000 addresses monthly (or 3.5 million
per year)
• The ACS only includes approximately 2 million final interviews per
year
• The survey was fully implemented in 2005
• Data comes in 3 forms:
• 1 year, 3 year and 5 year
4. 5 YEAR ACS
The 2014 ACS 5-year estimates were
released in 2015 and summarize
responses received in 2010, 2011, 2012,
2013 and 2014 for all geographies.
This is most suitable for data users
interested in longer-term changes
at small geographic scales.
FYI - “Places” refer to the statistical counterparts of incorporated places, and are
delineated to provide data for settled concentrations of population that are identifiable by
name but are not legally incorporated under the laws of the state in which they are
located. e.g. Boroughs in New York
Red Circles denote available data
5. ACS DATA STRUCTURE
2014 5 YR ACS
121 “SEQUENCES” EACH
52
STATES/T
ERRITORI
ES
EACH SEQUENCE IS A TABLE CONTAINING
THE SAME GEOGRAPHIC DATA WITH
DIFFERENT CENSUS DATA
6. PRIOR AFTER ACTION ITEMS MOTIVATING PROJECT DEV
WHY?
• During my starter project with census data, it was noted
that the topics of information were scattered across
several tables, or “sequences”
• The column headers were also coded, requiring the use
of a lookup table to decipher the headers
• Titled headers then had to be manually created and
entered as headers or metadata to replace coded
headers
7.
8. THAT’S A LOTTA BYTES
STEP 1:
• The first step taken was downloading the census data
• Issues with the previous census data utilized were its lack of
granular data and that it only represented estimates based off one
year
• The 5 year census was chosen to get more accurate data, and a
more granular version of that data was selected
• Once selected and downloaded, the data must be converted to SAS
tables
9. SAS TABLE CONVERSION
• After downloading SAS and running on a VM, the census sequence files are
converted to SAS Tables using a macro that directs the conversion. 176 GB of
SAS tables were created.
• The macro is supplied by the Census, but must be manipulated to output desired
data
10. THE JOURNEY CONTINUES
SAS TABLE TO CSV
• Ruby script calls python
scripts to convert SAS
tables to CSVs
• CSVs reduce overall data
size from 176 GB to 53 GB
11. I’M HADOOP AND SO CAN YOU!
STEP 2
• Once finished with obtaining
the data, the next step was
learning Hadoop
• Hadoop Hortonworks VM
sandbox was installed and I
began training the Hive
• Once I had created training
tables and felt comfortable
working in HDFS, I began
deciding on data structure
12. THE WALMART THEOREM
STEP 3
• Made the most sense to organize data into sensible categories on a
single table. While the tables would be large, it would reduce the
encumbrance of having only half the data needed on a single table.
• Having logical titles would eliminate the need for a lookup table and
manually entering titles when desired data was selected.
• Would reduce the data redundancy of repeating geographical
information on every sequence.
13. CATEGORIES
• Manually went through entire
census, categorizing tables
into topic and subtopics
• Requested changing scope of
objectives
• Focused on Marriage Data for
a proof of concept model
14. WE’VE GOT SOME WORK TO DO.
• Began looking at how table
metadata was organized and
could be automated by a script
to create logical names.
15. • Created a hierarchy and row number to build logical titles
• Problems with this method:
-Can’t be easily applied to other subjects, no repeatability
-Would be just as effective to write out the names manually
16. IN COMES PETER.
• While discussing my dilemma
with Peter, he showed me the
Census Reporter, a group that
“helps journalists navigate
and understand information
from the U.S. Census
bureau.”
• They had already organized
the entire census, giving the
indents and most importantly,
the parent column ID
17. TIME TO CODE
STEP 4
• Throughout this process, I had
continued training in Ruby and
was convinced it would be the
best method for me to create
the logical titles. This was
based on several factors:
- Ruby has a great CSV
library built in
- I was simultaneously training
in ruby and had this project in
mind during training sessions
-A ruby algorithm would be
easily repeatable and
shared/improved
18. LOADING THE CSV OF COLUMN HEADERS
• Created an array of hashes from
each row
• Value of parent column id is the
predecessor containing the next
portion of a logical title
• Realize all that is needed is the title,
the col_id and parent column id
• Learned how to use a hash lookup
from Thon in previous code
challenge
19. PUTTING TITLES TOGETHER
• By using the column_id as
the key to the hash, when it
is called, its value is
returned, containing column
title and its parent column id.
• Performing a hash lookup on
the parent column id results
in the corresponding column
title and its parent id to be
interpolated. This process
continues until there are no
more parents(‘nil’).
20. LET’S MAKE IT LEGIBLE
• Use flatten to remove array
nesting
• Compact removes the nil
values
• Reverse puts the string in the
correct order
• Join creates a single string
from the arrays
23. DONE? NOT SO MUCH.
Each column-id represents meta data for the
estimate (e) and margin of error (m)
B01001003 = B01001e3 & B01001m3
24. TRANSFORM COLUMN TITLE & ID FOR E & M
Original Code:
ex. B01001003 = B01001e3 & B01001m3 | B01001103 = B01001e103 & B01001m103
Issue: Codes with ‘e’ or ‘m’ values greater than 99 not being replaced
26. FINALLY! READY TO CHANGE A .CSV
• Need to load the CSVs created from SAS tables and find coded headers,
replacing them with the full formatted titles and write them out to a new CSV
27. LOOKING FOR A HASH
• ‘row’ returns a CSV::ROW object (a hard learned lesson). Uses an index with
nested array. Need [i] to access.
28. SMALL ISSUES
• When loading files, only load CSVs,
other files were being picked up
• How to retitle new CSV files
• Combining two arrays
29. COOL!! ALL DONE RIGHT??
• Still have 52 versions of each sequence
• The total number of sequences is 121
(ex: sf0037ak.csv - sequence file 37 of 121 for the state of Alaska)
• Topics are spread across multiple sequences
• Implement the WalMart Theorem.
WELLLLLL….
30. ONE. BIG. TABLE.
• Use integrator to combine the sequences and their
respective columns that relate to marriage data
• Needed to create ‘recspec’, used concatenation of LOGRENCO and STATE
• Very large memory load, sorted data and used extMerge
• 535,345 rows, 1011 columns, 541,233,795 cells
31. UPLOADING TO HDFS
• Originally uploaded file to Hortonworks Sandbox, sandbox
was ill equipped for such a large table.
• Switched to production Hadoop cluster
32. CREATE HIVE/IMPALA TABLE
• Exported metadata created by integrator join
• Created external table to loaded census CSV
34. CRADLE TO GRAVE PICTOGRAPH
SEQUENCE FILES
SAS TABLE
CSV TABLE WITH CODED HEADERS
CSV TABLE WITH TITLED HEADERS
SEQUENCE TABLE CSVS WITH COMMON CATEGORY
JOINED INTO ONE CSV
CSV LOADED ONTO HDFS
EXTERNAL HIVE TABLE CREATED
IMPORT INTO BDD
Next Steps
REPEAT FOR ADDITIONAL CATEGORIES
Easily Repeatable
for additional
census categories
SAS
Python Script
Ruby Script
Integrator
Hortonworks
Hive/Impala
35. USE CASE
• David’s Bridal becomes a client
• Trying to decide between 10 counties
where to place their next store
• Their research says first time
marriages use more expensive dresses
• Add census marriage data with
choropleth “heat map” of areas with
highest concentration of unmarried
women based on age 22-35, for
greater probability of a first time
marriage
• Give clients a “menu” of applicable census categories to be
included in their build
36. NEXT STEPS?
• Refactor code with classes
• Automate subject area code
selection process
• Utilize integrator’s Hadoop writer
tool to eliminate the nee to write to
.csv, import into HDFS and build
Hive table
• Create more common abbreviations
in titles to reduce overall length and
redundancy. Use something more
sophisticated than multiple .gsubs
37. FORESEEN ISSUES
• Uploading entire census would take up a lot of cluster space, may
not be practical
• Limitations to giant tables? Hive Limitations? Especially with header
character limits
• Possible solution would be uploading categories PRN when
applicable to projects
38. LESSONS LEARNED
• Booting and running a VM
• Hive Syntax, commands
• Loading files into HDFS
• Using Hive and HDFS through the command line
• File permissions
• Using SAS
• Manipulating Macros
• Maintaining a git repo/ using GitX
• How to communicate technical problems to my superiors
and solve them
• Basic ruby coding and program design
• Regular Expressions
• ETL concepts and practices
• Working well outside comfort zone
• How to easily manipulate JSON and CSVs
• Using all resources at my disposal
• Different areas of staff expertise
• Staying within a project timeline, keeping supervisor
informed of status
• Working with new and different filetypes (.gz, .fmt)
• Using integrator for large amounts of data and designing
graphs that optimize memory load
• Becoming more self-sufficient, researching and
implementing new concepts on my own
• When to dig in, and when it is appropriate to ask for help
• Repurposing lessons taught from prior projects into
current ones
• Reading and applying the code of others for my
purposes