SlideShare a Scribd company logo
ACS DATAMART
JEREMY SEARLS
“MAKE THE CENSUS DATA EASIER TO USE”
PROJECT OVERVIEW
INITIAL OBJECTIVES
• Create a repeatable process to build a “Data-Mart” of all census
data
• Utilize Hive and Hadoop to write and store data
• Make census data more accessible and understandable by
organizing data into categories with logical column headers
QUICK REVIEW OF ACS DATA
• “American Community Survey”
• Sent to approximately 295,000 addresses monthly (or 3.5 million
per year)
• The ACS only includes approximately 2 million final interviews per
year
• The survey was fully implemented in 2005
• Data comes in 3 forms:
• 1 year, 3 year and 5 year
5 YEAR ACS
The 2014 ACS 5-year estimates were
released in 2015 and summarize
responses received in 2010, 2011, 2012,
2013 and 2014 for all geographies.
This is most suitable for data users
interested in longer-term changes
at small geographic scales.
FYI - “Places” refer to the statistical counterparts of incorporated places, and are
delineated to provide data for settled concentrations of population that are identifiable by
name but are not legally incorporated under the laws of the state in which they are
located. e.g. Boroughs in New York
Red Circles denote available data
ACS DATA STRUCTURE
2014 5 YR ACS
121 “SEQUENCES” EACH
52
STATES/T
ERRITORI
ES
EACH SEQUENCE IS A TABLE CONTAINING
THE SAME GEOGRAPHIC DATA WITH
DIFFERENT CENSUS DATA
PRIOR AFTER ACTION ITEMS MOTIVATING PROJECT DEV
WHY?
• During my starter project with census data, it was noted
that the topics of information were scattered across
several tables, or “sequences”
• The column headers were also coded, requiring the use
of a lookup table to decipher the headers
• Titled headers then had to be manually created and
entered as headers or metadata to replace coded
headers
THAT’S A LOTTA BYTES
STEP 1:
• The first step taken was downloading the census data
• Issues with the previous census data utilized were its lack of
granular data and that it only represented estimates based off one
year
• The 5 year census was chosen to get more accurate data, and a
more granular version of that data was selected
• Once selected and downloaded, the data must be converted to SAS
tables
SAS TABLE CONVERSION
• After downloading SAS and running on a VM, the census sequence files are
converted to SAS Tables using a macro that directs the conversion. 176 GB of
SAS tables were created.
• The macro is supplied by the Census, but must be manipulated to output desired
data
THE JOURNEY CONTINUES
SAS TABLE TO CSV
• Ruby script calls python
scripts to convert SAS
tables to CSVs
• CSVs reduce overall data
size from 176 GB to 53 GB
I’M HADOOP AND SO CAN YOU!
STEP 2
• Once finished with obtaining
the data, the next step was
learning Hadoop
• Hadoop Hortonworks VM
sandbox was installed and I
began training the Hive
• Once I had created training
tables and felt comfortable
working in HDFS, I began
deciding on data structure
THE WALMART THEOREM
STEP 3
• Made the most sense to organize data into sensible categories on a
single table. While the tables would be large, it would reduce the
encumbrance of having only half the data needed on a single table.
• Having logical titles would eliminate the need for a lookup table and
manually entering titles when desired data was selected.
• Would reduce the data redundancy of repeating geographical
information on every sequence.
CATEGORIES
• Manually went through entire
census, categorizing tables
into topic and subtopics
• Requested changing scope of
objectives
• Focused on Marriage Data for
a proof of concept model
WE’VE GOT SOME WORK TO DO.
• Began looking at how table
metadata was organized and
could be automated by a script
to create logical names.
• Created a hierarchy and row number to build logical titles
• Problems with this method:
-Can’t be easily applied to other subjects, no repeatability
-Would be just as effective to write out the names manually
IN COMES PETER.
• While discussing my dilemma
with Peter, he showed me the
Census Reporter, a group that
“helps journalists navigate
and understand information
from the U.S. Census
bureau.”
• They had already organized
the entire census, giving the
indents and most importantly,
the parent column ID
TIME TO CODE
STEP 4
• Throughout this process, I had
continued training in Ruby and
was convinced it would be the
best method for me to create
the logical titles. This was
based on several factors:
- Ruby has a great CSV
library built in
- I was simultaneously training
in ruby and had this project in
mind during training sessions
-A ruby algorithm would be
easily repeatable and
shared/improved
LOADING THE CSV OF COLUMN HEADERS
• Created an array of hashes from
each row
• Value of parent column id is the
predecessor containing the next
portion of a logical title
• Realize all that is needed is the title,
the col_id and parent column id
• Learned how to use a hash lookup
from Thon in previous code
challenge
PUTTING TITLES TOGETHER
• By using the column_id as
the key to the hash, when it
is called, its value is
returned, containing column
title and its parent column id.
• Performing a hash lookup on
the parent column id results
in the corresponding column
title and its parent id to be
interpolated. This process
continues until there are no
more parents(‘nil’).
LET’S MAKE IT LEGIBLE
• Use flatten to remove array
nesting
• Compact removes the nil
values
• Reverse puts the string in the
correct order
• Join creates a single string
from the arrays
COMBINE WITH TABLE NAME
CONCATENATE AND FORMAT
DONE? NOT SO MUCH.
Each column-id represents meta data for the
estimate (e) and margin of error (m)
B01001003 = B01001e3 & B01001m3
TRANSFORM COLUMN TITLE & ID FOR E & M
Original Code:
ex. B01001003 = B01001e3 & B01001m3 | B01001103 = B01001e103 & B01001m103
Issue: Codes with ‘e’ or ‘m’ values greater than 99 not being replaced
CODE FIX
FINALLY! READY TO CHANGE A .CSV
• Need to load the CSVs created from SAS tables and find coded headers,
replacing them with the full formatted titles and write them out to a new CSV
LOOKING FOR A HASH
• ‘row’ returns a CSV::ROW object (a hard learned lesson). Uses an index with
nested array. Need [i] to access.
SMALL ISSUES
• When loading files, only load CSVs,
other files were being picked up
• How to retitle new CSV files
• Combining two arrays
COOL!! ALL DONE RIGHT??
• Still have 52 versions of each sequence
• The total number of sequences is 121
(ex: sf0037ak.csv - sequence file 37 of 121 for the state of Alaska)
• Topics are spread across multiple sequences
• Implement the WalMart Theorem.
WELLLLLL….
ONE. BIG. TABLE.
• Use integrator to combine the sequences and their
respective columns that relate to marriage data
• Needed to create ‘recspec’, used concatenation of LOGRENCO and STATE
• Very large memory load, sorted data and used extMerge
• 535,345 rows, 1011 columns, 541,233,795 cells
UPLOADING TO HDFS
• Originally uploaded file to Hortonworks Sandbox, sandbox
was ill equipped for such a large table.
• Switched to production Hadoop cluster
CREATE HIVE/IMPALA TABLE
• Exported metadata created by integrator join
• Created external table to loaded census CSV
QUERY IMPALA
• Ensure data set has been loaded with correct metadata
CRADLE TO GRAVE PICTOGRAPH
SEQUENCE FILES
SAS TABLE
CSV TABLE WITH CODED HEADERS
CSV TABLE WITH TITLED HEADERS
SEQUENCE TABLE CSVS WITH COMMON CATEGORY
JOINED INTO ONE CSV
CSV LOADED ONTO HDFS
EXTERNAL HIVE TABLE CREATED
IMPORT INTO BDD
Next Steps
REPEAT FOR ADDITIONAL CATEGORIES
Easily Repeatable
for additional
census categories
SAS
Python Script
Ruby Script
Integrator
Hortonworks
Hive/Impala
USE CASE
• David’s Bridal becomes a client
• Trying to decide between 10 counties
where to place their next store
• Their research says first time
marriages use more expensive dresses
• Add census marriage data with
choropleth “heat map” of areas with
highest concentration of unmarried
women based on age 22-35, for
greater probability of a first time
marriage
• Give clients a “menu” of applicable census categories to be
included in their build
NEXT STEPS?
• Refactor code with classes
• Automate subject area code
selection process
• Utilize integrator’s Hadoop writer
tool to eliminate the nee to write to
.csv, import into HDFS and build
Hive table
• Create more common abbreviations
in titles to reduce overall length and
redundancy. Use something more
sophisticated than multiple .gsubs
FORESEEN ISSUES
• Uploading entire census would take up a lot of cluster space, may
not be practical
• Limitations to giant tables? Hive Limitations? Especially with header
character limits
• Possible solution would be uploading categories PRN when
applicable to projects
LESSONS LEARNED
• Booting and running a VM
• Hive Syntax, commands
• Loading files into HDFS
• Using Hive and HDFS through the command line
• File permissions
• Using SAS
• Manipulating Macros
• Maintaining a git repo/ using GitX
• How to communicate technical problems to my superiors
and solve them
• Basic ruby coding and program design
• Regular Expressions
• ETL concepts and practices
• Working well outside comfort zone
• How to easily manipulate JSON and CSVs
• Using all resources at my disposal
• Different areas of staff expertise
• Staying within a project timeline, keeping supervisor
informed of status
• Working with new and different filetypes (.gz, .fmt)
• Using integrator for large amounts of data and designing
graphs that optimize memory load
• Becoming more self-sufficient, researching and
implementing new concepts on my own
• When to dig in, and when it is appropriate to ask for help
• Repurposing lessons taught from prior projects into
current ones
• Reading and applying the code of others for my
purposes

More Related Content

What's hot

Introduction to execution plan analysis
Introduction to execution plan analysisIntroduction to execution plan analysis
Introduction to execution plan analysis
John Sterrett
 
Hadoop first mr job - inverted index construction
Hadoop first mr job - inverted index constructionHadoop first mr job - inverted index construction
Hadoop first mr job - inverted index construction
Subhas Kumar Ghosh
 
TCC14 tour hague optimising workbooks
TCC14 tour hague optimising workbooksTCC14 tour hague optimising workbooks
TCC14 tour hague optimising workbooks
Mrunal Shridhar
 
Dynamic Draph / Iterative Computation on Apache Giraph
Dynamic Draph / Iterative Computation on Apache GiraphDynamic Draph / Iterative Computation on Apache Giraph
Dynamic Draph / Iterative Computation on Apache Giraph
DataWorks Summit
 

What's hot (20)

Cost-based Query Optimization
Cost-based Query Optimization Cost-based Query Optimization
Cost-based Query Optimization
 
Introduction to execution plan analysis
Introduction to execution plan analysisIntroduction to execution plan analysis
Introduction to execution plan analysis
 
Hadoop first mr job - inverted index construction
Hadoop first mr job - inverted index constructionHadoop first mr job - inverted index construction
Hadoop first mr job - inverted index construction
 
Redshift Chartio Event Presentation
Redshift Chartio Event PresentationRedshift Chartio Event Presentation
Redshift Chartio Event Presentation
 
The Evolution of a Relational Database Layer over HBase
The Evolution of a Relational Database Layer over HBaseThe Evolution of a Relational Database Layer over HBase
The Evolution of a Relational Database Layer over HBase
 
Local Secondary Indexes in Apache Phoenix
Local Secondary Indexes in Apache PhoenixLocal Secondary Indexes in Apache Phoenix
Local Secondary Indexes in Apache Phoenix
 
Putting Lipstick on Apache Pig at Netflix
Putting Lipstick on Apache Pig at NetflixPutting Lipstick on Apache Pig at Netflix
Putting Lipstick on Apache Pig at Netflix
 
Deep Dive on Amazon Redshift
Deep Dive on Amazon RedshiftDeep Dive on Amazon Redshift
Deep Dive on Amazon Redshift
 
Columnar Table Performance Enhancements Of Greenplum Database with Block Meta...
Columnar Table Performance Enhancements Of Greenplum Database with Block Meta...Columnar Table Performance Enhancements Of Greenplum Database with Block Meta...
Columnar Table Performance Enhancements Of Greenplum Database with Block Meta...
 
Deep Dive on Amazon Redshift
Deep Dive on Amazon RedshiftDeep Dive on Amazon Redshift
Deep Dive on Amazon Redshift
 
TCC14 tour hague optimising workbooks
TCC14 tour hague optimising workbooksTCC14 tour hague optimising workbooks
TCC14 tour hague optimising workbooks
 
Amazon Redshift: Performance Tuning and Optimization
Amazon Redshift: Performance Tuning and OptimizationAmazon Redshift: Performance Tuning and Optimization
Amazon Redshift: Performance Tuning and Optimization
 
Dynamic Draph / Iterative Computation on Apache Giraph
Dynamic Draph / Iterative Computation on Apache GiraphDynamic Draph / Iterative Computation on Apache Giraph
Dynamic Draph / Iterative Computation on Apache Giraph
 
Apache Big Data EU 2015 - Phoenix
Apache Big Data EU 2015 - PhoenixApache Big Data EU 2015 - Phoenix
Apache Big Data EU 2015 - Phoenix
 
Join optimization in hive
Join optimization in hive Join optimization in hive
Join optimization in hive
 
Loading Data into Redshift
Loading Data into RedshiftLoading Data into Redshift
Loading Data into Redshift
 
Amazon Redshift Deep Dive - February Online Tech Talks
Amazon Redshift Deep Dive - February Online Tech TalksAmazon Redshift Deep Dive - February Online Tech Talks
Amazon Redshift Deep Dive - February Online Tech Talks
 
HBaseCon 2013: Full-Text Indexing for Apache HBase
HBaseCon 2013: Full-Text Indexing for Apache HBaseHBaseCon 2013: Full-Text Indexing for Apache HBase
HBaseCon 2013: Full-Text Indexing for Apache HBase
 
H base introduction & development
H base introduction & developmentH base introduction & development
H base introduction & development
 
SQL - Structured Query Language
SQL - Structured Query LanguageSQL - Structured Query Language
SQL - Structured Query Language
 

Viewers also liked (6)

ACS DataMart_ppt
ACS DataMart_pptACS DataMart_ppt
ACS DataMart_ppt
 
Using census data to analyse the intergenerational transmission of the Welsh ...
Using census data to analyse the intergenerational transmission of the Welsh ...Using census data to analyse the intergenerational transmission of the Welsh ...
Using census data to analyse the intergenerational transmission of the Welsh ...
 
Social Science Students: Making Census Data Work for You
Social Science Students: Making Census Data Work for YouSocial Science Students: Making Census Data Work for You
Social Science Students: Making Census Data Work for You
 
Introduction to Census data and practical applications - Geography Skills Abe...
Introduction to Census data and practical applications - Geography Skills Abe...Introduction to Census data and practical applications - Geography Skills Abe...
Introduction to Census data and practical applications - Geography Skills Abe...
 
AB Testing at Expedia
AB Testing at ExpediaAB Testing at Expedia
AB Testing at Expedia
 
State-of-the-Art-Cardiology-Practice: Management OF Acute Coronary Syndrome P...
State-of-the-Art-Cardiology-Practice: Management OF Acute Coronary Syndrome P...State-of-the-Art-Cardiology-Practice: Management OF Acute Coronary Syndrome P...
State-of-the-Art-Cardiology-Practice: Management OF Acute Coronary Syndrome P...
 

Similar to ACS DataMart_ppt

Apache Hadoop 1.1
Apache Hadoop 1.1Apache Hadoop 1.1
Apache Hadoop 1.1
Sperasoft
 
Research on vector spatial data storage scheme based
Research on vector spatial data storage scheme basedResearch on vector spatial data storage scheme based
Research on vector spatial data storage scheme based
Anant Kumar
 

Similar to ACS DataMart_ppt (20)

Apache HBase Workshop
Apache HBase WorkshopApache HBase Workshop
Apache HBase Workshop
 
Apache hive
Apache hiveApache hive
Apache hive
 
SE2016 Java Valerii Moisieienko "Apache HBase Workshop"
SE2016 Java Valerii Moisieienko "Apache HBase Workshop"SE2016 Java Valerii Moisieienko "Apache HBase Workshop"
SE2016 Java Valerii Moisieienko "Apache HBase Workshop"
 
Valerii Moisieienko Apache hbase workshop
Valerii Moisieienko	Apache hbase workshopValerii Moisieienko	Apache hbase workshop
Valerii Moisieienko Apache hbase workshop
 
Schema Design
Schema DesignSchema Design
Schema Design
 
HBase Advanced Schema Design - Berlin Buzzwords - June 2012
HBase Advanced Schema Design - Berlin Buzzwords - June 2012HBase Advanced Schema Design - Berlin Buzzwords - June 2012
HBase Advanced Schema Design - Berlin Buzzwords - June 2012
 
HBase in Practice
HBase in PracticeHBase in Practice
HBase in Practice
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
Cheetah:Data Warehouse on Top of MapReduce
Cheetah:Data Warehouse on Top of MapReduceCheetah:Data Warehouse on Top of MapReduce
Cheetah:Data Warehouse on Top of MapReduce
 
01 hbase
01 hbase01 hbase
01 hbase
 
Build a modern data platform.pptx
Build a modern data platform.pptxBuild a modern data platform.pptx
Build a modern data platform.pptx
 
Apache hive introduction
Apache hive introductionApache hive introduction
Apache hive introduction
 
Apache Hadoop 1.1
Apache Hadoop 1.1Apache Hadoop 1.1
Apache Hadoop 1.1
 
Incredible Impala
Incredible Impala Incredible Impala
Incredible Impala
 
Hive_An Brief Introduction to HIVE_BIGDATAANALYTICS
Hive_An Brief Introduction to HIVE_BIGDATAANALYTICSHive_An Brief Introduction to HIVE_BIGDATAANALYTICS
Hive_An Brief Introduction to HIVE_BIGDATAANALYTICS
 
Hive Evolution: ApacheCon NA 2010
Hive Evolution:  ApacheCon NA 2010Hive Evolution:  ApacheCon NA 2010
Hive Evolution: ApacheCon NA 2010
 
Introduction to Amazon Athena
Introduction to Amazon AthenaIntroduction to Amazon Athena
Introduction to Amazon Athena
 
Apache Hive
Apache HiveApache Hive
Apache Hive
 
Research on vector spatial data storage scheme based
Research on vector spatial data storage scheme basedResearch on vector spatial data storage scheme based
Research on vector spatial data storage scheme based
 
Object Relational Database Management System
Object Relational Database Management SystemObject Relational Database Management System
Object Relational Database Management System
 

ACS DataMart_ppt

  • 2. “MAKE THE CENSUS DATA EASIER TO USE” PROJECT OVERVIEW INITIAL OBJECTIVES • Create a repeatable process to build a “Data-Mart” of all census data • Utilize Hive and Hadoop to write and store data • Make census data more accessible and understandable by organizing data into categories with logical column headers
  • 3. QUICK REVIEW OF ACS DATA • “American Community Survey” • Sent to approximately 295,000 addresses monthly (or 3.5 million per year) • The ACS only includes approximately 2 million final interviews per year • The survey was fully implemented in 2005 • Data comes in 3 forms: • 1 year, 3 year and 5 year
  • 4. 5 YEAR ACS The 2014 ACS 5-year estimates were released in 2015 and summarize responses received in 2010, 2011, 2012, 2013 and 2014 for all geographies. This is most suitable for data users interested in longer-term changes at small geographic scales. FYI - “Places” refer to the statistical counterparts of incorporated places, and are delineated to provide data for settled concentrations of population that are identifiable by name but are not legally incorporated under the laws of the state in which they are located. e.g. Boroughs in New York Red Circles denote available data
  • 5. ACS DATA STRUCTURE 2014 5 YR ACS 121 “SEQUENCES” EACH 52 STATES/T ERRITORI ES EACH SEQUENCE IS A TABLE CONTAINING THE SAME GEOGRAPHIC DATA WITH DIFFERENT CENSUS DATA
  • 6. PRIOR AFTER ACTION ITEMS MOTIVATING PROJECT DEV WHY? • During my starter project with census data, it was noted that the topics of information were scattered across several tables, or “sequences” • The column headers were also coded, requiring the use of a lookup table to decipher the headers • Titled headers then had to be manually created and entered as headers or metadata to replace coded headers
  • 7.
  • 8. THAT’S A LOTTA BYTES STEP 1: • The first step taken was downloading the census data • Issues with the previous census data utilized were its lack of granular data and that it only represented estimates based off one year • The 5 year census was chosen to get more accurate data, and a more granular version of that data was selected • Once selected and downloaded, the data must be converted to SAS tables
  • 9. SAS TABLE CONVERSION • After downloading SAS and running on a VM, the census sequence files are converted to SAS Tables using a macro that directs the conversion. 176 GB of SAS tables were created. • The macro is supplied by the Census, but must be manipulated to output desired data
  • 10. THE JOURNEY CONTINUES SAS TABLE TO CSV • Ruby script calls python scripts to convert SAS tables to CSVs • CSVs reduce overall data size from 176 GB to 53 GB
  • 11. I’M HADOOP AND SO CAN YOU! STEP 2 • Once finished with obtaining the data, the next step was learning Hadoop • Hadoop Hortonworks VM sandbox was installed and I began training the Hive • Once I had created training tables and felt comfortable working in HDFS, I began deciding on data structure
  • 12. THE WALMART THEOREM STEP 3 • Made the most sense to organize data into sensible categories on a single table. While the tables would be large, it would reduce the encumbrance of having only half the data needed on a single table. • Having logical titles would eliminate the need for a lookup table and manually entering titles when desired data was selected. • Would reduce the data redundancy of repeating geographical information on every sequence.
  • 13. CATEGORIES • Manually went through entire census, categorizing tables into topic and subtopics • Requested changing scope of objectives • Focused on Marriage Data for a proof of concept model
  • 14. WE’VE GOT SOME WORK TO DO. • Began looking at how table metadata was organized and could be automated by a script to create logical names.
  • 15. • Created a hierarchy and row number to build logical titles • Problems with this method: -Can’t be easily applied to other subjects, no repeatability -Would be just as effective to write out the names manually
  • 16. IN COMES PETER. • While discussing my dilemma with Peter, he showed me the Census Reporter, a group that “helps journalists navigate and understand information from the U.S. Census bureau.” • They had already organized the entire census, giving the indents and most importantly, the parent column ID
  • 17. TIME TO CODE STEP 4 • Throughout this process, I had continued training in Ruby and was convinced it would be the best method for me to create the logical titles. This was based on several factors: - Ruby has a great CSV library built in - I was simultaneously training in ruby and had this project in mind during training sessions -A ruby algorithm would be easily repeatable and shared/improved
  • 18. LOADING THE CSV OF COLUMN HEADERS • Created an array of hashes from each row • Value of parent column id is the predecessor containing the next portion of a logical title • Realize all that is needed is the title, the col_id and parent column id • Learned how to use a hash lookup from Thon in previous code challenge
  • 19. PUTTING TITLES TOGETHER • By using the column_id as the key to the hash, when it is called, its value is returned, containing column title and its parent column id. • Performing a hash lookup on the parent column id results in the corresponding column title and its parent id to be interpolated. This process continues until there are no more parents(‘nil’).
  • 20. LET’S MAKE IT LEGIBLE • Use flatten to remove array nesting • Compact removes the nil values • Reverse puts the string in the correct order • Join creates a single string from the arrays
  • 23. DONE? NOT SO MUCH. Each column-id represents meta data for the estimate (e) and margin of error (m) B01001003 = B01001e3 & B01001m3
  • 24. TRANSFORM COLUMN TITLE & ID FOR E & M Original Code: ex. B01001003 = B01001e3 & B01001m3 | B01001103 = B01001e103 & B01001m103 Issue: Codes with ‘e’ or ‘m’ values greater than 99 not being replaced
  • 26. FINALLY! READY TO CHANGE A .CSV • Need to load the CSVs created from SAS tables and find coded headers, replacing them with the full formatted titles and write them out to a new CSV
  • 27. LOOKING FOR A HASH • ‘row’ returns a CSV::ROW object (a hard learned lesson). Uses an index with nested array. Need [i] to access.
  • 28. SMALL ISSUES • When loading files, only load CSVs, other files were being picked up • How to retitle new CSV files • Combining two arrays
  • 29. COOL!! ALL DONE RIGHT?? • Still have 52 versions of each sequence • The total number of sequences is 121 (ex: sf0037ak.csv - sequence file 37 of 121 for the state of Alaska) • Topics are spread across multiple sequences • Implement the WalMart Theorem. WELLLLLL….
  • 30. ONE. BIG. TABLE. • Use integrator to combine the sequences and their respective columns that relate to marriage data • Needed to create ‘recspec’, used concatenation of LOGRENCO and STATE • Very large memory load, sorted data and used extMerge • 535,345 rows, 1011 columns, 541,233,795 cells
  • 31. UPLOADING TO HDFS • Originally uploaded file to Hortonworks Sandbox, sandbox was ill equipped for such a large table. • Switched to production Hadoop cluster
  • 32. CREATE HIVE/IMPALA TABLE • Exported metadata created by integrator join • Created external table to loaded census CSV
  • 33. QUERY IMPALA • Ensure data set has been loaded with correct metadata
  • 34. CRADLE TO GRAVE PICTOGRAPH SEQUENCE FILES SAS TABLE CSV TABLE WITH CODED HEADERS CSV TABLE WITH TITLED HEADERS SEQUENCE TABLE CSVS WITH COMMON CATEGORY JOINED INTO ONE CSV CSV LOADED ONTO HDFS EXTERNAL HIVE TABLE CREATED IMPORT INTO BDD Next Steps REPEAT FOR ADDITIONAL CATEGORIES Easily Repeatable for additional census categories SAS Python Script Ruby Script Integrator Hortonworks Hive/Impala
  • 35. USE CASE • David’s Bridal becomes a client • Trying to decide between 10 counties where to place their next store • Their research says first time marriages use more expensive dresses • Add census marriage data with choropleth “heat map” of areas with highest concentration of unmarried women based on age 22-35, for greater probability of a first time marriage • Give clients a “menu” of applicable census categories to be included in their build
  • 36. NEXT STEPS? • Refactor code with classes • Automate subject area code selection process • Utilize integrator’s Hadoop writer tool to eliminate the nee to write to .csv, import into HDFS and build Hive table • Create more common abbreviations in titles to reduce overall length and redundancy. Use something more sophisticated than multiple .gsubs
  • 37. FORESEEN ISSUES • Uploading entire census would take up a lot of cluster space, may not be practical • Limitations to giant tables? Hive Limitations? Especially with header character limits • Possible solution would be uploading categories PRN when applicable to projects
  • 38. LESSONS LEARNED • Booting and running a VM • Hive Syntax, commands • Loading files into HDFS • Using Hive and HDFS through the command line • File permissions • Using SAS • Manipulating Macros • Maintaining a git repo/ using GitX • How to communicate technical problems to my superiors and solve them • Basic ruby coding and program design • Regular Expressions • ETL concepts and practices • Working well outside comfort zone • How to easily manipulate JSON and CSVs • Using all resources at my disposal • Different areas of staff expertise • Staying within a project timeline, keeping supervisor informed of status • Working with new and different filetypes (.gz, .fmt) • Using integrator for large amounts of data and designing graphs that optimize memory load • Becoming more self-sufficient, researching and implementing new concepts on my own • When to dig in, and when it is appropriate to ask for help • Repurposing lessons taught from prior projects into current ones • Reading and applying the code of others for my purposes