So Who Wants to
                           Be a Munger?
                                Dana
                             ...
Who am I?

                   • Dana
                   • 8 years in corporate world
                   • Responsible for ...
Why is this important?
                   • We live in a data           • Important to know
                          driv...
The Process



Friday, August 28, 2009
The Rule of 3
                              In - Munge - Out
                   • Read data into some construct
          ...
1 - Reading



Friday, August 28, 2009
A Basic Munging Script
           The output file     open("new_numbers.txt", "w") do |f|
           The input file        F...
Simplify                       pass out
                                                     pass some
                   ...
Why this is better
            names = %w[dana james sarah storm gypsy]   numbers = open("numbers.txt")
            stream...
each() and puts()
                             class Rubyist
                               def each
                     ...
Reaching ultimate
                           munging power
          class Munger

                                       ...
Data

                   • Different kinds of data
                          • Structured - record oriented data
         ...
Somewhere in between
                          SAA_R_009 26-Mar-2009 15:26                           1: BOB's BILLARD HALL...
require "munger"

                             class RossReader

                                def initialize(file)
    ...
SAA_R_009 26-Mar-2009 15:26                           1: BOB's BILLARD HALL           Page 6
    Part Code        Descript...
Ugly Headers

      SAA_R_009 26-Mar-2009 15:26                       1: BOB's BILLARD HALL          Page 6

      Part Co...
unpack()
                   • Designed for breaking             • “a” means ascii
                          up binary data...
def initialize(file)
                 @file       = file
                 @headers    = nil
                 @format     =...
def initialize(file)
                             @file       = file
                             @in_header = false
     ...
Salesperson 22 BILL PRICE
                Customer 1014 KECK'S MEAT & FOODSERVICE
                  SA Sort Code 4.42 PORK...
assoc()
                   • lookup method                     • slower than a hash -
                                    ...
def initialize(file)
                    ...
                    @categories = []
                  end

                 ...
[["Part Code", "44-531"],
                           ["Description", "53/3 CU PRK RIB SOY"],
                           ["...
def each
                                                             open(@file) do |report|
                            ...
2 - Writing



Friday, August 28, 2009
require "rubygems"
               require "faster_csv"

               class CSVWriter
                 def initialize
   ...
3 - Munging



Friday, August 28, 2009
require "munger"
             require "ross_reader"
             require "csv_writer"



             report = Munger.new(...
So what can I do
                              with all this?

                   • Output your data into a spreadsheet
  ...
Examples



Friday, August 28, 2009
unless File.exist? "db.sqlite"
    require        "munger"                          class CreatePartCodes < ActiveRecord::...
reader = FCSV($stdin, :headers => true, :header_converters => :symbol)
       writer = DBWriter.new(PartCode)
       Creat...
Congratulations!
                          You, too, are now
                             a Munger!


Friday, August 28, 2...
Upcoming SlideShare
Loading in...5
×

Who Wants To Be a Munger

994

Published on

Lonestar Ruby Conference presentation about the three step process to data munging.

Published in: Technology
0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
994
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
25
Comments
0
Likes
5
Embeds 0
No embeds

No notes for slide

Who Wants To Be a Munger

  1. 1. So Who Wants to Be a Munger? Dana LSRC 2009 Friday, August 28, 2009
  2. 2. Who am I? • Dana • 8 years in corporate world • Responsible for munging a massive amount of data every day • Now develop Rails Applications for a living Friday, August 28, 2009
  3. 3. Why is this important? • We live in a data • Important to know driven society what data you have and what needs to • Companies feed on happen with it reports • The more you know • Clients have data and about the final output, want ways to display the easier you can it manipulate the data Friday, August 28, 2009
  4. 4. The Process Friday, August 28, 2009
  5. 5. The Rule of 3 In - Munge - Out • Read data into some construct • anything that understands each() • Transform the data • Output transformed data • some format that understands puts() Friday, August 28, 2009
  6. 6. 1 - Reading Friday, August 28, 2009
  7. 7. A Basic Munging Script The output file open("new_numbers.txt", "w") do |f| The input file File.foreach("numbers.txt") do |n| The transformation n.capitalize! f.puts n end end one One two Two three Three four Four five Five Friday, August 28, 2009
  8. 8. Simplify pass out pass some another object to object as munge output • Don’t confuse reading with def munge(input, output) munging input.each do |record| record.capitalize! • May have to read various output.puts record end files for the same output end • Use Ruby’s each() and puts() methods to your advantage Friday, August 28, 2009
  9. 9. Why this is better names = %w[dana james sarah storm gypsy] numbers = open("numbers.txt") stream = $stdout stream = open("new_numbers.txt", "w") munge(names, stream) munge(numbers, stream) Friday, August 28, 2009
  10. 10. each() and puts() class Rubyist def each yield "i" yield "love" yield "ruby" end end class Speaker def puts(words) `say #{words}` end end Friday, August 28, 2009
  11. 11. Reaching ultimate munging power class Munger m = Munger.new(open("numbers.txt"), def initialize(input, output) open("new_numbers.txt", "w")) @input = input m.munge do |n| @output = output n.strip! end if n =~ /At/i n.reverse def munge elsif n == "four" @input.each do |record| nil munged = yield(record) else @output.puts munged unless munged.nil? n.capitalize end end end end end Friday, August 28, 2009
  12. 12. Data • Different kinds of data • Structured - record oriented data • Unstructured • Most difficult to work with • Vast majority of data reading is pattern matching Friday, August 28, 2009
  13. 13. Somewhere in between SAA_R_009 26-Mar-2009 15:26 1: BOB's BILLARD HALL Page 6 headers Code Part Description Qty Period QTY LastYr QTY Var Lbs Period Lbs LastYr Lbs Var --------------- ------------------------ ---------- ---------- ------- ---------- ---------- ------- Salesperson 22 BILL PRICE Customer 1014 KECK'S MEAT & FOODSERVICE SA Sort Code 4.42 PORK RIBS 44-531 53/3 CU PRK RIB SOY 0 0 0 0 2 -100 44-531-0 100/2.5 CU PRK RIB SOY 10 21 -52 14 31 -55 15-230 53/3 BB PUB BURGER 150 0 0 150k 0 0 hierarchical 3680 40/4 RB PRK WHLMSC HARKER 187 243 -23 412 405 2 categories 3681 30/5.3 RB PRK WHLMS HARKR 207 162 28 378 243 56 3686 30/5.3 RB PRK WHLMS HARKR 27 45 -40 72 180 -60 3008 33/4.92 RB PRK HARKER 270 300 -10 580 600 -3 3010 25/6.4 RB PRK CNTRY HARKR 510 540 -6 1,000 1,080 -7 3402 40/4 RU PRK RIB PAT HARKR 0 0 0 0 40k -100 3403 51/3.14 RU PRK RIB HARKER 558 900 -38 1,008 1,170 -14 3404 40/4.14 RU PRK RIB HARKER 73k 1,052 -30 1,296 1,592 -19 ---------- ---------- ------- ---------- ---------- ------- SA Sort Code subtotals 2,567 3,263 -21 6,260 5,703 9 SA Sort Code 19.1 WAFFLES 5018 36/5 KING B WAFFLES 10 10 0 10 14 -29 ---------- ---------- ------- ---------- ---------- ------- SA Sort Code subtotals 10 10 0 10 14 -29 ---------- ---------- ------- ---------- ---------- ------- SAA_R_009 26-Mar-2009 15:26 1: BOB's BILLARD HALL Page 7 headers Code Part Description Qty Period QTY LastYr QTY Var Lbs Period Lbs LastYr Lbs Var --------------- ------------------------ ---------- ---------- ------- ---------- ---------- ------- Customer subtotals 2,577 3,273 -21 6,270 5,717 9 ---------- ---------- ------- ---------- ---------- ------- Salesperson subtotals 9,857 8,756 12 45,889 42,556 8 ---------- ---------- ------- ---------- ---------- ------- Report Totals 15,008 13,225 13 75,896 72,359 4 Friday, August 28, 2009
  14. 14. require "munger" class RossReader def initialize(file) @file = file end def each open(@file) do |report| report.each do |line| break if line =~ /AReport Totals/ next if line =~ /As+z/ or line =~ /As+-/ or line =~ /b(sub)?totalsb/i yield line end # report.each end # open end # def end report = Munger.new(RossReader.new("sample_report.txt"), open("ross_writer.txt", "w")) report.munge do |n| n end Friday, August 28, 2009
  15. 15. SAA_R_009 26-Mar-2009 15:26 1: BOB's BILLARD HALL Page 6 Part Code Description Qty Period QTY LastYr QTY Var Lbs Period Lbs LastYr Lbs Var --------------- ------------------------ ---------- ---------- ------- ---------- ---------- ------- Salesperson 22 BILL PRICE Customer 1014 KECK'S MEAT & FOODSERVICE SA Sort Code 4.42 PORK RIBS 44-531 53/3 CU PRK RIB SOY 0 0 0 0 2 -100 44-531-0 100/2.5 CU PRK RIB SOY 10 21 -52 14 31 -55 15-230 53/3 BB PUB BURGER 150 0 0 150k 0 0 3680 40/4 RB PRK WHLMSC HARKER 187 243 -23 412 405 2 3681 30/5.3 RB PRK WHLMS HARKR 207 162 28 378 243 56 3686 30/5.3 RB PRK WHLMS HARKR 27 45 -40 72 180 -60 3008 33/4.92 RB PRK HARKER 270 300 -10 580 600 -3 3010 25/6.4 RB PRK CNTRY HARKR 510 540 -6 1,000 1,080 -7 3402 40/4 RU PRK RIB PAT HARKR 0 0 0 0 40k -100 3403 51/3.14 RU PRK RIB HARKER 558 900 -38 1,008 1,170 -14 3404 40/4.14 RU PRK RIB HARKER 73k 1,052 -30 1,296 1,592 -19 SA Sort Code 19.1 WAFFLES 5018 36/5 KING B WAFFLES 10 10 0 10 14 -29 SAA_R_009 26-Mar-2009 15:26 1: BOB's BILLARD HALL Page 7 Part Code Description Qty Period QTY LastYr QTY Var Lbs Period Lbs LastYr Lbs Var --------------- ------------------------ ---------- ---------- ------- ---------- ---------- ------- Friday, August 28, 2009
  16. 16. Ugly Headers SAA_R_009 26-Mar-2009 15:26 1: BOB's BILLARD HALL Page 6 Part Code Description Qty Period QTY LastYr QTY Var Lbs Period Lbs LastYr Lbs Var --------------- ------------------------ ---------- ---------- ------- ---------- ---------- ------- Salesperson 22 BILL PRICE Customer 1014 KECK'S MEAT & FOODSERVICE SA Sort Code 4.42 PORK RIBS SAA_R_009 26-Mar-2009 15:26 1: BOB's BILLARD HALL Page 7 Part Code Description Qty Period QTY LastYr QTY Var Lbs Period Lbs LastYr Lbs Var --------------- ------------------------ ---------- ---------- ------- ---------- ---------- ------- Friday, August 28, 2009
  17. 17. unpack() • Designed for breaking • “a” means ascii up binary data character • Very handy for this • “x” means skip kind of fixed-width "cookies and cream".unpack("a7xa3xa5") work ["cookies", "and", "cream"] • unpack() takes in a format string "--- --- -----".split. map {d|"a#{d.length}" }.join("x") • You describe what "a3xa3xa5" the data looks like Friday, August 28, 2009
  18. 18. def initialize(file) @file = file @headers = nil @format = nil end def each open(@file) do |report| parse_header(Array.new(4) { report.gets }) report.each do |line| ... end # report.each end # open end # def def parse_header(headers) @format = headers[3].split.map { |col| "a#{col.size}" }.join("x") @headers = headers[2].unpack(@format).map { |f| f.strip } end Friday, August 28, 2009
  19. 19. def initialize(file) @file = file @in_header = false @headers = nil @format = nil end def each open(@file) do |report| parse_header(Array.new(4) { report.gets }) report.each do |line| if line =~ /ASAA_R/ @in_header = true elsif @in_header @in_header = false if line =~ /A-/ else ... end end # report.each end # open end # def Friday, August 28, 2009
  20. 20. Salesperson 22 BILL PRICE Customer 1014 KECK'S MEAT & FOODSERVICE SA Sort Code 4.42 PORK RIBS 44-531 53/3 CU PRK RIB SOY 0 0 0 0 2 -100 44-531-0 100/2.5 CU PRK RIB SOY 10 21 -52 14 31 -55 15-230 53/3 BB PUB BURGER 150 0 0 150k 0 0 3680 40/4 RB PRK WHLMSC HARKER 187 243 -23 412 405 2 3681 30/5.3 RB PRK WHLMS HARKR 207 162 28 378 243 56 3686 30/5.3 RB PRK WHLMS HARKR 27 45 -40 72 180 -60 3008 33/4.92 RB PRK HARKER 270 300 -10 580 600 -3 3010 25/6.4 RB PRK CNTRY HARKR 510 540 -6 1,000 1,080 -7 3402 40/4 RU PRK RIB PAT HARKR 0 0 0 0 40k -100 3403 51/3.14 RU PRK RIB HARKER 558 900 -38 1,008 1,170 -14 3404 40/4.14 RU PRK RIB HARKER 73k 1,052 -30 1,296 1,592 -19 SA Sort Code 19.1 WAFFLES 5018 36/5 KING B WAFFLES 10 10 0 10 14 -29 Friday, August 28, 2009
  21. 21. assoc() • lookup method • slower than a hash - don’t use on LARGE • call it on an array of amounts of data arrays • assoc() becomes a poor • pass in the data you man’s ordered hash want to lookup names = [["James" , "Gray"], ["Dana", "Gray"]] • walks through the puts names.assoc("James") outer array and returns the inner ["James", "Gray"] array that starts with the argument Friday, August 28, 2009
  22. 22. def initialize(file) ... @categories = [] end def each open(@file) do |report| ... if line =~ /As+(w[ws]+?)s+(d.+?)s+z/ if cat = @categories.assoc($1) cat[-1] = $2 else @categories << [$1, $2] end else yield @headers.zip(line.unpack(@format).map { |f| f.strip }) + @categories end end end # report.each end # open end # def Friday, August 28, 2009
  23. 23. [["Part Code", "44-531"], ["Description", "53/3 CU PRK RIB SOY"], ["Qty Period", "0"], ["QTY LastYr", "0"], ["Var", "0"], ["Lbs Period", "0"], ["Lbs LastYr", "2"], ["Var", "-100"], ["Salesperson", "22 BILL PRICE"], ["Customer", "1014 KECK'S MEAT & FOODSERVICE"], ["SA Sort Code", "4.42 PORK RIBS"]] [["Part Code", "44-531-0"], ["Description", "100/2.5 CU PRK RIB SOY"], ["Qty Period", "10"], ["QTY LastYr", "21"], ["Var", "-52"], ["Lbs Period", "14"], ["Lbs LastYr", "31"], ["Var", "-55"], ["Salesperson", "22 BILL PRICE"], ["Customer", "1014 KECK'S MEAT & FOODSERVICE"], ["SA Sort Code", "4.42 PORK RIBS"]] ... [["Part Code", "5018"], ["Description", "36/5 KING B WAFFLES"], ["Qty Period", "10"], ["QTY LastYr", "10"], ["Var", "0"], ["Lbs Period", "10"], ["Lbs LastYr", "14"], ["Var", "-29"], ["Salesperson", "22 BILL PRICE"], ["Customer", "1014 KECK'S MEAT & FOODSERVICE"], ["SA Sort Code", "19.1 WAFFLES"]] Friday, August 28, 2009
  24. 24. def each open(@file) do |report| parse_header(Array.new(4) { report.gets }) report.each do |line| if line =~ /ASAA_R/ @in_header = true class RossReader elsif @in_header @in_header = false if line =~ /A-/ def initialize(file) else @file = file break if line =~ /AReport Totals/ @in_header = false next if line =~ /As+z/ or @headers = nil line =~ /As+-/ or @format = nil line =~ /b(sub)?totalsb/i @categories = [] if line =~ /As+(w[ws]+?)s+(d.+?)s+z/ end if cat = @categories.assoc($1) cat[-1] = $2 def parse_header(headers) else @format = headers[3].split.map { @categories << [$1, $2] |col| "a#{col.size}" }.join("x") end @headers = headers[2].unpack(@format).map { else |f| f.strip } yield @headers.zip(line.unpack(@format).map { end |f| f.strip }) + @categories end end end # report.each end # open end # def end Friday, August 28, 2009
  25. 25. 2 - Writing Friday, August 28, 2009
  26. 26. require "rubygems" require "faster_csv" class CSVWriter def initialize @headers = nil end def puts(record) if @headers.nil? @headers = record.map { |field| field.first } FCSV { |csv| csv << @headers } end FCSV { |csv| csv << record.map { |field| field.last } } end end Friday, August 28, 2009
  27. 27. 3 - Munging Friday, August 28, 2009
  28. 28. require "munger" require "ross_reader" require "csv_writer" report = Munger.new(RossReader.new(ARGV.shift), CSVWriter.new) report.munge do |record| record.each do |field| if field.last =~ /A(?:d+,)+d+k?z/ field.last.delete!(",") end field.last.sub!(/Ad+kz/) { |num| num.to_i * 1000 } end record end Friday, August 28, 2009
  29. 29. So what can I do with all this? • Output your data into a spreadsheet such as Excel • Open the data in your text editor • Import the data into a database • Let’s see it in action Friday, August 28, 2009
  30. 30. Examples Friday, August 28, 2009
  31. 31. unless File.exist? "db.sqlite" require "munger" class CreatePartCodes < ActiveRecord::Migration require "rubygems" def self.up require "faster_csv" create_table :part_codes do |t| require "active_record" t.string :part_code t.string :description class DBWriter t.integer :qty_period def initialize(model, path = "db.sqlite") t.integer :qty_lastyr ActiveRecord::Base.establish_connection( t.integer :qty_var :adapter => "sqlite3", t.integer :lbs_period :database => path t.integer :lbs_lastyr ) t.integer :lbs_var @model = model t.string :salesperson end t.string :customer t.string :sa_sort_code def puts(record) end @model.create!(record) end end end def self.down drop_table :part_codes class PartCode < ActiveRecord::Base end end end end Friday, August 28, 2009
  32. 32. reader = FCSV($stdin, :headers => true, :header_converters => :symbol) writer = DBWriter.new(PartCode) CreatePartCodes.up if defined? CreatePartCodes m = Munger.new(reader, writer) m.munge do |row| row.to_hash end Friday, August 28, 2009
  33. 33. Congratulations! You, too, are now a Munger! Friday, August 28, 2009
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×