Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Module 3.1


Published on

  • Be the first to comment

  • Be the first to like this

Module 3.1

  1. 1. We’ve described histograms as being extremely flexible andhaving the ability to condense large data sets into usableterms. The key to the flexibility and versatility of histogramsis bin size.“Bin” is another word for the set of intervals that define thex-axis of a histogram. Bins must be equal in size. Constructinga histogram with varied bin sizes can lead to a lot ofconfusion. Salaries Ranges (Sears, LLC) 200 150 100 50 0
  2. 2. One challenge when creating a histogram is selecting thenumber of bins. Determining the interval can be arbitrary, butthere are a few methods to selecting the number of bins:1. Count the number “n” of total data points2. Take the square root of n, round upLet’s try one. You have a data set where n = 55. To determinethe number of bins, you would take the square root of 55 =7.416. Rounded up = 8. So with 8 being the optimum numberof bins, you can then look at the type of data and determinewhat eight equal intervals you would like to display.Source:
  3. 3. Here’s another way to determine the number of bins to use:1. Determine the bin range (max p – min p)2. Determine the width “h” you want for the bins3. Divide the bin range by the desired width: b = (max p – min p) hLet’s try one of these. You have a data set where the largestnumber is 100 and the smallest is 5. So, the bin range wouldbe 100-5 = 95. You decide you want the bin interval to be 10.Now, to calculate the number of bins, you take 95 / 10, youget 9.5. Round up = 10! Easy right?!Source:
  4. 4. Remember, histograms are flexible because of their bins. Youdon’t have to do fancy calculations, you can just arbitrarilyadopt a number of bins so long as they have equal intervals.But another thing to remember is that after a certainpoint, usually beyond 20 bins, even a histogram can getdifficult to follow. Here is a quick reference chart to help youchoose the right number of bins. Number of Data Points Number of Bars 20 - 50 6 51 - 100 7 101 - 200 8 201 - 500 9 501 – 1000 10 1000+ 11-20
  5. 5. In Slide 4, we just picked a number out of the blue for binwidth h, but we have some cautions about this one too. Theshape of a histogram is susceptible to the width of bins.If the bins are too wide, important information might behidden. If the bins are too narrow, what appears now to bemeaningful inconstancy might just be a random variation. 25 20 15 10 5 0 1 2 3 4 5 6 7 8 9 10 MoreSource:
  6. 6. Depending on the actual data distribution and the goals ofyour analysis, you may choose different widths. Infact, depending on the situation, you might create twohistograms from the same data set with different bin sizes.How do you know which h to use? Sometimes it just comesdown to experimenting. Try different widths until the datadepicts an honest story of what you’re trying to analyze. Nowyou see why we caution against bin size tampering! 25 20 15 10 5 0 1 2 3 4 5 6 7 8 9 10 MoreSource:
  7. 7. EXAMPLE SECENARIO: You are the Sales Manager for anonline retail company. You want to track revenue numbers forthe year so you decide to generate a histogram.Theoretically, you could create a histogram with a bin size of 1day, and there would be 365 bins. Or you could create a binsize of one full quarter, and there would be only four. Wouldthese two histograms tell a different story? Absolutely! 25 <cont.> 20 15 10 5 0 1 2 3 4 5 6 7 8 9 10 More
  8. 8. Showing 365 bins for revenue would make for a cumbersomehistogram that might show random events. Showing 4 binsfor revenue is much simpler, but might hide importantvariations. A more accurate story might be told for ahistogram with 12 bins for revenue by month. Samedata, different story—but 12 bins might do a better job ofcondensing a large amount of data while still capturingimportant variations—like seasonal fluxuations.Experimentation is the key to bin width h. 25 20 15 10 5 0 1 2 3 4 5 6 7 8 9 10 More
  9. 9. LETS RECAP!1. Find the number of bins by taking the square root of n and rounding up.2. Experimentation works for both number of bins and bin width.3. Too many or too few bins can show random events or hide important information.4. The appropriate number of bins combined with the proper bin width can tell a powerful story of data! 25 20 15 10 5 0
  10. 10. CRITICAL THINKING: In the perviousexample, what would be arguments for using binsizes of one day, one month, and one quarter?Think about what you’re trying to analyze. 25 20 15 10 5 0 1 2 3 4 5 6 7 8 9 10 More