This document discusses randomizing data in Microsoft SQL Server. It explains how to generate unique identifiers (GUIDs) to randomize data, create a temporary table to insert randomized data, and display results randomly ordered by the GUID field. Examples are provided of inserting a subset of user data from Stack Overflow based on location into a temporary table, and displaying 10 random records from that table. The randomized results could be useful for applications like jury selection, assigning volunteers, or selecting employees for training.
2. About the author
A bit about me:
Wally M. Pons, IT professional with over 20 years
of experience in software programming,
databases and solutions provider, you may
contact me as follows:
Twitter: @datagrupo / @wallypons
Web: Https://www.datagrupo.com
Email: wpons@datagrupo.com
3. The software you will need
You will need the following software:
Microsoft® SQL Server™ either 2008/2012/2014/2016 or 2017
Developer, Standard or greater edition. The Express editions
won’t be able to handle a 10GB or bigger database. You can
download the Developer editions free from the following link:
https://sqlserverupdates.com/
Microsoft® Windows installation or virtual machine that supports
the Microsoft® SQL Server™ version of your choice.
Torrent client, this is to download our test database(s).
7-zip or WinRAR for decompressing files.
4. What is randomization?
• Let’s start with my common definition of randomization: “It’s the
process of making something random, more specifically, a technique
to produce results that are not controlled by a human decision-
making process, thus removing the possibility of a manipulated
outcome and also preventing misjudgment.”
• Probably one of the most primitive examples of randomization
comes from the “coin-tossing” technique which involves one of two
choices (usually heads or tails), this is also by far the simplest form
of randomization.
• Beyond the aforementioned, there are other techniques and
processes to randomize information to our desire and needs, in this
case, I will explain how to accomplish this by using Microsoft® SQL
Server™ and some sample data from the Stack Overflow database,
which can be downloaded freely for testing purposes, this is
provided under the cc-by-sa 3.0 license terms.
5. Getting some sample data
• If you have a sample Microsoft® SQL Server™ database that you can
work on, then you may skip to the “Selecting the Data” section.
Such sample of information may include some or all of the following
data:
– Customer names
– Vendor names
– Inventory items
– Automobile brands & models or other type of data collection
• But if you don’t have a sample database then you may continue to
download the Stack Overflow database from the following link:
https://www.brentozar.com/archive/2015/10/how-to-download-
the-stack-overflow-database-via-bittorrent/
6. Downloading the data
• Before you download from the previous link and in order to
decompress the file(s), you need to have a torrent client and
either 7-zip or WinRAR installed on your machine. Links to those
apps are included in the download page for your convenience.
• There are three versions of the Stack Overflow database and
depending on your disk space you can download them all, but for
the purpose of this excerpt, I have downloaded the 10GB and 50GB
versions (which will more than suffice) but you may download the
312GB version, that is if you have the available disk space. Here’s a
preview of how the downloaded files look:
• You then decompress them to specific folders, as shown next.
7. Decompressing the data files
• The small 1.08GB (1,140,633KB) file contains the 10GB database,
which is composed by a Primary and Translog file. The other 9.43GB
(9,898,904KB) file contains the 50GB database, this is a little more
complex than the previous one because it contains the Primary,
Translog and three secondary files.
• In my case I have created a folder structure (yours doesn’t have to
match or look the same) for every file and named it accordingly to
its usage, for a better reference, please see the below image:
8. Attaching the data files
• Although your files are nice and neatly in place, you need to tell
Microsoft® SQL Server™ to use them, this is accomplished by using
a short script, you can use the same script and modify it to your file
location needs:
• As you can see, I’m attaching both databases using T-SQL, this also
allows me to designate a more adequate name to the databases.
9. Record counts
• Both databases have the same nine tables but not the same
amount of records, here’s what the 10GB and 50GB database tables
look like:
• We will be using the “Users” table since the data contained in it is
more meaningful for the purpose of this excerpt.
10. Selecting the data
• Now we’re going to see a sequential sample of the contents from
the “Users” table, please note that only specific fields are included
in the query image:
• With the above data collection sample you can have a better idea of
what data we are going to analyze for randomization.
11. Filtering the data
• Let’s filter the data in the “Users” table, in this case we will choose
the “Location” field to have an idea as to how many locations are
used:
• Interestingly, you may observe that the
most used location has a value of NULL,
followed by empty, India, London, United
Kingdom, United States, Germany and
so forth.
• Now we can choose a location for our
randomization process.
12. Creating unique values
• To randomize our data we are going to need to assign unique values
to each record, and to create unique values we will use a data type
known as “uniqueidentifier” which is 16 bytes in storage length and
stores a GUID (globally unique identifier). This field holds a 36
character GUID composed of numbers from 0 to 9, 4 hyphens and
the letters from a to f, a valid GUID looks like the following:
69BC9D6C-B22B-476E-AD09-008661F165C3
• And just in case that you were wondering about getting duplicate
GUIDs, the probability to find a duplicate within 103 trillion
GUIDs is one in a billion, so you can rest assure that duplicates
are far from happening with this approach.
13. Creating a Temporary Table
• Now we’re going to create a temporary table in which we will insert
a GUID and data for randomization purposes, sort it and display it.
• The temporary table has 5 fields (RandomGUID, DisplayName,
CurrentReputation, Location, Id), the scipt is as follows:
• Please note that this is a global temporary table, this means it is
available to all sessions within your current SQL instance and not
just yours, if you wish to keep the table accessible only for your SQL
session then you may remove one # sign from the name.
14. Selecting, Sorting and Inserting
Data to the Temporary Table
• Once the table has been created we must insert data into it, in this
case we will insert a selected portion of the data based on location.
First we create a script specifying which fields will be affected and
then we make our insert as shown on the below script:
• For this example I have chosen the
location of ‘San Francisco, CA’ but
you may chose any other location
that you wish. Now I have a temporary
table with 4,465 records in it and they
can be sorted by the RandomGUID
field for random results.
15. Displaying Random Results
• As you may have (or not) noticed, the RandomGUID field is not
shown in the previous insert and select portion when we populated
our temporary table, this is because that field has a default value
which creates a GUID automatically for every record you insert.
• This is something we will use to randomize results from the table by
doing a Select top 10 ordered by that field.
16. More Random Results
• In the end of our last script we added a ‘Drop Table’ command, this
is to delete the temporary table but you may omit this if you or
someone else is going to use the table.
• The script image to the right makes
the whole process of creating,
inserting, displaying and dropping
the table, this is useful for multiple
runs with variable results.
• On the next slide I will show two
results from this script.
17. Random Results Examples
• For better results, you may use larger amounts of records and
increase the randomization posibilities.
18. Use of Random Results
• One good purpose of randomizing data this way is to get one or all
of the following (randomly):
1. Jury selection
2. Volunteers
3. Responsible assignees
4. Group leaders
5. Employees that will attend a SQL seminar in Vegas (you wish!)
• I hope you find this excerpt useful, please share and practice the
gift of knowledge, it doesn’t matter if it’s one line of code or two
thousand lines, Thanks!