Social Science-Conscious Analysis Case Study: The Cost of Public School

Social Science-Conscious Analysis
Case Study: The Cost of Public School
Riley H

Why The Cost of Public School?
New York City has some
of the best and worst
schools in all of the state
as well as the country.
Sometimes these are
right next to each other.

A Closer Look at Adjacent Schools
P.S. 11 in Midtown
West performed worse
than 60% of all schools
in New York State.
P.S. 59 in Midtown
East, a 10 minute walk
away, is the 19th best
elementary school in
the state.

In a perfect world, how would you answer
your question?
For us, the perfect solution involved selling identical
houses right across a school zone from each other.
We’d then measure the price
difference. It was important to make
sure that other factors of a
neighborhood that drive price are as
stable as possible between the two,
allowing us to collect only the price
difference associated with the school.

With unlimited data, how would you
demonstrate your hypothesis was true?
Identifying an exact method to nail down the problem
we want to solve is sometimes the hardest step.
Start by detailing your “ideal experiment”; what you
would do with all the data you could ever want.
From there, you can break it down into pieces that are
possible.

What can you actually acquire?
High quality data and computational time are in
extremely short supply with few exceptions!
Cut down your question based on what data you can
acquire, but make sure you remain true to the core
social issue!

For The Cost of Public School Project
We focused on the following:
● What data do we need on housing?
● What types of housing can we acquire, and how will
the data we can't get affect the impact of the
experiment?
● What factors other than housing could affect the cost
of housing, and how can we grab accurate data for
them and quantify them?

Community Data Sites
Community sites are great if they’re available. They
can be a godsend for projects like these if the
community in question has been diligent in upgrading
their processes.
Unfortunately, most cities are still using handwritten
forms for a lot of their workings, leaving details
scanned into the system in the dreaded pdf format
with barely readable font. In other words, useless.

NOOOO! NOT HANDWRITTEN PDFS!
OUR PRECIOUS
DATA…
LOST TO THE ETHER!
D:

Caveats of Third Party Sites
● May not be free and clear to use, even just for
research purposes. Make sure you check the
terms!
● Limits on how much data you can get in a period of
time.
● May require a sign up and approval process before
allowing API usage.
● API may be slow.
● Pulling data in general moves slowly.

Fixing the Data:
Sometimes Your Research Needs Researching
Preliminary data exploration is important to make sure
what you have makes sense.
But what does “sense” refer to?
In some cases, it will be obvious, but not in all of
them. Cross-referencing what you have with other
sources of information may save you trouble later!

Well, the data looks okay...
Cursory summaries of the data (means, medians,
quartiles and ranges) may not show anything
particularly strange...even when it is there.
Check for duplicate data lines and wrong information
that is obscured to the point of looking realistic!
These are common side-effects of using an API from
a third party site, and won’t be so easy to find!

Feature or Flat Wrong?
After coming up with odd results in our regression
models, we looked back to the data and found many
listings with very small square footage listed. Some
were clearly wrong, like listings with 10 square feet.
Others were dubious, especially for tiny NYC living.
Where should we have drawn the line? You may find
yourself making this sort of judgement, and that’s
where your community research comes in handy!

Reasonable results don’t always
mean you have good data.

Yay! It’s a Clean Dataset!
After a lot of hard work, we finally have what we need to
proceed, a beautiful, clean data set.
At this point, you probably notice that your clean data is
substantially smaller than what you originally had, maybe
too small to enact your original experiment idea.
You can try to find more data, or use a model!

Modeling For a New Purpose
Our model was used to help us create data that we were
missing for the purpose of actually completing the
experiment, rather than have the predictions we acquired
used directly.
With our secondary experiment in mind, we constructed a
set of “fake” housing data to give us price averages in
areas of New York City that our third party site did not
care about.

The Actual Model
Ours was a linear regression model including the
following features. Make sure that the type of model
and the features involved work for your project.

Variety Helps Catch Errors
Analysis can be one of the most intense parts of a social
science project. It's more than just getting averages and
crunching numbers; not only do you have to know what
the numbers mean, but what they are defining
SOCIALLY.
This is where a diverse team comes in handy! Personal
experience may be an indication of where to go next and
what you've missed.

Don’t Forget the People Aspect
We specifically brought in people who know a lot about
certain areas of NYC, former realtors who are now
researchers, and people who own property in the areas
we were examining closely.
We also used our own experiences as residents of the
city to guide our choices.
We found that our numbers were in fact reflecting
lived experiences.

Don’t forget the community
you want to serve.
They should be driving your
research direction.

Look For the Reasons Why
If it turns out that your research doesn’t reflect lived
experience, examine why!
It could mean a drastic error in either your question, its
framing, the data set, or your analysis of the results!
Use the community to your advantage rather than work
against them.

Thank You
To my team at Microsoft, Glenda Ascencio, Anastassiya
Neznanova, and Thomas Patino, and our leads, Jake
Hofman, Amit Sharma, and Jenn Wortman Vaughn.
To Microsoft's Data Science Summer School, headed by
Jennifer Chayes at Microsoft Research.
And to everyone who encouraged me to give a data
science talk!

More about Myself
I am a student at CUNY Queens College graduating in
May with a BS in Computer Science and BA in
Mathematics.
If you have questions, comments, or want to recruit,
please contact me!
techiecheckie@gmail.com
https://github.com/techiecheckie
https://www.linkedin.com/in/techiecheckie

Bibliography
1. NYCOpenData, nycopendata.socrata.com
2. GreatSchools, greatschools.org
3. StreetEasy, streeteasy.com
4. NYC GeoClient API,
developer.cityofnewyork.us/api/geoclient-api
5. Microsoft Data Science Summer School,
ds3.research.microsoft.com

Social Science-Conscious Analysis Case Study: The Cost of Public School

Recommended

Recommended

More Related Content

Similar to Social Science-Conscious Analysis Case Study: The Cost of Public School

Similar to Social Science-Conscious Analysis Case Study: The Cost of Public School (20)

Recently uploaded

Recently uploaded (20)

Social Science-Conscious Analysis Case Study: The Cost of Public School