Students – and not a few staff – come to urban data science from a wide range of backgrounds and with vastly different levels of experience of programming and collaboration. While diversity is good from an ecosystem standpoint, for a new Masters or PhD student it can be hard to know where to begin: R or Python? LaTeX or Markdown? Git or SVN? MySQL or Postgres? This talk will draw on experience of both professional software development and research hacking, incorporating examples from the speaker’s research, to offer one perspective on tools and workflows that help you to pick the right tool for the job, that help to get things done, and that help you to recover when things (inevitably) go wrong. This talk will provide the start of a discussion, not the final answer.
Generally, I can talk about the majority of these tools at any level of detail you like, but I’ve tried to focus on the big picture and to group them into categories so that you can think about the wide range of things that go into developing good research and supporting long-term development.
You’ll notice that I have a very pragmatic, practical focus here. The really big thing to take from this is that I’ve: a) used more tools that I’d care to remember while doing my job; b) I don’t have any particular axe to grind. I prefer to use things that work, regardless of where they came from.
This talk will draw on my experience of professional software development and research hacking to offer one perspective on tools and workflows that help get things done, and that help you to recover when things (inevitably) break in the course of your work.
Does someone give me data and ask me to find a question? Or do I have a question and go looking for data? Mix of both? This cycle operates at many scales – the biggest mistake that you can make is to think that a piece of analysis is done when it’s sent off to the reviewer. Or even when it appears in print. These works take on a life all their own over time. Many ‘snippets’ somehow escalate into core operational applications by some insane evolutionary process.
Figure 2 is why good ‘hygiene’ practices are so important – they can make or break your research. Big data is deep enough that you can drown in it, so you need to be careful.
Event MATLAB can make maps, but there are no choices besides R and Python at the moment. Neither ticks every box, but obvious convergence occurring. I know someone will come up to me after my talk and say “But what about d3?” or some other language, but my simple question is this: if you are convinced that the rest of the world is wrong, it’s probably because you’re an evangelist.
MySQL MongoDB PostgreSQL/PostGIS Hive/Hadoop Sceptical of long-term utility of in-memory dbs. One thing that I always forget to do is log the queries that generate derived tables, or the steps by which I created linking tables between separate ‘areas’ of the schema. Imagine losing all of your derived data in one go, how easy would it be for you to just checkout the code from Git and hit ‘run’ to rebuild your analytical data warehouse?
ArcMap QGIS (+Postgres!) Python R Why would anyone use ArcMap now? R for research scriptability and ‘simple’ mapping (but see: sketchy maps). QGIS for ‘proper’ mapping. Down the rabbit hole with Python! QGIS is advancing by leaps and bounds, and planned integration with PySAL will give it analytics features far surpassing the ArcGIS toolbox; however, in quite a few ways it is still ‘Photoshop for maps’ – it can make them look prettier, faster than ArcMap. Integration with Postgres gives you very nice features for manipulating and visualising large data sets.
Git SVN/CVS Still have some doubts about git with large binary outputs instead of just code.
LaTeX Markdown Google Docs Word No right answer here, but interesting range of apps to help writers. Please learn Word’s Styles feature (should be easy for LaTeX or web developers). Have seen some interesting apps recently: Texts. Scrivener.
Dropbox TimeMachine rsync/scp Backblaze, CrashPlan, etc. Assume that it will take 3 weeks to recover 2 weeks’ work. Postgres has one major flaw as far as I’m concerned, and that’s replicating the database across machines. As far as I can tell this tends to involve dumping individual tables in their entirety and then restoring on the other machine. The synchronisation methods I’ve seen assume a very different type of system. Virtualisation could work, I guess.
None, this is not optional Audit ACLs. Let me tell you a story…
rctrack and YAML seem to be trying to solve aspects of this, but our attempts at replicating the Goddard research -- what we are doing now will be just as dated as mainframe work from 50 years ago!
Hardware is both more, and less, of a problem than you think – to see real performance boosts you need to spend a lot of money, otherwise you can get by on a lot less than you think.
I got an email about a PETL 10 years after leaving company.
The Full Stack
Department of Geography
School of Social Science & Public Policy
THE FULL STACK
To provide an overview of the tools and
technologies that I have found – or seen –
to enable good development practice &
BA in Comparative Literature in 1997.
Went to work for dot.com start-up.
Learned to program, on the job.
Learned SQL, on the job.
Learned to back up more often, on the job.
Managed sites, ETL systems & analytics over many
Re-entered academia in 2006.
PhD at CASA; collaboration with SENSEable City lab.
Lecturer at King’s since 2013; helped set up
MY EXPECTATIONS FOR (GOOD)
They must be useful when I need them.
They must get out of the way when I don’t.
They must fail gracefully when they can’t help it.
They must play well with other tools where feasible.
They must make it easy for me to do the right thing.
They should grow gracefully into operational systems.
WHERE DO WE GO FROM HERE?
In the remainder of this talk I will try to link
my outputs – the pretty pictures – to the
process by which they were created.
If you want to know more about something
you see, just stop me.
Coherence of syntax
Coherence of libraries
Spatial analytic support
Map-making & data viz
Ability to get things done
Availability of a good IDE
But it’s really the ‘value
added’ features that
Cellular Census (2007)
(Spatial) Feature set (esp.
Replication & distribution
Access controls & user
A lot can be done without
spatial queries. Learn
about indexing, query &
schema design, and
DATA STORAGE & MANAGEMENT
The ‘Big Bubble’? (2014)
Ability to layer
mapping to communicate
results with a spatial
dimension and mapping
to produce actual maps?
Global Health Partnerships (2016)
Ease of recovery
Scale of use
Best if you never learn
SVN/CVS, then your brain
will not be done in by Git.
VERSION CONTROL & RECOVERY
Oyster Card Work (2012)
Getting out of the way
Editing & comments
Quality of output
What helps you to think?
What helps you write first,
but makes formatting later
Thesis & ‘Space of Flows’ (2011, 2014)
How easy to backup/share?
How easy to recover?
How selective is recovery?
Backup early & backup
often. Never trust one
solution or one location.
Note: data protection
BACKUP & REPLICATION
Pint of Science (2014)
Encrypt! Encrypt! Encrypt!
Encourage use of
COMPLIANCE & DATA SECURITY
Also worth watching:
Travis CI: automated testing
with GitHub integration.
Docker/Vagrant: replication &
Full replication of
someone else’s entire
data analysis process is
harder than you think!
N/S Housing Divide (2017?)
• Better ways of specifying the full analytical ‘context’ –
including versions of libraries, platform, etc. – as well
as the input/output ‘pipeline’ – such as data and
results (rctrack seems to want to do this, but only
with R, YAML more promising).
• Ways of talking about data processing pipelines &
steps (UML is not the answer).
• Valuing of good (open) code & good data by
institutions and research councils.
THE BIG PICTURE
Tools (ca. 2006):
Tools (ca. 2016):
Postgres + PostGIS
THE BIG PICTURE
Massive shift from expensive proprietary to
cheap open (both software & hardware).
Underlying distinction between operational
and development/research environments
The problem: one tends to evolve into the
Document your code.
And any sources it drew upon.
You will regret not doing it.