The Full Stack

Department of Geography
School of Social Science & Public Policy
THE FULL STACK
JON READES

OBJECTIVE
To provide an overview of the tools and
technologies that I have found – or seen –
to enable good development practice &
productive research.

MY BACKGROUND
BA in Comparative Literature in 1997.
Went to work for dot.com start-up.
Learned to program, on the job.
Learned SQL, on the job.
Learned to back up more often, on the job.
Managed sites, ETL systems & analytics over many
years.
Re-entered academia in 2006.
PhD at CASA; collaboration with SENSEable City lab.
Lecturer at King’s since 2013; helped set up

HOW DOES ‘BIG DATA WORK’
WORK?
Idea
Exploration
DevelopmentRevision
Writing Up

BIG DATA WORK ON A PRACTICAL
LEVEL

MY EXPECTATIONS FOR (GOOD)
TOOLS
They must be useful when I need them.
They must get out of the way when I don’t.
They must fail gracefully when they can’t help it.
They must play well with other tools where feasible.
They must make it easy for me to do the right thing.
They should grow gracefully into operational systems.

WHERE DO WE GO FROM HERE?
In the remainder of this talk I will try to link
my outputs – the pretty pictures – to the
process by which they were created.
If you want to know more about something
you see, just stop me.

Considerations:
 Coherence of syntax
 Coherence of libraries
 Data-munging features
 Spatial analytic support
 Map-making & data viz
 Ability to get things done
 Availability of a good IDE
But it’s really the ‘value
added’ features that
matter.
PROGRAMMING LANGUAGES
Cellular Census (2007)

Considerations:
 Standards compliance
 (Spatial) Feature set (esp.
indexing)
 Replay/Logging
 Replication & distribution
 Access controls & user
management
A lot can be done without
spatial queries. Learn
about indexing, query &
schema design, and
DATA STORAGE & MANAGEMENT
The ‘Big Bubble’? (2014)

Considerations:
 Ease-of-use
 Scriptability
 Ability to layer
 Interoperability
Distinguish between
mapping to communicate
results with a spatial
dimension and mapping
to produce actual maps?
GEODATA VISUALISATION
Global Health Partnerships (2016)

Considerations:
 Collaboration
 Scalability
 Ease of recovery
 Scale of use
Best if you never learn
SVN/CVS, then your brain
will not be done in by Git.
VERSION CONTROL & RECOVERY
Oyster Card Work (2012)

Considerations:
 Getting out of the way
 Compatibility
 Collaboration
 Editing & comments
 Quality of output
What helps you to think?
What helps you write first,
but makes formatting later
easy?
WRITING
Thesis & ‘Space of Flows’ (2011, 2014)

Considerations:
 How easy to backup/share?
 How often?
 Where stored?
 How easy to recover?
 How selective is recovery?
Backup early & backup
often. Never trust one
solution or one location.
Note: data protection
issues.
BACKUP & REPLICATION
STRATEGIES
Pint of Science (2014)

Considerations:
 Performance
 Encryption
 ACLs (users/groups/systems)
 Password Managers
Encrypt! Encrypt! Encrypt!
Encourage use of
password managers.
COMPLIANCE & DATA SECURITY

Also worth watching:
 Travis CI: automated testing
with GitHub integration.
 Docker/Vagrant: replication &
virtualisation.
Full replication of
someone else’s entire
data analysis process is
harder than you think!
REPLICABLE RESEARCH
N/S Housing Divide (2017?)

WHAT’S MISSING?
• Better ways of specifying the full analytical ‘context’ –
including versions of libraries, platform, etc. – as well
as the input/output ‘pipeline’ – such as data and
results (rctrack seems to want to do this, but only
with R, YAML more promising).
• Ways of talking about data processing pipelines &
steps (UML is not the answer).
• Valuing of good (open) code & good data by
institutions and research councils.

THE BIG PICTURE
Tools (ca. 2006):
 Eclipse
 Perl/Java
 Oracle 8i
 Cron jobs
 OLAP Tools
 CVS
 ArcMap
Tools (ca. 2016):
 R/Rstudio
 Python
 Postgres + PostGIS
 Cron jobs
 Knitr, etc.
 Git
 QGIS

THE BIG PICTURE
Massive shift from expensive proprietary to
cheap open (both software & hardware).
Underlying distinction between operational
and development/research environments
persists.
The problem: one tends to evolve into the
other.

FINAL THOUGHT
Document your code.
And any sources it drew upon.
You will regret not doing it.

THANK YOU
Jon Reades
@jreades
reades.com
kingsgeocomputation.org

The Full Stack

More Related Content

Viewers also liked

Similar to The Full Stack

Recently uploaded

The Full Stack

Editor's Notes