Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

The Full Stack


Published on

Students – and not a few staff – come to urban data science from a wide range of backgrounds and with vastly different levels of experience of programming and collaboration. While diversity is good from an ecosystem standpoint, for a new Masters or PhD student it can be hard to know where to begin: R or Python? LaTeX or Markdown? Git or SVN? MySQL or Postgres? This talk will draw on experience of both professional software development and research hacking, incorporating examples from the speaker’s research, to offer one perspective on tools and workflows that help you to pick the right tool for the job, that help to get things done, and that help you to recover when things (inevitably) go wrong. This talk will provide the start of a discussion, not the final answer.

Published in: Data & Analytics
  • Login to see the comments

  • Be the first to like this

The Full Stack

  1. 1. Department of Geography School of Social Science & Public Policy THE FULL STACK JON READES
  2. 2. OBJECTIVE To provide an overview of the tools and technologies that I have found – or seen – to enable good development practice & productive research.
  3. 3. MY BACKGROUND BA in Comparative Literature in 1997. Went to work for start-up. Learned to program, on the job. Learned SQL, on the job. Learned to back up more often, on the job. Managed sites, ETL systems & analytics over many years. Re-entered academia in 2006. PhD at CASA; collaboration with SENSEable City lab. Lecturer at King’s since 2013; helped set up
  5. 5. HOW DOES ‘BIG DATA WORK’ WORK? Idea Exploration DevelopmentRevision Writing Up
  7. 7. MY EXPECTATIONS FOR (GOOD) TOOLS They must be useful when I need them. They must get out of the way when I don’t. They must fail gracefully when they can’t help it. They must play well with other tools where feasible. They must make it easy for me to do the right thing. They should grow gracefully into operational systems.
  8. 8. WHERE DO WE GO FROM HERE? In the remainder of this talk I will try to link my outputs – the pretty pictures – to the process by which they were created. If you want to know more about something you see, just stop me.
  9. 9. Considerations:  Coherence of syntax  Coherence of libraries  Data-munging features  Spatial analytic support  Map-making & data viz  Ability to get things done  Availability of a good IDE But it’s really the ‘value added’ features that matter. PROGRAMMING LANGUAGES Cellular Census (2007)
  10. 10. Considerations:  Standards compliance  (Spatial) Feature set (esp. indexing)  Replay/Logging  Replication & distribution  Access controls & user management A lot can be done without spatial queries. Learn about indexing, query & schema design, and DATA STORAGE & MANAGEMENT The ‘Big Bubble’? (2014)
  11. 11. Considerations:  Ease-of-use  Scriptability  Ability to layer  Interoperability Distinguish between mapping to communicate results with a spatial dimension and mapping to produce actual maps? GEODATA VISUALISATION Global Health Partnerships (2016)
  12. 12. Considerations:  Collaboration  Scalability  Ease of recovery  Scale of use Best if you never learn SVN/CVS, then your brain will not be done in by Git. VERSION CONTROL & RECOVERY Oyster Card Work (2012)
  13. 13. Considerations:  Getting out of the way  Compatibility  Collaboration  Editing & comments  Quality of output What helps you to think? What helps you write first, but makes formatting later easy? WRITING Thesis & ‘Space of Flows’ (2011, 2014)
  14. 14. Considerations:  How easy to backup/share?  How often?  Where stored?  How easy to recover?  How selective is recovery? Backup early & backup often. Never trust one solution or one location. Note: data protection issues. BACKUP & REPLICATION STRATEGIES Pint of Science (2014)
  15. 15. Considerations:  Performance  Encryption  ACLs (users/groups/systems)  Password Managers Encrypt! Encrypt! Encrypt! Encourage use of password managers. COMPLIANCE & DATA SECURITY
  16. 16. Also worth watching:  Travis CI: automated testing with GitHub integration.  Docker/Vagrant: replication & virtualisation. Full replication of someone else’s entire data analysis process is harder than you think! REPLICABLE RESEARCH N/S Housing Divide (2017?)
  17. 17. WHAT’S MISSING? • Better ways of specifying the full analytical ‘context’ – including versions of libraries, platform, etc. – as well as the input/output ‘pipeline’ – such as data and results (rctrack seems to want to do this, but only with R, YAML more promising). • Ways of talking about data processing pipelines & steps (UML is not the answer). • Valuing of good (open) code & good data by institutions and research councils.
  18. 18. THE BIG PICTURE Tools (ca. 2006):  Eclipse  Perl/Java  Oracle 8i  Cron jobs  OLAP Tools  CVS  ArcMap Tools (ca. 2016):  R/Rstudio  Python  Postgres + PostGIS  Cron jobs  Knitr, etc.  Git  QGIS
  19. 19. THE BIG PICTURE Massive shift from expensive proprietary to cheap open (both software & hardware). Underlying distinction between operational and development/research environments persists. The problem: one tends to evolve into the other.
  20. 20. FINAL THOUGHT Document your code. And any sources it drew upon. You will regret not doing it.
  21. 21. THANK YOU Jon Reades @jreades