The Full Stack

Jon Reades
Jon ReadesLecturer in Quantitative Human Geography at King's College London
Department of Geography
School of Social Science & Public Policy
THE FULL STACK
JON READES
OBJECTIVE
To provide an overview of the tools and
technologies that I have found – or seen –
to enable good development practice &
productive research.
MY BACKGROUND
BA in Comparative Literature in 1997.
Went to work for dot.com start-up.
Learned to program, on the job.
Learned SQL, on the job.
Learned to back up more often, on the job.
Managed sites, ETL systems & analytics over many
years.
Re-entered academia in 2006.
PhD at CASA; collaboration with SENSEable City lab.
Lecturer at King’s since 2013; helped set up
MOTIVATION
HOW DOES ‘BIG DATA WORK’
WORK?
Idea
Exploration
DevelopmentRevision
Writing Up
BIG DATA WORK ON A PRACTICAL
LEVEL
MY EXPECTATIONS FOR (GOOD)
TOOLS
They must be useful when I need them.
They must get out of the way when I don’t.
They must fail gracefully when they can’t help it.
They must play well with other tools where feasible.
They must make it easy for me to do the right thing.
They should grow gracefully into operational systems.
WHERE DO WE GO FROM HERE?
In the remainder of this talk I will try to link
my outputs – the pretty pictures – to the
process by which they were created.
If you want to know more about something
you see, just stop me.
Considerations:
 Coherence of syntax
 Coherence of libraries
 Data-munging features
 Spatial analytic support
 Map-making & data viz
 Ability to get things done
 Availability of a good IDE
But it’s really the ‘value
added’ features that
matter.
PROGRAMMING LANGUAGES
Cellular Census (2007)
Considerations:
 Standards compliance
 (Spatial) Feature set (esp.
indexing)
 Replay/Logging
 Replication & distribution
 Access controls & user
management
A lot can be done without
spatial queries. Learn
about indexing, query &
schema design, and
DATA STORAGE & MANAGEMENT
The ‘Big Bubble’? (2014)
Considerations:
 Ease-of-use
 Scriptability
 Ability to layer
 Interoperability
Distinguish between
mapping to communicate
results with a spatial
dimension and mapping
to produce actual maps?
GEODATA VISUALISATION
Global Health Partnerships (2016)
Considerations:
 Collaboration
 Scalability
 Ease of recovery
 Scale of use
Best if you never learn
SVN/CVS, then your brain
will not be done in by Git.
VERSION CONTROL & RECOVERY
Oyster Card Work (2012)
Considerations:
 Getting out of the way
 Compatibility
 Collaboration
 Editing & comments
 Quality of output
What helps you to think?
What helps you write first,
but makes formatting later
easy?
WRITING
Thesis & ‘Space of Flows’ (2011, 2014)
Considerations:
 How easy to backup/share?
 How often?
 Where stored?
 How easy to recover?
 How selective is recovery?
Backup early & backup
often. Never trust one
solution or one location.
Note: data protection
issues.
BACKUP & REPLICATION
STRATEGIES
Pint of Science (2014)
Considerations:
 Performance
 Encryption
 ACLs (users/groups/systems)
 Password Managers
Encrypt! Encrypt! Encrypt!
Encourage use of
password managers.
COMPLIANCE & DATA SECURITY
Also worth watching:
 Travis CI: automated testing
with GitHub integration.
 Docker/Vagrant: replication &
virtualisation.
Full replication of
someone else’s entire
data analysis process is
harder than you think!
REPLICABLE RESEARCH
N/S Housing Divide (2017?)
WHAT’S MISSING?
• Better ways of specifying the full analytical ‘context’ –
including versions of libraries, platform, etc. – as well
as the input/output ‘pipeline’ – such as data and
results (rctrack seems to want to do this, but only
with R, YAML more promising).
• Ways of talking about data processing pipelines &
steps (UML is not the answer).
• Valuing of good (open) code & good data by
institutions and research councils.
THE BIG PICTURE
Tools (ca. 2006):
 Eclipse
 Perl/Java
 Oracle 8i
 Cron jobs
 OLAP Tools
 CVS
 ArcMap
Tools (ca. 2016):
 R/Rstudio
 Python
 Postgres + PostGIS
 Cron jobs
 Knitr, etc.
 Git
 QGIS
THE BIG PICTURE
Massive shift from expensive proprietary to
cheap open (both software & hardware).
Underlying distinction between operational
and development/research environments
persists.
The problem: one tends to evolve into the
other.
FINAL THOUGHT
Document your code.
And any sources it drew upon.
You will regret not doing it.
THANK YOU
Jon Reades
@jreades
reades.com
kingsgeocomputation.org
1 of 21

Recommended

Philippe Bracke- Estimating Residential Land Prices in the UK by
Philippe Bracke- Estimating Residential Land Prices in the UKPhilippe Bracke- Estimating Residential Land Prices in the UK
Philippe Bracke- Estimating Residential Land Prices in the UKPyData
1K views31 slides
Conflictos derivados del mal uso del internet by
Conflictos derivados del mal uso del internetConflictos derivados del mal uso del internet
Conflictos derivados del mal uso del internetAidé Ortega
468 views6 slides
Ciudades inteligentes by
Ciudades inteligentesCiudades inteligentes
Ciudades inteligentesPaula Rincon
75 views6 slides
fabula infantil by
fabula infantilfabula infantil
fabula infantilangela maria giraldo
110 views10 slides
Memórias de Eve by
Memórias de EveMemórias de Eve
Memórias de EveLeo Lol
179 views24 slides
Presentación maría by
Presentación maríaPresentación maría
Presentación maríaMaría Sánchez Herranz
148 views6 slides

More Related Content

Viewers also liked

CV for linkedin by
CV for linkedinCV for linkedin
CV for linkedinNelly Susilawati
211 views5 slides
Teoria de juegos by
Teoria de juegosTeoria de juegos
Teoria de juegosfabiolaCornejo94
89 views16 slides
el impacto del internet en la actualidad by
el impacto del internet en la actualidadel impacto del internet en la actualidad
el impacto del internet en la actualidadnatalia andrade
1K views4 slides
Cordupack MVO by
Cordupack MVOCordupack MVO
Cordupack MVOCordupack
163 views4 slides
Moviemakerescrito by
MoviemakerescritoMoviemakerescrito
Moviemakerescritomarisolchicas
136 views6 slides
Powtoon by
PowtoonPowtoon
Powtoonmarisolchicas
414 views7 slides

Similar to The Full Stack

Ux for data exploration by
Ux for data explorationUx for data exploration
Ux for data explorationVladislav Korobov
372 views36 slides
Modeling Webinar: State of the Union for Data Innovation - 2016 by
Modeling Webinar: State of the Union for Data Innovation - 2016Modeling Webinar: State of the Union for Data Innovation - 2016
Modeling Webinar: State of the Union for Data Innovation - 2016DATAVERSITY
1K views42 slides
Data Workflows for Machine Learning - SF Bay Area ML by
Data Workflows for Machine Learning - SF Bay Area MLData Workflows for Machine Learning - SF Bay Area ML
Data Workflows for Machine Learning - SF Bay Area MLPaco Nathan
8.9K views79 slides
Democratizing Advanced Analytics Propels Instant Analysis Results to the Ubiq... by
Democratizing Advanced Analytics Propels Instant Analysis Results to the Ubiq...Democratizing Advanced Analytics Propels Instant Analysis Results to the Ubiq...
Democratizing Advanced Analytics Propels Instant Analysis Results to the Ubiq...Dana Gardner
466 views10 slides
Next generation of data scientist by
Next generation of data scientistNext generation of data scientist
Next generation of data scientistTanujaSomvanshi1
838 views12 slides
OSCON 2014: Data Workflows for Machine Learning by
OSCON 2014: Data Workflows for Machine LearningOSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine LearningPaco Nathan
16.9K views68 slides

Similar to The Full Stack(20)

Modeling Webinar: State of the Union for Data Innovation - 2016 by DATAVERSITY
Modeling Webinar: State of the Union for Data Innovation - 2016Modeling Webinar: State of the Union for Data Innovation - 2016
Modeling Webinar: State of the Union for Data Innovation - 2016
DATAVERSITY1K views
Data Workflows for Machine Learning - SF Bay Area ML by Paco Nathan
Data Workflows for Machine Learning - SF Bay Area MLData Workflows for Machine Learning - SF Bay Area ML
Data Workflows for Machine Learning - SF Bay Area ML
Paco Nathan8.9K views
Democratizing Advanced Analytics Propels Instant Analysis Results to the Ubiq... by Dana Gardner
Democratizing Advanced Analytics Propels Instant Analysis Results to the Ubiq...Democratizing Advanced Analytics Propels Instant Analysis Results to the Ubiq...
Democratizing Advanced Analytics Propels Instant Analysis Results to the Ubiq...
Dana Gardner466 views
OSCON 2014: Data Workflows for Machine Learning by Paco Nathan
OSCON 2014: Data Workflows for Machine LearningOSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine Learning
Paco Nathan16.9K views
Strata 2014: Design Challenges for Real Predictive Platforms by Max Gasner
Strata 2014: Design Challenges for Real Predictive Platforms Strata 2014: Design Challenges for Real Predictive Platforms
Strata 2014: Design Challenges for Real Predictive Platforms
Max Gasner1.3K views
Data Workflows for Machine Learning - Seattle DAML by Paco Nathan
Data Workflows for Machine Learning - Seattle DAMLData Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAML
Paco Nathan31.6K views
Tableau Final Presentation by Anvesh Rao
Tableau Final PresentationTableau Final Presentation
Tableau Final Presentation
Anvesh Rao86 views
Application and Methods of Deep Learning in IoT by IJAEMSJORNAL
Application and Methods of Deep Learning in IoTApplication and Methods of Deep Learning in IoT
Application and Methods of Deep Learning in IoT
IJAEMSJORNAL10 views
UX Prototyping and Personas 4-25-14 by Shilpa Thanawala
UX Prototyping and Personas 4-25-14UX Prototyping and Personas 4-25-14
UX Prototyping and Personas 4-25-14
Shilpa Thanawala1.3K views
Design and Data Processes  Unified -  3rd Corner View by Julian Jordan
Design and Data Processes  Unified -  3rd Corner ViewDesign and Data Processes  Unified -  3rd Corner View
Design and Data Processes  Unified -  3rd Corner View
Julian Jordan146 views
BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What... by Thomas Rones
BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...
BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...
Thomas Rones189 views
Data Viz for Data Discovery by Megan Bowe
Data Viz for Data DiscoveryData Viz for Data Discovery
Data Viz for Data Discovery
Megan Bowe77 views
BigData Visualization and Usecase@TDGA-Stelligence-11july2019-share by stelligence
BigData Visualization and Usecase@TDGA-Stelligence-11july2019-shareBigData Visualization and Usecase@TDGA-Stelligence-11july2019-share
BigData Visualization and Usecase@TDGA-Stelligence-11july2019-share
stelligence267 views

Recently uploaded

Underfunded.pptx by
Underfunded.pptxUnderfunded.pptx
Underfunded.pptxvgarcia19
15 views7 slides
Applied physics letters journal.pdf by
Applied physics letters journal.pdfApplied physics letters journal.pdf
Applied physics letters journal.pdfaqsamukhtiyar88
5 views8 slides
4_4_WP_4_06_ND_Model.pptx by
4_4_WP_4_06_ND_Model.pptx4_4_WP_4_06_ND_Model.pptx
4_4_WP_4_06_ND_Model.pptxd6fmc6kwd4
7 views13 slides
Penetration testing by Burpsuite by
Penetration testing by  BurpsuitePenetration testing by  Burpsuite
Penetration testing by BurpsuiteAyonDebnathCertified
5 views19 slides
Employees attrition by
Employees attritionEmployees attrition
Employees attritionMaryAlejandraDiaz
7 views5 slides
Data Journeys Hard Talk workshop final.pptx by
Data Journeys Hard Talk workshop final.pptxData Journeys Hard Talk workshop final.pptx
Data Journeys Hard Talk workshop final.pptxinfo828217
11 views18 slides

Recently uploaded(20)

Underfunded.pptx by vgarcia19
Underfunded.pptxUnderfunded.pptx
Underfunded.pptx
vgarcia1915 views
4_4_WP_4_06_ND_Model.pptx by d6fmc6kwd4
4_4_WP_4_06_ND_Model.pptx4_4_WP_4_06_ND_Model.pptx
4_4_WP_4_06_ND_Model.pptx
d6fmc6kwd47 views
Data Journeys Hard Talk workshop final.pptx by info828217
Data Journeys Hard Talk workshop final.pptxData Journeys Hard Talk workshop final.pptx
Data Journeys Hard Talk workshop final.pptx
info82821711 views
PRIVACY AWRE PERSONAL DATA STORAGE by antony420421
PRIVACY AWRE PERSONAL DATA STORAGEPRIVACY AWRE PERSONAL DATA STORAGE
PRIVACY AWRE PERSONAL DATA STORAGE
antony4204218 views
OPPOTUS - Malaysians on Malaysia 3Q2023.pdf by Oppotus
OPPOTUS - Malaysians on Malaysia 3Q2023.pdfOPPOTUS - Malaysians on Malaysia 3Q2023.pdf
OPPOTUS - Malaysians on Malaysia 3Q2023.pdf
Oppotus34 views
LIVE OAK MEMORIAL PARK.pptx by ms2332always
LIVE OAK MEMORIAL PARK.pptxLIVE OAK MEMORIAL PARK.pptx
LIVE OAK MEMORIAL PARK.pptx
ms2332always8 views
Best Home Security Systems.pptx by mogalang
Best Home Security Systems.pptxBest Home Security Systems.pptx
Best Home Security Systems.pptx
mogalang9 views
Business administration Project File.pdf by KiranPrajapati91
Business administration Project File.pdfBusiness administration Project File.pdf
Business administration Project File.pdf
KiranPrajapati9110 views
Listed Instruments Survey 2022.pptx by secretariat4
Listed Instruments Survey  2022.pptxListed Instruments Survey  2022.pptx
Listed Instruments Survey 2022.pptx
secretariat4130 views
Dr. Ousmane Badiane-2023 ReSAKSS Conference by AKADEMIYA2063
Dr. Ousmane Badiane-2023 ReSAKSS ConferenceDr. Ousmane Badiane-2023 ReSAKSS Conference
Dr. Ousmane Badiane-2023 ReSAKSS Conference
AKADEMIYA20635 views
DGST Methodology Presentation.pdf by maddierlegum
DGST Methodology Presentation.pdfDGST Methodology Presentation.pdf
DGST Methodology Presentation.pdf
maddierlegum7 views
PyData Global 2022 - Things I learned while running neural networks on microc... by SARADINDU SENGUPTA
PyData Global 2022 - Things I learned while running neural networks on microc...PyData Global 2022 - Things I learned while running neural networks on microc...
PyData Global 2022 - Things I learned while running neural networks on microc...
GDG Cloud Community Day 2022 - Managing data quality in Machine Learning by SARADINDU SENGUPTA
GDG Cloud Community Day 2022 -  Managing data quality in Machine LearningGDG Cloud Community Day 2022 -  Managing data quality in Machine Learning
GDG Cloud Community Day 2022 - Managing data quality in Machine Learning
Lack of communication among family.pptx by ahmed164023
Lack of communication among family.pptxLack of communication among family.pptx
Lack of communication among family.pptx
ahmed16402316 views

The Full Stack

  • 1. Department of Geography School of Social Science & Public Policy THE FULL STACK JON READES
  • 2. OBJECTIVE To provide an overview of the tools and technologies that I have found – or seen – to enable good development practice & productive research.
  • 3. MY BACKGROUND BA in Comparative Literature in 1997. Went to work for dot.com start-up. Learned to program, on the job. Learned SQL, on the job. Learned to back up more often, on the job. Managed sites, ETL systems & analytics over many years. Re-entered academia in 2006. PhD at CASA; collaboration with SENSEable City lab. Lecturer at King’s since 2013; helped set up
  • 5. HOW DOES ‘BIG DATA WORK’ WORK? Idea Exploration DevelopmentRevision Writing Up
  • 6. BIG DATA WORK ON A PRACTICAL LEVEL
  • 7. MY EXPECTATIONS FOR (GOOD) TOOLS They must be useful when I need them. They must get out of the way when I don’t. They must fail gracefully when they can’t help it. They must play well with other tools where feasible. They must make it easy for me to do the right thing. They should grow gracefully into operational systems.
  • 8. WHERE DO WE GO FROM HERE? In the remainder of this talk I will try to link my outputs – the pretty pictures – to the process by which they were created. If you want to know more about something you see, just stop me.
  • 9. Considerations:  Coherence of syntax  Coherence of libraries  Data-munging features  Spatial analytic support  Map-making & data viz  Ability to get things done  Availability of a good IDE But it’s really the ‘value added’ features that matter. PROGRAMMING LANGUAGES Cellular Census (2007)
  • 10. Considerations:  Standards compliance  (Spatial) Feature set (esp. indexing)  Replay/Logging  Replication & distribution  Access controls & user management A lot can be done without spatial queries. Learn about indexing, query & schema design, and DATA STORAGE & MANAGEMENT The ‘Big Bubble’? (2014)
  • 11. Considerations:  Ease-of-use  Scriptability  Ability to layer  Interoperability Distinguish between mapping to communicate results with a spatial dimension and mapping to produce actual maps? GEODATA VISUALISATION Global Health Partnerships (2016)
  • 12. Considerations:  Collaboration  Scalability  Ease of recovery  Scale of use Best if you never learn SVN/CVS, then your brain will not be done in by Git. VERSION CONTROL & RECOVERY Oyster Card Work (2012)
  • 13. Considerations:  Getting out of the way  Compatibility  Collaboration  Editing & comments  Quality of output What helps you to think? What helps you write first, but makes formatting later easy? WRITING Thesis & ‘Space of Flows’ (2011, 2014)
  • 14. Considerations:  How easy to backup/share?  How often?  Where stored?  How easy to recover?  How selective is recovery? Backup early & backup often. Never trust one solution or one location. Note: data protection issues. BACKUP & REPLICATION STRATEGIES Pint of Science (2014)
  • 15. Considerations:  Performance  Encryption  ACLs (users/groups/systems)  Password Managers Encrypt! Encrypt! Encrypt! Encourage use of password managers. COMPLIANCE & DATA SECURITY
  • 16. Also worth watching:  Travis CI: automated testing with GitHub integration.  Docker/Vagrant: replication & virtualisation. Full replication of someone else’s entire data analysis process is harder than you think! REPLICABLE RESEARCH N/S Housing Divide (2017?)
  • 17. WHAT’S MISSING? • Better ways of specifying the full analytical ‘context’ – including versions of libraries, platform, etc. – as well as the input/output ‘pipeline’ – such as data and results (rctrack seems to want to do this, but only with R, YAML more promising). • Ways of talking about data processing pipelines & steps (UML is not the answer). • Valuing of good (open) code & good data by institutions and research councils.
  • 18. THE BIG PICTURE Tools (ca. 2006):  Eclipse  Perl/Java  Oracle 8i  Cron jobs  OLAP Tools  CVS  ArcMap Tools (ca. 2016):  R/Rstudio  Python  Postgres + PostGIS  Cron jobs  Knitr, etc.  Git  QGIS
  • 19. THE BIG PICTURE Massive shift from expensive proprietary to cheap open (both software & hardware). Underlying distinction between operational and development/research environments persists. The problem: one tends to evolve into the other.
  • 20. FINAL THOUGHT Document your code. And any sources it drew upon. You will regret not doing it.

Editor's Notes

  1. Generally, I can talk about the majority of these tools at any level of detail you like, but I’ve tried to focus on the big picture and to group them into categories so that you can think about the wide range of things that go into developing good research and supporting long-term development.
  2. You’ll notice that I have a very pragmatic, practical focus here. The really big thing to take from this is that I’ve: a) used more tools that I’d care to remember while doing my job; b) I don’t have any particular axe to grind. I prefer to use things that work, regardless of where they came from.
  3. This talk will draw on my experience of professional software development and research hacking to offer one perspective on tools and workflows that help get things done, and that help you to recover when things (inevitably) break in the course of your work.
  4. Does someone give me data and ask me to find a question? Or do I have a question and go looking for data? Mix of both? This cycle operates at many scales – the biggest mistake that you can make is to think that a piece of analysis is done when it’s sent off to the reviewer. Or even when it appears in print. These works take on a life all their own over time. Many ‘snippets’ somehow escalate into core operational applications by some insane evolutionary process.
  5. Figure 2 is why good ‘hygiene’ practices are so important – they can make or break your research. Big data is deep enough that you can drown in it, so you need to be careful.
  6. Event MATLAB can make maps, but there are no choices besides R and Python at the moment. Neither ticks every box, but obvious convergence occurring. I know someone will come up to me after my talk and say “But what about d3?” or some other language, but my simple question is this: if you are convinced that the rest of the world is wrong, it’s probably because you’re an evangelist.
  7. MySQL MongoDB PostgreSQL/PostGIS Hive/Hadoop Sceptical of long-term utility of in-memory dbs. One thing that I always forget to do is log the queries that generate derived tables, or the steps by which I created linking tables between separate ‘areas’ of the schema. Imagine losing all of your derived data in one go, how easy would it be for you to just checkout the code from Git and hit ‘run’ to rebuild your analytical data warehouse?
  8. ArcMap QGIS (+Postgres!) Python R Why would anyone use ArcMap now? R for research scriptability and ‘simple’ mapping (but see: sketchy maps). QGIS for ‘proper’ mapping. Down the rabbit hole with Python! QGIS is advancing by leaps and bounds, and planned integration with PySAL will give it analytics features far surpassing the ArcGIS toolbox; however, in quite a few ways it is still ‘Photoshop for maps’ – it can make them look prettier, faster than ArcMap. Integration with Postgres gives you very nice features for manipulating and visualising large data sets.
  9. Git SVN/CVS Still have some doubts about git with large binary outputs instead of just code.
  10. LaTeX Markdown Google Docs Word No right answer here, but interesting range of apps to help writers. Please learn Word’s Styles feature (should be easy for LaTeX or web developers). Have seen some interesting apps recently: Texts. Scrivener.
  11. Dropbox TimeMachine rsync/scp Backblaze, CrashPlan, etc. Assume that it will take 3 weeks to recover 2 weeks’ work. Postgres has one major flaw as far as I’m concerned, and that’s replicating the database across machines. As far as I can tell this tends to involve dumping individual tables in their entirety and then restoring on the other machine. The synchronisation methods I’ve seen assume a very different type of system. Virtualisation could work, I guess.
  12. None, this is not optional Audit ACLs. Let me tell you a story…
  13. rctrack and YAML seem to be trying to solve aspects of this, but our attempts at replicating the Goddard research -- what we are doing now will be just as dated as mainframe work from 50 years ago!
  14. Hardware is both more, and less, of a problem than you think – to see real performance boosts you need to spend a lot of money, otherwise you can get by on a lot less than you think.
  15. I got an email about a PETL 10 years after leaving company.