Gors appropriate

Appropri-ut in the sense of not inapppropriate – the “right thing” to use, as well as
approrpri-ate, as in co-opt, or use for something it perhaps wasn’t originally intended
for.
1

So for example, one thing I do is appropriate openly licensed media resources for my
own slides.

In this case, I want to set the scene for this presentaAon as one in which I haven’t
been afraid to get my hands dirty, but I have also played with and explored a
parAcular medium – in this case, various digital technologies – and created my own
things which may also, ulAmately, be of direct use to others.

You might also say they’re at best half-baked, if not completely unbaked;-)
2

The tools I’m going to talk about are situated within a data context. I spend a lot of
Ame playing with openly licensed datasets, working across the whole data pipeline.

This example, taken from the third year undergrad equivalent OU course TM351
“Data Analysis and Management” provides a simplisAc view of some of the processes
involved in working with data.

(We all know it’s not quite that straighSorward, and oTen involves a lot of iteraAon
or backtracking, but as well as “The role of the academic [making] everything less
simple”, as Mary Beard put it in an Observer interview a few weeks ago, the
academic also simpliﬁes and idealises through abstracAon and revisionist storytelling,
parAcularly when it comes to describing processes.

So what I plan to do is spend a few minutes show you some of the tools and
emerging approaches I use working across the various steps of this pipeline.
3

So – the first thing to note is that I’m a technology opAmist: I believe technology can
help make our lives simpler, even if at first it may look as if we are making it more
complex by introducing yet more tools to learn – and install on computers that our IT
department would rather we leT under their control.

Taking control of your compuAng desAny is another theme of this talk…

In this example, the box diagram I showed on the first line was /wrien/ rather than
drawn. If I want to add steps, or have sub-branches added to the diagram, I don’t
need to start faffing around in Powerpoint or Word figures trying to line things up
and get them sized right and so on.

I let the machine do it.

In this parAcular online tool (you can see the URL in the screenshot at the top of the
slide – I’ll pop a copy of the annotated slides online, and also let Alan have a copy) –
so, in this parAcular tool, blockdiag, there are other diagram types available.

The underlying code is also opensource and available as a python package, so you can
write diagrams such as these in a Jupyter notebook, for example.

I’ll have more to say about Jupyter notebooks later.
4

One other point to note – and a bit of blatant self-promoAon here – most of the
individual slides within this talk are backed up by one or more posts on my personal
blog, Ouseful.info.

I’ve been wriAng this blog for many years and it represents a reasonably complete
notebook of a lots of the ideas I’ve explored over that Ame.

In many cases, the posts are comprehensive and self-complete: they record all the
steps I took to do somehAng in case I need to remind myself later.
5

So, the pipeline.

The first step, acquisiAon, relates to how we get hold of data This may be from
downloaded data files – Excel spreadsheet documents (which are actually zip files –
you know you can change the xlsx suffix to zip and unzip them, right? Same with docx
Word document files and pptx Powerpoint files), databases, online APIs (applicaAon
programmable interfaces), but it may be scraped from other sorts of document. Web
pages, for example, or PDF documents (even though PDF documents are horrible, it’s
oTen quite easy to extract data tables from them).

I’m not going to talk about the mechanics of scraping, but journalism lecturer Paul
Bradshaw has a good intro to a variety of tools and techniques in his Leanpub book
“Scraping for Journalists”.
6

I will beieﬂy menAon a couple of tools I use though – morph.io is a site hoste dby an
Australian opendata group that is actually a fork of a tool by UK Liverpudlian start-up,
Scraperwiki.

Morp.io will run a scraper of your own wriAng, hosted on Github, once a day and pop
the results into a SQLite database that you can download.

The slide shows a scraper I use for scraping License applicaAons made to the Isle of
Wight council.
7

Another tool I use a lot is Tabula. Tabula is a Java applicaAon with a browser based
user interface that will extract data tables from PDF documents.

You simple drag to select the area of the page you want to scrape (you can mirror the
same area over mulAple pages or deﬁne diﬀerent areas on each).
8

The heart of the applicaAon is actually a command line engine, recently wrapped by
the R tabulizr package.

This means you can automate the use of tabula in order to scrape tabular data from
PDF documents within R, gepng the data back as an R data frame.

That’s tabulizr – very nice; and the developer (on Github) is quite responsive.
9

Another tool I use from Ame to Ame is Apache Tika – this can extract text from PDFs,
Word documents and so on, as well as from images.

There are quite a few online OCR services now, many of them appearing as part of
“AI toolsets”, oﬀering a range of commodity AI API services – IBM, MicrosoT and
Google all have them, for example.

So as well as OCR text extracAon, they do face and emoAon detecAon in images,
semanAc tagging / enAty labeling within documents, automaAc image tagging,
speech to text, and so on. All with varying degrees of success. But all of them steadily
improving.
10

ATer data acquisiAon, we’re oTen faced with cleaning a dataset. A tool I used for
cleaning data is another Java applicaAon, again accessed via a browser, called
OpenRefine.

OpenRefine will open a wide range of document types – spreadsheets, csv or tabbed
data files, XML, JSON, HTML – either locally or from the web, and presents it in a
spreadsheet style UI.

A wide range of opAons are provided for applying a parAcular transformaAon to each
cell in a parAcular column – you can also script your own in a custom scripAng
language, or Python – as well as tools for faceAng and filtering the display of rows
based on values within one or more columns.

The clustering tools are useful for finding and correcAng parAal matches – so for
example, you can normalise MyCo Ltd, with MyCo Ltd., with MyCo Limited, and so
on.
11

OpenRefine can also provide support for a limited range of data reshaping acAons.
I’ve described a few of them in this post, which takes a messy local elecAon results
data set and shows how to clean and reshape it.

OpenRefine also has a templated export – so we can generate simple ‘line at a Ame’
reports from a filtered dataset.
12

One of the things I try to look for in applicaAons is whether they are open source and
whether they provide a browser based UI – if you can use it via a browser, you should
be able to use it on your own local machine or from a remotely hosted version
accessed over the web.

OpenReﬁne meets both these criteria, which means it’s no problem for someone like
IBM to make it available via their DataScienAstWorkbench site. (It’s also not too hard
to roll you won version of something like this site.)

The other tools currently provided by this site are RStudio, a powerful – and friendly
– IDE for the R programming language, and Jupyter notebooks.
13

One reason why it’s gepng easier to expose these applicaAons over the web in a
scaleable way is through containerisaAon. ContainerisaAon is a form of applicaAon
virtualisaAon where one or more applicaAons can be wired together an isolated from
each other within a mulA-tenanted virtual machine.

Docker containers offer the promise of being able to “run anywhere” – or at least,
anywhere where the container plaSorm can operate. Docker is the most popular
route to this at the moment.

The applicaAon show here is called KitemaAc. It lets you search for public applicaAon
containers, and download them and run them locally on your own computer.

The example shows various containers I’ve put together for OpenRefine (some are
different versions, others are experiments / demos I really should delete)

So rather than install Java on your computer and then download and install
OpenRefine, you can just one-click in KitemaAc and it will get a prepackaged
OpenRefine container for you that includes all that OpenRefine needs to run.
14

One of the spin-offs from the early days of OpenRefine was the noAon of a
“reconciliaAon service”, whereby you could look up each item in an OpenRefine
column against a webservice that would try to match it to – reconcile it with – a
known enAty. A parAal / fuzzy matching lookup against a controlled vocabulary,
essenAally. OpenCorporates, the opendata internaAonal company lookup service,
offers a reconciliaAon endpoint.

It’s easy enough to package up your own lookup tables and this recipe describes how
to do it using a homebrewed reconciliaAon container. I did ones for MPs, for
example.
15

Just as an aside, when pupng together reconciliaAon services, we ideally want a
canonical list of enAAes or enAty names we want to reconcile against.

Registers can be a good source of these. But it’s also worth noAng that registers can
also be used to generate derived datasets. For example, I wanted a list of UK prisons
with locaAon informaAon.

In the absence ﬁnding a single openly licensed dataset with this informaAon (a
website with one prison per page was the closest I found, which I could have scraped
but chose not to), I instead do a lookup via the Food Standards Agency, which has
inspecAon informaAon for public food outlets. (Another source might have been the
CQC, with a search for health surgeries or dental treatment centres, ﬁltered by
“HMP” or “prison”).

16

RStudio is another applicaAon that can be freely redistributed and exposed via a
bowser.

These posts who how to run an RStudio applicaAon in the cloud using a simple
container management dashboard formerly known as Tutum, now available as
Docker Cloud.

I’ve also described how to package a Shiny applicaAon in a container so you can
deploy it anywhere.

Does anyone use Shiny? Shiny is a rapid prototyping tool for building browser-based,
HTML5 interacAve applicaAons and dashboards – RStudio released a new
dashboarding framework over the last couple of weeks – that make it relaAvely easy
to build interacAve data exloraAon tools against an R environment.
17

One really nice component of the Docker ecosytem is docker-compose, formerly
known as ﬁg, which allows you to orchestrate the launch of several interlinked
containers, so you can easily access one from another.

The example here shows how to link RStudio and a Jupyter notebooks to a neo4j
database.
18

I’ve menAoned Jupyter a few Ames – does anyone use Jupyter notebooks? IPython
notebooks?

The browser based notebook UI lets you enter text (as markdown) and executable
code (in a variety of languages) and then run the code and display the results of the
code execuAon back in the notebook.

One thing I’ve been exploring recently is a way of calling command line applicaAon
funcAons packaged in a container from a notebook cell, and returning the output of
of the containerised command line funcAon as a shared ﬁle.

This post describes how I package the Contentmine tools - a set of tools for
harvesAng scienAﬁc journal papers and extracAng knowledge from them – and which
a real pain to set up normally – and then use them via a notebook.
19

Just by the by, if you want to try the notebooks out, there’s a live demo available. (I
also did a post on “Seven Ways to Run Jupyter Notebooks” which describes several
other alternaAve ways of running the notebooks.)

The code example here shows all the code needed to open an Excel file containing
average travel Ames to GP surgeries by LSOA, filter the data down to a parAcular local
authority area, pull in an openly licensed geojson shapefile for that area, and then
plot (and embed) an interacAve choropleth map via the folium python package (using
Google maps, I think, though it may be OpenStreetmap?)
20

One problem with producing interacAve maps is that someAmes you actually want an
image.

It turns out that webtesAng frameworks like Selenium make it easy to grab
screenshots from test pages rendered in a test browser, so I co-opted the idea to
produce a rouAne that lets me grab a png snapshot of a map.
21

That example was actually created for a side project I dabbled with with our
hyperlocal news outlet on the Isle of Wight called OnTheWIght.

OnTheWight have been reporAng monthly job ﬁgures for years, so I though I’d have a
go at automaAng the producAon of the reports from nomis data, as well as producing
a few charts.

The report is just a literal reporAng, although I do try to add some colour and a Any
amount of analysis for example by using direcAonal and magnitude terms – “the
numbers went UP SLIGHTLY from last month, although they are SIGNIFICANTLY
DOWN from the same Ame last year”. And so on.
22

On my own site, I started trying to pull out some geographical insight, automaAcally
reporAng on areas with noAceably high unemployment compared to other areas by
gender.

The map does look like a populaAon map, but the unemployment rate is actually
higher in some of the more heavily populated areas!
23

Just a side note – the idea of being able to build something once they deploy it more
widely for no extra eﬀort really appeals to me.

In the case of naAonal datasets broken down to local level, building a soluAon for a
local area you know about and understand helps get you started on automaAcally
detecAng and pulling out stories or features – but the same code can then run for
other areas.
24

The pain points oTen come in splipng the data down to local areas and then
generaAng the stories.
25

But if you automate a pain point away for one local area, you’ve solved the problem
for all of them.

The approach I’ve been taking is to think in terms of producing press releases rather
than than finished stories, relying on the journalist, or some other editorial role, to
act as the final arbiter of the quality and relevance of the press release style
communicaAon.

The implicaAon is also that more work needs to be done checking and working up the
press release for the final story (if, indeed, there is any story).
26

So picking up on this idea of reuse – or laziness – the nomis data to text engine can
be easily wrapped to to provide a conversaAonal UI for it.

In this example, I can ask the service for the latest JSA figures in a parAcular area.
Although not shown, you can put in a postcode, for example, and get the figures back
for the local authority area containing that postcode.

At the Ame I did this demo, I was half thinking of trying to persuade Johnston Press to
give me some pin money to play with, so I scraped a list of Johnston press papers,
found the postcode of their office, and used it as a the basis for a lookup of jobless
figures by newspaper Atle area.

27

Having got some machinery set up to work with slack, I could also use it as an
interface for a simple “spreadsheet row to paragraph of text” toy I was trying to put
together.

So here, for example, I’m looking up latest ﬁgures for CQC care home inspecAons.
(Actually, I think this is based on a scraper of the CQC website rather than a data ﬁle
download.)
28

The original experiments had the slack bot code running on my personal computer.
More recently, I started looking at how things like Amazon AWS Lamda funcAons,
essenAally serverless remote procedure calls, could be used to host the bot.

The examples here make use of the UK Parliament API to provide the content,
allowing me to lookup up recent reports, or commiee memberships, for example.
29

The data 2 text area is a rich one, and one thing I find reflecAng on my own
exploratory data acAviAes is that I oTen look to charts (which are oTen custom,
mutlilayered charts of my own devising – ggplot is great for that) for inspiraAon.

Working in educaAon, where we have a legal requirement to make our teaching
materials accessible, charts and figures oTen require wrien descripAons.

So one thing I’ve started wondering recently is whether we can introspect on chart
objects created using things like ggplot as a “data basis” for a textualisaAon of the
chart components (and then do data2tesxt analysis for the simple analyAcs insight
reporAng).

And it seems we can – gpplot chart objects , for example, have a ggplot_build()
introspector, and we can also get access directly to chart objects.
30

When I posted about my ggplot2text experiment, I idly wondered whether we could
do the same for matplotlib chart objects. And is seems we can, as this demo shared
via a commenter shows.

#Lazyweb Tw, you might say:-)
31

As I was looking at the Parlimanent API backend for a simple conversaAonal search
agent, the ONS Beta website became the live site. One of the nice things about the
new ONS site is that a JSON feed alternaAve is available for much of the HTML
content on the site.

Which means we can repurpose that website content directly as a response to a
conversaAonal search.
32

Finally, I want to return to the Jupyter ecosystem.

I absoultely love the notebook environment: it provides a great environment for
wriAng literate, reproducible data analysis scripts (serval news outlets are starAng to
publlish Jupyter notebooks showing the analysis behind their news stories – Buzzfeed
is a great example of this, as with their recent tennis macth ﬁxing / bepng scame, for
example), as well as providing a great environment for documenAng exploratory data
analyses.

But the Jupyter ecosystem is already much richer than that.

I haven’t described the dashboard toolkit for creaAng live dashboards, the slideshow
view that lets you create interacAve slides with live code execuAon, the range of
programme language kernels (not just Python and R) or the kernel wrapper that lets
you deﬁne an API via a notebook).

But I do just want to quickly menAon remote kernels.
33

At the moment, we’re currently rewriAng a day long residenAal school acAvity that
uses Lego robots. UnAl this year, we’ve used the original yellow Lego Mindstorms
RCX brick. This year, we’re using the Lego EV3 brick, which has wiﬁ and can be set up
to run Linux and a python shell that can access the robot’s bits.

The approach I’ve been exploring it to run a remote IPython kernel on the brick, and
a Juoyter server on a desktop machine, and then connect a notebook to the remote
kernel via the Jupyter server.

Running the notebook server on the brick removes the load of running the server
from the brick. (The same approach can be – and is – used to run large tasks on
supercomputer clusters.)

The notebooks also allow us to create simple interacAve Uis – just like R has the shiny
framework, the Jupyter notebooks can run interacAve ipywidgets direclty wired to
python state. In the example abovem I have a slide for controlling motor speed, for
example (actually, the duty cycle fo the stepper motor) and another that displays the
value being seen by a parAcular sensor. (Again, there’s a Any element of simplisAc
data2text contextualisaAon in the display.)

34

So that’s me done.

Some of the tools and technologies that I think are appropriate for, or can be
appropriated for, data related tasks.

SomeAmes a pen will do as well as a spoon.
35

And finally, a last bit of blatant self-promoAon.

In the same way that maths has recreaAonal maths – fun puzzles in the Sunday
papers – I engage in recreaAonal data acAviAes. And as with the blog, I keep a record
of what I’ve done.

Several years ago, I started to learn R, and used Formula One results and Aming
sheets data as context for that. Over the years, I’ve pulled various tricks and
techniques together into this evolving book. (Actually, the book was also another
experiment – Leanpub encourages you to publish as you write, and used markdown
for the manuscript. I was looking for an opportunity to explore whether we might be
able to use something like Rstudio, and in parAcular Rmd, R-markdown) for authoring
OU course materials, so this gave me a reason – and a context – for exploring such a
workflow).

It’s sAll a work in progress, bit at over 400 pages already it represents a reasonably
deep dive into the different things you can do with a limited range of datasets on a
parAcular topic, as well as exploring a variety of ways of using – and appropriaAng – R
to help us find stories in data.
36

Gors appropriate

More Related Content

Similar to Gors appropriate

More from Tony Hirst

Recently uploaded

Gors appropriate