This presentation was provided by Tim Lloyd of LibLynx, during the NISO Webinar "Wrestling with Access and Authentication Control" held on February 6, 2019.
Science 7 - LAND and SEA BREEZE and its Characteristics
Lloyd, "Web proxy vs Federated SSO: A practical guide to the pros & cons"
1. Web proxy vs Federated SSO:
A practical guide to
the pros & cons
Tim Lloyd, CEO of LibLynx
Identity & Access
for Libraries & Publishers
2. A brief refresher on
Web Proxy
• User’s browser routed via web proxy server
• URLs re-written to refer to proxy server
• Service provider sees proxy server IP address
21. Web Proxy vs
Federated SSO
Patron Privacy
User Experience
Security
Cost & Efficiency
Reporting
22. “We don’t want publishers to
track patron identities or what
they are reading”
Privacy
23. Privacy
Proxy
Publisher never receives personal
data (PD)
HOWEVER, patrons forced to
register for personalization risk
exposing PD
Shibboleth has good privacy
controls to limit the release of PD
Scalable Consent will provide tools
& policies to empower institutional
and individual choice
Other SAML systems can be
configured for privacy e.g. Active
Directory
SSO
24.
25. “My colleague can’t access
https://example.com.proxy.college.edu”
or
“Life, uh, finds a way”
User Experience
26. Patrons have to access via
institutional links, supported by
significant community effort
Poor experience outside proxied links
Proxy configs can fail
Patrons can access at point
of discovery
Poor experience to select
institutional affiliation (RA21
User Experience
Proxy SSO
27. “Someone at IP 1.2.3.4 violated the
terms of service 3 days ago between 2
and 4pm - please block their access”
Security
28. Access can be secured by credentials
Human error can be a significant factor
Blocked IPs are blunt tools that impact good
and bad users alike
Tracing unauthorized activity can be time
consuming and usually requires collaboration
between librarian & publisher
Soft target for fraudulent access
Federated trust fabric provides
strong set of security tools &
policies
Unauthorized activity quicker to
spot and block by publisher -
more like a scalpel than a
hammer
Security
Proxy SSO
29. “Does anyone have an up to
date stanza for XYZ Journal?”
Cost & Efficiency
30. Inherently unstable approach as it
has to respond to changes in
website design
Requires regular, ongoing
maintenance as resources change
and proxy configurations need
updating
Stable technology
Low ongoing
maintenance required
Cost & Efficiency
Proxy SSO
31. “Which group of patrons
accesses this resource
the most?”
Reporting
32. Access tracked through proxy
Granular user level reporting
available to library if proxy
requires login
Publisher can’t distinguish users
Access tracked via SSO
architecture
Granular user level reporting
available to library AND publisher
depending on institutional policies
Reporting
Proxy SSO
Hi, my name is Tim Lloyd and I run a business called LibLynx that specializes in Identity & Access for publishers and libraries. This means that we work with a wide variety of access and authentication technologies on both sides of the transaction – those providing services as well as those consuming services.
My presentation today is a practical guide to the pros and cons of Web Proxy authentication vs Federated Single Sign-On, or SSO. It’s intended to provide an introduction and general background to the later presentations from my colleagues.
These 2 technologies represent contrasting approaches to user authentication. While web proxy is the dominant technology solution for authenticating offsite access to electronic library resources, Federated SSO is increasingly viewed as an alternative that better fits the modern user workflow.
My observations are based on our experience of working with libraries across the board. They won’t necessarily hold true in all situations or for all libraries - but they should represent the experience for a significant proportion of libraries and, as such, are useful context for policy discussions.
Before we start the comparison, I’m going to briefly review how these 2 technologies work.
Firstly, web proxy servers.
One of the limitations of IP authentication is the requirement for users to be associated with a fixed set of IP addresses. This was easier back in the day when everyone logged in on campus from a fixed desktop, but increasingly usage comes from users on devices that are off-campus. A web proxy server allows a user to be assigned a specific IP address regardless of where they are physically in the world.
It works by routing the user’s browser via the proxy server. The proxy server re-writes URLs within the browser page to ensure that the user keeps being re-directed through the proxy as they navigate between web pages. The service provider associates the user with the IP address of the proxy server, rather than their real, underlying IP address. This allows a library to authenticate off-campus, or off-network, users as if they were within the campus IP range. It also allows libraries to outsource IP authentication to 3rd party services that give service providers a fixed IP range to authenticate users – helping to insulate resource access from local changes in IP addresses.
Although some of my comments relate to IP authentication more broadly (i.e. not necessarily via a web proxy), most of my comments are specific to web proxy technologies so I’ve gone with that title.
While IP authentication is an easy concept to grasp, Federated SSO is more complex and, in my experience, much less well understood. It’s laden with jargon and acronyms that obscure a series of transactions that aren’t that complex to grasp at a basic level.
If you’re unfamiliar with the term Federated Single Sign-on On (SSO), you may recognize the name Shibboleth instead - Shibboleth is an open source software commonly used to implement Federated Single Sign-On. Another term you may have heard of that we’ll be discussing later is RA21, which builds on top of Federated SSO. Because it’s easy to confuse RA21 with Federated Single Sign-On, I’ll make the distinction between them clear.
Let’s start with a simple analogy:
Bob runs a conference booth that provides books to anyone who studies at a subscribing institution. Amy comes up to the booth and says "Hi, can I have a book?"
Bob says, "Sure” and asks her if she’s at a subscribing institution
Amy says that she’s a student at ABC College
However, Bob doesn’t know Amy so he needs to verify that she’s registered with ABC College.
Luckily, he has a phone book where he can look up someone who can help him. In the case of ABC College, the person to talk to is Carol.
Bob calls Carol to ask if she can confirm that the person at his booth is a student at ABC College
Carol asks him to pass the phone to the student so she can talk to her directly
Carol talks to Amy and is able to confirm that she’s a valid student
Amy passes the phone back to Bob so that Carol can confirm to him that she’s a student at ABC College
Now, Bob would ideally like to know the student’s name so that he can learn more about her interests and recommend other books to her in future
However, ABC College’s policy is not to release student names and so Carol can’t provide Bob with any additional information on the student.
Bob has now verified that the student in front of him is at ABC College
Bob gives Amy her book, and also gives her a bright green badge to wear that says "I'm with ABC College" - Bob tells her that if the other booths see that badge, it'll save some time as she won't need to tell every booth which institution she studies at.
This simple scenario is actually very close to how Federated Single Sign-On works!
Bob is the 'service provider’ that needs to check a visitor’s institutional affiliation before providing access to services.
His phone book is a federation - a trusted list that details how to talk to a set of vetted institutions and vendors. Examples of federations include InCommon in the United States, and the UK Access Management Federation.
Carol is the 'identity provider' - the institution’s Single Sign-On service that confirms a visitor’s identity.
And while our characters in this scenario speak English, in reality Bob, Carol and the Federation communicate using a language called SAML.
Finally, the badge that Bob gives to Amy is what RA21 is really about - making it easier for Amy to deal with other service providers.
It's also important to note that Carol was in control of the Amy identity, and opted not to tell Bob Amy’s name. This might have been institutional policy, or Carol might have asked Amy if she wanted to share it. Either way, all Bob got was confirmation that Amy was definitely affiliated with ABC College and, as Bob trusts the phone book, he trusts Carol is the right person to confirm that.
Now we’ve had a refresher on Web Proxy authentication and Federated SSO, let’s analyze them against 5 areas of concern that we regularly field questions about from libraries.
Patron Privacy
User Experience
Security
Cost and Efficiency
Reporting
As each of these areas could easily be a presentation in its own right, I’ll briefly provide some context before each one.
We encounter very diverse views on privacy within libraries. At one end of the spectrum are libraries that want to anonymize all information relating to patron usage - at the other is libraries that want granular data that records each time a particular patron accesses a particular resource. And in between are many shades of grey.
For example, we work with a library that requires patrons to verify their network credentials before accessing resources via a web proxy, but doesn’t want to store the selections that are made. Another doesn’t want to require credentials on-campus, but is interested in recording patrons identities in the background where possible - for example, via cookies or by passively querying their local identity provider to see if they have an active session.
We often find that organizational IT departments tend to give it little thought. Some pay of lot of attention to the security of access (and some none), but issues around what data is shared tend to be based on quality and accessibility, not patron privacy.
So, to what extent do these access technologies allow libraries to reflect their institutional policy towards privacy?
Proxy Access
IP authentication is inherently anonymous, and so privacy protecting. Proxy servers make it more so by obscuring a patron’s underlying IP address (which can identify a user in certain situations). So, if your policy is to never to provide personal data under any circumstances then this fits the bill.
But what if your patrons want to personalize their experience, say to access ‘My Account’ features, or to build personalized recommendations?
By anonymising access, IP authentication forces them to register directly with service providers - which may inadvertently harm their privacy more than SSO. Their options are to re-use social logins (and further increase exposure of their life to Facebook, Google etc) or storing yet more usernames and passwords with third parties. We know from research that most users tend to re-use existing credentials in these situations, potentially exposing the passwords they use at work.
Federated SSO
Federated SSO offers fine-grained control over what personal data (“attributes” in identity-speak) are available to 3rd parties.
For example, using Shibboleth an institution can just release an opaque identifier that doesn’t identify an individual, and isn’t shared between different service providers and so can’t be used to build up a profile of usage across service providers. Although the identifier is persistent, that user may be assigned a new identifier over time (depending on institutional policy), and so a service provider may not be aware that the same user has returned.
Another example attribute can identify whether a user is, say, faculty or a student. Institutions can opt to share additional data, such as a name and email address, but it’s under their control.
Ironically, we often have to dampen publisher expectations that Federated SSO will provide all the information needed to fully personalize their interfaces - in our experience, most institutions simply don’t provide that level of information by default.
While it’s possible for service providers to engage with institutions and make the case for sharing, say, a name and email address, this is a time-consuming process and many institutions lack the tools and policies to effectively participate.
One important development in this area is Internet 2’s Scalable Consent initiative, which will provide institutions with tools and policies to manage the release of personal data and provide individual patrons with choice over what’s released.
As a picture speaks a thousand words, here’s a screenshot from one of their demos showing an example of how that might work. In this case, a patron is using Shibboleth to access a 3rd party service called CILogon (which could equally be a publisher service). At first glance it looks similar to the screens that Facebook and Google show when you use them to access a 3rd party app and they ask you to agree to share data. But this is actually much more sophisticated - combining both an institutional policy on what is recommended to share + an individual option to go with those choices or edit them further.
From a technical perspective, this is the best of all - institutions can develop privacy policies to suit their context, and individuals can control their data privacy within that policy. To learn more about this project, Google ‘scalable consent’.
I want to look at the user access experience from a couple of perspectives.
Firstly, place. Patrons used to be primarily located on campus and the library website played a key role in discovery of resources. As we all know, patrons are increasingly located off network, and off campus, and discovery is increasingly happening in a wider range of diverse places outside direct library control. How well do our authentication technologies adapt to this new reality?
Secondly, friction. The user experience for access systems is fundamentally about balancing simplicity and security. Add too much friction into the process, and users will flow to the point of least resistance. A good analogy for me is the scene in Jurassic Park where the scientist, Jeff Goldblum, is describing how dinosaurs could overcome the park’s genetically engineered population control. His comment is “Life, uh, finds a way”, and that’s what I think patrons do if they encounter access friction - they figure a way around it. Unfortunately, that can mean simply avoiding access-controlled library resources. How much access friction do our authentication technologies add?
Proxy Access
While on-campus IP authentication is clearly seamless, I think it’s more useful to analyse the off-campus user experience as that’s an increasingly common use case.
Unfortunately, the need to access via library-enabled links puts proxy access at odds with user behaviour.
We support proxy access and I’m constantly impressed by the creativity of the whole community in trying to plug the gaps:
we add proxy URLs to the library site
we provide proxy prefixes to discovery services
we register IP ranges and resources with services like PubMed, and with tools like CASA to enable access via Google Scholar
we try to remember to provide proxied versions when we send links to patrons
we use browser-extension services like Lean Library and Kopernio that automatically proxy links that you come across.
Collectively, it all adds up to a huge amount of ongoing effort to get around the basic fact that patrons want access at the point of discovery, wherever that is for them.
But even this leaves plenty of gaps, such as resources discovered outside library-managed services, and unproxied links shared by friends and colleagues. These gaps lead to a poor user experience because there’s no clear pathway for the patron to access a proxied version of the resource - outside of repeating their discovery process again but this time via a library service.
We all know the impact on patrons - some will give up, some will go the extra mile to figure out how to get access, and a few end up buying individual access to a resource that their library has already paid for.
It’s a lot easier to measure on-campus usage than to assess the off-campus access lost to alternative sources because users, uh, find a way.
In addition, the inherently unstable nature of proxy access can occasionally cause links to fail. More on that later.
Federated SSO
The big advantage of Federated SSO is the ability to authenticate at the point of discovery, which aligns more naturally with user behavior. No need to access via special URLs.
Access friction arises at 2 points in the process - when a patron selects their institution (so the publishers knows which institutional identity source to authenticate against), and when the patron enters their credentials to sign-in.
Identifying your institutional affiliation is a real issue as researchers move across publisher sites, each of which has to repeatedly ask the same question - which institution are you from? This is definitely a poor user experience but is being tackled head-on by the RA21 project as we’ll hear about later.
The 2nd point of friction, entering credentials, isn’t a significant barrier for patrons in our experience. Users have been trained by social logins to authenticate for services. When that authentication involves their organizational credentials, it may be seamless in practice if they already have an active session. And institutions can exercise a lot of control over how frequently they require users to re-authenticate - it’s not uncommon for some users to login at the start of their day and not have to re-enter credentials, while others are required to re-authenticate throughout the day. You can influence how much friction to add to that process.
The issues we see around Security tend to fall into 2 areas:
How effectively are users authenticated before access
Once access is granted, how effectively can unauthorized activity be stopped.
The quote is an example of the kind of request we have mediated between publishers and libraries. It can take days of back and forth to resolve.
I’m not going to cover all the possible security holes that can arise in these technologies, such as spoofing IPs or phishing attacks for credentials. Firstly, because good security practices can prevent the majority of opportunistic attacks, and we should all be following them anyway. Secondly, because some problems are universal - such as our willingness to believe emails that seem too good to be true, are true ...
Proxy Access
When it comes to authenticating users before access, proxy servers have a mixed record. The authentication process is really 2 distinct processes - one controlled by the library, and one controlled by the service provider.
Firstly, the library determines who can access their proxy server. Libraries can secure access to their proxy servers with individual authentication, and it’s possible to do this in a way that continues to support a variety of patron use cases, such as combinations of established network credentials for more permanent users and more flexible credentials for walk-ins, alumni, and other non-traditional users. This is a common implementation scenario that we’re very familiar with.
Secondly, the service provider receives incoming requests for access and matches them against registered IPs. And this is where problems arise.
A major source is simply human error. PSI - a business that specialises in IP audits - find that 58% of IP ranges held by publishers to authenticate libraries are inaccurate. Having worked at a publisher, I can testify to the number of problems that arise when IP ranges are manually communicated with myriad opportunities for error. It’s not surprising when you consider how many people touch this data. For example:
IT fails to notify the library about old IPs no longer being used, or new IPs being added
The library fails to communicate those changes to a publisher (and in some cases this is via an intermediary, such as a purchasing consortia or agent)
The service provider fails to record those changes in its records - a process that will likely involve those IPs being passed through several people
And at each step of the chain there’s an opportunity for those IPs to be inaccurately transcribed into the next communication method.
The impact is often hidden:
users are turned away because their IP address isn’t recognized and simply go elsewhere – they don’t notify you because they’re unaware that their library provides access, or because it’s seen as too much effort
unauthorized users who get access when they shouldn’t
valid users who get access but whose usage is attributed to another library because the data is incorrect (or due to overlapping IP ranges).
Online registries can significantly reduce the level of inaccuracy, but the issue arises from the inherent need to accurately communicate large volumes of dynamic information and the system is ultimately still only as good as the data put into it.
On a side note, a more recent issue we’ve come across is institutions using 3rd party web security products like Zscaler that dynamically apply an unregistered IP address - great for security, and terrible for access authentication.
Of perhaps bigger concern is the challenge faced when unauthorized access occurs. As service provider only see a generic IP, their only options are to shut down access from that IP (cutting off all legitimate users too), or to ask the library to investigate and resolve it. But that task can involve painstaking analysis of logs to match access attempts to user logins. If the unauthorized access was via an on-campus IP without any individual authentication, it can take many days to trace the access back from an IP address through the proxy to a physical computer, and then back to a specific login. There’s little appetite from publishers and their content providers to leave the barn door open that long.
Altogether, these issues can make web proxies a soft target for fraudulent access.
Federated SSO
User authentication before access is pretty secure under SSO. Volatile user data is stored centrally - via Identity Federations in the case of Shibboleth - and corrections are quickly and easily made, significantly reducing manual error.
Federated SSO provides a strong trust fabric that is reflected in a set of security tools and best practice policies defining standards like the exchange of encryption certificates and the use of signing data. It provides participants with a high degree of confidence that the counterparty in authentication is who they say they are, and greatly limits opportunities for 3rd party attacks.
Aside from the general problem of hacked credentials, we’ve never encountered any major security issues with Shibboleth, and the security holes that are occasionally discovered are quickly patched with updates.
And, in comparison to proxy access, unauthorized activity can be directly traced by the service provider to a specific identifier (anonymous or otherwise) and access using that identifier shut down without inconveniencing other users.
In our experience, both service provider and libraries underestimate the true cost of access and authentication because the largest cost is staff time. Open-source implementations of these technologies are freely available, and don’t require expensive hardware to run. However, they are tricky and require a steep learning curve - experienced practitioners are worth their weight in gold, but you only realize when they leave and you need to replace them!
And I want to re-emphasize that these comments are based on the experience of libraries in general.
Proxy Access
The big problem with proxy technology is that it’s an inherently unstable approach - the technical solution has to adapt to changes in website design. As these changes are out of your control, implemented at random across 1000s of provider sites, it’s a constant game of whack-a-mole.
For example, here are some seemingly innocuous website changes that can break proxy configurations:
Changes of domain
Changes to features to use more client-side javascript
Changes in the use of cookies
In addition, in our experience, publisher IT staff don’t have a good understanding of how their website changes impact proxy technologies – which isn’t surprising, because it’s knowledge that isn’t easily acquired. Even where they distribute recommended proxy setups, there’s no guarantee they’ll be deployed. So, control over the quality of the service is out of the publisher hands.
Another example of a major problem we’re all collectively dealing with is the transition of websites from http to https. As many browsers are moving towards explicitly marking http-based websites as ‘insecure’, publishers have been moving to make their resources https by default. Proxies generally achieve this with a wildcard security certificate, but the way these work generally necessitates changes to proxy configurations. (This is due to the fact the certificates are issued for a single level, e.g *.proxy.example.com, which can be problematic when resources using cookies across multiple domains, e.g. login.resource.com and content.resource.com)
The end result is a large and ongoing effort to maintain and update proxy configurations - like cleaning the Augean stables it never ends. It’s hard to think of a better example of IT job security.
If you add all this ongoing maintenance activity to the additional effort we’re all expending to ensure proxied links are delivered to users, you’ve got to ask yourself whether all this cost is telling us something. Would we design a solution that required all this effort if we were to start again from scratch?
Federated SSO
In comparison, Federated SSO requires similarly specialist technical expertise to setup but is a very stable technology with low maintenance needs. It’s independent of the technology for delivering content and services, so isn’t impacted by ongoing and unpredictable changes in resource provider websites and applications.
A final note: while you can obviously outsource management of both technologies (and we manage both), proxy servers are fundamentally a less efficient and more expensive technical solution - those costs still arise, and you incur them directly or indirectly.
Regardless of your attitude towards patron privacy, libraries need some level of usage reporting to inform outreach and budgeting.
And, as with privacy, we see a huge variety in approach. Some libraries are happy to rely on COUNTER stats from service provider - and others distrust COUNTER and prefer to rely on their own, independently generated access stats. Some libraries want very granular detail, others are happy for aggregated data.
Proxy Access
With a proxy server, it’s possible to record access to specific resources and provide aggregated stats. If a library also authenticates individual users, those stats can be mapped against a variety of additional metadata to provide more understanding about usage patterns, such as user departments, specialities, locations, and names. While it’s much harder to track usage within resources, these access stats provide libraries with a valuable and independent assessment of usage.
We’ve encountered libraries that choose to implement proxy access both on and off-site in order to provide them with the fullest possible picture of who is using what.
On the flip side, service providers experience a sea of anonymity that gives them a minimal understanding of user needs and usage patterns. This makes it much harder for them to develop more engaging services that better meet your patron’s needs.
Federated SSO
Federated SSO moves that access point away from the library proxy server to an organization’s SSO architecture.
Although the library has no direct visibility of usage outside of clicks on its own links or via services it controls (e.g. browser extensions), it’s possible to track access requests via SSO. Depending on your institutional policies, you could still build an independent picture of access to resources across direct clicks, via the proxy server, and via SSO.
And SSO offers service provider the potential for a more granular understanding of their users, again based on your institutional policy towards sharing more data, such as permanent anonymous identifiers or personal data like an email address.