A talk given at the August 2010 meeting of the Linux Users of Victoria. About using their mailing list of some 20,000 messages (since the start of 2007) with over 2 million words, as a demonstration of using a web corpus in NLTK (Natural Language Tool Kit), the Python library.
Sphinx autodoc - automated api documentation - PyCon.KR 2015Takayuki Shimizukawa
Using the automated documentation feature of Sphinx, you can make with ease the extensive documentation of Python program.
You just write python function documents (docstrings), Sphinx organizes them into the document, can be converted to a variety of formats.
In this session, I'll explain a documentation procedure that uses with sphinx autodoc and autosummary extensions.
PuppetDB: New Adventures in Higher-Order Automation - PuppetConf 2013Puppet
"PuppetDB: New Adventures in Higher-Order Automation" by
Deepak Giridharagopal, Director of Engineering, Puppet Labs.
Presentation Overview: PuppetDB gives users fast, robust, centralized storage for Puppet-produced data. The 1.0 version landed at Puppetconf 2012, and now we're one year older and one year wiser. It's been deployed in thousands of sites, people have written libraries and tools on top of it, and there's been plenty of activity in the past year. We've tightly integrated it into Puppet Enterprise. We've added new features like report storage, event querying, import/export, better HTTP endpoints, and unified querying. And though we've added features, we've also made PuppetDB faster and consume less disk space. This talk will cover what's happened in the PuppetDB world between Puppetconf 2012 and now. We'll go into the new features, talk about performance and correctness, and discuss lessons learned.
Speaker Bio: Deepak is Director of Engineering at Puppet Labs, one of the authors of PuppetDB, and a many-times-over Puppetconf veteran. Prior to joining Puppet Labs, he was Principal Engineer at Dell/MessageOne, using Puppet to manage thousands of production systems.
A bit of history, frustration-driven development, and why and how we started looking into Puppet at Opera Software. What we're doing, successes, pain points and what we're going to do with Puppet and Config Management next.
AUTHOR:ARTHUR SALE
The Open Source movement, of which Linux is a shining example, is a showcase of how accessibility makes for excellence. A parallel thrust is currently being conducted in the research institutions and the publishing industries of the world to create Open Access to the world’s publicly funded research. Arthur Sale will trace the origin of the movement, its economics and the forces holding it back, and where we are now, particularly in Australia. Open Access, or OA, has very many more active participants than Open Source, and many more nay-sayers, cautious Scrooges, and ignorant people. The struggle is titanic – the benefits equally large!
http://freeasinfreedom.modernthings.org/d/doku.php?id=arthur_sale
Free and open geodata: From shadows to reality - Simon GreenerBrianna Laugher
AUTHOR: SIMON GREENER
This talk will attempt a review of the geospatial data space within Australia. The talk will outline who the main players are, what spatial data is available, and the licensing options that cover their use. An assessment of the licenses will be made. In particular the talk will outline the data that is available for free and, and after establishing the various uses of that data, assess how important that data is to various sectors and individuals within society and how it might benefit society as a whole.
Sphinx autodoc - automated api documentation - PyCon.KR 2015Takayuki Shimizukawa
Using the automated documentation feature of Sphinx, you can make with ease the extensive documentation of Python program.
You just write python function documents (docstrings), Sphinx organizes them into the document, can be converted to a variety of formats.
In this session, I'll explain a documentation procedure that uses with sphinx autodoc and autosummary extensions.
PuppetDB: New Adventures in Higher-Order Automation - PuppetConf 2013Puppet
"PuppetDB: New Adventures in Higher-Order Automation" by
Deepak Giridharagopal, Director of Engineering, Puppet Labs.
Presentation Overview: PuppetDB gives users fast, robust, centralized storage for Puppet-produced data. The 1.0 version landed at Puppetconf 2012, and now we're one year older and one year wiser. It's been deployed in thousands of sites, people have written libraries and tools on top of it, and there's been plenty of activity in the past year. We've tightly integrated it into Puppet Enterprise. We've added new features like report storage, event querying, import/export, better HTTP endpoints, and unified querying. And though we've added features, we've also made PuppetDB faster and consume less disk space. This talk will cover what's happened in the PuppetDB world between Puppetconf 2012 and now. We'll go into the new features, talk about performance and correctness, and discuss lessons learned.
Speaker Bio: Deepak is Director of Engineering at Puppet Labs, one of the authors of PuppetDB, and a many-times-over Puppetconf veteran. Prior to joining Puppet Labs, he was Principal Engineer at Dell/MessageOne, using Puppet to manage thousands of production systems.
A bit of history, frustration-driven development, and why and how we started looking into Puppet at Opera Software. What we're doing, successes, pain points and what we're going to do with Puppet and Config Management next.
AUTHOR:ARTHUR SALE
The Open Source movement, of which Linux is a shining example, is a showcase of how accessibility makes for excellence. A parallel thrust is currently being conducted in the research institutions and the publishing industries of the world to create Open Access to the world’s publicly funded research. Arthur Sale will trace the origin of the movement, its economics and the forces holding it back, and where we are now, particularly in Australia. Open Access, or OA, has very many more active participants than Open Source, and many more nay-sayers, cautious Scrooges, and ignorant people. The struggle is titanic – the benefits equally large!
http://freeasinfreedom.modernthings.org/d/doku.php?id=arthur_sale
Free and open geodata: From shadows to reality - Simon GreenerBrianna Laugher
AUTHOR: SIMON GREENER
This talk will attempt a review of the geospatial data space within Australia. The talk will outline who the main players are, what spatial data is available, and the licensing options that cover their use. An assessment of the licenses will be made. In particular the talk will outline the data that is available for free and, and after establishing the various uses of that data, assess how important that data is to various sectors and individuals within society and how it might benefit society as a whole.
Future directions for copyright law - Laura SimesBrianna Laugher
AUTHOR: LAURA SIMES
Australia's copyright laws are the chief means by which we as a society regulate the creation and distribution of knowledge. With the digital age well and truly upon us, individuals, corporations and Governments are grappling with how copyright law now 'fits' in with this, with diverse opinions on how copyright law could or should adapt to meet these changes.
This talk will look at a few current issues of interest in the world of copyright, and consider where we seem to be heading. Some developments, such as the Anti-Counterfeiting Trade Agreement (ACTA) suggest a preoccuopation with stronger copyright laws and penalties. On the other hand, the recent Review of the National Innovation System (aka the Cutler Report) presents a number of recommendations for more 'flexible' copyright law, along with a suggestion that we need to look at copyright law in a different way than we have in the past.
A study into the behaviour of newly-registered users on the English Wikipedia, presented at Wikimania 2007 in Taipei.
Info:: http://wikimania2007.wikimedia.org/wiki/Proceedings:BL1
A Gentle Introduction to Coding ... with PythonTariq Rashid
A gentle introduction to coding (programming) for complete beginners. Starting from then basics - electrical wires - proceeding through variables, data structures, loops, functions, and exploring libraries for visualisation and specialist tools. Finally we use flask to make a very simple twitter clone web application.
Apidays Paris 2023 - Forget TypeScript, Choose Rust to build Robust, Fast and...apidays
Apidays Paris 2023 - Software and APIs for Smart, Sustainable and Sovereign Societies
December 6, 7 & 8, 2023
Forget TypeScript, Choose Rust to build Robust, Fast and Cheap APIs
Zacaria Chtatar, Backend Software Engineer at HaveSomeCode
------
Check out our conferences at https://www.apidays.global/
Do you want to sponsor or talk at one of our conferences?
https://apidays.typeform.com/to/ILJeAaV8
Learn more on APIscene, the global media made by the community for the community:
https://www.apiscene.io
Explore the API ecosystem with the API Landscape:
https://apilandscape.apiscene.io/
Kyo is a next-generation effect system that introduces an approach based on algebraic effects to deliver straightforward APIs in the pure Functional Programming paradigm. Unlike similar solutions, Kyo achieves this without inundating developers with esoteric concepts from Category Theory or using cryptic symbolic operators. This results in a development experience that is both intuitive and robust.
Kyo generalizes the innovative effect rotation mechanism introduced by ZIO. While ZIO restricts effects to two channels, namely dependency injection and short-circuiting, Kyo allows for an arbitrary number of effectful channels. This enhancement offers developers greater flexibility in effect management and simplifies Kyo's internal codebase through more principled design patterns.
In addition to this novel approach to effect handling, Kyo provides seamless direct syntax inspired by Monadless and a comprehensive set of built-in effects like Aborts for short-circuiting, Envs for dependency injection, and Fibers for green threads with fine-grained uncooperative preemption.
After over two years in development, the first public release of the project will be made during Functional Scala 2023. Attendees will also be treated to benchmark results that showcase Kyo's unparalleled performance.
We used to believe that some software is secure. I'd like to decompose our faith in it by touching every prime factor of our daily computing environments. This implies the roller-coaster ride from user space to the farthest parts of the Universe by 0.1-days and oddities of software that we use each day. Our goal is to realize how many assumptions are hidden in saying that some piece of code is secure.
This was a brief 1-hour introduction to R programming, presented at the 1st Inter-experimental Machine Learning (IML) Working Group Workshop at CERN, 20-22 March 2017.
So You're A Software Developer, Now What? Exploring Career GrowthBrianna Laugher
Talk presented at PyConline Australia 2021.
So you’re a software developer, now what? Is it satisfying? How do you level up? How can you influence your career direction? How can you even figure out what you want?
If your manager asked where you want your career to go, do you know what you would say? Do you feel like being a developer is a bit “meh”, but you are not really sure what else is out there? If career questions make you feel a bit lost, you’re not alone. It’s common for software developers to feel that their career is something that happens to them while they scramble for job stability, rather than something that they have any influence or control over.
This talk will challenge you to look sideways when it comes to moving on up. Software is so much more than just programming, even if you still program every day. Come to get some new ideas on what your current or next job might look like, and how to figure out what you might like. Leave with some confident directions to guide your next 1:1 discussion or job search.
Software is a generous and expansive field - why not aim to find fulfilling work while we’re here, hey?
Slides for a talk at PyCon AU 2013. Integrating PyDAP + WMS + OpenLayers + IPython Notebook.
Video: http://www.youtube.com/watch?v=YJqBGi48RAM
The IPython Notebook is a powerful web app for exploring ideas and data sets with Python. It has excellent integration with Matplotlib, giving the user highly customisable static plots with ease. But for larger data sets, a static plot may not be ideal - the ability to pan, zoom, choose dynamic layers and sample the data at particular points would be nice. This talk will demonstrate just how easy it is to integrate a Web Map Service/client such as Pydap/Leaflet.js into the IPython Notebook.
Video - http://www.youtube.com/watch?v=DTNejE9EraI
Code - https://gist.github.com/3386951
Pytest is a mature and comprehensive testing suite for Python projects, but it can be a little intimidating for newcomers. Where do these mysterious funcargs come from, how do parametrised tests work, and where are my xUnit-style setUp and tearDown methods?
Pytest lives by "convention over configuration" - which is great once you know what the conventions are. This talk will look at real examples of pytest in use, emphasising the features that differentiate it from nose.
A talk presented at PyCon AU 2011.
Zookeepr ( http://zookeepr.org/ ) is a comprehensive web-based conference management system, written in Python and built on Pylons. It has an unusual development history: custom created for the annual Linux.conf.au conference, there are yearly spikes in event-focused feature development, but relatively little of the ongoing development typically seen in open source projects.
This presentation is an introduction to the project, aimed at developers interested in contributing to a non-trivial open source project where meeting your fellow developers is quite possible, even likely, and your work is almost guaranteed to be seen and used each year by hundreds of Australia's most diehard geeks.
The “right to fork”, a consequence of the “hack on copyright” that is copyleft licensing, helps keep open source and open content project leaders honest. Forking is a political act as much as a version control command, and it used to be that both were a big deal. But now that distributed version control systems (DVCS) have made forking trivial, are there implications for the political act as well? How does political forking work within collaborative prose text projects (i.e. wikis)? English Wikipedia is so large as to be practically unforkable - it essentially has an unassailable monopoly, and unchecked power, in the English language encyclopedia market. One of the core Wikipedia rules is “one topic, one article”, which would seem to prohibit forking, but could we adhere to this principle and still take advantage of DVCS? Can a community be forked while keeping the shared project goals intact?
Audience members will benefit from a grasp of version control, distributed version control and the workings of wikis and Wikipedia.
Presented at the 'Freedom in the Cloud' miniconf, Monday January 24 2011 at linux.conf.au.
"There are so few women in IT because the female brain is no
good at logic or mental rotation of 3D objects... right?"
A lightning talk to be given at the Haecksen miniconf on Monday 24 January 2011 at Linux.conf.au.
http://haecksen.net/miniconf-schedule
Clash of the encyclopedias - is competition good for sharing?Brianna Laugher
Talk given at the Get It Louder 'SHARISM' conference in Shanghai, 22 October 2010.
http://www.getitlouder.com/enChair.aspx?ID=94
One of the benefits of the open web is that good ideas can flourish easily. In the Chinese speaking web, the idea of an online encyclopedia has been especially fruitful. With the Chinese Wikipedia enjoying its eighth birthday last month, it’s worth examining whether the fragmentation of efforts ultimately leads to a better product and bigger communities, or if the “us vs them” mentality is harmful to sharing.
A talk I gave at the first MXUG-style Girl Geek Dinner in Melbourne, 21/7/2009. This version comes with extra explanatory slides for the attendance-deprived.
I love web apps. But they generally don't have very good tools to help their own users deal with data overload. Like filters and summaries. Luckily, we have APIs, so they're pretty easy to string together. Examples using MediaWiki.
Free as in Market: Liberty and Property - Rusty RussellBrianna Laugher
AUTHOR: RUSTY RUSSELL
The capitalists keep telling us that strong property rights are the basis of economic freedom, and intuitively people like owning stuff. This is a powerful argument for Free Software with their limits on IP powers, but for some reason it is usually used to argue for strong IP rights. Should be a fun talk.
Public history in the digital age - Claudine ChionhBrianna Laugher
AUTHOR: CLAUDINE CHIONH
Public history is a field that seeks to return the study and interpretation of history to the wider community. While 'professional' historians with academic training may be involved, public history is largely focused on and driven by 'amateurs' – genealogists, local history societies and others who collect and collate sources and interpret and present them for a general audience. The increased use of the internet in recent years has created new opportunities for opening up sources and archives to the world, and for collaboration and public presentation of historical research. In this presentation I will describe two projects that aim to collect and publish historical sources on the web and to work with local and family historians to maintain these as dynamic historical resources.
It's all fun and games until someone wants to sue you: Reporting in the age o...Brianna Laugher
AUTHOR: SARAH STOKELY
The internet has radically changed what it means to be a journalist – but at the same time the laws, structures and ethical framework for traditional reporting have been weakened or rendered meaningless. Sarah Stokely talks about the pitfalls, challenges and blissful freedoms of being a blogger and citizen journalist in the post-print world.
AUTHOR: LIAM WYATT
“Gratis & Libre” discusses aspects of the idea of freedom in history by analysing the means by which it is curtailed. Specifically these are: the changing nature of copyright; attempts at censorship; knowledge destruction; the monetary cost of knowledge; and the politics of language. It seeks to show how Wikipedia walks a well-trodden path of anti-authority when it comes to the various means by which the ideal of freedom has been curtailed.
OpenAustralia - Everyday democracy for everybody in Australia - Matthew LandauerBrianna Laugher
AUTHOR: MATTHEW LANDAUER
Launched to the public in June 2008, http://www.openaustralia.org/ gives everyone much better access to the daily business of democracy in Australia. Have a look! In this talk we'll run through a lightning history of the Hansard and its relationship with our democracy, discuss how our political health depends on it, and show all the fun you could be having with the data.
Freedom Fighting: How do we convince the powers that be to relax their grip? ...Brianna Laugher
AUTHOR: JESSICA COATES
Over the last few years the free culture movement has moved out of the bedrooms and into the boardrooms. Open copyright models are increasingly of interest not just to enthusiasts, but to big name creators, government bodies and even corporate entities. It finally feels like we're just a hair's breadth away from widespread acceptance and take up, both in Australia and internationally. But there is still that tiny gap. How do we get people to cross the final divide from interested party to adopter?
Drawing on her experience as Project Manager of Creative Commons Australia, Jessica Coates will lead a group discussion on tactics for promoting greater adoption of open copyright in Australia. What tools are most useful for spreading the word? What projects have been successful in the past? How do we convince people to give more thought to their copyright decisions?
Presentation for Software Freedom Day in Melbourne. In just a handful of years, volunteers around the world have create the largest encyclopedia ever known, Wikipedia. It's still growing today, in literally hundreds of languages, and sister projects to provide other free reference works (such as textbooks) are also thriving. But it would have never been possible without the products of the free software movement, and more importantly, the principles. Find out how these principles have inspired a host of related causes in recent years, and how the core idea of sharing continues to resonate not just in software, but also science, academia and education.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Generating a custom Ruby SDK for your web service or Rails API using Smithyg2nightmarescribd
Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Language Sleuthing HOWTO with NLTK
1. Language Sleuthing HOWTO
or
Discovering Interesting Things
with Python's
Natural Language Tool Kit
Brianna Laugher
modernthings.org
brianna[@.]laugher.id.au
3. Because the web is full of
language data
Because linguistic techniques
can reveal unexpected insights
Because I don't want to have to
read everything
15. what are we aiming for?
what do NLTK corpora look like?
16. Getting NLTK
sudo apt-get install python-nltk
in Ubuntu 10.04
or
sudo apt-get install python-pip
pip install nltk
or
from source at nltk.org/download
21. Inaugural corpus
A Plaintext corpus:
My fellow citizens:
I stand here today humbled by the task before us,
grateful for the trust you have bestowed, mindful
of the sacrifices borne by our ancestors. I thank
President Bush for his service to our nation, as
well as the generosity and cooperation he has
shown throughout this transition.
Forty-four Americans have now taken the
presidential oath. ...............
24. BeautifulSoup to the rescue
>>> from BeautifulSoup import BeautifulSoup as BS
>>> data = open(filename,'r').read()
>>> soup = BS(data)
>>> print 'n'.join(soup.findAll(text=True))
27. What about blockquotes?
>>> bqs = s.findAll('blockquote')
>>> [bq.extract() for bq in bqs]
>>> print 'n'.join(s.findAll(text=True))
On 05/08/2007, at 12:05 PM, [...] wrote:
If u want it USB bootable, just burn the DSL boot disk to CD and fire it
up.  Then from the desktop after boot, right click and create the
bootable USB key yourself.  I havent actually done this myself (only
seen the option from the menu), but I am assuming it will be a fairly painless
process if you are happy with the stock image.  Would be interested in
how you go as I have to build 50 USB bootable DSL's in the next couple weeks.
Regards,
[...]
29. Getting it into NLTK
import nltk
path = 'path/to/files'
corpus = nltk.corpus.PlaintextCorpusReader(path,
'.*.html')
30. What about our metadata?
Create a Python dictionary that maps filenames to
categories
e.g.
categories={}
categories['2008-12/msg00226.html'] =
['year-2008',
'month-12',
'author-BM<bm@xxxxx>'
]
....etc
then...
import nltk
path = 'path/to/files/'
corpus =
nltk.corpus.CategorizedPlaintextCorpusReader(path,
'.*.html', cat_map=categories)
31. Simple categories
cats = corpus.categories()
authorcats=[c for c in cats if c.startswith('author')]
#>>> len(authorcats)
#608
yearcats=[c for c in cats if c.startswith('year')]
monthcats=[c for c in cats if c.startswith('month')]
32. ...who are the top posters?
posts = [(len(corpus.fileids(author)), author) for author in
authorcats]
posts.sort(reverse=True)
for count, author in posts[:10]:
print "%5dt%s" % (count, author)
→
1304 author-JW
1294 author-RC
1243 author-CS
1030 author-JH
868 author-DP
752 author-TWB
608 author-CS#2
556 author-TL
452 author-BM
412 author-RM
(email me if you're curious to know if you're on it...)
33. Frequency distributions
popular =['ubuntu','debian','fedora','arch']
niche = ['gentoo','suse','centos','redhat']
def getcfd(distros,limit):
cfd = nltk.ConditionalFreqDist(
(distro, fileid[:limit])
for fileid in corpus.fileids()
for w in corpus.words(fileid)
for distro in distros
if w.lower().startswith(distro))
return cfd
popularcfd = getcfd(popular,4) # or 7 for months
popularcfd.plot()
nichecfd = getcfd(niche,4)
nichecfd.plot()
another “NLTKism”
37. Random text generation
import random
words = [w.lower() for w in corpus.words()]
bigrams = nltk.bigrams(words)
cfd = nltk.ConditionalFreqDist(bigrams)
def generate_model(cfdist, word, num=15):
for i in range(num):
print word,
words = list(cfdist[word])
word = random.choice(words)
text = [w.lower() for w in corpus.words()]
bigrams = nltk.bigrams(text)
cfd = nltk.ConditionalFreqDist(bigrams)
generate_model(cfd, 'hi', num=20)
38. hi...
hi allan : ages since apparently yum erased . attempts
now venturing into config run ip 10 431 ms 57
hi serg it illegal address entries must *, t close relative info
many families continue fi into modem and reinstalled
hi wen and amended :) imageshack does for grade service
please blame . warning issued an overall environment
consists in
hi folks i accidentally due cause excitingly stupid idiots ,
deletion flag on adding option ? branded ) mounting them
hi guys do composite required </ emulator in for
unattended has info to catalyse a dbus will see atz init3
39. hi from Peter...
text = [w.lower() for w in corpus.words(categories=
[c for c in authorcats if 'PeterL' in c])]
hi everyone , hence the database schema and that run on memberdb on mail
store is 12 . yep ,
hi anita , your favourite piece of cpu cycles , he was thinking i hear the middle
of failure .
hi anita , same vhost b internal ip / nine seem odd occasion i hazard . 25ghz
g4 ibook here
hi everyone , same ) on removes a "-- nicelevel nn " as intended . 00 . main
host basis
hi cameron , no biggie . candidates in to upgrade . ubuntu dom0 install if there
! now ). txt
hi cameron , attribution for 30 seconds , and runs out on linux to on www .
luv , these
40. interesting collocations
...or not
ext = [w.lower() for w in corpus.words() if w.isalpha()]
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(text)
finder.apply_freq_filter(3)
finder.nbest(bigram_measures.pmi, 10)
→
bufnewfile bufread
busmaster speccycle
cellx celly
cheswick bellovin
cread clocal
curtail atl
dmcrs rscem
dmmrbc dmost
dmost dmcrs
...
41. oblig tag cloud
stopwords =
nltk.corpus.stopwords.words('english')
words = [w.lower() for w in corpus.words()
if w.isalpha()]
words = [w for w in words if w not in stopwords]
word_fd = nltk.FreqDist(words)
wordmax = word_fd[word_fd.max()]
wordmin = 1000 #YMMV
taglist = word_fd.items()
ranges = getRanges(wordmin, wordmax)
writeCloud(taglist, ranges, 'tags.html')
42.
43. another one for Peter :)
cats = [c for c in corpus.categories()
if 'PeterL' in c]
words=[w.lower() for w in corpus.words(categories=cats)
if w.isalpha()]
wordmin = 10
→