Scraping with Python for Fun and Profit - PyCon India 2010Abhishek Mishra
Tim Berners-Lee - On the Next Web talks about open, linked data. Sweet may the future be, but what if you need the data entangled in the vast web right now?
Mostly inspired from author's work on SpojBackup, this talk familiarizes beginners with the ease and power of web scraping in Python. It would introduce basics of related modules - Mechanize, urllib2, BeautifulSoup, Scrapy, and demonstrate simple examples to get them started with.
This talk discusses the principles of RESTful design and what it means to be HATEOAS. It concludes by demonstrating how to implement a simple RESTful API on top of ASP.NET Core.
libinjection: from SQLi to XSS by Nick GalbreathCODE BLUE
libinjection was introduced at Black Hat USA 2012 to quickly and accurately detect SQLi attacks from user inputs. Two years later the algorithm has been used by a number of open-source and proprietary WAFs and honeypots. This talk will introduce a new algorithm for detecting XSS. Like the SQLi libinjection algorithm, this does not use regular expressions, is very fast, and has a low false positive rate. Also like the original libinjection algorithm, this is available on GitHub with free license.
Nick Galbreath
Nick Galbreath is Vice President of Engineering at IPONWEB, a world leader in the development of online advertising exchanges. Prior to IPONWEB, his role was Director of Engineering at Etsy, overseeing groups handling security, fraud, security, authentication and other enterprise features. Prior to Etsy, Nick has held leadership positions in number of social and e-commerce companies, including Right Media, UPromise, Friendster, and Open Market. He is the author of ""Cryptography for Internet and Database Applications"" (Wiley). Previous speaking engagements have been at Black Hat, Def Con, DevOpsDays and other OWASP events. He holds a master's degree in mathematics from Boston University and currently resides in Tokyo, Japan.
In 2013
- LASCON http://lascon.org/about/, Keynote Speaker Austin, Texas USA
- DevOpsDays Tokyo, Japan
- Security Development Conference (Microsoft) San Francisco, CA, USA
- DevOpsDays Austin, Texas, USA
- Positive Hack Days http://phdays.com, Moscow Russia
- RSA USA, San Francisco, CA, speaker and panelist
In 2012
- DefCon
- BlackHat USA
- Others
Scraping with Python for Fun and Profit - PyCon India 2010Abhishek Mishra
Tim Berners-Lee - On the Next Web talks about open, linked data. Sweet may the future be, but what if you need the data entangled in the vast web right now?
Mostly inspired from author's work on SpojBackup, this talk familiarizes beginners with the ease and power of web scraping in Python. It would introduce basics of related modules - Mechanize, urllib2, BeautifulSoup, Scrapy, and demonstrate simple examples to get them started with.
This talk discusses the principles of RESTful design and what it means to be HATEOAS. It concludes by demonstrating how to implement a simple RESTful API on top of ASP.NET Core.
libinjection: from SQLi to XSS by Nick GalbreathCODE BLUE
libinjection was introduced at Black Hat USA 2012 to quickly and accurately detect SQLi attacks from user inputs. Two years later the algorithm has been used by a number of open-source and proprietary WAFs and honeypots. This talk will introduce a new algorithm for detecting XSS. Like the SQLi libinjection algorithm, this does not use regular expressions, is very fast, and has a low false positive rate. Also like the original libinjection algorithm, this is available on GitHub with free license.
Nick Galbreath
Nick Galbreath is Vice President of Engineering at IPONWEB, a world leader in the development of online advertising exchanges. Prior to IPONWEB, his role was Director of Engineering at Etsy, overseeing groups handling security, fraud, security, authentication and other enterprise features. Prior to Etsy, Nick has held leadership positions in number of social and e-commerce companies, including Right Media, UPromise, Friendster, and Open Market. He is the author of ""Cryptography for Internet and Database Applications"" (Wiley). Previous speaking engagements have been at Black Hat, Def Con, DevOpsDays and other OWASP events. He holds a master's degree in mathematics from Boston University and currently resides in Tokyo, Japan.
In 2013
- LASCON http://lascon.org/about/, Keynote Speaker Austin, Texas USA
- DevOpsDays Tokyo, Japan
- Security Development Conference (Microsoft) San Francisco, CA, USA
- DevOpsDays Austin, Texas, USA
- Positive Hack Days http://phdays.com, Moscow Russia
- RSA USA, San Francisco, CA, speaker and panelist
In 2012
- DefCon
- BlackHat USA
- Others
Top Mobile App Monetization Tactics You Ought to KnowInMobi
With the holiday season nearing, is your app monetization strategy geared up to get the most out of your users? Crafting an effective monetization strategy involves understanding and influencing your user's lifetime value (LTV).
In this 1 hour webinar, you'll learn:
What is LTV and how to apply it to your app business effectively -- metrics that you need to monitor and measure constantly.
How to go beyond analytics & metrics -- apply advanced user segmentation to design clever strategies that can help you engage and monetize your users better.
Some ideas to increase your app's monetization this holiday season.
This session is led by Pratik Shah, Product Manager at InMobi.
App developers needs to know the LTV - user lifetime value. This is mostly needed for financing and marketing purposes. This slideshare shows how to extract the needed data from Flurry analytics and leverage a free online calculator to get the result.
Google Analytics is a popular choice among app developers. Getting LTV using GA is hard and this slideshare explains how to use retention and DAU data along side an online lifetime value calculator to get the result.
Two Methods for Modeling LTV with a SpreadsheetEric Seufert
NB! The bitly link in the deck DOES NOT WORK, please use this one: http://bit.ly/1JTymzd
This is the presentation I gave at Slush 2013 in Helsinki, Finland. It describes two methods for modeling Lifetime Customer Value (LTV) in Excel. Linked within the presentation is a spreadsheet exemplifying both methods against 100k rows of fake user data that I generated with a Python script to "look" real (although they probably don't).
Everything You Need to Know About Customer Lifetime Value (CLV)Demac Media
Customer Lifetime Value (CLV) has become a must know term for eCommerce merchants of any size. It allows you to discover the true value each customer has to your store. But what most people don't know are the different ways to calculate it and the true impact it has on your business.
In this discussion we cover everything you need to know about customer lifetime value. We present a few ways to calculate it as well as show some cases where CLV can impact a business. These cases will show what happens when you ignore lifetime value, and the benefits of making decisions with CLV in mind.
Start guide to web scraping with Scrapy, one of best python modules to do web scraping, with Scrapy everything is more easy.
This presentation covers the key concepts of scrapy and the process of criation of spiders.
It's the first draft version and will be other versions, until the last version, if you see something that you want to be improved, give feedback and I will take that in consideration.
I also talk about some alternatives to scrapy like lxml, newspapers and others.
In the final i give you acess to the code used on this presentation, so you cant test easy and fast the concepts talked on this presentation.
I hope you like it :D
Browsers nowadays are competing with operating systems as the next application development platform. The rapid development of Web 2.0 keeps pushing browser developers into implementing advanced features that allow the creation of interactive multimedia applications. This sets the grounds for a new fertile environment in which a new breed of malware can come to life. Malware that is OS and architecture independent, as covert as a cutting edge rootkit but at the same time implemented through a series of API\'s and a generous variety of high-level OOP languages simplifying the task
Top Mobile App Monetization Tactics You Ought to KnowInMobi
With the holiday season nearing, is your app monetization strategy geared up to get the most out of your users? Crafting an effective monetization strategy involves understanding and influencing your user's lifetime value (LTV).
In this 1 hour webinar, you'll learn:
What is LTV and how to apply it to your app business effectively -- metrics that you need to monitor and measure constantly.
How to go beyond analytics & metrics -- apply advanced user segmentation to design clever strategies that can help you engage and monetize your users better.
Some ideas to increase your app's monetization this holiday season.
This session is led by Pratik Shah, Product Manager at InMobi.
App developers needs to know the LTV - user lifetime value. This is mostly needed for financing and marketing purposes. This slideshare shows how to extract the needed data from Flurry analytics and leverage a free online calculator to get the result.
Google Analytics is a popular choice among app developers. Getting LTV using GA is hard and this slideshare explains how to use retention and DAU data along side an online lifetime value calculator to get the result.
Two Methods for Modeling LTV with a SpreadsheetEric Seufert
NB! The bitly link in the deck DOES NOT WORK, please use this one: http://bit.ly/1JTymzd
This is the presentation I gave at Slush 2013 in Helsinki, Finland. It describes two methods for modeling Lifetime Customer Value (LTV) in Excel. Linked within the presentation is a spreadsheet exemplifying both methods against 100k rows of fake user data that I generated with a Python script to "look" real (although they probably don't).
Everything You Need to Know About Customer Lifetime Value (CLV)Demac Media
Customer Lifetime Value (CLV) has become a must know term for eCommerce merchants of any size. It allows you to discover the true value each customer has to your store. But what most people don't know are the different ways to calculate it and the true impact it has on your business.
In this discussion we cover everything you need to know about customer lifetime value. We present a few ways to calculate it as well as show some cases where CLV can impact a business. These cases will show what happens when you ignore lifetime value, and the benefits of making decisions with CLV in mind.
Start guide to web scraping with Scrapy, one of best python modules to do web scraping, with Scrapy everything is more easy.
This presentation covers the key concepts of scrapy and the process of criation of spiders.
It's the first draft version and will be other versions, until the last version, if you see something that you want to be improved, give feedback and I will take that in consideration.
I also talk about some alternatives to scrapy like lxml, newspapers and others.
In the final i give you acess to the code used on this presentation, so you cant test easy and fast the concepts talked on this presentation.
I hope you like it :D
Browsers nowadays are competing with operating systems as the next application development platform. The rapid development of Web 2.0 keeps pushing browser developers into implementing advanced features that allow the creation of interactive multimedia applications. This sets the grounds for a new fertile environment in which a new breed of malware can come to life. Malware that is OS and architecture independent, as covert as a cutting edge rootkit but at the same time implemented through a series of API\'s and a generous variety of high-level OOP languages simplifying the task
Basic PowerShell Toolmaking - Spiceworld 2016 sessionRob Dunn
PowerShell is everywhere. Admit it, even if you don't like change, you've probably needed to run a one-off command or small script in order to accomplish something...whether it was in AD, Exchange, VMWare or something else.
Running a single command is one thing, but what about making a reusable piece of code that anyone can run, or even better, schedule it? Get a report every Monday about drive space, remove old log files every month, report on logon failures...
We're going to take a command that fulfills a 'single-serving' role and turn it into something more dynamic; something that can be run over and over and be both relevant and timely!
Be ready to learn about parameters, basic functions, comment-based help, and other useful techniques - bring your laptop and code along with us!
Let's build a PowerShell tool!
Watch me present this topic via YouTube: https://youtu.be/akTypRvwr7g (video 1 of 2)
The "free" in freelance is what we all love about it, but it's also the creator of the biggest challenge: the equation that governs our self-employed lives is most often "do more work, get more money". The discipline required to work is eclipsed by the discipline required to not work.
This talk looks at self-working from a holistic perspective: Mike will look at some of the tools and techniques that are useful in helping freelancers balance their working lives, get on top of scheduling, build the confidence to say no - and ultimately find time to do things other than work.
A presentation I did with @lgladdy back in June 2012 for BathCamp (http://bathcamp.org/events/cms-smackdown).
Before you start commenting like a crazy-assed loon, please remember the title is entirely designed to provoke. Like anything else in this entire universe, I'm long enough in the tooth to know this: "IT DEPENDS".
So: No. I don't think Wordpress shits on all CMS's in every situation*
Peace, out
x
* Just most of them **
** kidding
Stop the noise - ten digital marketing tipsMike Ellis
Little time and no budget? Here's ten easy win tips to help you get the most out of social media and digital marketing. It's especially pitched at arts organisations and other non-profits, but it'll be useful whoever you are...
If you love your content, set it free (v3.0) Mike Ellis
This talk is a re-working of previous talks with the same name. This time it focuses on three big ideas which hang off notions of “free” and "open":
- what value and free mean in the networked world we’ve found ourselves in
- how this network has also changed us, as consumers and producers of content
- how we, as content-rich institutions, might respond to these changes
Niche or Platform - what next for our institutions online?Mike Ellis
This presentation looks at the ideas behind institutions delivering a "trusted platform" rather than niche silos. It suggests that "platforms" in this context are places where communities are enabled, supported and encouraged and goes on to consider ten big ideas which often accompany platform-like approaches.
This is a museumy version of my Ignite Cardiff presentation - I presented it at UKMW09.
The basic premise is that I believe we're approaching a kind of "perfect storm" for mobile and ubiquitous computing: the dream has been around for a long time but now we're seeing network speed increasing, cost dropping, device capability improving. Now could be the time for cultural heritage to really embrace mobile...
For the final Bathcamp meetup of 2009, we put together a quiz. We (loosely..) took the topics of the evening meetups from 2009 and then threw in a few more tech questions. Have a go - the answers are in the notes for each slide
The Benefits Of Doing Things DifferentlyMike Ellis
During October and November 2009, Mike Ellis (Eduserv) and Dan Zambonini (Box UK) built a museum website in 12 hours from beginning to end, under the title "Museum In A Day".
These slides accompany a workshop we delivered at DISH 2009 with the same title (see http://www.dish2009.nl/node/89)
The workshop uses the Museum In A Day project as a means to frame the wider conversation, and looks at where online museums are in terms of audience, traffic and reach, asking:
- How can we do things differently?
- How can we do more with less?
- How can we be where our audiences are?
For an overview of the Museum In A Day project, see http://museuminaday.com/
Mike Ellis and Lisa Price demonstrate practical examples of high impact, low-budget web 2.0 techniques that organisations can use to transform the way they work.
Slides from the "Developer Lounge" session at the 2009 Institutional Web Managers Workshop, all about developers getting together and chewing the e-cud.
For the BathCamp evening event on 21st July 2009 (http://bathcamp.ning.com/events/bathcamponified-3-minutes-one), we asked people to present on "the one technology which has blown you away more than any other".
Rather than choose Spotify, the iPhone or Gmail, I instead picked the piano as my technology: something I've enjoyed playing for (ouch) more than 30 years.
Here, in slides which I tried to present in less than 3 minutes, are some of the reasons why.
The notion of allowing access to your website content and data via API's and other machine readable means is well embedded in geek circles.
This presentation aims to look at the non-technical reasons why these approaches are a good idea, arguing that it is time for Machine Readable Data (MRD) approaches to be better communicated to content owners, budget holders and other non-technical stakeholders.
Everyware - "the future is already here, it's just not well distributed yet"Mike Ellis
In this Ignite presentation, I examine the notion of "everyware" - the merging and flowing of data and content between virtual and real spaces and the layering of virtual content onto the real world. Although this isn't hugely new, I argue that the growing convergence between device ubiquity, network speed, lowering cost, user familiarity, accurate LBS, geo-lo'd services and higher computing power points to a horizon where everyware is becoming a reality at last
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
1. scraping,
http://www.flickr.com/photos/juan23/82888194/
scripting and
hacking your way to
API-less data
[AKA: if you don’t have data
feeds, we’ll get it anyway]
2. overview
• “getting data out”
• non-exhaustive (and rapid!)
• slightly random
• live examples (hopefully)
• mainly non-technical(ish)
• mainly non-illegal. I think.
3. anything goes
• have no fear!
• feel no remorse!
• be shameless!
• long live the open data revolution!
5. me
• not really a developer
• ..but code enough ASP (stop giggling)
to do what I want to do
• slides will be at slideshare.net/dmje
• www.electronicmuseum.org.uk
• mike.ellis@eduserv.org.uk
6. we <3 data
• we want programmatic access...
• ...but sites are often lacking
• ...and APIs are usually a pipe dream
http://www.ucas.com/instit/i/h60.html
http://unicorn.lib.ic.ac.uk/uhtbin/opac/webcentral
7. scraping
• copy & paste, without having to copy &
paste...
• an inexact but really rather beautiful
science
Set xmlhttp = Server.CreateObject("MSXML2.ServerXMLHTTP.4.0")
Call xmlhttp.Open("GET",url,False)
Call xmlhttp.send
ReturnedXML = xmlhttp.responsetext
9. extraction #1: Y!Pipes
• find your data on page
• view source
• determine the delimeters
• put it into Pipes
• extract the output
originating page | output
10. extraction #2: Google Docs
• create a new google spreadsheet
• find the URL of the data you want
• identify how it is encapsulated (list/
table)
• use the importHTML() function (others for
feeds, xml, data, etc)
• dump out data as...CSV/XML/RSS/etc
originating page | output
11. extraction #3: dapper.net
• go to dapper.net/open
• identify several of the urls with the same
“shapes” that you want to scrape
• use the dapper dashboard to identify
content areas
• build the “dapp”
• pass in url’s of pages you want to extract
data from
• extract results from the output (xml,
flash, csv, etc)
originating page | output
12. extraction #4: YQL
• view source on the page you want to grab
• go to http://developer.yahoo.com/yql/console/
• get your XPath hat on and build a query
• grab the data from a RESTful query
http://developer.yahoo.com/yql/console/?
q=select%20*%20from%20html%20where%20url%3D
%22http%3A%2F%2Fopenlibrary.org%2Fsearch%3Fq
%3Dkeri%2Bhulme%22%20and%20xpath%3D%27%2F%2Fa
%5B%40class%3D%22result%22%5D%27
originating page | output
13. extraction #5: httrack
• grab a copy of httrack (or similar)from
http://www.httrack.com/
• point it at the bit of the site you want,
make sure the filters are correct, and push
go...
• you now have a local copy of the site, to
munge as you see fit
14. extraction #6: hacked search
• get an API key from Yahoo!
• use it to search within a domain
• script a standard download script to pick
out each page and download it
• hack that mumma
• (variation on a theme: build a simple
spider...)
15. now you’ve got your data..
• once you’ve got your data, you usually
need to munge it...
17. munging #2: find/replace
• use whatever scripting language you work
best with
• (even Word...)
• you’ll find that replace double space,
replace weird characters, replace paragraph
marks are about the most common needs
18. munging #3: mail merge!
• for rapid builds of html, javascript or
xml
• have a source document (often extracted or
munged from other sites) in Excel
• you can use filters to effectively grab
the data you need
• build the merge in Word, using the
“directory” option
• copy and paste the result out
19. munging #4: html removal
• have a function handy that you can pass a
block of html
• it is handy to have a script where you can
define which particular tags to remove or
leave in place
20. munging #5: html tidy
• grab a copy of html tidy from
http://tidy.sourceforge.net/
• tidy is available as a downloadable .exe
or a component that you can pass data to in
your code
21. processing #1: Open Calais
• a service from Reuters for analysing
blocks of text for semantic “meaning”
• get an API key from Open Calais
• send data via a POST to the REST service
• retrieve results from the RDF
• OR...just paste your text into
http://sws.clearforest.com/calaisviewer/
output
22. processing #2: Yahoo! TE
• a webservice for grabbing tags/terms from
blocks of text
• sign up for a Yahoo! API key
• pass your block of text using POST
• grab the results..
output