Loading…

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

Like this presentation? Why not share!

Presentation

on

  • 442 views

 

Statistics

Views

Total Views
442
Views on SlideShare
442
Embed Views
0

Actions

Likes
0
Downloads
1
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • I’d like to suggest universal access to all human knowledge IS within our grasp. What role will you play in it?

Presentation Presentation Presentation Transcript

  • Archiving the Web: Challenges on the Horizon NLA Innovative Ideas Forum Gordon Mohr, Internet Archive April 9, 2008
  • Overview
    • About the Internet Archive, Heritrix, and automated web harvesting
    • Some big challenges
    • Speculative approaches
      • “ what *might* be”
  • The Internet Archive and current web harvesting View slide
  • What is the Internet Archive?
    • Founded in 1996 by Brewster Kahle
    • An independent digital library
    • Purpose:
    • Facilitate universal access to all
    • human knowledge.
    http://www.archive.org View slide
  • What is the Internet Archive?
    • Non-Profit
      • mission of community service
      • no shareholders or profits allowed
      • tax benefits
    • Funded by…
      • donations by individuals & foundations
      • public grants and contracts
  • since 1996 based in San Francisco, California, USA 2km from Golden Gate Bridge
  • In 2008: 2 little houses in a park
  • What does Internet Archive do?
    • Originally: Archive the public web
      • crawling and storing (since 1996)
        • partnership with Alexa Internet
      • offering public access (since 2001)
        • “ the Wayback Machine”
      • collaborating on open tools (since 2003)
  • Internet Archive’s other collections
    • Audio
        • Concerts
        • ‘ NetLabels’
    • Video
        • Prelinger collection
        • User-content
    • Educational courseware
    • Digitized books
  • Open Content Alliance
    • Large-scale digitization for open access
    • Library & heritage institution partners
      • Boston Library Consortium, LoC, BL, University of California, many others
    • Foundation and corporate sponsorship
      • Sloan Foundation, Yahoo, Microsoft
    http://www.opencontentalliance.org
    • > 100 billion captures (URL+datetime)
    • > 1 petabyte compressed
    • > 10 years (1996-)
    • Collects anything accessible to public
    • Obeys ‘robots.txt’ restrictions
    • Respects rightsholder/site-owner takedown requests
  •  
  • Bibliotheca Alexandrina, Egypt: mirror deliveries in 2002, 2006
  • Web Archiving Partners
  • Web Harvesting 101
    • Collect a page/resource (URL)
    • Examine for references to other pages/resources
    • Add those to the list to be collected
    • Go to (1)
    • For simple hypertext, from honest sources: this works great
  • Emerging challenges: Where simple harvesting fails
  • Emerging Challenges
    • Professional Web Spam & Malware
    • Desktop-like Interactive Applications
      • (“AJAX” and “Web 2.0” style)
    • Social Networks & Applications
    • Virtual Worlds
    • (Second Life, World of Warcraft, etc.)
  • Challenge: Professional Web Spam and Malware
  • Professional Web Spam
    • Problem: Manipulating web users and search engines makes $$$
    • “ Eyeballs” == ad revenue
      • “ made for adsense” pages
    • High SERPs position == traffic, profit
    • Trick users into malware installs, identity compromise == low-risk theft
  • Web Spam: A Problem of Proportion
    • Can become a criminal – or just shady – multi-millionaire
    • Arms race between well-motivated bad actors and well-funded commercial engines
    • We’re tiny bystanders, relatively
      • Collateral damage in costs, average quality
      • Synthetic content can swamp real content
      • We want a little – but not all
  • Commercial countermeasures?
    • Secret algorithms
    • Big paid editorial/quality staff
    • Massive clickstream data
      • “ anonymized” ISP logs
      • Toolbars & analytics
  • Can public harvests match?
    • Secret algorithms
    • Big paid editorial/quality staff
    • Massive clickstream data
      • “ anonymized” ISP logs
      • Toolbars & analytics
    volunteer/community effort? public-interest toolbar/analytics?
  • Community Web Markup
    • Wikipedia/open source model
      • “ many eyes” make bugs (and fraud) visible
    • Public interest & smaller commercial operators share interests
      • Vertical crawls and search engines
      • “ Wikia” wikipedia-style web search
      • Internet Archive & IIPC crawls
    • Markup of “neighborhoods” by spamminess, other indicators
    • Prediction: very likely
  • Public-interest Toolbar & Analytics
    • Critical mass of stakeholders?
      • Toolbar like any other
      • Opt-in ‘web bug’ for participating sites
    • Big privacy issues:
      • Trusted aggregator
      • Peer blinding
        • “Crowds” research
    • Prediction: long-shot, but possible
      • required anonymity gives bad-actors room
  • Commercial Malware
    • Depressing summary:
      • “ The Commercial Malware Industry”
      • Peter Gutmann, University of Auckland
      • http://www.cs.auckland.ac.nz/~pgut001/pubs/malware_biz.pdf
      • p. 66: “ What Should I Do? (Non-geeks)
      • Put your head between your legs and kiss …”
  • Malware – impact on Web Archives
    • Again: we want a little
      • Accurate view of dangerous web
      • Important for researchers, legal prosecution
    • But: not put our users at risk
    • For now: accessing web archives can be as dangerous as live web
  • Malware – possible answers
    • Archive – and retroactively apply – today’s blacklists
      • EG: Google “Safe Browsing”
    • But how about treating as a biohazard?
      • 1919 flu
      • Can we give users a ‘containment lab’ for looking at old hazards?
  • Sidebar: Virtualization
    • Broadly, “the abstraction of computer resources”
    • I prefer “turning hardware into software”
      • pretend “virtual machine”
      • can be ancient, unrelated system
        • Atari or nintendo on modern PCs (gaming)
        • X86 DOS in any Java VM (JPC, Dioscuri)
      • can be for current convenience
        • Alternate OS on desktops (VMWare, Parallels)
        • Modern desktop PCs on modern servers (VMWare, Citrix)
        • Standardized servers on varying datacenters (EC2, 3Tera, etc.)
      • A full system becomes a relocatable “image”
  • Virtualization Against Malware
    • Disposable “sandbox”
      • User should still be wary
      • Total reset every session
    • Added benefit: veracity to target era
      • EG: construct image of “typical 1998 desktop”
      • Perfect viewer – and bug – compatibility
      • Nostalgia? Pity? Terror?
  • Challenge: Desktop-like Interactive Applications (“AJAX” and “Web 2.0” style)
  • Interactive Web Applications
    • Every gesture may generate reaction
      • Javascript
      • AJAX – server contact without full page reload
      • Flash and applets
    • Server-side does more than document provision
      • may span many opaque systems
      • nothing like an install disk to save
    • Often highly personalized experience
  • Interactive Web Applications
    • One Bright Spot:
    • Search Engine Optimization motivates alternate paths
      • Search bots the “third browser”
      • Simple link paths, text (e.g.: sitemaps)
  • Possible Approach: Androids
    • From crawler robots to “androids”
      • Act more like human users
      • Detect free registration, login!
      • Simulate mouseovers, in-page-clicks
    • Problems:
      • Risk of backlash
      • Hard to scale beyond simple
  • Possible Approach: Shoulder-Surfing
    • Have volunteers allow software to watch their interaction
    • Record at either protocol level, or as simple screen movie
      • “ screencasting”
    • Might also be considered an “automated documentary”
    • At the extreme: lifelogging
  • Aside: Lifelogging
    • Record your entire life
      • “ MyLifeBits”
      • Gordon Bell, MS Research
      • http://research.microsoft.com/barc/mediapresence/MyLifeBits.aspx
    • Desktop-logging trivial in comparison
      • fallback for any hard-to-archive digital content
  • Challenge: Social Networks/Applications
  • Social Networks/Applications
    • Only a fraction of the content is visible to anonymous visitors
    • Almost all views are personalized
    • High expectations of privacy
    • “ Walled garden” feature and a challenge
  • Social Networks: Valid Target?
    • Previously open activity moving into private areas
      • Utility of friend networks
      • Retreat from spam/anonymous hostility
      • Replacing homepages, events, organization pages that were archivable
    • Still subject to same decay
      • Companies change, fail, don’t value preservation
  • Social Networks: Approaches
    • Again: the Android harvester
      • Register and login
      • Get *some* of internal experience – without tripping privacy concerns (yet)
  • Would you accept this request?
    • Next: the Android “friend”
      • Opt-in archiving by “friending” the robot
      • Makes archived versions subject to network’s own privacy controls
  • Social Networks: Approaches
    • Finally: the Android assistant
      • In-network application or client-side recorder
      • “ shoulder-surfing” / “auto-documentary” / “lifelogging” with permission
      • Initially, appropriate for private archive
      • But: these are the private correspondences of today’s generation – some will be donated to public archives, and of intense interest to researchers
  • Challenge: Virtual Worlds
  • Virtual Worlds
    • Are they important?
      • Replacing other forums/communication
      • Displacing other mass entertainment
        • Games, TV, Sports
      • Unique art and architecture
      • New narratives and experiences
    • At the very least:
      • a new pop/children’s literature
  • Virtual Worlds: Challenges
    • Very unlike other media
      • Interactive – other participants’ actions mean endless possibilities
      • Not even much like the web itself – no discrete documents, links
      • “ Install disks”/client software alone are of little value
        • (somewhat like Interactive/AJAX apps case)
    • A world data dump from host organization?
      • Empty, sterile
  • Virtual Worlds: In-world Android
  • Virtual Worlds: In-world Android
    • Privacy concerns
      • Like security cameras?
      • Get world operator/community approval
      • Limit android range to…
        • Declared areas
        • Public spaces or invited areas
        • Events
    • Follow people/events of interest
    • “ Automated SecondLifeLogging”
  • Virtual Worlds: Shoulder Surfing
  • Virtual Worlds: Shoulder Surfing
    • Optional recording assistant/“familiar”
    • Again, privacy/disclosure/expectations are important
    • Autodocumentary of virtual world
  • A final thought…
    • Is all this recording of virtual world activity unprecedented?
  •  
  •  
  •  
  •  
  • Thank You Gordon Mohr Internet Archive Web Group [email_address]