• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Laura Welcher - The Rosetta Project and The Language Commons
 

Laura Welcher - The Rosetta Project and The Language Commons

on

  • 4,165 views

 

Statistics

Views

Total Views
4,165
Views on SlideShare
1,629
Embed Views
2,536

Actions

Likes
0
Downloads
14
Comments
0

17 Embeds 2,536

http://blog.longnow.org 1219
http://rosettaproject.org 1075
http://longnow.org 80
http://localhost 41
http://www.rosettaproject.org 40
http://feeds.feedburner.com 35
http://translate.googleusercontent.com 22
http://infosecurity.us 7
http://antilogical2.rssing.com 4
http://maximusandme.blogspot.com 4
http://static.slidesharecdn.com 3
http://webcache.googleusercontent.com 1
http://staging.longnow.org 1
http://www.translate.ru 1
http://fanyi.youdao.com 1
http://localhost:8000 1
http://web.archive.org 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Laura Welcher - The Rosetta Project and The Language Commons Laura Welcher - The Rosetta Project and The Language Commons Presentation Transcript

    • The Rosetta Project:Building a 10,000 Year Library ! of All Human Language!
    • A bit of background…
    • The 10,000 Year Clock “I want to build a clock that ticks once a year. The century hand advances once every one hundred years, and the cuckoo comes out on the millennium.”!Danny Hillis
    • Prototype 1
    • Clock Mountain
    • The 10,000 Year Library “The Clock dramatizes the scope of historic time past and to come but offers no content. The Library is all content, especially past content with future significance…The value could lie in providing civilizations with a wisdom line: slow, robust, apparentlyStewart brand inefficient. ”! -from “Clock/Library” in The Clock of the Long Now!
    • Library Projects: A Responsibility Record
    • Library Projects: All Species
    • Library Projects:  Time & Bits…We risk creating a Digital Dark Age – a void in the continuityof cultural record – because the formats and hardware which weentrust with our data are unlikely to outlast even the next tenyears, much less our own lives. - Danny Hillis
    • Strategies That Improve  Data Longevity•  For starters expand your scope: aim for at least 500 years (can you do better than paper?) •  Use it or lose it – unused data dies •  Provide access – promotes use, reuse, LOCKSS •  Consider saving everything (e.g. Internet Archive) •  Move it or lose it (“Movage”) •  Consider atoms over bits (analog)
    • Library Projects: Long Server
    • Library Projects: The Rosetta Project•  Thousands of years ago we stored information on stone tablets – some of these are still around. •  Hundreds of years ago we stored information in books – print on acid free paper can reliably be preserved 500 years. •  Now we store information digitally, using hardware, software and encodings that are highly ephemeral.
    • The Rosetta Disk(One Possible Solution)
    • Microscopic Analog  Data Storage
    • Microetched Pages
    • Human Eye Readable Side
    • Parallel Content in Multiple languages•  The Rosetta Stone includes a decree of the divine cult of King Ptolomy V carved in 196 BC •  Same text written in three different forms: Egyptian Hieroglyphs, Demotic (Early Egyptian Script preceding Coptic), and Ancient Greek •  Working back from the Greek and somewhat known Demotic, were able to decipher the Hieroglyphs – thereby unlocking records of an entire ancient civilization
    • Rosetta Disk Goal - Parallel content for all languagesVocabulary Maps Sound Structure Writing Systems Word and Sentence Ethnographic Information Structure Parallel texts Numbering Systems Other texts Color Systems
    • Building the Collection Book Scanning
    • Building the Collection: Swadesh Wordlists
    • Audio Digitization
    • Google Earth Interface
    • “Born Digital” Materials Endangered Language Documentation Project!
    • 6 First Edition Disks•  Brewster Kahle, Internet Archive •  Charles Butcher, Lazy 8 Foundation – now in the permanent special collection of the University of Colorado Boulder Library •  William Lidwell, author of Universal Principles of Design •  Oliver Wilke – Oliver Wilke Stiftung für Sprachen •  One is held by an anonymous donor, and one is in the Long Now Museum
    • 02004 RosettaEuropean SpaceAgency Mission
    • Rosetta Disk Museum EditionIn August 02009 we presentedthe prototype of the RosettaDisk Museum Edition toSecretary Wayne Clough forthe Smithsonian.!
    • Endangered Languages“The coming century will see either the deathor doom of 90% of mankind’s languages”! - Michael Krauss!
    • Top Ten languages byNative Speakers (Millions) Mandarin! Spanish! English! Bengali! Hindi! Portuguese! Russian! Japanese! German! Javanese! 0! 100! 200! 300! 400! 500! 600! 700! 800! 900! Data: The Ethnologue (02009) available at www.ethnologue.com!
    • Language Distribution1 Billion Half the world population speaks one of 10 languages (>1%)!100 Million Most everyone else speaks one of 300 languages (4%)! 5% of the world speaks one of 6,500 languages (95%) !10 Thousand Number of Languages!
    • Why does it matter?
    • Languages are... Great Works of Art!
    • Languages are... Great Libraries!
    • Languages are “How to” guides for Living on Planet Earth
    • Languages Provide a window into our minds
    • Freedom of Language -  an inalienable human rightIndividually you have: •  The right to be recognized as a member of a language community •  The right to use your language in private and in public •  The right to use your own name •  The right to interrelate and associate with your native speech community •  The right to maintain and develop your own culture
    • Freedom of Language - an inalienable human rightCollectively your speech community has: •  The right for your own language and culture to be taught •  The right of access to cultural services •  The right to an equitable presence of your language and culture in the communications media •  The right to receive attention in your own language from government bodies and in socioeconomic relations From the Universal Declaration on Linguistic Rights, Barcelona, June 1996!
    • Rosetta Project:Long Now, Here & Now
    • Open Digital Collection on All Human Languages
    • Rosetta Special Collection  In the Internet Archive
    • Rosetta Language Base –Linguistic Metastructure •  Freebase: over 10,000 languages and linguistic entities linked by language family relationship •  All data is linked to other kinds of data in Freebase •  We have rectified ~1500 Wikipedia pages about human languages to our data set
    • Rosetta Prototype Wiki
    • New InitiativeThe Language Commons
    • The Language Commons Working Group
    • Language Commons Goals:•  To scale the amount of open language data (PD/CCZero to GPL to CCNC-BY to MIT/BSD)!•  To seek the participation of holders of language data including publishers, corporations, and authors (including web authors), funders of research that generates language data, and the institutes, researchers, and projects who are themselves creating and/or curating language data.  !•  To build open and available language data resources to further research, development, and global access to knowledge !•  To help preserve and promote endangered languages!
    • Language Commons Participants•  Translate.org, Meedan.net, Miro Project, Rosetta Project / Long Now Foundation, the Kamusi Project, Rosetta Foundation (translation service organization in Ireland), Fostering Language Resources Network (FLaReNet), European Language Resources Assocation (ELRA), The Berkman Center for Internet and Society•  Biblotheca Alexandrina, Berkman Center for Internet and Society, IBM Watson Language Group, Center for Research in Computational Linguistics, King Abdullah’s Initiative for Arabic Content, International Development Research Center (Canada)•  Saint Louis University, University of Melbourne, University of Michigan, Vassar, Universitat d’Alacant, University of Edinburgh, University of Pittsburgh, University of Pennsylvania, Eastern Michigan University, Tufts University
    • Language Distribution1 Billion Half the world population speaks one of 10 languages (>1%)!100 Million Most everyone else speaks one of 300 languages (4%)! 5% of the world speaks one of 6,500 languages (95%) !10 Thousand Number of Languages!
    • Want to use your language in the digital domain?1.  Is there a writing system for your language?! a.  Yes! Continue to (2)! b.  No! But you can still talk on your mobile phone, and post YouTube videos of yourself and your friends. Note you will need to type alphanumeric text (or use voice commands) in another more widely used language.!
    • Want to use your language in the digital domain?2.  Is there a unique identifier (ISO 639 code) for your language?! a.  Yes! Continue to (3)! b.  No! Bummer. Go back to (1).!
    • Want to use your language in the digital domain? 3.  Is your writing system in Unicode?! a.  Yes! Congratulations! Your script is now supported in the essential architecture of the digital domain.! b.  No! Bummer. Either create one by adapting a supported script, build a proposal to get your script/unique characters supported in Unicode (contact the Script Encoding Initiative for help on this), go back to (1).!
    • Want to use your language in the digital domain?4.  Do you have a large corpus of natural texts – written and spoken?! a.  Yes! Congratulations! You must be a speaker of a very economically powerful language. You continue to grow these corpora as you interact online every day (email, internet searches, SMS texts, depending somewhat on which ones you use) – and the services based on them keep getting better for you – natural language search, machine translation, speech recognition, etc.! b.  No! Bummer. Go back to (3). You and billions of others are in the same circumstance. Many give up and simply use a mainstream language in the digital domain.!
    • The Growing Linguistic Digital Divide“There are hundreds of seriously under-documentedlanguages that remain very much alive with hundredsof thousands to tens of millions of speakers each.The speakers of these languages number collectivelyin the billions, and as linguistic technology grows inimportance, they find themselves of the far side ofan increasingly large digital divide.” - NSF Proposal “Seeding The Language Commons”
    • Enabling Top 300 Languages  as well as The Long Tail•  We have substantial machine readable corpora for only about 20-30 of the world’s 6,900 languages. [Bird and Abney, 2010]!•  There is a commercial motivation in enabling the 300 most widely spoken languages – if digital services and devices work for this group, that is 95% of humanity.!•  The other 6,500 or so – the long tail – has no commercial motivation, but these languages can be documented and enabled by non-profit/academic/philanthropic efforts.!•  The Long Tail can benefit from development of the 300 (and vice versa – if we are building better algorithms that can work with less data. !
    • What we want to build…
    • The Language Commons Proposal: Build an Encyclopedia of Human Language An aggregation and discovery portal for information and resources on all 6,900 human languages. For use by: • language speakers • educators • researchers • general public
    • Why an Encyclopedia of human language?•  To create the go-to place for information and resources on any and all human languages – for education, for research, for preservation!•  To provide resources on lower density languages in case of crisis or emergency!•  To take action in the face of impending language loss!•  To act as testament for the genius of human cultural and linguistic diversity, and stand for freedom of language as a basic human right!•  To provide a forward path for the use of the world’s languages in the digital domain (by building a massive repository of open linguistic corpora)!
    • Basic Design Principles:•  Comprehensive – One page (minimum for every human language)!•  Extensible – includes language families, subgroups, languages, dialects, maybe even unique/noteworthy ideolects!•  Flexible – multiple navigation options and suited for a variety of users and user views: by language taxonomy, by alternate taxonomy, by other grouping – like linguistic area, geographic, with robust search by language name, alternate names, ISO 639 code!•  Open – open content, open contribution – the world should build it!•  Visible – the site should be easily discoverable and references to it ubiquitous !
    • Model: WikiLanguage
    • Model: The Encyclopedia of Life
    • Where will the Data Come From?
    • Where will the Data Come From?
    • Where will the Data Come From? Global Lives Project World Premiere, February 02010 San Francisco, California Yerba Buena Center for the Arts
    • Where will the Data Come From? Photo by Erik Hersman! You! Everyone has a language and can help document it.!
    • Language Commons What we’ve done this year•  Established a special collection at the Internet Archive, built an uploader, and have accessioned several major corpora from working group participants!•  Declaration of purpose, Identity!•  Written grants, most notably to NSF for “Seeding the Language Commons: Software for Large Scale Transcription and Translation of Oral Literature”!•  Participants have made presentations about The Language Commons all over the world (Long Now presented at Wikimania in Gdansk last summer)!
    • How Long Now is Helping•  Long Now has offered to be the umbrella organization for The Language Commons, as a project closely related to the aims and goals of The Rosetta Project.!•  We are looking towards integrating the two digital collections – so that Rosetta’s parallel collection can seed the Language Commons.!•  The Language Commons collection would continue to serve as a source for future Rosetta Disks and other Long Now data preservation projects. !
    • Language Commons How YOU can help!•  Please tell other people about The Language Commons – Tweet, Facebook, write blog posts or articles about the need for an open Language Commons.!•  We need serious funding to build the Encyclopedia of Human Language – and we are working on this! But if you have any leads or suggestions please let us know.!•  Consider a generous contribution of open language data.!
    • Thank you!laura@longnow.org!