Petition predictor final

171 views

Published on

This presentation was made by Lucky Adike, Marty McEnroe, and Dann Ormond for the CS410 class at University of Illinois taught by Prof. ChengXiang Zhai for the Spring 2013 semester

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
171
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
1
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • 0 sec
  • 25 Sec
  • 20 Sec
  • 15 SecWe noticed that of approximately 90 current predictions, three of them have to do with marijuana. This gave us an idea.
  • work to do: Marty – compute numbers, sort by start date
  • Petition predictor final

    1. 1. CS410 Course Project PresentationPetition PredictorCS410 Spring 2013Lucky AdikeMartin McEnroeDann Ormond1
    2. 2. CS410 Course Project PresentationProblem StatementCongress shall make no law respecting an establishment of religion, or prohibiting the freeexercise thereof; or abridging the freedom of speech, or of the press; or the right of thepeople peaceably to assemble, and to petition the Government for a redress of grievances.- The First Amendment of the United States Constitution• January 2012: Congress proposed legislation on behalf of content distributors• The internet community grew increasingly alarmed about the change and side effects• Several well publicized events took place on January 18, 2012 as part of the SOPAblackout day: Google, Reddit, Wired, Wikipedia and 115,000 other websites modifiedtheir web presence to protest the pending legislation.• January 20th the legislation was shelved indefinitelyWhat useful information retrieval tool could be built?• Could this citizenry-government action have been anticipated and predicted?• Could information retrieval and analysis of the online conversation anticipate andpredict the end result?2
    3. 3. CS410 Course Project Presentation 3100,000signaturesin 30 days Which newpetitions willhit threshold?Reachthreshold andWhitehouserespondsMustregister withemail andzip codeRelated work: On 2/21/13 Whitehouse hostshackathon and releases project results on 5/1.Pulse predicts when the threshold will pass 100k:http://youtu.be/5-2P4GFZf8Yhttps://github.com/DruRly/pulse
    4. 4. CS410 Course Project PresentationSolution Approach• 1st Idea: Classify the petition:– “1” : Petition will receive 100,000 discrete, validated signatures within 30 days– “0” : Petition will not pass 1000,000 threshold in time• How to make a classification decision?1. statistical analysis of past performance.• Wrote a Python program to scrape the whitehouse website every 8 hours.Stored in a JSON object for use in subsequent analysis and retrieval4Text of petitionSignature count every 8hours starting 4/28Petition create date (but onlyviewable on website after 150signaturesa unique identifier, alsouseful as a search termTitle of petitionDuring course ofproject we changed toranking petitions
    5. 5. CS410 Course Project PresentationLogarithmic Curve Fit of 10 Most Likely Petitions51501500150001500000 10 20 30 40 50 60 70 80 90 100NumberofSignatures(Logscale)Time in 8 hour increments (petitions time shifted to common origin = creation date)archbishopsmarijuana3airgunpostalMalaysianassaulthabeasaggagthalliumtransnationalLog. (archbishops)Log. (marijuana3)Log. (airgun)Log. (postal)Log. (Malaysian)Log. (assault)Log. (habeas)Log. (aggag)Log. (thallium)Log. (transnational)Threshold @ 100,000 signaturesCurve fit then predict the 30th day value(x = 90 since we sample every 8 hours)Petition ‘fatigue’ suggests logarithmicmodel is better predictor- Ln used (base w1 = e) can be tuned
    6. 6. CS410 Course Project PresentationTwitter: Tweets and FollowersAfter signing, wh.gov site encourages you to promote the petition• Used public Twitter REST API• Search on the petition title• Tweet Rate = count / # of days (twitter limits age of tweets in API)• Use transformation of rate to reward place in rank, not absolute valuedifference– sublinear– linear– exponential• Guess: linear6Tweet Weight Adjusted for ∑ƒ(followers)• Are some tweeters more important than others?• Can we develop something like authorities/hubs?• Weighted Rate incorporates number of followers toincrease/decrease score of each tweetAdj. Score = ∑ log5(followers) /days of tweetsBase 5 -> Pivot point is w2 = 5 followers – can be tunedrank1.050.95
    7. 7. CS410 Course Project PresentationTransforming Rank to Boost Factors7• Petition rank is mapped via a linear function – function type can be tuned• Tuning scaling parameter applied based on judgment of importance of each IR category– tweet rate: w3= .02 1st -> 1.10; 10th -> .90– follower adjusted tweet rate: w4 = .04 1st -> 1.20; 10th -> .80Petition IDLn Curve Fitw1= eTweetsTweet Rateper DayRankBoostFactorBoostFollowerWeightedRateWeighted/Rate RatioRankBoostFactorBoostxNskxL1q 16,545 94 11.8 10 0.90 -1,655 39.7 3.379 7 0.92 -1,324xqNMVRB4 9,115 97 12.1 9 0.92 -729 44.1 3.636 1 1.20 1,823khpw6LCt 50,898 1022 127.8 2 1.08 4,072 459.8 3.600 3 1.12 6,108drCmyCHZ 21,280 231 28.9 5 1.02 426 103.1 3.570 4 1.08 1,702nBqKR7bm 446,841 1676 838.0 1 1.10 44,684 2675.7 3.193 9 0.84 -71,494kVhNfHQ1 14,720 168 21.0 6 0.98 -294 71.7 3.412 6 0.96 -589bMJpDrNq 6,769 114 14.3 8 0.94 -406 49.6 3.479 5 1.04 271KQWSvsKr 5,380 127 15.9 7 0.96 -215 57.7 3.635 2 1.16 861Rd8C54p1 83,231 93 31.0 4 1.04 3,329 63.5 2.047 10 0.80 -16,646V3hNt2fB 17,376 508 63.5 3 1.06 1,043 208.5 3.283 8 0.88 -2,085
    8. 8. CS410 Course Project PresentationCan Google Trends help us?8Chunks,value (0 – 100)Revoke US Visa,7on,83National Security Grounds,7to,83Venezuelan Government Officials,0involved,65in,93Transnational Organized Crime,65Converted the petition title into searchphrases using OpenNLP• sentence detector• tokenizer & POS tagger => ChunkerSome observations• Chunking produced common terms with high scores• Would be more useful to build a custom Querybackground language model – need more data• Not clear how Google trends computes values from 0to 100 – different petitions are not relative to eachother• Doesn’t appear to be “bag of words” model. Whatabout semantically equivalent terms? We werehoping for a tf-idf weighting from the web• Is there another tool out there? Is there functions ofthe API we didn’t exploit? Will the API evolve?• most unreliable IR source therefore w5= .01Results from web interfaceResults from API interface
    9. 9. CS410 Course Project PresentationAuthority Sites via Bing API• Created list of 30 authoritative web sites (e.g., cnn.com). Each weighted equally.• Sent full title of petition as query to Bing API exactly as listed on wh.gov:“Invest and deport Jasmine Sun who was the main suspect of a famous Thalliumpoison murder case (victim:Zhu Lin) in China”• Measured number of responses in the top 50 results that came from an authoritativedomain - eliminated self-posting parts of domain: http://ireport.cnn.com/docs/DOC-965382• Observation: Most petitions do not receive mainstream attention• Second most reliable w6= .039Petition ID keywordCloseDateLn CurveFitAuthoritySitesRankBoostFactorBoostxNskxL1q archbishops 5/27 16,545 5 3 1.09 1489xqNMVRB4 marijuana3 5/17 9,115 8 1 1.15 1367khpw6LCt airgun 5/15 50,898 4 4 1.06 3054drCmyCHZ postal 5/24 21,280 6 2 1.12 2554nBqKR7bm Malaysian 6/4 446,841 2 6 0.97 -13405kVhNfHQ1 assault 5/21 14,720 0 10 0.85 -2208bMJpDrNq habeas 5/27 6,769 3 5 1.03 203KQWSvsKr aggag 5/10 5,380 2 8 0.91 -484Rd8C54p1 thallium 6/4 83,231 0 9 0.88 -9988V3hNt2fB transnational 6/3 17,376 2 7 0.85 -2606
    10. 10. CS410 Course Project PresentationPutting it together• Our focus was on acquiring data and constructing a model and automated wherenecessary and using open tools, APIs, and information sources• Some work about transfer between modules and final ranking and computation needsmore automation if we are to run unattended• Much data analysis, both manual and automated to guess at important sources andparameters. Many initial ideas didn’t pan out:– Sentiment analysis (no such thing as bad publicity)– Google trends surprisingly useless – forced to do manual manipulation – very lowconfidence in this as a prediction– Facebook button on wh.gov but didn’t appear to be used as much as twitter– No training data to choose parameters. Choose simple “boost” model to start andused intuition from project to guess at relative size of boost from different sources.10Stop 85seismic airgun testing 0for 86oil and gas 77off 80the U.S. East Coast . 0
    11. 11. CS410 Course Project PresentationPutting Our Money Where Our Mouth Is…Ranked predictions of 10 most likely1 of the 842 petitions started between April 5 and May 4and ranked predictions. How will we do?111. Only petitions that have at least 150 signatures are visible to us2. One petition ( 0MNp0Bys ) started on 4/15 and hit 100k before we started collecting statistics so we excluded this form our data setPetition ID keywordCloseDateLinear CurveFitNaïveOrderLn Curve Fitw1= eTwitterw3 = .02Twitter+w2 = 5w4 = .04GoogleTrendsw5 = .01AuthoritySitesw6 = .03CombinedmodelPredictedOrderxNskxL1q archbishops 5/27 31,309 5 16,545 -1,655 -1,324 165 1,489 15,222 5xqNMVRB4 marijuana3 5/17 10,048 9 9,115 -729 1,823 456 1,367 12,032 7khpw6LCt airgun 5/15 57,185 4 50,898 4,072 6,108 -2,036 3,054 62,096 2drCmyCHZ postal 5/24 27,929 6 21,280 426 1,702 -426 2,554 25,536 4nBqKR7bm Malaysian 6/4 2,895,387 1 446,841 44,684 -71,494 17,874 -13,405 424,499 1kVhNfHQ1 assault 5/21 16,568 7 14,720 -294 -589 -442 -2,208 11,187 8bMJpDrNq habeas 5/27 12,036 8 6,769 -406 271 -68 203 6,769 9KQWSvsKr aggag 5/10 5,448 10 5,380 -215 861 -269 -484 5,273 10Rd8C54p1 thallium 6/4 734,304 2 83,231 3,329 -16,646 1,665 -9,988 61,591 3V3hNt2fB transnational 6/3 88,236 3 17,376 1,043 -2,085 521 -2,606 14,248 6Baseline IR Model Prediction
    12. 12. CS410 Course Project PresentationQuo Vadis?Do Research• Collect more data, train parameters, learn different ways to make predictions• Publish• Awesome? idea for a team competition homework 5 in a future classSharpen CS skills• Whitehouse.gov released API on 5/1 and a historical corpus on 5/2• Next Whitehouse hackathon on 6/1Make money• Turn this into an actual app and host it on web site– Business model: tweet dashboard link to anyone who tweets a petition, dashboardsite is advertising supported• Apply methods to other petition sites:change.org, gopetition.com, ipetitions.com, signon.org, thepetitionsite.com, care2.com(or get a job at one of these companies)Give back• Fraudulent petition signature detection• Mine the web for new petition topics with high success potential12

    ×