The Bones of a Bestseller: Visualizing Fiction

10,290 views

Published on

Using Python and D3 to find sex scenes and identify story arcs in fiction (bestsellers).

Published in: Technology, Education
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
10,290
On SlideShare
0
From Embeds
0
Number of Embeds
7,428
Actions
Shares
0
Downloads
24
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide
  • Book stars for today
  • Additionally, I want to illustrate using some statistical tricks on data, particularly some simple machine learning – and the tools I built to visualize those results.
  • Start with a little motivating graphic, or “pornographic”, that inspired me about 6 months ago. This was actually on the Economist’s blog!Can we do this automatically?
  • The way everyone would start this problem…
  • Then you can use a good model to predict new things you haven’t seen before. Spam classifiers work this way.
  • Supervised learning: Have some data you label with the truth, and feed it into some code to learn what the truth is all about.To do this properly, you divide the data up in a training set, and an evaluation set – and you see how your code did on the evaluation set: how much did it get right?Once you’re satisfied with the tweaks on the classifier code, you can use it on new data in the wild.
  • What the text looks like…
  • Doing the sex scenes labels myself sucked, so I outsourced it to Mechanical Turk, Amazon’s crowdsourcing remote work tool. It was super easy (to spend a lot of money on this). So I did (spend a lot of money).
  • But let’s step back a little…
  • Lots would say this is sexy (maybe not all women, though).
  • Some would say this set is sexy, others definitely would not. This turns out to be a lot of what 50 Shades is about… So, hmm. Also, this set is on Ebay in the UK if you’re into it.
  • So, apart from the bondage, the Mechanical Turkers are seeing small chunks of text, with no context, in random orders. Suppose there’s a steamy shower scene where they are getting it on – but they stop to discuss a horrible childhood incident and cry? Is that in a sex scene, or not? Tough to say.
  • Even worse – some parts of the first book are long sections of contract, which contain sexual rules and regulations – but it’s a contract. Sexy, or not? Probably not to most…
  • Results from Mechanical Turk as a CSV file.
  • We can see a fair amount of variation here, some good agreement, but the blue raters were more turned on by the beginning of the book.
  • A pretty good match, actually. Good for the Turkers and the porno-graphic team!
  • Again, what’s up with the blue raters – they loved this book. Red did not find it sexy at all.
  • NLTK outputs a list of top terms, unlike scikit-learn – just wanted to show you what they looked like.
  • This is an illustration (not by me) of how many classifiers there are that can be used on text, in scikit-learn… Picked one that has general good performance, to see how it compared to Naïve Bayes – Stochastic Gradient Descent. Notice there’s a Passive Aggressive Classifier, too. Best.name.ever.
  • Just to show you how little code it is to run a classifier pipeline – and check the results.
  • To be able to browse the results by content and context, I built a little tool in D3; you can see the matches and mismatches in the sex scenes, and rollover each little block to inspect the text itself. Useful!
  • I did spend a lot of money on Mturk, getting ratings of sex scenes. For future talks…
  • Switching back to another theme… overall arc of action in a novel.
  • The movie version of story arcs… “height” is tension, or some kind of measure of excitement, or drama…
  • But the more I investigated this online, the more I found people saying it’s bullshit. This is the best quote I found on the subject. Totally worth reading the essay.
  • Vonnegutis here talking about sentiment of events, not really “tension” or “excitement” or rising action – but there’s still some kind of structural differences going on across each book/story. The question is, what do best sellers look like over the course of the whole story? By whatever measure will illustrate the pacing/movement of the story.Lower right hand corner is me working on this talk.
  • Vonnegut on real life, compared to fiction.But because that’s depressing, here is a tiny owl.
  • Did this cheer you up? It’s better than a kitten! IMO.
  • So – back to the initial thought: rising action, crises, resolution, etc. Can we find this in books? Automatically, I mean?
  • Using action/exciting scenes as proxy for major events in a book…
  • Brief digression on how hard this is.
  • This seems obvious… fights, chases, etc.
  • Small chunks out of context don’t always look like action. Remember Mechanical Turk folks are seeing small pieces without context, so their judgments are based on only a tiny window. Or, in the case of Dan Brown, it might ALL look like action. (You might look up “bathos” here too.)
  • A sample of the text in question… This is an action scene.
  • It’s possible I could’ve improved this with some other trickery, but heck, let’s move on.
  • I thought maybe I’d get somewhere with another “bag of words” unstructured technique that’s popular now: topic analysis.
  • A snippet from a classic article.
  • A network view of topics and documents, by Elijah Meeks. This is a pretty obvious way to visualize the results of LDA on text. Butmy data is ordered chapters, so I didn’t want to do this. I wanted to keep the relationship, but still see the topics…
  • Built another tool to see if there was anything in this – showing them as ordered chapters connected by the “best” matching topics.
  • Outsource the summary writing for each chapter, to make it easier to see how topics relate to chapter contexts. … Add them as text under the leaves (the boxes that represent chapters). Now it’s hard to read – so use svg cute rotate trick and some resizing…!
  • Some UI niceties I added to make it slightly usable, even for myself. Unfortunately I had to shorten the text my friend created for each chapter; the originals were pretty hilarious…
  • This was the best way I could find on short notice to get a list of divergent bright colors… but I still had to hand-tweak the ones I used till they were all more or less readable and distinguishable!
  • For this tool I used Jim Vallandingham’s network code from the flowingdata tutorial – my first major use of coffeescript. I intended to try the radial layout, but ended up not. There was a lot of preprocessing of the stats to get the topics related to chapter “excitingness.”
  • On rollover, you can see the links, and since I created links between shared words (colored in blue), you can see little constellations for them, which I liked. This could’ve been simplified, of course.
  • No structure visible, no story structure… In using just the bag of words, I’d lost all structure or relationships across time, which seems important to me for things like pacing in a novel.
  • Previous work, sketch showing relationship of dialog to exposition in different texts; this I did in early spring of 2013, for another talk. It was a simple visual of dialog vs. exposition.
  • At the time I theorized that the reason Angels and Demons (which I bought on sale on Amazon) had less dialog towards the end was because the action had increased, to the detriment of dialog. Maybe I was right? Simple is best?
  • New process: Check for relationships of everything, across time; and relationship to “excitement” ratings.
  • First note the ratings differences (using ipython notebook and pandas). I used the avg scores.
  • After some code magic, nouns, for instance, look like this. Lots of chapters, lots of variation. I used rolling means to get a smoother curve!
  • These are hard to plot together on the same scale – giant mess.
  • Standardize the data with a small transform, to get it all on a comparable scale.Still kind of a mess.
  • Packaging to generate rickshaw.js graphs built on d3.
  • The result in a browser – just showing you Twilight for example, back to Brown in a sec.
  • There were nice inverse relations between nouns and verbs in both books. (Both done proportionally and as absolutes. These plots are proportional numbers.)
  • Notice the nice climbing excitement curve on A&D. This is based on Turkeravg scores, of course. DVC has more peaks in the first half, it seems. But does climb at the end.
  • So what’s the closest correlate to the action? Inverse correlation (mild) with the dialog, as I suspected initially. (Yes, logistic regression and other stats can be done here – I did, too. But for this talk, it was about the visuals.)
  • Not necessarily true – the giant peak of expository stuff early in Twilight turns out to be the trip to the beach where all the vampire/werewolf stories come out.
  • Another tool needed… to check the numbers with the text visible. You can eyeball for highlighted correlations with the excitement, and rollover the blocks again to see the text.
  • The Bones of a Bestseller: Visualizing Fiction

    1. 1. The Bones of a Bestseller:Visualizing FictionLynn Cherny@arnicasOpenvisConf 2013Monday, June 17, 13
    2. 2. Language, Sex,Violence(also spoilers)TEXTMonday, June 17, 13
    3. 3. Monday, June 17, 13Book stars for today
    4. 4. Study what’s popular, because it tells us somethingabout people.Monday, June 17, 13Additionally, I want to illustrate using some statistical tricks on data, particularly some simple machine learning – and the tools Ibuilt to visualize those results.
    5. 5. http://www.economist.com/blogs/graphicdetail/2012/11/fifty-shades-data-visualisationsBYMonday, June 17, 13Start with a little motivating graphic, or “pornographic”, that inspired me about 6 months ago. This was actually on theEconomist’s blog!Can we do this automatically?
    6. 6. Text Classification (Commonly)§“Bag of words” – each document is considereda collection of words, independent of order§Frequencies of certain words are used toidentify the textsSeems like this should work with sex scenes,right? Only so many body parts and behaviors,right?!Monday, June 17, 13The way everyone would start this problem…
    7. 7. Data LabelEstdsgfd fdsatreatret dfds YesDsrdsf drerear ewrewtrew NoReret retdrtd rewrewrtew YesDsfgdg fdsfd YesAlgorithmTrainTestNew data in the wildMonday, June 17, 13Supervised learning: Have some data you label with the truth, and feed it into some code to learn what the truth is all about.To do this properly, you divide the data up in a training set, and an evaluation set – and you see how your code did on theevaluation set: how much did it get right?Once you’re satisfied with the tweaks on the classifier code, you can use it on new data in the wild.
    8. 8. Sex Scene Detection First Steps1. Buy 50 Shades on Amazon, unlock text inCalibre, save as TXT file.2. Cut up a doc into 500 “word” chunks usingPython3. Try to label each chunk: “not sexy” (e.g., paperwork, taxes, calls to Mom) “maybe steamy” (e.g. kissing, limited touching, long looks) “sexy!” (fill in the ____ here)Monday, June 17, 13
    9. 9. “Would you like to sit?” He waves me toward an L-shaped white leather couch.His office is way too big for just one man. In front of the floor-to-ceiling windows, there’s amodern dark wood desk that six people could comfortably eat around. It matches thecoffee table by the couch. Everything else is white—ceiling, floors, and walls, except for thewall by the door, where a mosaic of small paintings hang, thirty-six of them arranged in asquare.They are exquisite—a series of mundane, forgotten objects painted in such precisedetail they look like photographs. Displayed together, they are breathtaking.“A local artist.Trouton,” says Grey when he catches my gaze.“They’re lovely. Raising the ordinary to extraordinary,” I murmur, distracted both by him andthe paintings. He cocks his head to one side and regards me intently.“I couldn’t agree more, Miss Steele,” he replies, his voice soft, and for some inexplicablereason I find myself blushing.Sample of 50 Shades of GreyMonday, June 17, 13What the text looks like…
    10. 10. Outsourced to Mechanical TurkMonday, June 17, 13Doing the sex scenes labels myself sucked, so I outsourced it to Mechanical Turk, Amazon’s crowdsourcing remote work tool. Itwas super easy (to spend a lot of money on this). So I did (spend a lot of money).
    11. 11. WHAT’S A SEX SCENE,ANYWAY?Monday, June 17, 13But let’s step back a little…
    12. 12. Zara.comMonday, June 17, 13Lots would say this is sexy (maybe not all women, though).
    13. 13. http://www.ebay.com/itm/Adult-Sex-Toys-Tools-Handcuffs-Eye-mask-Neck-Band-Strap-Whip-Rope-/330845727274?pt=UK_Home_Garden_Celebrations_Occasions_ET&hash=item4d07f12a2aMonday, June 17, 13Some would say this set is sexy, others definitely would not. This turns out to be a lot of what 50 Shades is about… So, hmm.Also, this set is on Ebay in the UK if you’re into it.
    14. 14. trendir.comMonday, June 17, 13So, apart from the bondage, the Mechanical Turkers are seeing small chunks of text, with no context, in random orders. Supposethere’s a steamy shower scene where they are getting it on – but they stop to discuss a horrible childhood incident and cry? Isthat in a sex scene, or not? Tough to say.
    15. 15. Sexually Exxxplicit,but still ahttp://www.icts.uiowa.edu/sites/default/files/contract.jpgMonday, June 17, 13Even worse – some parts of the first book are long sections of contract, which contain sexual rules and regulations – but it’s acontract. Sexy, or not? Probably not to most…
    16. 16. Monday, June 17, 13Results from Mechanical Turk as a CSV file.
    17. 17. How’d the raters do?Sex ScenesSteamy ScenesMonday, June 17, 13We can see a fair amount of variation here, some good agreement, but the blue raters were more turned on by the beginning ofthe book.
    18. 18. Comparing to “Pornographic”…Monday, June 17, 13A pretty good match, actually. Good for the Turkers and the porno-graphic team!
    19. 19. Comparing:Monday, June 17, 13Again, what’s up with the blue raters – they loved this book. Red did not find it sexy at all.
    20. 20. On to the learning algorithm…The training data:-The text chunks-The score the raters gave it (averaged) as “truth”I started with Python’s NLTK (Natural LanguageToolkit) and Naïve Bayes for classifying (workingin an ipython notebook).Monday, June 17, 13
    21. 21. NLTK Naïve Bayes not so greaton 50 Shades… 68%.“packet” (they use a lot of condoms)Monday, June 17, 13NLTK outputs a list of top terms, unlike scikit-learn – just wanted to show you what they looked like.
    22. 22. Python’s sklearn (scikit-learn)Lots of classifiers forsparse data like text!http://scikit-learn.org/0.13/auto_examples/document_classification_20newsgroups.htmlMonday, June 17, 13This is an illustration (not by me) of how many classifiers there are that can be used on text, in scikit-learn… Picked one that hasgeneral good performance, to see how it compared to Naïve Bayes – Stochastic Gradient Descent. Notice there’s a PassiveAggressive Classifier, too. Best.name.ever.
    23. 23. Using a lemmatizer step in the pipeline (to strip endings off words, since some fiction in mylater samples was in present tense)Pipelines in sklearn makes it incredibly easy to run lots of experiments.Fit the model, using training data and “target” answers (in this case,“50 Shades of Grey”)Test the model on new data (in this case,“50 Shades Darker”). Check how it did against theanswers.Nowwe’reat 88%Monday, June 17, 13Just to show you how little code it is to run a classifier pipeline – and check the results.
    24. 24. Interpreting the results…Demo: http://www.ghostweather.com/essays/talks/openvisconf/text_scores/rollover.htmlMonday, June 17, 13To be able to browse the results by content and context, I built a little tool in D3; you can see the matches and mismatches in thesex scenes, and rollover each little block to inspect the text itself. Useful!
    25. 25. Really amazing P.S. here…I paid for coding of a bunch of fan-fiction for sexscenes too, and fed them in to the SGD classifier.(Recall that 50 Shades started life as Twilightfanfic.)*cross-validating with entire set, not just 50 Shades books.97% accuracy achieved!*Monday, June 17, 13I did spend a lot of money on Mturk, getting ratings of sex scenes. For future talks…
    26. 26. I SAID I’D TALK ABOUTSTORY ARCS TOOBut hey -Monday, June 17, 13Switching back to another theme… overall arc of action in a novel.
    27. 27. http://www.musik-therapie.at/PederHill/Structure&Plot.htmMonday, June 17, 13The movie version of story arcs… “height” is tension, or some kind of measure of excitement, or drama…
    28. 28. PLEASE. IFYOU WRITING SCREENPLAY.HULK TELLINGYOU.THE 3 ACTSTRUCTURE = GARBAGE.STOP CITING IT IN ARTICLES.STOP TALKING ABOUT IT WITHFRIENDS.IT WILL NOT HELPYOU.STAY THE FUCK AWAY FROM ANYONEWHO EVEN CLAIM IT EXIST. IF THEYSAY IT DO. SAY “OR COURSE SHIT HASBEGINNING, MIDDLE,AND ENDINGYOU INSUFFERABLE TURD” THENTHROW A DRINK IN THEIR FACE ANDRUN AWAY…http://filmcrithulk.wordpress.com/2011/07/07/hulk-presents-the-myth-of-3-act-structure/“The HULK Presents the Myth of the 3-Act Structure”Monday, June 17, 13But the more I investigated this online, the more I found people saying it’s bullshit. This is the best quote I found on the subject.Totally worth reading the essay.
    29. 29. Vonnegut - http://thedesigngym.com/simpleshapesofstories/Monday, June 17, 13Vonnegut is here talking about sentiment of events, not really “tension” or “excitement” or rising action – but there’s still somekind of structural differences going on across each book/story. The question is, what do best sellers look like over the course ofthe whole story? By whatever measure will illustrate the pacing/movement of the story.Lower right hand corner is me working on this talk.
    30. 30. Monday, June 17, 13Vonnegut on real life, compared to fiction. But because that’s depressing, here is a tiny owl.
    31. 31. http://24.media.tumblr.com/ba77d04cb210b8e24ff73a49a19b3111/tumblr_mfc6dv2SER1qh66wqo1_1280.jpgMonday, June 17, 13Did this cheer you up? It’s better than a kitten! IMO.
    32. 32. Monday, June 17, 13So – back to the initial thought: rising action, crises, resolution, etc. Can we find this in books? Automatically, I mean?
    33. 33. Can we detect exciting scenes?Back to Mechanical Turk, with Dan Brown books: 2 raters again, chunks of 500 wordsOdd factoid: I got ratings of sex scenes in 2-4 hours.It took ~13 hours to get Dan Brown action scenes.Monday, June 17, 13Using action/exciting scenes as proxy for major events in a book…
    34. 34. “ACTION” SCENES ARETOUGH, TOOMonday, June 17, 13Brief digression on how hard this is.
    35. 35. Raven.theraider.netMonday, June 17, 13This seems obvious… fights, chases, etc.
    36. 36. Objects in the mirror are closer than theyappearwww.badhaven.com / Jurassic ParkMonday, June 17, 13Small chunks out of context don’t always look like action. Remember Mechanical Turk folks are seeing small pieces withoutcontext, so their judgments are based on only a tiny window. Or, in the case of Dan Brown, it might ALL look like action. (Youmight look up “bathos” here too.)
    37. 37. Almost naked, Silas hurled his pale body down the staircase. He knew hehad been betrayed, but by whom? When he reached the foyer, moreofficers were surging through the front door. Silas turned the other way anddashed deeper into the residence hall.The womens entrance. Every OpusDei building has one.Winding down narrow hallways, Silas snaked througha kitchen, past terrified workers, who left to avoid the naked albino as heknocked over bowls and silverware, bursting into a dark hallway near theboiler room. He now saw the door he sought, an exit light gleaming at theend.Running full speed through the door out into the rain, Silas leapt off thelow landing, not seeing the officer coming the other way until it was toolate.The two men collided, Silass broad, naked shoulder grinding into themans sternum with crushing force. He drove the officer backward onto thepavement, landing hard on top of him.The officers gun clattered away. Silascould hear men running down the hall shouting. Rolling, he grabbed theloose gun just as the officers emerged. A shot rang out on the stairs, andSilas felt a searing pain below his ribs. Filled with rage, he opened fire atall three officers, their blood spraying.A dark shadow loomed behind, coming out of nowhere.The angry handsthat grabbed at his bare shoulders felt as if they were infused with thepower of the devil himself.The man roared in his ear. SILAS, NO!Silas spun and fired.Their eyes met. Silas was already screaming inhorror as Bishop Aringarosa fell.Chapter 96DaVinci CodeMonday, June 17, 13A sample of the text in question… This is an action scene.
    38. 38. SOWHAT ABOUT “BAGS OFWORDS” HERE?Text content worked for sex scenes…..Monday, June 17, 13
    39. 39. SGD Classifier on “exciting” scenes washed out –about 60% accuracy on Dan Brown.Monday, June 17, 13It’s possible I could’ve improved this with some other trickery, but heck, let’s move on.
    40. 40. LDA Topic AnalysisTopic analysis produces associations between wordsand chunks of text, by probabilistic methods.“Topics” are described by lists of most informativewords.A topic may be associated with multiple documents.Monday, June 17, 13I thought maybe I’d get somewhere with another “bag of words” unstructured technique that’s popular now: topic analysis.
    41. 41. Blei (2011) from http://www.scottbot.net/HIAL/?p=221Monday, June 17, 13A snippet from a classic article.
    42. 42. Elijah Meeks: https://dhs.stanford.edu/comprehending-the-digital-humanities/topics/Monday, June 17, 13A network view of topics and documents, by Elijah Meeks. This is a pretty obvious way to visualize the results of LDA on text.But my data is ordered chapters, so I didn’t want to do this. I wanted to keep the relationship, but still see the topics…
    43. 43. Another tool:DaVinci Code topics to chaptersmapping“Excitement” rating color scaleavg by chapter, ordered(obviously)Topics (48ish) perchapter (108)Chapter 1… to Chapter 108Monday, June 17, 13Built another tool to see if there was anything in this – showing them as ordered chapters connected by the “best” matchingtopics.
    44. 44. Ah, but since it’s svg/d3… var chart = chart.append("g").attr("translate","0," +y).attr("transform","rotate(90 600 600)");But, maybe I need chaptersummaries…. So I can relatethem to the topics?Monday, June 17, 13Outsource the summary writing for each chapter, to make it easier to see how topics relate to chapter contexts. … Add them astext under the leaves (the boxes that represent chapters). Now it’s hard to read – so use svg cute rotate trick and someresizing…!
    45. 45. Add some topic-tooltipsand fade-outs….Demo: http://www.ghostweather.com/essays/talks/openvisconf/topic_arc_diagram/TopicArc.htmlMonday, June 17, 13Some UI niceties I added to make it slightly usable, even for myself. Unfortunately I had to shorten the text my friend createdfor each chapter; the originals were pretty hilarious…
    46. 46. This projectfeatured aCrayola colorscheme.http://en.wikipedia.org/wiki/List_of_Crayola_crayon_colorsMonday, June 17, 13This was the best way I could find on short notice to get a list of divergent bright colors… but I still had to hand-tweak the ones Iused till they were all more or less readable and distinguishable!
    47. 47. Maybe I need One More Tool. Any word relations of interest?Let’s try a hairball…Demo: http://www.ghostweather.com/essays/talks/openvisconf/topic_words_network/index.htmlMonday, June 17, 13For this tool I used Jim Vallandingham’s network code from the flowingdata tutorial – my first major use of coffeescript. Iintended to try the radial layout, but ended up not.There was a lot of preprocessing of the stats to get the topics related tochapter “excitingness.”
    48. 48. Small“constellations”show sharedwords (anaccident that’suseful!)Filtered to only the“exciting” nodes…Monday, June 17, 13On rollover, you can see the links, and since I created links between shared words (colored in blue), you can see littleconstellations for them, which I liked. This could’ve been simplified, of course.
    49. 49. THAT FELT LIKE A DEADEND.Maybe pretty, butMonday, June 17, 13No structure visible, no story structure… In using just the bag of words, I’d lost all structure or relationships across time, whichseems important to me for things like pacing in a novel.
    50. 50. Slide by me in a talk on Nodebox: http://blogger.ghostweather.com/2013/03/data-visualization-with-nodebox.htmlCovered up forcheap theatrics…Monday, June 17, 13Previous work, sketch showing relationship of dialog to exposition in different texts; this I did in early spring of 2013, for anothertalk. It was a simple visual of dialog vs. exposition.
    51. 51. Slide by me in a talk on Nodebox: http://blogger.ghostweather.com/2013/03/data-visualization-with-nodebox.htmlMonday, June 17, 13At the time I theorized that the reason Angels and Demons (which I bought on sale on Amazon) had less dialog towards the endwas because the action had increased, to the detriment of dialog. Maybe I was right? Simple is best?
    52. 52. Back to Python.§ Chunk book by chapter, get POS tags, punctuation,and word counts + more for each chapter…§ Import scores from Turkers id’ing which bits areexciting/action, incorporate with the other data.Monday, June 17, 13New process: Check for relationships of everything, across time; and relationship to “excitement” ratings.
    53. 53. Some pretty big raterdifferences,actually.Monday, June 17, 13First note the ratings differences (using ipython notebook and pandas). I used the avg scores.
    54. 54. Item = chapterMagic: Pandas’ rolling_mean functionon different window sizes!Monday, June 17, 13After some code magic, nouns, for instance, look like this. Lots of chapters, lots of variation. I used rolling means to get asmoother curve!
    55. 55. Well, this is a mess.Monday, June 17, 13These are hard to plot together on the same scale – giant mess.
    56. 56. Monday, June 17, 13Standardize the data with a small transform, to get it all on a comparable scale.Still kind of a mess.
    57. 57. Hey, now I want to play with it live, with UI controls…Enter Bearcart (by Rob Story/@oceankidbilly)Monday, June 17, 13Packaging to generate rickshaw.js graphs built on d3.
    58. 58. Notice the nicecheckbox perseries controls –what I needed!Monday, June 17, 13The result in a browser – just showing you Twilight for example, back to Brown in a sec.
    59. 59. A few oddities… nouns & verbsAngels & DemonsDaVinci CodeverbsverbsnounsnounsMonday, June 17, 13There were nice inverse relations between nouns and verbs in both books. (Both done proportionally and as absolutes.Theseplots are proportional numbers.)
    60. 60. Basic “excitement/action” arcsDaVinci CodeAngels & DemonsMonday, June 17, 13Notice the nice climbing excitement curve on A&D. This is based on Turker avg scores, of course. DVC has more peaks in thefirst half, it seems. But does climb at the end.
    61. 61. Angels & DemonsDaVinci Code“score”Action “score”QuotesQuotesDemo http://www.ghostweather.com/essays/talks/openvisconf/bearcart/index_dav.htmlDemo http://www.ghostweather.com/essays/talks/openvisconf/bearcart/index_ang.htmlChapter numberSo I was right –lots of runningaround and stuff!Monday, June 17, 13So what’s the closest correlate to the action? Inverse correlation (mild) with the dialog, as I suspected initially. (Yes, logisticregression and other stats can be done here – I did, too. But for this talk, it was about the visuals.)
    62. 62. TwilightThe talky bits…So…The action?Invert --Monday, June 17, 13Not necessarily true – the giant peak of expository stuff early in Twilight turns out to be the trip to the beach where all thevampire/werewolf stories come out.
    63. 63. Yet.Another.Tool. !Demo:http://www.ghostweather.com/essays/talks/openvisconf/chapter_scores/score_rollover_dav.htmlMonday, June 17, 13Another tool needed… to check the numbers with the text visible. You can eyeball for highlighted correlations with theexcitement, and rollover the blocks again to see the text.
    64. 64. Some final thoughtsCreate minimum viable tools (to help youvisualize/analyse) in whatever you can use, fast.And boy, machine learning sure can useinteractive visual tools!A browser can easily hold an entire trashy novel.Monday, June 17, 13
    65. 65. THANKS!@arnicas, Lynn@ghostweather.comMy thanks to….Luminosity (help with Dan Brown summaries)Yves Fey (help with romance genreconventions) Fan friends with sex-filled long fanfic refs (Dorinda, Movies_Michelle,Gwyn Rhys) Rob Story/@oceankidbilly (for help with Bearcart under pressure) JimVallandingham/@vlandham for his code/advice, Irene and Bocoup for hosting!Monday, June 17, 13
    66. 66. A Few References§ Applied Machine Learning with Scikit-Learn:http://scikit-learn.github.io/scikit-learn-tutorial/index.html§ Naïve Bayes for text in Scikit-Learn: http://scikit-learn.org/stable/modules/naive_bayes.html#naive-bayes§ Stochastic Gradient Descent in Scikit-Learn: http://scikit-learn.org/0.13/modules/sgd.html§ Nice tutorial overview of working with text data: scikit-learn.github.io/scikit-learn-tutorial/working_with_text_data.html§ Bearcart by Rob Story – Rickshaw timeseries graphs from python pandas datastructure in 4lines (https://github.com/wrobstory/bearcart)§ LDA topic modeling tool with UI - https://code.google.com/p/topic-modeling-tool/§ Scott Weingart’s nice overview of LDA Topic Modeling in Digital Humanities: http://www.scottbot.net/HIAL/?p=221§ Elijah Meeks’ lovely set of articles on LDA & Digital Humanties vis: https://dhs.stanford.edu/comprehending-the-digital-humanities/§ JimVallandingham’s tooltip code and a great demo/tutorial: http://flowingdata.com/2012/08/02/how-to-make-an-interactive-network-visualization/§ Rickshaw for timeseries graphs: https://github.com/shutterstock/rickshawMonday, June 17, 13
    67. 67. THEVIDEO OF THE TALK:http://blogger.ghostweather.com/2013/06/analysis-of-fiction-my-openvisconf-talk.htmlhttp://www.youtube.com/watch?v=f41U936WqPMP.S. SEE THE BLOG POST/EXAMPLES LIVE…Monday, June 17, 13

    ×