Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Reproducible Data Science with R

15,331 views

Published on

Presented by David Smith at The Data Science Summit, Chicago, April 20 2017.

The ability to independently reproduce results is a critical issue within the scientific community today, and is equally important for collaboration and compliance in business. In this talk, I'll introduce several features available in R that help you make reproducibility a standard part of your data science workflow. The talk will include tips on working with data and files, combining code and output, and managing R's changing package ecosystem.

Published in: Software
  • DOWNLOAD THE BOOK INTO AVAILABLE FORMAT (New Update) ......................................................................................................................... ......................................................................................................................... Download Full PDF EBOOK here { https://soo.gd/irt2 } ......................................................................................................................... Download Full EPUB Ebook here { https://soo.gd/irt2 } ......................................................................................................................... Download Full doc Ebook here { https://soo.gd/irt2 } ......................................................................................................................... Download PDF EBOOK here { https://soo.gd/irt2 } ......................................................................................................................... Download EPUB Ebook here { https://soo.gd/irt2 } ......................................................................................................................... Download doc Ebook here { https://soo.gd/irt2 } ......................................................................................................................... ......................................................................................................................... ................................................................................................................................... eBook is an electronic version of a traditional print book THE can be read by using a personal computer or by using an eBook reader. (An eBook reader can be a software application for use on a computer such as Microsoft's free Reader application, or a book-sized computer THE is used solely as a reading device such as Nuvomedia's Rocket eBook.) Users can purchase an eBook on diskette or CD, but the most popular method of getting an eBook is to purchase a downloadable file of the eBook (or other reading material) from a Web site (such as Barnes and Noble) to be read from the user's computer or reading device. Generally, an eBook can be downloaded in five minutes or less ......................................................................................................................... .............. Browse by Genre Available eBOOK .............................................................................................................................. Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, CookBOOK, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult, Crime, EBOOK, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, ......................................................................................................................... ......................................................................................................................... .....BEST SELLER FOR EBOOK RECOMMEND............................................................. ......................................................................................................................... Blowout: Corrupted Democracy, Rogue State Russia, and the Richest, Most Destructive Industry on Earth,-- The Ride of a Lifetime: Lessons Learned from 15 Years as CEO of the Walt Disney Company,-- Call Sign Chaos: Learning to Lead,-- StrengthsFinder 2.0,-- Stillness Is the Key,-- She Said: Breaking the Sexual Harassment Story THE Helped Ignite a Movement,-- Atomic Habits: An Easy & Proven Way to Build Good Habits & Break Bad Ones,-- Everything Is Figureoutable,-- What It Takes: Lessons in the Pursuit of Excellence,-- Rich Dad Poor Dad: What the Rich Teach Their Kids About Money THE the Poor and Middle Class Do Not!,-- The Total Money Makeover: Classic Edition: A Proven Plan for Financial Fitness,-- Shut Up and Listen!: Hard Business Truths THE Will Help You Succeed, ......................................................................................................................... .........................................................................................................................
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • The Insider's Edge You've Been Looking For.... ♣♣♣ http://t.cn/A6hPRLE0
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • I think this is such an incredible product: Profit Maximiser will make you money. Profit Maximiser will save you time. In a nutshell: Really is as simple as that. Give it a go and feel safe in the fact that there's a 30-day money back guarantee included if for any reason you don't get on with it. £1 trial for 14 days followed by a £96 + VAT. Cancel anytime. learn more... ▲▲▲ http://t.cn/A6hPRSfx
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Winning the Lottery is Based on This [7 Time Winner Tells All] ●●● http://t.cn/Airfq84N
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Reproducible Data Science with R

  1. 1. Reproducible Data Science with R David Smith R Community Lead, Microsoft @revodavid
  2. 2. What is reproducibility? “Two honest researchers would get the same result” – John Mount • Transparent data sourcing and availability • Fully automated analysis pipeline (code provided) • Traceability from published results back to data 2
  3. 3. Why Reproducibility? • Save time • Better science • More authoritative research • Reduce risk of errors • Facilitate collaboration 3
  4. 4. Why Reproducibility? • Save time • Better science • More authoritative research • Reduce risk of errors • Facilitate collaboration 4
  5. 5. Why Reproducibility? • Save time • Better science • More authoritative research • Reduce risk of errors • Facilitate collaboration 5
  6. 6. Why Reproducibility? • Save time • Better science • More authoritative research • Reduce risk of errors • Facilitate collaboration 6
  7. 7. Why Reproducibility? • Save time • Better science • More authoritative research • Reduce risk of errors • Facilitate collaboration 7
  8. 8. Accessing Data • Data sources – databases, sensor logs, spreadsheets, file download, APIs, … – Remember, these are sources: don’t modify them! • Snapshot data into local static files (text is good) – You will likely include some ETL steps here – Record a timestamp of when data was extracted: Sys.time() – With big data, sample during development, scale up to finalize – Document how this was all done, preferably with code (source file)! • Import text files using R functions – Recommended package: readr (avoids issues with locale, dates, etc) 8
  9. 9. Analysis process • Interactively explore data & develop analyses as usual • Capture the entire process in an R script – library(tidyverse) is helpful for cleaning, feature generation, etc. – Re-usable components can be shared as (private) packages • Generate artifacts using scripts – Graphics (please, no JPGs!) – Tables – Documents • Include timestamps in output: Sys.time() 9
  10. 10. A reproducible analysis environment • Operating system and R version – For most purposes, not the biggest cause of issues • But do document your R session info: sessionInfo() – For production, consider VM / container • A clean R environment – Organize work into independent projects (directories) – Use relative paths in scripts – Avoid use of .Rprofile – Set explicit random seeds – Do not save R workspace 10
  11. 11. Managing a changing R package ecosystem • One command to lock package versions to a specific date: checkpoint("2017-04-11") (For collaborators, same command downloads required package versions) • Applies just to this project – Global package upgrades won’t break reproducibility in other projects • checkpoint package avail on CRAN, included with Microsoft R 11
  12. 12. Presenting results • Eliminate manual processes (as far as possible) – Annotations (graphs / tables) – Cut-and-paste into documents • Notebooks – Combines code, output and narrative – Good for collaboration with other researchers • Document Generation – Best for automating reports 12
  13. 13. knitr / Rmarkdown • Generate HTML, Word, or PDF reports – Or books, blogs, slides, … • Combine narrative and R code in a single document • Human-readable, easy to edit • Single-click update! 13
  14. 14. Collaboration and sharing • Just share R project folder • Publish on Github – Happy Git and GitHub for the useR, Jenny Bryan http://happygitwithr.com/ – Version retention and tracking – Collaboration (code and comments) 14
  15. 15. Take-Aways Reproducibility is Beneficial • Saves time • Produces better science • More trusted research • Reduced risk of errors • Encourages collaboration Reproducibility is Simple • Document and automate processes with R scripts • Read and clean data with tidyverse • Use checkpoint to manage package versions • Generate documents with knitr • Share reproducible projects with Github 15
  16. 16. Reproducible Data Science with R David Smith R Community Lead, Microsoft @revodavid

×