Your SlideShare is downloading. ×
A case for teaching SQL to scientists
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

A case for teaching SQL to scientists

225
views

Published on

A quick, off-the-cuff talk about why I think SQL is good for scientists. Please send me notes correcting my Python, arguing, or asking for more information! And see the tutorial at: …

A quick, off-the-cuff talk about why I think SQL is good for scientists. Please send me notes correcting my Python, arguing, or asking for more information! And see the tutorial at: http://uwescience.github.io/sqlshare

Published in: Education, Technology

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
225
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
1
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. A case for teaching SQL to scientists Daniel Halperin #w2tbac @SESYNC 2013-07-09
  • 2. SQL: think like data • SQL is a Language for expressing Queries over Structured data. • vs Python/R, SQL is • strictly less powerful • better for concisely, clearly, and efficiently expressing data manipulation • ... and anecdotally, “many” scripts written by scientists just manipulate data
  • 3. Claim 1: SQL is Concise & Clear • English questions often translate directly into SQL • Scripting languages have a lot of language overhead -- syntactic sugar • Let’s see some (admittedly biased) examples
  • 4. with open(‘file.txt’) as input_file: cnt = 0 for line in input_file: cnt += 1 print cnt What does this code do?
  • 5. with open(‘file.txt’) as input_file: cnt = 0 for line in input_file: cnt += 1 print cnt What does this code do? SELECT COUNT(*) AS cnt FROM file
  • 6. with open(‘file.txt’) as input_file: for line in input_file: if int(line.split()[3]) > 5: print line What does this code do?
  • 7. with open(‘file.txt’) as input_file: for line in input_file: if int(line.split()[3]) > 5: print line What does this code do? SELECT * FROM file WHERE value > 5
  • 8. What does this code do? SELECT value, SUM(counts) AS tot_count FROM file GROUP BY value
  • 9. What does this code do? with open(‘file.txt’) as input_file: tot_counts = defaultdict(0) for line in input_file: tot_counts[line.split()[3]] += int(line.split()[4]) for value in tot_counts: print value, tot_counts[value] SELECT value, SUM(counts) AS tot_count FROM file GROUP BY value
  • 10. What does this code do? SELECT census.county, electoral.votes / census.population AS voting_rate FROM electoral, census WHERE electoral.county = census.county
  • 11. What does this code do? SELECT census.county, electoral.votes / census.population AS voting_rate FROM electoral, census WHERE electoral.county = census.county <Complicated stuff with dictionaries>
  • 12. Claim 2: SQL is Efficient Scaling up your data • What happens when Python/R data doesn’t fit in memory? Crash, or rewrite much more complicated code • All databases automatically, transparently spill to disk, and are heavily optimized for performance
  • 13. Claim 2: SQL is Efficient Say you inherit a really well-engineered Python script ./highly_optimized_code.py < TB.dataset > GB.result
  • 14. Claim 2: SQL is Efficient Say you inherit a really well-engineered Python script ./simple_data_filter.py < GB.result > MB.answer ./highly_optimized_code.py < TB.dataset > GB.result But are only interested in a small fraction of the result
  • 15. Claim 2: SQL is Efficient Say you inherit a really well-engineered Python script ./simple_data_filter.py < GB.result > MB.answer ./highly_optimized_code.py < TB.dataset > GB.result But are only interested in a small fraction of the result 1) Dive into the complex code and modify its internals to filter inside 2) Suffer the long running time of the first program
  • 16. Claim 2: SQL is Efficient CREATE VIEW their_query AS SELECT <... their code ...> FROM terabyte_dataset Gives their query a name, but doesn’t execute it!
  • 17. Claim 2: SQL is Efficient CREATE VIEW their_query AS SELECT <... their code ...> FROM terabyte_dataset SELECT * FROM their_query WHERE <... your filter ...> Gives their query a name, but doesn’t execute it! Combine both queries and optimize together!
  • 18. Claim 2: SQL is Efficient CREATE VIEW their_query AS SELECT <... their code ...> FROM terabyte_dataset SELECT * FROM their_query WHERE <... your filter ...> Gives their query a name, but doesn’t execute it! Combine both queries and optimize together! Fast!
  • 19. SQL for Science • UW’s SQLShare - open, view-oriented, web database service • Easy data import, public & private sharing, permalinks (DOI support coming) • Use a series of views instead of scripts for: • data cleaning, transformation, integration • simple stats, analytics, format conversion • provenance and publishing • mashups: integrated with R, Sage, etc.
  • 20. escience.washington.edu/sqlshare “An undergraduate student and I are working with gigabytes of tabular data derived from analysis of protein surfaces. Previously, we were using huge directory trees and plain text files. Now we can accomplish a 10 minute 100 line script in 1 line of SQL.” - Andrew D White, grad student in UW Chem Eng “I have had two students who are struggling with R come up and tell me how much more they like working in SQLShare.” - Robin Kodner, as asst professor at Western Washington U "That [SQL query that finished in 1 second] took me a week [manually in Excel]!" - Robin Kodner, as postdoc at UW Oceanography * yes, we need (and are interested in) more than anecdotes!!
  • 21. SQL can do more than you think (here vs R)

×