A case for teaching SQL to scientists

649 views

Published on

A quick, off-the-cuff talk about why I think SQL is good for scientists. Please send me notes correcting my Python, arguing, or asking for more information! And see the tutorial at: http://uwescience.github.io/sqlshare

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
649
On SlideShare
0
From Embeds
0
Number of Embeds
11
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

A case for teaching SQL to scientists

  1. 1. A case for teaching SQL to scientists Daniel Halperin #w2tbac @SESYNC 2013-07-09
  2. 2. SQL: think like data • SQL is a Language for expressing Queries over Structured data. • vs Python/R, SQL is • strictly less powerful • better for concisely, clearly, and efficiently expressing data manipulation • ... and anecdotally, “many” scripts written by scientists just manipulate data
  3. 3. Claim 1: SQL is Concise & Clear • English questions often translate directly into SQL • Scripting languages have a lot of language overhead -- syntactic sugar • Let’s see some (admittedly biased) examples
  4. 4. with open(‘file.txt’) as input_file: cnt = 0 for line in input_file: cnt += 1 print cnt What does this code do?
  5. 5. with open(‘file.txt’) as input_file: cnt = 0 for line in input_file: cnt += 1 print cnt What does this code do? SELECT COUNT(*) AS cnt FROM file
  6. 6. with open(‘file.txt’) as input_file: for line in input_file: if int(line.split()[3]) > 5: print line What does this code do?
  7. 7. with open(‘file.txt’) as input_file: for line in input_file: if int(line.split()[3]) > 5: print line What does this code do? SELECT * FROM file WHERE value > 5
  8. 8. What does this code do? SELECT value, SUM(counts) AS tot_count FROM file GROUP BY value
  9. 9. What does this code do? with open(‘file.txt’) as input_file: tot_counts = defaultdict(0) for line in input_file: tot_counts[line.split()[3]] += int(line.split()[4]) for value in tot_counts: print value, tot_counts[value] SELECT value, SUM(counts) AS tot_count FROM file GROUP BY value
  10. 10. What does this code do? SELECT census.county, electoral.votes / census.population AS voting_rate FROM electoral, census WHERE electoral.county = census.county
  11. 11. What does this code do? SELECT census.county, electoral.votes / census.population AS voting_rate FROM electoral, census WHERE electoral.county = census.county <Complicated stuff with dictionaries>
  12. 12. Claim 2: SQL is Efficient Scaling up your data • What happens when Python/R data doesn’t fit in memory? Crash, or rewrite much more complicated code • All databases automatically, transparently spill to disk, and are heavily optimized for performance
  13. 13. Claim 2: SQL is Efficient Say you inherit a really well-engineered Python script ./highly_optimized_code.py < TB.dataset > GB.result
  14. 14. Claim 2: SQL is Efficient Say you inherit a really well-engineered Python script ./simple_data_filter.py < GB.result > MB.answer ./highly_optimized_code.py < TB.dataset > GB.result But are only interested in a small fraction of the result
  15. 15. Claim 2: SQL is Efficient Say you inherit a really well-engineered Python script ./simple_data_filter.py < GB.result > MB.answer ./highly_optimized_code.py < TB.dataset > GB.result But are only interested in a small fraction of the result 1) Dive into the complex code and modify its internals to filter inside 2) Suffer the long running time of the first program
  16. 16. Claim 2: SQL is Efficient CREATE VIEW their_query AS SELECT <... their code ...> FROM terabyte_dataset Gives their query a name, but doesn’t execute it!
  17. 17. Claim 2: SQL is Efficient CREATE VIEW their_query AS SELECT <... their code ...> FROM terabyte_dataset SELECT * FROM their_query WHERE <... your filter ...> Gives their query a name, but doesn’t execute it! Combine both queries and optimize together!
  18. 18. Claim 2: SQL is Efficient CREATE VIEW their_query AS SELECT <... their code ...> FROM terabyte_dataset SELECT * FROM their_query WHERE <... your filter ...> Gives their query a name, but doesn’t execute it! Combine both queries and optimize together! Fast!
  19. 19. SQL for Science • UW’s SQLShare - open, view-oriented, web database service • Easy data import, public & private sharing, permalinks (DOI support coming) • Use a series of views instead of scripts for: • data cleaning, transformation, integration • simple stats, analytics, format conversion • provenance and publishing • mashups: integrated with R, Sage, etc.
  20. 20. escience.washington.edu/sqlshare “An undergraduate student and I are working with gigabytes of tabular data derived from analysis of protein surfaces. Previously, we were using huge directory trees and plain text files. Now we can accomplish a 10 minute 100 line script in 1 line of SQL.” - Andrew D White, grad student in UW Chem Eng “I have had two students who are struggling with R come up and tell me how much more they like working in SQLShare.” - Robin Kodner, as asst professor at Western Washington U "That [SQL query that finished in 1 second] took me a week [manually in Excel]!" - Robin Kodner, as postdoc at UW Oceanography * yes, we need (and are interested in) more than anecdotes!!
  21. 21. SQL can do more than you think (here vs R)

×