When Too Much Data is Barely Enough

812 views
778 views

Published on

A presentation for a workshop at IACSS09

Published in: Sports, Technology, Business
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
812
On SlideShare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
0
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

When Too Much Data is Barely Enough

  1. 1. When Too Much Data is Barely Enough Bob Buckley Chris Barnes
  2. 2. When Too Much Data is Barely Enough <ul><li>most sports face an explosion of training and competition data: detailed logs of performance, training and injuries, together with medical, physiological and anthropometrical data. The aim of this workshop is to discuss the challenges and opportunities associated with this plethora of athlete and team data, and provide some practical suggestions and advice on how to get “bang for bucks”. </li></ul>
  3. 3. Workshop – Part #1: Overview <ul><li>Data collection: source, formats </li></ul><ul><li>Aggregating data </li></ul><ul><ul><li>Normalise data </li></ul></ul><ul><ul><li>Standardise values </li></ul></ul><ul><li>Preparing for analysis </li></ul><ul><ul><li>Linking data </li></ul></ul><ul><ul><li>Example: adding history </li></ul></ul><ul><li>Part #2: Chris will discuss analysis </li></ul>
  4. 4. Data sources <ul><li>Performance data </li></ul><ul><ul><li>Match/competition results </li></ul></ul><ul><ul><li>Match measures, split times, etc. </li></ul></ul><ul><ul><li>Video & instruments (e.g. GPS) </li></ul></ul><ul><li>Injury data/logs & medical records </li></ul><ul><li>Training & recovery logs </li></ul><ul><ul><li>RPE and instruments </li></ul></ul><ul><li>Sports science testing </li></ul><ul><ul><li>Physio, psych, anthro, biomech, … </li></ul></ul>
  5. 5. data source formats <ul><li>MS Excel - just say “No!” </li></ul><ul><ul><li>If you must … </li></ul></ul><ul><ul><li>row 1 only has column headings </li></ul></ul><ul><ul><li>Single row </li></ul></ul><ul><ul><li>consistent data in columns </li></ul></ul><ul><ul><li>indicate missing data clearly </li></ul></ul><ul><ul><li>Comments column … after last data </li></ul></ul><ul><li>Database tables </li></ul><ul><li>Text file versions of database tables </li></ul><ul><ul><li>CSV or TAB-separated text files </li></ul></ul>
  6. 6. <ul><li>event based source data files … e.g. Olympics and World Champs </li></ul><ul><li>Athlete oriented transformation … </li></ul><ul><li>Normalise data, reorder columns & sort </li></ul>Simple example Result history AAA Comp 100m 12/5/2009 100m, Name: YYY pos name time 1 XXX 9.85 2 YYY 10.02 3 ZZZ 10.15 4 … … date time pos comp 12/5/2009 10.02 2 AAA 3/2/2009 10.45 3 AUS 09 21/11/2008 10.44 2 Asia 08 … … … …
  7. 7. Data Normalisation <ul><li>Puts data into “tables” </li></ul><ul><li>Add identifying info to detail rows… </li></ul>AAA Comp 100m 12/5/2009 pos name time 1 XXX 9.85 2 YYY 10.02 3 ZZZ 10.15 4 … … pos name time Event ID 1 XXX 9.85 2009475 2 YYY 10.02 2009475 3 ZZZ 10.15 2009475 4 … … 2009475 Event ID Comp Event type date 2009475 AAA Comp 2009 M 100m final 12/5/2009 2009476 AAA Comp 2009 W 100m final 12/5/2009 … … … …
  8. 8. Tips for Names <ul><li>Use ID or standard name </li></ul><ul><li>Perhaps use a table to map non-standard to ID (or standard names) </li></ul><ul><li>Family name can be first or last … </li></ul><ul><li>Names use Capitals inconsistently e.g. Debeer, De Beer and de Beer </li></ul><ul><li>Sorting: McXXX = MacXXX </li></ul>Non-standard Standard Nova Perris Nova Perris Perris, Nova Nova Perris Nova Perris-Kneebone Nova Perris
  9. 9. Simple data issues <ul><li>Data consistency, e.g. format </li></ul><ul><li>Consistent units, e.g. metres vs km </li></ul><ul><li>Dates: dd/mm/yyyy vs mm/dd/yy and a plethora of date/time formats </li></ul><ul><li>Times: e.g. seconds vs mm:ss.ss </li></ul><ul><li>Names: First Last vs Last, First or maiden/married, formal/informal </li></ul>
  10. 10. Tips for dates & times <ul><li>yyyy/mm/dd sorts as text too </li></ul><ul><li>Transform to common format that preserves accuracy </li></ul><ul><li>… and where date/time subtraction gives numeric result </li></ul><ul><li>Maybe use hh:mm:ss.ss </li></ul><ul><ul><li>Avoids confusing hh:mm with mm:ss </li></ul></ul>
  11. 11. Overview of Analysis <ul><li>Problem selection … choose questions </li></ul><ul><li>Collect the required data </li></ul><ul><li>Prepare data for analysis </li></ul><ul><li>Analyse data </li></ul><ul><li>Review results … which messages matter </li></ul><ul><li>Present results … deliver your message(s) </li></ul><ul><li>Check message delivery is working </li></ul>
  12. 12. Link data for analysis <ul><li>Flat data … analyse data in rows </li></ul><ul><li>Join data tables … database term </li></ul><ul><li>Handle unknown values properly </li></ul><ul><ul><li>Join with unknown </li></ul></ul><ul><ul><li>Calculations with unknown </li></ul></ul>ID Date Time pos 435 3/5/2008 10.3 3 512 3/5/2008 10.2 2 512 21/3/2008 10.1 4 607 3/5/2008 11.6 7 ID DoB 435 24/2/1987 512 17/5/1989 ID Date Time pos DoB 435 3/5/2008 10.3 3 24/2/1987 512 3/5/2008 10.2 2 17/5/1989 512 21/3/2008 10.1 4 17/5/1989 607 3/5/2008 11.6 7 ?
  13. 13. Preparation: add history <ul><li>Adding athlete history to rows e.g. previous times, … </li></ul><ul><li>Process sorted CSV files … </li></ul><ul><ul><li>Export table/query as CSV </li></ul></ul><ul><ul><li>Process text file </li></ul></ul><ul><ul><li>Load into analysis tool(s) </li></ul></ul><ul><li>Example program in Python </li></ul><ul><ul><li>Open source programming language </li></ul></ul>
  14. 14. Preparation: history (cont.) <ul><li>Setup for processing … part 1 </li></ul><ul><li>import csv n, fields = 2, ['time', 'pos'] </li></ul><ul><li>fd = open(src, 'r', newline=&quot; &quot;) </li></ul><ul><li>reader = csv.reader(fd) </li></ul><ul><li>fz = open(dst, 'w', newline=&quot; &quot;) </li></ul><ul><li>writer = csv.writer(fz) </li></ul><ul><li>hdr = next(reader) </li></ul><ul><li>hidx = [hdr.index(x) for x in fields] </li></ul><ul><li>idxid = hdr.index('name') </li></ul><ul><li>hdrx = [h+str(i+2) for h in fields for i in range(n)] </li></ul><ul><li>writer.writerow(hdr+hdrx) </li></ul>
  15. 15. Preparation: history (cont.) <ul><li>Adding athlete history to rows … part 2 </li></ul><ul><li>hnx = ['?' for x in hdrx] </li></ul><ul><li>previd = ’’ </li></ul><ul><li>for r in reader: </li></ul><ul><li>if len(r)<len(hdr): </li></ul><ul><li>r += [’?’ for x in range(len(r),len(hdr))] </li></ul><ul><li>if previd!=r[idxid]: </li></ul><ul><li>previd = r[idxid] </li></ul><ul><li>hist = hnx </li></ul><ul><li>writer.writerow(r+hist) </li></ul><ul><li>hist = [r[i] for i in hidx]+hist[0:-len(hidx)] </li></ul>
  16. 16. Preparation: history (cont.) <ul><li>The resulting CSV file … </li></ul>ID Date Time pos DoB Time2 pos2 Time3 pos3 435 3/5/2008 10.3 3 24/2/1987 ? ? ? ? 512 21/3/2008 10.1 4 17/5/1989 ? ? ? ? 512 3/5/2008 10.2 2 17/5/1989 10.1 4 ? ? 607 3/5/2008 11.6 7 ? ? ? ? ?
  17. 17. Over to Chris

×