• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
When Too Much Data is Barely Enough
 

When Too Much Data is Barely Enough

on

  • 1,038 views

A presentation for a workshop at IACSS09

A presentation for a workshop at IACSS09

Statistics

Views

Total Views
1,038
Views on SlideShare
1,036
Embed Views
2

Actions

Likes
2
Downloads
0
Comments
0

2 Embeds 2

http://www.docseek.net 1
http://www.linkedin.com 1

Accessibility

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    When Too Much Data is Barely Enough When Too Much Data is Barely Enough Presentation Transcript

    • When Too Much Data is Barely Enough Bob Buckley Chris Barnes
    • When Too Much Data is Barely Enough
      • most sports face an explosion of training and competition data: detailed logs of performance, training and injuries, together with medical, physiological and anthropometrical data. The aim of this workshop is to discuss the challenges and opportunities associated with this plethora of athlete and team data, and provide some practical suggestions and advice on how to get “bang for bucks”.
    • Workshop – Part #1: Overview
      • Data collection: source, formats
      • Aggregating data
        • Normalise data
        • Standardise values
      • Preparing for analysis
        • Linking data
        • Example: adding history
      • Part #2: Chris will discuss analysis
    • Data sources
      • Performance data
        • Match/competition results
        • Match measures, split times, etc.
        • Video & instruments (e.g. GPS)
      • Injury data/logs & medical records
      • Training & recovery logs
        • RPE and instruments
      • Sports science testing
        • Physio, psych, anthro, biomech, …
    • data source formats
      • MS Excel - just say “No!”
        • If you must …
        • row 1 only has column headings
        • Single row
        • consistent data in columns
        • indicate missing data clearly
        • Comments column … after last data
      • Database tables
      • Text file versions of database tables
        • CSV or TAB-separated text files
      • event based source data files … e.g. Olympics and World Champs
      • Athlete oriented transformation …
      • Normalise data, reorder columns & sort
      Simple example Result history AAA Comp 100m 12/5/2009 100m, Name: YYY pos name time 1 XXX 9.85 2 YYY 10.02 3 ZZZ 10.15 4 … … date time pos comp 12/5/2009 10.02 2 AAA 3/2/2009 10.45 3 AUS 09 21/11/2008 10.44 2 Asia 08 … … … …
    • Data Normalisation
      • Puts data into “tables”
      • Add identifying info to detail rows…
      AAA Comp 100m 12/5/2009 pos name time 1 XXX 9.85 2 YYY 10.02 3 ZZZ 10.15 4 … … pos name time Event ID 1 XXX 9.85 2009475 2 YYY 10.02 2009475 3 ZZZ 10.15 2009475 4 … … 2009475 Event ID Comp Event type date 2009475 AAA Comp 2009 M 100m final 12/5/2009 2009476 AAA Comp 2009 W 100m final 12/5/2009 … … … …
    • Tips for Names
      • Use ID or standard name
      • Perhaps use a table to map non-standard to ID (or standard names)
      • Family name can be first or last …
      • Names use Capitals inconsistently e.g. Debeer, De Beer and de Beer
      • Sorting: McXXX = MacXXX
      Non-standard Standard Nova Perris Nova Perris Perris, Nova Nova Perris Nova Perris-Kneebone Nova Perris
    • Simple data issues
      • Data consistency, e.g. format
      • Consistent units, e.g. metres vs km
      • Dates: dd/mm/yyyy vs mm/dd/yy and a plethora of date/time formats
      • Times: e.g. seconds vs mm:ss.ss
      • Names: First Last vs Last, First or maiden/married, formal/informal
    • Tips for dates & times
      • yyyy/mm/dd sorts as text too
      • Transform to common format that preserves accuracy
      • … and where date/time subtraction gives numeric result
      • Maybe use hh:mm:ss.ss
        • Avoids confusing hh:mm with mm:ss
    • Overview of Analysis
      • Problem selection … choose questions
      • Collect the required data
      • Prepare data for analysis
      • Analyse data
      • Review results … which messages matter
      • Present results … deliver your message(s)
      • Check message delivery is working
    • Link data for analysis
      • Flat data … analyse data in rows
      • Join data tables … database term
      • Handle unknown values properly
        • Join with unknown
        • Calculations with unknown
      ID Date Time pos 435 3/5/2008 10.3 3 512 3/5/2008 10.2 2 512 21/3/2008 10.1 4 607 3/5/2008 11.6 7 ID DoB 435 24/2/1987 512 17/5/1989 ID Date Time pos DoB 435 3/5/2008 10.3 3 24/2/1987 512 3/5/2008 10.2 2 17/5/1989 512 21/3/2008 10.1 4 17/5/1989 607 3/5/2008 11.6 7 ?
    • Preparation: add history
      • Adding athlete history to rows e.g. previous times, …
      • Process sorted CSV files …
        • Export table/query as CSV
        • Process text file
        • Load into analysis tool(s)
      • Example program in Python
        • Open source programming language
    • Preparation: history (cont.)
      • Setup for processing … part 1
      • import csv n, fields = 2, ['time', 'pos']
      • fd = open(src, 'r', newline=" ")
      • reader = csv.reader(fd)
      • fz = open(dst, 'w', newline=" ")
      • writer = csv.writer(fz)
      • hdr = next(reader)
      • hidx = [hdr.index(x) for x in fields]
      • idxid = hdr.index('name')
      • hdrx = [h+str(i+2) for h in fields for i in range(n)]
      • writer.writerow(hdr+hdrx)
    • Preparation: history (cont.)
      • Adding athlete history to rows … part 2
      • hnx = ['?' for x in hdrx]
      • previd = ’’
      • for r in reader:
      • if len(r)<len(hdr):
      • r += [’?’ for x in range(len(r),len(hdr))]
      • if previd!=r[idxid]:
      • previd = r[idxid]
      • hist = hnx
      • writer.writerow(r+hist)
      • hist = [r[i] for i in hidx]+hist[0:-len(hidx)]
    • Preparation: history (cont.)
      • The resulting CSV file …
      ID Date Time pos DoB Time2 pos2 Time3 pos3 435 3/5/2008 10.3 3 24/2/1987 ? ? ? ? 512 21/3/2008 10.1 4 17/5/1989 ? ? ? ? 512 3/5/2008 10.2 2 17/5/1989 10.1 4 ? ? 607 3/5/2008 11.6 7 ? ? ? ? ?
    • Over to Chris