Bertenthal
Upcoming SlideShare
Loading in...5
×
 

Bertenthal

on

  • 626 views

 

Statistics

Views

Total Views
626
Slideshare-icon Views on SlideShare
284
Embed Views
342

Actions

Likes
0
Downloads
5
Comments
0

1 Embed 342

http://openshapa.org 342

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Bertenthal Bertenthal Presentation Transcript

    • Social Informatics Data GridCyberinfrastructure for Collaborative Research in the Neural, Social and Behavioral Sciences Bennett I. Bertenthal Indiana University bbertent@indiana.edu
    • Infrastructure for Social and Behavioral SciencesGoal: Compare, measure and search for patterns in structured, semi- structured, and heterogeneous data sets.Challenge: Integrate information over time, place, and types of dataNeeds: (1) Data interface (shared datasets & databases) (2) Service interface (shared tools for analysis) (3) Intellectual interface (shared problems & theories)
    • Primary Objectives• Develop prototype of core facility for collecting multiple measures of time-synchronized data• Develop integrated tools for storage, retrieval, annotation, and analyses of multiple data sets at different time scales• Develop scripts for parallelizing code to run on grid clusters
    • What is SIDGrid?
    • Social Informatics Data Grid • A general purpose architecture for streaming data applications (e.g., video, audio, time series) • Built on well established database, multimedia and web and grid services standards • Time alignment in distributed heterogeneous datasets – Software and hardware based – Integrated with existing laboratory time stamping and registration techniques • Scalable – Number of datasets – Types of data – Multiple end user applications
    • ServerClient
    • Client Side
    • Client Side• Leveraging efforts for annotation and analysis of multimodal data – Familiarity and Interoperability • Elan (Max Planck Institute for Psycholinguistics, The Netherlands) • Talkbank (Carnegie Mellon University, US) • Digital Replay System (Nottingham University, UK) – XML, Java – Cross platform interoperability• Adding SIDGrid functionality to Elan – Minimally intrusive • Avoid complicated co-development w/ELAN team – Browsing SIDGrid data – Additional data types – Upload / Download to SIDGrid server
    • 66 GB 5 mov 2 wav …368 GB 23 mov 6 wav … 5 GB 1 mov 0 wav …21 GB 3 mov 12 wav … 4 GB 9 mov 1 wav … 4 GB 4 mov 1 wav … 1 GB 0 mov 2 wav …945 GB 1 mov 66 wav … 8 GB 3 mov 0 wav …20 GB 13 mov 2 wav …
    • Server Side
    • .mov .wav .eaf GB10 0 0 45 4 30 0 20 2 2 1 312 100 9 200 1 1 1 1 6 2 0 12400 0 1 1001 0 666 1 312 0 0 13 0.1 0 0 0 0.018 4 0 66
    • Search and Query (4,000 projects)• Data Files – Names – Keywords – Attributes (keyword-value) – Date – Type (Elan, Chat)• Contents of Files – Metadata – Tier – Annotations
    • Server Side• Web services – Query – Data download / upload• Portal interface – Security – Data and metadata browsing – Preview – Tags, attributes – Projects – Groups – Search – Data transformation using grid resources
    • Science Gateway
    • What Is The TeraGrid? (circa 2006) 75 Teraflops (trillion calculations per second) • 16 Supercomputers - 9 different types, multiple sizes = 12,500 faster than all 6 billion humans on earth each doing one calculation per second • World’s fastest network • Globus Toolkit and other middleware providing single ANL login, application management, data movement, web30 Gigabits per second to large sites services= 20-30 times major university connections= 30,000 times my home broadband= 1 full length feature film per second LA Starlight Atlanta SDSC TACC NCSA PU IU PSC ORNL
    • Scripts for Running Jobs on Grid• Matlab (high-level language and interactive environment for peforming computationally intensive tasks)• R (software environment for statistical computing and graphics)• Praat (software for acoustic analysis)• Free Surfer (automated tools for reconstruction of the brain’s cortical surface from structural MRI data)• AFNI (programs for processing, analyzing, and displaying FMRI data)• SUMA (adds cortical surface based functional imaging analysis to the AFNI suite of programs)
    • Advantages of Grid Computing• Vastly expanded computing and storage• Reduced effort as needs scale up• Improved resource utilization; lower costs• Facilities and models for collaboration• Sharing of tools, data, and procedures and protocols• Recording, assessment and reuse of complex tasks
    • Lessons Learned• Fast prototyping vs production quality software – After one year of development, no product available for user feedback – Optimal design vs practical design• Public vs private website – Need for dissemination – Need for security and protection of user groups and data• Tools for diverse user groups with varying degrees of technical expertise – Non-intuitive interface with minimal user support • Importance of user manuals, technical support, and FAQs• Multiple levels of privacy and confidentiality dictated by type of data and informed consent
    • If you build it, will they come?• Dissemination of SIDGrid – Website and movie – Invited workshops at UofC and IU – Pre-conference workshops• Start-up is time consuming – Scale of most projects conducted by social scientists does not justify time to learn web services and tools – Added value for larger, collaborative projects requires shift in goals and organization of research• Resistance to data sharing – Original proposal required that all data stored on SIDGrid servers would be publicly available
    • Objections to Data Sharing• It’s my data!• Protection of confidentiality and anonymity• Need to first establish standards for coding and analysis• Reporting of misleading and confusing findings• Raw data but not coded data should be shared – Annotation and coding is very time consuming and should not become available to others• If availability of web and software tools were contingent on sharing data, most users would opt out
    • Questions