Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Big Data in a neurophysiology research lab… what?


Published on

Big Data in a neurophysiology research lab… what? by Max Novelli

At RNEL, we have been working hard to lay the foundation to better manage our data and be able to integrate big data and AI technologies into our data management and analysis pipelines. These needs have arisen from the very size of the experimental data that push the limits: they are simply becoming unmanageable even on powerful workstations. We also determined that better query methodologies, validation and visualization tools are needed.

Our long term goal is finding the answer to the following question: Will we ever be able to go from experimental raw data to query curated data with a simple SQL-like language without spending humongous resources and manpower, while using a process that is organic, intuitive and flexible? Can we also leverage modern big data technologies and data science to achieve our goal?

This presentation is the story of an inter-disciplinary journey that started approximately 5 years ago. The journey enabled us to build a deeper knowledge of our data, a better system of management methodologies, as well as tools that allow us to query and aggregate across various datasets and easily improve such functionalities.

In this presentation, we will provide a general background of the work that we do in our lab. First, we will provide some examples of experiments that we conduct as a context in which to explain the data that are acquired and the challenge that comes with them. Next, we will outline some of the questions that researchers asked (and keep asking) when they attempt to work with large data structures to answer their own scientific questions without having to be bogged down by the technologies used and the original format of the data. Finally, we will raise some questions related to data management, which will help to improve validation and reduce the manpower necessary to curate the data. From the big picture, we will walk through the decisions and requirements that came out of our brainstorm sessions and show how far along we currently are in our journey and the path we took to get here. We will conclude by highlighting some of the amazing results that we were able to achieve, such as activation maps and central nervous system stimulation counts.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Big Data in a neurophysiology research lab… what?

  1. 1. Big Data in a neurophysiology research lab… what? by Max Novelli Rehabilitation Neural Engineering Lab, University of Pittsburgh, J on The Beach, Malaga, Spain 2018/05/24
  2. 2. Myself 1990: Technical diploma, Vimercate, Italy 2016: Big data technologies in research lab 1998: Graduate degree Laurea, Milano, Italy 2000: Started playing with Linux and Open Source Started playing with web technologies 2002: Consulting: system administration 2004: UCIS - system administrator 2001: Started practicing yoga 2006: BIRC - system administrator, data manager 2008: LRDC - system administrator, data manager 2012: RNEL - software engineer, system administrator for multiple neurocognitive research labs data manager, data architect 2013: 200hrs yoga teaching certification time 2018: RNEL- Head of Data and Informatics
  3. 3. 3 RNEL Rehabilitation Neural Engineering Lab Restoring sensory and motor functions after nervous system injury and limb loss Neurophysiology Research
  4. 4. 4 Sensory Functions Motor functions Able-bodied individuals Central and peripheral nervous system Limbs Position Force Pressure Texture Temperature Shape ... health!!!
  5. 5. 5 Sensory Functions Brain – Computer interface Motor functions Nervous system injury or limb loss Central and peripheral nervous system Prosthetic limb ...otherwise!!!
  6. 6. 6 Neural Activity Experiments Sensory Signals Data & Metadata Experimental system Kinematics Videos and Images Raw neural activity Intramuscular signals Nerve activity Control signals Stimulation patterns Events Control levels Joint positions Forces and torques Stimulation Patterns Control Signals
  7. 7. 7 Data vs MetadataDataMetadata Quantity measured Units Sampling frequency Type of sensor Operator Notes Issues Sensor serial number Range settings Sensor location
  8. 8. 8 First try ● Finding the information that you are looking for ● Access to proprietary formats ● Data and information sprawling and duplication Use raw data files
  9. 9. 9 Second try Matlab “database” (MDB) based on HDS Toolbox Matlab-based hierarchical file repository of objects accessible through dedicate functions, offering data lazy loading, completely transparent to the user. Each object aggregates data and relative metadata together. Subject Experiment System 2 Trial 1 System 1 Trial 2 Trial n Data type 1 Data type 2 Data type 1 Data type 2 Data type 1 Data type 2
  10. 10. 10 Second try chapter 2 ● Enhance user experience ● Interactive exploration ● Step away from raw data formats ● Structure our data in the most logical way for us ● Explore our data and associated metadata easily ● Did not scale well when data started to grow in size ● Data and functionalities bundled together ● Corruption ● Code base maintenance and upgrades ● Flexibility in data and metadata properties ● Queries: for loops MDB (HDS-toolbox implementation in Matlab)
  11. 11. 11 How to move forward? Big data approach Brainstorm session Wish List: ● Queries (Database) ● Flexible hierarchy ● Flexible data and metadata ● Minimal coding ● Direct access to data ● Cross-platform ● Data – Code Separation Maintain MDB features: ● Hierarchical structure ● Relationships between objects ● Data lazy loading ● Pairing data – metadata Decision: Start from scratch and build a new tool
  12. 12. 12 Big Data
  13. 13. 13 Big Data
  14. 14. 14 Big Data: Volume 50 TeraBytes… and counting Millions of files Hundreds of subjects Thousands of data recordings Volume Team of 2 people, with multiple responsibilities
  15. 15. 15 Big Data: Variety Structured data Still images and videos Time series: neural, kinematics, ... Variety ● type of data ● format they are saved in ● Information collected jpg png mov mpg tdms plx pl2 tdt nev mat cfg mat txt json yml
  16. 16. 16 Big Data: Velocity Within experiment: continuous / stream Within lab: bursts / batch Velocity Constant stream of messages containing: data, control signals and events (similar to IOT) Dragonfly messaging system ( Data is transferred to the central server after each experiment and analyzed Time (days) Size Activity
  17. 17. 17 Big Data: Veracity In RNEL terms: Data Validation
  18. 18. 18 Big Data: Veracity In RNEL terms: Data Validation What do the labels mean? Which is the unit of measurement?
  19. 19. 19 Big Data: Veracity In RNEL terms: Data Validation Which sensor was used to collect this signal?
  20. 20. 20 Big Data: Veracity In RNEL terms: Data Validation What is the different between signals in column 1 and 2?
  21. 21. 21 Big Data: Veracity In RNEL terms: Data Validation Did we drop any data point?
  22. 22. 22 Big Data: Veracity In RNEL terms: Data Validation Important for experimental reproducibility and replicability, Consistency in user experience, and optimal prosthetic control
  23. 23. 23 RNEL addition CurationContinuous Researchers, Data managers, Data curators, Others Manual, Automated Multiple sources Platform independent
  24. 24. 24 RNEL Big Data 4 V 2 C Volume Variety Velocity Validation (Veracity) Continuous Curation
  25. 25. 25 Data management Multipurpose Data Framework : MDF A framework to organize and manage data, including metadata, designed to provide consistent, normalized and easy data access ● Solid unique id (uuid) ● Platform independence (Matlab, Python, ...) ● Light-weight ● Direct access to underlying data ● Query functionality ● Lazy loading for data and objects ● Dynamic metadata and data properties ● Metadata in database, data in .mat files Design Requirements:
  26. 26. 26 First benefits >> tr = mdf.load('mdf_type','Trial','name', 'Block_30') tr =       type : Trial       uuid : 42d1a9f5­acbe­4265­b31f­2be327d34fde       data : []   metadata :           date : 07/25/2014       duration : 131.3131             id : 30           name : Block_030      startTime : 10:33:43        success : 1       hardware :           Omniplex : 30   PlexonStimulator : 30      OfflineSorter : 30           Platform : 30             ...   children :      spikeData : [2 SpikeData]             ... >> sbj = mdf.load('mdf_type','Subject') sbj =        type : Subject        uuid : b2d6cc61­d550­4e82­8fa7­514cc3b10c2a        data : []    metadata :           name : Flahr         number : 40             ...    children :            exp : [1 Experiment]             ... We started querying our data
  27. 27. 27 ...more benefits We can visually validate our data Data before importing new data Data after importing new data
  28. 28. 28 Application Sensory Experiment
  29. 29. 29 Application Sensory Experiment Evoked sensation when stimulation is applied to a selected electrode
  30. 30. 30 Reasons to adopt MDF ●Efficient way to organize data ●Flexibility ●Continuous Curation ●Data reusability ●Metadata queries ●Separation between data accessing and data usage
  31. 31. 31 What’s next? Real Big Data… almost!!! More queries Bigger Big Data Million of objects
  32. 32. 32 Going beyond... C5/C6 incomplete spinal injury Sensory functions 2 6x10 Utah microelectrode arrays Motor functions 2 10x10 Utah microelectrode arrays
  33. 33. 33 Experiment Velocity-based optimal linear estimator decoder Neural firing rates Velocity commands Sensor Stimulus Transformation Force feedback Intracortical Microstimulation Intracortical microstimulation as source of somatosensory feedback Current controlled, charge balanced, asymmetric pulses
  34. 34. 34 Stimulation stream Time0 Stimulation Pattern Definition Stimulation Application Individual Stimulation Stim pattern 1 on Channel 10 Stim pattern 3 on Channel 50 Stim pattern 1 on Channel 23 The information about which stimulation pattern is delivered on which channel is known only by analyzing the sequence of the event. Encoder StimulatorSafety
  35. 35. 35 Scientific question Is the performance of each electrode degrading with the amount of charge delivered over time? We need to count the total number of stimulation impulses delivered to each electrode in order to be able to compute the total charge delivered Time since implant: ~3 years Number of days of recording: >500 Total number of files: >150000 Total number of events: to be determined
  36. 36. 36 Solution MDF One object for each stimulation event Object example:  "experiment" : "CL",   "subject" : "CRS02b",   "location" : "Home",   "session" : "00231",   "set" : "0009",   "block" : "0001",   "trial" : "0001",   "rep" : "0080",   "date" : "2016­11­21",   "time" : "12:44:53",   "raw_file" : "…/QL.Task_State0002.Set0009….bin",  "name" : "STIM_SYNC_EVENT",  "sequenceno" : 6767,  "raw" : {    "header" : {     "msg_type" : 1808,     … }   "data" : {     "source_index" : 0,     "source_timestamp" : 4399.589533,    … }    } Queries: we can extract any group of stimulation events: ● by session, set, trial (SST) ● by file ● by day or the hour ● by channel, ● by amplitude, ● by type of experiment, ● by condition Major hurdle: importing data Each stimulation event = one pulse on one channel on Channel x at time t
  37. 37. 37 Numbers and Time Version 1 We extracted and imported in the db the minimal set on information, data and metadata needed to perform our task and answer the scientific question. Information filtering Number of objects Estimated: between 10 and 20 million Time Estimated completion time: 6 months
  38. 38. 38 Back to MDF ● Metadata in database, data in .mat files MDF Design Requirements: ● Metadata and data only in database MongoDb Analyzing logs from the first pass...
  39. 39. 39 Numbers and Time Version 2 We extracted and imported in the db the minimal set on information, data and metadata needed to perform our task and answer the scientific question. Information filtering Number of objects Estimated: between 10 and 20 million Total count: ~34 million Time Estimated completion time: 6 months After first round of optimization: 2 weeks
  40. 40. 40 Workflow Import: ● Stim events ● Configuration qlql ● Stim events ● Configuration  Amplitude  Channel ql_sst ● Session ● Set ● Trial ● # events SST listing Assignment Assign config to stim events ql_sch Complete stim events by channel Counting stim by channel Validation Visualization ql_count Stim event counts Different modalities ql_val Validation metrics Configuration, Events *.bin, *.mat
  41. 41. 41 Challenges ● Number of object, amount of information ● Information filtering ● Mdf saving metadata in database and data in file ● Single import process. One object at the time ● Validation of the information How can I be reasonably sure that the information and data imported is correct? ● Hardware ● Platform: Matlab or Python
  42. 42. 42 Hardware First iteration: ● Virtual machine on Xen hypervisor ● 4 cores ● 16Gigs ● Virtual OS disk ● Database drive: NFS mounted from server RAID ● Raw files: lab file server mounted through SMB/CIFS ● MDF: metadata in db, data in matlab files Current configuration: ● Dedicated server ● 16 cores ● 32Gigs ● OS drive: mechanical ● Database drive: SSD ● Raw files: lab file server mounted through SMB/CIFS ● MDF: data and metadata in db Next iteration: ● Distributed system ● Parallel processes ● MDF: v2.x
  43. 43. 43 Conclusions ● Big data approach was and is a successful strategy in managing lab data ● We were able to manage big data using MDF Minimal changes were required ● Queries capabilities are priceless Allowed faster access to data and more compact code ● Logging is invaluable, priceless
  44. 44. 44 Future ● Scaling: more data, computing power, parallel processing ● MDF v.2.x: more storage options, other languages, integration with other systems ● Batch creation of MDF objects ● Explore new queries and expand queries functionalities. SQL like language: select data.waveform from sensory where metadata.subject = “sbj_01” and data.time > 10s ● Automation: import, validation, processing ● Quantitative analysis of logs processing time, statistics, errors
  45. 45. 46 Thank you Thanks to: ● Rob Gaunt ● Lee Fisher ● Ameya Nanivadekar ● Tyler Simpson ● All my colleagues at RNEL Research was sponsored by the U.S. Army Research Office and the Defense Advanced Research Projects Agency (DARPA) was accomplished under Cooperative Agreement Number W911NF-15-2-0016. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Army Research Office, Army Research Laboratory, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation hereon. Questions, suggestions: ● ● Available at Production versions: 1.4 and 1.5. Currently working on v1.6 and v2.0MDF Acknowledgements RNEL website: