Big Data in a neurophysiology research lab… what?

Big Data in a neurophysiology
research lab…
what?
by
Max Novelli
Rehabilitation Neural Engineering Lab,
University of Pittsburgh
man8@pitt.edu, max.novelli@pitt.edu
J on The Beach, Malaga, Spain 2018/05/24

Myself
1990: Technical diploma, Vimercate, Italy
2016: Big data technologies in research lab
1998: Graduate degree Laurea, Milano, Italy
2000: Started playing with Linux and Open Source
Started playing with web technologies
2002: Consulting: system administration
2004: UCIS - system administrator
2001: Started practicing yoga
2006: BIRC - system administrator, data manager
2008: LRDC - system administrator, data manager
2012: RNEL - software engineer, system administrator
for multiple neurocognitive research labs
data manager, data architect
2013: 200hrs yoga teaching certification
time
2018: RNEL- Head of Data and Informatics

3
RNEL
Rehabilitation Neural Engineering Lab
Restoring sensory and motor functions
after nervous system injury and limb loss
Neurophysiology Research

4
Sensory Functions
Motor functions
Able-bodied individuals
Central and peripheral
nervous system
Limbs
Position
Force
Pressure
Texture
Temperature
Shape
...
...in health!!!

5
Sensory Functions
Brain – Computer interface
Motor functions
Nervous system injury or limb loss
Central and peripheral
nervous system
Prosthetic limb
...otherwise!!!

6
Neural Activity
Experiments
Sensory Signals
Data & Metadata
Experimental
system
Kinematics
Videos and Images
Raw neural activity
Intramuscular signals
Nerve activity
Control signals
Stimulation patterns
Events
Control levels
Joint positions
Forces and torques
Stimulation Patterns
Control Signals

7
Data vs MetadataDataMetadata
Quantity measured
Units
Sampling frequency
Type of sensor
Operator
Notes
Issues
Sensor serial number
Range settings
Sensor location

8
First try
● Finding the information that you are looking for
● Access to proprietary formats
● Data and information sprawling and duplication
Use raw data files

9
Second try
Matlab “database” (MDB) based on HDS Toolbox
http://jwagenaar.github.io/HDS-Toolbox/
Matlab-based hierarchical file repository of objects accessible through dedicate
functions, offering data lazy loading, completely transparent to the user.
Each object aggregates data and relative metadata together.
Subject
Experiment
System 2
Trial 1
System 1
Trial 2 Trial n
Data type 1
Data type 2
Data type 1
Data type 2
Data type 1
Data type 2

10
Second try chapter 2
● Enhance user experience
● Interactive exploration
● Step away from raw data formats
● Structure our data in the most logical way for us
● Explore our data and associated metadata easily
● Did not scale well when data started to grow in size
● Data and functionalities bundled together
● Corruption
● Code base maintenance and upgrades
● Flexibility in data and metadata properties
● Queries: for loops
MDB (HDS-toolbox implementation in Matlab)

11
How to move forward?
Big data approach
Brainstorm session
Wish List:
● Queries (Database)
● Flexible hierarchy
● Flexible data and metadata
● Minimal coding
● Direct access to data
● Cross-platform
● Data – Code Separation
Maintain MDB features:
● Hierarchical structure
● Relationships between objects
● Data lazy loading
● Pairing data – metadata
Decision: Start from scratch and build a new tool

12
Big Data
https://www.datasciencecentral.com/profiles/blogs/data-veracity
https://haritbigdata.wordpress.com/2015/06/15/bigdata-introduction/
https://www.theviable.co/how-big-data-impact-to-corporate/3v-model-of-big-data/

13
Big Data
https://www.datasciencecentral.com/profiles/blogs/data-veracity
https://haritbigdata.wordpress.com/2015/06/15/bigdata-introduction/
https://www.theviable.co/how-big-data-impact-to-corporate/3v-model-of-big-data/

14
Big Data: Volume
50 TeraBytes… and counting
Millions of files
Hundreds of subjects
Thousands of data recordings
Volume
Team of 2 people, with multiple responsibilities

15
Big Data: Variety
Structured data
Still images and videos
Time series: neural, kinematics, ...
Variety
● type of data
● format they are saved in
● Information collected
jpg
png
mov
mpg
tdms
plx
pl2
tdt
nev
mat
cfg
mat
txt
json
yml

16
Big Data: Velocity
Within experiment: continuous / stream
Within lab: bursts / batch
Velocity
Constant stream of messages containing:
data, control signals and events
(similar to IOT)
Dragonfly messaging system
(www.dragonfly-msg.org)
Data is transferred to the central server
after each experiment and analyzed
Time (days)
Size
Activity

17
Big Data: Veracity
In RNEL terms: Data Validation

18
Big Data: Veracity
What do the labels mean?
Which is the unit of measurement?

19
Big Data: Veracity
Which sensor was used to collect
this signal?

20
Big Data: Veracity
What is the different between signals in column 1 and 2?

21
Big Data: Veracity
Did we drop any data point?

22
Big Data: Veracity
Important for experimental reproducibility and replicability,
Consistency in user experience, and optimal prosthetic control

23
RNEL addition
CurationContinuous
Researchers, Data managers, Data curators, Others
Manual, Automated
Multiple sources
Platform independent

24
RNEL Big Data
4 V
2 C
Volume
Variety
Velocity
Validation
(Veracity)
Continuous
Curation

25
Data management
Multipurpose Data Framework : MDF
A framework to organize and manage data, including
metadata, designed to provide consistent, normalized and
easy data access
●
Solid unique id (uuid)
●
Platform independence (Matlab, Python, ...)
●
Light-weight
●
Direct access to underlying data
●
Query functionality
●
Lazy loading for data and objects
●
Dynamic metadata and data properties
●
Metadata in database, data in .mat files
Design Requirements:

26
First benefits
>> tr = mdf.load('mdf_type','Trial','name', 'Block_30')
tr =
      type : Trial
      uuid : 42d1a9f5acbe4265b31f2be327d34fde
      data : []
  metadata :
          date : 07/25/2014
      duration : 131.3131
            id : 30
          name : Block_030
     startTime : 10:33:43
       success : 1
      hardware :
          Omniplex : 30
  PlexonStimulator : 30
     OfflineSorter : 30
          Platform : 30
            ...
  children :
     spikeData : [2 SpikeData]
            ...
>> sbj = mdf.load('mdf_type','Subject')
sbj =
       type : Subject
       uuid : b2d6cc61d5504e828fa7514cc3b10c2a
       data : []
   metadata :
          name : Flahr
        number : 40
            ...
   children :
           exp : [1 Experiment]
            ...
We started querying our data

27
...more benefits
We can visually validate our data
Data before importing new data Data after importing new data

28
Application
Sensory Experiment

29
Application
Sensory Experiment
Evoked sensation when stimulation
is applied to a selected electrode

30
Reasons to adopt MDF
●Efficient way to organize data
●Flexibility
●Continuous Curation
●Data reusability
●Metadata queries
●Separation between
data accessing and data usage

31
What’s next?
Real Big Data… almost!!!
More queries
Bigger Big Data
Million of objects

32
Going beyond...
C5/C6 incomplete spinal injury
Sensory functions
2 6x10 Utah microelectrode arrays
Motor functions
2 10x10 Utah microelectrode arrays

33
Experiment
Velocity-based optimal linear
estimator decoder
Neural firing rates
Velocity
commands
Sensor Stimulus
Transformation
Force feedback
Intracortical
Microstimulation
Intracortical microstimulation as
source of somatosensory feedback
Current controlled, charge balanced, asymmetric pulses

34
Stimulation stream
Time0
Stimulation Pattern Definition
Stimulation Application
Individual Stimulation
Stim pattern 1 on
Channel 10
Stim pattern 3 on
Channel 50
Stim pattern 1 on
Channel 23
The information about which stimulation pattern is delivered on which
channel is known only by analyzing the sequence of the event.
Encoder StimulatorSafety

35
Scientific question
Is the performance of each electrode
degrading with the amount of charge
delivered over time?
We need to count the total number of
stimulation impulses delivered to each
electrode in order to be able to compute the
total charge delivered
Time since implant: ~3 years
Number of days of recording: >500
Total number of files: >150000
Total number of events: to be determined

36
Solution
MDF
One object for each stimulation event
Object example:
"experiment" : "CL",
"subject" : "CRS02b",
"location" : "Home",
"session" : "00231",
"set" : "0009",
"block" : "0001",
"trial" : "0001",
"rep" : "0080",
"date" : "20161121",
"time" : "12:44:53",
"raw_file" : "…/QL.Task_State0002.Set0009….bin",
"name" : "STIM_SYNC_EVENT",
"sequenceno" : 6767,
"raw" : {
  "header" : {
   "msg_type" : 1808,
   … }
  "data" : {
   "source_index" : 0,
   "source_timestamp" : 4399.589533,
   … }
  }
Queries:
we can extract any group of stimulation events:
● by session, set, trial (SST)
● by file
● by day or the hour
● by channel,
● by amplitude,
● by type of experiment,
● by condition
Major hurdle: importing data
Each stimulation event = one pulse on one channel
on Channel x
at time t

37
Numbers and Time
Version 1
We extracted and imported in the db the minimal set on
information, data and metadata needed to perform our task and
answer the scientific question.
Information filtering
Number of objects
Estimated: between 10 and 20 million
Time
Estimated completion time: 6 months

38
Back to MDF
● Metadata in database, data in .mat files
MDF Design Requirements:
● Metadata and data only in database
MongoDb
Analyzing logs from the first pass...

39
Numbers and Time
Version 2
We extracted and imported in the db the minimal set on
information, data and metadata needed to perform our task and
answer the scientific question.
Information filtering
Number of objects
Estimated: between 10 and 20 million
Total count: ~34 million
Time
Estimated completion time: 6 months
After first round of optimization: 2 weeks

40
Workflow
Import:
● Stim events
● Configuration
qlql
● Stim events
● Configuration
 Amplitude
 Channel
ql_sst
● Session
● Set
● Trial
● # events
SST listing
Assignment
Assign config to stim events
ql_sch
Complete stim
events by
channel
Counting
stim by channel
Validation
Visualization
ql_count
Stim event counts
Different modalities
ql_val
Validation metrics
Configuration,
Events
*.bin, *.mat

41
Challenges
● Number of object, amount of information
● Information filtering
● Mdf saving metadata in database
and data in file
● Single import process. One object at the time
● Validation of the information
How can I be reasonably sure that the information
and data imported is correct?
● Hardware
● Platform: Matlab or Python

42
Hardware
First iteration:
● Virtual machine on Xen hypervisor
● 4 cores
● 16Gigs
● Virtual OS disk
● Database drive: NFS mounted from server RAID
● Raw files: lab file server mounted through SMB/CIFS
● MDF: metadata in db, data in matlab files
Current configuration:
● Dedicated server
● 16 cores
● 32Gigs
● OS drive: mechanical
● Database drive: SSD
● Raw files: lab file server mounted through SMB/CIFS
● MDF: data and metadata in db
Next iteration:
● Distributed system
● Parallel processes
● MDF: v2.x

43
Conclusions
● Big data approach was and is a
successful strategy in managing lab data
● We were able to manage big data using MDF
Minimal changes were required
● Queries capabilities are priceless
Allowed faster access to data and
more compact code
● Logging is invaluable, priceless

44
Future
● Scaling:
more data, computing power, parallel processing
● MDF v.2.x:
more storage options, other languages,
integration with other systems
● Batch creation of MDF objects
● Explore new queries and expand queries functionalities.
SQL like language:
select data.waveform from sensory where
metadata.subject = “sbj_01” and data.time > 10s
● Automation:
import, validation, processing
● Quantitative analysis of logs
processing time, statistics, errors

46
Thank you
Thanks to:
● Rob Gaunt
● Lee Fisher
● Ameya Nanivadekar
● Tyler Simpson
● All my colleagues at RNEL
Research was sponsored by the U.S. Army Research Office and the Defense Advanced Research Projects
Agency (DARPA) was accomplished under Cooperative Agreement Number W911NF-15-2-0016. The views and
conclusions contained in this document are those of the authors and should not be interpreted as representing the
official policies, either expressed or implied, of the Army Research Office, Army Research Laboratory, or the U.S.
Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes
notwithstanding any copyright notation hereon.
Questions, suggestions:
● man8@pitt.edu
● max.novelli@pitt.edu
Available at https://bitbucket.org/nitrosx/mdf
Production versions: 1.4 and 1.5. Currently working on v1.6 and v2.0MDF
Acknowledgements
RNEL website:
http://www.rnel.pitt.edu

Big Data in a neurophysiology research lab… what?

Recommended

Recommended

More Related Content

Similar to Big Data in a neurophysiology research lab… what?

Similar to Big Data in a neurophysiology research lab… what? (20)

More from J On The Beach

More from J On The Beach (20)

Recently uploaded

Recently uploaded (20)

Big Data in a neurophysiology research lab… what?