Big Data in a neurophysiology research lab… what? by Max Novelli
At RNEL, we have been working hard to lay the foundation to better manage our data and be able to integrate big data and AI technologies into our data management and analysis pipelines. These needs have arisen from the very size of the experimental data that push the limits: they are simply becoming unmanageable even on powerful workstations. We also determined that better query methodologies, validation and visualization tools are needed.
Our long term goal is finding the answer to the following question: Will we ever be able to go from experimental raw data to query curated data with a simple SQL-like language without spending humongous resources and manpower, while using a process that is organic, intuitive and flexible? Can we also leverage modern big data technologies and data science to achieve our goal?
This presentation is the story of an inter-disciplinary journey that started approximately 5 years ago. The journey enabled us to build a deeper knowledge of our data, a better system of management methodologies, as well as tools that allow us to query and aggregate across various datasets and easily improve such functionalities.
In this presentation, we will provide a general background of the work that we do in our lab. First, we will provide some examples of experiments that we conduct as a context in which to explain the data that are acquired and the challenge that comes with them. Next, we will outline some of the questions that researchers asked (and keep asking) when they attempt to work with large data structures to answer their own scientific questions without having to be bogged down by the technologies used and the original format of the data. Finally, we will raise some questions related to data management, which will help to improve validation and reduce the manpower necessary to curate the data. From the big picture, we will walk through the decisions and requirements that came out of our brainstorm sessions and show how far along we currently are in our journey and the path we took to get here. We will conclude by highlighting some of the amazing results that we were able to achieve, such as activation maps and central nervous system stimulation counts.
1. Big Data in a neurophysiology
research lab…
what?
by
Max Novelli
Rehabilitation Neural Engineering Lab,
University of Pittsburgh
man8@pitt.edu, max.novelli@pitt.edu
J on The Beach, Malaga, Spain 2018/05/24
2. Myself
1990: Technical diploma, Vimercate, Italy
2016: Big data technologies in research lab
1998: Graduate degree Laurea, Milano, Italy
2000: Started playing with Linux and Open Source
Started playing with web technologies
2002: Consulting: system administration
2004: UCIS - system administrator
2001: Started practicing yoga
2006: BIRC - system administrator, data manager
2008: LRDC - system administrator, data manager
2012: RNEL - software engineer, system administrator
for multiple neurocognitive research labs
data manager, data architect
2013: 200hrs yoga teaching certification
time
2018: RNEL- Head of Data and Informatics
5. 5
Sensory Functions
Brain – Computer interface
Motor functions
Nervous system injury or limb loss
Central and peripheral
nervous system
Prosthetic limb
...otherwise!!!
6. 6
Neural Activity
Experiments
Sensory Signals
Data & Metadata
Experimental
system
Kinematics
Videos and Images
Raw neural activity
Intramuscular signals
Nerve activity
Control signals
Stimulation patterns
Events
Control levels
Joint positions
Forces and torques
Stimulation Patterns
Control Signals
7. 7
Data vs MetadataDataMetadata
Quantity measured
Units
Sampling frequency
Type of sensor
Operator
Notes
Issues
Sensor serial number
Range settings
Sensor location
8. 8
First try
● Finding the information that you are looking for
● Access to proprietary formats
● Data and information sprawling and duplication
Use raw data files
9. 9
Second try
Matlab “database” (MDB) based on HDS Toolbox
http://jwagenaar.github.io/HDS-Toolbox/
Matlab-based hierarchical file repository of objects accessible through dedicate
functions, offering data lazy loading, completely transparent to the user.
Each object aggregates data and relative metadata together.
Subject
Experiment
System 2
Trial 1
System 1
Trial 2 Trial n
Data type 1
Data type 2
Data type 1
Data type 2
Data type 1
Data type 2
10. 10
Second try chapter 2
● Enhance user experience
● Interactive exploration
● Step away from raw data formats
● Structure our data in the most logical way for us
● Explore our data and associated metadata easily
● Did not scale well when data started to grow in size
● Data and functionalities bundled together
● Corruption
● Code base maintenance and upgrades
● Flexibility in data and metadata properties
● Queries: for loops
MDB (HDS-toolbox implementation in Matlab)
11. 11
How to move forward?
Big data approach
Brainstorm session
Wish List:
● Queries (Database)
● Flexible hierarchy
● Flexible data and metadata
● Minimal coding
● Direct access to data
● Cross-platform
● Data – Code Separation
Maintain MDB features:
● Hierarchical structure
● Relationships between objects
● Data lazy loading
● Pairing data – metadata
Decision: Start from scratch and build a new tool
14. 14
Big Data: Volume
50 TeraBytes… and counting
Millions of files
Hundreds of subjects
Thousands of data recordings
Volume
Team of 2 people, with multiple responsibilities
15. 15
Big Data: Variety
Structured data
Still images and videos
Time series: neural, kinematics, ...
Variety
● type of data
● format they are saved in
● Information collected
jpg
png
mov
mpg
tdms
plx
pl2
tdt
nev
mat
cfg
mat
txt
json
yml
16. 16
Big Data: Velocity
Within experiment: continuous / stream
Within lab: bursts / batch
Velocity
Constant stream of messages containing:
data, control signals and events
(similar to IOT)
Dragonfly messaging system
(www.dragonfly-msg.org)
Data is transferred to the central server
after each experiment and analyzed
Time (days)
Size
Activity
22. 22
Big Data: Veracity
In RNEL terms: Data Validation
Important for experimental reproducibility and replicability,
Consistency in user experience, and optimal prosthetic control
24. 24
RNEL Big Data
4 V
2 C
Volume
Variety
Velocity
Validation
(Veracity)
Continuous
Curation
25. 25
Data management
Multipurpose Data Framework : MDF
A framework to organize and manage data, including
metadata, designed to provide consistent, normalized and
easy data access
●
Solid unique id (uuid)
●
Platform independence (Matlab, Python, ...)
●
Light-weight
●
Direct access to underlying data
●
Query functionality
●
Lazy loading for data and objects
●
Dynamic metadata and data properties
●
Metadata in database, data in .mat files
Design Requirements:
30. 30
Reasons to adopt MDF
●Efficient way to organize data
●Flexibility
●Continuous Curation
●Data reusability
●Metadata queries
●Separation between
data accessing and data usage
33. 33
Experiment
Velocity-based optimal linear
estimator decoder
Neural firing rates
Velocity
commands
Sensor Stimulus
Transformation
Force feedback
Intracortical
Microstimulation
Intracortical microstimulation as
source of somatosensory feedback
Current controlled, charge balanced, asymmetric pulses
34. 34
Stimulation stream
Time0
Stimulation Pattern Definition
Stimulation Application
Individual Stimulation
Stim pattern 1 on
Channel 10
Stim pattern 3 on
Channel 50
Stim pattern 1 on
Channel 23
The information about which stimulation pattern is delivered on which
channel is known only by analyzing the sequence of the event.
Encoder StimulatorSafety
35. 35
Scientific question
Is the performance of each electrode
degrading with the amount of charge
delivered over time?
We need to count the total number of
stimulation impulses delivered to each
electrode in order to be able to compute the
total charge delivered
Time since implant: ~3 years
Number of days of recording: >500
Total number of files: >150000
Total number of events: to be determined
36. 36
Solution
MDF
One object for each stimulation event
Object example:
"experiment" : "CL",
"subject" : "CRS02b",
"location" : "Home",
"session" : "00231",
"set" : "0009",
"block" : "0001",
"trial" : "0001",
"rep" : "0080",
"date" : "20161121",
"time" : "12:44:53",
"raw_file" : "…/QL.Task_State0002.Set0009….bin",
"name" : "STIM_SYNC_EVENT",
"sequenceno" : 6767,
"raw" : {
"header" : {
"msg_type" : 1808,
… }
"data" : {
"source_index" : 0,
"source_timestamp" : 4399.589533,
… }
}
Queries:
we can extract any group of stimulation events:
● by session, set, trial (SST)
● by file
● by day or the hour
● by channel,
● by amplitude,
● by type of experiment,
● by condition
Major hurdle: importing data
Each stimulation event = one pulse on one channel
on Channel x
at time t
37. 37
Numbers and Time
Version 1
We extracted and imported in the db the minimal set on
information, data and metadata needed to perform our task and
answer the scientific question.
Information filtering
Number of objects
Estimated: between 10 and 20 million
Time
Estimated completion time: 6 months
38. 38
Back to MDF
● Metadata in database, data in .mat files
MDF Design Requirements:
● Metadata and data only in database
MongoDb
Analyzing logs from the first pass...
39. 39
Numbers and Time
Version 2
We extracted and imported in the db the minimal set on
information, data and metadata needed to perform our task and
answer the scientific question.
Information filtering
Number of objects
Estimated: between 10 and 20 million
Total count: ~34 million
Time
Estimated completion time: 6 months
After first round of optimization: 2 weeks
41. 41
Challenges
● Number of object, amount of information
● Information filtering
● Mdf saving metadata in database
and data in file
● Single import process. One object at the time
● Validation of the information
How can I be reasonably sure that the information
and data imported is correct?
● Hardware
● Platform: Matlab or Python
42. 42
Hardware
First iteration:
● Virtual machine on Xen hypervisor
● 4 cores
● 16Gigs
● Virtual OS disk
● Database drive: NFS mounted from server RAID
● Raw files: lab file server mounted through SMB/CIFS
● MDF: metadata in db, data in matlab files
Current configuration:
● Dedicated server
● 16 cores
● 32Gigs
● OS drive: mechanical
● Database drive: SSD
● Raw files: lab file server mounted through SMB/CIFS
● MDF: data and metadata in db
Next iteration:
● Distributed system
● Parallel processes
● MDF: v2.x
43. 43
Conclusions
● Big data approach was and is a
successful strategy in managing lab data
● We were able to manage big data using MDF
Minimal changes were required
● Queries capabilities are priceless
Allowed faster access to data and
more compact code
● Logging is invaluable, priceless
44. 44
Future
● Scaling:
more data, computing power, parallel processing
● MDF v.2.x:
more storage options, other languages,
integration with other systems
● Batch creation of MDF objects
● Explore new queries and expand queries functionalities.
SQL like language:
select data.waveform from sensory where
metadata.subject = “sbj_01” and data.time > 10s
● Automation:
import, validation, processing
● Quantitative analysis of logs
processing time, statistics, errors
45. 46
Thank you
Thanks to:
● Rob Gaunt
● Lee Fisher
● Ameya Nanivadekar
● Tyler Simpson
● All my colleagues at RNEL
Research was sponsored by the U.S. Army Research Office and the Defense Advanced Research Projects
Agency (DARPA) was accomplished under Cooperative Agreement Number W911NF-15-2-0016. The views and
conclusions contained in this document are those of the authors and should not be interpreted as representing the
official policies, either expressed or implied, of the Army Research Office, Army Research Laboratory, or the U.S.
Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes
notwithstanding any copyright notation hereon.
Questions, suggestions:
● man8@pitt.edu
● max.novelli@pitt.edu
Available at https://bitbucket.org/nitrosx/mdf
Production versions: 1.4 and 1.5. Currently working on v1.6 and v2.0MDF
Acknowledgements
RNEL website:
http://www.rnel.pitt.edu