2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
biosphere_poster
1. Chantal Roth Bioinformatics Department TMRI, San Diego, CA
Acknowledgements: many people have contributed to this project, including: Jason Wu, Dana Alcivare, Hemant
Varma, E Li, Roman Rozenshteyn, Derek Guist, Don Hutchison, Darrell Ricke and Chris Martin
Abstract:Abstract: BioSphere consists of several components: a data warehouse containing integrated data originating from various sources, an application server that executes most of the business logic, and a client for user interactions and data visualizations. The poster zooms into the various components to show how the
architecture, the software design and the datasets it contains.
Design
3-Tier Architecture
• Java Swing and JSP clients
• Weblogic application server
• Oracle 8i database (9i soon)
GoF and J2EE Design Patterns
• FastLaneReader using Data Access Objects
• Servlet/Session Bean as front controller
• Factories and abstract factories for
persistence layer components
• Code generator pattern
• Visitor, singleton, command, adapter,
delegate, observer and many other design
patterns
Software
Code
• Over 280,000 lines of Java code:
– Client: ca 108,000
– Server: ca 172,000
• 75,000 are automatically generated
• 210,000 lines are manually written.
Code Generator
• creates value objects (java beans)
• data access objects (persistence)
• xml objects (xml parsers and creators)
XML
• Meta data definitions (table graph, db definitions)
• Database and server configuration
• User preferences
• Application module definitions
Major Data Sets
• Rice: myriad contigs, traces (sequences), markers,
FPC’s, gene predictions, proprietary gene models,
motifs, rice affy chip, cDNA’s, various blast
analysis
• Cochliobolus, Fusarium, Botrytis: contigs,
traces, affy chips, blast analysis, gene predictions,
motifs, pfam
• Arabidopsis: entire chromosomes, gene models
(TAIR), GARLIC, markers, affy chip
• Drosophila: genbank, P-elements, various blast
analysis, EST’s
• Other genomes: ashbya, neurospora, human,
banana, c.elegans, phytophthora, barley, maize,
potatoe, rye, sorghum, vinca, wheat, aphid,
heliothis, manduca, helicoverpa armigera,
melincognita, silverleaf whitefly, pombe, yeast,
magnaporthe
• Analysis: various blast analysis such as wheat
EST’s, plant repeats, swissprot, genbank, phytoseq
EST’s, various gene predictions, pfam
Hardware
Application Server
• Weblogic Application Server runs on a Sun E220R
with 1.5 GB RAM and 2 cpus
Database
• Oracle Server: Sun E4500 with 8 cpus and 8BGB
RAM
Analysis
• AnalysisClient: 24 analysis demons running on the
titan cluster (Linux)
Database
• Oracle 8i database (soon upgraded to 9i)
• More than 100 tables
• More than ½ Terabyte of data
• Data warehousing approach adapted to
bioinformatics
Data Integration
Data Sources
• Typically LIMS databases are specialized for one particular
data type, such as sequences or expression
• They usually allow a user to enter detailed information about
the sample (such as tissue)
• These systems are normally not linked and do not allow a
user to ask questions across datatypes
ETL
• The conclusions of the experiments, a very small subset of
the LIMS data, need to be integrated in order to allow
sophisticated data mining and visualizations (colored boxes)
• To do this, the results are extracted, transformed and then
loaded into a data warehouse
• Ideally a middleware would be used that handles the
extraction, transformation and loading
Client PC
EventsEvents
Events
Server Services
Request
Handler
Save
Service
Logging
Manager
Login
Component
Utilities
GUI
Print
Service
Undo
Manager
GUI
Export
Service
GUI
Preferences
Manager
GUI
Import
Service
Frameworks/Components
Graphics
Framework
Threading
Framework
Object
Pool
XML
Framework
Graph
Framework
Applications
X
M
L
X
M
L
X
M
L
X
M
L
X
M
L
Pathfinder
Contig
Viewer
Analysis
Client
Feature
Viewer
Chromosome
Viewer
X
M
L
Pathways
X
M
L
BioSphere
Messenger
X
M
L
Statistics
X
M
L
News
Communication
EJB
Service
Servlet
Service
Direct
Request
Service
XML
Communic.
Service
Application
Registry
Event
Manager
Control
Analysis Layer
Results
Processor
Compute
Host
Daemon
Compute
Host
Monitor
Blastdb
Monitor
Application Layer
Chromosome
Retriever
Contig
Assembler
Retrieve
Service
Save
Service
Feature
Service
Pathway
Service
Analysis
Service
Email
Service
User
Service
Request
Response
Communication
Ejb
Request
Handler
Servlet
Request
Handler
Direct
Request
Handler
XML
Request
Handler
Service Layer
XML
Startup
Service
Application Server
Persistence Layer
Schema
Service
DAL
ObjectsDAL
ObjectsDAL
ObjectsDAL
Objects
Loader
createscreates
Database
Handler
Schema
Service
Factory
Verifier
XML
Database
Manager
RMI
SER
MC
Data
Warehouse
Data Storage
flatfiles
Other
Databases
JDBC
Metabolomic
LIMS
Mapping,
Markers
Sequencing
LIMS
Proteomic
LIMS
Expression
LIMS
ETL (Extract, Transform, Load)
Data Integration
BioSphere Data
Warehouse
Client
PC
Server
BioSphere
Results of experiments
Software Framework
BioSphere is not only a bioinformatics platform, but
also a software development framework that allows the
development of any 3-tier software:
Persistence Layer
• It handles the retrieval, updating and storage of any data
type to a database. For instance, it could handle the
access of customer and product data in an e-commerce
system
Client-Server Communication
• The framework handles and communication to a server,
whether the server is at a remote location or even in the
same virtual machine. It can use RMI, object
serialization (servlets), XML and direct calls
Request processing
• For each request that goes to the server, the framework
automatically creates a new thread so that the client
never freezes (even if the processing takes a long time).
The requests are relayed to the corresponding business
logic component
Other frameworks
• It also includes a graphics framework for drawing
objects, zooming, selection, moving objects and graph
layout optimization.
• A code generator creates value objects, persistent
objects and XML objects