Big Analytics
without Big Hassles

Bryan Lewis
Chief Data Scientist
Alex Poliakov
Solutions Architect
Paradigm4’s SciDB

SciDB is an open source, scalable array database, with
native complex math analytics, integrated with R...
Paradigm4’s SciDB

SciDB helps data scientists, bioinformaticians, quants,
analysts, and scientists tackle their toughest ...
Webinar Replay
These slides are from a Paradigm4 webinar held on 11/12/13

You can find this webinar, and additional webin...
Agenda

1. Brief Introduction to SciDB
1. Demos

© Paradigm4 5

1. Q & A
Developed by Paradigm4

Open-source high-performance database
Data organized in multi-dimensional sparse arrays
Horizontal...
About Paradigm4
Paradigm4 develops & supports SciDB

CTO is MIT database researcher Mike Stonebraker
Force behind many maj...
Developed by Paradigm4

Community edition
• Open Source
• Unrestricted
• Fully scalable

• More math
• Fault tolerance
• S...
SciDB Powers NIH NCBI’s
1000 Genomes Project

http://www.ncbi.nlm.nih.gov/variation/tools/1000genomes/

© Paradigm4 9

Run...
SciDB Builds ARCA NBBO Book

• 186 million quotes for one day

• Runs in about half the time
on a cluster twice as large

...
SciDB Powers Recommendation Engines

• Fast truncated SVD

• Minutes per singular value
on a four node Linux cluster

© Pa...
SciDB System Architecture

“Shared Nothing” cluster of commodity hardware nodes

Interconnected with standard ethernet and...
SciDB Arrays
Each cell in a SciDB array consists of a fixed number
of typed attributes (variables).

Here is an example ce...
SciDB Arrays
A 1-D array looks like a spreadsheet
This picture shows five cells, each with four attributes
Attributes
Volu...
SciDB Arrays
The same data “redimensioned” into a 2D array

.

Dimension Symbol
“AAPL”
Volume

36013008713

450.61
450.73
...
Access multi-dimensional
subsets in constant time
cts
u
od
Pr

Customers

(price, location, age, gender, …)

Vendors
other...
High Performance Windowing

Fast, one-pass, running stats over arbitrary time or data windows
Even when time intervals cro...
SciDB Arrays
Arrays can be joined
along dimensions or subsets of dimensions

Values can be aggregated
along dimensions and...
© Paradigm4 19

• Work in familiar IDE
• Data persisted in SciDB
• Offload large computations to cluster
Demos
Quantitative Finance example
• Regularized correlation
• Relevance network graph

Remote Sensing application

Surviv...
© Paradigm4 21

Live demos
Two modes for using R & Python
SciDB-R/Py

R/Py-exec

(global)

(local)

Program SciDB naturally
from R or Python

Invoke ...
Rationale

Provide a simple, robust way to run R or Python from
inside SciDB queries, in parallel

© Paradigm4 23

Extend ...
Really simple example
Instance-parallel Monte Carlo estimate of π
avg(
r_exec(
build(<z:double>[i=1:1000,1,0],0),

'expr=x...
Big data bootstrap example
Consider a matrix named "events" with 8 columns:
Race
Age
Group
Gender

(categorical)
(numeric)...
Big data bootstrap example
Randomly partition rows of the events matrix into blocks of
at most 1000 rows (the "bag" part o...
Big data bootstrap example
store(redimension(
apply(
r_exec(P,
"expr=
require(survival);
D <- as.data.frame(matrix(val,nco...
Big data bootstrap result

Group 2 exhibits
significantly lower
relative risk of an
event than Group
1 in this example.

p...
Take Away

In-database, scalable, complex math
Less coding, more analysis
Transparent scale-up & speed-up
Interactive expl...
Questions?
Tell us about your application
• info@paradigm4.com

Try our Quick Start
• scidb.org/forum
• Download a VM or E...
Upcoming SlideShare
Loading in …5
×

Big Analytics Without Big Hassles

1,094 views

Published on

Complex analytics should work as nimbly on extremely large data sets as on small ones. You don’t want to think about whether your data fits in-memory, about parallelism, or formatting data for math packages. You’d like to use your favorite analytical language and have it transparently scale up to Big Data volumes.

Paradigm4 presents a webinar about SciDB—the massively scalable, open source, array database with native complex analytics, integrated with R and Python.

Details:

Presenter: Bryan Lewis, Chief Data Scientist, Paradigm4
Day/Time: Tuesday November 12th, 2013 at 1pm EST


Learn how SciDB enables you to:

-Explore rich data sets interactively
-Do complex math in-database—without being constrained -by memory limitations
-Perform multi-dimensional windowing, filtering, and aggregation
-Offload large computations to a commodity hardware cluster—on-premise or in a cloud
-Use R and Python to analyze SciDB arrays as if they were R or Python objects.
-Share data among users, with multi-user data integrity guarantees and version control
Webinar Agenda:

-Introduction to SciDB
-Demo
-Live Q&A

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,094
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
32
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide
  • If you building a PoC we can help you get off on the right foot...if you are building an enterprise app, betting the business on your software you are going to want to be supported by the people who wrote the code...The forums can help you get started, but for mission-critical support consider P4 support...No one knows the code better, or how to use it than P4... Bullets #2 and #3 -You can download these today at  ..
  • If you building a PoC we can help you get off on the right foot...if you are building an enterprise app, betting the business on your software you are going to want to be supported by the people who wrote the code...The forums can help you get started, but for mission-critical support consider P4 support...No one knows the code better, or how to use it than P4... Bullets #2 and #3 -You can download these today at  ..
  • There are a lot of recent databases that this picture applies toInstance instead of node because many instances can run on one physical node
  • Big Analytics Without Big Hassles

    1. 1. Big Analytics without Big Hassles Bryan Lewis Chief Data Scientist Alex Poliakov Solutions Architect
    2. 2. Paradigm4’s SciDB SciDB is an open source, scalable array database, with native complex math analytics, integrated with R & python © Paradigm4 Inc. 2
    3. 3. Paradigm4’s SciDB SciDB helps data scientists, bioinformaticians, quants, analysts, and scientists tackle their toughest “Big Data” management and complex analytics challenges. © Paradigm4 Inc. 3
    4. 4. Webinar Replay These slides are from a Paradigm4 webinar held on 11/12/13 You can find this webinar, and additional webinars, at: http://www.paradigm4.com/video/ www.paradigm4.com © Paradigm4 Inc. 4
    5. 5. Agenda 1. Brief Introduction to SciDB 1. Demos © Paradigm4 5 1. Q & A
    6. 6. Developed by Paradigm4 Open-source high-performance database Data organized in multi-dimensional sparse arrays Horizontally scalable Excels at parallel linear algebra © Paradigm4 6 ACID, data replication, versioned data
    7. 7. About Paradigm4 Paradigm4 develops & supports SciDB CTO is MIT database researcher Mike Stonebraker Force behind many major advances in commercial database products (Postgres, Illustra, Streambase, Vertica, VoltDB, …) Computational Genomics Imaging Quantitative Finance E-commerce Industrial Analytics Internet of Things © Paradigm4 7 Commercial applications
    8. 8. Developed by Paradigm4 Community edition • Open Source • Unrestricted • Fully scalable • More math • Fault tolerance • System management tools © Paradigm4 8 Enterprise edition
    9. 9. SciDB Powers NIH NCBI’s 1000 Genomes Project http://www.ncbi.nlm.nih.gov/variation/tools/1000genomes/ © Paradigm4 9 Running 24 x 7 since Fall 2012
    10. 10. SciDB Builds ARCA NBBO Book • 186 million quotes for one day • Runs in about half the time on a cluster twice as large © Paradigm4 10 • 80 seconds on a 32-instance cluster
    11. 11. SciDB Powers Recommendation Engines • Fast truncated SVD • Minutes per singular value on a four node Linux cluster © Paradigm4 11 • Sparse 50M x 50M matrix 4 billion nonzero values
    12. 12. SciDB System Architecture “Shared Nothing” cluster of commodity hardware nodes Interconnected with standard ethernet and TCP/IP © Paradigm4 12 SciDB Client ( iquery, Python, R, C++, C, JDBC )
    13. 13. SciDB Arrays Each cell in a SciDB array consists of a fixed number of typed attributes (variables). Here is an example cell with four attributes usec 36013008713 © Paradigm4 13 Price Volume Symbol 450.61 150 “AAPL”
    14. 14. SciDB Arrays A 1-D array looks like a spreadsheet This picture shows five cells, each with four attributes Attributes Volume Symbol usec 1 450.61 150 “AAPL” 36013008713 2 450.73 200 “AAPL” 36013008915 3 450.84 10 “AAPL” 36013208113 4 36.57 75 “MSFT” 36019008713 5 36.20 100 “MSFT” 36003200113 © Paradigm4 14 Dimension i Price
    15. 15. SciDB Arrays The same data “redimensioned” into a 2D array . Dimension Symbol “AAPL” Volume 36013008713 450.61 450.73 450.84 36.57 75 200 36013208113 100 150 36013008915 Volume 36.20 36003200113 Price 10 36019008713 © Paradigm4 15 Dimension usec Price “MSFT”
    16. 16. Access multi-dimensional subsets in constant time cts u od Pr Customers (price, location, age, gender, …) Vendors other dimensions …. © Paradigm4 16 Customer [1]
    17. 17. High Performance Windowing Fast, one-pass, running stats over arbitrary time or data windows Even when time intervals cross over internal storage shards © Paradigm4 17 Simple running median outlier filter
    18. 18. SciDB Arrays Arrays can be joined along dimensions or subsets of dimensions Values can be aggregated along dimensions and over windows Functions can be applied over values in arrays Linear algebra operations, matrix decompositions, and other interesting operations are defined for matrices and vectors © Paradigm4 18 Arrays can be sparse
    19. 19. © Paradigm4 19 • Work in familiar IDE • Data persisted in SciDB • Offload large computations to cluster
    20. 20. Demos Quantitative Finance example • Regularized correlation • Relevance network graph Remote Sensing application Survival Analysis on Healthcare Data • Estimate Cox proportional hazards model with the big data bootstrap © Paradigm4 20 • NASA MODIS satellite images • Regrid with spatial interpolation • Visualize (multiple resolutions)
    21. 21. © Paradigm4 21 Live demos
    22. 22. Two modes for using R & Python SciDB-R/Py R/Py-exec (global) (local) Program SciDB naturally from R or Python Invoke R or Python from within SciDB queries © Paradigm4 22 SciDB coordinator R-exec
    23. 23. Rationale Provide a simple, robust way to run R or Python from inside SciDB queries, in parallel © Paradigm4 23 Extend SciDB's powerful native analysis capabilities
    24. 24. Really simple example Instance-parallel Monte Carlo estimate of π avg( r_exec( build(<z:double>[i=1:1000,1,0],0), 'expr=x<-runif(1000);y<-runif(1000);list(sum(x^2+y^2<1)/250)') ) {i} x_avg © Paradigm4 24 {0} 3.14119
    25. 25. Big data bootstrap example Consider a matrix named "events" with 8 columns: Race Age Group Gender (categorical) (numeric) (categorical) (categorical) Apply the bag of little bootstraps to estimate confidence intervals for coefficients of a Cox proportional hazards survival model. © Paradigm4 25 ID (numeric) SES (numeric) Days_to_event (numeric) Event (binary)
    26. 26. Big data bootstrap example Randomly partition rows of the events matrix into blocks of at most 1000 rows (the "bag" part of the BLB method). © Paradigm4 26 store( redimension( cross_join(events as A, redimension(apply(project(sort(apply( build(<v:int64>[k=0:9999,1000,0],random()),p,k)),p),m,n), <p:int64> [m=0:*,1000,0]) as B, A.i, B.m), <val:double>[p=0:9999,1000,0,j=0:7,8,0]), P)
    27. 27. Big data bootstrap example store(redimension( apply( r_exec(P, "expr= require(survival); D <- as.data.frame(matrix(val,ncol=8,byrow=TRUE)); names(D) <-c ('ID','Race','SES','Age','Days','Event','Group','Gender'); D[,'Race'] <- factor(D[,'Race'], levels=1:13); D[,'Group'] <- factor(D[,'Group'], levels=1:2); D[,'Gender'] <- factor(D[,'Gender'], levels=1:2); ans <- sapply(1:500, function(x) { M <- coxph(Surv(Days, Event) ~ Age + Race + Group + Gender + SES + cluster(ID), data=D[sample(nrow(D),nrow(D),replace=1),]); c(coef(M), sqrt(diag(M[['var']])))}); list(apply(ans, 1, mean)); © Paradigm4 27 '), m, n%32), <ans:double null>[m=0:31,32,0], avg(val) as ans), coefs)
    28. 28. Big data bootstrap result Group 2 exhibits significantly lower relative risk of an event than Group 1 in this example. plot(exp(cf)) lapply(1:4,function(j){lines(c(j, j),c(exp(cf[j]1.96*se[j]),exp(cf[j]+1.96*se[j]) ))}) © Paradigm4 28 library("scidb") cf =scidb("coefs")[c(0,13:15)][] se =scidb("coefs")[c(16,29:31)][]
    29. 29. Take Away In-database, scalable, complex math Less coding, more analysis Transparent scale-up & speed-up Interactive exploratory analytics Seamless R and Python integration www.paradigm4.com
    30. 30. Questions? Tell us about your application • info@paradigm4.com Try our Quick Start • scidb.org/forum • Download a VM or EC2 AMI www.paradigm4.com © Paradigm4 Inc. 30

    ×