Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Like this presentation? Why not share!

1,094 views

Published on

Paradigm4 presents a webinar about SciDB—the massively scalable, open source, array database with native complex analytics, integrated with R and Python.

Details:

Presenter: Bryan Lewis, Chief Data Scientist, Paradigm4

Day/Time: Tuesday November 12th, 2013 at 1pm EST

Learn how SciDB enables you to:

-Explore rich data sets interactively

-Do complex math in-database—without being constrained -by memory limitations

-Perform multi-dimensional windowing, filtering, and aggregation

-Offload large computations to a commodity hardware cluster—on-premise or in a cloud

-Use R and Python to analyze SciDB arrays as if they were R or Python objects.

-Share data among users, with multi-user data integrity guarantees and version control

Webinar Agenda:

-Introduction to SciDB

-Demo

-Live Q&A

Published in:
Technology

No Downloads

Total views

1,094

On SlideShare

0

From Embeds

0

Number of Embeds

2

Shares

0

Downloads

32

Comments

0

Likes

2

No embeds

No notes for slide

- 1. Big Analytics without Big Hassles Bryan Lewis Chief Data Scientist Alex Poliakov Solutions Architect
- 2. Paradigm4’s SciDB SciDB is an open source, scalable array database, with native complex math analytics, integrated with R & python © Paradigm4 Inc. 2
- 3. Paradigm4’s SciDB SciDB helps data scientists, bioinformaticians, quants, analysts, and scientists tackle their toughest “Big Data” management and complex analytics challenges. © Paradigm4 Inc. 3
- 4. Webinar Replay These slides are from a Paradigm4 webinar held on 11/12/13 You can find this webinar, and additional webinars, at: http://www.paradigm4.com/video/ www.paradigm4.com © Paradigm4 Inc. 4
- 5. Agenda 1. Brief Introduction to SciDB 1. Demos © Paradigm4 5 1. Q & A
- 6. Developed by Paradigm4 Open-source high-performance database Data organized in multi-dimensional sparse arrays Horizontally scalable Excels at parallel linear algebra © Paradigm4 6 ACID, data replication, versioned data
- 7. About Paradigm4 Paradigm4 develops & supports SciDB CTO is MIT database researcher Mike Stonebraker Force behind many major advances in commercial database products (Postgres, Illustra, Streambase, Vertica, VoltDB, …) Computational Genomics Imaging Quantitative Finance E-commerce Industrial Analytics Internet of Things © Paradigm4 7 Commercial applications
- 8. Developed by Paradigm4 Community edition • Open Source • Unrestricted • Fully scalable • More math • Fault tolerance • System management tools © Paradigm4 8 Enterprise edition
- 9. SciDB Powers NIH NCBI’s 1000 Genomes Project http://www.ncbi.nlm.nih.gov/variation/tools/1000genomes/ © Paradigm4 9 Running 24 x 7 since Fall 2012
- 10. SciDB Builds ARCA NBBO Book • 186 million quotes for one day • Runs in about half the time on a cluster twice as large © Paradigm4 10 • 80 seconds on a 32-instance cluster
- 11. SciDB Powers Recommendation Engines • Fast truncated SVD • Minutes per singular value on a four node Linux cluster © Paradigm4 11 • Sparse 50M x 50M matrix 4 billion nonzero values
- 12. SciDB System Architecture “Shared Nothing” cluster of commodity hardware nodes Interconnected with standard ethernet and TCP/IP © Paradigm4 12 SciDB Client ( iquery, Python, R, C++, C, JDBC )
- 13. SciDB Arrays Each cell in a SciDB array consists of a fixed number of typed attributes (variables). Here is an example cell with four attributes usec 36013008713 © Paradigm4 13 Price Volume Symbol 450.61 150 “AAPL”
- 14. SciDB Arrays A 1-D array looks like a spreadsheet This picture shows five cells, each with four attributes Attributes Volume Symbol usec 1 450.61 150 “AAPL” 36013008713 2 450.73 200 “AAPL” 36013008915 3 450.84 10 “AAPL” 36013208113 4 36.57 75 “MSFT” 36019008713 5 36.20 100 “MSFT” 36003200113 © Paradigm4 14 Dimension i Price
- 15. SciDB Arrays The same data “redimensioned” into a 2D array . Dimension Symbol “AAPL” Volume 36013008713 450.61 450.73 450.84 36.57 75 200 36013208113 100 150 36013008915 Volume 36.20 36003200113 Price 10 36019008713 © Paradigm4 15 Dimension usec Price “MSFT”
- 16. Access multi-dimensional subsets in constant time cts u od Pr Customers (price, location, age, gender, …) Vendors other dimensions …. © Paradigm4 16 Customer [1]
- 17. High Performance Windowing Fast, one-pass, running stats over arbitrary time or data windows Even when time intervals cross over internal storage shards © Paradigm4 17 Simple running median outlier filter
- 18. SciDB Arrays Arrays can be joined along dimensions or subsets of dimensions Values can be aggregated along dimensions and over windows Functions can be applied over values in arrays Linear algebra operations, matrix decompositions, and other interesting operations are defined for matrices and vectors © Paradigm4 18 Arrays can be sparse
- 19. © Paradigm4 19 • Work in familiar IDE • Data persisted in SciDB • Offload large computations to cluster
- 20. Demos Quantitative Finance example • Regularized correlation • Relevance network graph Remote Sensing application Survival Analysis on Healthcare Data • Estimate Cox proportional hazards model with the big data bootstrap © Paradigm4 20 • NASA MODIS satellite images • Regrid with spatial interpolation • Visualize (multiple resolutions)
- 21. © Paradigm4 21 Live demos
- 22. Two modes for using R & Python SciDB-R/Py R/Py-exec (global) (local) Program SciDB naturally from R or Python Invoke R or Python from within SciDB queries © Paradigm4 22 SciDB coordinator R-exec
- 23. Rationale Provide a simple, robust way to run R or Python from inside SciDB queries, in parallel © Paradigm4 23 Extend SciDB's powerful native analysis capabilities
- 24. Really simple example Instance-parallel Monte Carlo estimate of π avg( r_exec( build(<z:double>[i=1:1000,1,0],0), 'expr=x<-runif(1000);y<-runif(1000);list(sum(x^2+y^2<1)/250)') ) {i} x_avg © Paradigm4 24 {0} 3.14119
- 25. Big data bootstrap example Consider a matrix named "events" with 8 columns: Race Age Group Gender (categorical) (numeric) (categorical) (categorical) Apply the bag of little bootstraps to estimate confidence intervals for coefficients of a Cox proportional hazards survival model. © Paradigm4 25 ID (numeric) SES (numeric) Days_to_event (numeric) Event (binary)
- 26. Big data bootstrap example Randomly partition rows of the events matrix into blocks of at most 1000 rows (the "bag" part of the BLB method). © Paradigm4 26 store( redimension( cross_join(events as A, redimension(apply(project(sort(apply( build(<v:int64>[k=0:9999,1000,0],random()),p,k)),p),m,n), <p:int64> [m=0:*,1000,0]) as B, A.i, B.m), <val:double>[p=0:9999,1000,0,j=0:7,8,0]), P)
- 27. Big data bootstrap example store(redimension( apply( r_exec(P, "expr= require(survival); D <- as.data.frame(matrix(val,ncol=8,byrow=TRUE)); names(D) <-c ('ID','Race','SES','Age','Days','Event','Group','Gender'); D[,'Race'] <- factor(D[,'Race'], levels=1:13); D[,'Group'] <- factor(D[,'Group'], levels=1:2); D[,'Gender'] <- factor(D[,'Gender'], levels=1:2); ans <- sapply(1:500, function(x) { M <- coxph(Surv(Days, Event) ~ Age + Race + Group + Gender + SES + cluster(ID), data=D[sample(nrow(D),nrow(D),replace=1),]); c(coef(M), sqrt(diag(M[['var']])))}); list(apply(ans, 1, mean)); © Paradigm4 27 '), m, n%32), <ans:double null>[m=0:31,32,0], avg(val) as ans), coefs)
- 28. Big data bootstrap result Group 2 exhibits significantly lower relative risk of an event than Group 1 in this example. plot(exp(cf)) lapply(1:4,function(j){lines(c(j, j),c(exp(cf[j]1.96*se[j]),exp(cf[j]+1.96*se[j]) ))}) © Paradigm4 28 library("scidb") cf =scidb("coefs")[c(0,13:15)][] se =scidb("coefs")[c(16,29:31)][]
- 29. Take Away In-database, scalable, complex math Less coding, more analysis Transparent scale-up & speed-up Interactive exploratory analytics Seamless R and Python integration www.paradigm4.com
- 30. Questions? Tell us about your application • info@paradigm4.com Try our Quick Start • scidb.org/forum • Download a VM or EC2 AMI www.paradigm4.com © Paradigm4 Inc. 30

No public clipboards found for this slide

×
### Save the most important slides with Clipping

Clipping is a handy way to collect and organize the most important slides from a presentation. You can keep your great finds in clipboards organized around topics.

Be the first to comment