User 2013-oracle-big-data-analytics-1971985

770 views

Published on

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
770
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
27
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

User 2013-oracle-big-data-analytics-1971985

  1. 1. Big Data Analytics – Scaling R to Enterprise Data useR! 2013 – Albacete Spain #useR2013 Luis Campos Mark Hornick Big Data Solutions Lead, Oracle EMEA @luigicampos 1 Copyright © 2013, Oracle and/or its affiliates. All rights reserved. Director, Oracle Database Advanced Analytics @MarkHornick Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
  2. 2. 2 Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
  3. 3. The girl with all the questions! “The real innovation here is that we can and get the ask questions answer back before we have forgotten why we asked the question in the first place .” – Hilary Mason, Chief Scientist Bit.ly + member of NYC Mayor Bloomberg’s Technology and Innovation Advisory Council 3 Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
  4. 4. Nexus of Forces, Platform 3.0, Four Pillars What Analysts/groups are saying? 4 Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
  5. 5. New Information Challenges Data Explosion A Decade of Digital Universe Growth: Storage in Exabytes (Source: IDC’s Digital Universe Study, June 2011) Combinatory Explosion Dimension Explosion 5 Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
  6. 6. Big Data Solution = Data + Analytics + Tools Source: McKinsey study “Big data: What’s your plan?” (March 2013) http://www.mckinsey.com/insights/business_technology/big_data_whats_your_plan DATA Any Data, Any Source 6 ANALYTICS Out-of-the box Analytics, New Models Copyright © 2013, Oracle and/or its affiliates. All rights reserved. TOOLS Self Service Data Discovery On Premise, On Cloud, On Mobile
  7. 7. Oracle Complete Business Analytics Solution BIG DATA APPLIANCE BIG DATA CONNECTORS NoSQL DB 7 Oracle Advanced DATA MINING Analytics ORACLE R Ent. SPATIAL,GRAPH Real Time Decisions (RTD) Copyright © 2013, Oracle and/or its affiliates. All rights reserved. OBIEE ENDECA Collective Intellect (CI) On Premise, Oracle Cloud, On Mobile
  8. 8. Apply Advanced Analytics on All Data Visualise it with any BI Tool Hadoop Relational HDFS Data BI Tools 8 Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
  9. 9. Oracle R Advantages 1. Keep the R tools 2. Keep the data where it sits (Relational or HDFS) 3. Keep the SQL Based BI Tools 4. Scale to LARGE data sets R workspace console Function push-down – data transformation & Oracle statistics engine statistics Development 9 Copyright © 2013, Oracle and/or its affiliates. All rights reserved. Production OBIEE, Web Services Consumption
  10. 10. Oracle’s Advanced Analytics Strategic Offerings Deliver enterprise-level advanced analytics in the Database  Oracle in-Database Data Mining algorithms – Access through Free GUI from SQL Developer or programmatically from SQL, PL/SQL, R or Java – Predictive model APIs for the Oracle R Enterprise – Exadata architecture advantages for up to 5x improvement with Smart Scan  Oracle R Distribution – Free download, pre-installed on Oracle Big Data Appliance, bundled with Oracle Linux – Enhanced linear algebra performance: Intel’s Math Kernel Library, AMD’s Core Math Library (Windows and Linux), SUN Solaris and IBM AIX – Enterprise support for customers of Oracle Advanced Analytics, Big Data Appliance, and Oracle Linux 10 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
  11. 11. Oracle’s Advanced Analytics Strategic Offerings Deliver enterprise-level R in the Database or Hadoop  Oracle R Enterprise – Transparent access to database-resident data from R – Embedded R script execution through database managed R engines – Statistics engine – Enhanced support for high-speed Exadata scoring  Oracle R Connector for Hadoop [ORCH] (Part of Oracle Big Data Connectors) – R interface to Oracle Hadoop Cluster on BDA and non-Oracle Hadoop clusters – Access and manipulate data in HDFS, database, and file system – Write MapReduce functions using R and execute through natural R interface – Predictive models with execution in-Cluster against Hadoop-stored data 11 Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
  12. 12. Oracle R Components Component layout Analyst Laptop Oracle Database Oracle R Distribution Oracle R Enterprise Server Components Oracle R Distribution Oracle R Connector for Hadoop Client Oracle R Enterprise Client Packages Optional with ORCH 12 Copyright © 2012, Oracle and/or its affiliates. All rights reserved. Oracle R Distribution Oracle R Connector for Hadoop Oracle R Enterprise Client Packages Big Data Appliance Oracle R Enterprise Client Packages Exadata
  13. 13. Knowledge Exploitation Process Typical stages in a Big Data Project Business Understanding Deployment Data Scientist Data Selection Evaluation Discovery Model Building 13 Copyright © 2013, Oracle and/or its affiliates. All rights reserved. Data Preparation 13
  14. 14. Data Loading with Oracle R Enterprise Business Understanding Deployment Data Scientist Data Selection library(ORE) R> df <- data.frame(A=1:26, B=letters[1:26]) R> dim(df) [1] 26 2 R> class(df) [1] "data.frame" R> ore.create(df, table="DF_TABLE") Evaluation Discovery Model Building 16 Data Preparation Copyright © 2013, Oracle and/or its affiliates. All rights reserved. R> ore.ls() [1] "DF_TABLE" R> class(DF_TABLE) [1] "ore.frame" attr(,"package") [1] "OREbase" R> dim(DF_TABLE) [1] 26 2 16
  15. 15. Discovery with Oracle R in-DB and HDFS Business Understanding Deployment Data Scientist Discovery Evaluation Model Building 17 Data Selection Data Preparation Copyright © 2013, Oracle and/or its affiliates. All rights reserved. library(ORE) ore.ls() # list tables in DB class(MY_TABLE) # ore.frame dim(MY_TABLE) # overloaded R functions head(MY_TABLE) sample(MY_TABLE) summary(MY_TABLE) library(ORCH) hdfs.ls() hdfs.dim("myHDFSdata") hdfs.head("myHDFSdata") hdfs.sample("myHDFSdata") hdfs.toHive("myHDFSdata", tablename="my_hive_data") summary(my_hive_data) 17
  16. 16. Data Prep with Oracle R in-DB and HDFS library(ORE) / library(ORCH) # join merge (MY_TABLE1, MY_TABLE2,by.x="x1", by.y="x2") Business Understanding Deployment Data Scientist Data Selection # project columns df <- MY_TABLE[,c("X","Y","Z")] # filter rows df <- df[df$Z<=4.3 | df$A=="B",1:3] Evaluation Discovery Model Building 18 Data Preparation Copyright © 2013, Oracle and/or its affiliates. All rights reserved. #binning IRIS_TAB <- ore.push(iris[1:4]) IRIS_TAB$PetalBins = ifelse(IRIS_TAB$Petal.Length < 2.0, "SMALL PETALS", ifelse(IRIS_TAB$Petal.Length < 4.0, "MEDIUM PETALS", "LARGE PETALS")) 18
  17. 17. “Densifying” data: custom MapReduce jobs Count occurrence of hash tags in tweets per customer for select tags mapHashTags <- function (k,v) { x <- strsplit(v$text, " ") x <- x[x!=''] importantTags <- tolower(importantTags) for(twt in 1:length(x)) { for(tag in x[[twt]]) { if(substr(tag,1,1) == "#") { tagL <- tolower(tag) if(tagL %in% importantTags) { orch.keyval(v[twt,"screenName"],tagL) }}}}} reduceHashTags <- function(k,vals) { # k = screenName, vals = vector(tags) importantTags <- tolower(importantTags) vals <- factor(vals$val,levels=importantTags) x <- as.data.frame(t(as.matrix(table(vals)))) orch.keyval(k,x) # k = screenName, x = df(importantTags as cols) with counts } 19 Copyright © 2013, Oracle and/or its affiliates. All rights reserved. 19
  18. 18. ORCH: Create your own MapReduce jobs Count occurrence of hash tags in tweets per customer for select tags importantTags <- c("#bigdata","#database","#oracle","#sql") tag.summary <- hadoop.exec(tweets.id, mapper=mapHashTags, reducer=reduceHashTags, export=orch.export(importantTags=importantTags), config=new("mapred.config", job.name = "TwitterScreenNameHashTags", reduce.tasks = 5, map.output = data.frame(key='a', val='a'), reduce.output = data.frame(key='a', bigdata=0, database=0 ,oracle=0, sql=0))) hdfs.get(tag.summary) > hdfs.get(tag.summary) key bigdata database oracle sql 1 4 7 37 91 2 twitter.user.2 15 19 1 32 3 twitter.user.3 104 57 8 0 4 20 twitter.user.1 twitter.user.4 0 64 549 0 Copyright © 2013, Oracle and/or its affiliates. All rights reserved. 20
  19. 19. Modelling with Oracle R in-DB and HDFS # Clustering with ORE Business Understanding Deployment Data Scientist Data Selection X <- ore.push (data.frame(x)) km.mod1 <ore.odmKMeans(~., X, num.centers=2, num.bins=5) summary(km.mod1) rules(km.mod1) clusterhists(km.mod1) # Regression with ORCH Discovery Evaluation Model Building 21 Data Preparation Copyright © 2013, Oracle and/or its affiliates. All rights reserved. mod.lm <- orch.lm(myFormula, myData, nReducers = 2) summary(mod.lm) pred <- predict.orch.lm(mod.lm, newdata = myData) res.pred <- hdfs.get(pred) head(res.pred) 21
  20. 20. In-database performance advantage R lm vs. ORE ore.lm Data: 500k to 1.5m records, 3 predictors Performance: 2x-3x improvement for build, 4x improvement for scoring 22 Copyright © 2013, Oracle and/or its affiliates. All rights reserved. 22
  21. 21. In-database performance advantage – lm More tests at http://blogs.oracle.com/R/entry/oracle_r_enterprise_1_32 23 Copyright © 2013, Oracle and/or its affiliates. All rights reserved. 23
  22. 22. Deploying with Oracle R Enterprise Load R scripts into ORE script repository Invoke R scripts by name from SQL Business Understanding Production Deploy ment Data Scientist Data Selection Discovery Evaluation Model Building 24 Data Preparation Copyright © 2013, Oracle and/or its affiliates. All rights reserved. Store R objects directly in Oracle Database (no separate files) Optional return values: • Data frame consumable by any SQL-ready application • XML containing structured data, complex R objects, PNG images • PNG table with BLOB column containing images for immediate consumption Schedule for automatic execution 24
  23. 23. Oracle Advanced Analytics: Embedded R Execution SQL interface rqEval – generate XML string for graphic output Oracle PL/SQL begin sys.rqScriptCreate('Example6', 'function(){ res <- 1:10 Oracle BI Publisher plot( 1:100, rnorm(100), pch = 21, bg = "red", cex = 2 ) R Language res }'); end; / Oracle SQL select value from 25 table(rqEval(NULL,'XML','Example6')); Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
  24. 24. Summary Oracle R Enterprise (ORE) Oracle R Connector for Hadoop (ORCH) • A comprehensive, database-centric environment for end-to-end analytical processes in R with immediate deployment to production environments • Wide range of in-database advanced analytics algorithms exposed through R • Eliminate R client memory limits • A collection of R packages enabling Big Data analytics from an R environment • Allows R users to leverage a Hadoop Cluster with HDFS and MapReduce from R • Prepackaged advanced analytics algorithms • Transparent manipulation of HIVE data • Enable R users to conduct Big Data projects from R • Eliminate client R engine memory barrier • Scale to large data sets • Deploy R-based solutions without translation to other languages or environments 26 Copyright © 2013, Oracle and/or its affiliates. All rights reserved. 26
  25. 25. Resources • Blog: http://www.oracle.com/goto/R https://blogs.oracle.com/R/ • Forum: https://forums.oracle.com/forums/forum.jspa?forumID=1397 • Oracle R Distribution: http://www.oracle.com/technetwork/indexes/downloads/r-distribution-1532464.html • ROracle: http://cran.r-project.org/web/packages/ROracle • Oracle R Enterprise: http://www.oracle.com/technetwork/database/options/advanced-analytics/r-enterprise • Oracle R Connector for Hadoop: http://www.oracle.com/us/products/database/big-data-connectors/overview 27 Copyright © 2013, Oracle and/or its affiliates. All rights reserved. 27
  26. 26. 28 Copyright © 2013, Oracle and/or its affiliates. All rights reserved. 28

×