Big data Big Analytics

Experienced Data Scientist
Jul. 13, 2012

More Related Content


Big data Big Analytics

  2. Pre- Agenda -Presenter Introduction -Audience Introduction -Expectations --------------------------------------------
  3. Presenter Introduction Working with Analytics since 2004 Educated at IIM Lucknow, DCE, U Tenn Author (R for Business Analytics (Springer)) Blogger at Interviewed 100+ Analytics leaders
  4. Audience Introduction ● Affiliation-Academic/ Govt/Private ● Years of working with Big Data- ● Specific Interest Area in Analytics-
  5. Great Expectations From You 1.No mobile rings , no sleeping (discreet sleeping), 2.Please take notes using pencil,parchment, paper,pen, computer,tablet,stylus,mobile etc, 3.Please ask Questions in the END(from notes taken at Step 2) From Me 1 Breadth of Case Studies (!) 2 Open Source focus (R mostly, clojure, python) 3 Actionable Ideas are useful ! i.e I spent 3 hours in X talk but I did learn to do Y, or I am now interested in trying out Z
  6. Agenda -Presenter Introduction -Audience Identification -Expectations -------------------------------------------- -Big Data -Big Data Analytics using R -Case Study 1(Amazon AWS,SAP Hana DB) -Big Data Analytics using other tools -Case Study 2 (, --------------------------------------------
  7. Big Data What is Big Data? "Big data" is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time. Examples include web logs, RFID, sensor networks, social networks, social data (due to the social data revolution), Internet text and documents, Internet search indexing, call detail records, astronomy, atmospheric science, genomics, biogeochemical, biological, and other complex and often interdisciplinary scientific research, military surveillance, medical records, photography archives, video archives, and large-scale e-commerce. IBM- Every day, we create 2.5 quintillion bytes of data — so much that 90% of the data in the world today has been created in the last two years alone. This data comes from everywhere: sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals to name a few. This data is big data.
  8. Big Data What is Big Data? Big Data Conferences --O'Reilly's Strata --Hadoop World --Many many conferences......including ours
  9. Thought for Today In 2012 , data that is classified as Big Data will be classified as Little Data by 2018 True ----------False ?
  10. What is Cloud Computing? Cloud computing is a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction. This cloud model is composed of five essential characteristics, three service models, and four deployment models. National Institute of Standards and Technology --
  11. Cloud Computing and Big Data Analytics Cost of computing Big Data would be too much, but for cloud computing. Cloud runs on X OS predominantly, and needs customized solutions as of 2012 Open source solutions (OS- Analytics) are more easily customized
  12. Sources of Big Data --Internet ------Server Logs,Clickstream,Analytics --Social Media --Governments and UN bodies --Internal Data from customers
  13. Storing Big Data for R --Lots of RAM (?!) --RDBMS --Documents (Couch DB ,MongoDB) --HDFS (Hadoop)
  14. Storing Big Data for R --Documents (Couch DB ,MongoDB) Package RMongo provides an R interface to a Java client for `MongoDB' ( databases, which are queried using JavaScript rather than SQL. Package rmongodb is another client using mongodb's C driver. R talking to CouchDB using Couch's ReSTful HTTP API. construct HTTP calls with RCurl, then move on to the R4CouchDB package for a higher level interface. and-r.html
  15. Big Data Packages in R- 1/2 ● The biglm package by Lumley uses incremental computations to offers lm() and glm() functionality to data sets stored outside of R's main memory. ● The ff package by Adler et al. offers file-based access to data sets that are too large to be loaded into memory, along with a number of higher-level functions. ● The bigmemory package by Kane and Emerson permits storing large objects such as matrices in memory (as well as via files) and uses external pointer objects to refer to them. This permits transparent access from R without bumping against R's internal memory limits. Several R processes on the same computer can also shared big memory objects. ● The HadoopStreaming Provides a framework for writing map/reduce scripts for use in Hadoop Streaming. Also facilitates operating on data in a streaming fashion, without Hadoo
  16. Big Data Packages in R -2/2 ● This package extends the bigmemory package with various analytics. Functions bigkmeans and binit may also be used with native R objects ● This package extends the bigmemory package with table- and split-like support for big.matrix objects. The functions may also be used with regular R matrices for improving speed and memory-efficiency. ● .For mutex (locking) support for advanced shared-memory usage, see synchronicity. lists more projects. For linear algebra support, see bigalgebra.
  17. Big Data and Revolution Analytics Primary -RevoScaleR package /XDF format Also sponsored RHadoop
  18. RHadoop -rhdfs package rhdfs- Overview This R package provides basic connectivity to the Hadoop Distributed File System. R programmers can browse, read, write, and modify files stored in HDFS. The following functions are part of this package ● File Manipulations ● hdfs.copy, hdfs.move, hdfs.rename, hdfs.delete, hdfs.rm, hdfs.del, hdfs.chown, hdfs.put, hdfs.get ● File Read/Write ● hdfs.file, hdfs.write, hdfs.close, hdfs.flush,,, hdfs.tell, hdfs.line.reader, ● Directory ● hdfs.dircreate, hdfs.mkdir ● Utility ●, hdfs.list.files,, hdfs.exists ● Initialization ● hdfs.init, hdfs.defaults Hadoop Distributed File System (HDFS™) is the primary storage system used by Hadoop applications. HDFS creates multiple replicas of data blocks and distributes them on compute nodes throughout a cluster to enable reliable, extremely rapid computations
  19. RHadoop -rhbase package rhbase- Overview This R package provides basic connectivity to HBASE, using the Thrift server. R programmers can browse, read, write, and modify tables stored in HBASE. The following functions are part of this package ● Table Maninpulation ●, hb.delete.table, hb.describe.table, hb.set.table.mode, hb.regions.table ● Read/Write ● hb.insert, hb.get, hb.delete,,, hb.scan ● Utility ● hb.list.tables ● Initialization ● hb.defaults, hb.init HBase is the Hadoop database. Think of it as a distributed, scalable, big data store.
  20. RHadoop -rmr package rmr- Overview This R package allows an R programmer to perform statistical analysis via MapReduce on a Hadoop cluster. ● Average flight delay (Orbitz): original and updated version with presentation ● Network analysis: original and a summary Also see for logistic regression and k-means
  21. Big Data Social Network Analysis Analyzing A Big Social Network using R and distributed graph engines hadoop/
  22. Big Data Social Media Analysis Can be used for Customers ( and also for latent influencers )- http://www.r-
  23. Big Data Social Media Analysis R package twitteR can be used for prototyping but Twitter's API is rate limited to 1500 per hour(?)/day, so we can use Datasift API
  24. Big Data Social Media Analysis How does information propagate through a social network?
  25. Big Data Social Network Analysis Can be used for Terrorists ( and also for potential protestors )- Drew Conway Primary focus is one three aspects of network analysis 1. Identifying leadership and key actors 2. Revealing underlying structure and intra-network community structure 3. Evolution and decay of social networks
  26. Big Data and Revolution Analytics Primary -RevoScaleR package /XDF format Also sponsored RHadoop ● For a case study, UpStream software ( slide 16): ● Big data GLMs (you might find the chart on this page useful): ● Data distillation with Hadoop and R: ● Analysis of the million row movie data set (building recommendation engines):
  27. Big Data and Revolution Analytics marketing analytics company UpStream Software, used map-reduce to convert transactions from Omniture logs (web visits, emails clicked on, ads displayed) into customer behaviors: response to an offer, research into a product, purchases.
  28. More R and Hadoop Case Studies few examples where R and Hadoop are used for data distillation: ● Using robust regression on a series of raw voice-over-IP packets to calculate how long participants talk during a phone conversation. ● Using graph theory (and R's igraph package) to quantify the number of close friends of members of a social network. ● Orbitz uses R and Hadoop to extract flights and hotels that will be presented during a travel search, based on previous transaction. ● Using k-means clustering to extract similar "groups" of transactions, which are then aggregated and used as the record level for structured analysis
  29. Using RDBMS (Big?) Data through R --RDBMS -RODBC Package RMySQL ROracle RPostgresSQL RSQLite
  30. Using RDBMS (is it Big Data?) through R --RDBMS -RODBC Package > library(RODBC) > odbcDataSources(type = c("all", "user", "system")) SQLServer PostgreSQL30 PostgreSQL35W "SQL Server""PostgreSQL ANSI(x64)" "PostgreSQL Unicode(x64)" MySQL "MySQL ODBC 5.1 Driver"
  31. Querying Big Data --RDBMS-SQL --Hadoop-Pig (but many ways)
  32. Big Data Analytics - Challenges ---Traditional statistics theory grew up when data was constrained --Traditional analytics programming was NOT parallel processing --Shortage of trained people
  33. Big Data Analytics - Solutions ---Teaching more parallel programming and algorithms --More focus on data reduction techniques like clustering , segmentation than on hypothesis testing. Sampling, anyone? --Training more data scientists
  34. Big Data Analytics - Tools used -Why R -High Performance Computing -Big Data Within R
  35. Using R (interfaces) --Using R Studio for easier development --Using Rattle GUI for straight off the shelf data mining and Using R Commander for Extensions --Using Revolution Analytics RPE -----Example of Snippets
  36. Using R --Using R for text mining ---Text Mining from Twitter Case Study ---Datasift Export to Amazon S3 --Using R for geo-coded analysis ---Hana DB --Using R for Graphical Analysis of Big Data TablePlot 3D using R Commander --Using R for forecasting Using Plugin R Commander E -Pack
  37. Existing Big Data Case Studies Departure of Aeroplanes-SAP Hana 200m!/2012/04/big-data-r-and-hana-analyze-200-million.html R using SAP Hana
  38. SAP Hana DB uses R
  39. Oracle R Enterprise Case Studies and Examples enterprise/index.html
  40. Oracle R Enterprise Case Studies and Examples enterprise/index.html
  41. Revolution Analytics RevoScaleR package he RevoScaleR package to extract time series data from time-stamped logs (in this case, the "US Domestic Flights From 1990 to 2009" dataset on Infochimps): Analyzing time series data of all sorts is a fundamental business analytics task to which the R language is beautifully suited. In addition to the time series functions built into base stats library there are dozens of R packages devoted to time series... We have shown how data manipulation functions of the RevoScaleR package to extract time stamped data from a large data file, aggregate it, and form it into monthly time series that can easily be analyzed with standard R functions. large-timestamped-logs-with-r.html
  42. Using R on Amazon -Case Study --Bioconductor in the Cloud --Custom Amazon Instance --Concerns for non- American users of Amazon
  43. Using BigML on cloud Case Study Classification using Clojure on Cloud --Concerns on depending on third party tools --Example
  44. Using Google APIs Google Storage API Google Predictive Analysis API Introduction to other APIS ----Concerns to users of Google APIs
  45. Using Google APIs case study Google Storage API Google Predictive Analysis API
  46. Using Google APIs case study Introduction to other Big Data Google APIS ----Concerns to users of Google APIs
  47. Using Python- PiCloud com/ http://www.picloud.
  48. Privacy hazards of big data analytics. Big Brother -1984 --- 2012 They know where you are (mobiles) They know what you are looking for (internet) They know your past (financial history +social media) They can use your medical history Laws authorize them (Patriot Act?) --example Emotional Analysis of Images http: //
  49. References and Acknowledgements David Smith, Revolution Analytics David Champagne, Revolution Analytics All R Bloggers,Developers, Packagers Blag - SAP Hana Analytics Charlie Berger -and Oracle R Team Jim Kobielus -IBM Big Data Team R Development Core Team (2012). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL
  50. Thanks
  51. Book- R for Business Analytics