Big data Big Analytics
Upcoming SlideShare
Loading in...5

Big data Big Analytics






Total Views
Views on SlideShare
Embed Views



14 Embeds 761 505 183 25 23 7 5 3 2 2 2 1 1 1 1



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

Big data Big Analytics Big data Big Analytics Presentation Transcript

  • Pre- Agenda-Presenter Introduction-Audience Introduction-Expectations--------------------------------------------
  • Presenter with Analytics since 2004Educated at IIM Lucknow, DCE, U TennAuthor (R for Business Analytics (Springer))Blogger at www.decisionstats.comInterviewed 100+ Analytics leaders
  • Audience Introduction● Affiliation-Academic/ Govt/Private● Years of working with Big Data-● Specific Interest Area in Analytics-
  • Great ExpectationsFrom You1.No mobile rings , no sleeping (discreet sleeping),2.Please take notes using pencil,parchment, paper,pen,computer,tablet,stylus,mobile etc,3.Please ask Questions in the END(from notes taken atStep 2)From Me1 Breadth of Case Studies (!)2 Open Source focus (R mostly, clojure, python)3 Actionable Ideas are useful !i.e I spent 3 hours in X talk but I did learn to do Y, or I am now interested in trying out Z
  • Agenda-Presenter Introduction-Audience Identification-Expectations---------------------------------------------Big Data-Big Data Analytics using R -Case Study 1(Amazon AWS,SAP HanaDB)-Big Data Analytics using other tools -Case Study 2 (,
  • Big DataWhat is Big Data?"Big data" is a term applied to data sets whose size is beyond the ability ofcommonly used software tools to capture, manage, and process the data withina tolerable elapsed time. Examples include web logs, RFID, sensor networks,social networks, social data (due to the social data revolution), Internet text anddocuments, Internet search indexing, call detail records, astronomy,atmospheric science, genomics, biogeochemical, biological, and other complexand often interdisciplinary scientific research, military surveillance, medicalrecords, photography archives, video archives, and large-scale e-commerce.IBM- day, we create 2.5 quintillion bytes of data — so much that 90% of thedata in the world today has been created in the last two years alone. This datacomes from everywhere: sensors used to gather climate information, posts tosocial media sites, digital pictures and videos, purchase transaction records,and cell phone GPS signals to name a few. This data is big data.
  • Big DataWhat is Big Data?Big Data Conferences--OReillys Strata--Hadoop World--Many many conferences......including ours
  • Thought for TodayIn 2012 , data that is classified as Big Data willbe classified as Little Data by 2018True ----------False?
  • What is Cloud Computing?Cloud computing is a model for enabling ubiquitous,convenient, on-demand network access to a shared pool ofconfigurable computing resources (e.g., networks, servers,storage, applications, and services) that can be rapidlyprovisioned and released with minimal management effortor service provider interaction. This cloud model iscomposed of five essential characteristics, three servicemodels, and four deployment models. National Institute of Standards and Technology--
  • Cloud Computing andBig Data AnalyticsCost of computing Big Data would be too much,but for cloud computing.Cloud runs on X OS predominantly, and needscustomized solutions as of 2012Open source solutions (OS- Analytics) aremore easily customized
  • Sources of Big Data--Internet------Server Logs,Clickstream,Analytics--Social Media--Governments and UN bodies--Internal Data from customers
  • Storing Big Data for R--Lots of RAM (?!)--RDBMS--Documents (Couch DB ,MongoDB)--HDFS (Hadoop)
  • Storing Big Data for R--Documents (Couch DB ,MongoDB)Package RMongo provides an R interface to a Java clientfor `MongoDB (, which are queried using JavaScript rather thanSQL. Package rmongodb is another client usingmongodbs C driver. talking to CouchDB using Couchs ReSTful HTTP API.construct HTTP calls with RCurl, then move on to theR4CouchDB package for a higher level interface.
  • Big Data Packages in R- 1/2● The biglm package by Lumley uses incremental computations to offers lm() and glm() functionality to data sets stored outside of Rs main memory.● The ff package by Adler et al. offers file-based access to data sets that are too large to be loaded into memory, along with a number of higher-level functions.● The bigmemory package by Kane and Emerson permits storing large objects such as matrices in memory (as well as via files) and uses external pointer objects to refer to them. This permits transparent access from R without bumping against Rs internal memory limits. Several R processes on the same computer can also shared big memory objects.● The HadoopStreaming Provides a framework for writing map/reduce scripts for use in Hadoop Streaming. Also facilitates operating on data in a streaming fashion, without Hadoo
  • Big Data Packages in R -2/2● package extends the bigmemory package with various analytics.Functions bigkmeans and binit may also be used with native R objects● package extends the bigmemory package with table- and split-like supportfor big.matrix objects. The functions may also be used with regular R matricesfor improving speed and memory-efficiency.● mutex (locking) support for advanced shared-memory usage, seesynchronicity. lists more projects. For linearalgebra support, see bigalgebra.
  • Big Data and RevolutionAnalyticsPrimary -RevoScaleR package /XDF formatAlso sponsored RHadoop
  • RHadoop -rhdfs packagerhdfs- R package provides basic connectivity to the Hadoop Distributed File System. R programmers can browse, read, write, andmodify files stored in HDFS. The following functions are part of this package ● File Manipulations ● hdfs.copy, hdfs.move, hdfs.rename, hdfs.delete, hdfs.rm, hdfs.del, hdfs.chown, hdfs.put, hdfs.get ● File Read/Write ● hdfs.file, hdfs.write, hdfs.close, hdfs.flush,,, hdfs.tell, hdfs.line.reader, ● Directory ● hdfs.dircreate, hdfs.mkdir ● Utility ●, hdfs.list.files,, hdfs.exists ● Initialization ● hdfs.init, hdfs.defaults Distributed File System (HDFS™) is the primary storage system used by Hadoop applications. HDFS creates multiplereplicas of data blocks and distributes them on compute nodes throughout a cluster to enable reliable, extremely rapidcomputations
  • RHadoop -rhbase packagerhbase- R package provides basic connectivity to HBASE, using the Thrift server. R programmers can browse, read, write, and modifytables stored in HBASE. The following functions are part of this package ● Table Maninpulation ●, hb.delete.table, hb.describe.table, hb.set.table.mode, hb.regions.table ● Read/Write ● hb.insert, hb.get, hb.delete,,, hb.scan ● Utility ● hb.list.tables ● Initialization ● hb.defaults, hb.init is the Hadoop database. Think of it as a distributed, scalable, big data store.
  • RHadoop -rmr packagermr-OverviewThis R package allows an R programmer to performstatistical analysis via MapReduce on a Hadoop cluster.● Average flight delay (Orbitz): original and updated version with presentation● Network analysis: original and a summaryAlso see logistic regression and k-means
  • Big Data Social Network AnalysisAnalyzing A Big Social Network using R anddistributed graph engines
  • Big Data Social MediaAnalysisCan be used for Customers ( and also for latent influencers )-
  • Big Data Social MediaAnalysisR package twitteR can be for prototyping but Twitters API is ratelimited to 1500 per hour(?)/day, so we can useDatasift API
  • Big Data Social MediaAnalysis How does information propagate through asocial network?
  • Big Data Social NetworkAnalysisCan be used for Terrorists ( and also for potential protestors )-Drew Conway focus is one three aspects of network analysis1. Identifying leadership and key actors2. Revealing underlying structure and intra-network community structure3. Evolution and decay of social networks
  • Big Data and RevolutionAnalyticsPrimary -RevoScaleR package /XDF formatAlso sponsored RHadoop● For a case study, UpStream software ( slide 16):● Big data GLMs (you might find the chart on this page useful):● Data distillation with Hadoop and R:● Analysis of the million row movie data set (building recommendation engines):
  • Big Data and RevolutionAnalyticsmarketing analytics company UpStream Software, used map-reduce to convert transactions from Omniture logs (web visits,emails clicked on, ads displayed) into customer behaviors: response to an offer, research into a product, purchases.
  • More R and Hadoop CaseStudiesfew examples where R and Hadoop are used for data distillation: ● Using robust regression on a series of raw voice-over-IP packets to calculate how long participants talk during a phone conversation. ● Using graph theory (and Rs igraph package) to quantify the number of close friends of members of a social network. ● Orbitz uses R and Hadoop to extract flights and hotels that will be presented during a travel search, based on previous transaction. ● Using k-means clustering to extract similar "groups" of transactions, which are then aggregated and used as the record level for structured analysis
  • Using RDBMS (Big?) Datathrough R--RDBMS -RODBCPackage RMySQL ROracle RPostgresSQL RSQLite
  • Using RDBMS (is it BigData?)through R--RDBMS -RODBCPackage> library(RODBC)> odbcDataSources(type = c("all", "user", "system")) SQLServer PostgreSQL30 PostgreSQL35W "SQL Server""PostgreSQL ANSI(x64)" "PostgreSQL Unicode(x64)" MySQL "MySQL ODBC 5.1 Driver"
  • Querying Big Data--RDBMS-SQL--Hadoop-Pig (but many ways)
  • Big Data Analytics- Challenges---Traditional statistics theory grew up when data wasconstrained--Traditional analytics programming was NOT parallelprocessing--Shortage of trained people
  • Big Data Analytics- Solutions---Teaching more parallel programming and algorithms--More focus on data reduction techniques like clustering ,segmentation than on hypothesis testing. Sampling,anyone?--Training more data scientists
  • Big Data Analytics- Tools used-Why R-High Performance Computing Data Within R
  • Using R (interfaces)--Using R Studio for easier development--Using Rattle GUI for straight off the shelf datamining and Using R Commander for Extensions--Using Revolution Analytics RPE-----Example of Snippets
  • Using R--Using R for text mining---Text Mining from Twitter Case Study---Datasift Export to Amazon S3--Using R for geo-coded analysis---Hana DB--Using R for Graphical Analysis of Big DataTablePlot3D using R Commander--Using R for forecastingUsing Plugin R Commander E -Pack
  • Existing Big Data CaseStudiesDeparture of Aeroplanes-SAP Hana 200m!/2012/04/big-data-r-and-hana-analyze-200-million.htmlR using SAP Hana
  • SAP Hana DB uses R
  • Oracle R EnterpriseCase Studies and Examples
  • Oracle R EnterpriseCase Studies and Examples
  • Revolution AnalyticsRevoScaleR packagehe RevoScaleR package to extract time series data from time-stamped logs (inthis case, the "US Domestic Flights From 1990 to 2009" dataset onInfochimps):Analyzing time series data of all sorts is a fundamental business analytics taskto which the R language is beautifully suited. In addition to the time seriesfunctions built into base stats library there are dozens of R packages devoted totime series...We have shown how data manipulation functions of the RevoScaleR packageto extract time stamped data from a large data file, aggregate it, and form it intomonthly time series that can easily be analyzed with standard R functions.
  • Using R on Amazon -CaseStudy--Bioconductor in the Cloud--Custom Amazon Instance--Concerns for non- American users of Amazon
  • Using BigML on cloudCase StudyClassification using Clojure on Cloud on depending on third party tools--Example
  • Using Google APIs Storage APIGoogle Predictive Analysis APIIntroduction to other APIS----Concerns to users of Google APIs
  • Using Google APIs casestudyGoogle Storage APIGoogle Predictive Analysis API
  • Using Google APIs casestudyIntroduction to other Big Data Google APIS----Concerns to users of Google APIs
  • Using Python- PiCloudcom/ http://www.picloud.
  • Privacy hazards of big dataanalytics.Big Brother -1984 --- 2012They know where you are (mobiles)They know what you are looking for (internet)They know your past (financial history +social media)They can use your medical historyLaws authorize them (Patriot Act?)--example Emotional Analysis of Images
  • References andAcknowledgementsDavid Smith, Revolution AnalyticsDavid Champagne, Revolution AnalyticsAll R Bloggers,Developers, PackagersBlag - SAP Hana AnalyticsCharlie Berger -and Oracle R TeamJim Kobielus -IBM Big Data TeamR Development Core Team (2012). R: A language and environment forstatistical computing. R Foundation for Statistical Computing,Vienna, Austria. ISBN 3-900051-07-0, URL
  • Thanks
  • Book- R for BusinessAnalytics