Presenter Introductionwww.linkedin.com/in/ajayohriWorking with Analytics since 2004Educated at IIM Lucknow, DCE, U TennAuthor (R for Business Analytics (Springer))Blogger at www.decisionstats.comInterviewed 100+ Analytics leaders
Audience Introduction● Affiliation-Academic/ Govt/Private● Years of working with Big Data-● Specific Interest Area in Analytics-
Great ExpectationsFrom You1.No mobile rings , no sleeping (discreet sleeping),2.Please take notes using pencil,parchment, paper,pen,computer,tablet,stylus,mobile etc,3.Please ask Questions in the END(from notes taken atStep 2)From Me1 Breadth of Case Studies (!)2 Open Source focus (R mostly, clojure, python)3 Actionable Ideas are useful !i.e I spent 3 hours in X talk but I did learn to do Y, or I am now interested in trying out Z
Agenda-Presenter Introduction-Audience Identification-Expectations---------------------------------------------Big Data-Big Data Analytics using R -Case Study 1(Amazon AWS,SAP HanaDB)-Big Data Analytics using other tools -Case Study 2 (BigML.com, Picloud.com)--------------------------------------------
Big DataWhat is Big Data?"Big data" is a term applied to data sets whose size is beyond the ability ofcommonly used software tools to capture, manage, and process the data withina tolerable elapsed time. Examples include web logs, RFID, sensor networks,social networks, social data (due to the social data revolution), Internet text anddocuments, Internet search indexing, call detail records, astronomy,atmospheric science, genomics, biogeochemical, biological, and other complexand often interdisciplinary scientific research, military surveillance, medicalrecords, photography archives, video archives, and large-scale e-commerce.IBM- http://www-01.ibm.com/software/data/bigdata/Every day, we create 2.5 quintillion bytes of data — so much that 90% of thedata in the world today has been created in the last two years alone. This datacomes from everywhere: sensors used to gather climate information, posts tosocial media sites, digital pictures and videos, purchase transaction records,and cell phone GPS signals to name a few. This data is big data.
Big DataWhat is Big Data?Big Data Conferences--OReillys Strata--Hadoop World--Many many conferences......including ours
Thought for TodayIn 2012 , data that is classified as Big Data willbe classified as Little Data by 2018True ----------False?
What is Cloud Computing?Cloud computing is a model for enabling ubiquitous,convenient, on-demand network access to a shared pool ofconfigurable computing resources (e.g., networks, servers,storage, applications, and services) that can be rapidlyprovisioned and released with minimal management effortor service provider interaction. This cloud model iscomposed of five essential characteristics, three servicemodels, and four deployment models.http://csrc.nist.gov/publications/nistpubs/800-145/SP800-145.pdf National Institute of Standards and Technology--
Cloud Computing andBig Data AnalyticsCost of computing Big Data would be too much,but for cloud computing.Cloud runs on X OS predominantly, and needscustomized solutions as of 2012Open source solutions (OS- Analytics) aremore easily customized
Sources of Big Data--Internet------Server Logs,Clickstream,Analytics--Social Media--Governments and UN bodies--Internal Data from customers
Storing Big Data for R--Lots of RAM (?!)--RDBMS--Documents (Couch DB ,MongoDB)--HDFS (Hadoop)
Big Data Packages in R- 1/2http://cran.r-project.org/web/views/HighPerformanceComputing.html● The biglm package by Lumley uses incremental computations to offers lm() and glm() functionality to data sets stored outside of Rs main memory.● The ff package by Adler et al. offers file-based access to data sets that are too large to be loaded into memory, along with a number of higher-level functions.● The bigmemory package by Kane and Emerson permits storing large objects such as matrices in memory (as well as via files) and uses external pointer objects to refer to them. This permits transparent access from R without bumping against Rs internal memory limits. Several R processes on the same computer can also shared big memory objects.● The HadoopStreaming Provides a framework for writing map/reduce scripts for use in Hadoop Streaming. Also facilitates operating on data in a streaming fashion, without Hadoo
Big Data Packages in R -2/2● http://cran.r-project.org/web/packages/biganalytics/This package extends the bigmemory package with various analytics.Functions bigkmeans and binit may also be used with native R objects● http://cran.r-project.org/web/packages/bigtabulate/index.htmlThis package extends the bigmemory package with table- and split-like supportfor big.matrix objects. The functions may also be used with regular R matricesfor improving speed and memory-efficiency.● http://cran.at.r-project.org/web/packages/synchronicity/index.html.For mutex (locking) support for advanced shared-memory usage, seesynchronicity.https://r-forge.r-project.org/R/?group_id=556 lists more projects. For linearalgebra support, see bigalgebra.
Big Data and RevolutionAnalyticsPrimary -RevoScaleR package /XDF formatAlso sponsored RHadoophttps://github.com/RevolutionAnalytics/RHadoop
RHadoop -rhdfs packagerhdfs-https://github.com/decisionstats/RHadoop/wiki/rhdfsOverviewThis R package provides basic connectivity to the Hadoop Distributed File System. R programmers can browse, read, write, andmodify files stored in HDFS. The following functions are part of this package ● File Manipulations ● hdfs.copy, hdfs.move, hdfs.rename, hdfs.delete, hdfs.rm, hdfs.del, hdfs.chown, hdfs.put, hdfs.get ● File Read/Write ● hdfs.file, hdfs.write, hdfs.close, hdfs.flush, hdfs.read, hdfs.seek, hdfs.tell, hdfs.line.reader, hdfs.read.text.file ● Directory ● hdfs.dircreate, hdfs.mkdir ● Utility ● hdfs.ls, hdfs.list.files, hdfs.file.info, hdfs.exists ● Initialization ● hdfs.init, hdfs.defaultshttp://hadoop.apache.org/hdfs/Hadoop Distributed File System (HDFS™) is the primary storage system used by Hadoop applications. HDFS creates multiplereplicas of data blocks and distributes them on compute nodes throughout a cluster to enable reliable, extremely rapidcomputations
RHadoop -rhbase packagerhbase-https://github.com/decisionstats/RHadoop/wiki/rhbaseOverviewThis R package provides basic connectivity to HBASE, using the Thrift server. R programmers can browse, read, write, and modifytables stored in HBASE. The following functions are part of this package ● Table Maninpulation ● hb.new.table, hb.delete.table, hb.describe.table, hb.set.table.mode, hb.regions.table ● Read/Write ● hb.insert, hb.get, hb.delete, hb.insert.data.frame, hb.get.data.frame, hb.scan ● Utility ● hb.list.tables ● Initialization ● hb.defaults, hb.inithttp://hbase.apache.org/HBase is the Hadoop database. Think of it as a distributed, scalable, big data store.
RHadoop -rmr packagermr-OverviewThis R package allows an R programmer to performstatistical analysis via MapReduce on a Hadoop cluster.● Average flight delay (Orbitz): original and updated version with presentation● Network analysis: original and a summaryAlso see https://github.com/decisionstats/RHadoop/wiki/Tutorialfor logistic regression and k-means
Big Data Social Network AnalysisAnalyzing A Big Social Network using R anddistributed graph engineshttp://thinkaurelius.com/2012/02/05/graph-degree-distributions-using-r-over-hadoop/
Big Data Social MediaAnalysisCan be used for Customers ( and also for latent influencers )- http://www.r-bloggers.com/an-example-of-social-network-analysis-with-r-using-package-igraph/
Big Data Social MediaAnalysisR package twitteR can be http://cran.r-project.org/web/packages/twitteR/index.htmlused for prototyping but Twitters API is ratelimited to 1500 per hour(?)/day, so we can useDatasift APIhttp://datasift.com/pricing#costs
Big Data Social MediaAnalysis How does information propagate through asocial network?http://www.r-bloggers.com/information-transmission-in-a-social-network-dissecting-the-spread-of-a-quora-post/
Big Data Social NetworkAnalysisCan be used for Terrorists ( and also for potential protestors )-Drew Conway http://riskecon.com/wp-content/uploads/2012/02/Conway-Socio_Terrorism.pdfPrimary focus is one three aspects of network analysis1. Identifying leadership and key actors2. Revealing underlying structure and intra-network community structure3. Evolution and decay of social networks
Big Data and RevolutionAnalyticsPrimary -RevoScaleR package /XDF formatAlso sponsored RHadoop● For a case study, UpStream software ( slide 16):http://www.revolutionanalytics.com/news-events/free-webinars/2012/how-big-data-is-changing-retail-marketing-analytics/● Big data GLMs (you might find the chart on this page useful):http://blog.revolutionanalytics.com/2012/06/big-data-generalized-linear-models-with-revolution-r-enterprise.html● Data distillation with Hadoop and R:http://blog.revolutionanalytics.com/2012/06/data-distillation-with-hadoop-and-r.html● Analysis of the million row movie data set (building recommendation engines):http://blog.revolutionanalytics.com/2012/04/simple-tools-for-building-a-recommendation-engine.html
Big Data and RevolutionAnalyticsmarketing analytics company UpStream Software, used map-reduce to convert transactions from Omniture logs (web visits,emails clicked on, ads displayed) into customer behaviors: response to an offer, research into a product, purchases.
More R and Hadoop CaseStudiesfew examples where R and Hadoop are used for data distillation: ● Using robust regression on a series of raw voice-over-IP packets to calculate how long participants talk during a phone conversation. ● Using graph theory (and Rs igraph package) to quantify the number of close friends of members of a social network. ● Orbitz uses R and Hadoop to extract flights and hotels that will be presented during a travel search, based on previous transaction. ● Using k-means clustering to extract similar "groups" of transactions, which are then aggregated and used as the record level for structured analysis
Using RDBMS (is it BigData?)through R--RDBMS -RODBCPackagehttp://cran.r-project.org/web/packages/RODBC/RODBC.pdf> library(RODBC)> odbcDataSources(type = c("all", "user", "system")) SQLServer PostgreSQL30 PostgreSQL35W "SQL Server""PostgreSQL ANSI(x64)" "PostgreSQL Unicode(x64)" MySQL "MySQL ODBC 5.1 Driver"
Querying Big Data--RDBMS-SQL--Hadoop-Pig (but many ways)
Big Data Analytics- Challenges---Traditional statistics theory grew up when data wasconstrained--Traditional analytics programming was NOT parallelprocessing--Shortage of trained people
Big Data Analytics- Solutions---Teaching more parallel programming and algorithms--More focus on data reduction techniques like clustering ,segmentation than on hypothesis testing. Sampling,anyone?--Training more data scientists
Big Data Analytics- Tools used-Why R-High Performance Computinghttp://cran.r-project.org/web/views/HighPerformanceComputing.html-Big Data Within Rhttp://www.slideshare.net/bytemining/r-hpc
Using R (interfaces)--Using R Studio for easier development--Using Rattle GUI for straight off the shelf datamining and Using R Commander for Extensions--Using Revolution Analytics RPE-----Example of Snippets
Using R--Using R for text mining---Text Mining from Twitter Case Study---Datasift Export to Amazon S3--Using R for geo-coded analysis---Hana DB--Using R for Graphical Analysis of Big DataTablePlot3D using R Commander--Using R for forecastingUsing Plugin R Commander E -Pack
Existing Big Data CaseStudiesDeparture of Aeroplanes-SAP Hana 200mhttp://allthingsr.blogspot.in/#!/2012/04/big-data-r-and-hana-analyze-200-million.htmlR using SAP Hanahttp://www.decisionstats.com/interview-blag-sap-labs-montreal-using-sap-hana-with-rstats/
SAP Hana DB uses Rhttp://scn.sap.com/community/in-memory-business-data-management/blog/2011/11/28/dealing-with-r-and-hana
Oracle R EnterpriseCase Studies and Exampleshttp://www.oracle.com/technetwork/database/options/advanced-analytics/r-enterprise/index.html
Oracle R EnterpriseCase Studies and Exampleshttp://www.oracle.com/technetwork/database/options/advanced-analytics/r-enterprise/index.html
Revolution AnalyticsRevoScaleR packagehe RevoScaleR package to extract time series data from time-stamped logs (inthis case, the "US Domestic Flights From 1990 to 2009" dataset onInfochimps):Analyzing time series data of all sorts is a fundamental business analytics taskto which the R language is beautifully suited. In addition to the time seriesfunctions built into base stats library there are dozens of R packages devoted totime series...We have shown how data manipulation functions of the RevoScaleR packageto extract time stamped data from a large data file, aggregate it, and form it intomonthly time series that can easily be analyzed with standard R functions.http://www.inside-r.org/howto/extracting-time-series-large-data-setshttp://blog.revolutionanalytics.com/2011/09/how-to-extract-time-series-from-large-timestamped-logs-with-r.html
Using R on Amazon -CaseStudy--Bioconductor in the Cloud--Custom Amazon Instance--Concerns for non- American users of Amazon
Using BigML on cloudCase StudyClassification using Clojure on Cloudhttps://bigml.com/gallery/models/fraud_and_crime--Concerns on depending on third party tools--Example Cloudnumbers.com
Using Google APIshttps://code.google.com/apis/console/?pli=1Google Storage APIGoogle Predictive Analysis APIIntroduction to other APIS----Concerns to users of Google APIs
Using Google APIs casestudyGoogle Storage APIGoogle Predictive Analysis APIhttp://code.google.com/p/google-prediction-api-r-client/
Using Google APIs casestudyIntroduction to other Big Data Google APIS----Concerns to users of Google APIs
Privacy hazards of big dataanalytics.Big Brother -1984 --- 2012They know where you are (mobiles)They know what you are looking for (internet)They know your past (financial history +social media)They can use your medical historyLaws authorize them (Patriot Act?)--example Emotional Analysis of Images http://www.affectiva.com/
References andAcknowledgementsDavid Smith, Revolution AnalyticsDavid Champagne, Revolution AnalyticsAll R Bloggers,Developers, PackagersBlag - SAP Hana AnalyticsCharlie Berger -and Oracle R TeamJim Kobielus -IBM Big Data TeamR Development Core Team (2012). R: A language and environment forstatistical computing. R Foundation for Statistical Computing,Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org/.