Great Expectations
From You
1.No mobile rings , no sleeping (discreet sleeping),
2.Please take notes using pencil,parchment, paper,pen,
computer,tablet,stylus,mobile etc,
3.Please ask Questions in the END(from notes taken at
Step 2)
From Me
1 Breadth of Case Studies (!)
2 Open Source focus (R mostly, clojure, python)
3 Actionable Ideas are useful !
i.e I spent 3 hours in X talk but I did learn to do Y, or I am now interested in trying out Z
Big Data
What is Big Data?
"Big data" is a term applied to data sets whose size is beyond the ability of
commonly used software tools to capture, manage, and process the data within
a tolerable elapsed time. Examples include web logs, RFID, sensor networks,
social networks, social data (due to the social data revolution), Internet text and
documents, Internet search indexing, call detail records, astronomy,
atmospheric science, genomics, biogeochemical, biological, and other complex
and often interdisciplinary scientific research, military surveillance, medical
records, photography archives, video archives, and large-scale e-commerce.
IBM- http://www-01.ibm.com/software/data/bigdata/
Every day, we create 2.5 quintillion bytes of data — so much that 90% of the
data in the world today has been created in the last two years alone. This data
comes from everywhere: sensors used to gather climate information, posts to
social media sites, digital pictures and videos, purchase transaction records,
and cell phone GPS signals to name a few. This data is big data.
Big Data
What is Big Data?
Big Data Conferences
--O'Reilly's Strata
--Hadoop World
--Many many conferences......including ours
Thought for Today
In 2012 , data that is classified as Big Data will
be classified as Little Data by 2018
True ----------False
?
What is Cloud Computing?
Cloud computing is a model for enabling ubiquitous,
convenient, on-demand network access to a shared pool of
configurable computing resources (e.g., networks, servers,
storage, applications, and services) that can be rapidly
provisioned and released with minimal management effort
or service provider interaction. This cloud model is
composed of five essential characteristics, three service
models, and four deployment models.
http://csrc.nist.gov/publications/nistpubs/800-145/SP800-145.pdf
National Institute of Standards and Technology
--
Cloud Computing and
Big Data Analytics
Cost of computing Big Data would be too much,
but for cloud computing.
Cloud runs on X OS predominantly, and needs
customized solutions as of 2012
Open source solutions (OS- Analytics) are
more easily customized
Sources of Big Data
--Internet
------Server Logs,Clickstream,Analytics
--Social Media
--Governments and UN bodies
--Internal Data from customers
Storing Big Data for R
--Lots of RAM (?!)
--RDBMS
--Documents (Couch DB ,MongoDB)
--HDFS (Hadoop)
Storing Big Data for R
--Documents (Couch DB ,MongoDB)
Package RMongo provides an R interface to a Java client
for `MongoDB' (http://en.wikipedia.org/wiki/MongoDB)
databases, which are queried using JavaScript rather than
SQL. Package rmongodb is another client using
mongodb's C driver.
https://github.com/wactbprot/R4CouchDB
R talking to CouchDB using Couch's ReSTful HTTP API.
construct HTTP calls with RCurl, then move on to the
R4CouchDB package for a higher level interface.
http://digitheadslabnotebook.blogspot.in/2010/10/couchdb-
and-r.html
Big Data Packages in R- 1/2
http://cran.r-project.org/web/views/HighPerformanceComputing.html
● The biglm package by Lumley uses incremental computations to offers lm()
and glm() functionality to data sets stored outside of R's main memory.
● The ff package by Adler et al. offers file-based access to data sets that are
too large to be loaded into memory, along with a number of higher-level
functions.
● The bigmemory package by Kane and Emerson permits storing large
objects such as matrices in memory (as well as via files) and uses external
pointer objects to refer to them. This permits transparent access from R
without bumping against R's internal memory limits. Several R processes
on the same computer can also shared big memory objects.
● The HadoopStreaming Provides a framework for writing map/reduce scripts for use in Hadoop Streaming. Also facilitates
operating on data in a streaming fashion, without Hadoo
Big Data Packages in R -2/2
● http://cran.r-project.org/web/packages/biganalytics/
This package extends the bigmemory package with various analytics.
Functions bigkmeans and binit may also be used with native R objects
● http://cran.r-project.org/web/packages/bigtabulate/index.html
This package extends the bigmemory package with table- and split-like support
for big.matrix objects. The functions may also be used with regular R matrices
for improving speed and memory-efficiency.
● http://cran.at.r-project.org/web/packages/synchronicity/index.html
.For mutex (locking) support for advanced shared-memory usage, see
synchronicity.
https://r-forge.r-project.org/R/?group_id=556 lists more projects. For linear
algebra support, see bigalgebra.
Big Data and Revolution
Analytics
Primary -RevoScaleR package /XDF format
Also sponsored RHadoop
https://github.com/RevolutionAnalytics/RHadoop
RHadoop -rhdfs package
rhdfs-
https://github.com/decisionstats/RHadoop/wiki/rhdfs
Overview
This R package provides basic connectivity to the Hadoop Distributed File System. R programmers can browse, read, write, and
modify files stored in HDFS. The following functions are part of this package
● File Manipulations
● hdfs.copy, hdfs.move, hdfs.rename, hdfs.delete, hdfs.rm, hdfs.del, hdfs.chown, hdfs.put, hdfs.get
● File Read/Write
● hdfs.file, hdfs.write, hdfs.close, hdfs.flush, hdfs.read, hdfs.seek, hdfs.tell, hdfs.line.reader, hdfs.read.text.file
● Directory
● hdfs.dircreate, hdfs.mkdir
● Utility
● hdfs.ls, hdfs.list.files, hdfs.file.info, hdfs.exists
● Initialization
● hdfs.init, hdfs.defaults
http://hadoop.apache.org/hdfs/
Hadoop Distributed File System (HDFS™) is the primary storage system used by Hadoop applications. HDFS creates multiple
replicas of data blocks and distributes them on compute nodes throughout a cluster to enable reliable, extremely rapid
computations
RHadoop -rhbase package
rhbase-
https://github.com/decisionstats/RHadoop/wiki/rhbase
Overview
This R package provides basic connectivity to HBASE, using the Thrift server. R programmers can browse, read, write, and modify
tables stored in HBASE. The following functions are part of this package
● Table Maninpulation
● hb.new.table, hb.delete.table, hb.describe.table, hb.set.table.mode, hb.regions.table
● Read/Write
● hb.insert, hb.get, hb.delete, hb.insert.data.frame, hb.get.data.frame, hb.scan
● Utility
● hb.list.tables
● Initialization
● hb.defaults, hb.init
http://hbase.apache.org/
HBase is the Hadoop database. Think of it as a distributed, scalable, big data store.
RHadoop -rmr package
rmr-
Overview
This R package allows an R programmer to perform
statistical analysis via MapReduce on a Hadoop cluster.
● Average flight delay (Orbitz): original and updated
version with presentation
● Network analysis: original and a summary
Also see https://github.com/decisionstats/RHadoop/wiki/Tutorial
for logistic regression and k-means
Big Data Social Network
Analysis
Analyzing A Big Social Network using R and
distributed graph engines
http://thinkaurelius.com/2012/02/05/graph-degree-distributions-using-r-over-
hadoop/
Big Data Social Media
Analysis
Can be used for Customers ( and also for latent influencers )- http://www.r-
bloggers.com/an-example-of-social-network-analysis-with-r-using-package-igraph/
Big Data Social Media
Analysis
R package twitteR can be
http://cran.r-project.org/web/packages/twitteR/index.html
used for prototyping but Twitter's API is rate
limited to 1500 per hour(?)/day, so we can use
Datasift APIhttp://datasift.com/pricing#costs
Big Data Social Media
Analysis
How does information propagate through a
social network?
http://www.r-bloggers.com/information-transmission-in-a-social-network-dissecting-the-spread-of-a-quora-post/
Big Data Social Network
Analysis
Can be used for Terrorists ( and also for potential protestors )-
Drew Conway http://riskecon.com/wp-content/uploads/2012/02/Conway-Socio_Terrorism.pdf
Primary focus is one three aspects of network analysis
1. Identifying leadership and key actors
2. Revealing underlying structure and intra-network community structure
3. Evolution and decay of social networks
Big Data and Revolution
Analytics
Primary -RevoScaleR package /XDF format
Also sponsored RHadoop
● For a case study, UpStream software ( slide 16):
http://www.revolutionanalytics.com/news-events/free-webinars/2012/how-big-data-is-changing-retail-marketing-analytics/
● Big data GLMs (you might find the chart on this page
useful):
http://blog.revolutionanalytics.com/2012/06/big-data-generalized-linear-models-with-revolution-r-enterprise.html
● Data distillation with Hadoop and R:
http://blog.revolutionanalytics.com/2012/06/data-distillation-with-hadoop-and-r.html
● Analysis of the million row movie data set (building
recommendation engines):
http://blog.revolutionanalytics.com/2012/04/simple-tools-for-building-a-recommendation-engine.html
Big Data and Revolution
Analytics
marketing analytics company UpStream Software, used map-reduce to convert transactions from Omniture logs (web visits,
emails clicked on, ads displayed) into customer behaviors: response to an offer, research into a product, purchases.
More R and Hadoop Case
Studies
few examples where R and Hadoop are used for data distillation:
● Using robust regression on a series of raw voice-over-IP packets to
calculate how long participants talk during a phone conversation.
● Using graph theory (and R's igraph package) to quantify the number of
close friends of members of a social network.
● Orbitz uses R and Hadoop to extract flights and hotels that will be
presented during a travel search, based on previous transaction.
● Using k-means clustering to extract similar "groups" of transactions, which
are then aggregated and used as the record level for structured analysis
Using RDBMS (Big?) Data
through R
--RDBMS -RODBC
Package
http://cran.r-project.org/doc/manuals/R-data.html#Relational-databases
http://cran.r-project.org/web/packages/RMySQL/index.html RMySQL
http://cran.r-project.org/web/packages/ROracle/index.html ROracle
http://cran.r-project.org/web/packages/RPostgreSQL/index.html RPostgresSQL
http://cran.r-project.org/web/packages/DBI/index.html
http://cran.r-project.org/web/packages/RSQLite/index.html RSQLite
Using RDBMS (is it Big
Data?)
through R
--RDBMS -RODBC
Package
http://cran.r-project.org/web/packages/RODBC/RODBC.pdf
> library(RODBC)
> odbcDataSources(type = c("all", "user", "system"))
SQLServer PostgreSQL30 PostgreSQL35W
"SQL Server""PostgreSQL ANSI(x64)" "PostgreSQL Unicode(x64)"
MySQL
"MySQL ODBC 5.1 Driver"
Big Data Analytics
- Challenges
---Traditional statistics theory grew up when data was
constrained
--Traditional analytics programming was NOT parallel
processing
--Shortage of trained people
Big Data Analytics
- Solutions
---Teaching more parallel programming and algorithms
--More focus on data reduction techniques like clustering ,
segmentation than on hypothesis testing. Sampling,
anyone?
--Training more data scientists
Big Data Analytics
- Tools used
-Why R
-High Performance Computing
http://cran.r-project.org/web/views/HighPerformanceComputing.html
-Big Data Within R
http://www.slideshare.net/bytemining/r-hpc
Using R (interfaces)
--Using R Studio for easier development
--Using Rattle GUI for straight off the shelf data
mining and Using R Commander for Extensions
--Using Revolution Analytics RPE
-----Example of Snippets
Using R
--Using R for text mining
---Text Mining from Twitter Case Study
---Datasift Export to Amazon S3
--Using R for geo-coded analysis
---Hana DB
--Using R for Graphical Analysis of Big Data
TablePlot
3D using R Commander
--Using R for forecasting
Using Plugin R Commander E -Pack
Existing Big Data Case
Studies
Departure of Aeroplanes-SAP Hana 200m
http://allthingsr.blogspot.in/#!/2012/04/big-data-r-and-hana-analyze-200-million.html
R using SAP Hana
http://www.decisionstats.com/interview-blag-sap-labs-montreal-using-sap-hana-with-rstats/
SAP Hana DB uses R
http://scn.sap.com/community/in-memory-business-data-management/blog/2011/11/28/dealing-with-r-and-hana
Oracle R Enterprise
Case Studies and Examples
http://www.oracle.com/technetwork/database/options/advanced-analytics/r-
enterprise/index.html
Oracle R Enterprise
Case Studies and Examples
http://www.oracle.com/technetwork/database/options/advanced-analytics/r-
enterprise/index.html
Revolution Analytics
RevoScaleR package
he RevoScaleR package to extract time series data from time-stamped logs (in
this case, the "US Domestic Flights From 1990 to 2009" dataset on
Infochimps):
Analyzing time series data of all sorts is a fundamental business analytics task
to which the R language is beautifully suited. In addition to the time series
functions built into base stats library there are dozens of R packages devoted to
time series...
We have shown how data manipulation functions of the RevoScaleR package
to extract time stamped data from a large data file, aggregate it, and form it into
monthly time series that can easily be analyzed with standard R functions.
http://www.inside-r.org/howto/extracting-time-series-large-data-sets
http://blog.revolutionanalytics.com/2011/09/how-to-extract-time-series-from-
large-timestamped-logs-with-r.html
Using R on Amazon -Case
Study
--Bioconductor in the Cloud
--Custom Amazon Instance
--Concerns for non- American users of Amazon
Using BigML on cloud
Case Study
Classification using Clojure on Cloud
https://bigml.com/gallery/models/fraud_and_crime
--Concerns on depending on third party tools
--Example Cloudnumbers.com
Privacy hazards of big data
analytics.
Big Brother -1984 --- 2012
They know where you are (mobiles)
They know what you are looking for (internet)
They know your past (financial history +social media)
They can use your medical history
Laws authorize them (Patriot Act?)
--example Emotional Analysis of Images http:
//www.affectiva.com/
References and
Acknowledgements
David Smith, Revolution Analytics
David Champagne, Revolution Analytics
All R Bloggers,Developers, Packagers
Blag - SAP Hana Analytics
Charlie Berger -and Oracle R Team
Jim Kobielus -IBM Big Data Team
R Development Core Team (2012). R: A language and environment for
statistical computing. R Foundation for Statistical Computing,
Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org/.