Using R with Hadoop
Upcoming SlideShare
Loading in...5
×
 

Using R with Hadoop

on

  • 21,283 views

R and Hadoop are changing the way organizations manage and utilize big data. Think Big Analytics and Revolution Analytics are helping clients plan, build, test and implement innovative solutions ...

R and Hadoop are changing the way organizations manage and utilize big data. Think Big Analytics and Revolution Analytics are helping clients plan, build, test and implement innovative solutions based on the two technologies that allow clients to analyze data in new ways; exposing new insights for the business. Join us as Jeffrey Breen explains the core technology concepts and illustrates how to utilize R and Revolution Analytics’ RevoR in Hadoop environments.

Statistics

Views

Total Views
21,283
Views on SlideShare
3,138
Embed Views
18,145

Actions

Likes
12
Downloads
245
Comments
1

69 Embeds 18,145

http://www.revolutionanalytics.com 10155
http://jeffreybreen.wordpress.com 6726
http://www.r-bloggers.com 1043
http://yonniedev.devcloud.acquia-sites.com 39
http://feeds.feedburner.com 38
https://jeffreybreen.wordpress.com 26
http://localhost 20
http://newsblur.com 8
http://revolutionanalytics.com 7
http://translate.googleusercontent.com 5
http://dschool.co 3
http://papfind.net 3
http://webcache.googleusercontent.com 3
http://127.0.0.1 3
http://yonnietest.devcloud.acquia-sites.com 2
http://tru-search.com 2
http://papasearc.com 2
http://www.feedspot.com 2
http://yonnie.devcloud.acquia-sites.com 2
http://www.365dailyjournal.com 2
http://flavors.me 2
http://digg.com 2
http://www.revolution-computing.com 2
http://twimblr.appspot.com 2
http://prlog.ru 2
http://find-food.net 1
http://formulaworldshop.com 1
http://a-vos-baguettes.blogspot.com 1
http://opelangi.blogspot.com 1
http://elyshop.com 1
http://allmedia4u-dizzydude.blogspot.com 1
https://www.google.com 1
http://feedly.com 1
http://jaredandkathleensmith.blogspot.com 1
http://goldentimepictures.blogspot.com 1
http://originationnews.com 1
http://manstar.info 1
http://yhvhrules.blogspot.com 1
http://triadblogs.com 1
http://www.google.com 1
http://prefadoros-nafplio.blogspot.com 1
http://pinshangshop.com 1
http://finsear-teln.com 1
http://betafindgoeasy.net 1
http://beauty-animal.blogspot.com 1
http://news.google.com 1
http://www.twylah.com 1
http://1kpl.us 1
http://www.revolutionanalytics.com. 1
http://xianguo.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Using R with Hadoop Using R with Hadoop Presentation Transcript

  • Revolution  ConfidentialUsing R with Hadoop by Jeffrey Breen Principal, Think Big Academy Think Big Analytics jeffrey.breen@thinkbiganalytics.com Twitter: @JeffreyBreen 1
  • Revolution  Confidential Think&BigStart&SmartScale&Fast 2
  • Agenda Revolution  ConfidentialWhy R?What is Hadoop?Counting words with MapReduceRHadoop in Action Client scenario: Scaling legacy analytics with RHadoop Client scenario: Network security threat detectionBig Data Warehousing with Hive Client scenario: Using Hadoop as a “drop-in” database replacement Client scenario: Preprocessing data for RWant to learn more?Q&A 3
  • Revolution  Confidentialhttp://thebalancedguy.blogspot.com/2010/09/with-3-boys-and-having-been-cub-scout.html 4
  • Revolution  Confidentialhttp://www.wengerna.com/giant-knife-16999 5
  • Number  of  R  Packages  Available How  many  R  Packages   are  there  now? At  the  command  line   enter: >  dim(available.packages())Slide courtesy of John Versotek, organizer of the Boston Predictive Analytics Meetup
  • Revolution  ConfidentialPoll: What do you use for analytics? 7
  • Revolution R Enterprise has Open-Source Revolution  ConfidentialR Engine at the core 3,700 community packages and growing exponentially Multi-Threaded Cluster Web Services Big Data Parallel Math Libraries support API Analysis Tools Technical IDE / Developer Support GUI Community Build Packages R Engine Assurance Language Libraries 1 8
  • What is Hadoop? Revolution  ConfidentialAn open source project designed to support large scale data processingInspired by Google’s MapReduce-based computational infrastructureComprised of several components Hadoop Distributed File System (HDFS) MapReduce processing framework, job scheduler, etc. Ingest/outgest services (Sqoop, Flume, etc.) Higher level languages and libraries (Hive, Pig, Cascading, Mahout)Written in Java, first opened up to alternatives through its Streaming API → If your language of choice can handle stdin and stdout, you can use it to write MapReduce jobs 9
  • Revolution  ConfidentialPoll: Have you used Hadoop? 10
  • Enter RHadoop Revolution  ConfidentialRHadoop is an open source project sponsored by Revolution AnalyticsPackage Overview rmr2 - all MapReduce-related functions rhdfs - interaction with Hadoop’s HDFS file system rhbase - access to the NoSQL HBase databasermr2 uses Hadoop’s Streaming API to allow R users to write MapReduce jobs in R handles all of the I/O and job submission for you (no while(<stdin>)-like loops!) 11
  • RHadoop Advantages Revolution  ConfidentialModular Packages group similar functions Only load (and learn!) what you need Minimizes prerequisites and dependenciesOpen Source Cost: Low (no) barrier to start using Transparency: Development, issue tracker, Wiki, etc. hosted on github: https://github.com/RevolutionAnalytics/RHadoop/Supported Sponsored by Revolution Analytics Training & professional services available Support included with Revolution R Enterprise 12
  • Revolution  Confidential Hadoop cluster components SQL Store Ingest Outgest SQL Service Service Store Logs ClusterKey Primary Master Server Client Servers Secondary ✲ Hive, Pig, ...italics: process ✲ Job Tracker Master Server ✲ cron+bash, Azkaban, … Sqoop, Scribe, … Name Node Secondary Monitoring, Management Name Node✲ : MR jobs Slaves Slave Server Slave Server Slave Server ✲ Task Tracker ✲ Task Tracker ✲ Task Tracker Data Node Data Node Data Node ... DiskDiskDiskDisk DiskDiskDiskDisk DiskDiskDiskDisk DiskDiskDiskDisk DiskDiskDiskDisk DiskDiskDiskDisk from Think Big Academy’s Hadoop Developer Course 13 Copyright  ©  2011-­‐2013,  Think  Big  AnalyJcs,  All  Rights  Reserved
  • Revolution  Confidential Hadoop’s distributed file system SQL Store Ingest Outgest SQL Service Service Store LogsServices Cluster Client Servers Primary Master Server ✲ Hive, Pig, ... Name Node Secondary ✲ Job Tracker Master Server ✲ cron+bash, Azkaban, … Sqoop, Scribe, … Name Node Secondary Monitoring, Management Name Node Data Nodes64MB blocks Slaves Slave Server Slave Server Slave Server3x replication ✲ Task Tracker ✲ Task Tracker ✲ Task Tracker Data Node Data Node Data Node ... DiskDiskDiskDisk DiskDiskDiskDisk DiskDiskDiskDisk DiskDiskDiskDisk DiskDiskDiskDisk DiskDiskDiskDisk from Think Big Academy’s Hadoop Developer Course Copyright  ©  2011-­‐2013,  Think  Big  AnalyJcs,  All  Rights  Reserved 14
  • Revolution  ConfidentialWordcount in Hadoop (The “hello world” of MapReduce) 15
  • Input Mappers Sort, Reducers Output ShuffleHadoop uses (hadoop, 1)MapReduce a2 (mapreduce, 1) hadoop 1 is 2 (uses, 1) We need to convert (is, 1), (a, 1) There is a Map phase (map, 1),(phase,1) the Input (there, 1) map 1 mapreduce 1 phase 2 into the Output. (phase,1) (is, 1), (a, 1) reduce 1 (there, 1), there 2 There is aReduce phase (reduce 1) uses 1 from Think Big Academy’s Hadoop Developer Course Copyright  ©  2011-­‐2013,  Think  Big  AnalyJcs,  All  Rights  Reserved
  • Input MappersHadoop usesMapReduce (N, "…") There is a Map phase (N, "…") (N, "") There is aReduce phase (N, "…") from Think Big Academy’s Hadoop Developer Course Copyright  ©  2011-­‐2013,  Think  Big  AnalyJcs,  All  Rights  Reserved
  • Input Mappers (hadoop, 1)Hadoop usesMapReduce (N, "…") (uses, 1) (mapreduce, 1) (there, 1) (is, 1) There is a Map phase (N, "…") (a, 1) (map, 1) (phase, 1) (N, "") (there, 1) (is, 1) There is aReduce phase (N, "…") (a, 1) (reduce, 1) (phase, 1) from Think Big Academy’s Hadoop Developer Course Copyright  ©  2011-­‐2013,  Think  Big  AnalyJcs,  All  Rights  Reserved
  • Revolution  Confidentialhttp://blog.stackoverflow.com/wp-content/uploads/then-a-miracle-occurs-cartoon.png 19
  • Input Mappers Sort, Reducers Shuffle 0-9, a-lHadoop uses (hadoop, 1)MapReduce (N, "…") (mapreduce, 1) (uses, 1) (is, 1), (a, 1) There is a Map phase (N, "…") m-q (map, 1),(phase,1) (there, 1) (N, "") (phase,1) r-z (is, 1), (a, 1) (there, 1), There is aReduce phase (N, "…") (reduce, 1) from Think Big Academy’s Hadoop Developer Course Copyright  ©  2011-­‐2013,  Think  Big  AnalyJcs,  All  Rights  Reserved
  • Input Mappers Sort, Reducers Shuffle 0-9, a-lHadoop uses (hadoop, 1)MapReduce (N, "…") (a, [1,1]), (mapreduce, 1) (hadoop, [1]), (is, [1,1]) (uses, 1) (is, 1), (a, 1) There is a Map phase (N, "…") m-q (map, 1),(phase,1) (there, 1) (map, [1]), (mapreduce, [1]), (phase, [1,1]) (N, "") (phase,1) r-z (is, 1), (a, 1) (reduce, [1]), (there, 1), (there, [1,1]), There is aReduce phase (N, "…") (reduce 1) (uses, 1) from Think Big Academy’s Hadoop Developer Course Copyright  ©  2011-­‐2013,  Think  Big  AnalyJcs,  All  Rights  Reserved
  • Input Mappers Sort, Reducers Output Shuffle 0-9, a-lHadoop uses (hadoop, 1)MapReduce (N, "…") (a, [1,1]), a2 (mapreduce, 1) (hadoop, [1]), hadoop 1 (is, [1,1]) is 2 (uses, 1) (is, 1), (a, 1) There is a Map phase (N, "…") m-q (map, 1),(phase,1) (there, 1) (map, [1]), map 1 (mapreduce, [1]), mapreduce 1 (phase, [1,1]) phase 2 (N, "") (phase,1) r-z (is, 1), (a, 1) (reduce, [1]), reduce 1 (there, 1), (there, [1,1]), there 2 There is aReduce phase (N, "…") (reduce 1) (uses, 1) uses 1 from Think Big Academy’s Hadoop Developer Course Copyright  ©  2011-­‐2013,  Think  Big  AnalyJcs,  All  Rights  Reserved
  • Input Mappers Sort, Reducers Output Shuffle 0-9, a-lHadoop uses (hadoop, 1)MapReduce (N, "…") (a, [1,1]), a2 (mapreduce, 1) (hadoop, [1]), hadoop 1 (is, [1,1]) is 2 (uses, 1) (is, 1), (a, 1) There is a Map phase (N, "…") m-q (map, 1),(phase,1) (there, 1) (map, [1]), map 1 (mapreduce, [1]), mapreduce 1 (phase, [1,1]) phase 2 (N, "") Map: (phase,1) Reduce: r-z• Transform  one  input  to  0-­‐N   • Collect multiple inputs into (is, 1), (a, 1) (reduce, [1]), reduce 1 (there, 1), (there, [1,1]), there 2 There is a (N, "…") one output. (uses, 1) outputs.  Reduce phase (reduce 1) uses 1 from Think Big Academy’s Hadoop Developer Course Copyright  ©  2011-­‐2013,  Think  Big  AnalyJcs,  All  Rights  Reserved
  • wordcount: code Revolution  Confidentiallibrary(rmr2)map = function(k,lines) { words.list = strsplit(lines, s) words = unlist(words.list) return( keyval(words, 1) )}reduce = function(word, counts) { keyval(word, sum(counts))}wordcount = function (input, output = NULL) { mapreduce(input = input , output = output, input.format = "text", map = map, reduce = reduce)} from Revolution Analytics’ Getting Started with RHadoop course 24
  • wordcount: submit job and fetch results Revolution  ConfidentialSubmit job > hdfs.root = wordcount > hdfs.data = file.path(hdfs.root, data) > hdfs.out = file.path(hdfs.root, out) > out = wordcount(hdfs.data, hdfs.out)Fetch results from HDFS > results = from.dfs( out ) > results.df = as.data.frame(results, stringsAsFactors=F ) > colnames(results.df) = c(word, count) > head(results.df) word count 1 greatness 2 2 damned 3 3 tis 5 4 jade 1 5 magician 1 6 hands 2 from Revolution Analytics’ Getting Started with RHadoop course 25
  • Code notes Revolution  ConfidentialScalable Hadoop and MapReduce abstract away system details Code runs on 1 node or 1,000 nodes without modificationPortable You write normal R code, interacting with normal R objects RHadoop’s rmr2 library abstracts away Hadoop details All the functionality you expect is there—including Enterprise R’sFlexible Only the mapper deals with the data directly All components communicate via key-value pairs Key-value “schema” chosen for each analysis rather than as a prerequisite to loading data into the system 26
  • Data silos have developed by source, Revolution  Confidentialtype, system, department...Customer Service, Website search, Operations Social MediaCall Center Clickstreams & Finance 27
  • Caution: Silos Kill Revolution  ConfidentialAP Photo/The Kansas City Star, Keith Myers 28
  • Revolution  ConfidentialPoll: How do you intend to use R & Hadoop? 29
  • Scaling up a legacy modeling environment Revolution  ConfidentialScenario A financial services client has built a model to predict credit risk using a legacy system, but now wants to run the scoring algorithm more frequently than possible in existing hardware- (and license-) bound environmentSolution Model coefficients are exported from legacy system and written to HDFS as a text file so they are available for our R code to use while scoring the entire data set nightly 30
  • “Side loading” data made easy Revolution  Confidential[...]# first load in our coefficients# Note that this data.frame will be# _automatically_ distributed to each nodeCOEFFS.DF = as.data.frame( from.dfs(hdfs.coeffs, format = make.input.format(csv, sep=t)) )out = mapreduce(input = hdfs.data, output = hdfs.out, input.format = make.input.format(csv, sep=t), map = scoring.map, verbose=T) 31
  • Logistic regression scoring “by hand” Revolution  Confidentialscore.logit = function(coeffs, data, reverse.coeffs.signs=F){ if( length(coeffs) != length(data) + 1 ) { warning("Bad row: ", data) return(NULL) } # SAS may reverse the signs of the coefficients # (this assumes the intercept should be changed as well) if (reverse.coeffs.signs) coeffs = -1 * coeffs intercept = coeffs[1] coeffs = coeffs[-1] return( 1/(1 + exp(-intercept - sum(coeffs * data))) )} 32
  • Network security threat detection Revolution  ConfidentialScenario A risk-conscious company with a high-profile web presence was seeking a way to detect potential attacks using captured network traffic (PCAP files).Solution Typical traffic was modeled using a combination of algorithms on the full data set. Baseline results from this “training” phase are stored on HDFS and can be refreshed periodically, as needed. A detection job loads the baseline results and analyzes incoming network traffic looking for anomalies. 33
  • Big Data Warehousing with Hive Revolution  ConfidentialHive supplies a SQL-like query language very familiar for those with relational database experienceBut Hive compiles, optimizes, and executes these queries as MapReduce jobs on the Hadoop clusterCan be used in conjunction with other Hadoop jobs, such as those written with rmr2 34
  • Hive architecture & access Revolution  Confidential Terminal browser RODBC, RJDBC, etc. Hive JDBC ODBC CLI HWI Thrift Server Driver Metastore (compiles, optimizes, executes) Hadoop Master DFS ✲ Job Tracker Name Node 35
  • Hadoop as a “drop-in” database replacement Revolution  ConfidentialScenario A market research company had developed a large collection of mature R code to conduct mission-critical analysis. Reaching the limits of their legacy database infrastructure, the company has decided to move their data to Hadoop, but wants to avoid a “Big Bang” project to preserve existing functionality.Solution As a stop-gap solution, ODBC connections to Hive have replaced connections to the legacy relational database with minimal recoding. Existing functionality was preserved while the team got up to speed with their new Hadoop environment and its capabilities. Solution allows for parallel testing of two systems. 36
  • Accessing Hive via ODBC/JDBC Revolution  Confidentiallibrary(RJDBC)# set the classpath to include the JDBC driver location, plus commons-logging[...]class.path = c(hive.class.path, commons.class.path)drv = JDBC("org.apache.hadoop.hive.jdbc.HiveDriver", classPath=class.path, "`")# make a connection to the running Hive Server:conn = dbConnect(drv, "jdbc:hive://localhost:10000/default")# setting the database name in the URL doesnt help,# so issue use databasename command:res = dbSendQuery(conn, use mydatabase)# submit the query and fetch the results as a data.frame:df = dbGetQuery(conn, SELECT name, sub FROM employees LATERAL VIEW explode(subordinates) subView AS sub) 37
  • Preprocessing data with Hive Revolution  ConfidentialScenario A marketing analytics team had built their own segmentation model based on their own database of web visitors. A partnership with a new data provider brought an order of magnitude more visitors to segment.Solution A series of MapReduce jobs (Hive queries and RHadoop rmr2) was used to match, filter, and transform the data into a manageable size for interactive modeling with Revolution R Enterprise. Segmented Sample Match Filter Extract Raw data Interactive modeling & exploratory analytics in Revolution R Enterprise Output 38
  • Other ways to use R and Hadoop Revolution  ConfidentialHDFS Revolution Enterprise R can read and write files directly on the distributed file system Files can include ScaleR’s XDF-formatted data setsMapReduce Many other R packages have been written to use R and Hadoop together, including RHIPE, segue, Oracle’s R Connector for Hadoop, etc.Hive Hadoop Streaming is also available for Hive to leverage functionality external to Hadoop and Java RHive leverages RServe to connect the two: http://cran.r-project.org/web/ packages/RHive/ 39
  • Want to learn more? Revolution  ConfidentialUpcoming public Getting Started with RHadoop 1-day classes Hands-on examples and exercises covering rhdfs, rhbase, and rmr2 Algorithms and data include wordcount, analysis of airline flight data, and collaborative filtering using structured and unstructured data from text, CSV files and Twitter February 25, 2013 - Palo Alto, CA March 13, 2013 - Boston, MA more to come: http://www.eventbrite.com/org/3179781746?s=12196176 (also available as private, on-site training)Revolution Analytics Quick Start Program for Hadoop Private Getting Started with RHadoop training Onsite consulting assistance for initial use case Revolution R for Hadoop licenses and support http://www.revolutionanalytics.com/services/consulting/hadoop-quick-start.php 40
  • Revolution  ConfidentialThank you! Any Questions? 41
  • Revolution  Confidential 42