Successfully reported this slideshow.
Your SlideShare is downloading. ×

Big Data Analytics made easy using Apache Hive to R Connector - StampedeCon 2014

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Upcoming SlideShare
Introduction to Hadoop
Introduction to Hadoop
Loading in …3
×

Check these out next

1 of 24 Ad

Big Data Analytics made easy using Apache Hive to R Connector - StampedeCon 2014

At StampedeCon 2014, Sukhendu Chakraborty (RichRelevance) presented "Big Data Analytics made easy using Apache Hive to R Connector."

As the leading omni-channel personalization provider, RichRelevance fully harnesses the power of Hadoop to handle petabytes of data coming from both online (clickstream) and offline (e.g. in-store) sources. Given this wealth of customer data at Richrelevance, omnichannel data integration and analytics is critical. One of the major challenges is to consolidate online, mobile, social, and other data sources to create a create a single view of users for making more insightful decisions.

Our use cases require clickstream analytics that leverage Apache Hive & R. Apache Hive is a good tool for performing ELT and basic analytics, but is limited in statistical analysis and data exploration capabilities. R, on the other hand, has become a preferred language for analytics, as it offers a wide variety of statistical and graphical packages. The downside is that R is single threaded memory intensive, making it impossible to work with data at scale.

Through a series of use cases, we will present how our version of the R to Hive connector allows us to bridge the gap between R and Hive and make big data analysis using R on terabytes of data feasible. This framework takes us a step closer to the notion of a “one solution fits all” principle where we are no longer restricted by a single compute mechanism. It is our attempt to bring the two worlds closer, such that the data source is agnostic to the tools which are used to access it.

At StampedeCon 2014, Sukhendu Chakraborty (RichRelevance) presented "Big Data Analytics made easy using Apache Hive to R Connector."

As the leading omni-channel personalization provider, RichRelevance fully harnesses the power of Hadoop to handle petabytes of data coming from both online (clickstream) and offline (e.g. in-store) sources. Given this wealth of customer data at Richrelevance, omnichannel data integration and analytics is critical. One of the major challenges is to consolidate online, mobile, social, and other data sources to create a create a single view of users for making more insightful decisions.

Our use cases require clickstream analytics that leverage Apache Hive & R. Apache Hive is a good tool for performing ELT and basic analytics, but is limited in statistical analysis and data exploration capabilities. R, on the other hand, has become a preferred language for analytics, as it offers a wide variety of statistical and graphical packages. The downside is that R is single threaded memory intensive, making it impossible to work with data at scale.

Through a series of use cases, we will present how our version of the R to Hive connector allows us to bridge the gap between R and Hive and make big data analysis using R on terabytes of data feasible. This framework takes us a step closer to the notion of a “one solution fits all” principle where we are no longer restricted by a single compute mechanism. It is our attempt to bring the two worlds closer, such that the data source is agnostic to the tools which are used to access it.

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Similar to Big Data Analytics made easy using Apache Hive to R Connector - StampedeCon 2014 (20)

Advertisement

More from StampedeCon (20)

Recently uploaded (20)

Advertisement

Big Data Analytics made easy using Apache Hive to R Connector - StampedeCon 2014

  1. 1. © 2014 RichRelevance, Inc. All Rights Reserved. Confidential. Sukhendu Chakraborty DataMesh Team @ {rr} Big Data Analytics made easy using Apache Hive to R Connector
  2. 2. © 2014 RichRelevance, Inc. All Rights Reserved. Confidential.
  3. 3. © 2014 RichRelevance, Inc. All Rights Reserved. Confidential. Our cloud-based platform supports both real-time processes and analytical use cases, utilizing technologies to name a few: Crunch, Hive, HBase, Avro, Kafka, R Someone clicks on a {rr} recommendation every 21 milliseconds Did You Know? Our data capacity includes a 1.5 PB Hadoop infrastructure, which enables us to employ 100+ algorithms in real-time In the US, we serve 7000 requests per second with an average response time of 50 ms
  4. 4. © 2014 RichRelevance, Inc. All Rights Reserved. Confidential.
  5. 5. © 2014 RichRelevance, Inc. All Rights Reserved. Confidential. What is R? • A letter in English alphabet • An open-source statistical language for data analytics – Simple: Easy to install and program – Popular: One of the most widely used open sourced statistical tools – Powerful: Rich set of packages (> 4000) to perform statistical analysis and plotting – More info: http://cran.us.r-project.org/
  6. 6. © 2014 RichRelevance, Inc. All Rights Reserved. Confidential. But… • Performance issues – Typically single threaded – All the data needs to be in memory – Not scalable • Need to know the internals to make it perform well
  7. 7. © 2014 RichRelevance, Inc. All Rights Reserved. Confidential. What’s out there • Rhadoop/RMR – Uses Hadoop MR to distribute data in the Hadoop cluster – No transparency: Limited data preparation support • RHIPE – Similar to Rhadoop – Protobuf dependency • RHive – Lets you run HIVE queries from R functions – Users need to know HQL – Needs Rserve + rJava
  8. 8. © 2014 RichRelevance, Inc. All Rights Reserved. Confidential. R @ {rr} - so far {rr} cluster R client HIVE queries Data access
  9. 9. © 2014 RichRelevance, Inc. All Rights Reserved. Confidential. • Transparency Layer • Pluggable Query generation • R as an analytical platform – Data cleanup – Ad-hoc analytics – Data preparation – Distributed analytics using Hadoop – Result summarization and publishing R HIVE connector HIVE (UC 1) MR (UC 2)
  10. 10. © 2014 RichRelevance, Inc. All Rights Reserved. Confidential. OO programming in R • S4 class system - classes and objects • Methods and multiple dispatch • Object validity checking • Extensible: setGenerics() • Quick overview: http://www.r- project.org/conferences/useR-2004/Keynotes/Leisch.pdf
  11. 11. © 2014 RichRelevance, Inc. All Rights Reserved. Confidential. Use Case I: Rollups in HIVE
  12. 12. © 2014 RichRelevance, Inc. All Rights Reserved. Confidential.
  13. 13. © 2014 RichRelevance, Inc. All Rights Reserved. Confidential.
  14. 14. © 2014 RichRelevance, Inc. All Rights Reserved. Confidential.
  15. 15. © 2014 RichRelevance, Inc. All Rights Reserved. Confidential.
  16. 16. © 2014 RichRelevance, Inc. All Rights Reserved. Confidential.
  17. 17. © 2014 RichRelevance, Inc. All Rights Reserved. Confidential. Use Case II: Distributed Analytics
  18. 18. © 2014 RichRelevance, Inc. All Rights Reserved. Confidential.
  19. 19. © 2014 RichRelevance, Inc. All Rights Reserved. Confidential.
  20. 20. © 2014 RichRelevance, Inc. All Rights Reserved. Confidential.
  21. 21. © 2014 RichRelevance, Inc. All Rights Reserved. Confidential. R @ {rr} {rr} cluster R client R HIVE connector Data access
  22. 22. © 2014 RichRelevance, Inc. All Rights Reserved. Confidential. Future Work • Extend the connector to handle other data sources • Add custom Analytical functions • Asynchronous execution • Performance tuning
  23. 23. © 2014 RichRelevance, Inc. All Rights Reserved. Confidential. Thank You
  24. 24. © 2014 RichRelevance, Inc. All Rights Reserved. Confidential. Questions?

Editor's Notes

  • Nuggets or Data Points
    1.5PB not as big as yahoo or facebook – huge from a retail industry perspective

×