© 2014 RichRelevance, Inc. All Rights Reserved. Confidential.
Sukhendu Chakraborty
DataMesh Team @ {rr}
Big Data Analytics...
© 2014 RichRelevance, Inc. All Rights Reserved. Confidential.
© 2014 RichRelevance, Inc. All Rights Reserved. Confidential.
Our cloud-based platform supports both real-time processes
a...
© 2014 RichRelevance, Inc. All Rights Reserved. Confidential.
© 2014 RichRelevance, Inc. All Rights Reserved. Confidential.
What is R?
• A letter in English alphabet
• An open-source s...
© 2014 RichRelevance, Inc. All Rights Reserved. Confidential.
But…
• Performance issues
– Typically single threaded
– All ...
© 2014 RichRelevance, Inc. All Rights Reserved. Confidential.
What’s out there
• Rhadoop/RMR
– Uses Hadoop MR to distribut...
© 2014 RichRelevance, Inc. All Rights Reserved. Confidential.
R @ {rr} - so far
{rr} cluster R client
HIVE queries
Data ac...
© 2014 RichRelevance, Inc. All Rights Reserved. Confidential.
• Transparency Layer
• Pluggable Query generation
• R as an ...
© 2014 RichRelevance, Inc. All Rights Reserved. Confidential.
OO programming in R
• S4 class system - classes and objects
...
© 2014 RichRelevance, Inc. All Rights Reserved. Confidential.
Use Case I:
Rollups in HIVE
© 2014 RichRelevance, Inc. All Rights Reserved. Confidential.
© 2014 RichRelevance, Inc. All Rights Reserved. Confidential.
© 2014 RichRelevance, Inc. All Rights Reserved. Confidential.
© 2014 RichRelevance, Inc. All Rights Reserved. Confidential.
© 2014 RichRelevance, Inc. All Rights Reserved. Confidential.
© 2014 RichRelevance, Inc. All Rights Reserved. Confidential.
Use Case II:
Distributed Analytics
© 2014 RichRelevance, Inc. All Rights Reserved. Confidential.
© 2014 RichRelevance, Inc. All Rights Reserved. Confidential.
© 2014 RichRelevance, Inc. All Rights Reserved. Confidential.
© 2014 RichRelevance, Inc. All Rights Reserved. Confidential.
R @ {rr}
{rr} cluster R client
R HIVE
connector
Data access
© 2014 RichRelevance, Inc. All Rights Reserved. Confidential.
Future Work
• Extend the connector to handle other data
sour...
© 2014 RichRelevance, Inc. All Rights Reserved. Confidential.
Thank You
© 2014 RichRelevance, Inc. All Rights Reserved. Confidential.
Questions?
Upcoming SlideShare
Loading in …5
×

Big Data Analytics made easy using Apache Hive to R Connector - StampedeCon 2014

1,130 views

Published on

At StampedeCon 2014, Sukhendu Chakraborty (RichRelevance) presented "Big Data Analytics made easy using Apache Hive to R Connector."

As the leading omni-channel personalization provider, RichRelevance fully harnesses the power of Hadoop to handle petabytes of data coming from both online (clickstream) and offline (e.g. in-store) sources. Given this wealth of customer data at Richrelevance, omnichannel data integration and analytics is critical. One of the major challenges is to consolidate online, mobile, social, and other data sources to create a create a single view of users for making more insightful decisions.

Our use cases require clickstream analytics that leverage Apache Hive & R. Apache Hive is a good tool for performing ELT and basic analytics, but is limited in statistical analysis and data exploration capabilities. R, on the other hand, has become a preferred language for analytics, as it offers a wide variety of statistical and graphical packages. The downside is that R is single threaded memory intensive, making it impossible to work with data at scale.

Through a series of use cases, we will present how our version of the R to Hive connector allows us to bridge the gap between R and Hive and make big data analysis using R on terabytes of data feasible. This framework takes us a step closer to the notion of a “one solution fits all” principle where we are no longer restricted by a single compute mechanism. It is our attempt to bring the two worlds closer, such that the data source is agnostic to the tools which are used to access it.

Published in: Technology
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,130
On SlideShare
0
From Embeds
0
Number of Embeds
18
Actions
Shares
0
Downloads
0
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide
  • Nuggets or Data Points
    1.5PB not as big as yahoo or facebook – huge from a retail industry perspective
  • Big Data Analytics made easy using Apache Hive to R Connector - StampedeCon 2014

    1. 1. © 2014 RichRelevance, Inc. All Rights Reserved. Confidential. Sukhendu Chakraborty DataMesh Team @ {rr} Big Data Analytics made easy using Apache Hive to R Connector
    2. 2. © 2014 RichRelevance, Inc. All Rights Reserved. Confidential.
    3. 3. © 2014 RichRelevance, Inc. All Rights Reserved. Confidential. Our cloud-based platform supports both real-time processes and analytical use cases, utilizing technologies to name a few: Crunch, Hive, HBase, Avro, Kafka, R Someone clicks on a {rr} recommendation every 21 milliseconds Did You Know? Our data capacity includes a 1.5 PB Hadoop infrastructure, which enables us to employ 100+ algorithms in real-time In the US, we serve 7000 requests per second with an average response time of 50 ms
    4. 4. © 2014 RichRelevance, Inc. All Rights Reserved. Confidential.
    5. 5. © 2014 RichRelevance, Inc. All Rights Reserved. Confidential. What is R? • A letter in English alphabet • An open-source statistical language for data analytics – Simple: Easy to install and program – Popular: One of the most widely used open sourced statistical tools – Powerful: Rich set of packages (> 4000) to perform statistical analysis and plotting – More info: http://cran.us.r-project.org/
    6. 6. © 2014 RichRelevance, Inc. All Rights Reserved. Confidential. But… • Performance issues – Typically single threaded – All the data needs to be in memory – Not scalable • Need to know the internals to make it perform well
    7. 7. © 2014 RichRelevance, Inc. All Rights Reserved. Confidential. What’s out there • Rhadoop/RMR – Uses Hadoop MR to distribute data in the Hadoop cluster – No transparency: Limited data preparation support • RHIPE – Similar to Rhadoop – Protobuf dependency • RHive – Lets you run HIVE queries from R functions – Users need to know HQL – Needs Rserve + rJava
    8. 8. © 2014 RichRelevance, Inc. All Rights Reserved. Confidential. R @ {rr} - so far {rr} cluster R client HIVE queries Data access
    9. 9. © 2014 RichRelevance, Inc. All Rights Reserved. Confidential. • Transparency Layer • Pluggable Query generation • R as an analytical platform – Data cleanup – Ad-hoc analytics – Data preparation – Distributed analytics using Hadoop – Result summarization and publishing R HIVE connector HIVE (UC 1) MR (UC 2)
    10. 10. © 2014 RichRelevance, Inc. All Rights Reserved. Confidential. OO programming in R • S4 class system - classes and objects • Methods and multiple dispatch • Object validity checking • Extensible: setGenerics() • Quick overview: http://www.r- project.org/conferences/useR-2004/Keynotes/Leisch.pdf
    11. 11. © 2014 RichRelevance, Inc. All Rights Reserved. Confidential. Use Case I: Rollups in HIVE
    12. 12. © 2014 RichRelevance, Inc. All Rights Reserved. Confidential.
    13. 13. © 2014 RichRelevance, Inc. All Rights Reserved. Confidential.
    14. 14. © 2014 RichRelevance, Inc. All Rights Reserved. Confidential.
    15. 15. © 2014 RichRelevance, Inc. All Rights Reserved. Confidential.
    16. 16. © 2014 RichRelevance, Inc. All Rights Reserved. Confidential.
    17. 17. © 2014 RichRelevance, Inc. All Rights Reserved. Confidential. Use Case II: Distributed Analytics
    18. 18. © 2014 RichRelevance, Inc. All Rights Reserved. Confidential.
    19. 19. © 2014 RichRelevance, Inc. All Rights Reserved. Confidential.
    20. 20. © 2014 RichRelevance, Inc. All Rights Reserved. Confidential.
    21. 21. © 2014 RichRelevance, Inc. All Rights Reserved. Confidential. R @ {rr} {rr} cluster R client R HIVE connector Data access
    22. 22. © 2014 RichRelevance, Inc. All Rights Reserved. Confidential. Future Work • Extend the connector to handle other data sources • Add custom Analytical functions • Asynchronous execution • Performance tuning
    23. 23. © 2014 RichRelevance, Inc. All Rights Reserved. Confidential. Thank You
    24. 24. © 2014 RichRelevance, Inc. All Rights Reserved. Confidential. Questions?

    ×