Ameya Kanitkar: Using Hadoop and HBase to Personalize Web, Mobile and Email Experience for Millions of Users

  • 734 views
Uploaded on

“Big Data Tools to Build Personalization” …

“Big Data Tools to Build Personalization”

More at http://webexpo.net/prague2013/talk/using-hadoop-and-hbase-to-personalize-web-mobile-and-email-experience-for-millions-of-users/

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
734
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
0
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • The relevance problem can coarsely be divided into to conceptual parts: algorithmic aspects and scale-related issues. We’ll start on the algorithmic side of things.

Transcript

  • 1. USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar
  • 2. Ameya Kanitkar – That‟s me! • Big Data Infrastructure Engineer @ Groupon, Palo Alto USA (Working on Deal Relevance & Personalization Systems) ameya.kanitkar@gmail.com http://www.linkedin.com/in/ameyakanitkar @aktwits
  • 3. Agenda  Basics of Hadoop & HBase  How you can use Hadoop & HBase for big data application  Case Study: Deal Relevance and Personalization Systems at Groupon with Hadoop & HBase
  • 4. Big Data Application Examples  Recommendation Systems  Ad targeting  Personalization Systems  BI/ DW  Log Analysis  Natural Language Processing
  • 5. So what is Hadoop?  General purpose framework for processing huge amounts of data.  Open Source  Batch / Offline Oriented
  • 6. Hadoop - HDFS  Open Source Distributed File System.  Store large files. Can easily be accessed via application built on top of HDFS.  Data is distributed and replicated over multiple machines  Linux Style commands eg. ls, cp, mv, touchz etc
  • 7. Hadoop – HDFS  Example: hadoop fs –dus /data/ 185453399927478 bytes =~ 168 TB (One of the folders from one of our hadoop cluster)
  • 8. Hadoop – Map Reduce  Application Framework built on top of HDFS to process your big data  Operates on key-value pairs  Mappers filter and transform input data  Reducers aggregate mapper output
  • 9. Example • Given web logs, calculate landing page conversion rate for each product • So basically we need to see how many impressions each product received and then calculate conversion rate of for each product
  • 10. Map Reduce Example Map Phase Reduce Phase Map 1: Process Log File: Output: Key (Product ID), Value (Impression Count) Map 2: Process Log File: Output: Key (Product ID), Value (Impression Count) Map N: Process Log File: Output: Key (Product ID), Value (Impression Count) Reducer: Here we receive all data for a given product. Just run simple for loop to calculate conversion rate. (Output: Product ID, Conversion Rate
  • 11. Recap  We just processed terabytes of data, and calculated conversion rate across millions of products.  Note: This is batch process only. It takes time. You can not start this process after some one visits your website. How about we generate recommendations in batch process and serve them in real time?
  • 12. HBase  Provides real time random read/ write access over HDFS  Built on Google‟s „Big Table‟ design  Open Sourced This is not RDBMS, so no joins. Access patterns are generally simple like get(key), put(key, value) etc.
  • 13. Row Cf:<qual> Cf:<qual> Row 1 Cf1:qual1 Cf1:qual2 Row 11 Cf1:qual2 Cf1:qual22 Row 2 …. Cf2:qual1 Cf1:qual3 Row N  Dynamic Column Names. No need to define columns upfront.  Both rows and columns are (lexicological) sorted Cf:<qual>
  • 14. …. Row Cf:<qual> user1 Cf1:click_history:{actual_cl Cf1:purchases:{actual_pur icks_data} chases} user11 Cf1:purchases:{actual_pur chases} user20 Cf1:mobile_impressions:{a Cf1:purchases:{actual_pur ctual mobile impressions} chases} Note: Each row has different columns, So think about this as a hash map rather than at table with rows and columns
  • 15. Putting it all together Store data in HDFS Web Generate Recommendations (Map Reduce) Serve Real Time Requests (HBase) Analyze Data (Map Reduce) Do offline analysis in Hadoop, and serve real time requests with HBase Mobile
  • 16. Use Case: Deal Relevance & Personalization @ Groupon
  • 17. What are Groupon Deals?
  • 18. Our Relevance Scenario Users
  • 19. Our Relevance Scenario How do we surface relevant deals ? Users  Deals are perishable (Deals expire or are sold out)  No direct user intent (As in traditional search advertising)  Relatively Limited User Information  Deals are highly local
  • 20. Two Sides to the Relevance Problem Algorithmic Issues Scaling Issues How to find relevant deals for individual users given a set of optimization criteria How to handle relevance for all users across multiple delivery platforms
  • 21. Developing Deal Ranking Algorithms • Exploring Data • Understanding signals, finding patterns • Building Models/Heuristics • Employ both classical machine learning techniques and heuristic adjustments to estimate user purchasing behavior • Conduct Experiments • Try out ideas on real users and evaluate their effect
  • 22. Data Infrastructure Growing Deals 2011 2012 Growing Users 2013  100 Million+ subscribers  We need to store data 20+ like, user click history, 400+ email records, service logs etc. This tunes to 2000+ billions of data points and TB‟s of data
  • 23. Deal Personalization Infrastructure Use Cases • Deliver Personalized Emails • Deliver Personalized Website & Mobile Experience Email Personalize billions of emails for hundredsof millions of users Offline System Personalize one of the most popular e-commerce mobile & web app for hundreds of millions of users & page views Online System
  • 24. Architecture • We can now maintain different SLA on online and offline systems Email Real Time Relevance Relevance Map/Reduce HBase Offline System Data Pipeline Replication HBase for Online System • We can tune HBase cluster differently for online and offline systems
  • 25. HBase Schema Design User ID Column Family 1 Column Family 2 Unique Identifier for Users User History and Profile Information Email History For Users Overwrite user history and profile info Append email history for each day as a separate columns. (On avg each row has over 200 columns) • Most of our data access patterns are via “User Key” • This makes it easy to design HBase schema • The actual data is kept in JSON
  • 26. Cluster Sizing HBase Replication Hadoop + HBase Cluster 100+ machine Hadoop cluster, this runs heavy map reduce jobs The same cluster also hosts 15 node HBase cluster Online HBase Cluster 10 Machine dedicated HBase cluster to serve real time SLA • Machine Profile • 96 GB RAM (HBase 25 GB) • 24 Virtual Cores CPU • 8 2TB Disks • Data Profile • 100 Million+ Records • 2TB+ Data • Over 4.2 Billion Data Points
  • 27. Questions? Thank You! (We are hiring!) www.groupon.com/techjobs