Big Data Warehousing: Building a Relevance Engine using Hadoop, Mahout, and Pig


Published on

Over the past few years, relevant recommendations have become expected and essential as part of the customer experience. From the customer’s perspective, marketing interactions are becoming helpful and time saving, instead of being generic, out of context, and annoying. If you shop at any of the major online retailers such as Amazon or Bluefly you may think they somehow have gotten inside your head as they present and recommend products relevant to you. This is an exponential improvement of the traditional psych-demographic profiling and targeting of the “old world”.

We talked about how Mahout can be leveraged to build a Recommendation Engine with a minimum of coding. We discussd how the open source search and machine learning capabilities of Apache Solr and Mahout can be combined to power large scale data driven applications that effectively combine real time access with large scale enrichment and discovery.

Caserta Concepts has grown beyond its roots as a provider of traditional data warehouse and BI consulting to also offer big data warehousing. If you’re a developer and are experienced in Hadoop, Hive, HBase, Mahout, Datameer or other Big Data technologies, we want to get to know you!

For more information, visit

Published in: Technology
1 Comment
  • I'm using for my slideshares
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Big Data Warehousing: Building a Relevance Engine using Hadoop, Mahout, and Pig

  1. Big Data Warehousing MeetupToday’s Topic: Building a RelevanceEngine using Hadoop, Mahout & Pig Sponsored By:
  2. WELCOME! Joe Caserta Founder & President, Caserta Concepts
  3. Agenda7:00 Networking Grab a slice of pizza and a drink...7:15 Joe Caserta Welcome President, Caserta Concepts About the Meetup and about Caserta Concepts Author, Data Warehouse ETL Toolkit7:30 Erik Laurence Big Data Facts and Figures VP Marketing, Caserta Concepts Interesting observations from the world of Big Data7:45 Elliott Cordo Relevance Principal Consultant, Caserta Concepts Building a Big Data recommendation engine with Mahout8:15 Grant Ingersoll Machine Learning Chief Scientist, Lucidworks Powering large scale data driven real time apps with Mahout co-founder Apache Solr and Mahout Lucene/Solr committer8:45 - More Networking9:00 Tell us what you’re up to…
  4. About BDW Meetup• Big Data is a complex, rapidly changing landscape• We want to share our stories and hear about yours• Great networking opportunity for like minded data nerds• Opportunities to collaborate on exciting projects
  5. About Caserta Concepts Focused Industries Served Expertise • Financial Services • Big Data Analytics • Healthcare / Insurance • Data Warehousing • Retail / eCommerce • Business Intelligence • Digital Media / Marketing • Strategic Data • K-12 / Higher Education Ecosystems Founded in 2001 • President: Joe Caserta, industry thought leader, consultant, educator and co-author, The Data Warehouse ETL Toolkit (Wiley, 2004)
  6. Client PortfolioFinance& InsuranceRetail/eCommerce& ManufacturingEducation& Services
  7. Expertise & Offerings Strategic Roadmap/ Assessment/Consulting Big Data Analytics Data Warehousing/ ETL/Data Integration BI/Visualization/ Analytics Master Data Management
  8. Big Data at Caserta ConceptsCaserta Concepts is a blend of the best designers in traditionalenterprise data with the best new designers in Big Data. Traditional Data Big Data • Tools • Tools • RDBMS • Hadoop • DQ • Mahout • MDM • Relevance Engine • BI • Analytics • ETL • New Data • Analytics • Social • Traditional Data • Machine • Transactions • Deep History • Unstructured Immutable Data Concepts • Transformation • Profiling • Conforming • Processing Efficiency/Speed 8
  9. Contacts Joe Caserta President & Founder, Caserta Concepts P: (855) 755-2246 x227 E: Erik Laurence VP Marketing, Caserta Concepts P: (855) 755-2246 x528 E: 1(855) 755-2246 Elliott Cordo Principal Consultant, Caserta Concepts P: (855) 755-2246 x267 E:
  10. BIG DATA FACTS AND FIGURES Erik Laurence VP Marketing, Caserta Concepts
  11. What is Really Meant by Big Data?• The 4 Vs of Big Data 10% • Volume Structured • More data than ever before • Most of world’s data is unstructured, 90% Un/Semi/Multi- Structured semi-structured or multi-structured • Variety • More sources than ever before • Social, web logs, machine logs, documents, geotags, video, … • Velocity • Some data only has value for a short period of time • Relevance engines, financial fraud sensors, early warning sensors, etc. • Vitality • Agility is required in analytics • Adapt quickly to changing business needs
  12. Enterprise Involvement with Big Data 6% 18% Beyond Pilot Stage Engaged in Pilot 76% Not Yet Involved• Awareness of Big Data high among enterprises, but three-quarters still wondering, ―What is this all about?‖• Answer across all businesses, ―We dont know what the business case is.‖ Source: WSJ November 29, 2012
  13. Business Cases Have Been Identified―The use of data and analytics …is going to be a basis of competitiongoing forward for individual firms, for sectors and even for countries.Those companies that are able to use data effectively are more likely towin in the marketplace.‖ - Michael Chui, McKinsey Global InstituteIn just one field—personal location data—$100 billion of value can becreated globally for service providers through use of data.Benefits for consumers could be six times that. Source: (WSJ 11/29/12)
  14. Big Data Played A Role in the Election―This was the first presidentialelection campaign where all of thedata that was coming into thecampaign was successfullycollected and centralized.―The Obama campaign did asuccessful job with that; the Obama campaign hired an analytics department five times as large as that of the 2008 operation.Romney campaign did not.‖ - John Aristotle Phillips, Chief Executive of Aristotle International (WSJ 11/29/12)
  15. Big Data Example in Obama Campaign• $40k-a-head dinner in June at Sarah Jessica Parker’s home in NYC• 7 different versions of the email solicitation for the event • Some mentioned a 2nd fundraiser that night, a Mariah Carey concert • Some said Ms. Parker is a mother • Some said Vogue editor Anna Wintour would be at the dinner• Who got which email depended on big data • Profile info about each prospect • How they react to different messages• Campaign created a single massive system to join info from Democratic voter files to • pollsters, fundraisers, field workers and consumer databases, social-media, and mobile contacts Sources: WSJ, Time Magazine
  16. Hadoop Market: Growing & Evolving• Big data outranks virtualization as #1 trend driving spending initiatives • Barclays CIO Survey, April 2012• Overall market at $100B • Hadoop 2nd only to RDBMS in potential• Estimates put market growth at > 40% CAGR • IDC expects Big Data tech and services market to grow to $16.9B in 2015 • According to JPMC 50% of Big Data market will be influenced by Hadoop
  17. Hadoop Cost Effective for Archiving• Hadoop is orders of magnitude cheaper than traditional archival methods• Annual cost of 1 TB of archival storage for a credit card company Tape SAN Hadoop $30,000 $3,000 $300
  18. Hadoop is Fast• Sears process to analyze loyalty club marketing campaigns took six weeks on mainframe, Teradata, and SAS servers • In retail, that’s half the season!• New process on Hadoop is done weekly • For online and mobile, daily analysis is done• What’s more, old models used 10% of data, new models use all the data• Source: Information Week (October 31, 2012)
  19. BUILDING A RECOMMENDATION ENGINE Elliott Cordo Principal Consultant, Caserta Concepts
  20. Recommendations• Your customers expect them • Good recommendations make life easier • Help them find information, products, and services they might not have thought of• What makes a good recommendation? • Relevant but not obvious • Sense of ―surprise‖
  21. Where can recommendationsengines be found?• Applications can be found in a wide variety of industries and applications: • Travel • Service Industry • Music/Online radio • TV and Video • Online Publications • Retail ..and countless others
  22. Our Use Case: Online MagazineGoals:• Serve customers recommendations based on what their peers are reading.• Recommendation must have context to the article they are currently viewing.
  23. Technical DetailsCore Platform:• Cloudera Hadoop Cluster• Mahout Machine Learning Library• Apache PigAdditional Technology:• Talend Big Data Edition (ETL to/from relational)• Datameer (Analysis and Visualization)
  24. How we did itSolution leverages three main algorithms:• Mahout K-Means – identifying groups of similar articles• Mahout Item-Based Recommender - recommendations based on peer behavior• Raw Popularity – custom Pig script ―people who read this article also read..‖
  25. K-Means• Treats items as coordinates• Places a number of random ―centroids‖ and assigns the nearest items• Moves the centroids around based on average location• Process repeats until the assignments stop changingWe used the major attributes of the articles to createcoordinate points:Author, Topic, Section, Region, Media, etc. *Diagram from Collective Intelligence by Toby Segaran
  26. Item-Based Recommender• Build an item-item matrix determining relationships between pairs of items (usage)• Using the matrix, and the data on the current user, infer his taste• We use a dataset containing Customer, Article and Rating • Since no rating was available we used a 1 to 5 scale based on age (a ramped 6 month decay)• In the output a 0 to 5 scale is calculated, 5 being the most highly recommended for this customer
  27. Popularity• Self join usage dataset based on Article Also_Read_Data= join Readers1 by Customer_ID, Readers2 by Customer_ID using merge• Group article based on Article, ―Also Read Article‖• Sort descending based on the number of distinct peer customers• Limit 25 (most popular ―Also Read Article‖)• In the output a 0 to 5 scale is calculated, 5 being the most popular for a given article
  28. Delivering RecommendationsCustomer views an article online and we are passed theirCustomer ID and the Article they are viewingWe then do the following:1. K-Means – get all items in the same cluster and calculate Item-Based: K-Means: Euclidean Distance. Reverse and scale 0-5. Peers are reading Similar2. Item-Based - get all peer recommendations for this customer3. Popularity – get all popular recommendations for this article4. Join the three data sets together, add the final rankings and bring back the most highly rated articles. Popularity: Most popular
  29. Items recommended by more than 1algorithm are the most highly rated Item-Based: K-Means: Peers are reading Similar Popularity: Most popular Best Recommendations
  30. Improvements/Ideas• Conditionally swap algorithms: Peer recommendations can be unwieldy for new users• Allow users to rate how relevant this recommendation is - > retrain the model• Play with the weighting of current algorithms, evaluate others• Hybrid search platform: Replace or supplement K-Means with Search platform
  31. MACHINE LEARNING Grant Ingersoll President, Lucidworks Mahout co-founder Lucene/Solr committer