Big Data Warehousing MeetupToday’s Topic: Building a RelevanceEngine using Hadoop, Mahout & Pig                           ...
WELCOME!  Joe Caserta  Founder & President, Caserta Concepts
Agenda7:00     Networking         Grab a slice of pizza and a drink...7:15     Joe Caserta                              We...
About BDW Meetup• Big Data is a complex, rapidly changing landscape• We want to share our stories and hear about yours• Gr...
About Caserta Concepts Focused                             Industries Served Expertise                                    ...
Client PortfolioFinance& InsuranceRetail/eCommerce& ManufacturingEducation& Services
Expertise & Offerings Strategic Roadmap/ Assessment/Consulting Big Data Analytics Data Warehousing/ ETL/Data Integration B...
Big Data at Caserta ConceptsCaserta Concepts is a blend of the best designers in traditionalenterprise data with the best ...
Contacts     Joe Caserta     President & Founder, Caserta Concepts     P: (855) 755-2246 x227     E: joe@casertaconcepts.c...
BIG DATA FACTS AND FIGURES   Erik Laurence   VP Marketing, Caserta Concepts
What is Really Meant by Big Data?• The 4 Vs of Big Data                                                        10%  • Volu...
Enterprise Involvement with Big Data                         6%                              18%                          ...
Business Cases Have Been Identified―The use of data and analytics …is going to be a basis of competitiongoing forward for ...
Big Data Played A Role in the Election―This was the first presidentialelection campaign where all of thedata that was comi...
Big Data Example in Obama Campaign• $40k-a-head dinner in June at Sarah Jessica  Parker’s home in NYC• 7 different version...
Hadoop Market: Growing & Evolving• Big data outranks virtualization as #1 trend driving spending initiatives  • Barclays C...
Hadoop Cost Effective for Archiving• Hadoop is orders of magnitude cheaper than traditional archival methods• Annual cost ...
Hadoop is Fast• Sears process to analyze loyalty club marketing campaigns took six weeks on mainframe, Teradata, and SAS s...
BUILDING A RECOMMENDATION ENGINE   Elliott Cordo   Principal Consultant, Caserta Concepts
Recommendations• Your customers expect them   • Good recommendations make life easier   • Help them find information, prod...
Where can recommendationsengines be found?• Applications can be found in a wide variety of industries and applications:  •...
Our Use Case: Online MagazineGoals:• Serve customers recommendations based on what their  peers are reading.• Recommendati...
Technical DetailsCore Platform:• Cloudera Hadoop Cluster• Mahout Machine Learning Library• Apache PigAdditional Technology...
How we did itSolution leverages three main algorithms:• Mahout K-Means – identifying groups of similar articles• Mahout It...
K-Means• Treats items as coordinates• Places a number of random  ―centroids‖ and assigns the  nearest items• Moves the cen...
Item-Based Recommender• Build an item-item matrix determining relationships  between pairs of items (usage)• Using the mat...
Popularity• Self join usage dataset based on Article  Also_Read_Data= join Readers1 by  Customer_ID, Readers2 by Customer_...
Delivering RecommendationsCustomer views an article online and we are passed theirCustomer ID and the Article they are vie...
Items recommended by more than 1algorithm are the most highly rated          Item-Based:                      K-Means:    ...
Improvements/Ideas• Conditionally swap algorithms: Peer recommendations  can be unwieldy for new users• Allow users to rat...
MACHINE LEARNING   Grant Ingersoll   President, Lucidworks   Mahout co-founder   Lucene/Solr committer
NETWORKING
Upcoming SlideShare
Loading in...5
×

Big Data Warehousing: Building a Relevance Engine using Hadoop, Mahout, and Pig

3,582

Published on

Over the past few years, relevant recommendations have become expected and essential as part of the customer experience. From the customer’s perspective, marketing interactions are becoming helpful and time saving, instead of being generic, out of context, and annoying. If you shop at any of the major online retailers such as Amazon or Bluefly you may think they somehow have gotten inside your head as they present and recommend products relevant to you. This is an exponential improvement of the traditional psych-demographic profiling and targeting of the “old world”.

We talked about how Mahout can be leveraged to build a Recommendation Engine with a minimum of coding. We discussd how the open source search and machine learning capabilities of Apache Solr and Mahout can be combined to power large scale data driven applications that effectively combine real time access with large scale enrichment and discovery.

Caserta Concepts has grown beyond its roots as a provider of traditional data warehouse and BI consulting to also offer big data warehousing. If you’re a developer and are experienced in Hadoop, Hive, HBase, Mahout, Datameer or other Big Data technologies, we want to get to know you!

For more information, visit http://www.casertaconcepts.com/.

Published in: Technology
1 Comment
14 Likes
Statistics
Notes
  • I'm using http://jingbox.com/five/ for my slideshares
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total Views
3,582
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
0
Comments
1
Likes
14
Embeds 0
No embeds

No notes for slide

Big Data Warehousing: Building a Relevance Engine using Hadoop, Mahout, and Pig

  1. 1. Big Data Warehousing MeetupToday’s Topic: Building a RelevanceEngine using Hadoop, Mahout & Pig Sponsored By:
  2. 2. WELCOME! Joe Caserta Founder & President, Caserta Concepts
  3. 3. Agenda7:00 Networking Grab a slice of pizza and a drink...7:15 Joe Caserta Welcome President, Caserta Concepts About the Meetup and about Caserta Concepts Author, Data Warehouse ETL Toolkit7:30 Erik Laurence Big Data Facts and Figures VP Marketing, Caserta Concepts Interesting observations from the world of Big Data7:45 Elliott Cordo Relevance Principal Consultant, Caserta Concepts Building a Big Data recommendation engine with Mahout8:15 Grant Ingersoll Machine Learning Chief Scientist, Lucidworks Powering large scale data driven real time apps with Mahout co-founder Apache Solr and Mahout Lucene/Solr committer8:45 - More Networking9:00 Tell us what you’re up to…
  4. 4. About BDW Meetup• Big Data is a complex, rapidly changing landscape• We want to share our stories and hear about yours• Great networking opportunity for like minded data nerds• Opportunities to collaborate on exciting projects
  5. 5. About Caserta Concepts Focused Industries Served Expertise • Financial Services • Big Data Analytics • Healthcare / Insurance • Data Warehousing • Retail / eCommerce • Business Intelligence • Digital Media / Marketing • Strategic Data • K-12 / Higher Education Ecosystems Founded in 2001 • President: Joe Caserta, industry thought leader, consultant, educator and co-author, The Data Warehouse ETL Toolkit (Wiley, 2004)
  6. 6. Client PortfolioFinance& InsuranceRetail/eCommerce& ManufacturingEducation& Services
  7. 7. Expertise & Offerings Strategic Roadmap/ Assessment/Consulting Big Data Analytics Data Warehousing/ ETL/Data Integration BI/Visualization/ Analytics Master Data Management
  8. 8. Big Data at Caserta ConceptsCaserta Concepts is a blend of the best designers in traditionalenterprise data with the best new designers in Big Data. Traditional Data Big Data • Tools • Tools • RDBMS • Hadoop • DQ • Mahout • MDM • Relevance Engine • BI • Analytics • ETL • New Data • Analytics • Social • Traditional Data • Machine • Transactions • Deep History • Unstructured Immutable Data Concepts • Transformation • Profiling • Conforming • Processing Efficiency/Speed 8
  9. 9. Contacts Joe Caserta President & Founder, Caserta Concepts P: (855) 755-2246 x227 E: joe@casertaconcepts.com Erik Laurence VP Marketing, Caserta Concepts P: (855) 755-2246 x528 info@casertaconcepts.com E: erik@casertaconcepts.com 1(855) 755-2246 www.casertaconcepts.com Elliott Cordo Principal Consultant, Caserta Concepts P: (855) 755-2246 x267 E: elliott@casertaconcepts.com
  10. 10. BIG DATA FACTS AND FIGURES Erik Laurence VP Marketing, Caserta Concepts
  11. 11. What is Really Meant by Big Data?• The 4 Vs of Big Data 10% • Volume Structured • More data than ever before • Most of world’s data is unstructured, 90% Un/Semi/Multi- Structured semi-structured or multi-structured • Variety • More sources than ever before • Social, web logs, machine logs, documents, geotags, video, … • Velocity • Some data only has value for a short period of time • Relevance engines, financial fraud sensors, early warning sensors, etc. • Vitality • Agility is required in analytics • Adapt quickly to changing business needs
  12. 12. Enterprise Involvement with Big Data 6% 18% Beyond Pilot Stage Engaged in Pilot 76% Not Yet Involved• Awareness of Big Data high among enterprises, but three-quarters still wondering, ―What is this all about?‖• Answer across all businesses, ―We dont know what the business case is.‖ Source: WSJ November 29, 2012
  13. 13. Business Cases Have Been Identified―The use of data and analytics …is going to be a basis of competitiongoing forward for individual firms, for sectors and even for countries.Those companies that are able to use data effectively are more likely towin in the marketplace.‖ - Michael Chui, McKinsey Global InstituteIn just one field—personal location data—$100 billion of value can becreated globally for service providers through use of data.Benefits for consumers could be six times that. Source: (WSJ 11/29/12)
  14. 14. Big Data Played A Role in the Election―This was the first presidentialelection campaign where all of thedata that was coming into thecampaign was successfullycollected and centralized.―The Obama campaign did asuccessful job with that; the Obama campaign hired an analytics department five times as large as that of the 2008 operation.Romney campaign did not.‖ - John Aristotle Phillips, Chief Executive of Aristotle International (WSJ 11/29/12)
  15. 15. Big Data Example in Obama Campaign• $40k-a-head dinner in June at Sarah Jessica Parker’s home in NYC• 7 different versions of the email solicitation for the event • Some mentioned a 2nd fundraiser that night, a Mariah Carey concert • Some said Ms. Parker is a mother • Some said Vogue editor Anna Wintour would be at the dinner• Who got which email depended on big data • Profile info about each prospect • How they react to different messages• Campaign created a single massive system to join info from Democratic voter files to • pollsters, fundraisers, field workers and consumer databases, social-media, and mobile contacts Sources: WSJ, Time Magazine
  16. 16. Hadoop Market: Growing & Evolving• Big data outranks virtualization as #1 trend driving spending initiatives • Barclays CIO Survey, April 2012• Overall market at $100B • Hadoop 2nd only to RDBMS in potential• Estimates put market growth at > 40% CAGR • IDC expects Big Data tech and services market to grow to $16.9B in 2015 • According to JPMC 50% of Big Data market will be influenced by Hadoop
  17. 17. Hadoop Cost Effective for Archiving• Hadoop is orders of magnitude cheaper than traditional archival methods• Annual cost of 1 TB of archival storage for a credit card company Tape SAN Hadoop $30,000 $3,000 $300
  18. 18. Hadoop is Fast• Sears process to analyze loyalty club marketing campaigns took six weeks on mainframe, Teradata, and SAS servers • In retail, that’s half the season!• New process on Hadoop is done weekly • For online and mobile, daily analysis is done• What’s more, old models used 10% of data, new models use all the data• Source: Information Week (October 31, 2012)
  19. 19. BUILDING A RECOMMENDATION ENGINE Elliott Cordo Principal Consultant, Caserta Concepts
  20. 20. Recommendations• Your customers expect them • Good recommendations make life easier • Help them find information, products, and services they might not have thought of• What makes a good recommendation? • Relevant but not obvious • Sense of ―surprise‖
  21. 21. Where can recommendationsengines be found?• Applications can be found in a wide variety of industries and applications: • Travel • Service Industry • Music/Online radio • TV and Video • Online Publications • Retail ..and countless others
  22. 22. Our Use Case: Online MagazineGoals:• Serve customers recommendations based on what their peers are reading.• Recommendation must have context to the article they are currently viewing.
  23. 23. Technical DetailsCore Platform:• Cloudera Hadoop Cluster• Mahout Machine Learning Library• Apache PigAdditional Technology:• Talend Big Data Edition (ETL to/from relational)• Datameer (Analysis and Visualization)
  24. 24. How we did itSolution leverages three main algorithms:• Mahout K-Means – identifying groups of similar articles• Mahout Item-Based Recommender - recommendations based on peer behavior• Raw Popularity – custom Pig script ―people who read this article also read..‖
  25. 25. K-Means• Treats items as coordinates• Places a number of random ―centroids‖ and assigns the nearest items• Moves the centroids around based on average location• Process repeats until the assignments stop changingWe used the major attributes of the articles to createcoordinate points:Author, Topic, Section, Region, Media, etc. *Diagram from Collective Intelligence by Toby Segaran
  26. 26. Item-Based Recommender• Build an item-item matrix determining relationships between pairs of items (usage)• Using the matrix, and the data on the current user, infer his taste• We use a dataset containing Customer, Article and Rating • Since no rating was available we used a 1 to 5 scale based on age (a ramped 6 month decay)• In the output a 0 to 5 scale is calculated, 5 being the most highly recommended for this customer
  27. 27. Popularity• Self join usage dataset based on Article Also_Read_Data= join Readers1 by Customer_ID, Readers2 by Customer_ID using merge• Group article based on Article, ―Also Read Article‖• Sort descending based on the number of distinct peer customers• Limit 25 (most popular ―Also Read Article‖)• In the output a 0 to 5 scale is calculated, 5 being the most popular for a given article
  28. 28. Delivering RecommendationsCustomer views an article online and we are passed theirCustomer ID and the Article they are viewingWe then do the following:1. K-Means – get all items in the same cluster and calculate Item-Based: K-Means: Euclidean Distance. Reverse and scale 0-5. Peers are reading Similar2. Item-Based - get all peer recommendations for this customer3. Popularity – get all popular recommendations for this article4. Join the three data sets together, add the final rankings and bring back the most highly rated articles. Popularity: Most popular
  29. 29. Items recommended by more than 1algorithm are the most highly rated Item-Based: K-Means: Peers are reading Similar Popularity: Most popular Best Recommendations
  30. 30. Improvements/Ideas• Conditionally swap algorithms: Peer recommendations can be unwieldy for new users• Allow users to rate how relevant this recommendation is - > retrain the model• Play with the weighting of current algorithms, evaluate others• Hybrid search platform: Replace or supplement K-Means with Search platform
  31. 31. MACHINE LEARNING Grant Ingersoll President, Lucidworks Mahout co-founder Lucene/Solr committer
  32. 32. NETWORKING

×