SlideShare a Scribd company logo
1 of 32
Big Data Warehousing Meetup

Today’s Topic: Building a Relevance
Engine using Hadoop, Mahout & Pig




                                      Sponsored By:
WELCOME!
  Joe Caserta
  Founder & President, Caserta Concepts
Agenda
7:00     Networking
         Grab a slice of pizza and a drink...



7:15     Joe Caserta                              Welcome
         President, Caserta Concepts              About the Meetup and about Caserta Concepts
         Author, Data Warehouse ETL Toolkit


7:30     Erik Laurence                            Big Data Facts and Figures
         VP Marketing, Caserta Concepts           Interesting observations from the world of Big Data



7:45     Elliott Cordo                            Relevance
         Principal Consultant, Caserta Concepts   Building a Big Data recommendation engine with Mahout



8:15     Grant Ingersoll                          Machine Learning
         Chief Scientist, Lucidworks              Powering large scale data driven real time apps with
         Mahout co-founder                        Apache Solr and Mahout
         Lucene/Solr committer

8:45 -   More Networking
9:00     Tell us what you’re up to…
About BDW Meetup
• Big Data is a complex, rapidly
 changing landscape

• We want to share our stories and
 hear about yours

• Great networking opportunity for
 like minded data nerds

• Opportunities to collaborate on
 exciting projects
About Caserta Concepts
 Focused                             Industries Served
 Expertise
                                    •   Financial Services
 •   Big Data Analytics             •   Healthcare / Insurance
 •   Data Warehousing               •   Retail / eCommerce
 •   Business Intelligence          •   Digital Media / Marketing
 •   Strategic Data                 •   K-12 / Higher Education
     Ecosystems

     Founded in 2001

     • President: Joe Caserta, industry thought leader,
       consultant, educator and co-author, The Data
       Warehouse ETL Toolkit (Wiley, 2004)
Client Portfolio
Finance
& Insurance




Retail/eCommerce
& Manufacturing




Education
& Services
Expertise & Offerings
 Strategic Roadmap/
 Assessment/Consulting


 Big Data
 Analytics




 Data Warehousing/
 ETL/Data Integration


 BI/Visualization/
 Analytics



 Master Data Management
Big Data at Caserta Concepts
Caserta Concepts is a blend of the best designers in traditional
enterprise data with the best new designers in Big Data.

            Traditional Data                   Big Data
          • Tools                        • Tools
                • RDBMS                        • Hadoop
                • DQ                           • Mahout
                • MDM                          • Relevance Engine
                • BI                           • Analytics
                • ETL                    • New Data
                • Analytics                    • Social
          • Traditional Data                   • Machine
                • Transactions                 • Deep History
                                               • Unstructured



                      Immutable Data Concepts
              • Transformation   • Profiling
              • Conforming       • Processing Efficiency/Speed


                                                                    8
Contacts

     Joe Caserta
     President & Founder, Caserta Concepts
     P: (855) 755-2246 x227
     E: joe@casertaconcepts.com


     Erik Laurence
     VP Marketing, Caserta Concepts
     P: (855) 755-2246 x528                   info@casertaconcepts.com
     E: erik@casertaconcepts.com              1(855) 755-2246
                                              www.casertaconcepts.com
     Elliott Cordo
     Principal Consultant, Caserta Concepts
     P: (855) 755-2246 x267
     E: elliott@casertaconcepts.com
BIG DATA FACTS AND FIGURES
   Erik Laurence
   VP Marketing, Caserta Concepts
What is Really Meant by Big Data?
• The 4 Vs of Big Data
                                                        10%
  • Volume
                                                                   Structured
    • More data than ever before
    • Most of world’s data is unstructured,       90%              Un/Semi/Multi-
                                                                   Structured
      semi-structured or multi-structured
  • Variety
    • More sources than ever before
    • Social, web logs, machine logs, documents, geotags, video, …
  • Velocity
    • Some data only has value for a short period of time
    • Relevance engines, financial fraud sensors, early warning sensors, etc.
  • Vitality
    • Agility is required in analytics
    • Adapt quickly to changing business needs
Enterprise Involvement with Big Data
                         6%

                              18%
                                         Beyond Pilot Stage
                                         Engaged in Pilot
                 76%
                                         Not Yet Involved




• Awareness of Big Data high among enterprises, but three-quarters still
  wondering, ―What is this all about?‖
• Answer across all businesses, ―We don't know what the business case
  is.‖



                                                            Source: WSJ November 29, 2012
Business Cases Have Been Identified
―The use of data and analytics …is going to be a basis of competition
going forward for individual firms, for sectors and even for countries.
Those companies that are able to use data effectively are more likely to
win in the marketplace.‖
         - Michael Chui, McKinsey Global Institute

In just one field—personal location data—$100 billion of value can be
created globally for service providers through use of data.

Benefits for consumers could be six times that.




    Source: (WSJ 11/29/12)
Big Data Played A Role in the Election
―This was the first presidential
election campaign where all of the
data that was coming into the
campaign was successfully
collected and centralized.

―The Obama campaign did a
successful job with that; the                     Obama campaign hired an analytics department five
                                                  times as large as that of the 2008 operation.
Romney campaign did not.‖

  - John Aristotle Phillips, Chief Executive of
  Aristotle International (WSJ 11/29/12)
Big Data Example in Obama Campaign
• $40k-a-head dinner in June at Sarah Jessica
  Parker’s home in NYC
• 7 different versions of the email solicitation for the
  event
  • Some mentioned a 2nd fundraiser that night, a Mariah
    Carey concert
  • Some said Ms. Parker is a mother
  • Some said Vogue editor Anna Wintour would be at the
    dinner
• Who got which email depended on big data
  • Profile info about each prospect
  • How they react to different messages
• Campaign created a single massive system to join
 info from Democratic voter files to
  • pollsters, fundraisers, field workers and consumer
    databases, social-media, and mobile contacts

  Sources: WSJ, Time Magazine
Hadoop Market: Growing & Evolving
• Big data outranks virtualization as
 #1 trend driving spending initiatives
  • Barclays CIO Survey, April 2012


• Overall market at $100B
  • Hadoop 2nd only to RDBMS in
    potential


• Estimates put market growth at >
 40% CAGR
  • IDC expects Big Data tech and
    services market to grow to $16.9B in
    2015
  • According to JPMC 50% of Big Data
    market will be influenced by Hadoop
Hadoop Cost Effective for Archiving
• Hadoop is orders of magnitude cheaper than traditional
 archival methods

• Annual cost of 1 TB of archival storage for a credit card
 company




        Tape                SAN                     Hadoop
       $30,000             $3,000                    $300
Hadoop is Fast
• Sears' process to analyze loyalty club
 marketing campaigns took six weeks on
 mainframe, Teradata, and SAS servers
  • In retail, that’s half the season!


• New process on Hadoop is done weekly
  • For online and mobile, daily analysis is done


• What’s more, old models used 10% of data, new models use all
 the data



• Source: Information Week (October 31, 2012)
BUILDING A RECOMMENDATION ENGINE
   Elliott Cordo
   Principal Consultant, Caserta Concepts
Recommendations
• Your customers expect them
   • Good recommendations make life easier
   • Help them find information, products, and services they might not
     have thought of


• What makes a good recommendation?
  • Relevant but not obvious
  • Sense of ―surprise‖
Where can recommendations
engines be found?
• Applications can be found in a wide variety of industries
 and applications:
  • Travel
  • Service Industry
  • Music/Online radio
  • TV and Video
  • Online Publications
  • Retail
   ..and countless others
Our Use Case: Online Magazine
Goals:
• Serve customers recommendations based on what their
  peers are reading.
• Recommendation must have context to the article they
  are currently viewing.
Technical Details
Core Platform:
• Cloudera Hadoop Cluster
• Mahout Machine Learning Library
• Apache Pig


Additional Technology:
• Talend Big Data Edition (ETL to/from relational)
• Datameer (Analysis and Visualization)
How we did it
Solution leverages three main algorithms:
• Mahout K-Means – identifying groups of similar articles
• Mahout Item-Based Recommender - recommendations
  based on peer behavior
• Raw Popularity – custom Pig script ―people who read this
  article also read..‖
K-Means
• Treats items as coordinates
• Places a number of random
  ―centroids‖ and assigns the
  nearest items
• Moves the centroids around
  based on average location
• Process repeats until the
  assignments stop changing

We used the major attributes of the articles to create
coordinate points:
Author, Topic, Section, Region, Media, etc.

                                *Diagram from Collective Intelligence by Toby Segaran
Item-Based Recommender
• Build an item-item matrix determining relationships
  between pairs of items (usage)
• Using the matrix, and the data on the current user, infer
  his taste


• We use a dataset containing Customer, Article and
  Rating
   • Since no rating was available we used a 1 to 5
      scale based on age (a ramped 6 month decay)
• In the output a 0 to 5 scale is calculated, 5 being the
  most highly recommended for this customer
Popularity
• Self join usage dataset based on Article
  Also_Read_Data= join Readers1 by
  Customer_ID, Readers2 by Customer_ID using 'merge'
• Group article based on Article, ―Also Read Article‖
• Sort descending based on the number of distinct peer
  customers
• Limit 25 (most popular ―Also Read Article‖)
• In the output a 0 to 5 scale is calculated, 5 being the most
  popular for a given article
Delivering Recommendations
Customer views an article online and we are passed their
Customer ID and the Article they are viewing

We then do the following:
1. K-Means – get all items in the same cluster and calculate
                                 Item-Based:           K-Means:
   Euclidean Distance. Reverse and scale 0-5.
                               Peers are reading        Similar

2. Item-Based - get all peer recommendations for this customer
3. Popularity – get all popular recommendations for this article
4. Join the three data sets together, add the final rankings and
   bring back the most highly rated articles.
                                          Popularity:
                                         Most popular
Items recommended by more than 1
algorithm are the most highly rated


          Item-Based:                      K-Means:
        Peers are reading                   Similar




                             Popularity:
                            Most popular
                                                     Best
                                                Recommendations
Improvements/Ideas
• Conditionally swap algorithms: Peer recommendations
  can be unwieldy for new users
• Allow users to rate how relevant this recommendation is -
  > retrain the model
• Play with the weighting of current algorithms, evaluate
  others
• Hybrid search platform: Replace or supplement K-Means
  with Search platform
MACHINE LEARNING
   Grant Ingersoll
   President, Lucidworks
   Mahout co-founder
   Lucene/Solr committer
NETWORKING

More Related Content

More from Caserta

Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...Caserta
 
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017Caserta
 
General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017Caserta
 
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...Caserta
 
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing KeynoteArchitecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing KeynoteCaserta
 
Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)Caserta
 
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017Caserta
 
The Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's EnterpriseThe Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's EnterpriseCaserta
 
Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics Caserta
 
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016Caserta
 
You're the New CDO, Now What?
You're the New CDO, Now What?You're the New CDO, Now What?
You're the New CDO, Now What?Caserta
 
The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation Caserta
 
Making Big Data Easy for Everyone
Making Big Data Easy for EveryoneMaking Big Data Easy for Everyone
Making Big Data Easy for EveryoneCaserta
 
Benefits of the Azure Cloud
Benefits of the Azure CloudBenefits of the Azure Cloud
Benefits of the Azure CloudCaserta
 
Big Data Analytics on the Cloud
Big Data Analytics on the CloudBig Data Analytics on the Cloud
Big Data Analytics on the CloudCaserta
 
Intro to Data Science on Hadoop
Intro to Data Science on HadoopIntro to Data Science on Hadoop
Intro to Data Science on HadoopCaserta
 
The Emerging Role of the Data Lake
The Emerging Role of the Data LakeThe Emerging Role of the Data Lake
The Emerging Role of the Data LakeCaserta
 
Not Your Father's Database by Databricks
Not Your Father's Database by DatabricksNot Your Father's Database by Databricks
Not Your Father's Database by DatabricksCaserta
 
Mastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkMastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkCaserta
 
Moving Past Infrastructure Limitations
Moving Past Infrastructure LimitationsMoving Past Infrastructure Limitations
Moving Past Infrastructure LimitationsCaserta
 

More from Caserta (20)

Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
 
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
 
General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017
 
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
 
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing KeynoteArchitecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
 
Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)
 
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
 
The Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's EnterpriseThe Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's Enterprise
 
Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics
 
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
 
You're the New CDO, Now What?
You're the New CDO, Now What?You're the New CDO, Now What?
You're the New CDO, Now What?
 
The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation
 
Making Big Data Easy for Everyone
Making Big Data Easy for EveryoneMaking Big Data Easy for Everyone
Making Big Data Easy for Everyone
 
Benefits of the Azure Cloud
Benefits of the Azure CloudBenefits of the Azure Cloud
Benefits of the Azure Cloud
 
Big Data Analytics on the Cloud
Big Data Analytics on the CloudBig Data Analytics on the Cloud
Big Data Analytics on the Cloud
 
Intro to Data Science on Hadoop
Intro to Data Science on HadoopIntro to Data Science on Hadoop
Intro to Data Science on Hadoop
 
The Emerging Role of the Data Lake
The Emerging Role of the Data LakeThe Emerging Role of the Data Lake
The Emerging Role of the Data Lake
 
Not Your Father's Database by Databricks
Not Your Father's Database by DatabricksNot Your Father's Database by Databricks
Not Your Father's Database by Databricks
 
Mastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkMastering Customer Data on Apache Spark
Mastering Customer Data on Apache Spark
 
Moving Past Infrastructure Limitations
Moving Past Infrastructure LimitationsMoving Past Infrastructure Limitations
Moving Past Infrastructure Limitations
 

Recently uploaded

Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 

Recently uploaded (20)

Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 

Big Data Warehousing: Building a Relevance Engine using Hadoop, Mahout, and Pig

  • 1. Big Data Warehousing Meetup Today’s Topic: Building a Relevance Engine using Hadoop, Mahout & Pig Sponsored By:
  • 2. WELCOME! Joe Caserta Founder & President, Caserta Concepts
  • 3. Agenda 7:00 Networking Grab a slice of pizza and a drink... 7:15 Joe Caserta Welcome President, Caserta Concepts About the Meetup and about Caserta Concepts Author, Data Warehouse ETL Toolkit 7:30 Erik Laurence Big Data Facts and Figures VP Marketing, Caserta Concepts Interesting observations from the world of Big Data 7:45 Elliott Cordo Relevance Principal Consultant, Caserta Concepts Building a Big Data recommendation engine with Mahout 8:15 Grant Ingersoll Machine Learning Chief Scientist, Lucidworks Powering large scale data driven real time apps with Mahout co-founder Apache Solr and Mahout Lucene/Solr committer 8:45 - More Networking 9:00 Tell us what you’re up to…
  • 4. About BDW Meetup • Big Data is a complex, rapidly changing landscape • We want to share our stories and hear about yours • Great networking opportunity for like minded data nerds • Opportunities to collaborate on exciting projects
  • 5. About Caserta Concepts Focused Industries Served Expertise • Financial Services • Big Data Analytics • Healthcare / Insurance • Data Warehousing • Retail / eCommerce • Business Intelligence • Digital Media / Marketing • Strategic Data • K-12 / Higher Education Ecosystems Founded in 2001 • President: Joe Caserta, industry thought leader, consultant, educator and co-author, The Data Warehouse ETL Toolkit (Wiley, 2004)
  • 6. Client Portfolio Finance & Insurance Retail/eCommerce & Manufacturing Education & Services
  • 7. Expertise & Offerings Strategic Roadmap/ Assessment/Consulting Big Data Analytics Data Warehousing/ ETL/Data Integration BI/Visualization/ Analytics Master Data Management
  • 8. Big Data at Caserta Concepts Caserta Concepts is a blend of the best designers in traditional enterprise data with the best new designers in Big Data. Traditional Data Big Data • Tools • Tools • RDBMS • Hadoop • DQ • Mahout • MDM • Relevance Engine • BI • Analytics • ETL • New Data • Analytics • Social • Traditional Data • Machine • Transactions • Deep History • Unstructured Immutable Data Concepts • Transformation • Profiling • Conforming • Processing Efficiency/Speed 8
  • 9. Contacts Joe Caserta President & Founder, Caserta Concepts P: (855) 755-2246 x227 E: joe@casertaconcepts.com Erik Laurence VP Marketing, Caserta Concepts P: (855) 755-2246 x528 info@casertaconcepts.com E: erik@casertaconcepts.com 1(855) 755-2246 www.casertaconcepts.com Elliott Cordo Principal Consultant, Caserta Concepts P: (855) 755-2246 x267 E: elliott@casertaconcepts.com
  • 10. BIG DATA FACTS AND FIGURES Erik Laurence VP Marketing, Caserta Concepts
  • 11. What is Really Meant by Big Data? • The 4 Vs of Big Data 10% • Volume Structured • More data than ever before • Most of world’s data is unstructured, 90% Un/Semi/Multi- Structured semi-structured or multi-structured • Variety • More sources than ever before • Social, web logs, machine logs, documents, geotags, video, … • Velocity • Some data only has value for a short period of time • Relevance engines, financial fraud sensors, early warning sensors, etc. • Vitality • Agility is required in analytics • Adapt quickly to changing business needs
  • 12. Enterprise Involvement with Big Data 6% 18% Beyond Pilot Stage Engaged in Pilot 76% Not Yet Involved • Awareness of Big Data high among enterprises, but three-quarters still wondering, ―What is this all about?‖ • Answer across all businesses, ―We don't know what the business case is.‖ Source: WSJ November 29, 2012
  • 13. Business Cases Have Been Identified ―The use of data and analytics …is going to be a basis of competition going forward for individual firms, for sectors and even for countries. Those companies that are able to use data effectively are more likely to win in the marketplace.‖ - Michael Chui, McKinsey Global Institute In just one field—personal location data—$100 billion of value can be created globally for service providers through use of data. Benefits for consumers could be six times that. Source: (WSJ 11/29/12)
  • 14. Big Data Played A Role in the Election ―This was the first presidential election campaign where all of the data that was coming into the campaign was successfully collected and centralized. ―The Obama campaign did a successful job with that; the Obama campaign hired an analytics department five times as large as that of the 2008 operation. Romney campaign did not.‖ - John Aristotle Phillips, Chief Executive of Aristotle International (WSJ 11/29/12)
  • 15. Big Data Example in Obama Campaign • $40k-a-head dinner in June at Sarah Jessica Parker’s home in NYC • 7 different versions of the email solicitation for the event • Some mentioned a 2nd fundraiser that night, a Mariah Carey concert • Some said Ms. Parker is a mother • Some said Vogue editor Anna Wintour would be at the dinner • Who got which email depended on big data • Profile info about each prospect • How they react to different messages • Campaign created a single massive system to join info from Democratic voter files to • pollsters, fundraisers, field workers and consumer databases, social-media, and mobile contacts Sources: WSJ, Time Magazine
  • 16. Hadoop Market: Growing & Evolving • Big data outranks virtualization as #1 trend driving spending initiatives • Barclays CIO Survey, April 2012 • Overall market at $100B • Hadoop 2nd only to RDBMS in potential • Estimates put market growth at > 40% CAGR • IDC expects Big Data tech and services market to grow to $16.9B in 2015 • According to JPMC 50% of Big Data market will be influenced by Hadoop
  • 17. Hadoop Cost Effective for Archiving • Hadoop is orders of magnitude cheaper than traditional archival methods • Annual cost of 1 TB of archival storage for a credit card company Tape SAN Hadoop $30,000 $3,000 $300
  • 18. Hadoop is Fast • Sears' process to analyze loyalty club marketing campaigns took six weeks on mainframe, Teradata, and SAS servers • In retail, that’s half the season! • New process on Hadoop is done weekly • For online and mobile, daily analysis is done • What’s more, old models used 10% of data, new models use all the data • Source: Information Week (October 31, 2012)
  • 19. BUILDING A RECOMMENDATION ENGINE Elliott Cordo Principal Consultant, Caserta Concepts
  • 20. Recommendations • Your customers expect them • Good recommendations make life easier • Help them find information, products, and services they might not have thought of • What makes a good recommendation? • Relevant but not obvious • Sense of ―surprise‖
  • 21. Where can recommendations engines be found? • Applications can be found in a wide variety of industries and applications: • Travel • Service Industry • Music/Online radio • TV and Video • Online Publications • Retail ..and countless others
  • 22. Our Use Case: Online Magazine Goals: • Serve customers recommendations based on what their peers are reading. • Recommendation must have context to the article they are currently viewing.
  • 23. Technical Details Core Platform: • Cloudera Hadoop Cluster • Mahout Machine Learning Library • Apache Pig Additional Technology: • Talend Big Data Edition (ETL to/from relational) • Datameer (Analysis and Visualization)
  • 24. How we did it Solution leverages three main algorithms: • Mahout K-Means – identifying groups of similar articles • Mahout Item-Based Recommender - recommendations based on peer behavior • Raw Popularity – custom Pig script ―people who read this article also read..‖
  • 25. K-Means • Treats items as coordinates • Places a number of random ―centroids‖ and assigns the nearest items • Moves the centroids around based on average location • Process repeats until the assignments stop changing We used the major attributes of the articles to create coordinate points: Author, Topic, Section, Region, Media, etc. *Diagram from Collective Intelligence by Toby Segaran
  • 26. Item-Based Recommender • Build an item-item matrix determining relationships between pairs of items (usage) • Using the matrix, and the data on the current user, infer his taste • We use a dataset containing Customer, Article and Rating • Since no rating was available we used a 1 to 5 scale based on age (a ramped 6 month decay) • In the output a 0 to 5 scale is calculated, 5 being the most highly recommended for this customer
  • 27. Popularity • Self join usage dataset based on Article Also_Read_Data= join Readers1 by Customer_ID, Readers2 by Customer_ID using 'merge' • Group article based on Article, ―Also Read Article‖ • Sort descending based on the number of distinct peer customers • Limit 25 (most popular ―Also Read Article‖) • In the output a 0 to 5 scale is calculated, 5 being the most popular for a given article
  • 28. Delivering Recommendations Customer views an article online and we are passed their Customer ID and the Article they are viewing We then do the following: 1. K-Means – get all items in the same cluster and calculate Item-Based: K-Means: Euclidean Distance. Reverse and scale 0-5. Peers are reading Similar 2. Item-Based - get all peer recommendations for this customer 3. Popularity – get all popular recommendations for this article 4. Join the three data sets together, add the final rankings and bring back the most highly rated articles. Popularity: Most popular
  • 29. Items recommended by more than 1 algorithm are the most highly rated Item-Based: K-Means: Peers are reading Similar Popularity: Most popular Best Recommendations
  • 30. Improvements/Ideas • Conditionally swap algorithms: Peer recommendations can be unwieldy for new users • Allow users to rate how relevant this recommendation is - > retrain the model • Play with the weighting of current algorithms, evaluate others • Hybrid search platform: Replace or supplement K-Means with Search platform
  • 31. MACHINE LEARNING Grant Ingersoll President, Lucidworks Mahout co-founder Lucene/Solr committer