SlideShare a Scribd company logo
1 of 20
Case Sudy:
Retail In-Store
Analysis with Hadoop
Nils Kübler, YMC
May 13th 2013
CC 2.0 by Franck BLAIS | http://flic.kr/p/cwVnSy
What is the Status
Quo? What could
be possible?
Introduction
Status Quo
What is the KPI in Retail?
→ Revenue/qm2
How to bring in more metrics?
Possibile sensors for a real store:
● customer frequency counters at doors
● the cashier system
● free WiFi access points
● video capturing
● temperature
● ...
For many of these sensors additional Hardware and Software is
needed:
⇒ Let's use the free WIFI access points
What type of Questions could we ask?
● How many people visited the store? → unique visitors?
● How many visits did we have?
● What is the average visit duration?
● How many people are new vs. returning?
● ....
CC 2.0 by by Ian Carroll | http://flic.kr/p/6NWoGm
How do we answer
these questions?
Preparation
Traditional Data Management Approach
From a high level of abstraction the answer is simple. We need a
data management system with three pieces:
1. ingest
2. store
3. process
Blueprint for a Data Management System
with Hadoop
We take this basis architecture and replace the generic terms
while mapping it onto the Hadoop ecosystem.
With this Hadoop architecture a Data Scientist should be able to
answer the questions without any programming environment.
He/she can also use familiar BI, analysis and reporting tools as
well.
CC 2.0 by Perry French | http://flic.kr/p/8wDMJS
What do we need?
Setup
Ingrediants
1. 2 WiFi access points to simulate two different stores
2. Flume to move all log messages to HDFS
3. A 4 node CDH4 cluster
4. Pentaho Data Integration‘s graphical designer for data
transformation, parsing, filtering and loading to the
warehouse
5. Hive as data warehouse system on top of Hadoop to project
structure onto data
6. Impala for querying data from Hive in real time
7. MS Excel to visualize results
● 2 WIFI Routers with OpenWRT installed: one Buffalo and one
Fonera
● Installed 4 Days before the Hackathon, to have some logdata
● Syslogs are collected on Central Syslog Server
● Flume Node collects syslogs and store them on HDFS,
without any manual intervention (no transformation, no
filtering)
● (Flume can also be run as Syslogserver)
Ingest
Parsing, Transformation, Filtering, Load
● Raw Log-Data needs to be transformed to CSV
● Many open-source BI Tools to help with that: Palo, SpargoBI,
Pentaho, Talend
● We used Pentaho
● Design a MapReduce Job for distributed transformation of
the Log-Data with
○ Regular expression to match line and split columns
○ Filter empty Lines
○ UDF to create CSV and Unix Timestamp
● From this data we can easily generate a Hive Schema and
store the data to our Hive Data Warehouse.
1358765267,2013,1,21,11,47,47,+01:00,buffalo,hostapd,wlan0,10:68:3f:40:20:2d,IEEE 802.1X,authorizing port
1358765267,2013,1,21,11,47,47,+01:00,buffalo,hostapd,wlan0,10:68:3f:40:20:2d,WPA,pairwise key handshake completed (RSN)
Process
● Data can now be processed either by Hive or Impala
● create intermediate with messages like: login/logout with
visit duration.
● We used Impala to query our data ad-hock for our questions
output:
○ How many people visited the store (unique visitors)?
○ How many visits did we have?
○ What is the average visit duration?
○ How many people are new vs. returning?
● The output was then loaded into Excel to create some nice
Graphs.
CC 2.0 by Qi Wei Fong | http://flic.kr/p/7w8vfq
Now, what did we
get?
Results
Visits for stores Buffalo and Fonera
● about 85% of the visits were detected in the Buffalo store
● about 15% in the Fonera store.
● Is Buffalo Store in a better location?
Unique visitors
● 135 visits in the Buffalo by only 9 unique visitors
● 24 visits in the Fonera store by 5 unique visitors
New vs. returning users
● more returning than new users in both stores
● Fonera didn't see a new visitor over the past four days at all
Visit duration over the past 4 days
● Buffalo has more evenly distributed durations
● Fonera shows some peaks
● visitors tend to stay in shop Buffalo much longer
Conclusion
● Analysing WiFi router log files could be done with a
traditional RDBMS database approach as well.
● Answering such questions based on WiFi router log files can
be done without programming software
● Given the fact that one can quickly ramp up a test cluster
with a few nodes, similar problems can be solved within one
day with a handful of engineers.
● It could be possible to track paths from people based on WiFi
router signals using triangulation.
CC 2.0 by Aurelien Guichard | http://flic.kr/p/cjg9yw
Blog Series:
http://bitly.com/bundles/nkuebler/1
Thank you

More Related Content

Similar to In-Store Analysis with Hadoop

Profoundis - Why OpenERP
Profoundis - Why OpenERPProfoundis - Why OpenERP
Profoundis - Why OpenERP
Arjun Pillai
 

Similar to In-Store Analysis with Hadoop (20)

Simply Business' Data Platform
Simply Business' Data PlatformSimply Business' Data Platform
Simply Business' Data Platform
 
Case Study: In-Store Analysis
Case Study: In-Store AnalysisCase Study: In-Store Analysis
Case Study: In-Store Analysis
 
In-Store Analysis with Hadoop
In-Store Analysis with HadoopIn-Store Analysis with Hadoop
In-Store Analysis with Hadoop
 
Lambda Architectures in Practice
Lambda Architectures in PracticeLambda Architectures in Practice
Lambda Architectures in Practice
 
Hadoop on retail
Hadoop on retailHadoop on retail
Hadoop on retail
 
Greenplum for Internet Scale Analytics and Mining - Greenplum Summit 2018
Greenplum for Internet Scale Analytics and Mining - Greenplum Summit 2018Greenplum for Internet Scale Analytics and Mining - Greenplum Summit 2018
Greenplum for Internet Scale Analytics and Mining - Greenplum Summit 2018
 
Apache Flink Adoption at Shopify
Apache Flink Adoption at ShopifyApache Flink Adoption at Shopify
Apache Flink Adoption at Shopify
 
How to do a SAP PI/PO Migration 2019
How to do a SAP PI/PO Migration 2019 How to do a SAP PI/PO Migration 2019
How to do a SAP PI/PO Migration 2019
 
Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...
Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...
Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...
 
Tweak Geeks #FOS15
Tweak Geeks #FOS15Tweak Geeks #FOS15
Tweak Geeks #FOS15
 
Acquisitie in de bibliotheek - Room for thought
Acquisitie in de bibliotheek - Room for thoughtAcquisitie in de bibliotheek - Room for thought
Acquisitie in de bibliotheek - Room for thought
 
Graphs, parallelism and business cases
 Graphs, parallelism and business cases Graphs, parallelism and business cases
Graphs, parallelism and business cases
 
Graphs, parallelism and business cases
Graphs, parallelism and business casesGraphs, parallelism and business cases
Graphs, parallelism and business cases
 
Big Data at a Gaming Company: Spil Games
Big Data at a Gaming Company: Spil GamesBig Data at a Gaming Company: Spil Games
Big Data at a Gaming Company: Spil Games
 
The Big Data Journey at Connexity - Big Data Day LA 2015
The Big Data Journey at Connexity - Big Data Day LA 2015The Big Data Journey at Connexity - Big Data Day LA 2015
The Big Data Journey at Connexity - Big Data Day LA 2015
 
How automate your SAP PI/PO/CPI and API management processes
How automate your SAP PI/PO/CPI and API management processesHow automate your SAP PI/PO/CPI and API management processes
How automate your SAP PI/PO/CPI and API management processes
 
Storing State Forever: Why It Can Be Good For Your Analytics
Storing State Forever: Why It Can Be Good For Your AnalyticsStoring State Forever: Why It Can Be Good For Your Analytics
Storing State Forever: Why It Can Be Good For Your Analytics
 
Profoundis - Why OpenERP
Profoundis - Why OpenERPProfoundis - Why OpenERP
Profoundis - Why OpenERP
 
ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelines
ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data PipelinesETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelines
ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelines
 
Clickhouse MeetUp@ContentSquare - ContentSquare's Experience Sharing
Clickhouse MeetUp@ContentSquare - ContentSquare's Experience SharingClickhouse MeetUp@ContentSquare - ContentSquare's Experience Sharing
Clickhouse MeetUp@ContentSquare - ContentSquare's Experience Sharing
 

More from Swiss Big Data User Group

Brainserve Datacenter: the High-Density Choice
Brainserve Datacenter: the High-Density ChoiceBrainserve Datacenter: the High-Density Choice
Brainserve Datacenter: the High-Density Choice
Swiss Big Data User Group
 
Urturn on AWS: scaling infra, cost and time to maket
Urturn on AWS: scaling infra, cost and time to maketUrturn on AWS: scaling infra, cost and time to maket
Urturn on AWS: scaling infra, cost and time to maket
Swiss Big Data User Group
 
The World Wide Distributed Computing Architecture of the LHC Datagrid
The World Wide Distributed Computing Architecture of the LHC DatagridThe World Wide Distributed Computing Architecture of the LHC Datagrid
The World Wide Distributed Computing Architecture of the LHC Datagrid
Swiss Big Data User Group
 
New opportunities for connected data : Neo4j the graph database
New opportunities for connected data : Neo4j the graph databaseNew opportunities for connected data : Neo4j the graph database
New opportunities for connected data : Neo4j the graph database
Swiss Big Data User Group
 

More from Swiss Big Data User Group (20)

Making Hadoop based analytics simple for everyone to use
Making Hadoop based analytics simple for everyone to useMaking Hadoop based analytics simple for everyone to use
Making Hadoop based analytics simple for everyone to use
 
A real life project using Cassandra at a large Swiss Telco operator
A real life project using Cassandra at a large Swiss Telco operatorA real life project using Cassandra at a large Swiss Telco operator
A real life project using Cassandra at a large Swiss Telco operator
 
Data Analytics – B2B vs. B2C
Data Analytics – B2B vs. B2CData Analytics – B2B vs. B2C
Data Analytics – B2B vs. B2C
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Closing The Loop for Evaluating Big Data Analysis
Closing The Loop for Evaluating Big Data AnalysisClosing The Loop for Evaluating Big Data Analysis
Closing The Loop for Evaluating Big Data Analysis
 
Big Data and Data Science for traditional Swiss companies
Big Data and Data Science for traditional Swiss companiesBig Data and Data Science for traditional Swiss companies
Big Data and Data Science for traditional Swiss companies
 
Design Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time LearningDesign Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time Learning
 
Educating Data Scientists of the Future
Educating Data Scientists of the FutureEducating Data Scientists of the Future
Educating Data Scientists of the Future
 
Unleash the power of Big Data in your existing Data Warehouse
Unleash the power of Big Data in your existing Data WarehouseUnleash the power of Big Data in your existing Data Warehouse
Unleash the power of Big Data in your existing Data Warehouse
 
Big data for Telco: opportunity or threat?
Big data for Telco: opportunity or threat?Big data for Telco: opportunity or threat?
Big data for Telco: opportunity or threat?
 
Project "Babelfish" - A data warehouse to attack complexity
 Project "Babelfish" - A data warehouse to attack complexity Project "Babelfish" - A data warehouse to attack complexity
Project "Babelfish" - A data warehouse to attack complexity
 
Brainserve Datacenter: the High-Density Choice
Brainserve Datacenter: the High-Density ChoiceBrainserve Datacenter: the High-Density Choice
Brainserve Datacenter: the High-Density Choice
 
Urturn on AWS: scaling infra, cost and time to maket
Urturn on AWS: scaling infra, cost and time to maketUrturn on AWS: scaling infra, cost and time to maket
Urturn on AWS: scaling infra, cost and time to maket
 
The World Wide Distributed Computing Architecture of the LHC Datagrid
The World Wide Distributed Computing Architecture of the LHC DatagridThe World Wide Distributed Computing Architecture of the LHC Datagrid
The World Wide Distributed Computing Architecture of the LHC Datagrid
 
New opportunities for connected data : Neo4j the graph database
New opportunities for connected data : Neo4j the graph databaseNew opportunities for connected data : Neo4j the graph database
New opportunities for connected data : Neo4j the graph database
 
Technology Outlook - The new Era of computing
Technology Outlook - The new Era of computingTechnology Outlook - The new Era of computing
Technology Outlook - The new Era of computing
 
Big Data Visualization With ParaView
Big Data Visualization With ParaViewBig Data Visualization With ParaView
Big Data Visualization With ParaView
 
Introduction to Apache Drill
Introduction to Apache DrillIntroduction to Apache Drill
Introduction to Apache Drill
 
Oracle's BigData solutions
Oracle's BigData solutionsOracle's BigData solutions
Oracle's BigData solutions
 

Recently uploaded

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Recently uploaded (20)

Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 

In-Store Analysis with Hadoop

  • 1. Case Sudy: Retail In-Store Analysis with Hadoop Nils Kübler, YMC May 13th 2013
  • 2. CC 2.0 by Franck BLAIS | http://flic.kr/p/cwVnSy What is the Status Quo? What could be possible? Introduction
  • 3. Status Quo What is the KPI in Retail? → Revenue/qm2
  • 4. How to bring in more metrics? Possibile sensors for a real store: ● customer frequency counters at doors ● the cashier system ● free WiFi access points ● video capturing ● temperature ● ... For many of these sensors additional Hardware and Software is needed: ⇒ Let's use the free WIFI access points
  • 5. What type of Questions could we ask? ● How many people visited the store? → unique visitors? ● How many visits did we have? ● What is the average visit duration? ● How many people are new vs. returning? ● ....
  • 6. CC 2.0 by by Ian Carroll | http://flic.kr/p/6NWoGm How do we answer these questions? Preparation
  • 7. Traditional Data Management Approach From a high level of abstraction the answer is simple. We need a data management system with three pieces: 1. ingest 2. store 3. process
  • 8. Blueprint for a Data Management System with Hadoop We take this basis architecture and replace the generic terms while mapping it onto the Hadoop ecosystem. With this Hadoop architecture a Data Scientist should be able to answer the questions without any programming environment. He/she can also use familiar BI, analysis and reporting tools as well.
  • 9. CC 2.0 by Perry French | http://flic.kr/p/8wDMJS What do we need? Setup
  • 10. Ingrediants 1. 2 WiFi access points to simulate two different stores 2. Flume to move all log messages to HDFS 3. A 4 node CDH4 cluster 4. Pentaho Data Integration‘s graphical designer for data transformation, parsing, filtering and loading to the warehouse 5. Hive as data warehouse system on top of Hadoop to project structure onto data 6. Impala for querying data from Hive in real time 7. MS Excel to visualize results
  • 11. ● 2 WIFI Routers with OpenWRT installed: one Buffalo and one Fonera ● Installed 4 Days before the Hackathon, to have some logdata ● Syslogs are collected on Central Syslog Server ● Flume Node collects syslogs and store them on HDFS, without any manual intervention (no transformation, no filtering) ● (Flume can also be run as Syslogserver) Ingest
  • 12. Parsing, Transformation, Filtering, Load ● Raw Log-Data needs to be transformed to CSV ● Many open-source BI Tools to help with that: Palo, SpargoBI, Pentaho, Talend ● We used Pentaho ● Design a MapReduce Job for distributed transformation of the Log-Data with ○ Regular expression to match line and split columns ○ Filter empty Lines ○ UDF to create CSV and Unix Timestamp ● From this data we can easily generate a Hive Schema and store the data to our Hive Data Warehouse. 1358765267,2013,1,21,11,47,47,+01:00,buffalo,hostapd,wlan0,10:68:3f:40:20:2d,IEEE 802.1X,authorizing port 1358765267,2013,1,21,11,47,47,+01:00,buffalo,hostapd,wlan0,10:68:3f:40:20:2d,WPA,pairwise key handshake completed (RSN)
  • 13. Process ● Data can now be processed either by Hive or Impala ● create intermediate with messages like: login/logout with visit duration. ● We used Impala to query our data ad-hock for our questions output: ○ How many people visited the store (unique visitors)? ○ How many visits did we have? ○ What is the average visit duration? ○ How many people are new vs. returning? ● The output was then loaded into Excel to create some nice Graphs.
  • 14. CC 2.0 by Qi Wei Fong | http://flic.kr/p/7w8vfq Now, what did we get? Results
  • 15. Visits for stores Buffalo and Fonera ● about 85% of the visits were detected in the Buffalo store ● about 15% in the Fonera store. ● Is Buffalo Store in a better location?
  • 16. Unique visitors ● 135 visits in the Buffalo by only 9 unique visitors ● 24 visits in the Fonera store by 5 unique visitors
  • 17. New vs. returning users ● more returning than new users in both stores ● Fonera didn't see a new visitor over the past four days at all
  • 18. Visit duration over the past 4 days ● Buffalo has more evenly distributed durations ● Fonera shows some peaks ● visitors tend to stay in shop Buffalo much longer
  • 19. Conclusion ● Analysing WiFi router log files could be done with a traditional RDBMS database approach as well. ● Answering such questions based on WiFi router log files can be done without programming software ● Given the fact that one can quickly ramp up a test cluster with a few nodes, similar problems can be solved within one day with a handful of engineers. ● It could be possible to track paths from people based on WiFi router signals using triangulation.
  • 20. CC 2.0 by Aurelien Guichard | http://flic.kr/p/cjg9yw Blog Series: http://bitly.com/bundles/nkuebler/1 Thank you