In-Store Analysis with Hadoop


Published on

While user tracking with WebTrends, comScore, Google Analytics etc. is a de-facto standard in the online world, tracking visitors in the real world is still fragmented. From a wide perspective, potential tracking data is produced by various sensors. With a real ‘bricks and mortar’ store, one could figure out possible sensors they could use: customer frequency counters at the doors, the cashier system, free WiFi access points, video capture, temperature, background music, smells and many more. For many of those sensors additional hardware and software would be needed, but a few sensors already have solutions available, e.g. video capturing with face or even eye recognition. The most interesting sensor data that doesn’t require additional hardware and software could be the WiFi access points. Especially given that many visitors will have WiFi enabled mobile phones. This talk demonstrates how WiFi access point log files can be used to answer different questions for a particular store.

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

In-Store Analysis with Hadoop

  1. 1. CC 2.0 by Mr. T in DC |
  2. 2. CC 2.0 by Franck BLAIS |
  3. 3. CC 2.0 by John Steven Fernandez |
  4. 4. CC 2.0 by Ian Carroll |
  5. 5. CC 2.0 by Perry French |
  6. 6. CC 2.0 by John Mitchell |
  7. 7. 7 How do we answer these questions? Before we started designing a blueprint solution we first of all asked ourselves: 1 Who would be asked to answer questions like this? 2 Who is this person? 3 What tools does this person expect to use? 4 And what is a typical skill set of this person? 5 How do they work? Preparation May 21, 2013
  8. 8. 8 So, how do we answer these questions as a Data Scientist? From a high level of abstraction the answer is simple. We need a data management system with three pieces: ingest, store and process. Traditional Data Management System Approach May 21, 2013 Data Source Data Ingestion Data Processing Data Storage
  9. 9. 9 So, how do we answer these questions as a Data Scientist? We take this basis architecture and replace the generic terms while mapping it onto the Hadoop ecosystem. With this Hadoop architecture a Data Scientist should be able to answer the questions without any programming environment. He/she can also use familiar BI, analysis and reporting tools as well. Blueprint for a Data Management System with Hadoop May 21, 2013 Data Source Flume HIVE, ImpalaHDFS BI/Analysis/R eporting
  10. 10. 10 Ingrediants 1 2 WiFi access points to simulate two different stores with OpenWRT, a linux based firmware for routers, installed 2 Flume to move all log messages to HDFS, without any manual intervention (no transformation, no filtering) 3 A 4 node CDH4 cluster (2GB RAM, 100GB HDD) 4 Pentaho Data Integration‘s graphical designer for data transformation, parsing, filtering and loading to the warehouse 5 Hive as data warehouse system on top of Hadoop to project structure onto data 6 Impala for querying data from HDFS in real time 7 MS Excel to visualize results Setup May 21, 2013
  11. 11. 11 How it Works Analytics System May 21, 2013 Flume Hive Impala OpenWRT 00:A0:C9:14:C8:28 Syslog Server Flume Source Sinks to HDFSLoads RawCSV Hadoop/HDFS M/R Pentaho UDP
  12. 12. CC 2.0 by Qi Wei Fong |
  13. 13. 13 Visits for stores number one & two The plot indicates that about 85% of the visits were detected in store number one and about 15% in store number two. One might draw the conclusion that store number one is in a much better location with more occasional customers. But let’s gain more insights by analysing the number of unique visitors. Analysis Result May 21, 2013
  14. 14. 14 Unique visitors This plot gives us more details about the customers. It turns out that the 135 visits in store number one were caused by just 9 unique visitors while store number two encountered 5 unique visitors. Analysis Result May 21, 2013
  15. 15. 15This plot indicates that we have more returning than new users in both stores. In store number two we didn’t see a new user over the past 4 days at all. It’s probably a good idea to start a marketing campaign which aims at new customers, e.g. to give out vouchers for the first purchase. New vs. returning users Analysis Result May 21, 2013
  16. 16. 16The plot for the last 4 days vividly visualizes that the visit duration in store number one was evenly distributed while the distribution in store number two shows some peaks. We can also see that visitors tend to stay in shop number one much longer. Visit duration over the past 4 days Analysis Result May 21, 2013
  17. 17. 17There is a lot of useful information that can be derived from this plot. 1. There is a repeating pattern of step-ins and step-outs within a short period of time. 2. There was a step-out of store number one and a step-in into store number two within just 28 seconds. Avg. Duration Between Visits of one particular user Analysis Result May 21, 2013
  18. 18. Ma y 21, 201 3 CC 2.0 by Aurelien Guichard |
  19. 19. 19 CCAH Course in ZH • Cloudera Administrator Training for Apache Hadoop (CCAH) • June 26th – 28th 2013 • Limmatstrasse 50, Zurich • More info's: Announcement May 21, 2013
  20. 20. 20 Links 1 Presentation, Video and Post Series • 2 3 4 May 21, 2013