This document describes a case study using Hadoop to analyze retail store WiFi log data. A Hadoop cluster was set up to ingest WiFi router log files from two stores via Flume. The data was transformed and stored in Hive. Questions about store visits and customers were then answered by running queries on the data with Impala and visualizing the results in Excel. Key results included identifying the store with more visits and returning customers, and analyzing visit durations and patterns between the two stores.
4. How to bring in more metrics?
Possibile sensors for a real store:
● customer frequency counters at doors
● the cashier system
● free WiFi access points
● video capturing
● temperature
● ...
For many of these sensors additional Hardware and Software is
needed:
⇒ Let's use the free WIFI access points
5. What type of Questions could we ask?
● How many people visited the store? → unique visitors?
● How many visits did we have?
● What is the average visit duration?
● How many people are new vs. returning?
● ....
6. CC 2.0 by by Ian Carroll | http://flic.kr/p/6NWoGm
How do we answer
these questions?
Preparation
7. Traditional Data Management Approach
From a high level of abstraction the answer is simple. We need a
data management system with three pieces:
1. ingest
2. store
3. process
8. Blueprint for a Data Management System
with Hadoop
We take this basis architecture and replace the generic terms
while mapping it onto the Hadoop ecosystem.
With this Hadoop architecture a Data Scientist should be able to
answer the questions without any programming environment.
He/she can also use familiar BI, analysis and reporting tools as
well.
9. CC 2.0 by Perry French | http://flic.kr/p/8wDMJS
What do we need?
Setup
10. Ingrediants
1. 2 WiFi access points to simulate two different stores
2. Flume to move all log messages to HDFS
3. A 4 node CDH4 cluster
4. Pentaho Data Integration‘s graphical designer for data
transformation, parsing, filtering and loading to the
warehouse
5. Hive as data warehouse system on top of Hadoop to project
structure onto data
6. Impala for querying data from Hive in real time
7. MS Excel to visualize results
11. ● 2 WIFI Routers with OpenWRT installed: one Buffalo and one
Fonera
● Installed 4 Days before the Hackathon, to have some logdata
● Syslogs are collected on Central Syslog Server
● Flume Node collects syslogs and store them on HDFS,
without any manual intervention (no transformation, no
filtering)
● (Flume can also be run as Syslogserver)
Ingest
12. Parsing, Transformation, Filtering, Load
● Raw Log-Data needs to be transformed to CSV
● Many open-source BI Tools to help with that: Palo, SpargoBI,
Pentaho, Talend
● We used Pentaho
● Design a MapReduce Job for distributed transformation of
the Log-Data with
○ Regular expression to match line and split columns
○ Filter empty Lines
○ UDF to create CSV and Unix Timestamp
● From this data we can easily generate a Hive Schema and
store the data to our Hive Data Warehouse.
1358765267,2013,1,21,11,47,47,+01:00,buffalo,hostapd,wlan0,10:68:3f:40:20:2d,IEEE 802.1X,authorizing port
1358765267,2013,1,21,11,47,47,+01:00,buffalo,hostapd,wlan0,10:68:3f:40:20:2d,WPA,pairwise key handshake completed (RSN)
13. Process
● Data can now be processed either by Hive or Impala
● create intermediate with messages like: login/logout with
visit duration.
● We used Impala to query our data ad-hock for our questions
output:
○ How many people visited the store (unique visitors)?
○ How many visits did we have?
○ What is the average visit duration?
○ How many people are new vs. returning?
● The output was then loaded into Excel to create some nice
Graphs.
14. CC 2.0 by Qi Wei Fong | http://flic.kr/p/7w8vfq
Now, what did we
get?
Results
15. Visits for stores Buffalo and Fonera
● about 85% of the visits were detected in the Buffalo store
● about 15% in the Fonera store.
● Is Buffalo Store in a better location?
16. Unique visitors
● 135 visits in the Buffalo by only 9 unique visitors
● 24 visits in the Fonera store by 5 unique visitors
17. New vs. returning users
● more returning than new users in both stores
● Fonera didn't see a new visitor over the past four days at all
18. Visit duration over the past 4 days
● Buffalo has more evenly distributed durations
● Fonera shows some peaks
● visitors tend to stay in shop Buffalo much longer
19. Conclusion
● Analysing WiFi router log files could be done with a
traditional RDBMS database approach as well.
● Answering such questions based on WiFi router log files can
be done without programming software
● Given the fact that one can quickly ramp up a test cluster
with a few nodes, similar problems can be solved within one
day with a handful of engineers.
● It could be possible to track paths from people based on WiFi
router signals using triangulation.
20. CC 2.0 by Aurelien Guichard | http://flic.kr/p/cjg9yw
Blog Series:
http://bitly.com/bundles/nkuebler/1
Thank you