CC 2.0 by Mr. T in DC | http://flic.kr/p/7khrin
CC 2.0 by Franck BLAIS | http://flic.kr/p/cwVnSy
CC 2.0 by John Steven Fernandez | http://flic.kr/p/a8uTzz
CC 2.0 by Ian Carroll | http://flic.kr/p/6NWoGm
CC 2.0 by Perry French | http://flic.kr/p/8wDMJS
CC 2.0 by John Mitchell | http://flic.kr/p/5UaPg8
7How do we answer these questions?Before we started designing a blueprintsolution we first of all asked ourselves:1 Who wo...
8So, how do we answer these questions as a Data Scientist?From a high level of abstraction theanswer is simple. We need a ...
9So, how do we answer these questions as a Data Scientist?We take this basis architecture and replace thegeneric terms whi...
10Ingrediants1 2 WiFi access points to simulate two different stores withOpenWRT, a linux based firmware for routers, inst...
11How it WorksAnalytics SystemMay17,2013FlumeHiveImpalaOpenWRT00:A0:C9:14:C8:28Syslog ServerFlumeSourceSinks toHDFSLoadsRa...
CC 2.0 by Qi Wei Fong | http://flic.kr/p/7w8vfq
13Visits for stores number one & twoThe plot indicates that about 85% of the visits were detected in storenumber one and a...
14Unique visitorsThis plot gives us more details about the customers. It turns out thatthe 135 visits in store number one ...
15This plot indicates that we have more returning than new users in bothstores. In store number two we didn’t see a new us...
16The plot for the last 4 days vividly visualizes that the visit duration instore number one was evenly distributed while ...
17There is a lot of useful information that can be derivedfrom this plot.1. There is a repeating pattern of step-ins and s...
May17,2013CC 2.0 by AurelienGuichard | http://flic.kr/p/cjg9yw
19CCAH Course in ZH• Cloudera Administrator Training forApache Hadoop (CCAH)• June 26th – 28th 2013• Limmatstrasse 50, Zur...
20Links1 Presentation, Video and Post Series• http://bitly.com/bundles/cguegi/12 http://www.bigdata-usergroup.ch3 http://a...
Upcoming SlideShare
Loading in …5
×

Case Study: In-Store Analysis

649 views
556 views

Published on

This talk was held at the Big Data User Group Stuttgart on May 16th.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
649
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Case Study: In-Store Analysis

  1. 1. CC 2.0 by Mr. T in DC | http://flic.kr/p/7khrin
  2. 2. CC 2.0 by Franck BLAIS | http://flic.kr/p/cwVnSy
  3. 3. CC 2.0 by John Steven Fernandez | http://flic.kr/p/a8uTzz
  4. 4. CC 2.0 by Ian Carroll | http://flic.kr/p/6NWoGm
  5. 5. CC 2.0 by Perry French | http://flic.kr/p/8wDMJS
  6. 6. CC 2.0 by John Mitchell | http://flic.kr/p/5UaPg8
  7. 7. 7How do we answer these questions?Before we started designing a blueprintsolution we first of all asked ourselves:1 Who would be asked to answer questionslike this?2 Who is this person?3 What tools does this person expect touse?4 And what is a typical skill set of thisperson?5 How do they work?PreparationMay17,2013
  8. 8. 8So, how do we answer these questions as a Data Scientist?From a high level of abstraction theanswer is simple. We need a datamanagement system with three pieces:ingest, store and process.Traditional Data Management System ApproachMay17,2013DataSourceDataIngestionDataProcessingDataStorage
  9. 9. 9So, how do we answer these questions as a Data Scientist?We take this basis architecture and replace thegeneric terms while mapping it onto the Hadoopecosystem.With this Hadoop architecture a Data Scientist shouldbe able to answer the questions without anyprogramming environment. He/she can also usefamiliar BI, analysis and reporting tools as well.Blueprint for a Data Management System with HadoopMay17,2013DataSource FlumeHIVE,ImpalaHDFSBI/Analysis/Reporting
  10. 10. 10Ingrediants1 2 WiFi access points to simulate two different stores withOpenWRT, a linux based firmware for routers, installed2 Flume to move all log messages to HDFS, without anymanual intervention (no transformation, no filtering)3 A 4 node CDH4 cluster (2GB RAM, 100GB HDD)4 Pentaho Data Integration‘s graphical designer for datatransformation, parsing, filtering and loading to thewarehouse5 Hive as data warehouse system on top of Hadoop toproject structure onto data6 Impala for querying data from HDFS in real time7 MS Excel to visualize resultsSetupMay17,2013
  11. 11. 11How it WorksAnalytics SystemMay17,2013FlumeHiveImpalaOpenWRT00:A0:C9:14:C8:28Syslog ServerFlumeSourceSinks toHDFSLoadsRawCSVHadoop/HDFSM/RPentahoUDP
  12. 12. CC 2.0 by Qi Wei Fong | http://flic.kr/p/7w8vfq
  13. 13. 13Visits for stores number one & twoThe plot indicates that about 85% of the visits were detected in storenumber one and about 15% in store number two. One might draw theconclusion that store number one is in a much better location with moreoccasional customers.But let’s gain more insights by analysing the number of unique visitors.Analysis ResultMay17,2013
  14. 14. 14Unique visitorsThis plot gives us more details about the customers. It turns out thatthe 135 visits in store number one were caused by just 9 uniquevisitors while store number two encountered 5 unique visitors.Analysis ResultMay17,2013
  15. 15. 15This plot indicates that we have more returning than new users in bothstores. In store number two we didn’t see a new user over the past 4 days atall.It’s probably a good idea to start a marketing campaign which aims at newcustomers, e.g. to give out vouchers for the first purchase.New vs. returning usersAnalysis ResultMay17,2013
  16. 16. 16The plot for the last 4 days vividly visualizes that the visit duration instore number one was evenly distributed while the distribution instore number two shows some peaks.We can also see that visitors tend to stay in shop number one muchlonger.Visit duration over the past 4 daysAnalysis ResultMay17,2013
  17. 17. 17There is a lot of useful information that can be derivedfrom this plot.1. There is a repeating pattern of step-ins and step-outswithin a short period of time.2. There was a step-out of store number one and a step-ininto store number two within just 28 seconds.Avg. Duration Between Visits of one particular userAnalysis ResultMay17,2013
  18. 18. May17,2013CC 2.0 by AurelienGuichard | http://flic.kr/p/cjg9yw
  19. 19. 19CCAH Course in ZH• Cloudera Administrator Training forApache Hadoop (CCAH)• June 26th – 28th 2013• Limmatstrasse 50, Zurich• More infos: http://www.ymc.ch/trainingAnnouncementMay17,2013
  20. 20. 20Links1 Presentation, Video and Post Series• http://bitly.com/bundles/cguegi/12 http://www.bigdata-usergroup.ch3 http://about.me/cguegi4 http://www.ymc.ch/trainingMay17,2013

×