Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Data Warehouse Project Report

367 views

Published on

Project report on the design and build of a data warehouse from unstructured and structured data sources (Quandl, yelp and UK Office for National Statistics) using SQL Server 2016, MongoDB and IBM Watson. Design and implementation of business intelligence visualisations using Tableau to answer cross domain business questions

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Data Warehouse Project Report

  1. 1. CA Data Warehouse Project Report Tom Donoghue x16103491 19 December 2016 MSCDAD Data Warehousing and Business Intelligence
  2. 2. CA2 Data Warehouse Project Report Tom Donoghue v1.0 Page 1 Table of Contents Introduction.....................................................................................................................2 Objectives................................................................................................................................2 Project Scope ...........................................................................................................................2 Data Warehouse Architecture and Implementation ..........................................................3 The Data Model .......................................................................................................................3 Slowly Changing Dimensions ..........................................................................................................5 Type of Fact table............................................................................................................................5 High Level Model Diagram..............................................................................................................5 ETL Method and Strategy..................................................................................................8 ETL Environment ......................................................................................................................8 Data Sources ...................................................................................................................................9 Staging and Data Warehouse ETL............................................................................................10 Visits..............................................................................................................................................10 Currency Strength.........................................................................................................................11 Business Reviews ..........................................................................................................................13 Edinburgh Visits ............................................................................................................................15 Time ..............................................................................................................................................16 Case Studies...................................................................................................................17 Visitor Nationalities Traveling to the UK and Edinburgh...........................................................17 Currency Strength Impact on Visits and Spend ........................................................................18 Business Review Entity Extraction...........................................................................................19 References .....................................................................................................................20
  3. 3. CA2 Data Warehouse Project Report Tom Donoghue v1.0 Page 2 Introduction The purpose of this document is to report on the Data Warehousing project undertaken to deliver a proof of concept data warehouse. This report is split into the following sections, Data Warehouse Architecture and Implementation, ETL Strategy and Case Studies. Objectives The objectives of the project are outlined below:  Design and implement a data warehouse to answer 3 case studies to illustrate the usefulness of a data warehousing solution  Use 3 or more sources of data  Use Business Intelligence queries and outputs to demonstrate and support the case studies Project Scope The scope of the project covers the 3 case studies which are described below and in the following context diagram. HandleBig Events want to know should they seriously consider holding their next US Australian trade symposium in Edinburgh? They have offices in New York, Sydney and Dublin and would like to provide some useful feedback to these offices to help them build initial promotional ideas. Our task is to help them make better informed decisions using the case studies (described in the Case Studies section) and the prototype data warehouse containing the sourced data.
  4. 4. CA2 Data Warehouse Project Report Tom Donoghue v1.0 Page 3 Data Warehouse Architecture and Implementation The architecture and design approach taken for this project follow the principles of data warehousing promoted by Kimball, Ross, Thornthwaite, Mundy and Becker (2008). The primary reason for taking the Kimball approach is based on the need to swiftly design and implement a working proof of concept data warehouse. The scope of the project is narrow with a tight timescale, which favours using dimensional modelling over a normalised relational modelling. Data warehouse data functions as a story about past events, designed to support decision making, serving up a digest of answers in grouped and aggregated ways which, are more meaningful and therefore more important to the business. Providing rollup, drilldown and cross views of the data (typical to OLAP operations) requires complex queries which, impact performance and may also add a maintenance overhead each time a new business question occurs. The data warehouse must also ingest data from disparate sources which need to be merged to create the desired outcomes. To overcome these issues, the data warehouse is designed using dimensional modelling. The data when organised multidimensionally is fashioned in a such a way that it serves a different business purpose to the usual OLTP operational database (Chaudhuri and Dayal, 1997). Adopting a methodology will produce a result, but the success of the result depends on how the methodology is executed to meet a set of business requirements. As mentioned in Ariyachandra and Watson (2006), which of the data warehouse architecture choices proposed by Kimball and Inmon is better, is and still is, an ongoing debate. The authors investigated five main data warehouse architectures in their studies. Regarding their study, our prototype data warehouse architecture implementation method is probably closest to the type described as an Independent Data Mart. Independent Data Marts were often frowned upon as an inferior architectural solution in operational production environments. However, they do represent a good fit for prototyping and proof of concept executions due to their relative simplicity and short lead time to deploy. Independent Data Marts may make a valid contribution as part of a larger hybrid data warehouse solution as the authors conclude. The diagram below shows the elements which comprise our prototype data warehouse architecture: Source data is ingested and processed by the Extract, Transform and Load (ETL) and populates the staging area (this process is detailed in the ETL section below) and subsequently populated the data warehouse. The data warehouse provides the business intelligence results to business user queries. The Data Model The data model was constructed using dimensional modelling, which according to Kimball et al. (2008) is an applicable way to best satisfy business intelligence needs, as it meets the underlying objectives of timely query performance and unambiguous meaningful results. The dimensional model contains dimensions and facts. Facts record business measurements that tend to be numeric and additive. Dimensions record logical sets of descriptive attributes and are bound to the facts, enabling the fact measurements to be viewed in various descriptive combinations. The benefits of dimensional modelling are that: It facilitates a multidimensional analysis domain, via the exploration of fact measures using dimensions. The schema is far simpler as the dimensions are denormalised which in turn improves query performance and serves data which is instantly recognisable to the business user. The resulting schema resembles a star shape, with the dimensions surrounding a single fact entity (Kimball et al., 2008; Rowen, Song, Medsker and Ewen, 2001). Many data warehouse implementations follow the star schema when describing and constructing the data model, as again
  5. 5. CA2 Data Warehouse Project Report Tom Donoghue v1.0 Page 4 it addresses the goals of fast query performance, ease and speed of populating the data warehouse (Chaudhuri and Dayal, 1997). In the Kimball dimensional design process our first step is to choose the business process or measurement event to be modelled, which in this case is Passenger Visits. To obtain an understanding of this, a simple business statement was made: “I want be able to see the number of visits made by nationality, when they visited, how long they stayed, how much did they spend? I also want to get a handle on their mode of travel, purpose of visit and how many people visit Edinburgh.” This is a powerful way of identifying possible facts and dimensions associated with the visits data source. However, the fact table grain needs to be defined before advancing further. Examining the appearance of the visits source data, helped to define the grain, as each visit is recorded quarterly. The grain should be defined as fine as possible, it is possible to roll up from it (e.g. quarters in to half years and higher into years), but we will not be able to drill down any lower than the selected grain. In this case, it is not possible to drill down lower than quarters (e.g. months and lower into weeks as these attributes are not present in the data). Therefore, the finest grain available in the visits data is quarters. Looking at the business statement above the dimensions start to appear:  Visits  Country  Nationality  Mode of Travel  Purpose of Visit  Edinburgh Visits  Time Identifying the Facts can also be drawn from the statement:  Visits  Spend  Nights Stayed There are also the three remaining data sources to cater for: Currency Rates, Business Reviews and Edinburgh Visits. As the grain has been declared then these entities also need to follow the grain and be at a quarterly level. This raised the following issues:  Reviews are recorded for any given date and therefore need to be massaged to fit the quarterly grain, which is achieved by transforming the review data in the ETL stage.  Currency FX rates are obtained by quarter which fits, but we have multiple currencies and that creates a many to many relationship, what does dimensional modelling offer to resolve this dilemma? As this is a prototype we strive to keep things simple, by ensuring a one to many relationship between dimensions and facts and to maintain the desired star schema. There are alternatives but these break our simple design and extend the amount of effort to build in the additional joins required to satisfy the business queries (Rowen et al., 2001). To resolve this issue currency data was transformed in the ETL stage, and repurposed as “Currency Strength” (and is described in detail in the ETL section) to adhere to the one to many objective and match the grain.  Edinburgh Visits data had the same many to many dilemma as Currency rates. Although rows are recorded quarterly there are multiple countries per quarter. The same solution of transforming the data to match grain was applied (this is also described in further detail in the ETL section).
  6. 6. CA2 Data Warehouse Project Report Tom Donoghue v1.0 Page 5 The Time dimension also needs to follow the quarter grain. SQL Server SSAS was used to generate a Time dimension. However, the resulting dimension needed to be modified to add an extra column to cater for the exact quarterly representation required to join to the Facts table. Slowly Changing Dimensions What method of updating the data in the dimensions and facts best suits the prototype data warehouse? Keeping the objective of simplicity in mind we opt for Kimball Type 1 – overwrite the dimension attribute. Type 1 means that the data warehouse will be completely overwritten each time the data requires a refresh. The impact of a Type 1 slowly changing dimension is that we lose all history of the previous state of the data prior to the reload (Kimball et al., 2008). It is unlikely that this would be the desired approach in a production data warehouse (depending on business requirements), but it is acceptable for this proof of concept piece as our source data are a snapshot of a set number of years from 2010 to 2016 comprising 27 quarters in total. Type of Fact table According to Kimball et al. (2008), the measured facts falls into one of three types of grain: transactions, periodic snapshots or accumulated snapshots. Our prototype model is aligned to the periodic snapshot type, as measures are recorded each quarter for a set number of quarters (the visits data source is by quarter). No further updates are applied to the fact table rows once the table has been populated. High Level Model Diagram Using the dimensions that were identified from the earlier business statement a high level model was created and is illustrated below: This is our star schema, comprising the central fact table “Travel” surrounded by the dimensions. The grain is also defined. The next stage is to identify the dimension attributes and the fact measures. This was achieved taking each data source in turn and asking whether the associated attributes and measures contributed to the questions being asked in the case studies. The following images show the source data and the dimension attributes (refer to the ETL section for further detail).
  7. 7. CA2 Data Warehouse Project Report Tom Donoghue v1.0 Page 6 Visitor data The following dimensions were created from the Visitor source data during the dimensional modelling: Country Attribute Format Country Id Integer PK Country Code Text Country Strength Text Mode Attribute Format Mode Id Integer PK Mode Code Text Mode Name Text Mode Detail Text Nationality Attribute Format Nationality Id Integer PK Nationality Code Text Nationality Strength Text Purpose Attribute Format Purpose Id Integer PK Purpose Code Text Currency Strength Text For the prototype two separate Country and Nationality dimensions were created rather than using a single dimension. The reason for this was due to the data being grouped inconsistently (e.g. a nationality of “Other EU”, but there is no information as to which countries this refers to) and to retain the data’s original meaning. In a production scenario, the country and nationalities would possibly be rationalised and consolidated into a single dimension and transformed to use an ISO country code as a key. Some of the data from the data source has been precluded as it was not required to satisfy the 3 business cases. However, this is not to under value its potential contribution in a full production data warehouse.
  8. 8. CA2 Data Warehouse Project Report Tom Donoghue v1.0 Page 7 Edinburgh Visits Edinburgh Visits Attribute Format Visit Id Integer PK Visit Date YYYYQQ Visit Count Integer Currency Rates By quarter for US Dollar, Australian Dollar and Euro. Currency Strength Attribute Format Currency Strength Id Integer PK Currency Strength Date YYYYQQ Currency Strength Text Business Reviews This data source comprises unstructured data which will undergo entity extraction to gain the following required attributes: Review Attribute Format Review Id Integer PK Review Date YYYYQQ Review Count Integer Name of Business Text Nationality Id Integer Entity Text Text Entity Type Text The Fact Table The fact table is required to store the following measures: Fact Measure Visits Units (days) Spend Units (GBP) Nights Stayed Units (days) Edinburgh Visits Units (days)
  9. 9. CA2 Data Warehouse Project Report Tom Donoghue v1.0 Page 8 The motivation for dimensional modelling in the context of data warehouse architecture may be summarised as follows: Understandability, the dimensional view of consolidated data is already recognisable to the business user. Query performance, gains in performance are obtained using star joins and flatter denormalised table structures. Dimensions are the pathway to measures in the fact table which, can absorb a myriad of unknown queries that users may devise over time. Dimensional extensibility, as new data arrives the dimension is capable of taking on the change either as a new row of data or by altering the table (Kimball et al., 2008). Finally, the Business Intelligence tools used to answer the 3 case studies makes use of the dimensional model designed in this project. ETL Method and Strategy This section describes the data sources, how they were extracted, the steps taken to transform and load the required data into the data warehouse. This phase of the project took a considerable amount of time to complete which, as Kimball et al. (2008) point out, may swallow up to 70% of time and work expended in the implementation of the data warehouse. Kimball et al. (2008) suggest that taking a haphazard approach to the ETL is likely to end in a tangle of objects which have multiple points of failure and are difficult to fathom out. There are many ETL tools which can be used to assist the ETL phase. The primary activities that these tools cover in terms of their functionality according to Vassiliadis, Simitsis, Georgantas, Terrovitis and Skiadopoulos (2005) are: (a) recognition of viable data in the source data, (b) obtaining this information, (c) creating a tailored and consolidated view of numerous data sources resulting in a unified format, (d) cleansing and massaging data into shape to fit the business and target database logic and (e) populating the data warehouse. The diagram below illustrates a high level view of the ETL landscape covered by the project scope: ETL Environment Prior to performing the extraction, the database environment was created. This consisted of two databases, staging and data warehouse. The databases were partitioned to ensure that data undergoing further exploration, cleaning and transformation was kept separately from the “clean” and prepared data that exists in the data warehouse environment. The purpose was to assist overall ETL management using a simple 2 phase approach. Source data is extracted, undergoes initial transformation and is loaded into the staging tables.
  10. 10. CA2 Data Warehouse Project Report Tom Donoghue v1.0 Page 9 The data is further examined and then undergoes a 2nd transformation before finally being loaded into the data warehouse database. This iterative approach was followed to examine and refine the quality of the data destined for the data warehouse. On early ETL runs as new issues occurred, the incidents were investigated, resolution sought and modification made to the appropriate ETL package to resolve the incident. The various ETL changes are discussed in the following sections. When the ETL packages were fully tested, and producing the expected results they were merged into logical steps to form an ETL workflow. This resulted in a workflow to cater for each of the data sources and a separate ETL package to load the data warehouses Facts table. The diagram below illustrates the Visits ETL, using this phased ETL design process (authored in SSIS). As mentioned in the dimensional modelling section, the tables are truncated on each package execution, no history is retained. Data Sources The table below shows the datasets that were sourced. Name Description Source Type of Data Visits International Passenger (IPS) Visits Edinburgh Visits Visit Britain (2016) Structured Currency Currency FX Rates QuandlAPI (2016) Semi-Structured Reviews Business Reviews Yelp Dataset Challenge (2016) Unstructured Visits The IPS Visit data: uk_trend_unfiltered_report was obtained as a CSV containing quarterly rows from 2002 to 2015. The Edinburgh visit data: detailed_towns_data_2010_-_2015 was also obtained as a CSV. The files were downloaded from the Visit Britain (2016). The datasets were originally created from the International Passenger Survey data (UK Office for National Statistics, 2016).
  11. 11. CA2 Data Warehouse Project Report Tom Donoghue v1.0 Page 10 Currency Currency FX rates were obtained using QuandlAPI (2016) to extract average quarterly FX rates for the Pound Sterling against the US Dollar, Australian Dollar and the Euro. Quarterly data was extracted for the period 2009 to 2016. Reviews. Business reviews were obtained from round 8 of the Yelp dataset challenge download (Yelp Dataset Challenge, 2016). The dataset was downloaded and unzipped to produce a JSON file for each entity. Staging and Data Warehouse ETL The ETL process for each of the data sources is described as follows: Visits The source CSV files were examined in OpenRefine (2016), to identify the data to be extracted, and to quickly perform checks for format inconsistencies and missing data. OpenRefine was used to reformat the quarter rows from quarters represented as month name e.g. “January-March” to QQ format e.g. “01”. The decimal values were converted back to integers and the input data was mapped to the respective columns of the Visits table in the staging database The staging dimensional tables Country, Nationality, Mode and Purpose were populated using the Visits staging table from the previous step. The Country ETL is described below (the same process was followed for the Nationality, Mode and Purpose tables). The target Country table was truncated, the country narratives were taken from the Visits table, sorted and the duplicates removed. A business country code column was assigned a value
  12. 12. CA2 Data Warehouse Project Report Tom Donoghue v1.0 Page 11 of “Unknown” (this column was created for use downstream to hold business friendly values, as none were available at ingestion. The default value of “Unknown” was assigned rather than leaving it blank or NULL). The rows were then inserted into the Country table with a unique integer key assigned by SQL on insert. The data warehouse ETL package truncates the DimCountry table and loads it using the staging Country table as the source. Again, SQL assigns a unique integer key to each row inserted and this is the surrogate key that will be used as the foreign key in the fact table. Currency Strength The Currency Strength ETL is shown in the diagram below. A script created in R was used to obtain average quarterly currency rates using the QuandlAPI (2016) as shown in the code snippet below. The QuandlAPI (2016) call is repeated to get the US and Australian Dollar values. The quarterly difference for each currency is calculated. The last row of the 2009 quarter used in the calculation contained “NA” and it was replaced with a dummy value (the entire year 2009 is discarded downstream as it is not required). The currency code and narrative are added to the data frame before it is written out to the respective currency CSV file.
  13. 13. CA2 Data Warehouse Project Report Tom Donoghue v1.0 Page 12 The R script is called by the Currency ETL package. Once the CSV files are created, the data is extracted and the date is reformatted to the desired quarterly format YYYYQQ and inserted into the staging Currency table. The desired Currency rows are selected from the Currency table, grouped by date and the rate difference is summed. The Currency Strength is calculated And the rows are then inserted into the staging Currency Strength table. The final package is run to load the Currency Strength dimension table in the data warehouse. Currency Strength is a measure of the strength of GBP against a basket of 3 currencies namely USD EUR and AUD. The value of the indicator is either “UP” or “DOWN”. “UP” indicates a strong pound relative to the basket, and “DOWN” indicates a weak pound relative to the basket of currencies. For overseas visitors to the UK a “DOWN” position should be more favourable (bearing in mind that the basket could be shielding a currency that has moved the other way e.g. USD and EUR are strong but a very weak AUD has caused the overall value of the basket to be negative). The Currency Strength is calculated by taking the average quarterly exchange rate of Pound Sterling against 3 Major currencies (i.e. USD, EUR and AUD) and obtaining the quarterly differences between each currency pair. The currency pair difference are summed to provide the basket value which, if
  14. 14. CA2 Data Warehouse Project Report Tom Donoghue v1.0 Page 13 positive sets the Currency Strength indicator to “UP” otherwise it is set to “DOWN”. In the currency dataset no quarterly difference of zero was found, had this been the case the indicator would have been set to “NO CHANGE”. Business Reviews To facilitate extraction of the business review data (which is the project’s unstructured data, supplied in the downloaded JSON files) a suitable document based database such as MongoDB was used. MongoDB was installed on the same virtual machine as SQL Server to maintain a self-contained environment. The files were imported into the yelp database using mongoimport, based on a tip from Eniod's Blog (2015) on working with the Yelp dataset. mongoimport --db yelp --collection businesses --file yelp_academic_dataset_business.json mongoimport --db yelp --collection review --file yelp_academic_dataset_review.json Using python and pymongo, two scripts were created. The first script extracts reviews related to Edinburgh businesses, retrieves the associated reviews dated from 2010 to 2016 and inserts them into a new collection. The second script reads the new collection, sends each text review for entity extraction using the AlchemyAPI (2016). The result of each entity extraction is stored in a dataframe to which a random Nationality code is added (to associate a review with the Visits nationality data, this addition to the data makes our reporting more interesting as it provides a link to the nationality of the reviewer). Once the entity extraction is complete the results are written to a CSV file which is then processed through SSIS. The scripts can be configured to set the count of businesses and associated reviews to extract (this assisted testing and limited the API calls as AlchemyAPI (2016) sets a daily transaction limit). It was noticed that the yelp dataset had businesses with a review count greater than zero but no document existed in review collection. The scripts could be improved in the future to handle this exception. The workaround for the few businesses in error, was to update the review count to zero in the business collection. AlchemyAPI (2016) provides an entity extraction API which is used to discover objects in the textual business reviews such as people, names, places and businesses (Meo, Ferrara, Abel, Aroyo and Houben, 2013). The two Python scripts used to obtain Edinburgh business reviews from MongoDB appear below: #!Python2.7python # This program connects to mongoDB and extracts Edinburgh businesses. We limit the number of Businesses extracted and then get a limited number of # associated reviews. The extracted reviews are finally inserted to a new collection import pprint from random import randint import pymongo from pymongo import MongoClient client = MongoClient('localhost', 27017) db = client.yelp businesses = db.businesses reviews = db.review dwreviews = db.dwreview # get Edinburgh businesses by limit aBus = businesses.find({"city" : "Edinburgh", "review_count": {"$gt": 0}}, {"business_id" : 1, "name": 1, "categories": 1 }). sort("stars", pymongo.DESCENDING).limit(2) # set 80 for live run # create list and dict collReviews = [] mybus = {} # loop through business cursor for busKey in aBus: #print (busKey['busKey['business_id']']) #+ " " + busKey['categories']) #mybus.append(busKey['business_id']) mybus['business_id'] = busKey['business_id'] mybus['name'] = busKey['name'] #collReviews += [mybus] #for each business key get the reviews and write them out to a new collection #we also want to randomly assign a country code to each review to indicate nationality of reviewer print mybus['business_id'] + " " + "**" print mybus['name'] aReview = reviews.find({"business_id": mybus['business_id'], "review_id" : {"$exists" : True}, "date": {"$gt": "2009-12-31"}}, {"review_id": 1, "date": 1, "text": 1, "business_id": 1}). sort("date", pymongo.DESCENDING).limit(3) # set to 100 for live run reviewer = [] for item in aReview: nationalityId = randint(1,75) print (item['business_id'] + "^^ " + item['text']) reviewer.append({"text": item['text'],"review_id": item['review_id'], "date": item['date'], "name": mybus['name'], "business_id": item['business_id'], "nationality_id": nationalityId}) #reviewer['text'] = item['text'] collReviews += [reviewer]
  15. 15. CA2 Data Warehouse Project Report Tom Donoghue v1.0 Page 14 # insert the document into the new collection for rec in collReviews: #pprint.pprint(rec) db.dwreviewvideo.insert(rec) print ('End of Pgm ') Extract Entities Script #!Python2.7python import time # import json import pandas as pd # import pprint from watson_developer_cloud import AlchemyLanguageV1 alchemy_language = AlchemyLanguageV1(api_key='deleted') import pymongo from pymongo import MongoClient client = MongoClient('localhost', 27017) db = client.yelp # select which colloction to do entity extraction on reviews = db.dwreviewvideo #get some reviews by limit curReview = reviews.find({}, {"text": 1, "date": 1, "name": 1, "nationality_id": 1}). sort("date", pymongo.DESCENDING).limit(521) # set to 521 for live run reviews = {} review =[] mylist = [] #loop through the cursor and call the entity extraction API for yReview in curReview: print yReview text = yReview['text'].encode('utf-8') #get entities for each yReview response = alchemy_language.entities(text) # wait for alchemy to do its thing time.sleep(2) # add the results to a list of dicts for item in response['entities']: textLatin1 = item['text'].encode('latin-1') mylist.append ({'type': item['type'], 'text': textLatin1, 'count': item['count'], 'date': yReview['date'], 'name': yReview['name'], 'nationality_id': yReview['nationality_id']}) #print 'entities list ' + str(mylist) # assign the list to a dataframe for ease of outpuuting a csv of the results df = pd.DataFrame(mylist) df.to_csv('C:dwDataSetsyelpEntities2.csv', index=False) print ('End of entity extraction') Using the created CSV, the data is extracted and the date is reformatted to YYYYQQ, the Nationality Id is used to look up the nationality name and add it to the output flow. Then data is inserted into the staging Review table.
  16. 16. CA2 Data Warehouse Project Report Tom Donoghue v1.0 Page 15 To update the data warehouse Review dimension, the reviews are transformed to obtain the reviews with the highest count for each quarter (one to match each of the 27 quarters), using a crafted SQL script to update the staging table with an incremented rowcount. The rownumber in the subselect is set to limit the rows selected to satisfy a review row match for each quarter. update review set reviewDateNo = Crownumber from ( select reviewId, reviewDate, reviewCount, ROW_NUMBER() over (PARTITION BY reviewDate order by reviewDate, reviewCount DESC) as Crownumber from ( select reviewId, reviewDate, reviewCount, ROW_NUMBER() over (PARTITION BY reviewCount order by reviewDate, reviewCount DESC) as rownumber from review Group by reviewId, reviewCount, reviewDate -- order by reviewDate, reviewCount DESC ) tempQuery where tempQuery.rownumber < 200 group by reviewDate, reviewCount, reviewId --order by Crownumber ) as reviewz where reviewz.reviewId = review.reviewId Edinburgh Visits A mixture of Excel and OpenRefine (2016) was used to reshape the data. A row for each of the 27 quarters is required to meet the grain. The following countries US, Australia, France, Germany, Ireland, Spain, Netherlands, Italy, Poland, Belgium, Greece, Austria and Portugal are summed to provide a quarterly count for each country. The summed and reshaped data is shown below, the original visit count was in thousands and was multiplied by 1000. If a blank was found in the original data is was assigned a zero.
  17. 17. CA2 Data Warehouse Project Report Tom Donoghue v1.0 Page 16 There was no data for 2016, so an average of each quarter was taken between 2010 and 2015 to create the 2016 quarters. The result was a total count of visitors (for the selected basket of countries) by quarter. Visits to towns are based on the towns visitors report spending at least one night in during their trip. Time The Time dimension was generated in SSAS and only exists in the data warehouse database. However, as mentioned above, a new column was needed to cater for the exact quarterly representation required to join to the Facts table (in the date format YYYYQQ). This was achieved using the following crafted SQL code which was run in SSMS. update t set t.quarterFactDate = ( select CONVERT(varchar(4),DATEPART("YYYY", t2.PK_Date)) + RIGHT('0' + CONVERT(varchar(2),DATEPART("QQ", t2.PK_Date)),2) from Time t2 where t2.PK_Date = t.PK_Date) from Time t Fact Table - Travel Fact The Travel Fact table also only exists in the data warehouse database The ETL created for the fact table is shown below. The ETL must extract the surrogate key from each dimension, gather the measures and merge the data into the Travel Fact table. Each row inserted into the Fact table must match the quarterly grain that was defined during the dimensional modelling. The result of the ETL is the data warehouse database which is illustrated below.
  18. 18. CA2 Data Warehouse Project Report Tom Donoghue v1.0 Page 17 The ETL has made use of several methods and tools: Manual operations with OpenRefine (2016) and Excel, automation via custom programs such as R and Python and integrated MongoDB, SQL Server tools, SSIS and SSMS. The ETL process appears to show that Vassiliadis et al. (2005) observations have been seen: The data required has been recognised in the source data, this data was obtained, the creation of a unified format through consolidation across the various sources of data (matching the grain), cleansing and getting the data into the required shape to fit the business requirements and finally that it populated the data warehouse. Case Studies The deployed cube is shown below, it was connected to Tableau Desktop (2016) to produce the business intelligence charts to support the following case studies: Visitor Nationalities Traveling to the UK and Edinburgh What number of US and Australian nationalities travel to the UK, compare this with several other EU nationalities too? What are they spending? Of these visitors, what sort of numbers visit Edinburgh? This information will assist our local offices how to better assess and address the target market on their home ground.
  19. 19. CA2 Data Warehouse Project Report Tom Donoghue v1.0 Page 18 The prototype data warehouse shows the amount spent and the visit figures for US, Australian and a selection of EU nationalities (France, Germany, Ireland, Spain, Netherlands, Italy, Poland, Belgium) for visits to the UK between 2010 and 2015. The bar chart to the right compares the visit numbers, for each quarter for the same basket of nationalities, with figures for visits to Edinburgh between 2010 and 2015. There appears to be a positive correlation between the visits to UK and visits to Edinburgh. Further analysis would need to be conducted, examining possible causation for fluctuations e.g. obtaining data about major events that may draw visitors to Edinburgh or keep them away would add value to the analysis. Further charts that show trend lines, variance e.g. quarter on quarter and year on year within and between both visit set of data would be interesting to see. Currency Strength Impact on Visits and Spend The business is concerned about Brexit impact and that overseas visitors may stay away due to the volatility of Sterling in the wake of Brexit. Is it possible to provide any information from our data warehouse to allay these fears?
  20. 20. CA2 Data Warehouse Project Report Tom Donoghue v1.0 Page 19 The charts above indicate the visitor and spend numbers in the light of the strength of Sterling in relation to the US Dollar, Australian Dollar and Euro basket of currencies. It appears that the currency strength does not deter visitor visits or spend. Visitor numbers have increased over the 5 year period and it is clear to see seasonal fluctuations. There appears to be a positive correlation between visits and spend. However, quarter 201502 and 201403 may warrant an investigation. Visits (6.407M) were higher in 201502 with less spend (3.085B), than lower visits (6.232M) in 201403 with a higher spend (3.883B). Business Review Entity Extraction Finally, away from the symposium, it would be helpful to provide visitors with places to go and things to see and do when in Edinburgh. Can we provide any points of interest in Edinburgh that will assist them? The treemap above shows entities extracted by entity between 2010 and 2015 from Edinburgh business reviews. The chart provides the entity name, business name, entity type, the reviewer’s nationality and total visits to the UK for the quarter that the review relates to (data is not displayed if the space is not available which is an issue when attempting to make a comparison between entities). Taking the entity Hanedan, as an example the AlchemyAPI (2016) returned the entity as a person and a city, it is in fact a Turkish restaurant. But the treemap highlighted this unusual pattern, and provoked a web search to discover what Hanedan was. Using a treemap visualisation is useful for exposing patterns that could be of interest and warrant further investigation. The treemap chart works well for presentation of small numbers. However, treemaps may present a confusing picture when the number of items displayed increases substantially (Tu and Shen, 2008).
  21. 21. CA2 Data Warehouse Project Report Tom Donoghue v1.0 Page 20 References AlchemyAPI (2016) Entity Extraction API [Online] Available at: http://www.alchemyapi.com/products/alchemylanguage/entity-extraction [Accessed 10 November 2016]. Ariyachandra, T. and Watson, H.J. (2006) ‘Which Data Warehouse Architecture Is Most Successful?’. Business Intelligence Journal, 11(1): pp. 4. Chaudhuri, S. and Dayal, U. (1997) ‘An overview of data warehousing and OLAP technology’. ACM SIGMOD Record, 26(1): pp. 65-74. Eniod's Blog (2015) Import Yelp dataset to MongoDB [Online] Available at: https://haduonght.wordpress.com/2015/02/10/import-yelp-dataset-to-mongodb [Accessed 10 November 2016]. Kimball, R., Ross, M., Thornthwaite, W., Mundy. J and Becker, B. (2008) The data warehouse lifecycle toolkit. 2nd ed. Indianapolis: Wiley Publishing, Inc. Meo, P., Ferrara, E., Abel, F., Aroyo, L. and Houben, G. (2013) ‘Analyzing user behavior across social sharing environments’. ACM Transactions on Intelligent Systems and Technology (TIST), 5(1): pp. 14- 31. OpenRefine (2016) A free, open source, powerful tool for working with messy data [Online] Available at: http://openrefine.org/ [Accessed 10 November 2016]. QuandlAPI (2016) Quandl API Introduction [Online] Available at: https://www.quandl.com/docs/api [Accessed 10 November 2016]. Rowen, W., Song, I.Y., Medsker, C. and Ewen, E. (2001) ‘An analysis of many-to-many relationships between fact and dimension tables in dimensional modeling’. Proceedings of the International Workshop on Design and Management of Data Warehouses (DMDW 2001). Interlaken, Switzerland, 4 June 2001. Tableau Desktop (2016) Analytics that work the way you think [Online] Available at: http://www.tableau.com/products/desktop [Accessed 10 November 2016]. Tu, Y. and Shen, H. (2008) ‘Balloon Focus: a Seamless Multi-Focus+Context Method for Treemaps’. IEEE Transactions on Visualization and Computer Graphics, 14(6): pp. 1157-1164. UK Office for National Statistics (2016) Methodology:International Passenger Survey background notes [Online] Available at: https://www.ons.gov.uk/peoplepopulationandcommunity/leisureandtourism/methodologies/intern ationalpassengersurveybackgroundnotes#sample-methodology [Accessed 10 November 2016]. Vassiliadis, P., Simitsis, A., Georgantas, P., Terrovitis, M. & Skiadopoulos, S. (2005) ‘A generic and customizable framework for the design of ETL scenarios’. Information Systems, 30(7): pp. 492-525. Visit Britain (2016) Inbound tourism trends by market [Online] Available at: https://www.visitbritain.org/inbound-tourism-trends [Accessed 10 November 2016]. Yelp Dataset Challenge (2016) Yelp Dataset Challenge [Online] Available at: https://www.yelp.com/dataset_challenge [Accessed 10 November 2016].

×