BUS105
Business Information
Systems
Workshop Week 3
Small and big Data Collection, Storage
and Management in Relation to
Information Systems
Copyright Notice
COPYRIGHT
COMMONWEALTH OF AUSTRALIA
Copyright Regulations 1969
WARNING
This material has been reproduced and communicated to you by or on behalf of Kaplan Higher
Education pursuant to Part VB of the Copyright Act 1968 (the Act). The material in
this communication may be subject to copyright under the Act. Any further reproduction
or communication of this material by you may be the subject of copyright protection under the Act.
Do not remove this notice
2
Lesson Learning Outcomes
1 Review different types of data
2 Contrast small and big data collection
3 Learn about data storage and management
4 Examine business case studies in relation to
the type of data requirements for particular
information systems
Splunk: Slicing Data for
Domino’s Pizza
• Watch the video on how Splunk is helping to improve
Domino’s business functions
https://www.youtube.com/watch?v=LXMjN6kVmUY
Q: What was the big event
that occurred in the US that
required many pizza orders?
https://www.youtube.com/watch?v=LXMjN6kVmUY
• Raw data (primary data)
– Numbers, words, symbols collected from a source
– Not cleaned or processed
– may have errors or outliers
• Metadata
– Data that provides information about other data
– “Metadata explains the origin, purpose, time, geographic
location, creator, access, and terms of use of the data.”
https://data.library.arizona.edu/data-management-tips/data-documentation-and-metadata
Glossary 1
LO1
https://data.library.arizona.edu/data-management-tips/data-documentation-and-metadata
• Metadata from a pdf file
Metadata Example
Glossary 2
LO1
• Structured data is formatted for use, has a well-defined data
structure, generally stored in rows and columns
- e.g. age (in years), first name (text), address (text),
income ($), etc. We will learn more about this in the
relational database section of the slides.
• Semi-structured data has some structure
- e.g. CSV files with comma separated data. XML and
JavaScript Object Notation, JSON, documents used to
exchange data to/from a web server
• Parse means to analyse (a string or text) into logical syntactic
components.
EMC Education Services (Eds.) 2015, Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data, John Wiley &
Sons, Indianapolis, US.
https://www.google.com/search?q=parsing+definition&ie=&oe=
https://en.wikipedia.org/wiki/JSON
https://www.google.com/search?q=parsing+definition&ie=&oe
Glossary 3
LO1
• Quasi-structured data textual data which has various
formats and takes effort to handle and analyse
– e.g. web clickstream data
• Unstructured data has no predefined data model, not
organised, may have multiple types of data
- e.g. data from thermostats, sensors, home electronic
devices, cars, images and soun ...
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
BUS105Business Information SystemsWorkshop Week 3.docx
1. BUS105
Business Information
Systems
Workshop Week 3
Small and big Data Collection, Storage
and Management in Relation to
Information Systems
Copyright Notice
COPYRIGHT
COMMONWEALTH OF AUSTRALIA
Copyright Regulations 1969
WARNING
This material has been reproduced and communicated to you by
or on behalf of Kaplan Higher
Education pursuant to Part VB of the Copyright Act 1968 (the
Act). The material in
this communication may be subject to copyright under the Act.
Any further reproduction
2. or communication of this material by you may be the subject of
copyright protection under the Act.
Do not remove this notice
2
Lesson Learning Outcomes
1 Review different types of data
2 Contrast small and big data collection
3 Learn about data storage and management
4 Examine business case studies in relation to
the type of data requirements for particular
information systems
Splunk: Slicing Data for
Domino’s Pizza
• Watch the video on how Splunk is helping to improve
Domino’s business functions
https://www.youtube.com/watch?v=LXMjN6kVmUY
Q: What was the big event
3. that occurred in the US that
required many pizza orders?
https://www.youtube.com/watch?v=LXMjN6kVmUY
• Raw data (primary data)
– Numbers, words, symbols collected from a source
– Not cleaned or processed
– may have errors or outliers
• Metadata
– Data that provides information about other data
– “Metadata explains the origin, purpose, time, geographic
location, creator, access, and terms of use of the data.”
https://data.library.arizona.edu/data-management-tips/data-
documentation-and-metadata
Glossary 1
LO1
https://data.library.arizona.edu/data-management-tips/data-
documentation-and-metadata
• Metadata from a pdf file
Metadata Example
4. Glossary 2
LO1
• Structured data is formatted for use, has a well-defined data
structure, generally stored in rows and columns
- e.g. age (in years), first name (text), address (text),
income ($), etc. We will learn more about this in the
relational database section of the slides.
• Semi-structured data has some structure
- e.g. CSV files with comma separated data. XML and
JavaScript Object Notation, JSON, documents used to
exchange data to/from a web server
• Parse means to analyse (a string or text) into logical syntactic
components.
EMC Education Services (Eds.) 2015, Data Science and Big
Data Analytics: Discovering, Analyzing, Visualizing and
Presenting Data, John Wiley &
Sons, Indianapolis, US.
https://www.google.com/search?q=parsing+definition&ie=&oe=
https://en.wikipedia.org/wiki/JSON
https://www.google.com/search?q=parsing+definition&ie=&oe
Glossary 3
5. LO1
• Quasi-structured data textual data which has various
formats and takes effort to handle and analyse
– e.g. web clickstream data
• Unstructured data has no predefined data model, not
organised, may have multiple types of data
- e.g. data from thermostats, sensors, home electronic
devices, cars, images and sounds & pdf files.
EMC Education Services (Eds.) 2015, Data Science and Big
Data Analytics: Discovering, Analyzing,
Visualizing and Presenting Data, John Wiley & Sons,
Indianapolis, US.
https://commons.wikimedia.org/wiki/Neod
ythemis_hildebrandti
https://commons.wikimedia.org/wiki/Neodythemis_hildebrandti
Numerical vs Categorical Data
LO1
Data
Numerical
(quantitative)
6. Discrete: takes numerical
values from counting
Continuous: takes numerical
values from measurements
Categorical
(qualitative)
Nominal : an identifier or label
and has no numerical meaning
Ordinal: categories that can be
ranked (ordered) arbitrarily
Examples of Numerical and
Categorical Data
Data
Numerical
(quantitative)
Discrete: number of chairs in this room
Continuous: height
Categorical
(qualitative)
Nominal: colours, i.e. blue, green, yellow.....
7. Ordinal: risk, e.g.
1. High risk,
2. Medium risk
3. Low risk
Activity 1: Numerical and
Categorical Data
• Form groups and find more examples of the data types
Data
Numerical
(quantitative)
Discrete:
Continuous:
Categorical
(qualitative)
Nominal:
Ordinal:
• Suppose that you have been employed by bicycle hire
company Citibike to analyse bike trips made by customers
8. in 2018. Some of the questions you may have are:
• Where do the customers ride most often?
• How far do the customers ride?
• How old, on average, are the customers?
https://www.citibikenyc.com/
Q: What sort of data would
you collect and how much?
Who Wants to Ride Around
New York City?
https://www.citibikenyc.com/
Who Wants to Ride Around
New York City?
This is structured data.
Q: How do you think this customer data is collected?
• We obtained a data set of 12,677 trips taken in January
2018.
9. • Variables include
• Trip Duration (seconds)
• Start Time and Date
• Stop Time and Date
• Start Station Name
• End Station Name
• Station ID
• Station Lat/Long
• Bike ID
• User Type (Customer = 24-hour pass or 3-day pass user;
Subscriber = Annual Member)
• Gender (Zero=unknown; 1=male; 2=female)
• Year of Birth
https://data.world/citibikenyc/citibike-tripdata-january-2018
Q. What type of variables
are these?
Who Wants to Ride Around
New York City?
https://data.world/citibikenyc/citibike-tripdata-january-2018
10. Activity 2: Contrast Small and
Big Data
LO2
• Watch the video and list four of the ten ways in
which small and big data differ
• Report back to class
https://www.youtube.com/watch?v=nh-FrpMqlIs
https://www.youtube.com/watch?v=nh-FrpMqlIs
Small Data Summary
LO2
1. Goal: often for a very specific purpose
2. Location: usually stored in one place
3. Structure: more likely to be structured data
4. Data preparation: often handled by a single person
5. Longevity: may only be kept for 7 years
6. Measurements: usually measurements taken by a smaller
group
or one person/machine and are consistent
7. Reproducibility: easier to reproduce
11. 8. Stakes (cost): less expensive
9. Introspection: easier to interpret and data points clearer
10. Analysis: often easier to organise and analyse
Video on content from Jules Berman’s book called Principles of
Big Data: Preparing, Sharing, and Analyzing
Complex Information https://www.youtube.com/watch?v=nh-
FrpMqlIs
https://www.youtube.com/watch?v=nh-FrpMqlIs
Big Data Summary
LO2
1. Goals: one may not know how they are going to use all of
their big data
2. Location: in multiple places (servers)
3. Structure: all types (structured, semi, quasi and unstructured)
4. Data preparation: by several persons
5. Longevity: may be kept for much longer and possibly used
across
different projects, or linked to other data later
6. Measurements: by different persons/machines with different
protocols
12. 7. Reproducibility: more difficult to recover data if something
goes wrong.
8. Stakes (cost): can be expensive
9. Introspection: you may not be able to identify data type or
use
10. Analysis: more complex, e.g. requires extraction,
transformation, etc.
How Business Collects Customer
Big Data
Internally collected as:
• Sales data (transaction history, customer interaction)
• Customer feedback (e.g. Facebook)
Externally collected by:
• Directly asking
• Indirect tracking (emails, apps and third-party trackers,
• Websites, cookies and web beacons
• Adding other data sources to their own by
– purchasing third party data (e.g. from data
companies Acxiom and Oracle)
14. Storage of Data
LO3
• Data Lake
– Repository for large amounts of raw data from multiple
sources and in
many formats, some of which may not be useful
• Data warehouse
– A repository of data from various sources, partially re-
organised, and
used to support decision makers in the organisation
– Takes data from data lake and transforms it
• Data mart
– A low-cost, scaled-down version of a data warehouse designed
for the
end-user needs in a strategic business unit (SBU) or a
department
• Database
– Organised collection of structured data (relational) or specific
Semi-, quasi and unstructured data (non-relational)
Big Data Storage and
15. Management Options
Top 10 Big Data Storage Companies
https://selecthub.com/big-data-storage-software/
We will learn more about
semi and unstructured data
management in week 8.
https://selecthub.com/big-data-storage-software/
Relational Database Management Systems
• Database management system (DBMS)
– A set of tools to add, delete, access, modify, and analyse
stored data
Relational databases
• Data represented as two-dimensional tables with columns and
rows
Example: Microsoft Excel
Software for storage and finding data: MySQL, Microsoft
Access, Google
Spanner, MemSQL
http://bigdata-madesimple.com/relational-vs-non-relational-
databases-part-1/
16. http://bigdata-madesimple.com/relational-vs-non-relational-
databases-part-1/
Non-Relational Database Management
Systems
Non-relational databases
• For big data and real-time web data
• Usually open source and work on a distributed (parallel)
data approach
General categories of non-relational databases:
Key-value stores for shopping cart, sensor data
Document stores for tweets, customer data, blog posts
Wide-column stores for time series, banking
Graph stores for networks, social connections
http://bigdata-madesimple.com/relational-vs-non-relational-
databases-part-1/
https://stackoverflow.com/questions/35281066/neo4j-is-it-
possible-to-visualise-a-simple-overview-of-my-database
http://bigdata-madesimple.com/relational-vs-non-relational-
databases-part-1/
https://stackoverflow.com/questions/35281066/neo4j-is-it-
possible-to-visualise-a-simple-overview-of-my-database
17. Non-relational databases
NoSQL databases:
• Store data in a non-tabular for,
e.g. MongoDB (JSON), Neo4j, HBASE
XML databases:
• Have an XML format,
e.g. Oracle Berkeley DB XML, eXist-db, BaseX
http://bigdata-madesimple.com/relational-vs-non-relational-
databases-part-1/
https://stackoverflow.com/questions/35281066/neo4j-is-it-
possible-to-visualise-a-simple-overview-of-my-database
Non-Relational Database Management
Systems Cont.
http://bigdata-madesimple.com/relational-vs-non-relational-
databases-part-1/
https://stackoverflow.com/questions/35281066/neo4j-is-it-
possible-to-visualise-a-simple-overview-of-my-database
Query Languages
• Query languages request information from databases.
• Querying language and method used depends on the
18. database used.
• The oldest query language is structured query language
(SQL) for relational databases.
– SQL does complicated searches using simple key
words, e.g.
• SELECT (specifies a desired attribute)
• FROM (specifies the table to be used)
• WHERE (specifies conditions to apply in the query)
Other types: UnQL for noSQL databases
• Xquery, XQL for XML databases
Activity 4: Review Quiz
Q1: SQL stands for:
a. Sequence query language
b. Structured query language
c. Semi query language
d. Social query language
Q2: Would you use a data mart across a large organisation or
just in a
19. department?
Q3: MongoDB is a
a. Relational database
b. Table
c. XML database
d. NoSQL database using JSON
Data Governance
• Data governance:
– The policies and processes for managing data and information
across an entire organisation for a specified time.
• Master data management
– How and where data is managed and maintained for the entire
organisation
• Roles and responsibilities
– Staff in charge of making policies and managing data
Example (see next slide)
• Cancer Institute NSW data governance policy
21. Data management is how you:
– Organise, structure, and maintain the data
– Store, back up, and preserve data
– Prepare material for analysis, or to share with others
This Photo by Unknown Author is licensed under CC BY
Management is part of
governance (hence the
overlap)
http://archive.edrm.net/resources/edrm-white-paper-series/igrm-
garp
https://creativecommons.org/licenses/by/3.0/
Activity 5: Data Governance
• Form groups, watch the video on data governance
and answer the questions below.
https://www.youtube.com/watch?v=t4IOS5csv40
Q1: Definite data governance. Why do we need it?
Q2: What keywords came up in the video in relation to
data governance?
Q3: What are the three key components of data
22. governance? Can you explain them in your own words?
https://www.youtube.com/watch?v=t4IOS5csv40
Data Documentation
• Data documentation is important for transparency.
• Methods include data dictionaries, schema, metadata
A data dictionary is a reference (document) of the
variables in a database.
– Defines the format necessary to enter the data into
the database, i.e. ranges, codes, decimal places
– Creates standard definitions for all attributes
– Provides organisational data resource inventory for
effective data management
Creating a Data Dictionary
Watch the video on creating a data dictionary.
https://www.youtube.com/watch?v=AeVJy-ow2b0
Do you understand these basic elements now?
Field name
Field size
23. Data type
Data format
Description
Example (optional)
See activity on next slide
https://www.youtube.com/watch?v=AeVJy-ow2b0
Activity 6: Create a Simple Data
Dictionary for the Citibike Data
• Form a group
• Download the file ‘JC-201801-citibike-tripdata.xlsx’
• As a group, construct a simple data dictionary for at least
four variables in the Citibike data
• Report back to class
Case Study: H&R Block Partner
With Xero
LO3
• The video shows how H&R Block has adopted
24. Xero to customise service, given customer tax
data
• Click on link: Xero
• Xero partners dominate nominations for the
Australian Accounting Awards 2019
This Photo by Unknown Author is licensed under CC BY-SA
https://tv.xero.com/detail/videos/customer-
stories/video/5764088895001/h-r-block:-year-round-revenue-
with-xero?autoStart=true
http://www.staygeo.com/2015/07/guide-to-e-file-income-tax-
returns.html
https://creativecommons.org/licenses/by-sa/3.0/
Case Study: Yamaha Partner 2nd
Watch and AWS Cloud Services
“Established in 1960 as Yamaha International Corporation,
Yamaha
Corporation of America (YCA) offers a full line of musical
instruments
and audio/visual products to the U.S. market.”
Business Problem:
• Yamaha’s data management based at a single data centre.
• All production, test, and development systems running in a co-
25. location
arrangement at another data centre.
• Yamaha had an expensive 30-month replacement cycle for its
leased
hardware.
Solution
:
• Yamaha migrated data & some management to the AWS Cloud
• Company 2nd Watch was hired to assist.
• The migration to AWS was timely.
• 2nd Watch provide ongoing management, optimisation and
planning
services.
https://aws.amazon.com/partners/apn-journal/all/yamaha-2nd-
27. COPYRIGHT
COMMONWEALTH OF AUSTRALIA
Copyright Regulations 1969
WARNING
This material has been reproduced and communicated to you by
or on behalf of Kaplan Higher
Education pursuant to Part VB of the Copyright Act 1968 (the
Act). The material in
this communication may be subject to copyright under the Act.
Any further reproduction
or communication of this material by you may be the subject of
copyright protection under the Act.
Do not remove this notice
2
Lesson Learning Outcomes
28. 1 Learn about the data analytics project
lifecycle
2 Do a hands-on exercise in excel with
reference to LO1
3 Interpret results as required
Excel Workshop Week 7
Vehicle Cost Analysis
Commons.wikipedia.org
Business Question: How much does it cost
to run a bus service?
Intechen.com
29. Today’s Tasks
• Please download today’s data file now
BUS105_ProximityBus_for_week_7.xlsx
• You will be doing a hands-on Microsoft Excel cost analysis
exercise in order to answer the business question:
How much does it cost to run a bus service?
• General Excel instructions will be followed by your specific
instructions.
• At the same time we will be learning about the data
analytics lifecycle and referring to it every now and then.
Data Analytics Lifecycle
30. Business
Understanding
Data Understanding
Data Preparation
Data Modelling
Evaluation
Deployment
Kelleher, JD, MacNamee, B & D’Arcy A 2015, Fundamentals of
machine learning for
predictive analytics, The MIT Press, Cambridge Massachusetts,
p12-15.
Data
Kelleher, JD, MacNamee, B & D’Arcy A 2015, Fundamentals of
machine learning for
31. predictive analytics, The MIT Press, Cambridge Massachusetts,
p12-15.
Stage 1: Business Understanding
This is stage 1 of the data analytics lifecycle.
Some questions you should answer during this stage:
• What are your objectives/aims?
e.g. Is our bus company making a profit?
• What resources do you need to start the project?
e.g. Do we need an analyst? What software do we need?
• What are your business success criteria?
e.g. How can we maintain a bus good service and keep
costs below a certain level?
32. • In this workshop we will work with vehicle mileage and
cost data, draw charts and perform cost calculations
using excel in-built functions.
Opening the Excel Data File
Double click on the
BUS105_ProximityB
us_for_week_7.xlsx
file icon to open the
file in Excel.
Data Understanding
• Questions to ask at this stage:
33. • What data have you got and is it complete?
e.g. Bus ID, cost per km, km driven...
• What was the source? e.g. Maintenance department
• What other data would be useful,
• e.g. Bus ticket prices, number of passengers per day, …..
• Do you have a description of the data
e.g. (Data dictionary or encyclopedia)
This Photo by Unknown Author is
licensed under CC BY
http://opensource.org/node/688
https://creativecommons.org/licenses/by/3.0/
Instructions
Clicking on a cell
35. Ischool.utexus.edu
Cell A1
Instructions
To enter worksheet titles, numbers or text
– Open the file in Microsoft Excel
– Click on a cell to make the cell active
– Type desired text
– Click the ENTER button to complete the entry
– Move to the next cell of interest and repeat
Additional information:
(To cut and paste, use Ctrl C and Ctrl V as in Word)
Your instructions on next page…
37. 14905.00
• Move to cell E3 and replace “M cost” with
“Mileage Cost”
Your File Should Look Like This
Are the first four columns complete?
Formulae With Simple Operators
Instructions
Using simple operators and relative cell reference
• Recall all formulae start with an = sign
• Simple operators for addition, subtraction, multiplication,
38. division and nth power are +, -, *, / and ^n where n is the
power, e.g. 5 squared is =5^2
• If we drag the cursor along, the cell addresses are changed
relative to position. This is called relative cell referencing.
Your Instructions
• Obtain mileage cost: Go to cell E4 in the Mileage column
of your worksheet, type “= B4*C4” ENTER
Fill Handle
Instructions :
How to copy a cell calculation to adjacent cells in a col/row
• With the cell containing the contents (e.g. E4), to fill down
the column, point to the fill handle to activate it (i.e. click
39. on the lower right hand corner of the active cell and a plus
sign should appear “+”)
• Hold on to the corner and drag the handle down the
column as required (i.e. to cell E12)
Your Instructions
• Copy the formula (using the fill handle) down the rest of
column E to cell E12
Mileage
Costs
Your File Should Look Like This
40. Using Simple Formulae Cont.
Your Instructions
• Obtain total cost: Go to cell F4 and type in “=D4 + E4”
ENTER
• Copy the formula (by dragging the cursor) down the rest
of column F to cell F12
• Obtain total cost per Km: Go to cell G4 and type in
“=F4/C4” ENTER
• Copy the formula (using the fill handle) down the rest of
column G to cell G12
Your File Should Look Like This
41. Finding Totals Using ‘SUM’
Instructions
To sum a column of numbers
– Click the first empty cell below the column of numbers to sum
– Click the AutoSum button on the HOME tab to display a
formula
in the formula bar and in the active cell, for example
=SUM(B4:B12)
Your Instructions
• Highlight G4 to G12 and click on the decrease decimal places
button
in the Number menu. Reduce the decimals to 2 places.
• Sum the columns using the sum formula in the “totals” row
(row 13)
42. Instructions
To sum a column of numbers
– Click the first empty cell below the column of numbers to sum
– Click the AutoSum button on the HOME tab to display a
formula in
the formula bar and in the active cell, for example,
=SUM(B4:B12)
Your Instructions
• Find column totals using the SUM formula in the “Totals” row
(row 13)
• Highlight G4 to G12 and click on the decrease decimal places
button
in the Number menu. Reduce the decimals to 1 place.
43. Finding Totals Using ‘SUM’
Your File Should Look Like This
Data Preparation
• We have carried out a small amount of data preparation.
Some of the questions you should answer during this
stage:
• Have you considered your data storage and
maintenance capacity?
e.g. Do you need new software, cloud warehousing or
just a PC?
• Do you need to transform (data wrangling) or integrate
44. the data in any way?
e.g. Finding total cost, reducing numbers to one decimal
Stage 4: Modelling
Questions for the modelling phase:
• What models will you use?
e.g. Descriptive, predictive analytics or AI techniques
• How will you train/test and assess the models?
e.g. You will need a training data set if you are going to
use machine learning
Let’s look at some basic summary statistics, average, max
and min.
46. cell reference are adjusted to reflect the new location
B$4 Mixed reference This cell reference is mixed. The column
reference
changes when you copy this cell to another column
because it is relative. The row reference does not
change because it is absolute
$B4 Mixed reference This cell reference is mixed. The column
reference
does not change because it is absolute. The row
reference changes when you copy this cell reference to
another row because it is relative
Absolute Versus Relative Address
47. Instructions
To enter a formula containing absolute cell references
– Given a selected cell, enter the formula and then press the F4
key
to change the most recently typed cell reference from a relative
cell
reference to an absolute cell reference
Your Instructions
– Go to cell A17 in your spreadsheet and type in “9”
– Calculate the average using each total divided by 9 using
absolute
referencing: Go to cell B14 and type in “=B13/$A$17”
– Apply this to cells C14 to G14 using the fill handle and adjust
the
results to 2 decimal places
48. Find Max and Min
Instructions
To find the maximum or minimum of a range of cells type in
=max(start cell:end cell) for maximum
=min(start cell:end cell) for minimum
Your Instructions
• Fill in the “Highest” and “Lowest” column values in row 15
and 16, using =max(B4:B12) and =min(B4:B12)
• If required, change all values so that 2 decimal places are
displayed
49. Your File Should Look Like This
Stage 5: Evaluation
Questions regarding the evaluation and deployment
phases:
• How will you assess the results in terms of business
success criteria?
e.g. How are these results going to help the bus company?
• Have you reviewed all the modelling so far?
e.g. What other preliminary models we can learn from?
See evaluation activity on the next page
Activity 1:Evaluation
50. • Answer the following questions:
1. Which bus costs the most to run per km?
2. Which bus has (lowest mileage) driven the
least number of kilometres?
3. What is the lowest maintenance cost?
Stage 5: Deployment
Questions regarding the deployment phases:
• Next steps? Do you need to gather more data, carry out
another data mining project, or start deployment?
e.g. let’s try filtering and sorting values of interest
(see next page)
51. • How will you implement your findings?
e.g. find a way to reduce the cost of bus 701
Filtering
Instructions
Filtering
• The editing menu has sort and filter commands
• To filter items based on a particular column: click on the
column to be filtered
• Move your mouse to the editing menu and select filter a small
box with an arrow will appear at the top of the column
• Clicking on the arrow reveal the items in the list
• Unticking individual boxes hides (filters out those items) and
the filter box changes shape
• This command is good for removing BLANKS in data sets
52. • To display (unfilter) the list click on “select all” in the list of
filter
boxes, and your original data should be displayed
Filtering
We want to filter out the costs per Km less than
1.80
Your Instructions
1. Click on the top of column B of your spreadsheet
2. Take the cursor to the editing menu and select filter
3. Click on the filter box in column B to reveal the details
of data in column B
4. Untick the boxes with values lower than 1.80 to hide
53. them
Notice that the row
with the minimum
is now hidden too
Filtered column
Your File Should Look Like This
We want to sort the mileage costs while keeping the other
row information consistent with those costs
Your Instructions
1. First unfilter column B by ticking the “select all” box in the
filter options
54. 2. Copy just the values of data block from cell A3 to G12 to
Sheet 1 by highlighting the data and using control C
3. Click on cell A1 in sheet 1, right click on your mouse and
select paste special, click on the values option and OK
4. Select column E, go to the sort menu and select “sort
smallest to largest”
The “Expand selection” box will appear, make sure the
expand selection option is checked and then press SORT
Sorting
Sorting
Your Worksheet Sheet 1 should look like this
55. Total Cost per bus as a Percentage of
the Entire Total Cost
B701
15%
B702
12%
B703
11%
B704
11%
B705
7%
B706
10%
B707
56. 11%
B708
11%
B709
12%
TOTAL COST PER BUS AS A PERCENTAGE OF SUM
TOTAL COST
An alternative representation of total costs
Activity 2: Interpretation
Answer these questions:
1. Which bus has the greatest mileage cost?
2. What is the maintenance cost of the bus of
interest in question 1?
57. 3. Is it easier to interpret the table or pie chart?
Why?
BUS105
Business Information
Systems
Lesson week 8
Semi-structured and unstructured data
management
Lesson Learning Outcomes
1 Define semi-structured and unstructured data
58. 2 Distinguish between the various NoSQL and
NewSQL databases
3 Learn about various software packages for
the management of semi-structured and
unstructured data
4 Evaluate case studies
5 Final discussion with your teacher of
individual report
Dark analytics: Analyzing
unstructured data
Did you know that 95% of data in the world is unstructured?
Watch the video on Dark Analytics
59. https://www.youtube.com/watch?v=X4f-GCGraXI
What sorts of data is really difficult to analyse?
https://www.youtube.com/watch?v=X4f-GCGraXI
Glossary 1
LO1
Recall that
• Semi-structured data has some structure
- e.g. CSV files with comma separated data. XML &
JavaScript Object Notation, JSON, documents used to
exchange data to/from a web server.
**** some analysts do consider .csv files as structured data
• Unstructured data has no predefined data model not
organised, may have multiple types of data
60. - e.g. data from thermostats, sensors, home electronic
devices, cars, images and sounds & pdf files.
EMC Education Services (Eds.) 2015, Data Science and Big
Data Analytics: Discovering, Analyzing, Visualizing and
Presenting Data, John Wiley &
Sons, Indianapolis, US.
https://www.google.com/search?q=parsing+definition&ie=&oe=
https://en.wikipedia.org/wiki/JSON
Glossary 3
LO1
• Quasi-structured data textual data which has various
formats and takes effort to handle and analyse
– e.g. web clickstream data
61. • Unstructured data has no predefined data model not
organised, may have multiple types of data
- e.g. data from thermostats, sensors, home electronic
devices, cars, images and sounds & pdf files.
EMC Education Services (Eds.) 2015, Data Science and Big
Data Analytics: Discovering, Analyzing,
Visualizing and Presenting Data, John Wiley & Sons,
Indianapolis, US.
https://commons.wikimedia.org/wiki/Ne
odythemis_hildebrandti
Why we need non-relational
databases?
• Big data has driven the need for
62. • NoSQL databases
– For unstructured data
• NewSQL databases
– Bridging the gap between relational and NoSQL database
design
• Note: Querying language/method depends on the
database used
This Photo by Unknown Author is licensed
under CC BY-NC
http://www.ksi.mff.cuni.cz/
https://creativecommons.org/licenses/by-nc/3.0/
Recall: NoSQL Databases
NoSQL (Not only SQL), i.e.Non-relational databases
Are used to manage unstructured & semi• -structured data
63. Sometimes called • “Cloud” databases
• Usually open source
Work on a distributed (parallel) data approach•
General categories of non• -relational databases (DBs):
– Key-value DBs, e.g. shopping cart, sensor data
– Document DBs, e.g. tweets, customer data, blog posts
– column-oriented DBs, e.g. time series, banking
– Graph DBs, e.g. networks, social connections
Coronel, C, and Morris, S 2019, Database Systems: Design,
Implementation, &
Management, 13th Edn.,Cengage, Boston, USA.
Activity 1:Match database type and
65. Networks
Time series
Sensor data
Banking
Blog posts
Social connections
Example of Key-Value Database
For example, student names and ages. The name is
used as the key.
Software
Windows Azure•
Riak•
71. 2. They are written in key & value pairs.
3. Keys must be strings, and values must be a valid JSON data
type
(string, i.e. text), number, object, array, boolean or null).
4. Keys and values are separated by a colon.
5. Each key/value pair is separated by a comma.
Javascript: JSON and Ajax, 1998 -2014 O’Reilly Media, Inc.
available at
archive.oreilly.com/oreillyschool/courses/javascript2/Javascript
%20JSON%20and%20Ajax%20v2.pdf
This work is licensed under a Creative Commons Attribution-
ShareAlike 3.0 Unported License.
https://en.wikipedia.org/wiki/JSON
Activity 2: JSON code
72. • Why are these incorrectly coded?
1. (“name”: “John”, “age”:30, “Car”: “Ford” )
2. {name: “John”, age:30, Car: “Ford” }
3. {“name”: “age”:30, “Car”: “Ford” }
4. {“name”: “John”, “age”:30, [Car]: [Ford] }
5. {“name”: “John” “age”:30 “Car”: “Ford” }
More about MongoDB
• A document database
• Documents do not have to conform to the
same structure (schema-less)
• Documents with similar types are stored in
73. collections, related collections are stored in a
DB
• The documents appear as JSON files to users
Coronel, C, and Morris, S 2019, Database Systems: Design,
Implementation,
& Management, 13th Edn.,Cengage, Boston, USA.
Example: Column-Oriented Database
• Same example in a row store (relational) and column
(non-relational). Software, Cassandra and HBase
Relational Table
Column-centric storage
Block 1 | 125670,145679,234466,785940,785840
74. Block 2 | Ma,Jimmy,Peter,Sundar,Jiping
Block 3 | 130, 128 144, 132, 110
Block 4 | 85,78,88,82,70
Activity 3: Column-Oriented Database
• Convert the subset of data from the week 7 excel file
(shown below) into column-centric format
Relational Table
Column-
centric
storage
Block 1 |
Block 2 |
75. Block 3 |
Block 4 |
Case study: Fraud detection using a
Graph Database
• Neo4j video on Fraud detection
• Watch the video and learn about graph database design
https://www.youtube.com/watch?v=ujimD6MP87I
https://www.youtube.com/watch?v=ujimD6MP87I
Aggregate awareness
• Aggregate awareness means that the data is
grouped (or “aggregated”) around a central topic
76. • For example, data collected in connection with an
individual blog post, including
– Title, content, date posted
– Username, screen name
– Comments made on the post, etc
• Key value, document and column DBs are all
aggregate aware
Coronel, C, and Morris, S 2019, Database Systems: Design,
Implementation,
& Management, 13th Edn.,Cengage, Boston, USA.
NewSQL Databases
• Cloud-based to handle large amounts of data
• E.g. ClustrixDB, NuoDB
77. • Use SQL for queries
• Use massively parallel query processing (MPP)
, i.e. data across multiple servers which
process the data locally
• Key-value and column-oriented data stores
Case Study: Hit Labs
ClustrixDB customer success story
• Application: Hit Labs created the Bubble Group Messenger
App (for group messaging and group chat)
• It is free on iOS and Android devices
• Originally built on Amazon's Aurora
• Problem: Hit Labs wanted a database to support their rapid
user growth