Preparing raw data for analysis

Accessing and Preparing Data
Summit 2020
Clif Kranish

Contents
2
Get Data
Use a representative sample of your data source
Sampling
Upload data from your PC or access server data sources
Understand your data and make it more useful
Profiling and Preparing Data
1
2
3
4
5
Augment your data from other data sources
Combining Data
Stage your data for analysis
Loading Data

“Data preparation is the act of manipulating
raw data into a form that can be readily and
accurately analyzed.”
3
— Wikipedia

Upload data from your desktop or
access tables and files on the server.
Get Data
4

Preparing Data
5
WebFOCUS is used with data from disparate sources
Some data sources can be used as-is
Some may be need to prepared prior to use
• Company star schema data warehouses
• Departmental data marts
• Other curated corporate data sources
• Personal database tables
• Publicly accessible data
• Excel worksheets and other files on a PC

Get Data
• Upload Local Files
• Connect to Server Files and
databases

Citi Bike (NYC Bike Share System)

Can we answer these questions from the data?
● How long is the typical ride?
● What hours of the day has the most rides?
● Has the number of rider per day varied over the month?
● What generation are the riders?
● What neighborhoods do the most rides start from?

Yes, if we augment the data with additional information
● Trip duration in minutes
● Trip start time as hour of the day
● Trip day of the month
● Station zip code, neighborhood, city and county
● Generation and age in years

11
Get Data
Local Files or Server Files

Upload Data
Data is stage (uploaded) and previewed as Raw or Formatted

Upload data (delimited file)
Raw data shows original file

Upload Data
Select Excel workbook with multiple worksheets

Upload Data
Select workbook with multiple worksheets

Upload Data (from Excel)
Rows to skip and use as headers automatically suggested

Upload Data (from delimited file)
Header row and delimiter automatically suggested

Upload Data
Adapter and Application Folder default – or can be selected

Upload Data
Metadata and table names suggested

Upload Data
Choose columns to upload

Sampling
Use a Representative Sample instead of entire dataset (or first nnn rows)
• Improves responsiveness when preparing large data sets
• Automatically calculates sample size
• Confidence Level 99% - Margin of Error +/- 1%
• Stages a random sample
on disk… or in a database table

Sampling
Always enable sampling
Tools > Workspace > Settings > Settings for Web Console Preferences
Data Assist (Representative Sampling) > ENABLE_SAMPLING

Sampling
Hover to see sample size

Sampling
Recreate (Stratified) Sampling

Stratified Sampling
Ensure at least one row for every unique value of selected column(s)

Profiling and Preparing Data
32
Data Profiling helps to understand data
Preparing data can make data easier to analyze

Profile Data
See how data values are distributed

Prepare Data
Derive Trip Duration in Minutes

Prepare Data
Trip Duration in Minutes

Prepare Data
Derive Day of Month

Prepare Data
Derived day of month

Prepare Data
Derive Hour of Day

Prepare Data
Derive Hour of Day – Use function DTIME

Prepare Data
Derived Hour of Day – Sorted by Value

Profile Data – Birth Year
A cyclist was born in 1896? And what’s up with 1969?

Prepare Data
Birth Year - Select 1969 – Brushing shows values for other columns

Prepare Data
Birth Year – Add new expression

Prepare Data
Birth Year – Valid or NULL

Prepare Data
Generations – Create Groups

Prepare Data
Generations – Create Group - Millennials

Combine Data
50
Augment data with additional information from other
files with UNION or JOIN

Combine Data Sources
Select JOIN or UNION

UNION
Multiple data sources with the same layout

JOIN
Two tables with a column with shared values

Join Editor
Columns to join are suggested. See effects of inner or outer join

Profile
Zip Code, Neighborhood, City, County for each starting station

Stage data for analysis in a database
table, column store or as a view
Load
56

Load Options
Select Adapter, Connection, Synonym and Table

Preparing raw data for analysis

Recommended

Recommended

More Related Content

What's hot

What's hot (6)

Similar to Preparing raw data for analysis

Similar to Preparing raw data for analysis (20)

Recently uploaded

Recently uploaded (20)

Preparing raw data for analysis