This document discusses accessing and preparing data for analysis in WebFOCUS. It covers getting data by uploading files from a PC or accessing server data sources. It also covers preparing data through profiling to understand it and deriving new variables. The document demonstrates sampling to improve performance on large datasets and combining data through union or join operations. It provides examples on the Citi Bike NYC bike sharing dataset to demonstrate preparing and profiling the data to answer different analysis questions.
2. Contents
2
Get Data
Use a representative sample of your data source
Sampling
Upload data from your PC or access server data sources
Understand your data and make it more useful
Profiling and Preparing Data
1
2
3
4
5
Augment your data from other data sources
Combining Data
Stage your data for analysis
Loading Data
3. “Data preparation is the act of manipulating
raw data into a form that can be readily and
accurately analyzed.”
3
— Wikipedia
4. Upload data from your desktop or
access tables and files on the server.
Get Data
4
5. Preparing Data
5
WebFOCUS is used with data from disparate sources
Some data sources can be used as-is
Some may be need to prepared prior to use
• Company star schema data warehouses
• Departmental data marts
• Other curated corporate data sources
• Personal database tables
• Publicly accessible data
• Excel worksheets and other files on a PC
6. Get Data
• Upload Local Files
• Connect to Server Files and
databases
9. Can we answer these questions from the data?
● How long is the typical ride?
● What hours of the day has the most rides?
● Has the number of rider per day varied over the month?
● What generation are the riders?
● What neighborhoods do the most rides start from?
10. Yes, if we augment the data with additional information
● Trip duration in minutes
● Trip start time as hour of the day
● Trip day of the month
● Station zip code, neighborhood, city and county
● Generation and age in years
25. Sampling
Use a Representative Sample instead of entire dataset (or first nnn rows)
• Improves responsiveness when preparing large data sets
• Automatically calculates sample size
• Confidence Level 99% - Margin of Error +/- 1%
• Stages a random sample
on disk… or in a database table