FINDING AND USING
UNIQUE DATASETS
Pepar Hugo
https://www.linkedin.com/in/pep
arhugo/
May 25th, 2021
AGENDA
- Introduction
- Topic Overview
- Traditional vs Modern Data Sources
- Live Demo
INTRODUCTION
- Senior Data Engineer at Corel
- Working GCP to help support data-driven initiatives
- BigQuery
- Cloud Functions
- Cloud Run
- AI Platform
- Cloud Storage
WORK EXPERIENCE
7 years experience:
- Business Metrics Analyst/Manager Sales Compensation at Sasktel
- Information Analyst/Data Scientist at eHealth Saskatchewan
- Senior Data Analyst at Mapillary
- Senior Data Analyst at Facebook
- Data Engineer at Intersog/Knowledgehound
- Senior Data Engineer at Corel
TOOLS
- Python
- R
- Jupyter
- PyCharm
- SQL Systems (Oracle, Postgres,
MySQL, SQL Server)
- ElasticSearch
- Kibana
- Docker
- Bash
- Git
- GitHub
- BitBucket
- Jenkins
- Django
- Kafka
- Spark
WHY I’M GIVING THIS TALK
- The thought process to approaching data-driven problems
- For managers to understand at high level how analyst should
approach problems
- For analysts to see the bigger picture of pulling data sources,
programming and analysis together
WHY REGINA LEAD WATER
CONNECTORS?
In November 2019 it was announced there is lead connectors in
Regina leeching into drinking water:
- Very little information was provided to the public regarding this
information
- Questions from journalists about affected schools and day cares in
Regina
 The cities response there’s no way to know this information
- I simply didn’t believe there was no way to know what schools or
day cares were affected
- I took this a step further in January 2021 with putting together a
presentation framing this as a public health issue
THE ANALYSIS I PRESENTED IN
JANUARY
file:///Users/pepar-mapillary/regina-lead/Pepar Hugo - Lead Water
Connections - Jan 2021.pdf
DATA SOURCES
TRADITIONAL
Formats
- CSV Files
- Text Files
- Excel Files
Generally easy to use but limited
in the uses
Issues
- Flat files
- Pre-compiled
- Limited Data Types
- Stale
MODERN
Formats
- JSON
- XML
- Parquet
- HTML
Complex to use but offers more
flexibility in data types such as
geospatial fields
Issues
- Learning curve is high
- Requires reading technical
documentation
- Requires programming
- Accessing these sources in a
friendly manner
USING MODERN DATA SOURCES
How I used modern data sources to create the Regina Lead
Connections analysis?
Data Sources
- Regina Open GIS Data
- Statistics Canada API
- Saskatchewan Registered Child Care Facilities
- Integrating geospatial data
GOALS OF THIS DEMO
- How to think about finding sources
- How to think about accessing them
- How to think about integrating sources
- This talk will not teach the software engineering or programming
steps but the code will be shared with those interested via a public
GitHub repo

Finding and using unique data sources

  • 1.
    FINDING AND USING UNIQUEDATASETS Pepar Hugo https://www.linkedin.com/in/pep arhugo/ May 25th, 2021
  • 2.
    AGENDA - Introduction - TopicOverview - Traditional vs Modern Data Sources - Live Demo
  • 3.
    INTRODUCTION - Senior DataEngineer at Corel - Working GCP to help support data-driven initiatives - BigQuery - Cloud Functions - Cloud Run - AI Platform - Cloud Storage
  • 4.
    WORK EXPERIENCE 7 yearsexperience: - Business Metrics Analyst/Manager Sales Compensation at Sasktel - Information Analyst/Data Scientist at eHealth Saskatchewan - Senior Data Analyst at Mapillary - Senior Data Analyst at Facebook - Data Engineer at Intersog/Knowledgehound - Senior Data Engineer at Corel
  • 5.
    TOOLS - Python - R -Jupyter - PyCharm - SQL Systems (Oracle, Postgres, MySQL, SQL Server) - ElasticSearch - Kibana - Docker - Bash - Git - GitHub - BitBucket - Jenkins - Django - Kafka - Spark
  • 6.
    WHY I’M GIVINGTHIS TALK - The thought process to approaching data-driven problems - For managers to understand at high level how analyst should approach problems - For analysts to see the bigger picture of pulling data sources, programming and analysis together
  • 7.
    WHY REGINA LEADWATER CONNECTORS? In November 2019 it was announced there is lead connectors in Regina leeching into drinking water: - Very little information was provided to the public regarding this information - Questions from journalists about affected schools and day cares in Regina  The cities response there’s no way to know this information - I simply didn’t believe there was no way to know what schools or day cares were affected - I took this a step further in January 2021 with putting together a presentation framing this as a public health issue
  • 8.
    THE ANALYSIS IPRESENTED IN JANUARY file:///Users/pepar-mapillary/regina-lead/Pepar Hugo - Lead Water Connections - Jan 2021.pdf
  • 9.
  • 10.
    TRADITIONAL Formats - CSV Files -Text Files - Excel Files Generally easy to use but limited in the uses Issues - Flat files - Pre-compiled - Limited Data Types - Stale
  • 11.
    MODERN Formats - JSON - XML -Parquet - HTML Complex to use but offers more flexibility in data types such as geospatial fields Issues - Learning curve is high - Requires reading technical documentation - Requires programming - Accessing these sources in a friendly manner
  • 12.
    USING MODERN DATASOURCES How I used modern data sources to create the Regina Lead Connections analysis? Data Sources - Regina Open GIS Data - Statistics Canada API - Saskatchewan Registered Child Care Facilities - Integrating geospatial data
  • 13.
    GOALS OF THISDEMO - How to think about finding sources - How to think about accessing them - How to think about integrating sources - This talk will not teach the software engineering or programming steps but the code will be shared with those interested via a public GitHub repo