11. TensorFlow Datasets
Each dataset includes some (not all) of
the following metadata:
● Description
● Dataset Homepage
● Download Size
● Example Code
● Example Images
● etc.
Who am I
What is RIIS
Why the interest in TensorFlow
Why the interest in TensorFlow Datasets
Founder and President of RIIS, LLC. – a mobile and web development company based in Troy, MI
Author of six books on development
Adjunct Professor – currently teaching Android Development at Saginaw Valley State University in the fall term and University of Detroit Mercy in the Winter Term
What is TensorFlow - back in the 90s, if you wanted to do any AI or machine learning you had to roll your own back propagation neural network.
From personal experience I can say it wasn’t impossible but it took a while.
These days it’s much simpler, you can call any number of machine learning models from frameworks such as Tensorflow or Pytorch etc. that sit on top of the latest and greatest neural network or other types of models allowing you to easily train your models.
The following example is an Hello World for TensorFlow, we’re trying to get the NN to match the equation y = (3*x) + 1, so if x is 1 y is 4.
We train the model on a 6 element x,y array over 500 epochs.
When we ask it what y is when the value of y is when x = 10 we get a prediction of 31.00025749. So it’s not exactly right but it’s pretty close.
EXPLAIN THE MODEL https://codelabs.developers.google.com/codelabs/tensorflow-lab1-helloworld#0
URL is in the resources
While this is a really simple example we see a lot of the common elements of a standard TensorFlow workflow, i.e. load the data, train the model, test the model and deploy
I’m mostly interested in image classification and object detection, as we do a lot of work on drones. We typically deploy the model as a tensorflow lite file on an android or iphone
These differ from our hello world as we have to label the data the workflow is now label, load, train, test and deploy
A few years ago now, we created an app that counts sheep or cattle from a DJI or Parrot drone.
I quickly ran into a problem, where am I going to find my labeled data?
Creating your own labeled data can be expensive, you have to get 5000 to 10000 images and then someone has to manually label (draw a box around) each image for what you to detect, that can cost up to $1 per image depending on how many objects you have on an image
TF Data Sets solve my labeling problem by providing a simple to use loading mechanism for 200+ public research datasets
That’s assuming your use case is covered by one of these datsets
It’s another front end to make our lives easier so I don’t have to write custom code to load each dataset
TFDS has been built with these principles in mind:
Simplicity: Standard use-cases should work out-of-the box
Performance: TFDS follows best practices and can achieve state-of-the-art speed
Determinism/reproducibility: All users get the same examples in the same order
Customisability: Advanced users can have fine-grained control
TensorFlow Datasets have something for everyone, I’m mostly interested in image classification for mobile apps and object detection for drones
Why - easy to use, consistency, everything collected together in one place, as well as sample code to show you how to load and train the data.
These are NOT Google datasets, the collection is mostly public research datasets
Format is always the same which makes it easy to load into a TF notebook. Rather than having to figure out how to download and import it each time
Currently there are 224 datasets and it’s growing all the time, there’s also a mechanism to add your own dataset
SHOW THE DATASETS - https://www.tensorflow.org/datasets/catalog/mnist
https://www.tensorflow.org/datasets/catalog/coco
https://www.tensorflow.org/datasets/catalog/yelp_polarity_reviews
If we want to load the mnist database of handwritten digits we use the following command
Some common arguments:
* `split=`: Which split to read (e.g. `'train'`, `['train', 'test']`, `'train[80%:]'`,...).
* `shuffle_files=`: Control whether to shuffle the files between each epoch (TFDS store big datasets in multiple smaller files).
* `data_dir=`: Location where the dataset is saved (defaults to `~/tensorflow_datasets/`)
* `with_info=True`: Returns the `tfds.core.DatasetInfo` containing dataset metadata
* `download=False`: Disable download
If we want to load the yelp database of reviews we do the following.
Aside from some of the parameters it’s the same. It’s always the same which is the beauty of TFDS.
No custom code, no hacks to get at the dataset.
Because the datasets are also public research datasets at universities they have a tendency to go offline from time to time
It’s downloaded and ready for the next stage of our workflow.
We’ll return to the Yelp dataset later in our colab demo.
There are lots more, just pulling out the ones that we’re interested in
We use a retrained version of the coco dataset to count cattle from drones
COCO – Common object’s in Context
Kitti – autonomous driving
Voc – lots of different objects, people, cars etc.
Waymo is for automated driving
Wider face – facial expressions
Description
Dataset Homepage
Download Size
Example Code
Example Images
etc.
Run Through Colab
https://colab.research.google.com/drive/1hO7G4Tn-2OAlkDK9-3hpr7Yr0rzKdquz#scrollTo=FmNNPNjR3XsN
Show Colab Pro – Choose GPU – Runtime -> Change Runtime Type
Runtime all
Show RAM
Show where the files are /root/tensorflow_datasets/yelp_polarity_reviews/
SKIP AHEAD TO NEXT SLIDE WHILE RUNNING
One of the issues we had in the past was getting the Google Cloud Platform up and running, could spend days trying to get it configured
With these jupyter notebooks on Colab you’re up and running in minutes
GO BACK TO YELP COLAB
Originally I trained as a Mech Eng, so I always tend to the practical rather than the theoretical
So over the summer a couple of our interns turned some of the TFDS datasets into real apps.
Go to RIIS apps page on Google Play https://play.google.com/store/apps/developer?id=RIIS+LLC
Doggie in the window – stanford dogs dataset - SHOW VIDEO - SHOW CODE https://github.com/riis/DogClassificationAndroid
Identiplant - Plant Village dataset - SHOW VIDEO – this uses a non TFDS dataset
Food Classifier – Food101 dataset – SHOW VIDEO
Flower Classifier – Oxford Flower Dataset – SHOW VIDEO
Problem with other datasets is we need custom script to load them
Many of these are also moving into TF Datasets as you can add to the datasets
At 225 datasets with more datasets being added all the time
But you still might need to look elsewhere for datasets such as Kaggle or Aicrowd or just by doing a search on google
We’re planning on doing some more apps early in the new year
We’re also looking for Canadian interns this time around if anyone knows anyone who might be interested?
Snakes – want to see if we can write an app to tell if a snake is poisonous or not
Mushrooms (bad data) – There’s a Danish app that detects if a mushroom is going to make you sick or not
The app uses crowdsourcing to get lots of people to upload their own mushrooms pictures and definitions
Unfortunately there’s nobody looking at the uploaded pictures so the dataset is corrupt
Spiders – is the spider a Brown recluse or not?
As you can see here I have a problem with poisonous things, maybe I should add Quicksand to the list.
TensorFlow 2 Coco – Object Detection API recently was upgraded with support for TF2, redo our Cattle & Sheep counters
Cattle & Sheep TFDS - going to upload our cattle and sheep labeled dataset to TFDS