Using Public Datasets with TensorFlow.pptx

giddijukho 8 views 21 slides May 09, 2024
Slide 1
Slide 1 of 21
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21

About This Presentation

Using TensorFlow and public datasets


Slide Content

Using Public Datasets with TensorFlow Datasets CHAPTER 4

There are lots of different ways of getting the data with which to train a model. Fashion MNIST dataset that is conveniently bundled with Keras Many public datasets require you to learn lots of different domain-specific skills before you begin to consider your model architecture. The goal behind Tensor‐Flow Datasets (TFDS) is to expose datasets in a way that’s easy to consume, where all the preprocessing steps of acquiring the data and getting it into TensorFlow-friendly APIs are done for you

TFDS builds on this idea, but greatly expands not only the number of datasets available but the diversity of dataset types. TensorFlow Datasets is a separate install from TensorFlow, so be sure to install it before trying out any samples! If you are using Google Colab , it’s already preinstalled.

TFDS List The list of available datasets is growing all the time, in categories such as: Audio: Speech and music data Image : From simple learning datasets like Horses or Humans up to advanced research datasets for uses such as diabetic retinopathy detection Object detection : COCO, Open Images, and more Structured data : Titanic survivors, Amazon reviews, and more Summarization : News from CNN and the Daily Mail, scientific papers, wikiHow , and more Text : IMDb reviews, natural language questions, and more Translate: Various translation training datasets Video : Moving MNIST, Starcraft , and more

Getting Started with TFDS

Data about the dataset is also available using the with_info parameter when loading the dataset, like this:

Using TFDS with Keras Models In Chapter 2 we saw: When using TFDS the code is very similar, but with some minor changes. The Keras datasets gave us ndarray types that worked natively in model.fit , but with TFDS we’ll need to do a little conversion work:

Horses & Humans from tfds

The Horses or Humans dataset is split into training and test sets, so if you want to do validation of your model while training, you can do so by loading a separate validation set from TFDS like this:

Loading Specific Versions All datasets stored in TFDS use a MAJOR.MINOR.PATCH numbering system If PATCH is updated, then the data returned by a call is identical, but the underlying organization may have changed If MINOR is updated, then the data is still unchanged, with the exception that there may be additional features in each record If MAJOR is updated, then there may be changes in the format of the records and their placement

Using Mapping Functions for Augmentation We used augmentation tools that were available when using an ImageDataGenerator to provide the training data for your model.

Using TensorFlow Addons The TensorFlow Addons library contains even more functions that you can use. Some of the functions in the ImageDataGenerator augmentation (such as rotate) can only be found there, so it’s a good idea to check it out.

Using Custom Splits If you’re familiar with Python slice notation, you can use that as well. This notation can be summarized as defining your desired slices within square brackets like this: [<start>: <stop>: <step>]

The ETL process Extract, Transfer, and Load ETL is the core pattern that TensorFlow uses for training, regardless of scale. We’ve been exploring small-scale, single-computer model building in this book, but the same technology can be used for large-scale training across multiple machines with massive datasets.

The Extract-Transform-Load (ETL) process is a crucial step in training machine learning models. Extraction and Transformation: Data extraction and transformation can be performed on any processor, including a CPU. Tasks such as downloading data, unzipping files, and preprocessing records are typically executed on the CPU. The code used for these tasks may not fully leverage the capabilities of GPUs or TPUs, as they are primarily designed for parallel computation.

Training Phase: Training a model offers significant benefits when utilizing GPUs or TPUs. GPUs and TPUs excel in parallel processing, making them ideal for training tasks. Therefore, it's advantageous to utilize GPUs or TPUs during the training phase whenever possible. Workload Distribution: In scenarios where both CPU and GPU/TPU resources are available, it's beneficial to distribute the workload accordingly. The Extract and Transform stages are typically performed on the CPU, leveraging its general processing capabilities. Conversely, the Load stage, involving model training, is best suited for execution on GPUs or TPUs due to their specialized parallel computing capabilities.

Large datasets often require data preparation, including extraction and transformation, to be performed in batches due to their size. In this scenario, while one batch is being prepared, the GPU/TPU remains idle as it awaits the data for training. Once the batch is ready, it can be sent to the GPU/TPU for training, but this leaves the CPU idle until the training process is completed. Subsequently, the CPU starts preparing the next batch, leading to significant idle time in the overall process. This idle time highlights the potential for optimization in the data preparation and training pipeline to minimize resource underutilization and improve overall efficiency.

CPU GPU Training Vs Pipelining CPU GPU Training Pipelining
Tags