8
WEEK- 6:
INGESTING BY BATCH OR BY STREAM
6.1 Comparing batch and stream ingestion:
Fig- 6.1: Batch& Streaming Ingestion
To generalize the characteristics of batch processing, batch ingestion involves running batch
jobs that query a source, move the resulting dataset or datasets to durable storage in the
pipeline, and then perform whatever transformations are required for the use case. As noted
in the Ingesting and Preparing Data module, this could be just cleaning and minimally
formatting data to put it into the lake. Or, it could be more complex enrichment, augmentation,
and processing to support complex querying or big data and machine learning (ML)
applications. Batch processing might be started on demand, run on a schedule, or initiated by
an event. Traditional extract, transform, and load (ETL) uses batch processing, but extract,
load, and transform (LT) processing might also be done by batch.
Batch Ingestion Processing:
The process of transporting data from one or more sources to a target site for further
processing and analysis. This data can originate from a range of sources, including data lakes,
IoT devices, on-premises databases, and SaaS apps, and end up in different target
environments, such as cloud data warehouses or data marts.
Purpose Built Ingestion Tools:
Fig- 6.2: Built Ingestion Tools
Use Amazon App Flow to ingest data from a software as a service (SaaS) application. You
can do the following with Amazon App Flow:
•Create a connector that reads from a SaaS source and includes filters.
•Map fields in each source object to fields in the destination and perform
transformations.
•Perform validation on records to be transferred.
•Securely transfer to Amazon S3 or Amazon Redshift. You can trigger an ingestion on
demand, on event, or on a schedule.
An example use case for Amazon App Flow is to ingest customer support ticket data from
the Zendesk SaaS product.
Downloaded by GOKULk 8/1/2004 (
[email protected])
lOMoARcPSD|60463001