DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK

bunkertor 181 views 55 slides May 08, 2024

Slide 1 of 55

About This Presentation

Building Real-Time Pipelines With FLaNK
Timothy Spann, Principal Developer Advocate, Streaming - Cloudera Future of Data meetup, startup grind, AI Camp
The combination of Apache Flink, Apache NiFi, and Apache Kafka for building real-time data processing pipelines is extremely powerful, as demons...

Size: 26.16 MB

Language: en

Added: May 08, 2024

Slides: 55 pages

Slide Content

Building Real-Time Pipelines With FLaNK
Tim Spann
Principal Developer Advocate

May 8, 2024

DEMO --
OPEN SOURCE

B102: Enabling Real-Time Analytics
Wednesday, May 08 2024
12:00 p.m. - 12:45 p.m.

Real-time analytics contributes to building scalable and fault-tolerant data processing
pipelines.
Building Real-Time Pipelines With FLaNK
Timothy Spann, Principal Developer Advocate - Cloudera
The combination of Apache Flink, Apache NiFi, and Apache Kafka for building real-time data
processing pipelines is extremely powerful, as demonstrated by this case study using the
FLaNK-MTA project. The project leverages these technologies to process and analyze
real-time data from the New York City Metropolitan Transportation Authority (MTA).
FLaNK-MTA demonstrates how to efficiently collect, transform, and analyze high-volume data
streams, enabling timely insights and decision-making.

4
Tim Spann

Twitter: @PaasDev // Blog: datainmotion.dev
Principal Developer Advocate.

ex-Pivotal, ex-Hortonworks, ex-StreamNative,
ex-PwC, ex-HPE, ex-E&Y.

https://medium.com/@tspann
https://github.com/tspannhw

5
@PaasDev
https://www.meetup.com/futureofdata-princeton/
https://www.meetup.com/futureofdata-newyork/

From Big Data to AI to Streaming to Containers to Cloud to Analytics to Cloud Storage to Fast Data to
Machine Learning to Microservices to ...
Future of Data - NYC + NJ + Philly + Virtual

6
This week in Apache NiFi, Apache Flink,
Apache Kafka, Milvus, LLM, ML, AI, Apache
Spark, Apache Iceberg, Python, Java, LLM,
GenAI, Vector DB and Open Source friends.
https://bit.ly/32dAJft

https://www.meetup.com/futureofdata-
princeton/
FLaNK-AIM Weekly by Tim Spann

7
https://ﬂankworkspace.slack.com/

https://join.slack.com/t/ﬂankworkspac
e/shared_invite/zt-2fycjv241-~NRHZDt
dfwDjlfvXK_Bz0A
Join Our Slack and Interact with LLM

AGENDA
Introduction to FLaNK AI

The World of the Real

Overview

Apache NiFi, Kafka, Flink,
Iceberg

Demos

Q&A

FLaNK AIM

NiFi Flink

Iceberg Kafka

12
Already using Kafka? Already using NiFi? Need for Fast Flink?
Simple setup for many tables
Want metadata augmented data
Don’t need low latency?
Visual monitoring
Easy manual scaling
Easy to combine with NiFi
Debezium
Simple JDBC queries?
Transform individual records?
Want easy development with UI?
Lots of small ﬁles, events, records, rows?
Continuous stream of rows
Support many different sources
Debezium coming
Strong control of table and joins
Want high Throughput?
Want Low Latency?
Want Advanced Windowing and State?
Automatic records immediately
Pure SQL
Debezium
Kafka Connect, NiFi, Flink? Which engine to choose? Or All 3?

FLaNK Pipelines

External Context Ingest
Ingesting, routing, clean, enrich, transforming,
parsing, chunking and vectorizing structured,
unstructured, semistructured, binary data and
documents

Prompt engineering
Crafting and structuring queries to optimize
LLM responses

Context Retrieval
Enhancing LLM with external context such as
Retrieval Augmented Generation (RAG)

Roundtrip Interface
Act as a Discord, REST, Kafka, SQL, Slack bot to
roundtrip discussions

WORLD OF THE REAL

FLaNK-MTA / Urban Transportation

https://github.com/tspannhw/ﬂank-transit

https://github.com/tspannhw/flank-transit

https://medium.com/cloudera-inc/subways-and-transit-updates-in-real-time-30c104c359ef
NYC Subway

https://medium.com/cloudera-inc/streaming-street-cams-to-yolo-v8-with-python-and-nifi-to-minio-s3-3277e73723ce
Street Cameras

https://medium.com/@tspann/septa-transit-real-time-81082878b485
Philadelphia SEPTA

https://medium.com/@tspann/real-time-irish-transit-analytics-ea76164c9595
Irish Transit

https://medium.com/cloudera-inc/boston-wheres-my-bus-llm-streaming-to-the-rescue-586df
d019237
MBTA Bus Live

APACHE NIFI

•Record Readers - Avro, CSV, Grok, IPFIX, JSAN1, JSON,
Parquet, Scripted, Syslog5424, Syslog, WindowsEvent, XML
•Record Writers - Avro, CSV, FreeFromText, Json, Parquet,
Scripted, XML
•Record Reader and Writer support referencing a schema
registry for retrieving schemas when necessary.
•Enable processors that accept any data format without
having to worry about the parsing and serialization logic.
•Allows us to keep FlowFiles larger, each consisting of
multiple records, which results in far better performance.
UNSTRUCTURED DATA WITH NIFI

•Archives - tar, gzipped, zipped, …
•Images - PNG, JPG, GIF, BMP, …
•Documents - HTML, Markdown, RSS, PDF, Doc, RTF, Plain Text, …
•Videos - MP4, Clips, Mov, Youtube URL…
•Sound - MP3, …
•Social / Chat - Slack, Discord, Twitter, REST, Email, …
•Identify Mime Types, Chunk Documents, Store to Vector Database
•Parse Documents - HTML, Markdown, PDF, Word, Excel, Powerpoint

UNSTRUCTURED DATA WITH NIFI

© 2019 Cloudera, Inc. All rights reserved.30
CLOUD ML/DL/AI/Vector Database Services

•Cloudera ML
•Amazon Polly, Translate, Textract, Transcribe, Bedrock, …
•Hugging Face
•IBM Watson X.AI
•Vector Stores Anywhere: Weaviate, Pinecone, Milvus,
Chroma DB, SOLR, …

https://medium.com/cloudera-inc/getting-ready-for-apache-nifi-2-0-5a5e6a67f450
NiFi 2.0.0 Features
●Python Integration
●Parameters
●JDK 21+
●JSON Flow Serialization
●Rules Engine for Development
Assistance
●Run Process Group as Stateless
●ﬂow.json.gz

https://cwiki.apache.org/confluence/display/NIFI/NiFi+2.0+Release+Goals

Generate Synthetic Records w/
Faker
●Python 3.10+
●faker
●Choose as many as you want
●Attribute output

Get GTFS Data
●Python 3.10+
●GTFS from Transit URL
●Alerts, Trip Updates or Vehicle Positions
●Returns JSON
●google.transit and google.protobuf

Get Compound GTFS Data
●Python 3.10+
●GTFS to JSON

https://github.com/tspannhw/FLaNK-python-processors/blob/main/GetGTFSCompoundFeed.py

Extract Company Names
●Python 3.10+
●Hugging Face, NLP, SpaCY, PyTorch

https://github.com/tspannhw/FLaNK-python-ExtractCompanyName-processor

Extract Entities
●Python 3.10+
●NLP, SpaCY
●Extract locations
●Extract organizations
●Extract money
●Extract time
●Extract events
●Extract countries
●Extract objects, food, people, quantities

https://github.com/tspannhw/FLaNK-python-processors/blob/main/ExtractEntities.py

Parse Addresses
●Python 3.10+
●PYAP Library
●Simple Library if your text includes an
address
●Address Parsing
●Address Detecting
●MIT Licensed
●Looking at other libraries, GenAI, DL, ML

https://github.com/tspannhw/FLaNK-python-processors

Address To Lat/Long
●Python 3.10+
●geopy Library
●Nominatim
●OpenStreetMaps (OSM)
●openstreetmap.org/copyright
●Returns as attributes and JSON ﬁle
●Works with partial addresses
●Categorizes location
●Bounding Box

https://github.com/tspannhw/FLaNKAI-Boston

APACHE KAFKA

Let’s do a metamorphosis on your data. Don’t fear changing data.
You don’t need to be a brilliant writer to stream
data.
Franz Kafka was a
German-speaking Bohemian
novelist and short-story writer,
widely regarded as one of the
major figures of 20th-century
literature. His work fuses elements
of realism and the fantastic.
--Wikipedia
YES, FRANZ, IT’S KAFKA

●Open Source
●Log
●Distributed Event Store
●Highly Scalable, Exactly Once
●High-throughput, Low-latency
●Binary TCP-based protocol that is optimized for eﬃciency
●Source/Sinks: Debezium CDC, JDBC, Kafka, HTTP, JMS,
InﬂuxDB, HDFS, Kudu, S3, Syslog, MQTT, SFTP, MQTT

APACHE FLINK
I Can Haz Data?

●Open Source
●Framework (Java or Python)
●Distributed Engine
●Stream Processing
●Highly Scalable, Exactly Once
●High-throughput, Low-latency
●Source/Sinks: HDFS, Kudu, Iceberg,
Kafka, HBase, Hive, JDBC, OpenSearch

FLINK SQL

APACHE ICEBERG

●Open Source Performant Format for Large Analytic
Tables
●Support for multiple engines like Spark, Hive, Impala,
Trino, Flink, Presto and more.
●ACID Transactions
●Time Travel
●Rollback
●Partitioning
●Data Compaction
●Schema Evolution

FLINK & ICEBERG INTEGRATION
Robust Next Generation Architecture for Data Driven Business
Uniﬁed Processing Engine Massive Open table format
•Maximally open
•Maximally ﬂexible
•Ultra high performance for MASSIVE data
•Can be used as Source and Sink
•Supports batch and streaming modes
•Supports time travel

NIFI & ICEBERG INTEGRATION
•PutIceberg processor

DEMO
I Can Haz
Data?
https://github.com/tspannhw/FLaNK-EveryTransitSystem

CSP Community Edition
●Docker compose ﬁle of CSP to run from command line w/o any
dependencies, including Flink, SQL Stream Builder, Kafka, Kafka
Connect, Streams Messaging Manager and Schema Registry.
○$>docker compose up
●Licensed under the Cloudera Community License
●Unsupported Commercially (Community Help - Ask Tim)
●Community Group Hub for CSP
●Find it on docs.cloudera.com (see QR Code)
●Kafka, Kafka Connect, SMM, SR, Flink, Flink SQL, MV, Postgresql, SSB
●Develop apps locally

Open Source Edition
•Apache NiFi in Docker
•Try new features
quickly
•Develop applications
locally
●Docker NiFi
○docker run --name nifi -p 8443:8443 -d -e
SINGLE_USER_CREDENTIALS_USERNAME=admin -e
SINGLE_USER_CREDENTIALS_PASSWORD=ctsBtRBKHRAx69EqUghv
vgEvjnaLjFEB apache/nifi:latest

●Licensed under the ASF License
●Unsupported
●NiFi 1.25 and NiFi 2.0.0-M2

https://hub.docker.com/r/apache/niﬁ

https://github.com/tspannhw/FLaNK-Transit
SELECT n.speed, n.travel_time, n.borough, n.link_name, n.link_points,
n.latitude, n.longitude, DISTANCE_BETWEEN(CAST(t.latitude as STRING),
CAST(t.latitude as STRING),
m.VehicleLocationLatitude, m.VehicleLocationLongitude) as miles,
t.title, t.`description`, t.pubDate, t.latitude, t.longitude,
m.VehicleLocationLatitude, m.VehicleLocationLongitude,
m.StopPointRef, m.VehicleRef,
m.ProgressRate, m.ExpectedDepartureTime, m.StopPoint,
m.VisitNumber, m.DataFrameRef, m.StopPointName,
m.Bearing, m.OriginAimedDepartureTime, m.OperatorRef,
m.DestinationName, m.ExpectedArrivalTime, m.BlockRef,
m.LineRef, m.DirectionRef, m.ArrivalProximityText,
m.DistanceFromStop, m.EstimatedPassengerCapacity,
m.AimedArrivalTime, m.PublishedLineName,
m.ProgressStatus, m.DestinationRef, m.EstimatedPassengerCount,
m.OriginRef, m.NumberOfStopsAway, m.ts
FROM jsonmta /*+ OPTIONS('scan.startup.mode' = 'earliest-offset') */ m
FULL OUTER JOIN jsontranscom /*+ OPTIONS('scan.startup.mode' = 'earliest-offset') */ t
ON (t.latitude >= CAST(m.VehicleLocationLatitude as float) - 0.3)
AND (t.longitude >= CAST(m.VehicleLocationLongitude as float) - 0.3)
AND (t.latitude <= CAST(m.VehicleLocationLatitude as float) + 0.3)
AND (t.longitude <= CAST(m.VehicleLocationLongitude as float) + 0.3)
FULL OUTER JOIN nytrafficspeed /*+ OPTIONS('scan.startup.mode' = 'earliest-offset') */ n
ON (n.latitude >= CAST(m.VehicleLocationLatitude as float) - 0.3)
AND (n.longitude >= CAST(m.VehicleLocationLongitude as float) - 0.3)
AND (n.latitude <= CAST(m.VehicleLocationLatitude as float) + 0.3)
AND (n.longitude <= CAST(m.VehicleLocationLongitude as float) + 0.3)
WHERE m.VehicleRef is not null
AND t.title is not null
I Can Haz
Data?

MORE ARTICLES

●https://medium.com/cloudera-inc/watching-airport-traﬃc-in-real-time-32c522a6e386
●https://medium.com/cloudera-inc/building-a-real-time-data-pipeline-a-comprehensive-tutorial-on-min
iﬁ-niﬁ-kafka-and-ﬂink-ee03ee6722cb
●https://medium.com/cloudera-inc/ﬁnding-the-best-way-around-7491c76ca4cb
●https://medium.com/cloudera-inc/nyc-traﬃc-are-you-kidding-me-6d3fa853903b
●https://medium.com/@tspann/building-a-travel-advisory-app-with-apache-niﬁ-in-k8-969b44c84958
●https://medium.com/@tspann/using-ollama-with-mistral-and-apache-niﬁ-720c17f5ff12
●https://medium.com/cloudera-inc/google-gemma-for-real-time-lightweight-open-llm-inference-88efe
98e580f
●https://medium.com/@tspann/image-processing-with-custom-python-and-niﬁ-2-0-06eadc62c03c
●https://medium.com/@tspann/ai-augmented-devrel-part-1-4058af905a89
●https://medium.com/cloudera-inc/mixtral-generative-sparse-mixture-of-experts-in-dataﬂows-59744f
7d28a9
●https://medium.com/@tspann/building-an-llm-bot-for-meetups-and-conference-interactivity-c211ea
6e3b61
●https://medium.com/@tspann/yet-another-python-processor-45aaae6fe406
●https://medium.com/@tspann/real-time-irish-transit-analytics-ea76164c9595
●https://medium.com/@tspann/septa-transit-real-time-81082878b485

DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Slide 31

Slide 32

Slide 33

Slide 34

Slide 35

Slide 36

Slide 37

Slide 38

Slide 39

Slide 40

Slide 41

Slide 42

Slide 43

Slide 44

Slide 45

Slide 46

Slide 47

Slide 48

Slide 49

Slide 50

Slide 51

Slide 52

Slide 53

Slide 54

Slide 55

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

8-top-ai-courses-for-customer-support-representatives-in-2025.pptx

7-essential-ai-courses-for-call-center-supervisors-in-2025.pptx

25-essential-ai-courses-for-user-support-specialists-in-2025.pptx

8-essential-ai-courses-for-insurance-customer-service-representatives-in-2025.pptx

Know for Certain

PPT OPD LES 3ertt4t4tqqqe23e3e3rq2qq232.pptx