DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK

bunkertor 181 views 55 slides May 08, 2024
Slide 1
Slide 1 of 55
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55

About This Presentation

Building Real-Time Pipelines With FLaNK
Timothy Spann, Principal Developer Advocate, Streaming - Cloudera Future of Data meetup, startup grind, AI Camp
The combination of Apache Flink, Apache NiFi, and Apache Kafka for building real-time data processing pipelines is extremely powerful, as demons...


Slide Content

Building Real-Time Pipelines With FLaNK
Tim Spann
Principal Developer Advocate

May 8, 2024

DEMO --
OPEN SOURCE

B102: Enabling Real-Time Analytics
Wednesday, May 08 2024
12:00 p.m. - 12:45 p.m.

Real-time analytics contributes to building scalable and fault-tolerant data processing
pipelines.
Building Real-Time Pipelines With FLaNK
Timothy Spann, Principal Developer Advocate - Cloudera
The combination of Apache Flink, Apache NiFi, and Apache Kafka for building real-time data
processing pipelines is extremely powerful, as demonstrated by this case study using the
FLaNK-MTA project. The project leverages these technologies to process and analyze
real-time data from the New York City Metropolitan Transportation Authority (MTA).
FLaNK-MTA demonstrates how to efficiently collect, transform, and analyze high-volume data
streams, enabling timely insights and decision-making.

4
Tim Spann

Twitter: @PaasDev // Blog: datainmotion.dev
Principal Developer Advocate.

ex-Pivotal, ex-Hortonworks, ex-StreamNative,
ex-PwC, ex-HPE, ex-E&Y.




https://medium.com/@tspann
https://github.com/tspannhw

5
@PaasDev
https://www.meetup.com/futureofdata-princeton/
https://www.meetup.com/futureofdata-newyork/

From Big Data to AI to Streaming to Containers to Cloud to Analytics to Cloud Storage to Fast Data to
Machine Learning to Microservices to ...
Future of Data - NYC + NJ + Philly + Virtual

6
This week in Apache NiFi, Apache Flink,
Apache Kafka, Milvus, LLM, ML, AI, Apache
Spark, Apache Iceberg, Python, Java, LLM,
GenAI, Vector DB and Open Source friends.
https://bit.ly/32dAJft

https://www.meetup.com/futureofdata-
princeton/
FLaNK-AIM Weekly by Tim Spann

7
https://flankworkspace.slack.com/

https://join.slack.com/t/flankworkspac
e/shared_invite/zt-2fycjv241-~NRHZDt
dfwDjlfvXK_Bz0A
Join Our Slack and Interact with LLM

AGENDA
Introduction to FLaNK AI

The World of the Real

Overview

Apache NiFi, Kafka, Flink,
Iceberg

Demos

Q&A

FLaNK AIM

NiFi Flink



Iceberg Kafka

12
Already using Kafka? Already using NiFi? Need for Fast Flink?
Simple setup for many tables
Want metadata augmented data
Don’t need low latency?
Visual monitoring
Easy manual scaling
Easy to combine with NiFi
Debezium
Simple JDBC queries?
Transform individual records?
Want easy development with UI?
Lots of small files, events, records, rows?
Continuous stream of rows
Support many different sources
Debezium coming
Strong control of table and joins
Want high Throughput?
Want Low Latency?
Want Advanced Windowing and State?
Automatic records immediately
Pure SQL
Debezium
Kafka Connect, NiFi, Flink? Which engine to choose? Or All 3?

FLaNK Pipelines

External Context Ingest
Ingesting, routing, clean, enrich, transforming,
parsing, chunking and vectorizing structured,
unstructured, semistructured, binary data and
documents

Prompt engineering
Crafting and structuring queries to optimize
LLM responses

Context Retrieval
Enhancing LLM with external context such as
Retrieval Augmented Generation (RAG)

Roundtrip Interface
Act as a Discord, REST, Kafka, SQL, Slack bot to
roundtrip discussions

WORLD OF THE REAL

FLaNK-MTA / Urban Transportation

https://github.com/tspannhw/flank-transit

https://github.com/tspannhw/flank-transit

https://medium.com/cloudera-inc/subways-and-transit-updates-in-real-time-30c104c359ef
NYC Subway

https://medium.com/cloudera-inc/streaming-street-cams-to-yolo-v8-with-python-and-nifi-to-minio-s3-3277e73723ce
Street Cameras

https://medium.com/@tspann/septa-transit-real-time-81082878b485
Philadelphia SEPTA

https://medium.com/@tspann/real-time-irish-transit-analytics-ea76164c9595
Irish Transit

https://medium.com/cloudera-inc/boston-wheres-my-bus-llm-streaming-to-the-rescue-586df
d019237
MBTA Bus Live

FLaNK for Halifax Canada Transit —
NiFi, Kafka, Flink, SQL, GTFS-RT | by
Tim Spann | Cloudera | Dec, 2023 |
Medium

Never Get Lost in the Stream.
NiFi-Kafka-Flink for getting to work… |
by Tim Spann | Cloudera | Dec, 2023 |
Medium

Iteration 1: Building a System to
Consume All the Real-Time Transit
Data in the World At Once | by Tim
Spann | Cloudera | Medium

Watching Airport Traffic in Real-Time
| by Tim Spann | Cloudera | Medium


Building Real-Time Pipelines With FLaNK

APACHE NIFI

© 2023 Cloudera, Inc. All rights reserved.27
PROVENANCE

•Record Readers - Avro, CSV, Grok, IPFIX, JSAN1, JSON,
Parquet, Scripted, Syslog5424, Syslog, WindowsEvent, XML
•Record Writers - Avro, CSV, FreeFromText, Json, Parquet,
Scripted, XML
•Record Reader and Writer support referencing a schema
registry for retrieving schemas when necessary.
•Enable processors that accept any data format without
having to worry about the parsing and serialization logic.
•Allows us to keep FlowFiles larger, each consisting of
multiple records, which results in far better performance.
UNSTRUCTURED DATA WITH NIFI

•Archives - tar, gzipped, zipped, …
•Images - PNG, JPG, GIF, BMP, …
•Documents - HTML, Markdown, RSS, PDF, Doc, RTF, Plain Text, …
•Videos - MP4, Clips, Mov, Youtube URL…
•Sound - MP3, …
•Social / Chat - Slack, Discord, Twitter, REST, Email, …
•Identify Mime Types, Chunk Documents, Store to Vector Database
•Parse Documents - HTML, Markdown, PDF, Word, Excel, Powerpoint


UNSTRUCTURED DATA WITH NIFI

© 2019 Cloudera, Inc. All rights reserved.30
CLOUD ML/DL/AI/Vector Database Services

•Cloudera ML
•Amazon Polly, Translate, Textract, Transcribe, Bedrock, …
•Hugging Face
•IBM Watson X.AI
•Vector Stores Anywhere: Weaviate, Pinecone, Milvus,
Chroma DB, SOLR, …

https://medium.com/cloudera-inc/getting-ready-for-apache-nifi-2-0-5a5e6a67f450
NiFi 2.0.0 Features
●Python Integration
●Parameters
●JDK 21+
●JSON Flow Serialization
●Rules Engine for Development
Assistance
●Run Process Group as Stateless
●flow.json.gz

https://cwiki.apache.org/confluence/display/NIFI/NiFi+2.0+Release+Goals

Generate Synthetic Records w/
Faker
●Python 3.10+
●faker
●Choose as many as you want
●Attribute output

Get GTFS Data
●Python 3.10+
●GTFS from Transit URL
●Alerts, Trip Updates or Vehicle Positions
●Returns JSON
●google.transit and google.protobuf

Get Compound GTFS Data
●Python 3.10+
●GTFS to JSON


https://github.com/tspannhw/FLaNK-python-processors/blob/main/GetGTFSCompoundFeed.py

Extract Company Names
●Python 3.10+
●Hugging Face, NLP, SpaCY, PyTorch


https://github.com/tspannhw/FLaNK-python-ExtractCompanyName-processor

Extract Entities
●Python 3.10+
●NLP, SpaCY
●Extract locations
●Extract organizations
●Extract money
●Extract time
●Extract events
●Extract countries
●Extract objects, food, people, quantities

https://github.com/tspannhw/FLaNK-python-processors/blob/main/ExtractEntities.py

Parse Addresses
●Python 3.10+
●PYAP Library
●Simple Library if your text includes an
address
●Address Parsing
●Address Detecting
●MIT Licensed
●Looking at other libraries, GenAI, DL, ML




https://github.com/tspannhw/FLaNK-python-processors

Address To Lat/Long
●Python 3.10+
●geopy Library
●Nominatim
●OpenStreetMaps (OSM)
●openstreetmap.org/copyright
●Returns as attributes and JSON file
●Works with partial addresses
●Categorizes location
●Bounding Box




https://github.com/tspannhw/FLaNKAI-Boston

APACHE KAFKA

Let’s do a metamorphosis on your data. Don’t fear changing data.
You don’t need to be a brilliant writer to stream
data.
Franz Kafka was a
German-speaking Bohemian
novelist and short-story writer,
widely regarded as one of the
major figures of 20th-century
literature. His work fuses elements
of realism and the fantastic.
--Wikipedia
YES, FRANZ, IT’S KAFKA

●Open Source
●Log
●Distributed Event Store
●Highly Scalable, Exactly Once
●High-throughput, Low-latency
●Binary TCP-based protocol that is optimized for efficiency
●Source/Sinks: Debezium CDC, JDBC, Kafka, HTTP, JMS,
InfluxDB, HDFS, Kudu, S3, Syslog, MQTT, SFTP, MQTT

APACHE FLINK
I Can Haz Data?

●Open Source
●Framework (Java or Python)
●Distributed Engine
●Stream Processing
●Highly Scalable, Exactly Once
●High-throughput, Low-latency
●Source/Sinks: HDFS, Kudu, Iceberg,
Kafka, HBase, Hive, JDBC, OpenSearch

FLINK SQL

45© Cloudera, Inc. All rights reserved.
Apache Flink SQL
Democratize access to real-time data with just SQL

APACHE ICEBERG

●Open Source Performant Format for Large Analytic
Tables
●Support for multiple engines like Spark, Hive, Impala,
Trino, Flink, Presto and more.
●ACID Transactions
●Time Travel
●Rollback
●Partitioning
●Data Compaction
●Schema Evolution

FLINK & ICEBERG INTEGRATION
Robust Next Generation Architecture for Data Driven Business
Unified Processing Engine Massive Open table format
•Maximally open
•Maximally flexible
•Ultra high performance for MASSIVE data
•Can be used as Source and Sink
•Supports batch and streaming modes
•Supports time travel

NIFI & ICEBERG INTEGRATION
•PutIceberg processor

DEMO
I Can Haz
Data?
https://github.com/tspannhw/FLaNK-EveryTransitSystem

CSP Community Edition
●Docker compose file of CSP to run from command line w/o any
dependencies, including Flink, SQL Stream Builder, Kafka, Kafka
Connect, Streams Messaging Manager and Schema Registry.
○$>docker compose up
●Licensed under the Cloudera Community License
●Unsupported Commercially (Community Help - Ask Tim)
●Community Group Hub for CSP
●Find it on docs.cloudera.com (see QR Code)
●Kafka, Kafka Connect, SMM, SR, Flink, Flink SQL, MV, Postgresql, SSB
●Develop apps locally

Open Source Edition
•Apache NiFi in Docker
•Try new features
quickly
•Develop applications
locally
●Docker NiFi
○docker run --name nifi -p 8443:8443 -d -e
SINGLE_USER_CREDENTIALS_USERNAME=admin -e
SINGLE_USER_CREDENTIALS_PASSWORD=ctsBtRBKHRAx69EqUghv
vgEvjnaLjFEB apache/nifi:latest

●Licensed under the ASF License
●Unsupported
●NiFi 1.25 and NiFi 2.0.0-M2

https://hub.docker.com/r/apache/nifi

https://github.com/tspannhw/FLaNK-Transit
SELECT n.speed, n.travel_time, n.borough, n.link_name, n.link_points,
n.latitude, n.longitude, DISTANCE_BETWEEN(CAST(t.latitude as STRING),
CAST(t.latitude as STRING),
m.VehicleLocationLatitude, m.VehicleLocationLongitude) as miles,
t.title, t.`description`, t.pubDate, t.latitude, t.longitude,
m.VehicleLocationLatitude, m.VehicleLocationLongitude,
m.StopPointRef, m.VehicleRef,
m.ProgressRate, m.ExpectedDepartureTime, m.StopPoint,
m.VisitNumber, m.DataFrameRef, m.StopPointName,
m.Bearing, m.OriginAimedDepartureTime, m.OperatorRef,
m.DestinationName, m.ExpectedArrivalTime, m.BlockRef,
m.LineRef, m.DirectionRef, m.ArrivalProximityText,
m.DistanceFromStop, m.EstimatedPassengerCapacity,
m.AimedArrivalTime, m.PublishedLineName,
m.ProgressStatus, m.DestinationRef, m.EstimatedPassengerCount,
m.OriginRef, m.NumberOfStopsAway, m.ts
FROM jsonmta /*+ OPTIONS('scan.startup.mode' = 'earliest-offset') */ m
FULL OUTER JOIN jsontranscom /*+ OPTIONS('scan.startup.mode' = 'earliest-offset') */ t
ON (t.latitude >= CAST(m.VehicleLocationLatitude as float) - 0.3)
AND (t.longitude >= CAST(m.VehicleLocationLongitude as float) - 0.3)
AND (t.latitude <= CAST(m.VehicleLocationLatitude as float) + 0.3)
AND (t.longitude <= CAST(m.VehicleLocationLongitude as float) + 0.3)
FULL OUTER JOIN nytrafficspeed /*+ OPTIONS('scan.startup.mode' = 'earliest-offset') */ n
ON (n.latitude >= CAST(m.VehicleLocationLatitude as float) - 0.3)
AND (n.longitude >= CAST(m.VehicleLocationLongitude as float) - 0.3)
AND (n.latitude <= CAST(m.VehicleLocationLatitude as float) + 0.3)
AND (n.longitude <= CAST(m.VehicleLocationLongitude as float) + 0.3)
WHERE m.VehicleRef is not null
AND t.title is not null
I Can Haz
Data?

MORE ARTICLES

●https://medium.com/cloudera-inc/watching-airport-traffic-in-real-time-32c522a6e386
●https://medium.com/cloudera-inc/building-a-real-time-data-pipeline-a-comprehensive-tutorial-on-min
ifi-nifi-kafka-and-flink-ee03ee6722cb
●https://medium.com/cloudera-inc/finding-the-best-way-around-7491c76ca4cb
●https://medium.com/cloudera-inc/nyc-traffic-are-you-kidding-me-6d3fa853903b
●https://medium.com/@tspann/building-a-travel-advisory-app-with-apache-nifi-in-k8-969b44c84958
●https://medium.com/@tspann/using-ollama-with-mistral-and-apache-nifi-720c17f5ff12
●https://medium.com/cloudera-inc/google-gemma-for-real-time-lightweight-open-llm-inference-88efe
98e580f
●https://medium.com/@tspann/image-processing-with-custom-python-and-nifi-2-0-06eadc62c03c
●https://medium.com/@tspann/ai-augmented-devrel-part-1-4058af905a89
●https://medium.com/cloudera-inc/mixtral-generative-sparse-mixture-of-experts-in-dataflows-59744f
7d28a9
●https://medium.com/@tspann/building-an-llm-bot-for-meetups-and-conference-interactivity-c211ea
6e3b61
●https://medium.com/@tspann/yet-another-python-processor-45aaae6fe406
●https://medium.com/@tspann/real-time-irish-transit-analytics-ea76164c9595
●https://medium.com/@tspann/septa-transit-real-time-81082878b485

55
TH N Y U