DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
bunkertor
181 views
55 slides
May 08, 2024
Slide 1 of 55
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
About This Presentation
Building Real-Time Pipelines With FLaNK
Timothy Spann, Principal Developer Advocate, Streaming - Cloudera Future of Data meetup, startup grind, AI Camp
The combination of Apache Flink, Apache NiFi, and Apache Kafka for building real-time data processing pipelines is extremely powerful, as demons...
Building Real-Time Pipelines With FLaNK
Timothy Spann, Principal Developer Advocate, Streaming - Cloudera Future of Data meetup, startup grind, AI Camp
The combination of Apache Flink, Apache NiFi, and Apache Kafka for building real-time data processing pipelines is extremely powerful, as demonstrated by this case study using the FLaNK-MTA project. The project leverages these technologies to process and analyze real-time data from the New York City Metropolitan Transportation Authority (MTA). FLaNK-MTA demonstrates how to efficiently collect, transform, and analyze high-volume data streams, enabling timely insights and decision-making.
Real-time analytics contributes to building scalable and fault-tolerant data processing
pipelines.
Building Real-Time Pipelines With FLaNK
Timothy Spann, Principal Developer Advocate - Cloudera
The combination of Apache Flink, Apache NiFi, and Apache Kafka for building real-time data
processing pipelines is extremely powerful, as demonstrated by this case study using the
FLaNK-MTA project. The project leverages these technologies to process and analyze
real-time data from the New York City Metropolitan Transportation Authority (MTA).
FLaNK-MTA demonstrates how to efficiently collect, transform, and analyze high-volume data
streams, enabling timely insights and decision-making.
4
Tim Spann
Twitter: @PaasDev // Blog: datainmotion.dev
Principal Developer Advocate.
From Big Data to AI to Streaming to Containers to Cloud to Analytics to Cloud Storage to Fast Data to
Machine Learning to Microservices to ...
Future of Data - NYC + NJ + Philly + Virtual
6
This week in Apache NiFi, Apache Flink,
Apache Kafka, Milvus, LLM, ML, AI, Apache
Spark, Apache Iceberg, Python, Java, LLM,
GenAI, Vector DB and Open Source friends.
https://bit.ly/32dAJft
https://www.meetup.com/futureofdata-
princeton/
FLaNK-AIM Weekly by Tim Spann
7
https://flankworkspace.slack.com/
https://join.slack.com/t/flankworkspac
e/shared_invite/zt-2fycjv241-~NRHZDt
dfwDjlfvXK_Bz0A
Join Our Slack and Interact with LLM
AGENDA
Introduction to FLaNK AI
The World of the Real
Overview
Apache NiFi, Kafka, Flink,
Iceberg
Demos
Q&A
FLaNK AIM
NiFi Flink
Iceberg Kafka
12
Already using Kafka? Already using NiFi? Need for Fast Flink?
Simple setup for many tables
Want metadata augmented data
Don’t need low latency?
Visual monitoring
Easy manual scaling
Easy to combine with NiFi
Debezium
Simple JDBC queries?
Transform individual records?
Want easy development with UI?
Lots of small files, events, records, rows?
Continuous stream of rows
Support many different sources
Debezium coming
Strong control of table and joins
Want high Throughput?
Want Low Latency?
Want Advanced Windowing and State?
Automatic records immediately
Pure SQL
Debezium
Kafka Connect, NiFi, Flink? Which engine to choose? Or All 3?
FLaNK Pipelines
External Context Ingest
Ingesting, routing, clean, enrich, transforming,
parsing, chunking and vectorizing structured,
unstructured, semistructured, binary data and
documents
Prompt engineering
Crafting and structuring queries to optimize
LLM responses
Context Retrieval
Enhancing LLM with external context such as
Retrieval Augmented Generation (RAG)
Roundtrip Interface
Act as a Discord, REST, Kafka, SQL, Slack bot to
roundtrip discussions
•Record Readers - Avro, CSV, Grok, IPFIX, JSAN1, JSON,
Parquet, Scripted, Syslog5424, Syslog, WindowsEvent, XML
•Record Writers - Avro, CSV, FreeFromText, Json, Parquet,
Scripted, XML
•Record Reader and Writer support referencing a schema
registry for retrieving schemas when necessary.
•Enable processors that accept any data format without
having to worry about the parsing and serialization logic.
•Allows us to keep FlowFiles larger, each consisting of
multiple records, which results in far better performance.
UNSTRUCTURED DATA WITH NIFI
https://medium.com/cloudera-inc/getting-ready-for-apache-nifi-2-0-5a5e6a67f450
NiFi 2.0.0 Features
●Python Integration
●Parameters
●JDK 21+
●JSON Flow Serialization
●Rules Engine for Development
Assistance
●Run Process Group as Stateless
●flow.json.gz
Parse Addresses
●Python 3.10+
●PYAP Library
●Simple Library if your text includes an
address
●Address Parsing
●Address Detecting
●MIT Licensed
●Looking at other libraries, GenAI, DL, ML
Address To Lat/Long
●Python 3.10+
●geopy Library
●Nominatim
●OpenStreetMaps (OSM)
●openstreetmap.org/copyright
●Returns as attributes and JSON file
●Works with partial addresses
●Categorizes location
●Bounding Box
https://github.com/tspannhw/FLaNKAI-Boston
APACHE KAFKA
Let’s do a metamorphosis on your data. Don’t fear changing data.
You don’t need to be a brilliant writer to stream
data.
Franz Kafka was a
German-speaking Bohemian
novelist and short-story writer,
widely regarded as one of the
major figures of 20th-century
literature. His work fuses elements
of realism and the fantastic.
--Wikipedia
YES, FRANZ, IT’S KAFKA
●Open Source
●Log
●Distributed Event Store
●Highly Scalable, Exactly Once
●High-throughput, Low-latency
●Binary TCP-based protocol that is optimized for efficiency
●Source/Sinks: Debezium CDC, JDBC, Kafka, HTTP, JMS,
InfluxDB, HDFS, Kudu, S3, Syslog, MQTT, SFTP, MQTT
●Open Source Performant Format for Large Analytic
Tables
●Support for multiple engines like Spark, Hive, Impala,
Trino, Flink, Presto and more.
●ACID Transactions
●Time Travel
●Rollback
●Partitioning
●Data Compaction
●Schema Evolution
FLINK & ICEBERG INTEGRATION
Robust Next Generation Architecture for Data Driven Business
Unified Processing Engine Massive Open table format
•Maximally open
•Maximally flexible
•Ultra high performance for MASSIVE data
•Can be used as Source and Sink
•Supports batch and streaming modes
•Supports time travel
NIFI & ICEBERG INTEGRATION
•PutIceberg processor
DEMO
I Can Haz
Data?
https://github.com/tspannhw/FLaNK-EveryTransitSystem
CSP Community Edition
●Docker compose file of CSP to run from command line w/o any
dependencies, including Flink, SQL Stream Builder, Kafka, Kafka
Connect, Streams Messaging Manager and Schema Registry.
○$>docker compose up
●Licensed under the Cloudera Community License
●Unsupported Commercially (Community Help - Ask Tim)
●Community Group Hub for CSP
●Find it on docs.cloudera.com (see QR Code)
●Kafka, Kafka Connect, SMM, SR, Flink, Flink SQL, MV, Postgresql, SSB
●Develop apps locally
Open Source Edition
•Apache NiFi in Docker
•Try new features
quickly
•Develop applications
locally
●Docker NiFi
○docker run --name nifi -p 8443:8443 -d -e
SINGLE_USER_CREDENTIALS_USERNAME=admin -e
SINGLE_USER_CREDENTIALS_PASSWORD=ctsBtRBKHRAx69EqUghv
vgEvjnaLjFEB apache/nifi:latest
●Licensed under the ASF License
●Unsupported
●NiFi 1.25 and NiFi 2.0.0-M2
https://hub.docker.com/r/apache/nifi
https://github.com/tspannhw/FLaNK-Transit
SELECT n.speed, n.travel_time, n.borough, n.link_name, n.link_points,
n.latitude, n.longitude, DISTANCE_BETWEEN(CAST(t.latitude as STRING),
CAST(t.latitude as STRING),
m.VehicleLocationLatitude, m.VehicleLocationLongitude) as miles,
t.title, t.`description`, t.pubDate, t.latitude, t.longitude,
m.VehicleLocationLatitude, m.VehicleLocationLongitude,
m.StopPointRef, m.VehicleRef,
m.ProgressRate, m.ExpectedDepartureTime, m.StopPoint,
m.VisitNumber, m.DataFrameRef, m.StopPointName,
m.Bearing, m.OriginAimedDepartureTime, m.OperatorRef,
m.DestinationName, m.ExpectedArrivalTime, m.BlockRef,
m.LineRef, m.DirectionRef, m.ArrivalProximityText,
m.DistanceFromStop, m.EstimatedPassengerCapacity,
m.AimedArrivalTime, m.PublishedLineName,
m.ProgressStatus, m.DestinationRef, m.EstimatedPassengerCount,
m.OriginRef, m.NumberOfStopsAway, m.ts
FROM jsonmta /*+ OPTIONS('scan.startup.mode' = 'earliest-offset') */ m
FULL OUTER JOIN jsontranscom /*+ OPTIONS('scan.startup.mode' = 'earliest-offset') */ t
ON (t.latitude >= CAST(m.VehicleLocationLatitude as float) - 0.3)
AND (t.longitude >= CAST(m.VehicleLocationLongitude as float) - 0.3)
AND (t.latitude <= CAST(m.VehicleLocationLatitude as float) + 0.3)
AND (t.longitude <= CAST(m.VehicleLocationLongitude as float) + 0.3)
FULL OUTER JOIN nytrafficspeed /*+ OPTIONS('scan.startup.mode' = 'earliest-offset') */ n
ON (n.latitude >= CAST(m.VehicleLocationLatitude as float) - 0.3)
AND (n.longitude >= CAST(m.VehicleLocationLongitude as float) - 0.3)
AND (n.latitude <= CAST(m.VehicleLocationLatitude as float) + 0.3)
AND (n.longitude <= CAST(m.VehicleLocationLongitude as float) + 0.3)
WHERE m.VehicleRef is not null
AND t.title is not null
I Can Haz
Data?