PyData Sofia May 2024 - Intro to Apache Arrow

xhochy 42 views 28 slides Jul 12, 2024

Slide 1 of 28

About This Presentation

Exploring the text that powers the modern data (science) stack

Size: 1.78 MB

Language: en

Added: Jul 12, 2024

Slides: 28 pages

Slide Content

Apache Arrow 
Exploring the tech that powers the
modern data (science) stack
Uwe Korn – QuantCo – May 2024

About me
•Uwe Korn 
https://mastodon.social/@xhochy / @xhochy 
https://www.linkedin.com/in/uwekorn/
•CTO at Data Science startup QuantCo
•Previously worked as a Data Engineer
•A lot of OSS, notably Apache {Arrow,
Parquet} and conda-forge
•PyData Südwest Co-Organizer

Agenda
1.Why do we need this?
2.What is it?
3.What’s its impact?

Why do we need this?
•Different Ecosystems
•PyData / R space
•Java/Scala „Big Data“
•SQL Databases
•Different technologies
•Pandas / SQLite

Why solve it?
•We build pipelines to move data
•We want to use all tools we can leverage
•Avoid working on converters or waiting for the data to be converted

Introducing Apache Arrow
•Columnar representation of data in main memory
•Provide libraries to access the data structures
•Building blocks for various ecosystems to use them
•Implements adopters for existing structures

Columnar?

All the languages!
1.„Pure“ implementations in 
C++, Java, Go, JavaScript, C#, Rust, Julia, Swift, C(nanoarrow)
2.Wrappers on-top of them in 
Python, R, Ruby, C/GLib, Matlab

There is a social component
1.A standard is only as good as its usage
2.Different communities came together to form Arrow
3.Nowadays even more use it to connect

Arrow Basics
1.Array: a sequence of values of the same type in contiguous buffers
2.ChunkedArray: a sequence of arrays of the same type
3.Table: a sorted dictionary of ChunkedArrays of the same length

Arrow Basics: valid masks
1.Track null_count per Array
2.Each array has a buffer of bits indicating whether a value is valid, 
i.e. non-null

Arrow Basics: int array
Python array: [1, null, 2, 4, 8]
Length: 5, Null count: 1
Validity bitmap buffer:
| Byte 0 (validity bitmap) | Bytes 1-63 |
|--------------------------|-----------------------|
| 00011101 | 0 (padding) |
Value Buffer:
| Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | Bytes 12-15 | Bytes 16-19 | Bytes 20-63 |
|-------------|-------------|-------------|-------------|-------------|-----------------------|
| 1 | unspecified | 2 | 4 | 8 | unspecified (padding) |

Arrow Basics: string array
Python array: ['joe', null, null, 'mark']
Length: 4, Null count: 2
Validity bitmap buffer:
| Byte 0 (validity bitmap) | Bytes 1-63 |
|--------------------------|-----------------------|
| 00001001 | 0 (padding) |
Offsets buffer:
| Bytes 0-19 | Bytes 20-63 |
|----------------|-----------------------|
| 0, 3, 3, 3, 7 | unspecified (padding) |
Value buffer:
| Bytes 0-6 | Bytes 7-63 |
|----------------|-----------------------|
| joemark | unspecified (padding) |

Impact!
Arrow is now used in all „edges“ where data passes through:
•Databases, either in clients or in UDFs
•Data Engineering tooling
•Machine Learning libraries
•Dashboarding and BI applications

Examples of Arrow’s
massive Impact
If it ain’t a 10x-100x+ speedup, it ain’t worth it.

Parquet

Anatomy of a Parquet file

Parquet
1.This was the first exposure of Arrow to the Python world
2.End-users only see pandas.read_parquet
3.Actually, it is:
A.C++ Parquet->Arrow reader
B.C++ Pandas<->Arrow Adapter
C.Small Python shim to connect both and give a nice API

DuckDB Interop
1.Load data in Arrow
2.Process in DuckDB
3.Convert back to Arrow
4.Hand over to another tool
All the above happened without any serialization overhead

DuckDB Interop

Fast Database Access

Fast Database Access
Nowadays, you get even more speed with
•ADBC – Arrow DataBase Connector
•arrow-odbc

Should you use Arrow?
1.Actually, No.
2.Not directly, but make sure it is used in the backend.
3.If you need performance, but the current exchange is slow; then dive
deeper.
4.If you want to write high-performance, framework-agnostic code.

The ecosystem
…and many more.… 
https://arrow.apache.org/powered_by/

Questions?
Follow me: 
https://mastodon.social/@xhochy / @xhochy 
https://www.linkedin.com/in/uwekorn/

PyData Sofia May 2024 - Intro to Apache Arrow

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

PyData Sofia May 2024 - Intro to Apache Arrow

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

8-top-ai-courses-for-customer-support-representatives-in-2025.pptx

7-essential-ai-courses-for-call-center-supervisors-in-2025.pptx

25-essential-ai-courses-for-user-support-specialists-in-2025.pptx

8-essential-ai-courses-for-insurance-customer-service-representatives-in-2025.pptx

Know for Certain

PPT OPD LES 3ertt4t4tqqqe23e3e3rq2qq232.pptx