Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Doc] How long till data is ready to be consumed at speed? #15036

Open
hpvd opened this issue Feb 12, 2025 · 4 comments
Open

[Doc] How long till data is ready to be consumed at speed? #15036

hpvd opened this issue Feb 12, 2025 · 4 comments

Comments

@hpvd
Copy link

hpvd commented Feb 12, 2025

Pinot can deliver query results with stunning speed /low latency which is described on many places
e.g. very nicely at startree s blog https://startree.ai/resources/what-makes-apache-pinot-fast-chapter-ii

In contrast its hard to find any numbers/examples on How long does it take to have the data ready to be consumed at this speed?

How long does it take from data ingest through the layers of Pinot including updating different index etc.

Would be handy to have some infos on this in the doc or blog or as a first step directly in this issue.

@hpvd hpvd changed the title Doc: how long till data is ready to be consumed at speed? [Doc] How long till data is ready to be consumed at speed? Feb 12, 2025
@hpvd
Copy link
Author

hpvd commented Feb 12, 2025

btw: at operation of Pinot this kind of metric of "e2e freshness" would also have relevance in many use cases and would allow to

  • to find problems
  • help to optimize
  • decide if freshness of data from inside Pinot is good enough or data has to be taken from other places e.g. event sourcing
  • proof SLAs
  • ...

there is a "well aged" issue for this #4007
incl. a proposal: https://cwiki.apache.org/confluence/display/PINOT/Pinot+Freshness+Metric

@hpvd
Copy link
Author

hpvd commented Feb 13, 2025

@Jackie-Jiang do you have any first example numbers on this / a source to share
to get a very first impression about possible time spans and influencing factors e.g. like depending on index type
(for everyone interested in before full doc /metric implementation is done)?

@Jackie-Jiang
Copy link
Contributor

I'm not sure if I completely get the question, but I can answer from the perspective of how Pinot handles streaming data. Unlike a lot of other databases that ingest streaming data as mini batches (where the delay happens), Pinot directly writes the data into index row-by-row and the data immediately becomes queryable. The delay of streaming data arriving Pinot to it becoming queryable is usually below millisecond (Pinot can easily ingest thousands of messages per second). If you count end-to-end time from data produced to streaming system (e.g. Kafka) to becoming queryable in Pinot, the delay is usually a few seconds, and majority of the delay is from streaming system processing then delivering the messages to Pinot.

@hpvd
Copy link
Author

hpvd commented Feb 14, 2025

Pinot directly writes the data into index row-by-row and the data immediately becomes queryable. The delay of streaming data arriving Pinot to it becoming queryable is usually below millisecond

Many thanks for this inside!

Would be really interesting to have some real end2end benchmarks of durations
from the arrival of a (kafka) message at Pinot
to writing the results of a query which uses an index and includes data from the freshly arrived message to a new message
(e.g. maybe continuous query and send results till the new data is included)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants