Skip to content

Proof Of Concept to pull news from RSS feeds, and store them in a Data Lake using Delta Lake's "delta-rs" as a writer, and "PyArrow Acero" as the streaming and compute engine.

License

Notifications You must be signed in to change notification settings

polsm91/acero-delta-lake-streaming

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

News Processor based on PyArrow Acero

Proof Of Concept to pull news data from an RSS feed, and then store them in a Data Lake using Delta Lake's delta-rs as a writer. The compute pipeline is defined using the PyArrow Acero library.

It will iterate over the RSS feeds, pull the data, and extract the desired fields, and store them into a Delta Lake table. To avoid processing the same RSS Feed entry twice, the ids are stored in the rss_state.json.

The RSS data comes from BBC's RSS feeds, explore the BBC's page.

Instructions

Create virtual environment

python -m venv venv
source venv/bin/activate

Install dependencies

pip install -r requirements.txt

Run the script

python main.py

Exploring the results

Try the following in your favorite data science tool:

from deltalake import DeltaTable
dt_news = DeltaTable("/tmp/curated_news/raw/")
df_news = dt_news.to_pandas()
df_news.head()

Will give you:

                                               title      published_time  ...                                      thumbnail_url    category
0  Hegseth orders pause in US cyber-offensive aga... 2025-03-03 11:19:52  ...  https://ichef.bbci.co.uk/ace/standard/240/cpsp...  Technology
1    TikTok investigated over use of children's data 2025-03-03 06:10:56  ...  https://ichef.bbci.co.uk/ace/standard/240/cpsp...  Technology
2        Microsoft announces Skype will close in May 2025-02-28 17:48:12  ...  https://ichef.bbci.co.uk/ace/standard/240/cpsp...  Technology
3    WhatsApp says it has resolved technical problem 2025-02-28 18:21:37  ...  https://ichef.bbci.co.uk/ace/standard/240/cpsp...  Technology
4  Lloyds Bank says app issues fixed after payday... 2025-02-28 12:46:48  ...  https://ichef.bbci.co.uk/ace/standard/240/cpsp...  Technology

Why?

The objective of the POC was to validate whether PyArrow Acero can be an efficient method to pull RSS feeds and store them for analytical purposes.

References

About

Proof Of Concept to pull news from RSS feeds, and store them in a Data Lake using Delta Lake's "delta-rs" as a writer, and "PyArrow Acero" as the streaming and compute engine.

Topics

Resources

License

Stars

Watchers

Forks

Languages