Proof Of Concept to pull news data from an RSS feed, and then store them in a Data Lake using Delta Lake's delta-rs
as a writer. The compute pipeline is defined using the PyArrow Acero
library.
It will iterate over the RSS feeds, pull the data, and extract the desired fields, and store them into a Delta Lake table. To avoid processing the same RSS Feed entry twice, the ids are stored in the rss_state.json
.
The RSS data comes from BBC's RSS feeds, explore the BBC's page.
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
python main.py
Try the following in your favorite data science tool:
from deltalake import DeltaTable
dt_news = DeltaTable("/tmp/curated_news/raw/")
df_news = dt_news.to_pandas()
df_news.head()
Will give you:
title published_time ... thumbnail_url category
0 Hegseth orders pause in US cyber-offensive aga... 2025-03-03 11:19:52 ... https://ichef.bbci.co.uk/ace/standard/240/cpsp... Technology
1 TikTok investigated over use of children's data 2025-03-03 06:10:56 ... https://ichef.bbci.co.uk/ace/standard/240/cpsp... Technology
2 Microsoft announces Skype will close in May 2025-02-28 17:48:12 ... https://ichef.bbci.co.uk/ace/standard/240/cpsp... Technology
3 WhatsApp says it has resolved technical problem 2025-02-28 18:21:37 ... https://ichef.bbci.co.uk/ace/standard/240/cpsp... Technology
4 Lloyds Bank says app issues fixed after payday... 2025-02-28 12:46:48 ... https://ichef.bbci.co.uk/ace/standard/240/cpsp... Technology
The objective of the POC was to validate whether PyArrow Acero can be an efficient method to pull RSS feeds and store them for analytical purposes.