Skip to content

soumilshah1995/An-easy-to-use-Python-utility-class-for-accessing-incremental-data-from-Hudi-Data-Lakes

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 

Repository files navigation

An easy-to-use Python utility class for accessing incremental data from Hudi Data Lakes

One of the key features of Hudi is its support for incremental data processing. This means that Hudi can efficiently process only the changes that have occurred since the last time data was processed, rather than processing the entire dataset every time. This can result in significant performance improvements and reduced processing times.

Let's move on to learning how to use Hudi Incremental Data Processing to power downstream systems. Search applications like Elasticsearch, relational databases, and non-relational databases are examples of downstream systems.

An easy-to-use Python utility class for accessing incremental data from Hudi Data Lakes. The code logic can be shown in the following flow chart:

Please fork the repository and submit a merge request if you notice any flaws or ideas to improve the template.

Videos


How To Use

Snap (1)


Code Logic

incremental drawio

PlantUML

image

  • NOTE| Make sure your Enviroment varibales are set for AWS Access and secret keys

Demo

Performing some inserts

image

Running template

image

Metadata File on S3

image

Metadata File on S3

image

Performing one more insert

image

Running template

image

About

An easy-to-use Python utility class for accessing incremental data from Hudi Data Lakes

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages