Saturday, 8 July 2017

Pull or push - Source data

Push or pull?

Does the ETL team pull the data from the source systems or source systems push the data to the landing zone? I have always preferred the latter. But this new guy made me think about it from different perspective. Might he be right? In today's world where you hear about open architecture almost every day? Isn't such architecture the only way to meet constantly changing needs of business? What would be the benefits if we choose tightly coupled architecture instead?

I have several reasons why I prefer the push approach:
  1. If the table structure on source system is changed its owner is supposed to amend extraction programs to keep the agreed structure of the extracted file. Moreover, he is responsible to do that. But if my team pulls the data from this system and he is not informed in advance of this change our extraction programs might fail. I know what you say. In many companies (probably in most of them) we have our SLAs and other procedures to prevent such mistake to happen. But be honest - have you ever met the company which always follows their own procedures? If your answer is yes please accept my congratulations. You are the lucky one. Because I usually don't have this luck I try to keep the responsibility where it really belongs.

  2. But it is not the main reason why I prefer push over pull. It is also the knowledge that source system owner has. He knows better than me what is happening with the system.

  3. And last but not least - the architecture is really clean. At least from my perspective. Having one landing zone where source files are delivered in the same format following the same naming convention is really very helpful for the ETL team as it significantly facilitates the development and offers great time savings by using shared components. It also helps the company when new system is required to provide the data. It simply complies with the rules and nobody has to reinvent the wheel again.

Anyway, I firmly believe the architect had his own (perhaps) good reasons to push the pull architecture. Funny thing was that the company guidelines strongly recommended loosely coupled architecture. Unfortunately we have never had time to discuss his views further.



_______________________________________________

You always get the same reaction from the source system owner - "I'm not letting you touch my database with that thing!" The people responsible for service level agreements on transactional databases do not like applications they don't fully understand pulling unknown volumes of data at unknown times from their databases - especially if you are pulling off delta time slices on tables not indexed for it. They prefer to stay in control using database tools to pull out the data.

I prefer the pull method since you get end to end data lineage reporting in the one tool, the ETL tool can usually do it cheaper then other methods and you can use the one developer to modify the end to end extract to add extra columns later on. But I accept that most mission critical source system owners see outbound extracts as something they want to control.

Even in a pull method I would still stage the raw data to flat file or data store before transforming it just for a rollback and recovery point and an archive for troubleshooting.

Source :: http://it.toolbox.com/blogs/traveling-consultant/push-or-pull-loosely-or-tightly-coupled-etl-19539