First of all, let’s start with the definition and a sample architecture picture of a DWH.
“A data warehouse (DW or DWH), also known as an enterprise data warehouse (EDW), is a system used for reporting and data analysis, and is considered a core component of business intelligence. DWs are central repositories of integrated data from one or more disparate sources. They store current and historical data in one single place that are used for creating analytical reports for workers throughout the enterprise.” Wikipedia
In a real world scenario, like a Bank ( financial institution ) the data stored in the DWH is past data ( from yesterday).
The daily transactions (data) are being loaded every night, by ETL processes. This processes usually takes hours ( more than 8+).
Business issue : How can we interrogate data in real time ?
Past data is in DWH and current (today’s data) is in the source systems.
Current ETL tools, like ODI and GoldenGate can do this by Change Data Capture.It is very fast and CDC is done with the help of Oracle Logs.
The only issue is that the source system must be an Oracle DB, which in a real world situation there are several Source Systems with different DBs , technologies, but not Oracle 🙂
Integrate a new Real Time Layer/ Stream Layer, based on HDFS/hadoop , over (in parallel) the current DWH.
Part 2 will be with some hands on implementation. Stay tuned 🙂