What is the Oracle Intelligent Data Lake (IDL)?

John Whitaker, Analytics ACE

10 Oct 2024 — 2 min read

Photo by Uladzislau Petrushkevich / Unsplash

The Intelligent Data Lake was one of my key takeaways from Oracle CloudWorld 24. So what is it, and why does it matter? The answer to the former is that the Intelligent Data Lake (IDL) is Oracle's solution to building a comprehensive Data Lakehouse platform using modern data management tools. As you can see below, the IDL is a four-layer architecture. I will briefly go through each layer, starting on the data layer as everything builds upon that. I will deep-dive into each layer in later posts.

Layer 1: Open Data Lake

The Open Data Lake contains three main types of data:

Open Source Data formats: These open source data file formats include Apache Parquet, Delta Lake, and Apache Iceberg and are commonly used to create "tables" in a Data Lake.
Structured Data: This is your standard relational database data. It can be one or more different databases and isn't necessarily an Oracle Database.
Unstructured Data: Files of any type, including documents, Pictures, XML files, Audio, Video, and CSV files. These differ from the Open-Source Data formats as those are structured and designed to perform analytical operations directly (e.g., Delta Lake SQL statements).

Layer 2: Unified Catalog

The Unified Catalog contains the Open Data Lake's metadata using the Oracle Data Catalog platform. This layer defines and stores data security, and enterprise AI models will also be stored there.

Layer 3: Open-Source Runtimes

The heavy hitter here is Apache Spark. It will be the main workhorse for data integration. The massively parallel nature of the platform allows for data virtualization in addition to data movement. As long as the data is registered in the Unified Catalog, the system can real-time query a third-party cloud data object (e.g., GCP BigQuery table). I will go over these in my deep dive, but in a nutshell, Apache Flink is primarily used as a stream-processing framework, and Delta Sharing is an open-source protocol to share Delta Lake files.

Layer 4: Modern Developer Experience

The modern developer experience is the approach of allowing users to leverage their tools of choice. Data Scientists will want to use things such as Jupyter Notebooks to write complex ML models, whereas others will simply want to interact with the data using SQL (e.g. Spark SQL). More to come.

What's Next?

This was a quick run down of the 4 layer Oracle Intelligent Data Lake architecture. Hopefully it helped give you an idea on what each of these layers represents. Be on the look out for my posts where I deep dive into these further.

Maximize insights in Oracle Analytics Cloud with Data Lakehouses Part 3 – Medallion Architecture

This is the third in my series of blogs on Data Lakehouses. My first two articles lay the conceptual framework for Data Lakes and Data Warehouses. This article discusses the Data Lake medallion design pattern. Medallion architecture is a Data Lake design approach with tiers that are named after the

Maximize insights in Oracle Analytics Cloud with Data Lakehouses Part 2 - The Data Warehouse

In the first part of this series, I discussed what a Data Lake is. In a nutshell, it is a logical collection of files stored in object storage. These can be csv, pdf, imagines or what I call semi-structured data files like parquet and Iceberg. This post will describe the

Oracle Cloud World 24: 5 Key Takeaways

* Journey still has it. Oracle really put together a great Party this year. All their greatest hits plus a 5 minute guitar "improv" from Neal Schon (not sure exactly what it was but it required a special white guitar). * We are now in the age of Data Intelligence.

Oracle CloudWorld 24: What I'm looking forward to

Let me get to the obvious first. The Analytics Keynote from T.K. Anand is a can't miss. Gabby Rubin's OAC Vision & Roadmap, and Stefan Schmitz' FDI Vision & Roadmap I'm sure will be full of exciting announcements. There are a tremendous