Complex Data Aggregation using
Oracle Virtualized Data Sources
Real-world complex data aggregation using virtualized
data sources
Executive summary
This paper specifically
demonstrates that joins across disparate data sources to retrieve
complex data aggregation results using virtualized data access are
possible in a real-world environment. The reason for this proof of
concept is that in prior attempts, organizations trying to join such
information across disparate data sets failed to achieve the desired
response time results. Prior concepts weren’t like the proposed
concept, but the previous output in concept was identical.
Based on the assumption that confirmation was
required, a proof of concept was developed to retrieve data from a
variety of actual “system of record” sources aggregating data in
real time. The goal was to achieve acceptable (i.e. sub-second)
response times for this complex aggregation.
This proof of
concept was successful in achieving the desired results.
The results demonstrated in this paper used
real-world live data sources. Optimizations that will improve
performance beyond the testing results in this document include
moving hardware closer together, changing indexing schemes in the
data sources, using pre-aggregated data sources when available,
benchmarking other Sybase JDBC drivers, threading requests, using
faster hardware, data caching or increasing system memory). Any and
all of these optimization methods would only improve performance. In
other words, the benchmarks reported in this paper are worst-case.
The architecture discussed in “Tahiti
Integration Architecture” proposed for the delivery of these objects
as both Web Services objects and virtual Oracle Database objects.
This proof of concept delivered the objects solely as Oracle
Database objects. Web Services will perform better/faster than
Oracle Database objects because Java code running on the application
server executes faster than Java code in the database. Again, the
proof of concept demonstrates the worst-case scenario.
The ability to provide a non-obtrusive,
real-time, virtual integration solution is key to your company’s
future integration success. Virtualizing data access via Web
Services and inside the Oracle Database will allow your company to
access and analyze otherwise inaccessible information locked in
disparate parts of the organization.
Pulling data from the “system of record” at
run time rather than a dated snapshot of information provides
numerous benefits to the organization. Avoiding endless data
replication will not only save storage space and network bandwidth,
but also will provide users with current data that isn’t stale.
Because the objects demonstrated exist within an Oracle Database,
via the use of Oracle’s materialized views, data caching is easily
configurable if desired.
Other Oracle features, such as security, are
available for these objects. Additionally, the objects that were
created provide for real-time bidirectional transactions — the
underlying objects can be changed, not just read.
Business Case
A proactive approach to integration needs to
occur across all levels within your company. Without integrating
applications, data islands will exist throughout your heterogeneous
application environment.
Bad data just continues getting worse as it is
replicated. This is especially true when it is transformed in the
process of replication. Like the game of telephone, the data
quality rating (i.e. ICE) goes down each time data is replicated.
The age of the data becomes critical to its value too. The second
that data is extracted from a source system, it is dated.
Therefore, the integrity of the data decreases every time data is
moved through the systems. The only way to get a perfect ICE rating
is directly from the source (i.e. system of record).
In other words, real-time extraction of the
information is key. There needs to be one system of record for
distinct sets of data (i.e. customer from CORE, assets from Tahiti,
problems from Apollo).
Specific problems with Integration today:
·Data
quality
·Integrity
·Constancy
·Efficiency
·Security
·Lack
of data normalization
·Replicated
data everywhere
·Space
consumed over and over
·Lose
of accuracy in transformation, errors, etc.
·Tight-coupling
of interfaces
·No
access to real-time information (i.e. system of record)
·Fragmented
data set
Without access to real time information the
business can adequately define a “customer, asset, or location.”
However, historically you could not join data sets across disparate
applications. Replicating data and carrying pieces of data in each
unique system leads to “customer” data being fragmented across a
variety of applications. There is no way to pull together an entire
“customer” record in one place. This results in a number of systems
– each doing their own thing, yet everyone relies on everyone else’s
data.
In Sun’s The Inner Circle newsletter, the
article entitled “The Six Practices of Enterprise Integration,”
explains that “to stay in the leading pack, businesses invested in
technology to fuel this dash to market — but without thinking
through how the new systems would work together.” This paper goes
on to articulate how important integration is to your company’s
mission and goals - “unleashing the power of your existing
infrastructure via integration means huge payoffs in the way of
increasing revenue, reducing operations costs, improving customer
satisfaction, providing a gateway to new business channels, enabling
the enterprise to make better decisions faster…” This article
explains six best practices to save your company money in going down
this integration path - maximizing the use of existing IT assets and
preparing your company’s infrastructure to seamlessly integrate with
partners and customers. The article also points out Sun’s software
solution set that’s available to solve each of the six best practice
areas.
Data integration is not simply a matter of
copying data through every system. In fact, when source data is not
high quality data, its quality will get worse at each hand off. In
other words, bad data will continue being replicated. To define
data quality, one of our customers uses the ICE rating. I stands
for Integrity, C stands for completeness and E stands for
efficiency. The higher the rating, the better the quality of the
data. Like the game of telephone, the ICE rating goes down each
time data is replicated – especially when data is “transformed” in
the process. Transformations alter the original data.
Knowing the entire install base, its time in
service, mean time between failures, problem parts – a hardware
vendor would know that a part should be replaced (before it fails),
which would increase availability.
By successfully designing and implementing an
integrated solution comprised of data warehouses and reporting
tools, and by connecting sales, manufacturing, and service source
systems to that integrated solution, the following benefits will be
realized:
·The
ability to have field and manufacturing failure information
collected and available in a central location
·The
ability to have Install Base information, including in field
changes, available in one central location
·The
ability to slice product data by customer
·The
ability to trace product/FRU location through the supply chain
·The
ability to calculate MTBF across any FRU level part, for any company
would be readily available
·The
ability to view Vendor repair data
·The
ability to EOL duplicate systems which currently calculate MTBF will
save expense
Based on the above benefits, the following
information is required for a “central” data source (i.e.
virtualized data source):
·Field
and manufacturing
·Install
base including field changes
·Product
data by customer
·Product/FRU
location traceable through the supply chain
·MTBF
across any FRU level part down to the company level
·Vendor
repair data
The bottom line is that integration will save
resources – money, people and servers.
Complex aggregation object
The desired result set included statistics
summarized to a financial quarter for a specific model of hardware.
Details included in this information are the
number of units shipped and total internal outages experienced on
the respective hardware sorted by outage reason (or major outage
cause category).
Other components include the unique hardware
identifier (i.e. hardware_id that is used internally for each
hardware platform and the marketing_platform_dim_id that is used in
the shipped_install_base data) and the description for the hardware
platform also is included in the aggregation object.
As shown in figure 1, to create the object
described above, data was required locally (the cross-reference
information for join conditions across data sets) in the data
warehouse (product descriptions and install base shipment
information) and from a Sybase database (containing internal system
outage information). Queries were executed from a laptop attached to
the WAN using TOAD (Quest’s tool for Oracle application developers)
via SQL*Net from a laptop. The typical client machine for this type
of data would be an application or report server rather than a
laptop executing a query. Application and report servers are larger
and faster machines. Therefore, once again, this is a worst-case
example.
Figure 1 — architecture of systems
FlyingPig — local database
FlyingPig is a departmental development server
that houses an Oracle instance (bndrybay) that contains the local
tables (cross-reference data), simple objects (objects directly
referencing an external data source, such as the server table in
Sybase) and complex aggregation object (server_data).
Detailed information about FlyingPig follows:
Full domain name
flyingpig.central.xyz.com
Server
e4000, 8 processors, 1Gb RAM
Usage
This server is used for a number of internal applications in a
development environment.
IP address
172.20.0.75
Location
Colorado
Database description
Local Oracle Database
Network routing to FlyingPig is best obtained
using TraceRoute from the laptop to FlyingPig as shown here:
Hop
Timing
Node
1
<1 ms
129.147.22.254
2
<1 ms
b032b-3-5500.central.xyz.com [129.147.3.250]
3
<1 ms
flyingpig.central.xyz.com [172.20.0.75]
Since no cross-reference existed between the
internally used hardware_id and marketing_platform_dim_id, this data
set was built in a locally maintained table named hardware_crossref.
The hardware_crossref table contains two
numeric fields — hardware_id and marketing_platform_dim_id. This
table shows a sample of the data in the cross-reference table:
hardware_id
marketing_platform_dim_id
1
746
2
746
3
783
4
700
…
…
65
835
Data Warehouse
The data warehouse contains the product
descriptions and (detailed) install-base shipment information for
each hardware system.
This production Oracle Database resides on the
Global-Dss machine. Detailed information follows:
Production enterprise systems data warehouse. Over 5Tbytes of
storage and is completely utilized all the time via I/O.
Location
Colorado
DB link name on FlyingPig
Trac
TnsName
ODSDBWW
Username
TraqQual
Network routing to Global-Dss is best obtained
using TraceRoute from the laptop to Global-Dss as shown here:
Hop
Timing
Node
1
<1 ms
129.147.22.254
2
<1 ms
129.147.90.10
3
<1 ms
global-dss.central.xyz.com
Three primary tables were used from the data
warehouse in the creation of the complex aggregation object — each
is in the DWH schema:
Shipped_install_base_fct
A shipped install base for specific hardware
components sold to customers — detailed customer shipment data.
This table contains 13,068,817 records.
Marketing_platform_dim_v1
Descriptions for each marketing part based
on marketing ID. This table contains 249 records.
Day_dim
Is used to retrieve a high and low date for
a specific quarter and joined to the shipped_install_base table to
gather a specific quarter of shipped data. It contains 7,638
records.
Sybase database tables
Internal system outage data is stored in the
ENSIRQIUS Sybase instance that resides on the MoonPalace machine.
Detailed information about the MoonPalace server is as follows:
Stores internal system outages. Very quiet server. Nothing ever
really happening with this one.
Network routing to MoonPalace is best obtained
using TraceRoute from the laptop to MoonPalace as shown here:
Hop
Timing
Node
1
<1 ms
129.147.22.254
2
<1 ms
129.147.90.10
3
<1 ms
moonpalace.central.xyz.com [129.147.32.23]
There are four primary tables that were used
from the outage database in the creation of the complex aggregation
object:
Server
Each internal server and its respective
hardware ID. Contains 27,234 records. This table was queried to
get the individual machine sizes and the location of the hardware
shown in this paper.
BizHrsOutages
Detailed information of outages for internal
servers during business hours — detailed information, including
the cause of each outage. Contains 29,928 records.
CauseMaster
For each cause master ID, contains a
description of the cause and master category for it. Contains 962
records.
CauseMasterCategory
For each cause master category ID, contains
a description of the category. Contains 136 records.
Retrieving Sysbase Data
The above tables are all required to
retrieve the necessary information from the Sybase database.
TUSC’s Periscope product is used to make the
non-Oracle data look like Oracle tables. The JDBC driver used to
query the data from the Sybase instance was developed by i-net
Software (www.inetsoftware.de). The driver is called SY-TO, which is
a JDBC 2.0 driver for Sybase databases.
No performance benchmarks were done to compare
SY-TO to other Sybase JDBC drivers. Therefore, there may be
faster/better drivers available. The connection to MoonPalace was
made via TCP/IP through a specific port. The JDBC driver used to
query the data from the Sybase instance came from i-net Software (www.inetsoftware.de).
The driver is called SY-TO, which is a JDBC 2.0 driver for Sybase
databases. The connection to MoonPalace was made via TCP/IP through
a specific port.
“Simple” virtual object — server_summary
Rather than pull all of the data from the
above objects into Oracle for processing, a query was created that
joined the above four tables into a “simple” virtual object named
server_summary. Only the matching rows of data are pulled back to
Oracle (summarized by date/day and outage master cause), which, in
turn, will be summarized for a financial quarter. The Sybase
database will actually run the summarization query. Therefore, the
local Oracle database is not taxed with the indexing or
summarization burden to find the rows matching the criteria.
It might be asked why this data isn’t
summarized by financial quarter, but rather at the individual date
level. This is because financial quarters aren’t defined within the
Sybase database. Rather, financial quarters are defined in the
Oracle data warehouse.
The columns in the server_summary virtual
object include:
Column
Data Type
hardware_id
Number
bizstartdate
Date
Outage
description / reason
Varchar2(80)
Outage Duration
Number
The complete query that is used to extract
this information from the Sybase database is:
select s.hardwareId,
b.bizStartDate,
ccm.description,
sum(b.bizDuration)
from "dbo"."Server" s,
"dbo"."BizHrsOutages" b,
"dbo"."CauseMaster" cm,
"dbo"."CauseCategoryMaster" ccm'
where s."serverId" = b."serverId"
and b."causeId" = cm."causeId"
and cm."causeCategoryId" =
ccm."causeCategoryId
group by s.hardwareId,
b.bizStartDate,
ccm.description
To maximize the use of indexes in the Sybase
database, one must keep in mind the above query needs to be modified
on the fly as a result of bind variables that are passed in.
Specifically, for our queries, we pass a hardware_id and date range
to the query. Periscope automatically manages this.
Complex virtual object — server_data
The complex virtual object (server_data) is
based on a combination of data from the above local Oracle table (hardware_crossref),
data warehouse tables, and Sybase outage database tables.
The entity relationship diagram for this
object is shown here:
The following table shows the columns of the
complex object along with their respective source(s):
Column
Source (database)
hardware_id
Hardware
cross-reference (local) joined to server_summary (Sybase)
marketing_platform
Hardware
cross-reference (local) joined to marketing_platform (data
warehouse) and shipped_install_base (data warehouse)
outage_reason
server_summary
(Sybase)
platform_description
marketing_platform (data warehouse)
financial_year
day_dim (data
warehouse)
financial_quarter
day_dim (data
warehouse)
outage_duration
server_summary
(Sybase)
shipped_units
shipped_install_base (data warehouse)
A query against the above object for a
specific hardware_id, financial_year and financial_quarter returns
the respective marketing_platform, outage_reason,
platform_description, outage_duration and shipped_units.
For example, the following query:
select marketing_platform,
platform_description,
sum(shipped_units),
sum(outage_duration)
from server_data
where hardware_id = 1
and financial_year = 2003
Will yield the following results:
marketing_
platform
platform_
description
outage_
duration
shipped_
units
746
E4000
Campfire ENT server
9965
489
Marketing_platform is obtained from the
hardware_crossref table in the local database using a hardware_id of
1. Platform_description is obtained from the marketing_platform
table in the data warehouse via a database link. The outage duration
is calculated from detail data in the Sybase database via JDBC that
comes from the “simple” object that summarizes the data to the date
level, which is then summarized for the specific quarter requested.
Shipped_units is calculated by summarizing the number of records
matching the period requested for this marketing_platform from the
shipped_install_base table in the data warehouse using a database
link.
Clearly, this is a complex virtual object that
aggregates a number of sources in real time. The performance of
these queries varies depending on the number of underlying rows
needed to answer the query, but typically takes less than about one
second to return a record (i.e. in this case, a record is an answer
to a complex question). If pre-summarized star schema data
warehouse tables were queried rather than detailed information, this
query would result in consistent sub-second response times.
High-level flow
The following high-level flow shows the
operations performed by this complex object:
Figure out the hardware_id, financial_year
and financial_quarter that are being queried (from the where
clause of the select statement).
Retrieve the marketing_platform_dim_id from
hardware_crossref (local table on FlyingPig) for the hardware_id
passed in.
Retrieve the low (minimum) and high
(maximum) dates for the respective financial quarter from day_dim
(from data warehouse using a database link) for the financial_year
and financial_quarter passed in.
Retrieve the platform_description
(description for this hardware) from marketing_platform (data
warehouse) for the marketing_platform_dim_id retrieved above.
Retrieve the total number of shipped units
(sum of shipped_units) from shipped_install_base (data warehouse)
joined with day_dim (data warehouse) for the
marketing_platform_dim_id, financial_year and financial_quarter.
Retrieve the total outage time by
outage_reason from server_summary (from Sybase outage database
using JDBC) for the hardware_id, low- and high date range.
Output one (1) server_data record per
distinct outage_reason.
Benchmark
To demonstrate the performance of the complex
aggregation object, simple and complex benchmarks were performed.
Simple benchmark
For the simple benchmark, random queries were
executed against the complex aggregation object (server_data) and
timings were recorded as shown here. Ten random queries were
executed five times. All queries executed in about one second or
less – the underlying data to retrieve these results across eight
tables and three distinct databases is 1000s of rows.
Queries executed are similar to the following:
select count(*)
from server_data
where hardware_id = 1
and financial_year = 2003
and financial_quarter = 1
Complex benchmark
For purposes of further benchmarking the
performance of this complex aggregation object, another aggregation
object (sum_server) was created to summarize every product line for
a specific year (down to the quarter, but summarizing outage
reasons). In other words, it’s further aggregation of the complex
object.
The flow of this benchmark object is as
follows:
Determine the year to retrieve (from the
where clause).
Retrieve each unique hardware_id from the
hardware_crossref table (in the local FlyingPig database).
Loop through four respective quarters for
the input year (1 through 4).
Retrieve the marketing_platform,
plaform_description, summary of outages and total shipped units
for input year and respective quarters from the server_summary
(our complex aggregation object) table.
Calculate the time to process this
aggregation.
Output this information.
Timing per row was output as a column in this
query. Note that this query accesses most of the 12 million rows
that are in the install base tables to retrieve these results.
The following information shows the summary of
the benchmark results:
Minimum time per “record”
0.506 seconds
Maximum time per “record”
8.543 seconds
Although these aggregations can take up to 8.5
seconds to complete, as you can see, this is an aggregation of an
already complex object. If response time for this aggregated object
was not acceptable, a simplified object could be created to achieve
the desired performance results.
The following table shows a sample of the
detailed query results (all records, all tables, specific year) and
the detailed timing results for this complex aggregation query:
Beyond structured data
This real-world example demonstrates the
collection of structured data — such as Oracle and Sybase data.
Periscope manages any structured data source (any JDBC source, XML,
etc.), and will also work for unstructured data — such as email
(some employees have mentioned they have seven gigabytes of email),
documents on laptops, etc.
Periscope makes all data look like any other
physical Oracle table in the database. Application don’t know the
difference between a physical table and a Periscope object (called a
Monocle). Periscope also publishes these data objects as Web
Services and HTTP dynamic XML objects.
The real power of this approach is that
providing “life” beyond structured data would allow structured and
non-structured data to be joined, providing a holistic view to the
organization.
Every object that is created contains metadata
(data about the data), which can also be used for other purposes.
Summary
This paper is proof that it is possible to
perform real-world complex data aggregation using virtualized data
sources. This further proves that the theoretically-based
architecture described in the prior “Integration Architecture” paper
will work in a real-world environment. This is a major breakthrough
in technology and integration solutions going forward.