Quantum Solutions, Inc.    

"The Answer For Your Future"

 


* Warning Signs Of Businesses

* Why Businesses Don't Grow

* Why A Business Plan Is Important

* Why An Entrepreneur Is Successful

* Traits Of A Good Salesperson

* Why Strategies Fail

* Why Businesses Fail

 

 


Complex Data Aggregation using Oracle Virtualized Data Sources

 

 

 

 

 

 


 

Real-world complex data aggregation using virtualized data sources

 

Executive summary

This paper specifically demonstrates that joins across disparate data sources to retrieve complex data aggregation results using virtualized data access are possible in a real-world environment. The reason for this proof of concept is that in prior attempts, organizations trying to join such information across disparate data sets failed to achieve the desired response time results. Prior concepts weren’t like the proposed concept, but the previous output in concept was identical.

 

Based on the assumption that confirmation was required, a proof of concept was developed to retrieve data from a variety of actual “system of record” sources aggregating data in real time. The goal was to achieve acceptable (i.e. sub-second) response times for this complex aggregation.

 

This proof of concept was successful in achieving the desired results.

The results demonstrated in this paper used real-world live data sources. Optimizations that will improve performance beyond the testing results in this document include moving hardware closer together, changing indexing schemes in the data sources, using pre-aggregated data sources when available, benchmarking other Sybase JDBC drivers, threading requests, using faster hardware, data caching or increasing system memory). Any and all of these optimization methods would only improve performance. In other words, the benchmarks reported in this paper are worst-case.

 

The architecture discussed in “Tahiti Integration Architecture” proposed for the delivery of these objects as both Web Services objects and virtual Oracle Database objects. This proof of concept delivered the objects solely as Oracle Database objects. Web Services will perform better/faster than Oracle Database objects because Java code running on the application server executes faster than Java code in the database. Again, the proof of concept demonstrates the worst-case scenario.

 

The ability to provide a non-obtrusive, real-time, virtual integration solution is key to your company’s future integration success. Virtualizing data access via Web Services and inside the Oracle Database will allow your company to access and analyze otherwise inaccessible information locked in disparate parts of the organization.

 

Pulling data from the “system of record” at run time rather than a dated snapshot of information provides numerous benefits to the organization. Avoiding endless data replication will not only save storage space and network bandwidth, but also will provide users with current data that isn’t stale. Because the objects demonstrated exist within an Oracle Database, via the use of Oracle’s materialized views, data caching is easily configurable if desired.

 

Other Oracle features, such as security, are available for these objects. Additionally, the objects that were created provide for real-time bidirectional transactions — the underlying objects can be changed, not just read.

 

Business Case

A proactive approach to integration needs to occur across all levels within your company.  Without integrating applications, data islands will exist throughout your heterogeneous application environment. 

 

Bad data just continues getting worse as it is replicated.  This is especially true when it is transformed in the process of replication.  Like the game of telephone, the data quality rating (i.e. ICE) goes down each time data is replicated.  The age of the data becomes critical to its value too.  The second that data is extracted from a source system, it is dated.  Therefore, the integrity of the data decreases every time data is moved through the systems.  The only way to get a perfect ICE rating is directly from the source (i.e. system of record).

 

In other words, real-time extraction of the information is key.  There needs to be one system of record for distinct sets of data (i.e. customer from CORE, assets from Tahiti, problems from Apollo).

 

Specific problems with Integration today:

 

· Data quality

· Integrity

· Constancy

· Efficiency

· Security

· Lack of data normalization

· Replicated data everywhere

· Space consumed over and over

· Lose of accuracy in transformation, errors, etc.

· Tight-coupling of interfaces

· No access to real-time information (i.e. system of record)

· Fragmented data set

 

Without access to real time information the business can adequately define a “customer, asset, or location.”  However, historically you could not join data sets across disparate applications.  Replicating data and carrying pieces of data in each unique system leads to “customer” data being fragmented across a variety of applications.  There is no way to pull together an entire “customer” record in one place.  This results in a number of systems – each doing their own thing, yet everyone relies on everyone else’s data. 

 

In Sun’s The Inner Circle newsletter, the article entitled “The Six Practices of Enterprise Integration,” explains that “to stay in the leading pack, businesses invested in technology to fuel this dash to market — but without thinking through how the new systems would work together.”  This paper goes on to articulate how important integration is to your company’s mission and goals - “unleashing the power of your existing infrastructure via integration means huge payoffs in the way of increasing revenue, reducing operations costs, improving customer satisfaction, providing a gateway to new business channels, enabling the enterprise to make better decisions faster…”  This article explains six best practices to save your company money in going down this integration path - maximizing the use of existing IT assets and preparing your company’s infrastructure to seamlessly integrate with partners and customers.  The article also points out Sun’s software solution set that’s available to solve each of the six best practice areas. 

 

Data integration is not simply a matter of copying data through every system.  In fact, when source data is not high quality data, its quality will get worse at each hand off.  In other words, bad data will continue being replicated.  To define data quality, one of our customers uses the ICE rating.  I stands for Integrity, C stands for completeness and E stands for efficiency.  The higher the rating, the better the quality of the data.  Like the game of telephone, the ICE rating goes down each time data is replicated – especially when data is “transformed” in the process.  Transformations alter the original data. 

 

Knowing the entire install base, its time in service, mean time between failures, problem parts – a hardware vendor would know that a part should be replaced (before it fails), which would increase availability. 

 

By successfully designing and implementing an integrated solution comprised of data warehouses and reporting tools, and by connecting sales, manufacturing, and service source systems to that integrated solution, the following benefits will be realized:

 

· The ability to have field and manufacturing failure information collected and available in a central location

· The ability to have Install Base information, including in field changes, available in one central location

· The ability to slice product data by customer

· The ability to trace product/FRU location through the supply chain

· The ability to calculate MTBF across any FRU level part, for any company would be readily available

· The ability to view Vendor repair data

· The ability to EOL duplicate systems which currently calculate MTBF will save expense

 

Based on the above benefits, the following information is required for a “central” data source (i.e. virtualized data source):

      · Field and manufacturing

· Install base including field changes

· Product data by customer

· Product/FRU location traceable through the supply chain

· MTBF across any FRU level part down to the company level

· Vendor repair data

 

The bottom line is that integration will save resources – money, people and servers.

 

Complex aggregation object

The desired result set included statistics summarized to a financial quarter for a specific model of hardware.

 

Details included in this information are the number of units shipped and total internal outages experienced on the respective hardware sorted by outage reason (or major outage cause category).

 

Other components include the unique hardware identifier (i.e. hardware_id that is used internally for each hardware platform and the marketing_platform_dim_id that is used in the shipped_install_base data) and the description for the hardware platform also is included in the aggregation object.

 

As shown in figure 1, to create the object described above, data was required locally (the cross-reference information for join conditions across data sets) in the data warehouse (product descriptions and install base shipment information) and from a Sybase database (containing internal system outage information). Queries were executed from a laptop attached to the WAN using TOAD (Quest’s tool for Oracle application developers) via SQL*Net from a laptop. The typical client machine for this type of data would be an application or report server rather than a laptop executing a query. Application and report servers are larger and faster machines. Therefore, once again, this is a worst-case example.

 

Figure 1 — architecture of systems

 

FlyingPig — local database

FlyingPig is a departmental development server that houses an Oracle instance (bndrybay) that contains the local tables (cross-reference data), simple objects (objects directly referencing an external data source, such as the server table in Sybase) and complex aggregation object (server_data).

 

Detailed information about FlyingPig follows:

 

Full domain name

flyingpig.central.xyz.com

Server

e4000, 8 processors, 1Gb RAM

Usage

This server is used for a number of internal applications in a development environment.

IP address

172.20.0.75

Location

Colorado

Database description

Local Oracle Database

 

Network routing to FlyingPig is best obtained using TraceRoute from the laptop to FlyingPig as shown here:

 

Hop

Timing

Node

1

<1 ms

129.147.22.254

2

<1 ms

b032b-3-5500.central.xyz.com [129.147.3.250]

3

<1 ms

flyingpig.central.xyz.com [172.20.0.75]

 

Since no cross-reference existed between the internally used hardware_id and marketing_platform_dim_id, this data set was built in a locally maintained table named hardware_crossref.

 

The hardware_crossref table contains two numeric fields — hardware_id and marketing_platform_dim_id.  This table shows a sample of the data in the cross-reference table:

 

hardware_id

marketing_platform_dim_id

1

746

2

746

3

783

4

700

65

835

 

Data Warehouse

The data warehouse contains the product descriptions and (detailed) install-base shipment information for each hardware system.

 

This production Oracle Database resides on the Global-Dss machine. Detailed information follows:

 

Full domain name

global-dss.central.xyz.com

Machine

e6500, 12 400Mhz/8mb cache processors, 10Gb RAM, hme network card, 7 photons w/36g drives and 2 photons w/72g drives.

IP address

129.147.30.48

SID

ODSDBWW

Usage

Production enterprise systems data warehouse.  Over 5Tbytes of storage and is completely utilized all the time via I/O.

Location

Colorado

DB link name on FlyingPig

Trac

TnsName

ODSDBWW

Username

TraqQual

 

Network routing to Global-Dss is best obtained using TraceRoute from the laptop to Global-Dss as shown here:

 

Hop

Timing

Node

1

<1 ms

129.147.22.254

2

<1 ms

129.147.90.10

3

<1 ms

global-dss.central.xyz.com

 

Three primary tables were used from the data warehouse in the creation of the complex aggregation object — each is in the DWH schema:

Shipped_install_base_fct

A shipped install base for specific hardware components sold to customers — detailed customer shipment data. This table contains 13,068,817 records.

 

Marketing_platform_dim_v1

Descriptions for each marketing part based on marketing ID. This table contains 249 records.

 

Day_dim

Is used to retrieve a high and low date for a specific quarter and joined to the shipped_install_base table to gather a specific quarter of shipped data. It contains 7,638 records.

Sybase database tables

Internal system outage data is stored in the ENSIRQIUS Sybase instance that resides on the MoonPalace machine. Detailed information about the MoonPalace server is as follows:

 

Full domain name

moonpalace.central.xyz.com

Machine

e4500, 4 400Mhz/8mb cache processors and 4 400Mhz/4mb cache processors, 3Gb RAM, hme network card, 1 photon w/22 9gb drives.

IP address

129.147.32.23

Location

Colorado

Usage

Stores internal system outages.  Very quiet server. Nothing ever really happening with this one.

 

Network routing to MoonPalace is best obtained using TraceRoute from the laptop to MoonPalace as shown here:

 

Hop

Timing

Node

1

<1 ms

129.147.22.254

2

<1 ms

129.147.90.10

3

<1 ms

moonpalace.central.xyz.com [129.147.32.23]

 

There are four primary tables that were used from the outage database in the creation of the complex aggregation object:

Server

Each internal server and its respective hardware ID. Contains 27,234 records.  This table was queried to get the individual machine sizes and the location of the hardware shown in this paper.

 

BizHrsOutages

Detailed information of outages for internal servers during business hours — detailed information, including the cause of each outage. Contains 29,928 records.

 

CauseMaster

For each cause master ID, contains a description of the cause and master category for it. Contains 962 records.

 

CauseMasterCategory

For each cause master category ID, contains a description of the category. Contains 136 records.

 

Retrieving Sysbase Data

The above tables are all required to retrieve the necessary information from the Sybase database.

TUSC’s Periscope product is used to make the non-Oracle data look like Oracle tables.  The JDBC driver used to query the data from the Sybase instance was developed by i-net Software (www.inetsoftware.de). The driver is called SY-TO, which is a JDBC 2.0 driver for Sybase databases.

 

No performance benchmarks were done to compare SY-TO to other Sybase JDBC drivers. Therefore, there may be faster/better drivers available. The connection to MoonPalace was made via TCP/IP through a specific port.  The JDBC driver used to query the data from the Sybase instance came from i-net Software (www.inetsoftware.de). The driver is called SY-TO, which is a JDBC 2.0 driver for Sybase databases. The connection to MoonPalace was made via TCP/IP through a specific port.

 

“Simple” virtual object — server_summary

Rather than pull all of the data from the above objects into Oracle for processing, a query was created that joined the above four tables into a “simple” virtual object named server_summary. Only the matching rows of data are pulled back to Oracle (summarized by date/day and outage master cause), which, in turn, will be summarized for a financial quarter. The Sybase database will actually run the summarization query. Therefore, the local Oracle database is not taxed with the indexing or summarization burden to find the rows matching the criteria.

 

It might be asked why this data isn’t summarized by financial quarter, but rather at the individual date level. This is because financial quarters aren’t defined within the Sybase database. Rather, financial quarters are defined in the Oracle data warehouse.

 

The columns in the server_summary virtual object include:

 

Column

Data Type

hardware_id

Number

bizstartdate

Date

Outage description / reason

Varchar2(80)

Outage Duration

Number

 

The complete query that is used to extract this information from the Sybase database is:

 

select s.hardwareId,

            b.bizStartDate,

           ccm.description,

           sum(b.bizDuration)

from   "dbo"."Server" s,

           "dbo"."BizHrsOutages" b,

           "dbo"."CauseMaster" cm,

           "dbo"."CauseCategoryMaster" ccm'

where  s."serverId"    = b."serverId"

and      b."causeId"    = cm."causeId"

and      cm."causeCategoryId" =

            ccm."causeCategoryId

group  by s.hardwareId,

            b.bizStartDate,

           ccm.description

 

To maximize the use of indexes in the Sybase database, one must keep in mind the above query needs to be modified on the fly as a result of bind variables that are passed in. Specifically, for our queries, we pass a hardware_id and date range to the query.  Periscope automatically manages this.

 

Complex virtual object — server_data

The complex virtual object (server_data) is based on a combination of data from the above local Oracle table (hardware_crossref), data warehouse tables, and Sybase outage database tables.

 

The entity relationship diagram for this object is shown here:

 

 

The following table shows the columns of the complex object along with their respective source(s):

 

Column

Source (database)

hardware_id

Hardware cross-reference (local) joined to server_summary (Sybase)

marketing_platform

Hardware cross-reference (local) joined to marketing_platform (data warehouse) and shipped_install_base (data warehouse)

outage_reason

server_summary (Sybase)

platform_description

marketing_platform (data warehouse)

financial_year

day_dim (data warehouse)

financial_quarter

day_dim (data warehouse)

outage_duration

server_summary (Sybase)

shipped_units

shipped_install_base (data warehouse)

 

A query against the above object for a specific hardware_id, financial_year and financial_quarter returns the respective marketing_platform, outage_reason, platform_description, outage_duration and shipped_units.

 

For example, the following query:

 

select marketing_platform,

           platform_description,

           sum(shipped_units),

           sum(outage_duration)

from   server_data

where hardware_id    = 1

and     financial_year = 2003

 

Will yield the following results:

 

marketing_
platform

platform_
description

outage_
duration

shipped_
units

746

E4000 Campfire ENT server

9965

489

 

Marketing_platform is obtained from the hardware_crossref table in the local database using a hardware_id of 1. Platform_description is obtained from the marketing_platform table in the data warehouse via a database link. The outage duration is calculated from detail data in the Sybase database via JDBC that comes from the “simple” object that summarizes the data to the date level, which is then summarized for the specific quarter requested. Shipped_units is calculated by summarizing the number of records matching the period requested for this marketing_platform from the shipped_install_base table in the data warehouse using a database link.

 

Clearly, this is a complex virtual object that aggregates a number of sources in real time.  The performance of these queries varies depending on the number of underlying rows needed to answer the query, but typically takes less than about one second to return a record (i.e. in this case, a record is an answer to a complex question).  If pre-summarized star schema data warehouse tables were queried rather than detailed information, this query would result in consistent sub-second response times.

 

High-level flow

The following high-level flow shows the operations performed by this complex object:

  1. Figure out the hardware_id, financial_year and financial_quarter that are being queried (from the where clause of the select statement).

  2. Retrieve the marketing_platform_dim_id from hardware_crossref (local table on FlyingPig) for the hardware_id passed in.

  3. Retrieve the low (minimum) and high (maximum) dates for the respective financial quarter from day_dim (from data warehouse using a database link) for the financial_year and financial_quarter passed in.

  4. Retrieve the platform_description (description for this hardware) from marketing_platform (data warehouse) for the marketing_platform_dim_id retrieved above.

  5. Retrieve the total number of shipped units (sum of shipped_units) from shipped_install_base (data warehouse) joined with day_dim (data warehouse) for the marketing_platform_dim_id, financial_year and financial_quarter.

  6. Retrieve the total outage time by outage_reason from server_summary (from Sybase outage database using JDBC) for the hardware_id, low- and high date range.

  7. Output one (1) server_data record per distinct outage_reason.

Benchmark

To demonstrate the performance of the complex aggregation object, simple and complex benchmarks were performed.

 

Simple benchmark

 

 

For the simple benchmark, random queries were executed against the complex aggregation object (server_data) and timings were recorded as shown here. Ten random queries were executed five times.  All queries executed in about one second or less – the underlying data to retrieve these results across eight tables and three distinct databases is 1000s of rows.

 

Queries executed are similar to the following:

 

select    count(*)

from      server_data

where    hardware_id         = 1

and        financial_year      = 2003

and        financial_quarter = 1

 

Complex benchmark

For purposes of further benchmarking the performance of this complex aggregation object, another aggregation object (sum_server) was created to summarize every product line for a specific year (down to the quarter, but summarizing outage reasons).  In other words, it’s further aggregation of the complex object.

 

The flow of this benchmark object is as follows:

  1. Determine the year to retrieve (from the where clause).

  2. Retrieve each unique hardware_id from the hardware_crossref table (in the local FlyingPig database).

  3. Loop through four respective quarters for the input year (1 through 4).

  4. Retrieve the marketing_platform, plaform_description, summary of outages and total shipped units for input year and respective quarters from the server_summary (our complex aggregation object) table.

  5. Calculate the time to process this aggregation.

  6. Output this information.

Timing per row was output as a column in this query. Note that this query accesses most of the 12 million rows that are in the install base tables to retrieve these results.

 

The following information shows the summary of the benchmark results:

 

Minimum time per “record”

0.506 seconds

Maximum time per “record”

8.543 seconds

 

Although these aggregations can take up to 8.5 seconds to complete, as you can see, this is an aggregation of an already complex object.  If response time for this aggregated object was not acceptable, a simplified object could be created to achieve the desired performance results.

 

The following table shows a sample of the detailed query results (all records, all tables, specific year) and the detailed timing results for this complex aggregation query:

 

 

Beyond structured data

This real-world example demonstrates the collection of structured data — such as Oracle and Sybase data. Periscope manages any structured data source (any JDBC source, XML, etc.), and will also work for unstructured data — such as email (some employees have mentioned they have seven gigabytes of email), documents on laptops, etc.

 

Periscope makes all data look like any other physical Oracle table in the database.   Application don’t know the difference between a physical table and a Periscope object (called a Monocle).  Periscope also publishes these data objects as Web Services and HTTP dynamic XML objects.

 

The real power of this approach is that providing “life” beyond structured data would allow structured and non-structured data to be joined, providing a holistic view to the organization.

 

Every object that is created contains metadata (data about the data), which can also be used for other purposes.

 

Summary

This paper is proof that it is possible to perform real-world complex data aggregation using virtualized data sources.  This further proves that the theoretically-based architecture described in the prior “Integration Architecture” paper will work in a real-world environment.  This is a major breakthrough in technology and integration solutions going forward.

 

 

 

In Partnership with               

 

 

 

 

 

 

                    and

 

 

 

 


Periscope

Universal data access using


Banner

 

Best Prices on Technology

14920 Z Circle

Omaha, NE 68137

Office

01.402.681.8137

 

Send Email


home   contact    privacy policy    terms of use

© 1996-2010 Quantum Solutions, Inc.