26 Jul 2005

WebSphere Information Integrator V8.2: New Features Overview

By Neal Finkelstein, Cuneyt Goksu, Phil Nelson, and Fred Sobotka


Introduction

The product formerly known as DB2 Information Integrator has a new name. Early this year, IBM announced that it was re-branding the product as WebSphere Information Integrator, to take advantage of WebSphere's strong association with all kinds of business integration. But by whatever name, WebSphere Information Integrator (WII) offers capabilities that make it much easier to build applications against distributed, heterogeneous information of all sorts. In this article, we'll cover many of the new features introduced in the different components that make up WebSphere Information Integrator V8.2: federation, event publishing, Q replication, and OmniFind.

SQL Federation
Version 8.2 of WII has introduced a number of enhancements to add new data sources, and improve maintainability, scalability and performance. Several non-relational data sources have been added to extend the reach of WII. Major improvements were made in federated query performance. WII has been improved to exploit both inter-partition parallelism in DPF enabled databases, and intra-partition parallelism in SMP servers. In addition, the DB2 optimizer can now reroute queries to locally cached nickname data. Another significant area of improvement is in administration. Discovering data sources, gathering metadata, and setting up WII components are now automated to a greater extent than in previous versions. Other aspects of administration that have been improved are monitoring, problem determination, and performance tuning.

New Data Sources
WebSphere Business Integration
WII has added support for several widely deployed business applications. Using the WebSphere Business Integration (WBI) wrapper, federated queries can now access PeopleSoft, Siebel and SAP data. These WBI data sources can be joined to both relational data such as DB2, ORACLE or SQLSERVER and also to non-relational data, including XML files, flat files or web services.
The WebSphere Business Integration wrapper creates a business object based on an SQL query and uses WebSphere MQ message queues to communicate with an application specific adapter. The adapter communicates with the application to obtain the information and sends it back to the wrapper. Finally, the wrapper populates the columns of the nickname or nicknames created for the particular business object.
Life Sciences Data
In V8.2, WII added Kyoto Encyclopedia of Genes and Genomes (KEGG) data to its already long list of life sciences data sources.
Web Services Data
Web services providers are now accessible through SQL, using either parameters or XML documents for input or output. The XML data types are mapped to the columns of the nicknames. The Web Services wrapper makes use of the HTTPS transport protocol for SOAP messages, thereby providing greater security for data transmission. Nickname options, URL and SOAPACTION, offer the capability of dynamic addressing, so that the same nickname can be used for multiple service endpoints. The service endpoint is specified at query execution time.
JAVA SDK for Wrapper Development
Custom wrappers provide additional flexibility for WII to access data sources not currently included. In V8.2, WII provides a software development kit for creating custom wrappers in Java. The SDK includes both the shared libraries and the header files required to build the wrappers. Previously, C++ was the only language that could be used to develop wrappers.

Performance Improvements
Fenced Wrappers
In previous versions, the WII wrapper processes ran in the same address space as the DB2 engine (trusted wrappers). In V8.2, wrappers can be created with an option to run in a separate address space, fenced off from the federation engine. The advantages of fenced wrappers are that they can maintain persistent connections to the data source, their processes can be shared by multiple users, and their processes are separated from the query engine processes. The separation results in improved stability and better problem determination. Although fenced wrappers use more memory and CPU than trusted wrappers, fenced wrappers use memory and CPU more efficiently and hence offer greater scalability. The fenced option is especially good for custom wrappers. The fenced option is required to take advantage of parallel processing in a partitioned database environment.
Parallelism for federated queries
WII has been improved to use the parallel processing capabilities of the federation engine. In a partitioned database environment, WII can exploit inter-partition parallelism by distributing nickname data to all partitions and performing operations, such as joins, in parallel. Inter-partition parallel processing can occur for joins of local and remote data or joins of remote data only. For local and remote data joins, WII will distribute the remote data across the partitions if the amount of remote data is less than the amount of local data. If the optimizer chooses not to distribute the nickname data, the join is done at the coordinator partition.

When a query or part of a query involves remote data only, WII can use a Computational Partition Group to distribute the remote data for parallel joins. An existing node group can be used as a Computational Partition Group by setting the registry variable DB2_COMPPARTITIONGROUP to an existing node group.

In SMP environments, WII can exploit intra-partition parallelism by executing local portions of the query in parallel with the nickname access. SMP parallelism is enabled for both fenced and trusted wrappers.
Materialized Query Tables
Local caching of remote data can improve query performance substantially. WII can automatically decide to route queries to a cache table, known as a Materialized Query Table (MQT), if the MQT can satisfy the query. An MQT can be used to cache the results of joins and aggregations over multiple data sources, both remote and local. In addition, MQTs can be indexed and, in a partitioned database, MQTs can be distributed or replicated to all partitions. In V8.2 MQTs can include both relational and non-relational data sources. The DB2 Control Center for V8.2 provides a “Wizard” tool to assist in setting up an MQT and creating a refresh process. Updates to the remote table from the federated server are made directly. The results of the update are automatically reflected in the MQT, based on the refresh policy previously established. For volatile nickname data, replication can be used to refresh the cache. An on-demand or a scheduled full fresh of the MQT can be used for less volatile data. DB2 Cube Views or the DB2 Design Advisor can be used with nicknames to recommend MQTs that aggregate remote data and store the results locally for better query performance.
Improved Informational Constraints
Informational constraints have been enhanced for nicknames to provide the DB2 optimizer with information about the federated data source that can improve the access path chosen. WII will not enforce or check any informational constraints, but performance can be improved. Relational nicknames can be altered to add the constraint, and constraints can be defined when non-relational nicknames are created. For example, if the check constraint C1 > 10000 is specified for column C1 of the nickname N1, WII will not access N1 when queries refer to values of C1 less than or equal to 10,000. Informational constraints can be especially helpful when nicknames are created on views at the remote source because WII will not discover any primary key information that exists on the table(s) referred to by the view. The optimizer will also use informational constraints when considering routing a query to an MQT.
Union All Views
Support for Union All views on remote data from multiple sources has been improved. By including a “data source” field in the view definition and using it in the predicate when joining Union All views, WII is able to eliminate the join combinations that do not contain data that satisfy the query.
Administration Improvements
Server Discovery and Configuration Files
WII has added functionality to help in the process of discovering remote data sources. A configuration wizard is available for federated objects through the DB2 Control Center. The wizard can access the client configuration files for various data sources such as “tnsnames.ora” file for Oracle, “odbc.ini” file for SQLSERVER and “interfaces” file for SYBASE. Using these files, the WII configuration wizard can discover the remote servers that can be accessed from the federated server. Once access to the remote server is configured, WII can access the system catalog tables at the remote database to provide a list of tables that can be federated.

For servers not registered in these configuration files, information entered using the configuration wizard can be used to populate the configuration file to complete the task of configuring access to the nickname.
Configuration Scripts
The DB2 Control Center has always had the capability of viewing the DDL generated during the process of configuring a remote data source. In V8.2, the DDL executed on the federated server can be reviewed in an “Action Output” window and saved into a script file that can be reused to configure other federated servers. This will speed up the configuration process and reduce the potential for errors.
Updating Statistics
When a relational nickname is first created, WII copies statistics for the data source and any indexes from the remote server’s system catalog tables. These statistics should be refreshed periodically to provide the optimizer with information for choosing the best access plan. In prior versions, WII could update the statistics only if the nickname was dropped and recreated. V8.2 provides a stored procedure SYSPROC.NNSTAT that collects statistics and updates them in the federated server’s system catalog tables without dropping the nickname. Nickname statistics can be updated for all nicknames defined for the remote server or for a specific table. The stored procedure can be invoked from the command line or through the Control Center. The update task can be scheduled through the Control Center after the DB2 Tools Catalog is created. It is important to note that NNSTAT only copies that data from the remote server. It is the responsibility of the DBA for the remote data source to make sure that current statistics are maintained.
Monitoring Federated Systems
The status of remote servers and nicknames can be monitored by using the DB2 Health Center, which is accessed from the DB2 Control Center, or by issuing the GET HEALTH SNAPSHOT command from the command line or the command window. The Health Center can be configured to send alerts when federated server or nickname problems are detected.

Performance of federated queries can be analyzed by reviewing the output of dynamic SQL snapshots. The output of the snapshot provides statistics about each query fragment sent to a remote server. Explain results can be examined to determine the access plan for nickname data. A review of the access plan can identify which portion of the query is being pushed down to the remote server. The SHIP operator in the explain output indicates that the entries listed below have been pushed down. Features such as informational constraints and the stored procedure NNSTAT mentioned above can be employed in order to improve the access plan and performance of the query fragments that have been identified as the bottlenecks.

The changes to WII Standard and Advanced Editions from V8.1 to V8.2 extend the already long list of data sources that can be federated. As important as that might be, the changes to performance and administration features are even more impressive. As experience has shown, the option of moving the data to the query usually wins over the option of sending the query to the data when performance is an issue. In a 24X7X52 environment, simplifying the administration of WII by speeding up the setup of federated access, creating a means of packaging the configuration scripts, and providing improved health monitoring to avoid outages are essential to successful strategic adoption of WII for enterprise information integration projects.

Event Publishing
Event publishing is a flexible way to inform non-relational targets of changes to specific relational tables via XML and asynchronous message queues. WebSphere II includes Event Publisher Edition and WebSphere MQ to publish events from data sources in DB2 UDB for LUW. Other data sources require different editions of WebSphere II event publisher software.
WebSphere II Event Publisher Edition
WebSphere II Event Publisher Edition captures database changes in DB2 UDB for LUW from the recovery log, translates them into XML, and sends them to WebSphere MQ. Any application or service that directly integrates with WebSphere MQ or supports the Java™ Message Service API (JMS) can asynchronously receive the data changes as they occur. For example, using WebSphere II Event Publisher Edition, WebSphere Business Integration can receive changes from a DB2 UDB database and automatically update an SAP application.

WebSphere II Classic Event Publisher
WebSphere II Classic Event Publisher captures changes made to mainframe databases or files and publishes these changes to WebSphere MQ. The changed data, or “events”, may be used for driving data replication, populating warehouse databases, or integrating with enterprise applications such as SAP. Typical implementations of this style of event publishing involve capturing changes made to legacy data and pushing them to a CRM application on another platform, or pushing data to an ETL Tool for updating a data warehouse, using changes to inventory values to drive restocking and reordering processes. Integration accomplished through event publishing uses a loosely-coupled approach that minimizes the impact of application changes.

There are three editions for Classic Event Publisher:

• WebSphere II Classic Event Publisher for CA-IDMS captures changes by reading the active and recovery log.

• WebSphere II Classic Event Publisher for IMS actively captures changes made to IMS files using an IMS logger exit. For recovery purposes, changes can also be captured by reading IMS log files. Captured data is held until a commit triggers the reformatting and publications of all of the IMS data associated with that unit-of-work. Rollback causes captured data to be discarded. Captured changes are first reformatted into a relational data format based on a metadata mapping between the IMS segment(s) layout and WebSphere II Classic Event Publisher logical table definitions and then packaged as a self-describing XML message.

• WebSphere II Classic Event Publisher for VSAM captures changes made to VSAM files through CICS from MVS logs. Captured data is held until a commit triggers the reformatting and publication of all of the VSAM data associated with that unit-of-work. Rollback causes captured data to be discarded. Captured changes are first reformatted into a relational data format based on a metadata mapping between the VSAM record layout and WebSphere II Classic Event Publisher logical table definitions and then packaged as a self-describing XML message.

WebSphere II Event Publisher for DB2 UDB for z/OS
WebSphere II Event Publisher for DB2 UDB for z/OS captures changes made in DB2 UDB for z/OS and publishes these changes to WebSphere MQ. Changed data items are captured using DB2 UDB recovery log mechanisms.

Event publishing can be used to provide information to central information brokers and Web applications or to trigger actions or processes that are based on updates, inserts, or deletions to source data.

Q Replication
New in the Version 8 Replication Edition of Information Integrator is a technology called Q Replication. This technology provides high throughput, low latency reliable replication and utilizes the WebSphere MQ product as its transport mechanism.

Replication can be carried out within a single server or between multiple geographically dispersed servers.

Replication can be operated in three modes –

• Unidirectional : changes made to a master node are used to maintain slave copies. However no changes to the slaves are replicated back to the master. This mode is most suitable for providing read only copies of the master database, perhaps for ad hoc query workloads.
• Bidirectional : one node defined as the master and is therefore given priority when carrying out conflict resolution. This mode is most suitable for maintaining a backup or standby system where only occasional updates will take place on the server(s) which are not the master
• Peer-to-peer : each node is afforded the same priority and conflict resolution is based on the timestamp of the latest update. This mode is most suitable for running a production system in multiple locations with each location having a local active database.

The latter two modes provide an alternative to HADR for providing a geographically remote highly available environment. Q Replication currently has the advantage over HADR that all nodes can be “active” (i.e. processing business transactions). It also is capable of limiting replication to a subset of the tables in a database, whereas HADR operates at the database level only. However complex conflict resolution at high transaction volumes may cause difficulty, and in this type of environment HADR would be a better choice. In addition HADR is much more simple to configure.

In unidirectional mode filters can be applied to the data being extracted from the master to limit the amount of data (rows and columns) being extracted and to manipulate the data. This is not available for bidirectional and peer-to-peer modes.

Q replication consists of a number of components –

• The WebSphere MQ configuration : a number of queues and channels must be defined within WebSphere MQ to support Q Replication. IBM supplies sample setup scripts for this but there are no tools to assist with this configuration. This is something which IBM have been made aware of by many customers and are seeking to address in a future release.
• Replication Queue Maps : this component defines which MQ queues and channels are to be used for sending and receiving data between the source and target systems.
• Q Subscriptions : these are the definitions for the sources and targets for replication. Source data can either by replicated (moved to target tables) or published (made available on an MQ queue for third party applications to access). Note that the term “subscription” covers both replicated and published data sources : despite the term it has nothing to do with “publish and subscribe” architectures
• Q Capture : uses the definitions in the source system Q Subscriptions and the Replication Queue Maps to extract changes from the database system recovery logs and place them onto MQ message queues for transmission to target systems
• Q Apply : uses the definitions in the target system Q Subscriptions and the Replication Queue maps to retrieve information from MQ message queues and apply it to database tables on the target systems

Q Replication is available on both DB2 for z/OS and DB2 for LUW platforms. With later releases of Information Integrator support is also being extended to cover non-DB2 sources and targets : initial additions to the framework are Oracle and IMS, but it is expected that over time a large number of data sources and targets will be available.

WebSphere II OmniFind Edition
WebSphere Information Integrator OmniFind Edition is IBM's flagship product for searching structured and unstructured data across the enterprise. OmniFind is implemented as a WebSphere application that uses DB2 UDB to store metadata and searchable indexes of all data sources. Unlike database extenders, which require all content to be stored in a relational database, OmniFind scans remote data sources
with crawler programs and stores the tokenized results in its own internal database. OmniFind employs specialized algorithms to parse and analyze its index so client programs can issue sophisticated linguistic searches and quickly receive a relevant set of matching results.

OmniFind comes with crawlers and document filters to handle a wide variety of information sources:

• Over 200 electronic document formats
• Web pages
• XML files
• NNTP/Usenet archives
• Lotus Notes
• MS Exchange
• DB2, Informix, Oracle, and MS SQL Server databases
• Documentum and FileNet (requires WebSphere II Content Edition, formerly Venetica VeniceBridge)

Enterprise Search
Organizations use OmniFind to build a comprehensive knowledge base that allows enterprise searching, which differs from Internet searching in the following ways:
• Enterprise search involves documents that are less likely to reference each other through linking, creating a need for new weighting algorithms to determine relevance to the user's search criteria.
• Documents accessed by enterprise search will require different levels of security that must be enforced in order to prevent unauthorized users from seeing confidential information.
Embedded with DB2 and WebSphere
WII OmniFind Edition is bundled with customized versions of WebSphere Application Server and DB2 Universal Database, both of which are necessary for OmniFind to run and are specifically limited for OmniFind use only.

OmniFind Components and APIs
Administrators define collections, which group information sources together to be accessible within a single search. Crawlers are programs that scan information sources within a collection at regular intervals. Crawlers are capable of skipping over documents and files that have not changed since they were last indexed.

OmniFind's Data Listener API contains methods that enable search administrators to add, update, or remove documents from the repository. The Data Listener API also allows custom crawlers to be developed to handle new types of information sources.

Java programmers add enterprise search features to their applications by using objects and methods in OmniFind's Search and Index API (SIAPI).

Improved web crawling and parsing in FixPak 1
The web crawler can use authentication information specified by the administrator in order to scan web pages that are protected by HTTP basic authentication, and can fully navigate web sites that use frames. The administrator can also instruct the web crawler to disregard the no-follow and no-link directives for specific web pages. The parser can be configured to treat specific META tags on web pages as fields that users can specify in their searches.

Improved Microsoft compatibility
FixPak 1 for WII OmniFind Edition V8.2 provides support for 32-bit versions of Windows Server 2003 and can crawl through MS SQL Server databases.

Crawling multiple Lotus Notes databases
It is no longer necessary to define a separate crawler for each Lotus Notes database.

OmniFind RedBook
Additional detail on WII OmniFind Edition is available in this recently published RedBook: WebSphere Information Integrator OmniFind Edition: Fast Track Implementation (ISBN# 0738493562).

Summary
The various product editions that comprise WebSphere II V8.2 offer a Swiss army knife of tools to help DBAs meet the integration challenges presented by an ever-growing amount of heterogeneous and unstructured information. These tools can be applied individually or together to bridge the gaps between information silos across the enterprise, often by reading the information directly from its source.