Friday, September 12, 2008

Setting Up TPC-H: Part 2

Now that we have the tools to create the data, lets create a place in the database to put it and then insert the data.

Step 1: Create TPCH Tablespace and user/schema
Create both a dedicated tablespace and schema to contain/access this data - here it is simply called TPCH. As the SYS user

SQL> CREATE SMALLFILE TABLESPACE "TPCH" DATAFILE '/u01/app/oracle/oradata/BRS01/tpch.dbf' SIZE 1000M AUTOEXTEND ON NEXT 10M MAXSIZE UNLIMITED LOGGING EXTENT MANAGEMENT LOCAL SEGMENT SPACE MANAGEMENT AUTO;

SQL> CREATE USER "TPCH" PROFILE "DEFAULT" IDENTIFIED BY "password" DEFAULT TABLESPACE "TPCH" TEMPORARY TABLESPACE "TEMP" QUOTA UNLIMITED ON "TPCH" ACCOUNT UNLOCK;
SQL> GRANT "CONNECT" TO "TPCH";
SQL> GRANT CREATE TABLE TO "TPCH";
SQL> GRANT CREATE VIEW TO "TPCH";

Step 2: Create the tables
TPCH comes with two files (dss.ddl and dss.ri) that contain the DDL and referential integrity setup. However, since we will use "direct path"option of sqlloader to put the data into the database, it doesn't make sense to have any primary or foreign keys in place when loading. Run this script (as the tpch user) to create all the required tables.

Step 3: Generate and load data into database
Jeff Moss has put together a "wrapper" script that uses dbgen to create and store the data in flat files and then calls Oracle's sqlldr to put the data into the database - see details here. Here is what needs to be done
  • Download the control files (*.ctl) and the two scripts and put them in same directory as tpch i.e. where dbgen and qgen are located. Due the wiki tool used on Jeff's page the naming is a bit mangled - just rename using lower case and use .ctl or .sh extension.
  • Run the scripts, following Jeff's examples almost verbatim. Obviously, use a connection string appropriate for your own database and pay attention to the last parameter - it is the total number of (parent + child) processes created or parallel streams used to create and load the data. Too high a number here can very easily bring a system to it's knees - my rule of thumb is to make this equal to the number of CPU cores. The first parameter is the TPCH Scale Factor: 1 ~ 1 GB database, 10 ~ 10 GB database etc.
Step 4: Create primary keys, foreign keys and indexes
These constraints are specified in the dss.ri file of TPCH. Unfortunately, some syntactic idiosyncrasies and outdated schema names mean that this is not simply plug 'n play on Oracle. To make life easier I created a Oracle-compatible script that will setup all the primary and foreign keys - the script is here.

No comments: