Question and Answer
What the utilities available in nz/bin? Driver: libnzodbc.so , Utilities
: nzconvert, nzds, nzevent, nzhw, nzload, nzodbcsql, nzpassword, nzreclaim,
nzrev, nzsession, nzsql, nzstate, nzstats, nzsystem
Where is the Configuration
file available? Config file odbc.ini is available in nz/lib and it
contains the Database Source properties and Connection parameters.
What is NZSQL utility? This
utility can be run from any machine on which it is installed i.e. it does not
have be from netezza box itself. This utility can be used to run any sql
statement like select or insert statement on the netezza database. In the below
usage you see data extracted from the table into a output file and in the
second mzsql command the data is also gzipped. nzsql -host
<netezz_db_servername> -d <database> -u <username> -pw
<password> -c -c "select * from tablename" -o
/root/home/outputfilename.txt;
What is NZLOAD utility? This
utility can be used to load data from files to netezza target tables. The
nzload command usage is below. In the usage below the flat file which is
delimited by | is loaded to a table mentioned with in
<target_table_name>. Also user name and password can be provided for this
utility. nzload -host <netezzahost> -db <database> -u
<user name> -pw <password> -delim '|' -t <target_table_name>
-df /root/home/outputfilename.txt
Explain Field Programmable
Gate Arrays (FPGAs) They filter out extraneous data as early in the data stream as
possible andas fast as data streams off the disk. This process of data
elimination close to the data sourceremoves I/O bottlenecks and frees up
downstream components such as the CPU, memory,and network from processing
superfluous data, thus having a significant multiplier effect onsystem
performance.
Explain Balanced, massively
parallel architecture of Netezza? The Netezza architecture combines
the best elements of Symmetric Multiprocessing (SMP)and Massively Parallel
Processing (MPP) to create an appliance purpose-built for analyzingpetabytes of
data quickly. Every component of the architecture, including the
processor,FPGA, memory, and network, is carefully selected and optimized to
service data as fast asthe physics of the disk allows, while minimizing cost
and power consumption. More than a thousand of these customized MPP
streams work together to divide and conquer the workload.
What are the building
blocks of Netezza Appliance? Disk Enclosures (RAID protected)>> Snippet-Blades
(FPGA+Memory+CPU) : S-Blades are intelligent processing nodes that make up the
turbocharged MPP engine ofthe appliance. Each S-Blade is an independent server
containing powerful multi-core CPUs, multi-engine FPGAs, and gigabytes of RAM,
all balanced and working concurrentlyto deliver peak performance. >>
Network Fabric : The Netezza applianceruns a customized IP-based
protocol >> Host : The SMP hosts are high-performance Linux
servers set up in an active-passive configuration for high availability.
What is ZoneMap
acceleration? (The Netezza anti-index): ZoneMap acceleration exploits the
natural ordering of rows in a data warehouse to accelerate performance by
orders of magnitude. The technique avoids scanning rows with column values
outside the start and end range of a query. For example, if a table contains two
years of weekly records (~100weeks) and a query is looking for data for only
one week, ZoneMap acceleration can improve performance up to 100 times. Unlike
indexes, ZoneMaps are automatically created and updated for each database
table, without incurring any administrative overhead.
What are the partitioning
methods available in netezza? There are two partitioning methods
available in netezza: Random partitioning: Distributes the data randomly. Hash
Partitioning: Distributes the data on the specified columns.
How do you redistribute a
table? Use Create Table As (CTAS) to redistribute the data in a table.
While creating the new table specify the distribute on clause to distribute the
data on the new columns.
If you did not specify any
distribution clause, how the Create Table AS (CTAS) will distribute the rows? CTAS will
get distribution from the original table.
How do you check the rows
in a table are equally distributed in all SPU's or not? To check
the distribution of rows run the following query. SELECT datasliceid,
COUNT(*)FROM <table name> GROUP BY datasliceid
What is collocated join? When you
join tables which are distributed on the same key and used these key columns in
the join condition, then each SPU in netezza works 100% independent of the
other, as the required data is available in itself. This type of joins is
called collocated joins.
When netezza redistributes
a table and when it broadcasts a table?Whenever it is not possible to do
a collocated join, netezza either redistributes the tables or broadcasts the
table. When the table is a small one, then netezza broadcasts the table.
Otherwise netezza redistributes the table.
How do you remove logically
deleted records? Whenever you delete a row in a table, it is not physically
deleted. It is logically deleted by flagging the deletexid field in the table.
NZRECLAIM utility is used to remove the logically deleted records.
What is nzload? Nzload
utility is used load data from a file into a table. It is used to load bulk
data quickly and simultaneously rejects erroneous content.
What are the ways to load a
data from a table into a file? Create an external table. Use
nzsql utility with -o option.
What are the different ways
to load data from a file into a table? Use nzload to load the data from a
file into a table. Create an external table and then load the original table
using the external table.
How netezza updates a row
in a table? Netezza logically deletes the original row by flagging the
deletexid column with the current transaction id and inserts a new row with the
updated values.
What are the data warehouse
appliances you know?
Netezza/ Greenplum/ Vertica/ Teradata/ Dataupia/ Oracle Exadata
What are the environment
variables that are required to connect to netezza? The
environment variables required are: NZ_HOST, NZ_DATABASE, NZ_USER, NZ_PASSWORD
What are the different
states of Netezza? Online: Normal or usual state. > >Stopped: Netezza will
shut down after completing current queries, no new queries allowed. >>
Offline: Waits for completion of current queries, new queries in queue receives
error. >> Paused: Same as above, but no error displayed. Typically caused
during Netezza boot up or startup. >>Down: Just plain down, could be due
to Netezza server problem or user initiated.
What are the constraints on
a table are enforced? The only constraint Netezza supports is Not
null. There are no primary key and foreign key constraints in Netezza.
Can you insert duplicate
rows in Netezza table? Yes. As there are no primary key constraints
in netezza you can insert duplicate rows.
How the NOT NULL
specification on a column improves the netezza performance? Specifying
Not Null on each column in table results in better performance. Netezza tracks
the NULL values at row header level. Having NULL values results in storing
references to NULL values in header. If all columns are NOT NULL, then there is
no record header.
What is a snippet? A snippet
is a small unit of work that are carried out in SPU.
What are zonemaps? An extent
is the smallest unit of disk allocation on a SPU. Zonemaps is internal mapping
structures to the extents that take advantage of the internal ordering of data
to eliminate extents that do not need to be scanned. Zonemaps transparently
avoid scanning of unreferenced rows. Zonemaps are created for every column in
the table and contain the minimum and maximum values for every extent.
How the zonemaps are
created and updated? Zonemaps are created and refreshed for every SPU when you
Generate statistics, Nzload operation, Insert, update operations, Nzreclaim
operation.
What is the difference
between generate statistics and generate express statistics? Generate
statistics is used to gather statistics about each table column's proportion of
duplicate values, minimum values, maximum values, null values, unique values
and updates the system catalog tables >> The difference between 'generate
statistics' and 'generate express statistics' is based on how the column
uniqueness is calculates. The 'generate express statistics' calculates
estimated dispersion values based on the sampling of rows in the table.
'Generate express statistics' uses approximation in generating the stats where
as 'generate statistics' uses all the rows in the table.
What is the use of creating materialized views? A materialized view
reduces the width (number of columns) of data being scanned in the base table
by creating a thin version (fewer columns) of the base table that contains a
small subset of frequently queried columns.
What is the distribution of
materialized views? A materialized view has the same distribution key as the base
table.
What are the limitations of
materialized views?
You cannot insert, update, delete or truncate a materialized view. Any changes
on the base tables will reflect into materialized views >> You can
specify only one base table in the from clause >> Base table can't be a
external table, system table or a temporary table >> You cannot use a
where clause in the materialized view >> Expressions are not allowed as
columns.
What are the best practices of creating materialized views?
Create materialized views with few columns which are frequently queried
>> Specify order by clause on the most restrictive columns (columns used
in where clause) >> Periodically or manually refresh the materialized
views.
Up to how many columns you
can specify in distribute on clause? You can specify up to four columns
in the distribution clause?
If you did not specify any
distribute on clause while creating a table, what distribution netezza uses? Netezza
distributes the data on the first column and it uses Hash partitioning
Can you update the columns
used in distribution clause? No, the column that is used in
distribution clause cannot be used for updates.
What data types are most
suited for the columns specified in distribution clause? Integer
Create Netezza Table. To create
an explicit distribution key, the Netezza SQL syntax is: CREATE TABLE
<tablename> ( <col>,<col>… ) | AS
(<select_clause>) DISTRIBUTE ON HASH ( <col>,<col>,…
); The phrase distribute on specifies the distribution key, the
word hash is optional. To create a round-robin distribution key, the Netezza
SQL syntax is: CREATE TABLE <tablename> (<col int10>,<col
int20>…) DISTRIBUTE ON RANDOM;
Create External Table. CREATE EXTERNAL TABLE <table_name> (col1, col2, col3..) |
SAMEAS <table_name> USING (DELIM '<delimter>' DATAOBJECT
'<host_file_name>' LOGDIR '<log_dir_name>);
Explain FPGA and how is it useful for query performance.
FPGA: Field Programmable Gate Array (FPGA) is located on each SPU. Netezza is different from other architectures. Netezza can do a “hardware upgrade” through software by using FPGA. Hardware is reconfigured during install.
While reading data from disk, FPGA on each SPU also helps in ‘filtering’ unnecessary data before getting loaded into memory on each SPU. This way, FPGA does not overwhelm with all the data from disk.
What is a zone map?
Zone map in Netezza is similar (concept wise) to partitions in Oracle. Netezza maintains map for data so that it does relies on zone map to pull only the range it is interested in. For example, if we need to pull out data from Jan 2009 till June 2009 from a table that is distributed on date column, zone map helps us to achieve this. Zone map is maintained by Netezza automagically, no user intervention needed. Zone mapping is done at a block (extent) level. Netezza has zone maps for all columns (not just distributed column) and includes information such as minimum, maximum, total number of records.
What are different ways to load?
-nzload
-External tables
-Create table AS (aka, CTAS).
-Inserts (Eeeewee!!)
Does everything gets cached in Netezza (or any other data appliance)?
Typically only schema and other database objects are cached in appliances. Data is not cached, in general. In most cases, data is not saved any where (in any cache or on host computer) and is streamed directly from SPU to client software.
How is load achieved in Netezza and why is that quick / fast?
Loads by pass a few steps that typically a query would go through (a query goes through plan generation, optimization and transaction management). Loads are done in terms of “sets” and this set is based on underlying table structure (thus loads for two different tables are different as their sets are based on table structures). Data is processed to check format and distribution of records calculated very quickly (in one step), fills into ‘set’ structure and writes to storage structure. Storage also performs space availability and other admin tasks, all these operations go pretty quick (think of them as UNIX named pipes that streams data and SPU stores these records).
What databases are we able to configure pushdown optimization?
IBM DB2
Microsoft SQL Server
Netezza
Oracle
Sybase ASE
Teradata
Databases that use ODBC drivers
When are we likely to receive incorrect (aggregate) results?
Very rarely a driver may return aggregated results that are still getting processed back to client. In this case, client may assume that calculation is complete, instead of updating with latest or final results. Obviously, driver has to wait for Netezza to complete operation on host computer, before delivering results.
Desired to gain proficiency on Netezza? Explore the blog post on Netezza Training
to become a pro in Netezza.
What is real-time processing?
Data sources such as JMS, WebSphere MQ, TIBCO, webMethods, MSMQ, SQP, and webservices can publish data in real-time. These real-time sources can be leveraged by Informatica Power Centre to process data on-demand. A session can be specifically configured for real-time processing.
Explain how data gets stored in Netezza and how does SPU failover take place?
Data is stored based on a selected field(s) which are used for distibution.
==Data (A)==> Hash Function (B) ==> Logical SPU identifier list (C) ==> Physical SPU list (D) ==> Storage (E)
When data arrives, it is hased based on field(s) and a hash function (B) is used for this purpose. For example, for a hypothetical 32 node system system, logical SPU identifier list has 32 unique entries. If there are 1000 hashed data items from (B), there are 1000 entries in (C), all having only 32 SPU entries (a number of data items go to the same SPU, thus multiple (B) entries map to same (C)). For instance, (C) has values [3,19,30,7,20,25,11,3,22,19….]. This way, 1000 data entries are mapped. (D) has physical IP address of both primary and failover SPU. If there is a failover, this is the only place where Netezza need to update its entries. Same goes for a system that has a new SPU added. It is a little complicated, in principle, this is the concept.
What is a real-time processing terminating condition?
A real-time processing terminating condition determines when the Integration Service stops reading messages from a real-time source and ends the session.
What are 4 environment variables that are required. What are different states on Netezza?
Environment variables: NZ_HOST, NZ_DATABASE, NZ_USER and NZ_PASSWORD
-Online: Normal or usual state.
-Stopped: Netezza will shutdown after completing current queries, no new queries allowed.
-Offline: Waits for completion of current queries, new or queries in queue receive error.
-Paused: Same as above, but no error displayed. Typically caused during Netezza boot or startup.
-Down: Just plain down, could be due to Netezza server problem or user initiated.
What is a pipeline partition and how does provide a session with higher performance?
Within a mapping, a session can break apart different source qualifier to target pipelines into their own reader/transformation/writer thread(s). This allows the Integration Service to run the partition in parallel with other pipeline partitions in the same mapping. The parallelism creates a higher performing session.
What is the maximum number of partitions that can be defined for in a single pipeline?
You can define up to 64 partitions at any partition point in a pipeline.
What is a dynamic session partition?
A dynamic session partition is where the Integration Service scales the number of session partitions at runtime. The number of partitions is based on a number of factors including number of nodes in a grid or source database partitions.
List three dynamic partitioning configurations that cause a session to run with one partition?
– Set dynamic partitioning to the number of nodes in the grid, and the session does not run on a grid.
– Create a user-defined SQL statement or a user-defined source filter.
-Use dynamic partitioning with an Application Source Qualifier.
List the different types of pushdown optimization that can be configured?
Source-side pushdown optimization – The Integration Service pushes as much transformation logic as possible to the source database.
Target-side pushdown optimization – The Integration Service pushes as much transformation logic as possible to the target database.
Full pushdown optimization – The Integration Service attempts to push all transformation logic to the target database. If the Integration Service cannot push all transformation logic to the database, it performs both source-side and target-side pushdown optimization
Does Netezza support concurrent update of the same record?
In case of conflict in which the same record is set for modification, Netezza rolls back recent transaction that is attempted on the same record, in fact same table. This is generally acceptable in DW environments. Netezza does support serialization transactions and does not permit dirty reads.
What happens to records that are loaded during nzload process, but were not committed.
They are logically deleted and administrator can run nzreclaim, we may also truncate table.
What is a control task?
A control task is used to alter the normal processing of a workflow by stopping, aborting, or failing a workflow or worklet.
Can a group become a member of another group in Netezza user administration. Can we use same group name for databases?
In Netezza, public group is created automatically and every one is a member of this group by default. We can create as many groups and any user can be member of any group(s). Group can not be a member of another group. Group names, user names and database names are unique. That is, we can not have a database called sales and a group also called sales.
How can we give a global permission to user joe so that he can create table in any database?
Login into system database and give that permission to user by saying “grant create table to joe;”
What permission will you give to connect to a database?
List. Grand list, select on table to public (if logged into sales database, this allows all users to query tables in sales database).
Do we need to drop all tables and objects in the database, before dropping a database?
No, drop database will take care of it.
What constraints on a table are enforced?
Not null and default. Netezza does not apply PK and FK.
Why NOT NULL specification is better in Netezza?
Specifying not null results in better performance as NULL values are tracked at rowheader level. Having NULL values results in storing references to NULL values in header. If all columns are NOT NULL, then there is no record header.
Create Table AS (CTAS), does it distribute data randomly or based on table on which it received data.
Response: Newly created table from CTAS gets distribution from the original table.
Interested in mastering Netezza Certification? Learn more about Netezza Tutorials
in this blog post.
Why do you prefer truncate instead of drop table command?
Just empties data from table, keeping table structure and permission intact.
When no distribution clause is used while creating a table, what distribution is used by Netezza.
First column (same as in Teradata).
Can we update all columns in a Netezza table?
No, the column that is used in distribution clause cannot be used for updates. Remember, up to four columns can be used for distribution of data on SPU. From practical sense, updating distribution columns result in redistribution of data; the single most performance hit when large table is involved. This restriction makes sense.
What is data slice and SPU?
For me, they are the same! Of course, this answer is not accurate reply in your interview(s).
What data type works best for zone maps?
Zone maps work best for integer data types.
What feature in Netezza you do not like?
Of course, a large list, especially when compared to Oracle. PK and FK enforcement is a big drawback though this is typically enforced at ETL or ELT process [ELT: Extract, Transform and Load. Note that ‘Transform’ and ‘Load’ can happen within Netezza].
List three real-time processing terminating conditions?
Idle time – Time Integration Service waits to receive messages before it stops reading from the source.
Message count – Number of messages the Integration Service reads from a real-time source before it stops reading from the source.
Reader time limit – Amount of time in seconds that the Integration Service reads source messages from the real-time source before it stops reading from the source.
What is real-time processing message recovery?
Real-time processing message recovery allows the Integration Service to recover unprocessed messages from a failed session. Recovery files, tables, queues, or topics are used to recover the source messages or IDs. Recovery mode can be used to recover these unprocessed messaged.
Netezza/ Greenplum/ Vertica/ Teradata/ Dataupia/ Oracle Exadata
What is the use of creating materialized views? A materialized view reduces the width (number of columns) of data being scanned in the base table by creating a thin version (fewer columns) of the base table that contains a small subset of frequently queried columns.
You cannot insert, update, delete or truncate a materialized view. Any changes on the base tables will reflect into materialized views >> You can specify only one base table in the from clause >> Base table can't be a external table, system table or a temporary table >> You cannot use a where clause in the materialized view >> Expressions are not allowed as columns.
What are the best practices of creating materialized views?
Create materialized views with few columns which are frequently queried >> Specify order by clause on the most restrictive columns (columns used in where clause) >> Periodically or manually refresh the materialized views.