Skip to main content

AGGREGATOR TRANSFORMATION



LET’S PLAY WITH TRANSFORMATION
As mentioned earlier, Transformations are business logics which are used to make inconsistent data to consistent format.

1) Aggregator Transformation
It is used to do some aggregation concepts like sum, average, count, count distinct (count distinct and Non null column).
This transformation does not generate an error output.
Options
1)      Advanced / Basic
Display or hide options to configure multiple aggregations for multiple outputs. By default, the advanced options are hidden.
2)      Aggregation Name
In the Advanced display, type a friendly name for the aggregation.
3)      Group By Columns
In the Advanced display, select columns for grouping by using the Available Input Columns list as described below.
4)      Key Scale
In the Advanced display, optionally specify the approximate number of keys that the aggregation can write. By default, the value of this option is Unspecified. If both the Key Scale and Keys properties are set, the value of Keys takes precedence.
Value
Description
Unspecified
The Key Scale property is not used.
Low
Aggregation can write approximately 500,000 keys.
Medium
Aggregation can write approximately 5,000,000 keys.
High
Aggregation can write more than 25,000,000 keys.
5)      Keys
In the Advanced display, optionally specify the exact number of keys that the aggregation can write. If both Key Scale and Keys are specified, Keys takes precedence.
6)      Available Input Columns
Select from the list of available input columns by using the check boxes in this table.
7)      Input Column
Select from the list of available input columns.
8)      Output Alias
Type an alias for each column. The default is the name of the input column; however, you can choose any unique, descriptive name.
9)      Operation
Choose from the list of available operations, using the following table as a guide.
Operation
Description
GroupBy
Divides datasets into groups. Columns with any data type can be used for grouping.
Sum
Sums the values in a column. Only columns with numeric data types can be summed.
Average
Returns the average of the column values in a column. Only columns with numeric data types can be averaged.
Count
Returns the number of items in a group.
Count Distinct
Returns the number of unique non null values in a group.
Minimum
Returns the minimum value in a group. Restricted to numeric data types.
Maximum
Returns the maximum value in a group. Restricted to numeric data types.
10)  Comparison Flags
If you choose Group By, use the check boxes to control how the transformation performs the comparison. 
11)  Count Distinct Scale
Optionally specify the approximate number of distinct values that the aggregation can write. By default, the value of this option is Unspecified. If both CountDistinctScale and CountDistinctKeys are specified, CountDistinctKeys takes precedence.
Value
Description
Unspecified
The CountDistinctScale property is not used.
Low
Aggregation can write approximately 500,000 distinct values.
Medium
Aggregation can write approximately 5,000,000 distinct values.
High
Aggregation can write more than 25,000,000 distinct values.
12)  Count Distinct Keys
Optionally specify the exact number of distinct values that the aggregation can write. If both CountDistinctScale and CountDistinctKeys are specified, CountDistinctKeys takes precedence.







The Aggregate transformation applies aggregate functions, such as Average, to column values and copies the results to the transformation output. Besides aggregate functions, the transformation provides the GROUP BY clause, which you can use to specify groups to aggregate across.
You configure the Aggregate transformation at the transformation, output, and column levels.
At the transformation level, you configure the Aggregate transformation for performance by specifying the following values:
·         The number of groups that are expected to result from a Group by operation.
·         The number of distinct values that are expected to result from a Count distinct operation.
·         The percentage by which memory can be extended during the aggregation.
·         The Aggregate transformation can also be configured to generate a warning instead of failing when the value of a divisor is zero.

At the output level, you configure the Aggregate transformation for performance by specifying the number of groups that are expected to result from a Group by operation. The Aggregate transformation supports multiple outputs, and each can be configured differently.

At the column level, you specify the following values:
·         The aggregation that the column performs.
·         The comparison options of the aggregation.
You can also configure the Aggregate transformation for performance by specifying these values:
·         The number of groups that are expected to result from a Group by operation on the column.
·         The number of distinct values that are expected to result from a Count distinct operation on the column.
 Performance Considerations
The Aggregate transformation includes a set of properties that you can set to enhance the performance of the transformation.
·         When performing a Group by operation, set the Keys or KeysScale properties of the component and the component outputs. Using Keys, you can specify the exact number of keys the transformation is expected to handle. (In this context, Keys refers to the number of groups that are expected to result from a Group by operation.) Using KeysScale, you can specify an approximate number of keys. When you specify an appropriate value for Keys or KeyScale, you improve performance because the tranformation is able to allocate adequate memory for the data that the transformation caches.
·         When performing a distinct count operation, set the CountDistinctKeys or CountDistinctScale properties of the component. Using CountDistinctKeys, you can specify the exact number of keys the transformation is expected to handle for a count distinct operation. (In this context, CountDistinctKeys refers to the number of distinct values that are expected to result from a Distinct count operation.) Using CountDistinctScale, you can specify an approximate number of keys for a count distinct operation. When you specify an appropriate value for CountDistinctKeys or CountDistinctScale, you improve performance because the transformation is able to allocate adequate memory for the data that the transformation caches.

Comments

Popular posts from this blog

BIG Data, Hadoop – Chapter 2 - Data Life Cycle

Data Life Cycle The data life cycle is pictorial defined as show below:     As we see, in our current system, we capture/ Extract our data, then we store it and later we process for reporting and analytics. But in case of big data, the problem lies in storing and then processing it faster. Hence Hadoop takes this portion, where it stores the data in effective format (Hadoop distributed File System) and also process using its engine (Map Reduce Engine). Since Map Reduce engine or Hadoop engine need data on HDFS format to process, We have favorable tools available in market to do this operation. As an example, Scoop is a tool which converts RDBMS to HDFS. Likewise we have SAP BOD to convert sap system data to HDFS.

OLE DB provider "Microsoft.ACE.OLEDB.12.0" for linked server "(null)" returned message "The Microsoft Access database engine cannot open or write to the file ''. It is already opened exclusively by another user, or you need permission to view and write its data.". Msg 7303, Level 16, State 1, Line 1 Cannot initialize the data source object of OLE DB provider "Microsoft.ACE.OLEDB.12.0" for linked server "(null)".

OLE DB provider "Microsoft.ACE.OLEDB.12.0" for linked server "(null)" returned message "The Microsoft Access database engine cannot open or write to the file ''. It is already opened exclusively by another user, or you need permission to view and write its data.". Msg 7303, Level 16, State 1, Line 1 Cannot initialize the data source object of OLE DB provider "Microsoft.ACE.OLEDB.12.0" for linked server "(null)". If you get this error while Loading Data From Excel to SQL Server, then, close the Excel sheet opened and try to run queries again.

Talend ETL Part 1: SQL Server Database to Excel Sheet

Hello All, Of many ETL tools available in Market, One of the strong tool is Talend. Difference between other ETL tools and tools like Pentaho, Talend, Clover ETL, Adeptia Integration etc, is that they support NO SQL Cross domains, BIG Data, Hadoop etc. Other ETL tools like, SSIS, Informatica are now coming with their higher versions, which consists of Hadoop Integration. Basically We can say, there are two databases types. 1) RDBMS (Example: SQL Server, MySQL, Oracle etc) 2) Non RDBMS (Example: MongoDB, InfiniDB etc) Talend Supports Non RDBMS databases. Here I would like to share my hands on experience on Talend and how to use it and explain basic components of Talend. Approx there are 500 components we can find in Talend. So lets Kick Start from Basics. First lets try to load Data from Microsoft SQL Server to Excel. Steps: Step1: Open Talend Studio. Step 2: Right click on Job Design and Create a new Job by giving some job name. Step 3: Give the name o...