cancel
Showing results for 
Search instead for 
Did you mean: 

SAP HANA - How system understands the columnar data relation?

Former Member
0 Kudos

Hi All,

SAP HANA database stores the data in columnar format. So from row based table perspective, each column will be treated as an individual table with distinct values.

How HANA database maintains the relationship of each row so the user gets correct data? In other words, how HANA database internally process the column based records?

Regards,

Mandar

Accepted Solutions (1)

Accepted Solutions (1)

Former Member
0 Kudos

Thank you Venkatesh and Lars for your reply.

Lars, first of all thank you!

I got the point that system stores distinct values of each column.. let me copy your statements

- per column you store a dictionary of every distinct value that occurs in this column

- for each distinct value you need to maintain a list or array or vector (however you call it) that tells you, for which row a specific value occurs.

My doubt is clear till this point.

now, when you say "when we want to put back together a full specific row, we need to go through every column and find the matching column value for each row id "

So how does system creates a row when there are distinct column values. As there are n permutations and combinations, how internal algorithm creates the right record?

Regards,

Mandar

lbreddemann
Active Contributor
0 Kudos

> My doubt is clear till this point.

>

> now, when you say "when we want to put back together a full specific row, we need to go through every column and find the matching column value for each row id "

>

> So how does system creates a row when there are distinct column values. As there are n permutations and combinations, how internal algorithm creates the right record?

> Mandar

Internal algorithm? => NO IDEA! And if I would have, I surely wouldn't be allowed to tell

But what would you do if you would need to develop such a column store?

You probably would take a rowid and iterate through each column to find the value it got assigned for this value.

Maybe you would implement a kind of inverse index to allow faster lookup [rowid -> value] or something like that.

Once you found the value for your rowid you move on to the next column until you got a value for all requested columns.

Since the columns are pretty independent from each other you can do this in parallel if enough free threads are available.

That's what I can tell - for more, it might be a good idea if you check the implementation of other column store databases.

There are many out there and most of them work pretty similar

regards,

Lars

Answers (4)

Answers (4)

rama_shankar3
Active Contributor
0 Kudos

Thanks - good points !

Former Member
0 Kudos

Lars, Can you please help me to understand how this inverse index works here?

Mandar,

Even determination of the correct rows, can be linked to these three compression techniques.

Run-length encoding

Cluster encoding

and Dictionary encoding

Run-length encoding. If data in a column is sorted there is a high probability that two or more

elements contain the same values. Run-length encoding counts the number of consecutive column

elements with the same values. To achieve this, the original column is replaced with a two-column

list. The first column contains the values as they appear in the original table column and the second

column contains the counts of consecutive occurrences of the respective value. From this information

the original column can easily be reconstructed.

Cluster encoding. This compression technique works by searching for multiple occurrences of the

same sequence of values within the original column. The compressed column consists of a two-column

list with the first column containing the elements of a particular sequence and the second

column containing the row numbers where the sequence starts in the original column. Many popular

data compression programs use this technique to compress text files.

Dictionary encoding. Table columns which contain only a comparably small number of distinct values

can be effectively be compressed by enumerating the distinct values and storing only their numbers.

This technique requires that an additional table, the dictionary, is maintained which in the first column

contains the original values and in the second one the numbers representing the values. This

technique leads to high compression rates, is very common, e. g. in country codes or customer numbers,

but is seldom regarded as a compression technique.

thanks,

Tilak

lbreddemann
Active Contributor
0 Kudos

Hi Tilak!

Sorry, but I fail to see the question in your post.

What exactly is it you want me to explain?

- Lars

Former Member
0 Kudos

Thank you Lars for a quick reply... I understand that even if you knew you would not be allowed to tell, but I couldn't resist myself since it is a very unconventional way of storage. Thanks again for your valuable inputs, at the same I will check out for other databases as well....

Regards,

Mandar

lbreddemann
Active Contributor
0 Kudos

Hi Mandar,

I guess the question how to put back together rows from column store data is something that puzzles many who are used to work with classic row-store databases.

I haven't seen the column store coding, but usually something like this is done:

- per column you store a dictionary of every distinct value that occurs in this column

- for each distinct value you need to maintain a list or array or vector (however you call it) that tells you, for which row a specific value occurs.

This is exactly what a classic index does:


[distinct value] --> [rowid_1, rowid_2, ... , rowid_n]

Now, when we want to put back together a full specific row, we need to go through every column and find the matching column value for each rowid - so we need to perform a kind of inverse index function.

As you can assume, this is something pretty 'expensive' and in fact: rebuilding a complete row is the most expensive operation of a column store.

This also should explain why column store tables are that good for aggregation and data analysis but terribly bad when it's about working on specific rows.

I hope this is what you wanted to know.

regards,

Lars

patrickbachmann
Active Contributor
0 Kudos

Hey Lars,

I am training some of my ABAP team and of course they are interested in this and I'm trying to explain this best I can and I drafted up this example and I was wondering if this would be a good visual?  Does it look about right?  ie: in my row vs column store example I want to query all values for KTM.  Yellow highlights the values i'm querying and on the column store inside the parenthesis would be the $rowid$.  I'd feel better showing this if I got your thoughts.

Thanks!

-Patrick

lbreddemann
Active Contributor
0 Kudos

Hi Patrick

this way surely is not wrong in the sense that the numbers in the brackets on the COLUMN STORE side seem to indicate the rows they belong to.

However, it gives the impression, that the column store data is also stored in a single structure, just like the row store.

But that (among other things) this is not so is one of the core enablers for many optimizations.

So, I'd always try to at least show the dictionary and the main vector separated.

Also, it's important to point out that there is something like a "logical" table structure - the way the database presents the data to the SQL users - and a "physical" structure - the way the database handles the data internally.

This is from Richard's and my book:

I decided to not include things like the $rowid$ as this is a sort-of tricky column and not required to explain the concept. It's rather an implementation detail.

Cheers,

Lars

patrickbachmann
Active Contributor
0 Kudos

Lars, thanks for the great feedback and diagram.  I will use this in my training and will get the book finally!

-Patrick

Former Member
0 Kudos

Hi

You can check the HANA Architecture to understand how system will differentiate Row and Column data storage.

http://www.erphowtos.com/sap-bw/33-sap-hana-overview-and-architecture.html

We have two different for data flows for row and column tables.

When it comes to data consistency, this will be achieved using MVCC(Multi Version Concurrency Control)

In this we have two types i) CID(commit ID) ii) TID(Transaction ID)

CID is used for row based storage where as TID is used for Column based storage.

Regards,

Venkatesh