from Hacker News

What Is an OLAP Cube? An Exhaustive Explainer

by chotmat on 9/3/21, 7:07 AM with 19 comments

  • by erosenbe0 on 9/5/21, 3:42 AM

    OLAP database is just a copy, replica, or archive of data with a schema designed for analytical queries and not for transactional speed and robustness.

    Cube means you can answer questions across dimensions of time, region, and the like. Temporal queries are not always simple.

    You don't query your live transactional db to answer a question like, "are third week of February margins on Acme Widget Q better in place A or B."

    That's really it.

  • by rstuart4133 on 9/5/21, 10:03 AM

    I remember hearing about OLAP cubes donkey's years ago (probably not far from 1993 as the article says), read numerous descriptions of what they are (many sounding like this article). But technically there was nothing novel about the results they were producing, any programmer could have done it. There was nothing surprising about the underlying data structures. I walked away with much of a clue, but it wasn't just marketing fluff because the idea kept hanging around.

    That changed recently. I was asked to load some data into a OLAP package. Nothing special about that, but it meant I got to sit in on the training session for the people using it. They were the same people our IT had to help on a daily basis with the simplest of things. Yet within an hour it had clicked, and they were happily exploring how each dimension in many dimensional data effected the others.

    That was my light bulb moment. The reason OLAP is still a thing 20 years later isn't because of some special algorithm, or data structure - it's because it's a unique GUI for exploring very complex data that non-computer find intuitive. That's my take on it anyway.

  • by rahimnathwani on 9/5/21, 1:30 AM

    If you've used pivot tables in Excel or Google Sheets, you'll know that you have:

    A) rows and columns

    B) values

    Dimensions are things you might include in A.

    Facts (or measures) are things you can aggregate, and would be part of B.

    In the old days, you often couldn't quickly (on demand) compute the answer to a question, due to memory and processing limitations.

    But, if you could decide ahead of time what dimensions and facts you wanted, you could compute all the little slices, e.g. value of sales in the north region by salesperson Bob in month July 1996.

    Then, when you needed some particular pivot table, your software could provide it to you just by summing these precomputed slices.

    The OL in OLAP stands for 'online', i.e. you'd do your analytical processing ('AP') live, rather than waiting for some long-running batch job.

  • by mr_toad on 9/5/21, 2:33 AM

    My experience of OLAP was that it took enterprise so long to adopt and plan and design OLAP systems that they were obsolete before they could ever be used.
  • by sradman on 9/5/21, 3:46 AM

    OLAP Cubes have been disrupted by Column Stores. Unless you are interested in the history of Data Warehouses, the technology can safely be ignored.
  • by gfody on 9/5/21, 1:53 AM

    people can really get hung up on jargon like this. the real concept is dimensional modeling which is a whole strategy and toolkit of ideas for doing online analytical processing. you can do it with anything, the data warehouse toolkit was calling it "rowlap" when you did in a regular database. you would be dead lost trying to use a proper "olap" tool with special sparse matrix data structures and MDX queries without understanding concepts like the various types of dimension tables and how they can be nested etc. Claire was basically right but maybe actually complaining more about the marketing of the olap products than any actual gatekeeping.
  • by vinay_ys on 9/5/21, 11:57 AM

    Very nice write up. For the story from 2010 to now, I would mention the emergence by MPP columnar processing systems like Vertica and in-memory distributed systems like MemSQL to the narrative. Of course Kylin, Clickhouse etc are great open-source contenders (although at the time I looked into them (~5-6 years ago), they were not mature enough).

    In my experience, people often underestimate the continuous effort to maintain the the Kimball's "Enterprise Data Warehouse Bus Architecture" diagram, even with more powerful machines and modern distributed tooling.

    In today's fast evolving Internet apps world, the data use cases and scenarios are very fast evolving. That brings its own set of challenges.

    Having good usable tools for managing the lifecycle of entity or event definitions, their variants like emitted/logged vs cleaned/processed/synthesized, their data quality checks etc and ensuring they are easily discoverable and understandable by everyone in the org is super crucial and it is significantly under-appreciated.

    Usually, strong systems engineers who are in charge of the data platform focus on building the data infra (job scheduling, data pipelines execution, storage etc) but the crucial work of defining the data dictionaries, event or entity models etc are left out. The data producers and data consumers who are spread out throughout the organization have to muddle through this on their own without any centralized tooling to support this activity. These make data use very difficult and siloed.

    Usually, there would be a team of BI analysts who are tasked to get some answers out of the data for the questions asked of them by various data users. Funnily these analysts are also working in silos assigned to those different data users. Inevitably, they become the super-inefficient intermediary between the data users and the data insights.

    The pre-cooked data insights are presented in spreadsheets and slides in review meetings – where a narrative is already prepared by the analysts.

    This robs the opportunity for the data users to explore and ask data questions on their own in a fast iteration cycle to improve their intuition and understanding of their product/business environment.

    IMO, these challenges still remain largely unsolved even to this day across organizations of all size and scale.

  • by cm277 on 9/5/21, 3:04 PM

    Well, an OLAP cube is a kind of a table/view. There are two core properties: the cube contains only one value ('column' in RDBMS-speak) and the data is defined across a set of 'dimensions' (levels in OLAP, unique indices with a foreign key in RDBMS).

    The core insight though that the article is missing is that these dimensions are well-defined (and finite!) within the OLAP architecture and so are the roll-ups (aggregations) with their parent dimensions. The classical example is a date: a date has to roll-up to a week. Obvious. But the set of dates within the environment has to (or should anyway) also be well defined: so you can't just use any date, it has to fall, e.g. between 2019 and 2021.

    Why? because the structure of the dimensions makes algebra across OLAP cubes trivial. In fact, OLAP algebra is pretty much linear algebra, just defined slightly differently for a more fluid problem space. That's why I like to think that OLAP isn't obsolete: in fact, most databases are just now becoming OLAP-y :)

    Source: 20+ years of OLAP systems development.

  • by olapalo on 9/5/21, 12:21 AM

    This is more of research notes on the evolution of the concept than explanation, but it is still very good.
  • by philipodonnell on 9/5/21, 11:43 AM

    Anyone have experience with Apache Kylin? Seemed to me as a way to blend the cube concept with columnar stores.

    https://en.m.wikipedia.org/wiki/Apache_Kylin

  • by 7thaccount on 9/5/21, 4:18 AM

    Is there any advantage to OLAP cube over plain SQL (large historical database regularly updated with production data)?

    It sounds like certain common queries may be extra fast as it's kind of pre-computed?