Score:0

Duplicated rows of Google cloud billing data

cn flag

I extracted GCP billing data through BigQuery. I made queries to retrieve data for particular time duration and then I found two types of duplicate data:

  1. duplicate data resulted from unnesting the label column

  2. duplicate data results from normal query (Like as below) rows with fully the same value for all columns.

      billing_account_id AS id,
      service.id AS ServiceId,
      service.description AS ServiceDescription,
      sku.id AS SkuId,
      sku.description AS SkuDescription,
      usage_start_time,
      usage_end_time,
      project.id AS ProjectId,
      project.number AS ProjectNumber,
      project.name AS ProjectName,
      project.labels as ProjectLabels,
      project.ancestry_numbers AS ProjectAncestryNumbers,
      labels   AS Labels,
      system_labels  AS SystemLabels,
      location.location AS Location,
      location.country AS Country,
      location.region AS Region,
      location.zone AS Zone,
      SUBSTRING(cast(export_time as string), 1, 19) AS ExportTime,
      cost AS Cost,
      currency AS Currency,
      currency_conversion_rate AS CurrencyConversionRate,
      usage.amount AS UsageAmount,
      usage.unit AS UsageUnit,
      usage.amount_in_pricing_units AS UsageAmountInPricingUnits,
      usage.pricing_unit AS UsagePricingUnit,
      credits  AS Credits,
      invoice.month AS InvoiceMonth,
      cost_type AS CostType,
      adjustment_info.id AS adjustmentInfoId,
      adjustment_info.description AS adjustmentInfoDescription,
      adjustment_info.mode AS adjustmentInfoMode,
      adjustment_info.type AS adjustmentInfoType
    FROM 
    'NAME OF TABLE' 
    WHERE DATE(_PARTITIONTIME) BETWEEN subtractFromTime(utcnow(),60,'Day','yyyy-MM-dd')AND formatDateTime(utcnow(),'yyyy-MM-dd')```


**I'd like to know why there are duplicate data? How they can be handled and how can be sure that cost calculation has not been affected by duplicate data?**

Appreciate it if anyone can help me.


Bests,
Shokoufeh
Score:1
pk flag

1. Why rows get duplicated when un-nesting the labels field?

When you are un-nesting a repeated type field like labels, duplication of rows is expected. To be precise, each row gets duplicated based on the length of that row's labels array.

Unnest count vs total labels

2. Why are there duplicate rows even before un-nesting?

If you create 2 Compute Engine VMs of exact config and location, the idle usage rows in the billing export of those 2 VMs are exactly the same unless labelled. The export table doesn't have an explicitly exposed primary key.

The export table's granularity is only until service & SKUs and not per resource. This leads to duplicate looking data but they are actual valid usages.

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.