• 1a527dd5 3 hours ago

    Blimey, that is a lot of moving parts.

    Our data team currently has something similar and its costs are astronomical.

    On the other hand our internal platform metrics are fired at BigQuery [1] and then we used scheduled queries that run daily (looking at the -24 hours) that aggregate/export to parquet. And it's cheap as chips. From there it's just a flat file that is stored on GCS that can be pulled for analysis.

    Do you have more thoughts on Preset/Superset? We looked at both (slightly leaning towards cloud hosted as we want to move away from on-prem) - but ended up going with Metabase.

    [1] https://cloud.google.com/bigquery/docs/write-api

    • zurfer 5 hours ago

      Kudos to the author who is responsible for the whole stack. A lot of effort goes into ingesting data into Iceberg tables to be queried via AWS Athena.

      But I think it's great that analytics and data transformation is distributed, so developers also are somewhat responsible for correct analytical numbers.

      In most companies there is strong split between building product and maintaining analytics for the product, which leads to all sort of inefficiencies and errors.

      • valzam 3 hours ago

        i pity the developer who has to maintain tagless final plumbing code after the “functional programming enthusiast” moves on… in a Go first org no less.

        • otter-in-a-suit 18 minutes ago

          Author here. This decision went through all proper architecture channels, including talks with our engineers, proof of concepts and the like.

          I’ve been doing this too long to shoehorn in my pet languages if I didn’t think they’re a good fit. And I think that scala/FP + Flink _is_ a good fit for this use case.

          We did also explore the go ecosystem fwiw - the options there are limited (especially around the data tooling like iceberg) and go is simply not a language that’s popular enough in the data world.

          Python’s typing system (or lack thereof) is a huge hinderance in this space in general (imo), and Java didn’t cause many happy faces on the Eng team either, but it’s certainly an option. I just find FP semantics a better fit for data / streaming work (lots of map and flat map anyways), and Scala makes that easy.

          Also no cats/zio - just some tangles final _inspired_ composition and type classes. Not too difficult to reason about, not using any obscure patterns. I even mutate references sometimes. :-)

          • epgui an hour ago

            I would much rather inherit an FP data pipeline than anything else. You do realize data pipelines (and distributed computing) are an ideal use case for FP?

          • LoganDark 5 hours ago

            > Note that we do not store any data about the traffic content flowing through your tunnels—we only ever look at metadata. While you have the ability to enable full capture mode of all your traffic and can opt in to this service, we never store or analyze this data in our data platform. Instead, we use Clickhouse with a short data retention period in a completely separate platform and strong access controls to store this information and make it available to customers.

            Don't worry, your sensitive data isn't handled by our platform, we ship it to a third-party instead. This is for your protection!

            (I have no idea if Clickhouse is actually a third party, it sounds like one though?)

            • leosanchez 4 hours ago

              Clickhouse is a database. It has cloud offering.

              • faangguyindia 3 hours ago

                What's the point of clickhouse cloud?when you can just use bigquery and run queries on billions of row.

                I am genuinely curious what case does clickhouse serve over bigquery.

                • FridgeSeal 2 hours ago

                  It’s actually open source, you can self-host it easily enough, you can push a single instance pretty far too.

                  It’ll also happily read from disaggregated storage and is compatible with parquet and friends and a stack of other formats. I’ve not really used BigQuery in anger, but the ClickHouse performance is really, really good.

                  I guess ultimately, all the same benefits, and a lot fewer downsides.

                  • tnolet 43 minutes ago

                    - non proprietary

                    - open source

                    - run it locally

                    - SQL like syntax

                    - tons of plugins

                    - not by Google

                • IanCal 4 hours ago

                  A different platform doesn't mean third party. It can just mean you have completely separated things so that none of the data tooling discussed here has any ability to access it.

                  • LoganDark 2 hours ago

                    Not sure what you mean... Do you mean they run software called Clickhouse on their own infra, just separated from the other parts of their backend? To me it reads like they were shipping the data off to a third-party named Clickhouse, especially with "we never store or analyze this data in our data platform" (does data platform refer to ngrok itself or what?).