• datadrivenangel 3 hours ago

    For most companies, the unstructured 'data' is barely information, let alone data, let alone valuable.

    Most companies have internal training videos, recordings of meetings, PDFs of policies, and basically all of these are worthless within a year from a business perspective. Some things are useful for longer, or for reasons of historical interest, but the half life is short. The real thing that contains value is what those could potentially represent, like decisions or events that would benefit from action. If a meeting has a decision that some executive would want to know about, maybe a summary of the transcript could be useful?

    Turning the random memorandum generated as part of business into valuable insights without process re-design is a pipe dream in most scenarios. Not all though.

    • danjl a day ago

      The trick is finding a problem that increases revenue or decreases costs after providing structure to the data. Sure, it would be great to bring structure to assets, but you can't just provide search or labeling. You have to figure out how providing the structure actually brings value to those companies. You'd hope they would do that for you, but you need to figure it all out, at least for one set of customers who will pay you, before you build the MVP. The details of how it benefits the company has a profound effect on the design of the MVP, including how to access the assets and how to expose the structure in the UX.

      • dark7 20 hours ago

        Yeah good point. Would have to provide some value other than search. They can probably already do that with ChatGPT to an extent with pdfs and what not.

      • evanjrowley a day ago

        It's an even bigger obstacle for data management, particularly classification and loss prevention. Comparatively, it's less of an obstacle for AI and most likely that will be a game changer for addressing those other issues.

        • edmundsauto a day ago

          I work for big tech. Our problem is not the unstructured nature of the data; it is the volume of noise to signal. Basic ranking and information retrieval is implemented; we have LLM/RAG systems that can be queried. However, it’s hard to evaluate what is good and up to date information - 98% of the documents people kick out are not useful.

          • dark7 20 hours ago

            Interesting. So for example you’re saying you make a query and the information in the document is old and out of date?

            • yourapostasy 15 hours ago

              > ...and the information in the document is old and out of date?

              Not just old and out of date. Plain wrong, like referring to internal systems, processes or intranet URL's that are (often poorly done) duplications of enterprise standard services, processes or pages, and should be deprecated instead of incorporated into training data.

              Or the unstructured data sits within structured data. Some development teams stick giant strings with their own internal, proprietary formatting into a database blob column because...I dunno, they thought it was expedient or whatever for Very Good [Undocumented] Reasons. Invariably, these fields hold all sorts of nastiness you either want to tease out, or do not want to exist at all because they expose what you wish they wouldn't in cleartext. And even though your company pays the vendors of these monstrosities (oh yes, my PTSD suppressed memories of multiple vendors committing these same terrible, no good, very bad software sins) three commas a year of "support", they adamantly refuse to share their precious proprietary format with you, much less give you the courtesy to update you when they change the byzantine format that gives the three body problem a run for the money for a deterministic solution.

              If you make a Generative AI sort sense out of that, then you can practically segment the market by "how much before the customer starts to put away the blank check" and you'll make bank. Everyone I know of in this Data Loss Prevention (DLP) or adjacent spaces are deer in the headlights when the conversation turns towards these operational realities and what we might do with their solutions to put a dent in these multiverses of pain. No one has roadmaps talking of iterative pipelines, search outcome behavior analysis and automated training thereof, or really much beyond various degrees of pattern matching, whether that be regex's or LLM's.

              • dark7 13 hours ago

                Very interesting. Definitely got me thinking. What's the workflow like when these issues arise? At what point in that workflow would a possible generative ai be of help?

          • aworks 15 hours ago

            Ben Thompson suggests Palantir as a company to leverage deep enterprise data with AI.

            https://stratechery.com/2024/enterprise-philosophy-and-the-f...

            • constantinum 12 hours ago

              Unstract is trying to solve this problem by fully leveraging the LLM stack. It is open-source https://github.com/Zipstack/unstract

              • theGnuMe a day ago

                There are companies but it is a wide open space. There was a legal AI startup sold last year for a billion or so...