• doodlebugging 3 days ago

    Looks nice. I've had occasion to import/export, edit, etc. thousands of CSV files from multiple software platforms over the years and this tool looks like a simple way for a user to determine whether there are issues in the CSV file that will cause problems on import to their application.

    One question I immediately have is how this compares to a spreadsheet CSV import tool such as the one in Excel which is extremely flexible. It appears that this app requires a specific format (comma delimited, new line at end of each row) in order to work. I never tried to count the times that a CSV file that I had to work with required editing in order to facilitate import to Excel or other application because CSV is such a non-standard standard output that the only way one could know whether the import would be successful was to pop it into an editor, like Notepad++ and examine it before import. Notepad++ was a critical tool in the chain to force compliance for all the different applications I used. Each application allowed CSV import/export but some accepted almost any delimiter while others were strict about file format and failing to understand the expected CSV format for each would definitely cause headaches as some input errors could leave a very subtle footprint that you may not catch until late in processing.

    Anyway, it appears that your definition of CSV format is pretty strict so how do you propose that a user manage importation of files that do not fit your CSV definition? Notepad++ before import to verify compliance?

    I also see one thing on the main page under "Security" that looks like it could be worded differently.

    >No tracking or analytics software is used for privacy

    To me, this implies that no steps have been taken to manage user/data privacy.

    Perhaps a comma could be inserted so that it reads "...used, for privacy." or maybe it should read:

    For (user/data) privacy, there is no tracking or analytics software.

    • kengoa 3 days ago

      > how this compares to a spreadsheet CSV import tool such as the one in Excel which is extremely flexible.

      I would say the data loading functionality compares very poorly to Excel CSV import for all the reasons you pointed out, and I agree that the users can face those formatting issues which could be resolved in another tool like Excel or Google Spreadsheet for non-technical users and Notepad++ or editors for a bit more technical users. The assumption on CSV files being clean is strong so I will try to surface import errors at least, and in the meantime point to different ways to format the data as those tools will be complementary to Visprex.

      > To me, this implies that no steps have been taken to manage user/data privacy.

      This is a good point. I fixed the wording and now it simply reads "No tracking or analytics software is used". Thanks!

    • paddy_m 3 days ago

      Nice work!

      Do you have any plans for data cleaning?

      I am working on a somewhat similar open source project. I intend to add heuristic data cleaning. With the UI I want to be able to toggle between different strategies quickly - strip characters from a column to treat it as numeric, if less than 2% or 5% of values have a character, fill na with mean, interpret dates in different formats - drop if the date doesn't parse. The idea bing that if it's really quick to change between different strategies, you can create more opinionated strategies to get to the right answer faster.

      Happy to collaborate and talk tables with anyone who's interested.

      • kengoa 3 days ago

        Yes I do have plans for data preprocessing using DuckDB WebAssembly (I have upcoming features secion in this blog: https://kengoa.github.io/software/2024/11/03/small-software....) but this will require SQL which some of the target audience might not be familiar with. I'm thinking of something like visual query builder from metabase.

        > With the UI I want to be able to toggle between different strategies quickly - strip characters from a column to treat it as numeric, if less than 2% or 5% of values have a character, fill na with mean, interpret dates in different formats - drop if the date doesn't parse

        Those are really good examples and I can make those predefined preproccesing techniques available as toggles in the dataset tab. Thanks for the feedback!

        • remolacha 3 days ago

          not quite what you're describing, but I open-sourced a fuzzy deduplication tool last week: https://dedupe.it Would be interested in expanding it to deal with data cleaning more broadly

          • turtlebits 3 days ago

            Not sure if you have introduced an artificial delay, but deduping ~25 rows shouldn't take 5+ seconds...

            edit: I see you're using an LLM, but " ~$8.40 per 1k records" sounds unsustainable.

        • TripleChecker 2 days ago

          Are you planning to add xlsx support?

          Also a few typos you might want to review: https://triplechecker.com/s/823563/docs.visprex.com

          • kengoa 2 days ago

            I will try to add more support for other data formats like xlsx and parquet in the future. Current CSV parsing is also a bit limited (i.e. cannot deal with timestamps) so I will try to update parsers first.

            Also thanks for the error checker! I pushed the fixes in https://github.com/visprex/visprex.github.io/pull/4

          • teddyh 3 days ago

            I loaded a CSV with one date/time column and one numerical column. I then selected “Scatter Plot”, but got the message “Not enough numerical columns found. Load a CSV file with at least 2 numerical columns in the Datasets tab.” I would have thought that a date/time column would count?

            • kengoa 3 days ago

              Thanks for trying it out! This is unfortunately not possible as of now and is one of th high-priority tasks to parse timestamps and datetimes, which is now incorrectly parsed as a string (Categorical). I'm using Papa Parse to load CSV data and I will likely need to add a custom parser on top of it.

              Some of those plans are mentioned in my blog post reflecting on building this app: https://kengoa.github.io/software/2024/11/03/small-software....

              • nerdponx 3 days ago

                You might also want to support a Unix timestamp as input, i.e. an integer or decimal number of (mili|micro|nano-)seconds since the Unix epoch. No need to worry about messy date parsing there.

                • nyclounge 3 days ago

                  Maybe use dayjs to handle all kinds of wired string dates.

                  • kengoa 2 days ago

                    dayjs seems like exactly I was looking for, thanks for the suggestion! I might have tried to write a parser myself otherwise.

              • mosselman 3 days ago

                Cool! Does anyone know of any javascript libraries that I could use to get this type of distribution visualisation from tabular data? Something I can run on my site that is.

              • parsimo2010 3 days ago

                I like this a lot- I am going to show it to my students!

                They seem to hate learning R, and while this doesn’t prevent them from having to build a model, this will speed up the exploration steps.

                • kengoa 3 days ago

                  I'm very glad to hear this as this is exactly the target audience and the use case I initially thought of! I hope your students find it useful.

                • jeffreygoesto 3 days ago

                  I typically fire up GnuPlot and there CSV loading and a plot are one line each. What does Visprex do more or better?

                  • relistan 3 days ago

                    Not the author, but just looking at it, and playing with it, I’d say ease of use is the obvious one. Availability in the browser any time anywhere seems nice.

                    • kengoa 3 days ago

                      Thanks for commenting this, I would say speeding up the iteration between visualisation steps is the main benefit as you might not want to be thinking about matplotlib syntax when trying to get a sense of data distributions.

                  • rrr_oh_man 3 days ago

                    Very cool stuff!

                    Maybe bar / beeswarm charts would be useful?

                    I was missing the possibility to show differences by category, eg mpg by make in the cars dataset.

                    • kengoa 3 days ago

                      I haven't considered beeswarm charts for this before, I will add those to a list of upcoming features. Thanks for the feedback :)

                    • imfing 3 days ago

                      cool project!

                      Visualizing tabular data often presented some challenges, as I had to rely on tools like Google Sheets or Colab + Pandas for quick cleaning and wrangling before exploring different visualizations.

                      I think having more client-side data cleaning capabilities would make it even more powerful

                      • kengoa 3 days ago

                        > had to rely on tools like Google Sheets or Colab + Pandas for quick cleaning and wrangling before exploring different visualizations.

                        Yes I had the same experience for analytics work some years ago. As others have pointed out, Visprex only works in a happy path where data is a clean CSV file so will definitely need to work on data cleaning. I have a DuckDB integration planned but not sure if this is easy enough for the target audience. Will try to add some predefinied functionalities, thanks for the feedback!