[Scilab-users] HDF5 save is super slow

Stéphane Mottelet stephane.mottelet at utc.fr
Mon Oct 15 15:35:04 CEST 2018


Le 15/10/2018 à 15:07, Arvid Rosén a écrit :
>
> Hi,
>
> Yeah, that makes sense. Or, it was about what I expected at least. It 
> is a pity though, as handling thousands of filters isn’t necessarily a 
> strange thing to do with a software like Scilab, and making a special 
> serialization like that would be nothing less than a hack.
>
> Do you think there is a way forward under the hood that could make big 
> deep list structures >10x faster in the future?
>
No. I think that hdf5 is not convenient for deeply structured data with 
small leafs. Some interesting discussions can be found here:

https://cyrille.rossant.net/should-you-use-hdf5/
https://cyrille.rossant.net/moving-away-hdf5/

If you just need to read/write within your own software, serializing 
should not be an issue. In the example you gave, the structure of each 
leaf is always the same: using an array of structs improves performances 
a little bit:

clear
N = 4;
n = 1000;
for i=1:n
    G(i).a=rand(N,N);
    G(i).b=rand(N,1);
    G(i).c=rand(1,N);
    G(i).c=rand(1,1);
end
tic();
save('filters.dat', 'G');
disp(toc());
--> disp(toc());

    0.24133


S.
>
> Otherwise, the whole object orientation part of Scilab (tlist and 
> mlist etc.) would be hard to use for anything that comes in large 
> numbers, which would be a shame, especially as it used to work just 
> fine (well, I can see how the old structure wasn’t “just fine” in 
> other ways, but still).
>
> Cheers,
>
> Arvid
>
> *From: *users <users-bounces at lists.scilab.org> on behalf of Stéphane 
> Mottelet <stephane.mottelet at utc.fr>
> *Organization: *Université de Technologie de Compiègne
> *Reply-To: *Users mailing list for Scilab <users at lists.scilab.org>
> *Date: *Monday, 15 October 2018 at 14:37
> *To: *"users at lists.scilab.org" <users at lists.scilab.org>
> *Subject: *Re: [Scilab-users] HDF5 save is super slow
>
> Hello,
>
> I looked a little bit in the sources: the evident bottleneck is the 
> nested creation of an hdf5 group each time that a container variable 
> is met.
> For the given example, this is particularly evident. If you replace 
> the syslin structure by the corresponding [A,B;C,D] matrix, then save 
> is ten times faster:
>
> N = 4;
> n = 1000;
> filters = list();
> for i=1:n
>   G=syslin('c', rand(N,N), rand(N,1), rand(1,N), rand(1,1));
>   filters($+1) = G;
> end
> tic();
> save('filters.dat', 'filters');
> disp(toc());
> --> disp(toc());
>
>    0.724754
>
> N = 4;
> n = 1000;
> filters = list()
> for i=1:n
>   G=syslin('c', rand(N,N), rand(N,1), rand(1,N), rand(1,1));
>   filters($+1) = [G.a G.b;G.c G.d];
> end
> tic();
> save('filters.dat', 'filters');
> disp(toc());
> --> disp(toc());
>
>    0.082302
>
> Serializing container objects seems to be the solution, but it goes 
> towards an orthogonal direction w.r.t. the hdf5 portability spirit.
>
> S.
>
>
> Le 15/10/2018 à 12:22, Antoine Monmayrant a écrit :
>
>     Le 15/10/2018 à 11:55, Arvid Rosén a écrit :
>
>         Hi,
>
>         Thanks for getting back to me!
>
>         Unfortunately, we used Scilab’s pretty cool way of doing
>         object orientation, so we have big nested tlist structures
>         with multiple instances of various lists of filters and other
>         structures, as in my example. Saving those structures in some
>         explicit manual way would be extremely complicated. Or is
>         there some way of writing explicit HDF5 saving/loading schemes
>         using overloading? That would be great! I am sure we could
>         find the main culprits and do something explicit for them, but
>         as they can be located wherever in a big nested structure, it
>         would be painful to do anything on the top level.
>
>         Another, related I guess, problem here is that the new file
>         format uses about 15 times as much disk space as the old
>         format (for a typical ill-behaved nested structure). That adds
>         to the save/load time too I guess, but is probably not the
>         main source here.
>
>     Argh, yes, I tested it and in your example, I have a file x8.5 bigger.
>     I think that both increases in time and size are real issues and
>     should be reported as bugs.
>
>     By the way, I rewrote your script to run it under both 6.0 and 5.5:
>
>     /////////////////////////////////
>     N = 4;
>     n = 10000;
>     filters = list();
>
>     for i=1:n
>       G=syslin('c', rand(N,N), rand(N,1), rand(1,N), rand(1,1));
>       filters($+1) = G;
>     end
>
>     ver=getversion('scilab');
>
>     if ver(1)<6 then
>         tic();
>         save('filters_old.dat', filters);
>         ts1 = toc();
>     else
>         tic();
>         save('filters_new.dat', 'filters');
>         ts1 = toc();
>     end
>
>     printf("Time for save %.2fs\n", ts1);
>     /////////////////////////////////
>
>     Hope it helps,
>
>     Antoine
>
>
>         I think I might have reported this earlier using Bugzilla, but
>         I’m not sure. I’ll check and report it if not.
>
>         Cheers,
>
>         Arvid
>
>         *From: *users <users-bounces at lists.scilab.org>
>         <mailto:users-bounces at lists.scilab.org> on behalf of
>         "amonmayr at laas.fr" <mailto:amonmayr at laas.fr>
>         <amonmayr at laas.fr> <mailto:amonmayr at laas.fr>
>         *Reply-To: *"antoine.monmayrant at laas.fr"
>         <mailto:antoine.monmayrant at laas.fr>
>         <antoine.monmayrant at laas.fr>
>         <mailto:antoine.monmayrant at laas.fr>, Users mailing list for
>         Scilab <users at lists.scilab.org> <mailto:users at lists.scilab.org>
>         *Date: *Monday, 15 October 2018 at 11:08
>         *To: *"users at lists.scilab.org" <mailto:users at lists.scilab.org>
>         <users at lists.scilab.org> <mailto:users at lists.scilab.org>
>         *Subject: *Re: [Scilab-users] HDF5 save is super slow
>
>         Hello,
>
>         I tried your code in 5.5.1 and the last nightly-build of 6.0:
>         I see a slowdown of around 175 between old save in 5.5.1 and
>         new (and only) save in 6.0.
>         It's really related to the data structure, because we use hdf5
>         read/write a lot here and did not experience significant
>         slowdowns using 6.0.
>         I think the overhead might come to the translation of your
>         fairly complex variable (a long array of tlist) in the
>         corresponding hdf5 structure.
>         In the old save, this translation was not necessary.
>         Maybe you could try to save your data in a different way.
>         For example:
>         3) you could save each element of "filters" in a separate file.
>         2) you could bypass save and directly write your data in a
>         hdf5 file by using h5open(), h5write() directly. It means you
>         need to write your own load() for your custom file format. But
>         this way, you can try to find the best way to layout your data
>         in hdf5 format.
>         3) in addition to 2) you could try to save each entry of your
>         "filters" array as one dataset in a given hdf5 file.
>
>         Did you search on bugzilla whether this bug was already submitted?
>         Could you try to report it?
>
>
>         Antoine
>
>         Le 15/10/2018 à 10:11, Arvid Rosén a écrit :
>
>             /////////////////////////////////
>
>             N = 4;
>
>             n = 10000;
>
>             filters = list();
>
>             for i=1:n
>
>               G=syslin('c', rand(N,N), rand(N,1), rand(1,N), rand(1,1));
>
>               filters($+1) = G;
>
>             end
>
>             tic();
>
>             save('filters.dat', filters);
>
>             ts1 = toc();
>
>             tic();
>
>             save('filters.dat', 'filters');
>
>             ts2 = toc();
>
>             printf("old save %.2fs\n", ts1);
>
>             printf("new save %.2fs\n", ts2);
>
>             printf("slowdown %.1f\n", ts2/ts1);
>
>             /////////////////////////////////
>
>         -- 
>
>         +++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>           
>
>           Antoine Monmayrant LAAS - CNRS
>
>           7 avenue du Colonel Roche
>
>           BP 54200
>
>           31031 TOULOUSE Cedex 4
>
>           FRANCE
>
>           
>
>           Tel:+33  5 61 33 64 59
>
>           
>
>           email :antoine.monmayrant at laas.fr <mailto:antoine.monmayrant at laas.fr>
>
>           permanent email :antoine.monmayrant at polytechnique.org
>         <mailto:antoine.monmayrant at polytechnique.org>
>
>           
>
>         +++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>           
>
>     -- 
>
>     +++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>       Antoine Monmayrant LAAS - CNRS
>
>       7 avenue du Colonel Roche
>
>       BP 54200
>
>       31031 TOULOUSE Cedex 4
>
>       FRANCE
>
>       Tel:+33  5 61 33 64 59
>
>       
>
>       email :antoine.monmayrant at laas.fr <mailto:antoine.monmayrant at laas.fr>
>
>       permanent email :antoine.monmayrant at polytechnique.org
>     <mailto:antoine.monmayrant at polytechnique.org>
>
>     +++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>     _______________________________________________
>
>     users mailing list
>
>     users at lists.scilab.org <mailto:users at lists.scilab.org>
>
>     https://antispam.utc.fr/proxy/1/c3RlcGhhbmUubW90dGVsZXRAdXRjLmZy/lists.scilab.org/mailman/listinfo/users
>     <https://antispam.utc.fr/proxy/2/c3RlcGhhbmUubW90dGVsZXRAdXRjLmZy/antispam.utc.fr/proxy/1/c3RlcGhhbmUubW90dGVsZXRAdXRjLmZy/lists.scilab.org/mailman/listinfo/users>
>
> -- 
> Stéphane Mottelet
> Ingénieur de recherche
> EA 4297 Transformations Intégrées de la Matière Renouvelable
> Département Génie des Procédés Industriels
> Sorbonne Universités - Université de Technologie de Compiègne
> CS 60319, 60203 Compiègne cedex
> Tel : +33(0)344234688
> http://www.utc.fr/~mottelet 
> <https://antispam.utc.fr/proxy/1/c3RlcGhhbmUubW90dGVsZXRAdXRjLmZy/www.utc.fr/%7Emottelet>
>
>
> _______________________________________________
> users mailing list
> users at lists.scilab.org
> https://antispam.utc.fr/proxy/1/c3RlcGhhbmUubW90dGVsZXRAdXRjLmZy/lists.scilab.org/mailman/listinfo/users


-- 
Stéphane Mottelet
Ingénieur de recherche
EA 4297 Transformations Intégrées de la Matière Renouvelable
Département Génie des Procédés Industriels
Sorbonne Universités - Université de Technologie de Compiègne
CS 60319, 60203 Compiègne cedex
Tel : +33(0)344234688
http://www.utc.fr/~mottelet

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.scilab.org/pipermail/users/attachments/20181015/9b94d0f4/attachment.htm>


More information about the users mailing list