Centre de Calcul de l Institut National de Physique Nucléaire et de Physique des Particules. ATLAS CAF 9 Sep 2016 Vamvakopoulos E. - PDF

Description
Centre de Calcul de l Institut National de Physique Nucléaire et de Physique des Particules ATLAS CAF 9 Sep 2016 Vamvakopoulos E. July-Aug 2016 IN2P3-CC T1 ACTIVITY (wallclock) Federation Site Duration

Please download to get full document.

View again

of 20
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Information
Category:

Paintings & Photography

Publish on:

Views: 22 | Pages: 20

Extension: PDF | Download: 0

Share
Transcript
Centre de Calcul de l Institut National de Physique Nucléaire et de Physique des Particules ATLAS CAF 9 Sep 2016 Vamvakopoulos E. July-Aug 2016 IN2P3-CC T1 ACTIVITY (wallclock) Federation Site Duration Log.CPUs (all VOs) FR- CCIN2P3 IN2P3-CC to Present HEPSPEC0 6 (all VOs) Pledges (in ATLAS) Q1 : 90,31 % Q2 : 98,62 Q3 : 73,84 % (!), Q3 is not finished yet July-Aug 2016 IN2P3-CC T1 ACTIVITY (wallclock) July-Aug 2016 IN2P3-CC T1 ACTIVITY (wallclock) per job type for IN2P3-CC_MCORE July-Aug 2016 IN2P3-CC T1 ACTIVITY (wallclock) per job type for IN2P3-CC_MCORE (Total) July-Aug 2016 IN2P3-CC T1 ACTIVITY (wallclock) per job type for IN2P3-CC_MCORE_HIMEM July-Aug 2016 IN2P3-CC T1 ACTIVITY (wallclock) per job type for IN2P3-CC_MCORE_HIMEM Memory and Brokering ATLAS jobs memory consumption Terminology VMEM is the total amount of virtual memory of the process has mapped, regardless of whether it has been committed to physical memory. RSS is the amount of physical memory being mapped (resident set size). PSS is the proportional set size, which is the amount of memory shared with other processes, divided by the number of processes sharing each page. USS is unique set size, which is the amount of memory that is private to the process and is not shared with any other Grid Engine is capable to manage Resources Limits per job CPU time/wallclock Memory ( vmem,rss and real memory consuption) File sizes Traditional ulimits vs CGROUPS Ulimits ( aggreate per jobs) Vmem and RSS, verbose logs when job fails Notion of hard and soft limits CGROUPs memory.soft_limit_in_bytes memory.limit_in_bytes Those limits correspond to the physical amount of memory ( which is close to PSS metrics) We perfom some test Hard limit are working nicely Lack of vebose log need to be checked Need to understood beeter the soft limits RSS vs PSS for IN2P3-CC_MCORE RSS vs PSS for IN2P3-CC Atlas makes the brokering of the jobs based on PSS memory. The PSS memory limits per queue are define in AGIS: ANALY_IN2P3-CC GB (min-max PSS) IN2P3-CC-all-ce-sge-long GB IN2P3-CC_MCORE GB IN2P3-CC_MCORE_HIMEM GB IN2P3-CC_VVL GB The memory profile of the ended jobs in the WN farm, depend from the preselection of the jobs which happens in the level of Panda-Jedi with respect to the panda queue limits for PSS in AGIS. For Atlas MUTLICORE core jobs, the RSS memory consuption is too high (overestimated) and will be far away from the real physical memory consumption which reflect to PSS memory. The multicore atlas jobs use heavly share libraries ( the gain of the sharing depend from the type of the job: simul, reco, repro,...,etc) which reduced significant the memory footprint. In atlas single core jobs where there is no heavy sharing, maxpss and maxrss have few MBytes difference. Therefore, for single core jobs of atlas, we can put some memory limit (soft/hard) based on RSS without any issue for multicore jobs rss limits will cause a problem. We should check other physical memory restriction from GE cgroups ( m_mem_free and m_mem_total). Also the definition of particular mc queues for last can help in this direction. At the Next CC downtime 20 sep 2016 : Memory limits will switch to RSS for all the VOs No issue with serial atlas jobs We propose to batchmaster to introduce new dedicate queues in order to handle the situation of rss vs pss for the atlas multicore jobs CGROUPs test should be continued GPU at CC-IN2P3 GPU Material 10 Brand New Machine Dell C4130 : 2xXeon E5-2640v3 Ghz) : 16 physical cores no HT 128 GB RAM 2 Nvidia Tesla K80 - 4xGPU Nvidia GK210 with 12 GB DDR5 Infiniband QDR as a private interconnect between workers ( future feature) Grid Engine Four dedicate queue on GridEngine suport multicore jobs mc_gpu_interactive mc_gpu_long mc_gpu_longlasting mc_gpu_medium Parallel Jobs with GPU binding in the future mc_gpu_interactive indicative limits Slots 8 (?) tmpdir /scratch h_cpu 24h h_fsize 20G h_rss 8G Please Open a OTRS ticket in order to grand access on those queue! Documention : Please, avoid to have multiple qlogin session for same user ( GE bug) CUDA_HOME=/opt/cuda-7.5/samples Details about the usage from Atlas VO shoud be discussed Libraries Already installed at the moment: Cuda-cublas Cuda-cufft Cuda-cudart Cuda-curand Cuda-cusolver Cuda-cusparse Cuda-npp Cuda-nvrtc Will be some basic OpenCL libraries Various issue and Transition Frontier and sites squids Plan for migration before the next Downtime Replacement of FAX xrootd node On progress AFSGROUP switching to ReadOnly on Monday 12 Sep Later, we will copy the data SPS to HPSS developing effort On progress There is no issue with the extra buffer space for the migration All the time we could add an offset and or split the graphs New Panda movers First test with IN2P3-CC-T3_VM01 ( on test) https://twiki.cern.ch/twiki/bin/view/panda/sitemovers We start to run some I/O demanding real jobs on IN2P3-CC-T3_VM02 ( 80 % cpu demanding jobs 20 % I/O)
Related Search
Similar documents
View more...
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks