Artificial Intelligence
MINT-1T: ãªãŒãã³ãœãŒã¹ã®ãã«ãã¢ãŒãã«ããŒã¿ã10åã«æ¡å€§

æå 端ã®å€§èŠæš¡ãã«ãã¢ãŒãã«ã¢ãã« (LMM) ã®ãã¬ãŒãã³ã°ã«ã¯ãèªç±åœ¢åŒã®ç»åãšããã¹ãã®ã€ã³ã¿ãŒãªãŒãã·ãŒã±ã³ã¹ãå«ãå€§èŠæš¡ãªããŒã¿ã»ãããå¿ èŠã§ãããªãŒãã³ãœãŒã¹ã® LMM ã¯æ¥éã«é²åããŠããŸããããªãŒãã³ãœãŒã¹ã®å€§èŠæš¡ãªãã«ãã¢ãŒãã«ã€ã³ã¿ãŒãªãŒãããŒã¿ã»ããã¯äŸç¶ãšããŠå€§ããäžè¶³ããŠããŸãããããã®ããŒã¿ã»ããã¯ãããŸããŸãªã¢ããªãã£ã«ããã£ãŠã³ã³ãã³ããçè§£ããã³çæã§ããé«åºŠãª AI ã·ã¹ãã ãäœæããããã®åºç€ãšãªãããããã®éèŠæ§ã¯ããã匷調ããŠããéããããšã¯ãããŸãããå æ¬çãªã€ã³ã¿ãŒãªãŒãããŒã¿ã»ãããååã«äŸçµŠãããªããã°ãããé«åºŠã§é«æ§èœãª LMM ãéçºããå¯èœæ§ã¯å€§å¹ ã«æãªãããŸãããããã®ããŒã¿ã»ããã«ãããã¢ãã«ã¯å€æ§ãªå ¥åããåŠç¿ã§ããããã«ãªããããŸããŸãªã¢ããªã±ãŒã·ã§ã³ã§ããæ±çšçã§å¹æçã«ãªããŸããããã«ããã®ãããªããŒã¿ã»ããã®äžè¶³ã¯ãã€ãããŒã·ã§ã³ãšã³ã©ãã¬ãŒã·ã§ã³ãæšé²ããããã«å ±æãªãœãŒã¹ã«äŸåããŠãããªãŒãã³ãœãŒã¹ã³ãã¥ããã£ã«ãšã£ãŠèª²é¡ãšãªããŸãã
ãªãŒãã³ãœãŒã¹ã® LMM ã¯è¿å¹Žå€§ããªé²æ©ãéããŠããŸãããå€§èŠæš¡ãªã€ã³ã¿ãŒãªãŒã ããŒã¿ã»ããã®å¯çšæ§ãéãããŠããããããã®æé·ã¯åŠšããããŠããŸãããã®é害ãå æããã«ã¯ããã«ãã¢ãŒãã« ã¢ãã«ã®ç¶ç¶çãªéçºãšæ¹è¯ããµããŒãã§ãããããå æ¬çãªããŒã¿ã»ããããã¥ã¬ãŒããæ³šéä»ãããªãªãŒã¹ããããã®å調çãªåãçµã¿ãå¿ èŠã§ããããã«ããããã®ããŒã¿ã»ããã®äœæãšé åžã«ã¯ãããã€ãã®æè¡çããã³ããžã¹ãã£ãã¯ãªããŒãã«ãå æããå¿ èŠããããŸããããŒã¿åéã¯åºç¯å²ã§ãLMM ãå±éãããããŸããŸãªã³ã³ããã¹ãã代衚ãããã®ã§ãªããã°ãªããŸãããæ³šéä»ãã§ã¯ãã€ã³ã¿ãŒãªãŒããããç»åãšããã¹ãã®ã·ãŒã±ã³ã¹ãã¢ãã«ã®åŠç¿æ©èœã匷åããæ¹æ³ã§æŽåããŠããããšãæ éã«æ€èšããå¿ èŠããããŸããããã«ãããŒã¿ã»ããããªãŒãã³ãœãŒã¹ã§ããããšãä¿èšŒããã«ã¯ãããŒã¿ã®ãã©ã€ãã·ãŒãšäœ¿çšæš©ã«é¢é£ããæ³çããã³å«ççèæ ®äºé ã«å¯ŸåŠããå¿ èŠããããŸããé«å質ã§å€§èŠæš¡ãªãã«ãã¢ãŒãã« ã€ã³ã¿ãŒãªãŒã ããŒã¿ã»ããã®å¯çšæ§ãæ¡å€§ããããšã¯ãAI ç ç©¶éçºã®å°æ¥ã«ãšã£ãŠäžå¯æ¬ ã§ããçŸåšã®äžè¶³ã«å¯ŸåŠããããšã§ãAI ã³ãã¥ããã£ã¯ãã倧ããªã€ãããŒã·ã§ã³ãšã³ã©ãã¬ãŒã·ã§ã³ãä¿é²ããè€éã§çŸå®äžçã®åé¡ã«åãçµãããšãã§ããããã匷åã§å€çšé㪠LMM ã®äœæã«ã€ãªãããŸãã
ãããèžãŸããŠããããŸã§ã§æå€§ãã€æã倿§ãªãã«ãã¢ãŒãã«ã€ã³ã¿ãŒãªãŒããªãŒãã³ãœãŒã¹ããŒã¿ã»ããã§ããMINT-1Tãæ§ç¯ãããŸãããMINT-1T: æ¢åã®ãªãŒãã³ãœãŒã¹ããŒã¿ã»ããã®10åã®èŠæš¡ã§ã3.4å ã®ããã¹ãããŒã¯ã³ãš1åã®ç»åãå«ãŸããŠããŸããMINT-1TããŒã¿ã»ããã«ã¯ãPDFãã¡ã€ã«ãArXivè«æãªã©ããããŸã§å ¬éãããããšã®ãªããœãŒã¹ãå°å ¥ãããŠããŸãããã«ãã¢ãŒãã«ã€ã³ã¿ãŒãªãŒãããŒã¿ã»ããã¯ç°¡åã«ã¯æ¡åŒµã§ããªããããMINT-1TããŒã¿ã»ããã§ããŒã¿ãã¥ã¬ãŒã·ã§ã³ããã»ã¹ãå ±æããä»ã®äººããã®ãããªæ å ±è±å¯ãªããªã¢ã³ãã§å®éšã§ããããã«ããããšãéèŠã§ããMINT-1TããŒã¿ã»ããã¯ããã®æ¹æ³ãã€ãŸãMINT-XNUMXTã§ãã¬ãŒãã³ã°ãããLMã¢ãã«ãã以åã®æå 端ã®OBELICSãšïŒå€å°ã§ã¯ãããŸããïŒç«¶äºåãããããšã瀺ããŠããŸãã
MINT-1T: XNUMXå ããŒã¯ã³ã®ãã«ãã¢ãŒãã«ããŒã¿ã»ãã
å€§èŠæš¡ãªãªãŒãã³ãœãŒã¹ã®äºåãã¬ãŒãã³ã°ããŒã¿ã»ããã¯ãããŒã¿ãšã³ãžãã¢ãªã³ã°ã®æ¢æ±ãšéææ§ã®ãããªãŒãã³ãœãŒã¹ã¢ãã«ã®ãã¬ãŒãã³ã°ã«ãããŠãç ç©¶ã³ãã¥ããã£ã«ãšã£ãŠæ¥µããŠéèŠãªåœ¹å²ãæãããŠããŸãããããã¹ãé åã§ã¯ãC4ãThe Pileãªã©ã®åæã®ç ç©¶ããã³ãã¥ããã£ãGPT-JãGPT-Neoãªã©ã®ãªãŒãã³ãœãŒã¹ã®å€§èŠæš¡èšèªã¢ãã«ã®æåã®ã»ããããã¬ãŒãã³ã°ã§ããããã«ããäžã§éèŠãªåœ¹å²ãæãããŸããããããã®åºç€çãªåãçµã¿ã¯ããã®åŸã®ããŒã¿ãã£ã«ã¿ãªã³ã°æ¹æ³ãšã¹ã±ãŒãªã³ã°ã®æ¹åã«ãéãéããŸãããåæ§ã«ãç»åããã¹ã空éã§ã¯ãå€§èŠæš¡ãªãªãŒãã³ãœãŒã¹ããŒã¿ã»ããããããŒã¿ãã£ã«ã¿ãªã³ã°ãããã¯ãŒã¯ãT-MARSãªã©ã®ããåªããããŒã¿ãã¥ã¬ãŒã·ã§ã³æ¹æ³ã®é©æ°ãä¿é²ããŸãããæå 端ã®ç 究宀ãããã¬ãŒãã³ã°ãžã®ã·ãããé¡èã«ãªã£ãŠããŸãã å€§èŠæš¡ãã«ãã¢ãŒãã« ã¢ãã« (LMM) ããã«ã¯ãç»åãšããã¹ãã®èªç±åœ¢åŒã®ã·ãŒã±ã³ã¹ã§æ§æãããåºç¯ãªãã«ãã¢ãŒãã«ã€ã³ã¿ãŒãªãŒãããŒã¿ã»ãããå¿ èŠã§ããæå 端ã®ã¢ãã«ã®æ©èœãæ¥éã«é²æ©ããã«ã€ããŠãã¯ããŒãºããœãŒã¹ã¢ãã«ãšãªãŒãã³ãœãŒã¹ã¢ãã«éã®ãã«ãã¢ãŒãã«ãã¬ãŒãã³ã°ããŒã¿ã«å€§ããªã®ã£ãããçããŠããŸããçŸåšã®ãªãŒãã³ãœãŒã¹ã®ãã«ãã¢ãŒãã«ã€ã³ã¿ãŒãªãŒãããŒã¿ã»ããã¯ã䞻㫠HTML ããã¥ã¡ã³ãããååŸãããŠãããããããã¹ãã®ã¿ã®ããŒã¿ã»ãããããå°ããã倿§æ§ã«æ¬ ããŠãããããŒã¿ã®å¹ ãšå€æ§æ§ãå¶éãããŠããŸãããã®å¶éã«ãããå ç¢ãªãªãŒãã³ãœãŒã¹ LMM ã®éçºã劚ãããããªãŒãã³ãœãŒã¹ã¢ãã«ãšã¯ããŒãºããœãŒã¹ã¢ãã«ã®æ©èœã«æ Œå·®ãçããŠããŸãã
ãã®ã®ã£ãããåããããã«ããããŸã§ã§æå€§ãã€æã倿§ãªãªãŒãã³ãœãŒã¹ã®ãã«ãã¢ãŒãã«ã€ã³ã¿ãŒãªãŒãããŒã¿ã»ãããšã㊠MINT-1T ãäœæãããŸãããMINT-1T ã«ã¯ãHTMLãPDFãArXiv ãªã©ã®ããŸããŸãªãœãŒã¹ããååŸããåèš 1 å åã®ããã¹ãããŒã¯ã³ãš 115 ååã®ç»åãå«ãŸããŠããŸããMINT-353T 以åããã®åéã§æå€§ã®ãªãŒãã³ãœãŒã¹ããŒã¿ã»ãã㯠OBELICS ã§ããã¹ãŠ HTML ããååŸãã XNUMX ååã®ããã¹ãããŒã¯ã³ãš XNUMX å XNUMX äžåã®ç»åãå«ãŸããŠããŸããã
MINT-1Tã®è²¢ç®ã¯æ¬¡ã®ãšããã§ãã
- ããŒã¿ãšã³ãžãã¢ãªã³ã°: ãã®ãã«ãã¢ãŒãã«ã€ã³ã¿ãŒãªãŒãããŒã¿ã®ã¹ã±ãŒãªã³ã°ã¯ãããã¹ãã®ã¿ã®ããŒã¿ã»ãããç»åãšããã¹ãã®ãã¢ã®ããŒã¿ã»ãããæ§ç¯ãããããããšã³ãžãã¢ãªã³ã°äžã®èª²é¡ãå€ããªããŸããã¯ããã«å€§ããªããã¥ã¡ã³ã ãµã€ãºãåŠçããç»åãšããã¹ãã®å ã®é åºãç¶æããããšãéèŠã§ãã
- 倿§æ§ïŒ MINT-1T ã¯ãCommonCrawl PDF ã ArXiv ãªã©ã®ãœãŒã¹ããé«å質ã®ãã«ãã¢ãŒãã« ããã¥ã¡ã³ããå€§èŠæš¡ã«åéããããã«ãã¢ãŒãã« ã€ã³ã¿ãŒãªãŒã ã¹ããŒã¹åã®è£œåã§ãã
- ã¢ãã«å®éš: å®éšã§ã¯ãMINT-1T ã§ãã¬ãŒãã³ã°ããã LMM ã¯ãæ¢åã®æé«ã®ãªãŒãã³ãœãŒã¹ ããŒã¿ã»ããã§ãã OBELICS ã§ãã¬ãŒãã³ã°ãããã¢ãã«ã®ããã©ãŒãã³ã¹ã«å¹æµããã ãã§ãªããæœåšçã«ãããäžåãå¯èœæ§ããããã¹ã±ãŒã«ã XNUMX åã«å¢å ããããšã瀺ãããŠããŸãã
MINT-1T: ããŒã¿ã»ããã®æ§ç¯
MINT-1T ã¯ãPDF ã ArXiv è«æãªã©ããã倿§ãªã€ã³ã¿ãŒãªãŒã ããã¥ã¡ã³ã ãœãŒã¹ãå©çšããå€§èŠæš¡ãªãªãŒãã³ ãœãŒã¹ ããŒã¿ã»ããããã¥ã¬ãŒãããŸãããã®ã»ã¯ã·ã§ã³ã§ã¯ããã«ãã¢ãŒãã« ããã¥ã¡ã³ãã®ãœãŒã·ã³ã°ãäœå質ã³ã³ãã³ãã®ãã£ã«ã¿ãªã³ã°ãããŒã¿ã®éè€æé€ãè·å Žã NSFW ã«é©ããªãã³ã³ãã³ããæãŸãããªãã³ã³ãã³ãã®åé€ãè¡ã MINT-1T ã®æ¹æ³ã«ã€ããŠè©³ãã説æããŸããæçµçãªããŒã¿ã»ããã¯ã922 å (B) ã® HTML ããŒã¯ã³ã106 åã® PDF ããŒã¯ã³ãããã³ 9 åã® ArXiv ããŒã¯ã³ã§æ§æãããŸãã
倧éã®ãã«ãã¢ãŒãã«ææžã®èª¿é
HTML ãã€ãã©ã€ã³
MINT-1T ã¯ãå WARC ãšã³ããªã® DOM ããªãŒãè§£æããããšã«ãããCommonCrawl WARC ãã¡ã€ã«ããã€ã³ã¿ãŒãªãŒãããããã«ãã¢ãŒãã« ããã¥ã¡ã³ããæœåºãã OBELICS ã®æ¹æ³ã«åŸããŸããOBELICS 㯠2020 幎 2023 æãã 1 幎 2017 æãŸã§ã® CommonCrawl ãã³ãã®ããã¥ã¡ã³ãã®ã¿ãåŠçããŸããããMINT-2024T ã¯ããã¥ã¡ã³ã ããŒã«ãæ¡åŒµããŠã2018 幎 2024 æãã 1 幎 XNUMX æãŸã§ã® HTML ããã¥ã¡ã³ã (XNUMX 幎 XNUMX æãã XNUMX 幎 XNUMX æãŸã§ã®å®å šãªãã³ããšãã以åã®éšåçãªãã³ããå«ã) ãå«ããŸãããOBELICS ãšåæ§ã«ãMINT-XNUMXT ã¯ãç»åããŸã£ããå«ãŸããŠããªãããã¥ã¡ã³ããXNUMX ãè¶ ããç»åããŸãã¯ããŽãã¢ãã¿ãŒããã«ããxxx ãªã©ã®äžé©åãªéšåæååãå«ã URL ãæã€ç»åãå«ãããã¥ã¡ã³ããé€å€ããŸãã
PDFãã€ãã©ã€ã³
MINT-1T ã¯ã2023 幎 2024 æãã 1 幎 50 æãŸã§ã®ãã³ãã® CommonCrawl WAT ãã¡ã€ã«ãã PDF ããã¥ã¡ã³ããååŸããŸããæåã«ããã¹ãŠã® PDF ãªã³ã¯ããããã®ãã³ãããæœåºãããŸããæ¬¡ã«ãMINT-50T 㯠PyMuPDF ã䜿çšã㊠PDF ãããŠã³ããŒãããŠèªã¿åãããšããXNUMX MB ãè¶ ãã PDF (倧ããªç»åãå«ãŸããŠããå¯èœæ§ãé«ã) ãš XNUMX ããŒãžãè¶ ãã PDF ãç Žæ£ããŸããããã¹ãã®ãªãããŒãžã¯é€å€ãããæ®ãã®ããŒãžã®èªã¿åãé åºã確ç«ãããŸããèªã¿åãé åºã¯ãããŒãžäžã®ãã¹ãŠã®ããã¹ã ãããã¯ã®å¢çããã¯ã¹ãèŠã€ããåã«åºã¥ããŠãããã¯ãã¯ã©ã¹ã¿ãŒåããå·Šäžããå³äžã«é åºä»ãããããšã«ãã£ãŠæ±ºå®ãããŸããç»åã¯ãåãããŒãžäžã®ããã¹ã ãããã¯ãžã®è¿ãã«åºã¥ããŠã·ãŒã±ã³ã¹ã«çµ±åãããŸãã
ArXivãã€ãã©ã€ã³
MINT-1T ã¯ãTexSoup ã䜿çšã㊠LaTeX ãœãŒã¹ ã³ãŒããã ArXiv ã€ã³ã¿ãŒãªãŒã ããã¥ã¡ã³ããæ§ç¯ããå³ã®ã¿ã°ãæ€çŽ¢ããŠç»åãè«ææ¬æã«ã€ã³ã¿ãŒãªãŒãããŸããè€æ°ãã¡ã€ã«ã®è«æã®å ŽåãMINT-1T ã¯ã¡ã€ã³ã® Tex ãã¡ã€ã«ãèå¥ããå ¥åã¿ã°ããã®ãã¡ã€ã«ã®å 容ã«çœ®ãæããŸããLaTeX ã³ãŒãã¯ãã€ã³ããŒããåèæç®ã衚ãåŒçšã¿ã°ãåé€ããŠã¯ãªãŒã³ã¢ãããããŸããArXiv ã¯ãã§ã«é«åºŠã«ãã¥ã¬ãŒã·ã§ã³ãããããŒã¿ ãœãŒã¹ã§ããããã远å ã®ãã£ã«ã¿ãªã³ã°ãéè€æé€ã¯å®è¡ãããŸããã
ããã¹ãå質ãã£ã«ã¿ãªã³ã°
MINT-1T ã¯ãRefinedWebãDolmaãFineWeb ã«ãã£ãŠç¢ºç«ãããææ³ã«åŸããããã¹ã ãã£ã«ã¿ãªã³ã°ã«ã¢ãã«ããŒã¹ã®ãã¥ãŒãªã¹ãã£ãã¯ã䜿çšããªãããã«ããŠããŸããæåã«ãFasttext ã®èšèªèå¥ã¢ãã« (ä¿¡é Œãããå€ 0.65) ã䜿çšããŠãè±èªä»¥å€ã®ææžãæé€ãããŸãããã«ããæãŸãããªãã³ã³ãã³ããé€å€ãããããURL ã« NSFW ãµãã¹ããªã³ã°ãå«ãææžãåé€ãããŸããRefinedWeb ã®ããã¹ã ãã£ã«ã¿ãªã³ã°æ¹æ³ãé©çšãããç¹ã«ãéè€ãã n-gram ãå€ãããææžããMassiveText ã«ãŒã«ã䜿çšããŠäœå質ãšèå¥ãããææžãåé€ãããŸãã
ç»åãã£ã«ã¿ãªã³ã°
PDF ããã³ HTML ãã¡ã€ã«ããã¥ã¬ãŒãããåŸãMINT-1T 㯠HTML ããŒã¿ã»ããå ã®ãã¹ãŠã®ç»å URL ãããŠã³ããŒãããååŸã§ããªããªã³ã¯ãç Žæ£ããæå¹ãªç»åãªã³ã¯ã®ãªãããã¥ã¡ã³ããåé€ããŸãã150 ãã¯ã»ã«æªæºã®ç»åã¯ãããŽãã¢ã€ã³ã³ãªã©ã®ãã€ãºã®å€ãç»åãé¿ããããã«ç Žæ£ããã20,000 ãã¯ã»ã«ãè¶ ããç»åããéåžžã¯ãããã¯å€ã®ç»åã«å¯Ÿå¿ããããåé€ãããŸããHTML ããã¥ã¡ã³ãã®å Žåãã¢ã¹ãã¯ãæ¯ã XNUMX ãè¶ ããç»åã¯ãåºåãããŒãªã©ã®äœå質ã®ç»åãé€å€ããããã«åé€ãããŸããPDF ã®å Žåãç§åŠçãªå³ã衚ãä¿æããããã«ãããå€ã¯ XNUMX ã«èª¿æŽãããŸãã
äžã®å³ã¯ãMINT-1T ã HTML ãœãŒã¹ä»¥å€ã« PDF ã ArXiv ããã¥ã¡ã³ãããã®ããŒã¿ãç¬èªã«å«ããæ¹æ³ã瀺ããŠããŸãã
å®å šãã£ã«ã¿ãªã³ã°
- NSFW ç»åãã£ã«ã¿ãªã³ã°: MINT-1T ã¯ãããŒã¿ã»ããå ã®ãã¹ãŠã®ç»åã« NSFW ç»åæ€åºåšãé©çšããŸããããã¥ã¡ã³ãã« NSFW ç»åã XNUMX ã€ã§ãå«ãŸããŠããå Žåã¯ãããã¥ã¡ã³ãå šäœãç Žæ£ãããŸãã
- å人æ å ±ã®åé€ïŒå人æ å ±æŒæŽ©ã®ãªã¹ã¯ã軜æžãããããããã¹ãããŒã¿å ã®ã¡ãŒã«ã¢ãã¬ã¹ãšIPã¢ãã¬ã¹ã¯å¿ååãããŠããŸããã¡ãŒã«ã¯ã[ã¡ãŒã«ä¿è·]â ãšã©ã³ãã ã«çæãããæ©èœããªã IP ãå«ã IP ã§ãã
éè€æé€
MINT-1T ã¯ãå CommonCrawl ã¹ãããã·ã§ããå ã§æ®µèœãšããã¥ã¡ã³ãã®ããã¹ãã®éè€æé€ãå®è¡ããã€ã¡ãŒãžã®éè€æé€ãè¡ã£ãŠãã¢ã€ã³ã³ãããŽãªã©ã®ç¹°ãè¿ãã®ãæ å ±äŸ¡å€ã®ãªãã€ã¡ãŒãžãåé€ããŸãããã¹ãŠã®éè€æé€æé ã¯ãããŒã¿ ãœãŒã¹ããšã«åå¥ã«å®è¡ãããŸãã
段èœãšææžã®éè€æé€
Dolma ã®æ¹æ³è«ã«åŸããMINT-1T ã¯ãã«ãŒã ãã£ã«ã¿ãŒã䜿çšããŠããã¹ãã®éè€ãå¹ççã«æé€ãã誀æ€åºçã 0.01 ã«èšå®ããŠãåããã¥ã¡ã³ããã 13 ã°ã©ã ã®æ®µèœ (äºéã®æ¹è¡åºåãã§ç€ºããã) ã®éè€ãæé€ããŸããããã¥ã¡ã³ãã®æ®µèœã® 80% 以äžãéè€ããŠããå Žåãããã¥ã¡ã³ãå šäœãç Žæ£ãããŸãã
äžè¬çãªå®åæã®åé€
段èœã®éè€é€å»åŸãMINT-1T 㯠HTML ããã¥ã¡ã³ãå ã®ãã³ã³ãã³ããžã¹ããããããããã° ã¢ãŒã«ã€ãããªã©ã®çãäžè¬çãªå®åæãåé€ããŸããããã¯ãCCNet ã®æ £è¡ã«åŸã£ãŠãå CommonCrawl ã¹ãããã·ã§ããã® 2% ã«å¯ŸããŠæ£ç¢ºãªæ®µèœã®éè€é€å»ãå®è¡ããããšã§è¡ãããäžè¬çãªå®åæã®ã»ãšãã©ã確å®ã«åé€ãããŸãã
äžã®å³ã¯ãMINT-1T ã®ãã£ã«ã¿ãªã³ã° ããã»ã¹ã瀺ããŠãããHTMLãPDFãArXiv è«æã®ããŒã¿ ãã€ãã©ã€ã³å šäœã§ããŒã¯ã³ãã©ã®ããã«åé€ããããã瀺ããŠããŸãã
ã€ã¡ãŒãžéè€é€å»
å CommonCrawl ã¹ãããã·ã§ããå ã§ãMINT-1T 㯠SHA256 ããã·ã¥ã«åºã¥ããŠé »ç¹ã«åºçŸããç»åãåé€ããŸããå³å¯ãªéè€æé€ã§ã¯ãªããMultimodal-C4 ãã©ã¯ãã£ã¹ã«åŸã£ãŠãã¹ãããã·ã§ããå ã§ XNUMX å以äžåºçŸããç»åã®ã¿ãåé€ãããŸããOBELICS ãšåæ§ã«ãåäžããã¥ã¡ã³ãå ã§ç¹°ãè¿ãåºçŸããç»åã¯åé€ãããæåã®åºçŸã®ã¿ãä¿æãããŸãã
ã€ã³ãã©
ããŒã¿åŠçå šäœãéããŠãMINT-1T 㯠2,350 ããã»ããµ ããŒããš 190 ããã»ããµ ããŒãã®çµã¿åããããå¹³å 90 åã® CPU ã³ã¢ã«ã¢ã¯ã»ã¹ããŸãããåèšã§ããã®ããŒã¿ã»ããã®æ§ç¯ã«ã¯çŽ 4.2 äž CPU æéã䜿çšãããŸããã
MINT-1TãšOBELICSã®ææžæ§æã®æ¯èŒ
ã€ã³ã¿ãŒãªãŒããããããŒã¿ã»ããã®æ§æãè©äŸ¡ããéã«ã¯ãææžãããã®ããã¹ã ããŒã¯ã³ã®ååžãšææžãããã®ç»åæ°ãšãã 50,000 ã€ã®éèŠãªç¹æ§ã調ã¹ãããŸãããã®åæã§ã¯ãOBELICS ãš MINT-1T ã®åããŒã¿ ãœãŒã¹ã®äž¡æ¹ãã XNUMX ä»¶ã®ææžãã©ã³ãã ã«ãµã³ããªã³ã°ãããŸããã GPT-2 ããã¹ã ããŒã¯ã³ã®æ°ãèšç®ããããã«ããŒã¯ãã€ã¶ãŒã䜿çšãããŸãããããã¹ã ããŒã¯ã³ãšç»åã®æ°ã® 1.5 ååäœç¯å²å€ã«ããããã¥ã¡ã³ããé€å€ããããšã§ãå€ãå€ãé€å»ãããŸãããæ¬¡ã®å³ã«ç€ºãããã«ãMINT-1T ã® HTML ãµãã»ããã¯ãOBELICS ã§èŠãããããŒã¯ã³ååžãšã»ãŒäžèŽããŠããŸãããã ããPDF ãš ArXiv ããååŸãããããã¥ã¡ã³ãã¯ãå¹³åã㊠HTML ããã¥ã¡ã³ããããé·ããªãåŸåããããããŸããŸãªãœãŒã¹ããããŒã¿ãååŸããå©ç¹ã匷調ãããŠããŸããå³ 5 ã¯ããã¹ãŠã®ããã¥ã¡ã³ãã®ç»åå¯åºŠã調ã¹ãŠãããPDF ãš ArXiv ããã¥ã¡ã³ãã«ã¯ HTML ããã¥ã¡ã³ããšæ¯èŒããŠå€ãã®ç»åãå«ãŸããŠãããArXiv ãµã³ãã«ãæãç»åå¯åºŠãé«ãããšã瀺ããŠããŸãã
ããŸããŸãªããŒã¿ ãœãŒã¹ã«ãã£ãŠããã¥ã¡ã³ãã®å€æ§æ§ã¯ã©ã®ããã«åäžããã®ã§ãããã?
HTML ãè¶ ããŠãã«ãã¢ãŒãã« ããã¥ã¡ã³ãã®ããŒã«ãæ¡å€§ããéèŠãªåæ©ã¯ããã¡ã€ã³ ã«ãã¬ããžã®åäžã§ãããã®ã«ãã¬ããžã®å€æ§æ§ãšæ·±ããå®éåããããã«ãæœåšçãã£ãªã¯ã¬é å (LDA) ã¢ãã«ããOBELICS ããŒã¿ã»ãããMINT-100,000T ã® HTML ãµãã»ãããããã³ MINT-1T ã® PDF ãµãã»ãã (ArXiv ãé€ã) ãããµã³ããªã³ã°ããã 1 ã®ããã¥ã¡ã³ãã§ãã¬ãŒãã³ã°ãã200 ã®ãããã¯ãååŸããŸãããæ¬¡ã«ãGPT-4 ã䜿çšããŠåèªã»ãããåé¡ããMMMU ãã¡ã€ã³ã«åºã¥ããŠãå¥åº·ãšå»åŠãç§åŠãããžãã¹ã人æç§åŠãæŽå²ãªã©ã®äž»èŠãªãã¡ã€ã³ãç¹å®ããŸãããåæã«ããããã¡ã€ã³ååžã®æç¢ºãªåŸåãæããã«ãªããŸããã
- ãªããªãã¯: ãã®ããŒã¿ã»ããã¯ãã人æç§åŠãšç€ŸäŒç§åŠãã«é¡èã«éäžããŠããããšãããããŸããããã¯ãWikipedia ã®èšäºã«äŒŒãŠããªãææžãé€å€ããããŒã¿æ§ç¯ããã»ã¹ã«èµ·å ããå¯èœæ§ãããããã®çµæãããäžè¬çãªç¥èãšäººæç§åŠã«éç¹ã眮ããã³ã³ãã³ããžã®ååžã倿Žãããå¯èœæ§ããããŸãã
- MINT-1T ã® HTML ãµãã»ãã: OBELICS ãšã¯å¯Ÿç §çã«ãMINT-1T ã® HTML ãµãã»ããã¯ç¹å®ã®ãã¡ã€ã³ã«åŒ·ãåã£ãŠããããããå¹ åºããã©ã³ã¹ã®åãããã¡ã€ã³è¡šçŸã瀺åããŠããŸãã
- MINT-1T ã® PDF ãµãã»ãã: MINT-1T ã® PDF ããã¥ã¡ã³ãã«ã¯ããç§åŠæè¡ãããã¥ã¡ã³ãã®å²åãé«ããªã£ãŠããŸãããã®åŸåã¯ã詳现ãªç ç©¶è«æãæè¡ã¬ããŒããå ±æããã«ã¯ PDF ã奜ãŸãããšããç§åŠçã³ãã¥ãã±ãŒã·ã§ã³ã®æ§è³ªã«ãããã®ãšèããããŸãã
MINT-1T: çµæãšå®éš
ãã¹ãŠã®å®éšã«ãããŠãMINT-1T 㯠50% ã®ç»åããã¹ããã£ãã·ã§ã³ããããš 50% ã®ãã«ãã¢ãŒãã«ã€ã³ã¿ãŒãªãŒããããã§ã¢ãã«ããã¬ãŒãã³ã°ããŸããåã€ã³ã¿ãŒãªãŒãããã¥ã¡ã³ãããæå€§ 2048 åã®ãã«ãã¢ãŒãã«ããŒã¯ã³ããµã³ããªã³ã°ãããåç»åããã¹ããµã³ãã«ãã 340 åã®ããŒã¯ã³ããµã³ããªã³ã°ãããŸããFlamingo ãšåæ§ã«ã飿¥ããç»åããã¹ãã·ãŒã±ã³ã¹ã®çµäºã瀺ãããã«ãçµäºãããŒã¯ã³ã远å ãããŸãããã¬ãŒãã³ã°äžãåäžç»åã®ã€ã³ã¿ãŒãªãŒãããã¥ã¡ã³ãã® 50% ãã©ã³ãã ã«ããããããããã«ãç»åããã¥ã¡ã³ããã¢ãããµã³ããªã³ã°ãããŸããç»åããã¹ãããŒã¿ã»ããã¯ãå éšã§ãã¥ã¬ãŒãããããã£ãã·ã§ã³ããŒã¿ã»ããã®æ··åç©ã§æ§æãããŠããŸãããã«ãã¢ãŒãã«ã€ã³ã¿ãŒãªãŒãã·ãŒã±ã³ã¹ã«é¢ããã¢ãã«ã®æšè«èœåã¯ãã³ã³ããã¹ãå åŠç¿èœåãšãã«ãç»åæšè«ããã©ãŒãã³ã¹ãéããŠè©äŸ¡ãããŸãã
äžã®å³ã¯ãOBELICS ããã³ MINT-1T ã®ãµãã»ããã® MMMU ã«ãããåãã¡ã€ã³ã®ããã¥ã¡ã³ãã®å²åã瀺ããŠããŸãã
æèã«æ²¿ã£ãåŠç¿: ã¢ãã«ã¯ãããŸããŸãªãã£ãã·ã§ã³ ãã³ãããŒã¯ (COCO (Karpathy ãã¹ã) ããã³ TextCaps (æ€èšŒ)) ãšããžã¥ã¢ã«è³ªåå¿çããŒã¿ã»ãã (VQAv2 (æ€èšŒ)ãOK-VQA (æ€èšŒ)ãTextVQA (æ€èšŒ)ãããã³ VizWiz (æ€èšŒ)) ã§ã® XNUMX ã·ã§ããããã³ XNUMX ã·ã§ããã®ã³ã³ããã¹ãå åŠç¿ããã©ãŒãã³ã¹ã§è©äŸ¡ãããŸãããã¢ã³ã¹ãã¬ãŒã·ã§ã³ã¯ãã¬ãŒãã³ã° ã»ããããã©ã³ãã ã«ãµã³ããªã³ã°ãããŸããã¹ã³ã¢ã¯è€æ°ã®è©äŸ¡å®è¡ã§å¹³ååãããã©ã³ãã åããããã¢ã³ã¹ãã¬ãŒã·ã§ã³ã¯ãéžæãããããã³ããã«å¯ŸããæåºŠãèæ ®ã«å ¥ããŸããã¿ã¹ã¯ããšã«ç°ãªãããã³ãããé€å»ãããæãããã©ãŒãã³ã¹ã®é«ããã®ãéžæãããŸãã
ãã«ãã€ã¡ãŒãžæšè«: ã¢ãã«ã¯ãã³ã³ããã¹ãå åŠç¿è©äŸ¡ãè¶ ããŠè€æ°ç»åæšè«èœåã調ã¹ãããã«ãMMMU (åäžç»åãšè€æ°ç»åã®äž¡æ¹ã®è³ªåãå«ã) ãš Mantis-Eval (ãã¹ãŠè€æ°ç»åã®è³ªå) ã§è©äŸ¡ãããŸãã
HTMLããã¥ã¡ã³ãã®ãã¬ãŒãã³ã°
ãŸããMINT-1T ã® HTML éšåã OBELICS ãšæ¯èŒãããŸããããã¯ãOBELICS ã HTML ããã¥ã¡ã³ããããã¥ã¬ãŒã·ã§ã³ãããã以åã®äž»èŠãªã€ã³ã¿ãŒãªãŒã ããŒã¿ã»ããã§ããããã§ãã1 ã€ã®ã¢ãã«ã MINT-10T ãš OBELICS ã® HTML éšåã§ãã¬ãŒãã³ã°ãããåèš 4 åã®ãã«ãã¢ãŒãã« ããŒã¯ã³ã䜿çšãããŸãããã³ã³ããã¹ãå åŠç¿ã®ããã©ãŒãã³ã¹ãè©äŸ¡ãããŸãããæ¬¡ã®è¡šã¯ãäžè¬çãªãã³ãããŒã¯ã§ã® 8 ã·ã§ããããã³ 1 ã·ã§ããã®ããã©ãŒãã³ã¹ã瀺ããŠããŸããMINT-1T HTML ããã¥ã¡ã³ãã§ãã¬ãŒãã³ã°ãããã¢ãã«ã¯ãVQA ã¿ã¹ã¯ã§ã¯ OBELICS ãããåªããŠããŸããããã£ãã·ã§ã³ ãã³ãããŒã¯ã§ã¯å£ã£ãŠããŸããå¹³åãããšãOBELICS ã®ããã©ãŒãã³ã¹ã¯ MINT-XNUMXT (HTML) ããããããã«åªããŠããŸãã
PDF ããã³ ArXiv ããã¥ã¡ã³ãã®è¿œå
ãã®åŸãHTMLãPDFãArXiv ããã¥ã¡ã³ããæ··åšãã MINT-1T ã®å®å šãªããŒã¿ ãœãŒã¹ã§ãã¬ãŒãã³ã°ãè¡ãããŸããã€ã³ã¿ãŒãªãŒããããããã¥ã¡ã³ãã¯ãHTML ãã 50%ãPDF ãã 45%ãArXiv ãã 5% ã§ãµã³ããªã³ã°ãããŸããã¢ãã«ã¯åèš 10 åã®ãã«ãã¢ãŒãã« ããŒã¯ã³ã§ãã¬ãŒãã³ã°ãããŸããäžã®è¡šã«ç€ºãããã«ãå®å šãª MINT-1T ããŒã¿æ··åã§ãã¬ãŒãã³ã°ãããã¢ãã«ã¯ãã»ãšãã©ã®ã³ã³ããã¹ãå åŠç¿ãã³ãããŒã¯ã§ OBELICS ããã³ MINT-1T (HTML) ãããåªããŠããŸããããè€éãªãã«ãã¢ãŒãã«æšè«ãã³ãããŒã¯ã§ã¯ãMINT-1T ã¢ãã«ã¯ MMMU ã§ã¯ OBELICS ãããåªããŠããŸãããMantis-Eval ã§ã¯ããã©ãŒãã³ã¹ãäœäžããŸãã
詳现ãªåŸå
ã€ã³ã³ã³ããã¹ãåŠç¿ã®ããã©ãŒãã³ã¹ã¯ãã¢ã³ã¹ãã¬ãŒã·ã§ã³ã«ãã£ãŠã©ã®ããã«åäžããŸãã?
ã³ã³ããã¹ãå åŠç¿ã®ããã©ãŒãã³ã¹ã¯ã1 ïœ 1 åã®ãã¢ã³ââã¹ãã¬ãŒã·ã§ã³ã§ããã³ããã衚瀺ããããšãã«è©äŸ¡ãããŸããåè©äŸ¡ãã³ãããŒã¯ã«å¯ŸããŠãã·ã§ããæ°ããšã« 1 åã®è©Šè¡ãå®è¡ãããŸããæ¬¡ã®å³ã«ç€ºãããã«ãMINT-XNUMXT ã§ãã¬ãŒãã³ã°ãããã¢ãã«ã¯ããã¹ãŠã®ã·ã§ããã§ MINT-XNUMXT ã® HTML ãµãã»ãããš OBELICS ã§ãã¬ãŒãã³ã°ãããã¢ãã«ãããåªããŠããŸããMINT-XNUMXT (HTML) ã¢ãã«ã®ããã©ãŒãã³ã¹ã¯ãOBELICS ããããããã«å£ã£ãŠããŸãã
åå¹ãšèŠèŠç質åå¿çã¿ã¹ã¯ã®ããã©ãŒãã³ã¹
次ã®å³ã¯ããã£ãã·ã§ã³äœæãšèŠèŠç質åå¿ç (VQA) ãã³ãããŒã¯ã«ãããã³ã³ããã¹ãå åŠç¿ã®å¹³åããã©ãŒãã³ã¹ã瀺ããŠããŸããOBELICS ã¯ã1 ã·ã§ãã ãã£ãã·ã§ã³äœæãã³ãããŒã¯ã§ã¯ãã¹ãŠã® MINT-1T ããªã¢ã³ãããåªããŠããã1 ã·ã§ãã ãã£ãã·ã§ã³äœæã§ã¯ MINT-1T ãããããã«å£ã£ãŠããŸãããã ããMINT-XNUMXT 㯠VQA ãã³ãããŒã¯ã§ã¯äž¡æ¹ã®ããŒã¹ã©ã€ã³ããå€§å¹ ã«åªããŠããŸããMINT-XNUMXT (HTML) ã¯ãVQA ã¿ã¹ã¯ã§ã OBELICS ããåªããŠããŸãã
ããŸããŸãªãã¡ã€ã³ã§ã®ããã©ãŒãã³ã¹
MINT-1T ã«å€æ§ãªãã¡ã€ã³ãå«ããã®ã¯ãã¢ãã«ã®äžè¬åãåäžãããããã§ããåã®å³ã¯ãåãã¡ã€ã³ã® MMMU ã®ããã©ãŒãã³ã¹ãåé¡ãããã®ã§ããããžãã¹ ãã¡ã€ã³ãé€ããMINT-1T 㯠OBELICS ããã³ MINT-1T (HTML) ãããåªããŠããŸããMINT-1T ã®ç§åŠããã³ãã¯ãããžãŒ ãã¡ã€ã³ã®ããã©ãŒãã³ã¹ãåäžããã®ã¯ããããã®ãã¡ã€ã³ã ArXiv ããã³ PDF ããã¥ã¡ã³ãã§åºã䜿çšãããŠããããã§ãã
æçµçãªèã
ãã®èšäºã§ã¯ããããŸã§ã§æå€§ãã€æã倿§ãªãã«ãã¢ãŒãã«ã€ã³ã¿ãŒãªãŒããªãŒãã³ãœãŒã¹ããŒã¿ã»ããã§ãã MINT-1T ã«ã€ããŠèª¬æããŸãããMINT-1T: æ¢åã®ãªãŒãã³ãœãŒã¹ããŒã¿ã»ããã® 10 åã®èŠæš¡ã§ã3.4 å ã®ããã¹ãããŒã¯ã³ãš 1 åã®ç»åãå«ãŸããŠããŸããMINT-1T ããŒã¿ã»ããã«ã¯ãPDF ãã¡ã€ã«ã ArXiv è«æãªã©ããããŸã§å ¬éãããããšã®ãªããœãŒã¹ãå«ãŸããŠããŸãããã«ãã¢ãŒãã«ã€ã³ã¿ãŒãªãŒãããŒã¿ã»ããã¯ç°¡åã«ã¯æ¡åŒµã§ããªããããMINT-1T ããŒã¿ã»ããã§ããŒã¿ãã¥ã¬ãŒã·ã§ã³ããã»ã¹ãå ±æããä»ã®äººããã®ãããªæ å ±è±å¯ãªããªã¢ã³ãã§å®éšã§ããããã«ããããšãéèŠã§ããMINT-1T ããŒã¿ã»ããã¯ããã®æ¹æ³ãã€ãŸã MINT-XNUMXT ã§ãã¬ãŒãã³ã°ããã LM ã¢ãã«ãããããŸã§ã®æå 端㮠OBELICS ãš (å€å°ã§ã¯ãããŸãã) ç«¶åå¯èœã§ããããšã瀺ããŠããŸãã