/robowaifu/ - DIY Robot Wives

Advancing robotics to a point where anime catgrill meidos in tiny miniskirts are a reality

Porn boards have been deleted. Orphaned files will be cleared in 3 days, download images if you have hotlinks.


Days left: 34


JulayWorld fallback document - SAVE LOCALLY

JulayWorld onion service: bhlnasxdkbaoxf4gtpbhavref7l2j3bwooes77hqcacxztkindztzrad.onion

Max message length: 32768

Drag files to upload or
click here to select them

Maximum 5 files / Maximum size: 20.00 MB

More

(used to delete files and postings)


Datasets for Training AI Robowaifu Technician 04/09/2020 (Thu) 21:36:12 No.2300
Training AI and robowaifus requires immense amounts of data. It'd be useful to curate books and datasets to feed into our models or possibly build our own corpora to train on. The quality of data is really important. Garbage in is garbage out. The GPT2 pre-trained models for example are riddled with 'Advertisement' after paragraphs. Perhaps we can also discuss and share scripts for cleaning and preparing data here and anything else related to datasets. To start here are some large datasets I've found useful for training chatbots: >The Stanford Question Answering Dataset https://rajpurkar.github.io/SQuAD-explorer/ >Amazon QA http://jmcauley.ucsd.edu/data/amazon/qa/ >WikiText-103 https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/ >Arxiv Data from 24,000+ papers https://www.kaggle.com/neelshah18/arxivdataset >NIPS papers https://www.kaggle.com/benhamner/nips-papers >Frontiers in Neuroscience Journal Articles https://www.kaggle.com/markoarezina/frontiers-in-neuroscience-articles >Ubuntu Dialogue Corpus https://www.kaggle.com/rtatman/ubuntu-dialogue-corpus >4plebs.org data dump https://archive.org/details/4plebs-org-data-dump-2020-01 >The Movie Dialog Corpus https://www.kaggle.com/Cornell-University/movie-dialog-corpus >Common Crawl https://commoncrawl.org/the-data/
>>2300 >digits A propitious start. > Perhaps we can also discuss and share scripts for cleaning and preparing data here and anything else related to datasets. I once wrote a crawler that walked through all the bibles published on Biblehub & Biblegateway, and then stole 'read' all their thousands and thousands of pages (ie, downloaded them all), then parsed every one of them out into JSON-like databases specific to each translation. Once completed this meant that I now had a way to directly correlate both the various translations to each other, but also with the original words in the extant manuscripts. I say all this to simply point out that working with textual data is both important, and also complex for anything non-trivial. Many people just assume that dealing with text is both easy and simple, yet both assumptions are incorrect. There is a lot of work required to clean & normalize textual data for our robowaifus first, because as you said, GIGO. High performance of the processing system seems pretty important to me, and so when I created the Biblebot software I wrote everything in standard C++ which could literally work through the entire processing load in just seconds once I had all the thousands of files downloaded locally. Slower methods such as Python would also suffice I imagine, but the basic need would remain: the data must be cleaned and normalized. Further, I began to work on some graphical concepts to both display & refine the complex semantic and other interrelationships between words and phrases. This toolset hasn't materialized yet, but the ideas are basically valid I believe. The format for the interrelationships were based on Terry Halpin's Object Role Modeling, as specified in pic related. > I imagine that these experiences can at least be informative for this project, if not of some practical use. Good luck Anon.
>>2303 I just inadvertently discovered Dr. Halpin created a third book on the topic. > >Object-Role Modeling (ORM) is a fact-based approach to data modeling that expresses the information requirements of any business domain simply in terms of objects that play roles in relationships. All facts of interest are treated as instances of attribute-free structures known as fact types, where the relationship may be unary (e.g. Person smokes), binary (e.g. Person was born on Date), ternary (e.g. Customer bought Product on Date), or longer. Fact types facilitate natural expression, are easy to populate with examples for validation purposes, and have greater semantic stability than attribute-based structures such as those used in Entity Relationship Modeling (ER) or the Unified Modeling Language (UML). >All relevant facts, constraints and derivation rules are expressed in controlled natural language sentences that are intelligible to users in the business domain being modeled. This allows ORM data models to be validated by business domain experts who are unfamiliar with ORM's graphical notation. For the data modeler, ORM's graphical notation covers a much wider range of constraints than can be expressed in industrial ER or UML class diagrams, and thus allows rich visualization of the underlying semantics. >Suitable for both novices and experienced practitioners, this book covers the fundamentals of the ORM approach. Written in easy-to-understand language, it shows how to design an ORM model, illustrating each step with simple examples. Each chapter ends with a practical lab that discusses how to use the freeware NORMA tool to enter ORM models and use it to automatically generate verbalizations of the model and map it to a relational database.
>>2307 this book has a later companion workbook as well. >
>>2303 I'm crawling and scraping the web at the moment for imageboard threads and arranging the replies into sequences to feed in for training. Fortunately it's only a few thousand threads for now and I can use BeautifulSoup in Python to process the pages, but filtering the posts and connecting them is something else. People link posts in crazy ways and reply to multiple posts at the same time, and there's empty posts, cross-thread and board links, spam, nonsense and copypasta. Separating the good data from the bad is really difficult. I'd like to be able to do sentiment analysis too and really focus in on what I'm looking to train on. I'll have to check these books out and >>2251 When I make games I know how to structure everything but when it comes to processing text I have no idea what I'm doing and end up going full spaghetti and having no code to reuse afterwards. What are some good libraries for parsing JSON and CSV in C/C++? Python isn't gonna cut it once I get to larger datasets. I know libxml2 for XML and just found MeTA Toolkit for text analysis but don't know if it's any good: https://meta-toolkit.org/
>>2309 >What are some good libraries for parsing JSON and CSV in C/C++? I don't know about CSV, but I used jsoncpp when I wrote BUMP, such as it is. When I wrote the biblebot, I simply wrote the parsers by hand using just the C++ standard libraries such as string, iostream, vector, & map. BeautifulSoup is amazing actually, and I'd advise you to at least try to use Python first before taking the plunge into Sepples. It's a beast.
>>2309 update: looks like the ACM book you linked from robowaifu u bread uses the meta-toolkit, so there's a nice convergence anon. might be worth looking into tbh. >The Coursera course Text Retrieval and Search Engines uses MeTA in programming assignments available to thousands of students >The Coursera course Text Mining and Analytics uses MeTA in its programming assignments as well <An upcoming textbook Text Data Analysis and Management: A Practical Introduction to Text Mining and Information Retrieval showcases the MeTA toolkit with exercises and demos >The UIUC course CS 410: Text Information Systems uses MeTA in some programming assignments >The TIMAN Research Group from the UIUC Computer Science Department uses MeTA in their text mining research
Open file (37.49 KB 550x356 a cute.jpg)
Found this site full of anime subtitles: https://kitsunekko.net/dirlist.php?dir=subtitles%2F They're in ASS format and don't have speakers labeled but they'll save time transcribing dialog from anime. Soon we'll be the first in the world to talk with chatbots that imitate anime characters fairly well. What a time to be alive! >>2310 I'm a big fan of Python but once shit starts getting coded in a big loop everything goes to shit fast even with libraries doing the heavy lifting. Processing these 2 million posts is gonna take awhile and it sure isn't gonna scale well to 200+ million. We're gonna need all the beasts we can tame if we're gonna ride our robowaifus to the stars.
>>2312 >We're gonna need all the beasts we can tame if we're gonna ride our robowaifus to the stars. OK, fair enough. I was just trying to steer you onto a less painful course. But yes, I figured out a few years ago that if we want to actually create real, usable robowaifus then C++ was the only practical choice. I've been focused on it ever since. BTW I'm building the prereqs for MeTA rn on my box, I'm gonna give it a go, since I already have Text processing textbook you mentioned.
>>2313 >prereqs Arch Linux Build Guide Arch Linux consistently has the most up to date packages due to its rolling release setup, so it’s often the easiest platform to get set up on. To install the dependencies, run the following commands. sudo pacman -Sy sudo pacman -S clang cmake git icu libc++ make jemalloc zlib Once the dependencies are all installed, you should be ready to build. Run the following commands to get started: # clone the project git clone https://github.com/meta-toolkit/meta.git cd meta/ # set up submodules git submodule update --init --recursive # set up a build directory mkdir build cd build cp ../config.toml . # configure and build the project CXX=clang++ cmake ../ -DCMAKE_BUILD_TYPE=Release make You can now test the system by running the following command: ./unit-test --reporter=spec If everything passes, congratulations! MeTA seems to be working on your system. https://meta-toolkit.org/setup-guide.html#arch-linux-build-guide
>>2313 Yeah, it's pretty much unavoidable once we get to microcontrollers. >>2314 For some reason it wouldn't download this file (the server kept returning an empty response) but I found a mirror for it and dropped it into ./deps/icu-58.2/ https://ftp.osuosl.org/pub/blfs/conglomeration/icu/icu4c-58_2-src.tgz Then it wouldn't build on Debian because it couldn't find xlocale.h but I found a fix: # from the build directory sed -i 's/xlocale/locale/' deps/icu-58.2/src/ExternalICU/source/i18n/digitlst.cpp https://github.com/meta-toolkit/meta/issues/195 Then it wouldn't build with Debian's default g++-8 because it needs g++-7 rm CMakeCache.txt CXX=g++-7 cmake ../ -DCMAKE_BUILD_TYPE=Release Then everything built with all tests passed.
>>2317 >Yeah, it's pretty much unavoidable once we get to microcontrollers. Actually, we'll probably need to wrap C-code in libraries for C++ use for much of that particular use case. I meant for the things C++ brings to the table. Namely great abstraction & generic programming while still being 99% the performance of hand-coded assembler. And it's ability to do this while still providing very good concurrency and parallelism is a mix that simply can't be beat. And trust me, we'll need every ounce of that power before we're finished rolling out the first prototypes. >needing to downgrade the compiler hmm. that's a surprise. good job figuring things out anon. btw i'd recommend you set up an Arch box (manjaro is a good choice for simplicity's sake) if you want to stay on the current edge of things.
>Datasets >There are several public datasets that we’ve converted to the line_corpus format: 20newsgroups, https://meta-toolkit.org/data/20newsgroups.tar.gz originally from here. http://qwone.com/~jason/20Newsgroups/ IMDB Large Movie Review Dataset, https://meta-toolkit.org/data/imdb.tar.gz originally from here. http://ai.stanford.edu/~amaas/data/sentiment/ Any libsvm-formatted dataset can be used to create a forward_index. http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/ >sauce: https://meta-toolkit.org/overview-tutorial.html
word stemming tool used in meta-tools https://snowballstem.org/ https://github.com/snowballstem
>>2320 located the 'official' sepples version, quite a bit simpler too tbh. https://github.com/smassung/porter2_stemmer
here's meta-toolkit's discourse for support, etc. https://forum.meta-toolkit.org/
Open file (128.20 KB 655x410 my-body-is-ready.png)
>>2312 >Soon we'll be the first in the world to talk with chatbots that imitate anime characters fairly well.
Alphabetical list of part-of-speech tags used in the Penn Treebank Project: Number Tag Description 1. CC Coordinating conjunction 2. CD Cardinal number 3. DT Determiner 4. EX Existential there 5. FW Foreign word 6. IN Preposition or subordinating conjunction 7. JJ Adjective 8. JJR Adjective, comparative 9. JJS Adjective, superlative 10. LS List item marker 11. MD Modal 12. NN Noun, singular or mass 13. NNS Noun, plural 14. NNP Proper noun, singular 15. NNPS Proper noun, plural 16. PDT Predeterminer 17. POS Possessive ending 18. PRP Personal pronoun 19. PRP$ Possessive pronoun 20. RB Adverb 21. RBR Adverb, comparative 22. RBS Adverb, superlative 23. RP Particle 24. SYM Symbol 25. TO to 26. UH Interjection 27. VB Verb, base form 28. VBD Verb, past tense 29. VBG Verb, gerund or present participle 30. VBN Verb, past participle 31. VBP Verb, non-3rd person singular present 32. VBZ Verb, 3rd person singular present 33. WDT Wh-determiner 34. WP Wh-pronoun 35. WP$ Possessive wh-pronoun 36. WRB Wh-adverb >sauce https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
Edited last time by Chobitsu on 04/11/2020 (Sat) 00:53:01.
>>2317 -I also had an issue with this dependency, namely the version. On my Arch box, the current pacman-installed ICU is ~v65, but the meta-toolkit's cmake config insists on an exact (lower) version. -For Arch the fix is easy, in the cloned meta-toolkit repo's dir meta/deps/meta-cmake/ open the file FindOrBuildICU.cmake in a simple text editor replace line #34 with: if (BUILD_STATIC_ICU OR NOT ICU_VERSION OR NOT ICU_VERSION VERSION_GREATER_EQUAL "${FindOrBuildICU_VERSION}") (ie, use 'VERSION_GREATER_EQUAL' instead of 'VERSION_EQUAL') -That's it, just change that one statement. The cmake command does the right thing and everything builds and tests successfully afterwards. -One other thing, I'd recommend using the 'develop' branch instead of the 'master' branch of the repo as it somewhat accomodates the C++17 standard. From the repo dir: git fetch git checkout develop git submodule update --init --recursive >--- BTW, I think this is the latest release version of the ICU code: https://github.com/unicode-org/icu/releases/download/release-66-1/icu4c-66_1-src.tgz
Noice, I installed Calibre and found it can convert epub to txt easy: ebook-convert input.epub output.txt They only need a tiny bit of editing to clean up for training. >>2333 And thanks, I'm looking at src/tools/profile.cpp at the moment trying to figure out how to use MeTA. To start I'd like to discard shitposts then remove stop words and analyze the word frequency to get an idea of the content of posts being collected. It seems like I have to create an index first to perform searches, although perhaps I can just use the word counts. Then I'd like to filter them by content, probably by exporting the data to do some semi-supervised learning on good and bad post examples until it can figure out the filtering automatically.
>>2338 I don't know much about the tool yet, but I think they cover some of the basics here https://meta-toolkit.org/profile-tutorial.html thanks for the tip about calibre, i didn't know that.
Open file (115.92 KB 1000x687 Chobits.jpg)
Idea: it might be possible to build a system that can scrape dialog from manga for use in training chatbots. It'd be an interesting project since it would also need to be able to predict some of the context and who is speaking. It makes me wonder what other massive data resources could be tapped into with adequate tools.
>>2450 Certainly there is a mountain of content in mango Anon, it would be great to figure it out. >that scene Kek. The Madhouse animu did a really good job with that one.
>>2450 subtitle files, visual novels, light/web novel translations. Visual novels are particularly good because if you make a scrapper for a commonly used engine (like renpy) you can access dialog from thousands of games. Also, associating lines with characters, context and even emotions should be easy to do by using the displayed artwork as a reference.
>>2464 >should be easy to do by using the displayed artwork as a reference. can you define 'easy to do' anon? how would the process work. what algorithm would work, for example?
>>2465 It's ez, just invent the singularity :^) LSTMs are good at this sort of stuff where it needs to detect cute anime girls and flick a switch that stays on for long periods of time, although I'm not really sure what someone would do with a visual novel reading bot besides endlessly generating choose your own adventures like AI Dungeon. That'd be pretty crazy with a good voice synth. If anon knows some tricks for easy image recognition, please share.
>>2300 pdftotext that comes with poppler-utils can be used to convert PDFs. For some reason ebook-convert doesn't work on my system for PDFs. pdftotext has some problems though spacing paragraphs and converting math symbols properly. Cleaning up a 500-page machine learning textbook is a bitch but I know it will be worth it. I'll post it here once it's complete. It would be extremely useful if we could build a model that can construct questions for any given sentence. Then format everything it reads into a Q&A and train chatbots on that. Like this: <Q: What would help us collect an insane amount of chatbot training data? >A: It would be extremely useful if we could build a model that can construct questions for any given sentence. It would get really good at answering technical questions on any topic trained on, plus be able to intuit technical things not found in any of the training data. It'd be like having a group of researchers in your pocket you can query any time and make Google look like ancient dinosaur technology.
>>2472 > I'll post it here once it's complete. Please do Anon. >It would be extremely useful if we could build a model that can construct questions for any given sentence. That's a great idea. It's something I wish I could do well tbh. >and make Google look like ancient dinosaur technology. This brings up the question of industrial espionage. Ofc (((interests))) will stop at nothing to succeed. If you could literally make Jewgle then you would be the focus of much attention.
>>2467 >It's ez, just invent the singularity :^) You. I like the way you think Anon.
>>2465 oh, I wasn't thinking of using AI to recognize characters, but to parse the game data files directly. In renpy a dialog commands look like this: define e = Character("Eileen", image="eileen") label start: show eileen mad e "I'm a little upset at you." e happy "But it's just a passing thing."
>>2478 Interesting syntax.
Found an archive of an old anime transcript website. This thing is a goldmine: https://www.dropbox.com/s/lmqa9hnciu1fbav/animetranscripts_backup_jul_2018.7z?dl=0 Mining the transcripts into text format is gonna take a bit of work but it'll be an excellent dataset for people to bootstrap their chatbots on.
>>2585 Thanks Anon, grabbing a copy now.
Open file (390.74 KB 2550x2227 funguy.png)
Some chatbot datasets from the ParlAI team that created >>3190 >PersonaChat http://convai.io/ Paper: https://arxiv.org/pdf/1801.07243 >DailyDialog Website: https://web.archive.org/web/20190917200842/yanran.li/dailydialog (defunct site, dataset archived here: https://files.catbox.moe/kr936x.zip) Paper: https://arxiv.org/abs/1710.03957 >Wizard of Wikipedia Paper: https://openreview.net/forum?id=r1l73iRqKm >Empathetic Dialogues Paper: https://arxiv.org/abs/1811.00207 >SQuAD Website: https://rajpurkar.github.io/SQuAD-explorer/ >MS MARCO Website: http://www.msmarco.org/ >QuAC Website: https://www.aclweb.org/anthology/D18-1241 >HotpotQA GitHub: https://hotpotqa.github.io/ >QACNN & QADailyMail Paper: https://arxiv.org/abs/1506.03340 >CBT Paper: https://arxiv.org/abs/1511.02301 >BookTest Paper: https://arxiv.org/abs/1610.00956 >bAbI Dialogue tasks Paper: https://arxiv.org/abs/1605.07683 >Ubuntu Dialogue Paper: https://arxiv.org/abs/1506.08909 >OpenSubtitles Website: http://opus.lingfil.uu.se/OpenSubtitles.php >Image Chat Paper: https://arxiv.org/abs/1811.00945 >VQA Website: http://visualqa.org/ >VisDial Paper: https://arxiv.org/abs/1611.08669 >CLEVR Website: http://cs.stanford.edu/people/jcjohns/clevr/ To download these datasets setup ParlAI: https://github.com/facebookresearch/ParlAI#installing-parlai Then do with the dataset's appropriate task name: python examples/display_data.py --task TASKNAME --datatype train See the complete list of datasets and their task names here: https://github.com/facebookresearch/ParlAI/blob/master/parlai/tasks/task_list.py Example to download ConvAI2: python examples/display_data.py --task convai2 --datatype train
>>3195 Thanks, Anon!
>>3195 >Example to download ConvAI2: python examples/display_data.py --task convai2 --datatype train Illegal instruction (core dumped) Why Does Python Hate Me!? :^)
Open file (381.59 KB 986x854 python cucked.png)
>>3200 How else do you expect software to run and package maintainers to maintain code when tech conferences clap roaringly when it's said calling connectors male and female is not inclusive because it implies that an outie is male and an innie is female? The download files for a task can be found in its folder at ParlAI/parlai/tasks/TASKNAME/build.py
>>3202 (((redhat)))
>>3202 Haha, good point. But in fact (of course) the issue is all on my end. If Kokubunji says it works, you can believe it works. I just screwed the pooch in the Python Plane of AI dubs is all. :^) I like that term Master, think I'll use that someday... >>3203 They're a real mixed-bag IMO. Some talented & good, some absolute evil & pure diversity.

Report/Delete/Moderation Forms
Delete
Report

Captcha (required for reports and bans by board staff)

no cookies?