Data Science by ODS.ai 🦜
46K subscribers
676 photos
77 videos
7 files
1.75K links
First Telegram Data Science channel. Covering all technical and popular staff about anything related to Data Science: AI, Big Data, Machine Learning, Statistics, general Math and the applications of former. To reach editors contact: @malev
加ε…₯钑道
Machine learning algorithms
Language Model on One Billion Word Benchmark


In this release, we open source a model trained on the One Billion Word Benchmark (http://arxiv.org/abs/1312.3005), a large language corpus in English which was released in 2013. This dataset contains about one billion words, and has a vocabulary size of about 800K words. It contains mostly news data. Since sentences in the training set are shuffled, models can ignore the context and focus on sentence level language modeling.

In the original release and subsequent work, people have used the same test set to train models on this dataset as a standard benchmark for language modeling. Recently, we wrote an article (http://arxiv.org/abs/1602.02410) describing a model hybrid between character CNN, a large and deep LSTM, and a specific Softmax architecture which allowed us to train the best model on this dataset thus far, almost halving the best perplexity previously obtained by others.

Link for the repo: https://github.com/tensorflow/models/tree/master/lm_1b
Generative Visual Manipulation on the Natural Image Manifold

For more details, please visit the project webpage:
https://people.eecs.berkeley.edu/~junyanz/projects/gvm/

"Generative Visual Manipulation on the Natural Image Manifold", Jun-Yan Zhu, Philipp KrΓ€henbΓΌhl, Eli Shechtman and Alexei A. Efros. In European Conference on Computer Vision (ECCV). 2016.

(via Deep Learning community on vk.com)
Fully-Convolutional Siamese Networks for Object Tracking

A new state-of-the-art for real-time tracking at 50-100 fps. It can be used to track objects in videos and stuff.

http://www.gitxiv.com/posts/TvEcWEJabGu7pEHEa/fully-convolutional-siamese-networks-for-object-tracking
Stanford University report on how life will be different with the AI by the 2030.
Spoiler: no skynet just yet.

https://ai100.stanford.edu/2016-report
There is an #opensource repository for automatic image captioning in #tensorflow
As article reports, researches have managed to significally improve quality of recognition.


https://research.googleblog.com/2016/09/show-and-tell-image-captioning-open.html

#deeplearning
πŸ‘1
Google released new ImageNet dataset, but for video.

YouTube8M is a large-scale labeled video dataset that consists of 8 million YouTube video IDs and associated labels from a diverse vocabulary of 4800 visual entities. It also comes with precomputed state-of-the-art vision features from billions of frames, which fit on a single hard disk. This makes it possible to train video models from hundreds of thousands of video hours in less than a day on 1 GPU! 

https://research.google.com/youtube8m/
http://arxiv.org/pdf/1609.08675v1.pdf
Pointer Sentinel Mixture Models: use a pointer but back off to softmax vocab if uncertain
+ WikiText, new LM corpus.

Pointer sentinel-LSTM model achieves state of the art language modeling performance on the Penn Treebank (70.9 perplexity) while using far fewer parameters than a standard softmax LSTM. In order to evaluate how well language models can exploit longer contexts and deal with more realistic vocabularies and larger corpora we also introduce the freely available WikiText corpus.

https://arxiv.org/abs/1609.07843
Google released Open Images, a dataset consisting of ~9 million URLs to images that have been annotated with labels spanning over 6000 categories. They tried to make the dataset as practical as possible: the labels cover more real-life entities than the 1000 ImageNet classes, there are enough images to train a deep neural network from scratch and the images are listed as having a Creative Commons Attribution license.

https://research.googleblog.com/2016/09/introducing-open-images-dataset.html
Not Safe For Work!

Following Yahoo release of dataset for training porno classifier, researchers used trained networks to sythensise new porno images. Results are available at https://open_nsfw.gitlab.io (NSFW)
Andrew Ng wrote a letter about his upcoming book:

Dear Friends, 

You can now download the first 12 chapters of the Machine Learning Yearning book draft. These chapters discuss how good machine learning strategy will help you, and give new guidelines for setting up your datasets and evaluation metric in the deep learning era.

You can download the text here (5.3MB): https://gallery.mailchimp.com/dc3a7ef4d750c0abfc19202a3/files/Machine_Learning_Yearning_V0.5_01.pdf

Thank you for your patience. I ended up making many revisions before feeling this was ready to send to you. Additional chapters will be coming in the next week.

I would love to hear from you. To ask questions, discuss the content, or give feedback, please post on Reddit at:
http://www.reddit.com/r/mlyearning

You can also tweet at me at https://twitter.com/AndrewYNg . I hope this book will help you build highly effective AI and machine learning systems.

Andrew
Learning Deep Neural Networks with Massive Learned Knowledge, Z. Hu, Z. Yang, R. Salakhutdinov, E. Xing

https://www.cs.cmu.edu/~zhitingh/data/emnlp16deep.pdf

#paper #dl
πŸ‘1