データマイニングや機械学習をやるときによく問題となる「リーケージ」を防ぐ方法について論じた論文「Leakage in Data Mining: Formulation, Detecting, and Avoidance」(Kaufman, Shachar, et al., ACM Transactions on Knowledge Discovery from Data (TKDD) 6.4 (2012): 1-21.)を解説します。
主な内容は以下のとおりです。
・過去に起きたリーケージの事例の紹介
・リーケージを防ぐための2つの考え方
・リーケージの発見
・リーケージの修正
You Only Look One-level Featureの解説と見せかけた物体検出のよもやま話Yusuke Uchida
第7回全日本コンピュータビジョン勉強会「CVPR2021読み会」(前編)の発表資料です
https://kantocv.connpass.com/event/216701/
You Only Look One-level Featureの解説と、YOLO系の雑談や、物体検出における関連する手法等を広く説明しています
ゼロから始める深層強化学習(NLP2018講演資料)/ Introduction of Deep Reinforcement LearningPreferred Networks
Introduction of Deep Reinforcement Learning, which was presented at domestic NLP conference.
言語処理学会第24回年次大会(NLP2018) での講演資料です。
http://www.anlp.jp/nlp2018/#tutorial
データマイニングや機械学習をやるときによく問題となる「リーケージ」を防ぐ方法について論じた論文「Leakage in Data Mining: Formulation, Detecting, and Avoidance」(Kaufman, Shachar, et al., ACM Transactions on Knowledge Discovery from Data (TKDD) 6.4 (2012): 1-21.)を解説します。
主な内容は以下のとおりです。
・過去に起きたリーケージの事例の紹介
・リーケージを防ぐための2つの考え方
・リーケージの発見
・リーケージの修正
You Only Look One-level Featureの解説と見せかけた物体検出のよもやま話Yusuke Uchida
第7回全日本コンピュータビジョン勉強会「CVPR2021読み会」(前編)の発表資料です
https://kantocv.connpass.com/event/216701/
You Only Look One-level Featureの解説と、YOLO系の雑談や、物体検出における関連する手法等を広く説明しています
ゼロから始める深層強化学習(NLP2018講演資料)/ Introduction of Deep Reinforcement LearningPreferred Networks
Introduction of Deep Reinforcement Learning, which was presented at domestic NLP conference.
言語処理学会第24回年次大会(NLP2018) での講演資料です。
http://www.anlp.jp/nlp2018/#tutorial
LLJVM is a library that translates LLVM bitcode to JVM bytecode. It was originally created to optimize Python UDFs in PySpark by compiling them to bitcode using Numba and then translating that to run on JVMs. However, LLJVM currently only supports a limited set of LLVM instructions and data types. It focuses on translating simple Numba-generated bitcode and providing runtime support functions. Translating more complex UDFs could improve PySpark performance significantly by avoiding serialization overhead and allowing whole-stage codegen.
The document introduces Apache Spark v2.3 and Hivemall-on-Spark v0.5.0. It discusses new features in Spark v2.3 including Structured Streaming, image support, and performance improvements for Pandas UDFs. It also provides an overview of Hivemall-on-Spark, which allows users to run Hivemall machine learning functions on Spark DataFrames/SQL and includes utilities for easier use. The author then demonstrates building a logistic regression model on sample data using Hivemall-on-Spark to classify documents. Current work is also discussed to further optimize feature selection by rewriting Spark plans before feature extraction.
This document discusses integrating XGBoost machine learning with Spark and DataFrames. It provides examples of using XGBoost in Spark to train models on distributed data and make predictions on streaming data in parallel. It also discusses future work, such as using Rabbit for parallel learning, adding support to more platforms like Windows, and integrating with Spark ML pipelines.
A x86-optimized rank&select dictionary for bit sequencesTakeshi Yamamuro
The document summarizes a technique for efficiently performing rank and select operations on bit sequences using succinct data structures. It describes splitting the bit sequence into blocks of logarithmic size and precomputing total count values (stored in arrays L and S) to allow rank and select to be performed in O(log N) time using only o(N) extra space, where N is the length of the bit sequence. This is done using a technique known as the "4 Russian methods". Performance test results show the optimized implementation outperforms existing libraries.
1) VAST-Tree is a new data structure that uses vector-based and compressed techniques to enable highly parallel tree traversal on modern hardware.
2) It classifies tree branches into different layers and applies different compression techniques like prefix/suffix bit truncation. This allows processing of multiple keys simultaneously using SIMD instructions.
3) Experiments on real Twitter data show that VAST-Tree achieves better compression ratios and throughput than existing techniques like FAST by dynamically compressing branch nodes while minimizing comparison errors.