This tutorial aims to provide attendees with a detailed understanding of end-to-end evaluation pipeline based on human judgments (offline measurement). The tutorial will give an overview of the state of the art methods, techniques, and metrics necessary for each stage of evaluation process. We will mostly focus on evaluating an information retrieval (search) system, but the other tasks such as recommendation and classification will also be discussed. Practical examples will be drawn both from the literature and from real world usage scenarios in industry.
윤석진 : 조직의 데이터 드리븐 문화를 위해 극복해야하는 문제들
발표영상 https://youtu.be/X29liXyIo3s
---
PAP가 준비한 팝콘 시즌1에서 프로덕트와 함께 성장하는 데이터 실무자들의 이야기를 담았습니다.
---
PAP(Product Analytics Playground)는 프로덕트 데이터 분석에 대해 편안하게 이야기할 수 있는 커뮤니티입니다.
우리는 데이터 드리븐 프로덕트 문화를 더 많은 분들이 각자의 자리에서 이끌어갈 수 있도록 하는 것을 목표로 합니다.
다양한 직군의 사람들이 모여 프로덕트를 만들듯 PAP 역시 다양한 멤버로 구성되어 있으며, 여러분들의 참여로 만들어집니다.
---
공식 페이지 : https://playinpap.oopy.io
페이스북 그룹 : https://www.facebook.com/groups/talkinpap
팀블로그 : https://playinpap.github.io
제 15회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [YouPlace 팀] : 카프카와 스파크를 활용한 유튜브 영상 속 제주 명소 검색 BOAZ Bigdata
데이터 엔지니어링 프로젝트를 진행한 YouPlace팀에서는 아래와 같은 프로젝트를 진행했습니다.
<aside>
이젠 검색도 유튜브 시대
제주여행을 계획할 때 브이로그 영상을 많이 참고하실텐데요
수많은 영상들과 영상 속 분산된 명소들을 하나 하나 찾으려 생각하면 막막하지 않으셨나요?
이러한 고민을 갖고 계신 분들을 위해, 유튜브 브이로거들이 찾아간 여행 명소들을 지도에서 한 눈에 파악할 수 있도록 만들었어요
(github : https://github.com/Boaz-Youplace)
16기 엔지니어링 고은서 | 중앙대학교 소프트웨어학부
16기 엔지니어링 류정화 | 성신여자대학교 융합보안공학과
16기 엔지니어링 송경민 | 국민대학교 소프트웨어학과
This document discusses techniques for recommender systems including multi-armed bandit (MAB), Thompson sampling, user clustering, and using item features. It provides examples of how MAB works using the ε-greedy approach and explores the tradeoff between exploration and exploitation. User clustering is presented as a way to group users based on click-through rate to improve targeting. Finally, it suggests using different item features like images, text, and collaborative filtering data as inputs to recommendation models.
The document discusses building a recommender system using collaborative filtering approaches. It describes collecting usage and rating data, calculating item-item and user-user similarities, making predictions for unknown values using k-nearest neighbors, and evaluating the system using measures like precision, recall and root mean squared error. Implementation details like programming languages, databases and cloud infrastructure are also summarized.
윤석진 : 조직의 데이터 드리븐 문화를 위해 극복해야하는 문제들
발표영상 https://youtu.be/X29liXyIo3s
---
PAP가 준비한 팝콘 시즌1에서 프로덕트와 함께 성장하는 데이터 실무자들의 이야기를 담았습니다.
---
PAP(Product Analytics Playground)는 프로덕트 데이터 분석에 대해 편안하게 이야기할 수 있는 커뮤니티입니다.
우리는 데이터 드리븐 프로덕트 문화를 더 많은 분들이 각자의 자리에서 이끌어갈 수 있도록 하는 것을 목표로 합니다.
다양한 직군의 사람들이 모여 프로덕트를 만들듯 PAP 역시 다양한 멤버로 구성되어 있으며, 여러분들의 참여로 만들어집니다.
---
공식 페이지 : https://playinpap.oopy.io
페이스북 그룹 : https://www.facebook.com/groups/talkinpap
팀블로그 : https://playinpap.github.io
제 15회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [YouPlace 팀] : 카프카와 스파크를 활용한 유튜브 영상 속 제주 명소 검색 BOAZ Bigdata
데이터 엔지니어링 프로젝트를 진행한 YouPlace팀에서는 아래와 같은 프로젝트를 진행했습니다.
<aside>
이젠 검색도 유튜브 시대
제주여행을 계획할 때 브이로그 영상을 많이 참고하실텐데요
수많은 영상들과 영상 속 분산된 명소들을 하나 하나 찾으려 생각하면 막막하지 않으셨나요?
이러한 고민을 갖고 계신 분들을 위해, 유튜브 브이로거들이 찾아간 여행 명소들을 지도에서 한 눈에 파악할 수 있도록 만들었어요
(github : https://github.com/Boaz-Youplace)
16기 엔지니어링 고은서 | 중앙대학교 소프트웨어학부
16기 엔지니어링 류정화 | 성신여자대학교 융합보안공학과
16기 엔지니어링 송경민 | 국민대학교 소프트웨어학과
This document discusses techniques for recommender systems including multi-armed bandit (MAB), Thompson sampling, user clustering, and using item features. It provides examples of how MAB works using the ε-greedy approach and explores the tradeoff between exploration and exploitation. User clustering is presented as a way to group users based on click-through rate to improve targeting. Finally, it suggests using different item features like images, text, and collaborative filtering data as inputs to recommendation models.
The document discusses building a recommender system using collaborative filtering approaches. It describes collecting usage and rating data, calculating item-item and user-user similarities, making predictions for unknown values using k-nearest neighbors, and evaluating the system using measures like precision, recall and root mean squared error. Implementation details like programming languages, databases and cloud infrastructure are also summarized.
신뢰할 수 있는 A/B 테스트를 위해 알아야 할 것들Minho Lee
2021-09-04 프롬 특강 발표자료입니다.
---
많은 사람들이 A/B 테스트가 중요하다고 말합니다.
그런데 우리는 뭘 믿고 A/B 테스트에 의사결정을 맡기는 걸까요?
A/B 테스트는 그냥 돌리면 성과를 만들어주는 마법의 도구가 아닙니다.
신뢰할 수 있는 실험 결과를 위해 어떤 고민이 더 필요한지 살펴보려고 합니다.
(오리지널 구글 프리젠테이션은 http://goo.gl/uiX2UH 에)
- 권재명 (Jaimyoung Kwon)
1. 실리콘 벨리 데이터 기업들
2. 온라인 광고 사업
3. 데이터 사이언티스트, 데이터 엔지니어, 머신러닝 사이언티스트
4. 실리콘 벨리 데이터 사이언티스트의 하루
5. 데이터 사이언스 툴채인
6. 데이터 사이언스 베스트 프랙티스
7. 데이터 사이언스 필수 통계 개념
8. 사내 데이터 사이언스 도입
코끼리(BOAZ) 사서의 도서 추천 솔루션
: 이 책 내용이 내 취향인데, 비슷한 내용의 책은 어떻게 찾지?’
줄거리를 바탕으로 책을 고르시는 분, 관심 작가의 책을 읽고 싶은 분들께
코끼리 사서가 취향저격 책을 제안해 드립니다.
12기 강호석 고은비 고은지 양태일 이지인 전준수 정해원
[국내 최초 빅데이터 연합동아리 BOAZ]
유튜브 - https://www.youtube.com/channel/UCSniI26A56n2QZ71opJtTUg
페이스북 - https://www.facebook.com/BOAZbigdata
인스타그램 - http://www.instagram.com/boaz_bigdata
블로그 - https://blog.naver.com/boazbigdata
이윤희 : 다짜고짜 배워보는 인과추론
발표영상 https://youtu.be/fShRiqe1Cf0
---
PAP가 준비한 팝콘 시즌1에서 프로덕트와 함께 성장하는 데이터 실무자들의 이야기를 담았습니다.
---
PAP(Product Analytics Playground)는 프로덕트 데이터 분석에 대해 편안하게 이야기할 수 있는 커뮤니티입니다.
우리는 데이터 드리븐 프로덕트 문화를 더 많은 분들이 각자의 자리에서 이끌어갈 수 있도록 하는 것을 목표로 합니다.
다양한 직군의 사람들이 모여 프로덕트를 만들듯 PAP 역시 다양한 멤버로 구성되어 있으며, 여러분들의 참여로 만들어집니다.
---
공식 페이지 : https://playinpap.oopy.io
페이스북 그룹 : https://www.facebook.com/groups/talkinpap
팀블로그 : https://playinpap.github.io
제 16회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [기린그림 팀] : 사용자의 손글씨가 담긴 그림 일기 생성 서비스BOAZ Bigdata
데이터 분석 프로젝트를 진행한 기린그림 팀에서는 아래와 같은 프로젝트를 진행했습니다.
기린그림 팀은 사용자의 글씨체를 학습하여 나만의 폰트로 일기를 쓰고, 사진을 업로드 하면 직접 그림을 그린 것처럼 변환하여 그림일기를 쓸 수 있도록 하는 프로젝트를 진행 했습니다.
16기 김유진 이화여자대학교 과학교육과
17기 김송성 고려대학교 통계학과
17기 박종은 연세대학교 언더우드국제학부
17기 여해인 동덕여자대학교 컴퓨터학과
17기 이보림 중앙대학교 소프트웨어학부
제 15회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [쇼미더뮤직 팀] : 텍스트 감정추출을 통한 노래 추천BOAZ Bigdata
데이터 분석 프로젝트를 짆애한 쇼미더뮤직 팀에서는 아래와 같은 프로젝트를 진행했습니다.
내 하루의 감정을 통해 노래를 추천받을 수 있다면 얼마나 좋을까?
자연어처리와 추천시스템 기법의 collaboration..
여러분의 감정을 추출하고, 어울리는 노래를 추천해드립니다.
**쇼미더뮤직!**
16기 김양경 건국대학교 기술경영학과
15기 김은선 세종대학교 데이터사이언스학과
16기 유수빈 동덕여자대학교 정보통계학과
16기 이상민 경희대학교 소프트웨어융합학과
16기 조하늘 동덕여자대학교 국제경영학과, 정보통계학과
16기 최 리 건국대학교 응용통계학과
제 19회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [COLLABO-AZ] : 고객 세그멘테이션 기반 개인 맞춤형 추천시스템 for 루빗BOAZ Bigdata
데이터 분석 프로젝트를 진행한 COLLABO-AZ 팀에서는 아래와 같은 프로젝트를 진행했습니다.
고객 세그멘테이션 기반 개인 맞춤형 추천시스템 for 루빗
20기 정지혜 이화여자대학교 통계학과
20기 김지민 중앙대학교 응용통계학과
20기 오태연 단국대학교 정보통계학과
20기 최은선 한양대학교 에리카캠퍼스 정보사회미디어학과
제 15회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [MarketIN팀] : 디지털 마케팅 헬스체킹 서비스BOAZ Bigdata
데이터 시각화 프로젝트를 진행한 MarketIN팀에서는 아래와 같은 프로젝트를 진행했습니다.
- 작은 가게를 운영하는 경우부터 온라인 쇼핑몰까지 비즈니스 운영 과정에선 수많은 의사 결정이 필요합니다. 데이터를 대시보드 템플릿에 연결하여 질문에 대한 답을 한눈에 찾을 수 있습니다.
- 마켓인을 통해 데이터 기반 비즈니스를 경험해보세요.
16기 강민주 (서울과학기술대학교 산업정보시스템전공)
16기 김서연 (숙명여자대학교 홍보광고학과)
16기 오지원 (세종대학교 경영학과)
16기 윤해림 (세종대학교 경영학과)
16기 임성아 (세종대학교 경영학과)
16기 한주리 (고려대학교 사회학과)
AI 연구자를 위한 클린코드 - GDG DevFest Seoul 2019Kenneth Ceyer
올바른 코드 작성을 고민하는 연구자들을 위하여 - 클린코드는 여러분의 코드를 복잡한 패턴으로 구현하여 시간을 잡아먹는, 겉만 화려한 장식이 아닙니다. 모델을 구현하고, 또 그것을 테스트 할 때 이것이 정말 올바른 코드인지 궁금하셨나요? 이 세션에서는 연구 모델을 작성할 때 발견할 수 있는, 빈번한 코드 악취(Code smell)들과, 그것들을 어떻게 없앨 수 있을지에 대해서 알아봅니다. 코드에 영혼을 불어넣고, 그 어떤 코드라도 부끄럽지 않게 구현할 수 있는 연구자들이 되어봅시다!
제 15회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [개미야 뭐하니?팀] : 투자자의 반응을 이용한 실시간 등락 예측(feat. 카프카)BOAZ Bigdata
데이터 엔지니어링 프로젝트를 진행한 개미야 뭐하니? 팀에서는 아래와 같은 프로젝트를 진행했습니다.
[Web 발신] 5분 후, 당신이 투자한 주식이 떨어집니다!
실시간으로 내 주식의 등락을 알려주는 ai가 있다?
이것만 있으면 나도 주린이 탈출
개미와 함께하는 최적의 매도 매수 타이밍
지금 이 순간, 내 주식의 미래를 볼 수 있다
(신청: https://github.com/jayleenym/AYOA)
16기 강지수 동덕여자대학교 정보통계학과
16기 김서민 숙명여자대학교 컴퓨터과학과
16기 김윤기 한양대학교 대학원 컴퓨터소프트웨어학과
16기 문예진 서강대학교 경제학과 / 빅데이터 사이언스
2019년 파이콘 한국에서 진행된 튜토리얼 자료입니다. 최재식 교수님께서 설명가능인공지능이란 무엇인가에 대해 발표해주신 Part 1 발표자료입니다. 아래 링크를 통해 행사 관련 정보를 확인하실 수 있습니다.
http://xai.unist.ac.kr/Tutorial/2018/
https://github.com/OpenXAIProject/PyConKorea2019-Tutorials
Part 1: https://www.slideshare.net/OpenXAI/2019-part-1
Part 2: https://www.slideshare.net/OpenXAI/2019-lrp-part-2
Part 3: https://www.slideshare.net/OpenXAI/2019-shap-part-3
This document discusses recommender systems, including:
1. It provides an overview of recommender systems, their history, and common problems like top-N recommendation and rating prediction.
2. It then discusses what makes a good recommender system, including experiment methods like offline, user surveys, and online experiments, as well as evaluation metrics like prediction accuracy, diversity, novelty, and user satisfaction.
3. Key metrics that are important to evaluate recommender systems are discussed, such as user satisfaction, prediction accuracy, coverage, diversity, novelty, serendipity, trust, robustness, and response time. The document emphasizes selecting metrics based on business goals.
데이터 엔지니어링 프로젝트를 진행한 로깅줍깅 팀에서는 로그 데이터를 수집 및 처리하는 파이프라인을 만들어 각각의 단계에서 일어날 수 있는 상황에 대한 실험을 진행했습니다.
16기 엔지니어링 강하영 동덕여자대학교 정보통계학과
16기 엔지니어링 임태빈 상명대학교 컴퓨터과학과
16기 엔지니어링 지유리 숙명여자대학교 소프트웨어융합학과
Recommender system algorithm and architectureLiang Xiang
1) The document discusses recommender system algorithms and architecture. It covers common recommendation techniques like collaborative filtering, content-based filtering, and graph-based recommendations.
2) It also discusses challenges like cold starts for new users and items. For new users, it recommends using demographic data or initial feedback to understand interests. For new items, it suggests using content information or initial user feedback.
3) The document proposes a feature-based recommendation framework that connects users, items, and latent features to address challenges like heterogeneous data and cold starts. This framework provides explanations but does not support user-based methods.
오늘 밤부터 쓰는 google analytics (구글 애널리틱스, GA) Yongho Ha
http://ga.yonghosee.com 에서 진행하는 구글 어날리틱스(google analytics) 에 대한 강의 슬라이드 입니다. 이 슬라이드는 샘플이지만, 초반부는 실재 강의 교재 그대로 입니다. 이것 자체로도 여러분이 GA를 이해하는데 좀 도움이 된다면 기쁘겠습니다^^ 감사합니다.
This document discusses recommender engines, which are systems that predict items a user may be interested in based on their preferences and behaviors. It describes several common recommendation techniques, including demographic filtering, content-based filtering, user-based collaborative filtering, and item-based collaborative filtering. Examples of recommender engines used by Amazon and Digg are provided to illustrate how these techniques are implemented on e-commerce and social news sites. The document concludes that recommender engines provide benefits to both businesses and users by enabling personalized recommendations at scale.
The attached summary highlights the benefits of the Cross Slot no-till seeding system over other inferior seeding technologies.
In many cases Cross Slot can be the difference between a farmer getting his seed in the ground in wet or dry conditions.... if another piece of seeding technology does not allow the farmer to get his crop in then the farmer has had a failure.
신뢰할 수 있는 A/B 테스트를 위해 알아야 할 것들Minho Lee
2021-09-04 프롬 특강 발표자료입니다.
---
많은 사람들이 A/B 테스트가 중요하다고 말합니다.
그런데 우리는 뭘 믿고 A/B 테스트에 의사결정을 맡기는 걸까요?
A/B 테스트는 그냥 돌리면 성과를 만들어주는 마법의 도구가 아닙니다.
신뢰할 수 있는 실험 결과를 위해 어떤 고민이 더 필요한지 살펴보려고 합니다.
(오리지널 구글 프리젠테이션은 http://goo.gl/uiX2UH 에)
- 권재명 (Jaimyoung Kwon)
1. 실리콘 벨리 데이터 기업들
2. 온라인 광고 사업
3. 데이터 사이언티스트, 데이터 엔지니어, 머신러닝 사이언티스트
4. 실리콘 벨리 데이터 사이언티스트의 하루
5. 데이터 사이언스 툴채인
6. 데이터 사이언스 베스트 프랙티스
7. 데이터 사이언스 필수 통계 개념
8. 사내 데이터 사이언스 도입
코끼리(BOAZ) 사서의 도서 추천 솔루션
: 이 책 내용이 내 취향인데, 비슷한 내용의 책은 어떻게 찾지?’
줄거리를 바탕으로 책을 고르시는 분, 관심 작가의 책을 읽고 싶은 분들께
코끼리 사서가 취향저격 책을 제안해 드립니다.
12기 강호석 고은비 고은지 양태일 이지인 전준수 정해원
[국내 최초 빅데이터 연합동아리 BOAZ]
유튜브 - https://www.youtube.com/channel/UCSniI26A56n2QZ71opJtTUg
페이스북 - https://www.facebook.com/BOAZbigdata
인스타그램 - http://www.instagram.com/boaz_bigdata
블로그 - https://blog.naver.com/boazbigdata
이윤희 : 다짜고짜 배워보는 인과추론
발표영상 https://youtu.be/fShRiqe1Cf0
---
PAP가 준비한 팝콘 시즌1에서 프로덕트와 함께 성장하는 데이터 실무자들의 이야기를 담았습니다.
---
PAP(Product Analytics Playground)는 프로덕트 데이터 분석에 대해 편안하게 이야기할 수 있는 커뮤니티입니다.
우리는 데이터 드리븐 프로덕트 문화를 더 많은 분들이 각자의 자리에서 이끌어갈 수 있도록 하는 것을 목표로 합니다.
다양한 직군의 사람들이 모여 프로덕트를 만들듯 PAP 역시 다양한 멤버로 구성되어 있으며, 여러분들의 참여로 만들어집니다.
---
공식 페이지 : https://playinpap.oopy.io
페이스북 그룹 : https://www.facebook.com/groups/talkinpap
팀블로그 : https://playinpap.github.io
제 16회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [기린그림 팀] : 사용자의 손글씨가 담긴 그림 일기 생성 서비스BOAZ Bigdata
데이터 분석 프로젝트를 진행한 기린그림 팀에서는 아래와 같은 프로젝트를 진행했습니다.
기린그림 팀은 사용자의 글씨체를 학습하여 나만의 폰트로 일기를 쓰고, 사진을 업로드 하면 직접 그림을 그린 것처럼 변환하여 그림일기를 쓸 수 있도록 하는 프로젝트를 진행 했습니다.
16기 김유진 이화여자대학교 과학교육과
17기 김송성 고려대학교 통계학과
17기 박종은 연세대학교 언더우드국제학부
17기 여해인 동덕여자대학교 컴퓨터학과
17기 이보림 중앙대학교 소프트웨어학부
제 15회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [쇼미더뮤직 팀] : 텍스트 감정추출을 통한 노래 추천BOAZ Bigdata
데이터 분석 프로젝트를 짆애한 쇼미더뮤직 팀에서는 아래와 같은 프로젝트를 진행했습니다.
내 하루의 감정을 통해 노래를 추천받을 수 있다면 얼마나 좋을까?
자연어처리와 추천시스템 기법의 collaboration..
여러분의 감정을 추출하고, 어울리는 노래를 추천해드립니다.
**쇼미더뮤직!**
16기 김양경 건국대학교 기술경영학과
15기 김은선 세종대학교 데이터사이언스학과
16기 유수빈 동덕여자대학교 정보통계학과
16기 이상민 경희대학교 소프트웨어융합학과
16기 조하늘 동덕여자대학교 국제경영학과, 정보통계학과
16기 최 리 건국대학교 응용통계학과
제 19회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [COLLABO-AZ] : 고객 세그멘테이션 기반 개인 맞춤형 추천시스템 for 루빗BOAZ Bigdata
데이터 분석 프로젝트를 진행한 COLLABO-AZ 팀에서는 아래와 같은 프로젝트를 진행했습니다.
고객 세그멘테이션 기반 개인 맞춤형 추천시스템 for 루빗
20기 정지혜 이화여자대학교 통계학과
20기 김지민 중앙대학교 응용통계학과
20기 오태연 단국대학교 정보통계학과
20기 최은선 한양대학교 에리카캠퍼스 정보사회미디어학과
제 15회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [MarketIN팀] : 디지털 마케팅 헬스체킹 서비스BOAZ Bigdata
데이터 시각화 프로젝트를 진행한 MarketIN팀에서는 아래와 같은 프로젝트를 진행했습니다.
- 작은 가게를 운영하는 경우부터 온라인 쇼핑몰까지 비즈니스 운영 과정에선 수많은 의사 결정이 필요합니다. 데이터를 대시보드 템플릿에 연결하여 질문에 대한 답을 한눈에 찾을 수 있습니다.
- 마켓인을 통해 데이터 기반 비즈니스를 경험해보세요.
16기 강민주 (서울과학기술대학교 산업정보시스템전공)
16기 김서연 (숙명여자대학교 홍보광고학과)
16기 오지원 (세종대학교 경영학과)
16기 윤해림 (세종대학교 경영학과)
16기 임성아 (세종대학교 경영학과)
16기 한주리 (고려대학교 사회학과)
AI 연구자를 위한 클린코드 - GDG DevFest Seoul 2019Kenneth Ceyer
올바른 코드 작성을 고민하는 연구자들을 위하여 - 클린코드는 여러분의 코드를 복잡한 패턴으로 구현하여 시간을 잡아먹는, 겉만 화려한 장식이 아닙니다. 모델을 구현하고, 또 그것을 테스트 할 때 이것이 정말 올바른 코드인지 궁금하셨나요? 이 세션에서는 연구 모델을 작성할 때 발견할 수 있는, 빈번한 코드 악취(Code smell)들과, 그것들을 어떻게 없앨 수 있을지에 대해서 알아봅니다. 코드에 영혼을 불어넣고, 그 어떤 코드라도 부끄럽지 않게 구현할 수 있는 연구자들이 되어봅시다!
제 15회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [개미야 뭐하니?팀] : 투자자의 반응을 이용한 실시간 등락 예측(feat. 카프카)BOAZ Bigdata
데이터 엔지니어링 프로젝트를 진행한 개미야 뭐하니? 팀에서는 아래와 같은 프로젝트를 진행했습니다.
[Web 발신] 5분 후, 당신이 투자한 주식이 떨어집니다!
실시간으로 내 주식의 등락을 알려주는 ai가 있다?
이것만 있으면 나도 주린이 탈출
개미와 함께하는 최적의 매도 매수 타이밍
지금 이 순간, 내 주식의 미래를 볼 수 있다
(신청: https://github.com/jayleenym/AYOA)
16기 강지수 동덕여자대학교 정보통계학과
16기 김서민 숙명여자대학교 컴퓨터과학과
16기 김윤기 한양대학교 대학원 컴퓨터소프트웨어학과
16기 문예진 서강대학교 경제학과 / 빅데이터 사이언스
2019년 파이콘 한국에서 진행된 튜토리얼 자료입니다. 최재식 교수님께서 설명가능인공지능이란 무엇인가에 대해 발표해주신 Part 1 발표자료입니다. 아래 링크를 통해 행사 관련 정보를 확인하실 수 있습니다.
http://xai.unist.ac.kr/Tutorial/2018/
https://github.com/OpenXAIProject/PyConKorea2019-Tutorials
Part 1: https://www.slideshare.net/OpenXAI/2019-part-1
Part 2: https://www.slideshare.net/OpenXAI/2019-lrp-part-2
Part 3: https://www.slideshare.net/OpenXAI/2019-shap-part-3
This document discusses recommender systems, including:
1. It provides an overview of recommender systems, their history, and common problems like top-N recommendation and rating prediction.
2. It then discusses what makes a good recommender system, including experiment methods like offline, user surveys, and online experiments, as well as evaluation metrics like prediction accuracy, diversity, novelty, and user satisfaction.
3. Key metrics that are important to evaluate recommender systems are discussed, such as user satisfaction, prediction accuracy, coverage, diversity, novelty, serendipity, trust, robustness, and response time. The document emphasizes selecting metrics based on business goals.
데이터 엔지니어링 프로젝트를 진행한 로깅줍깅 팀에서는 로그 데이터를 수집 및 처리하는 파이프라인을 만들어 각각의 단계에서 일어날 수 있는 상황에 대한 실험을 진행했습니다.
16기 엔지니어링 강하영 동덕여자대학교 정보통계학과
16기 엔지니어링 임태빈 상명대학교 컴퓨터과학과
16기 엔지니어링 지유리 숙명여자대학교 소프트웨어융합학과
Recommender system algorithm and architectureLiang Xiang
1) The document discusses recommender system algorithms and architecture. It covers common recommendation techniques like collaborative filtering, content-based filtering, and graph-based recommendations.
2) It also discusses challenges like cold starts for new users and items. For new users, it recommends using demographic data or initial feedback to understand interests. For new items, it suggests using content information or initial user feedback.
3) The document proposes a feature-based recommendation framework that connects users, items, and latent features to address challenges like heterogeneous data and cold starts. This framework provides explanations but does not support user-based methods.
오늘 밤부터 쓰는 google analytics (구글 애널리틱스, GA) Yongho Ha
http://ga.yonghosee.com 에서 진행하는 구글 어날리틱스(google analytics) 에 대한 강의 슬라이드 입니다. 이 슬라이드는 샘플이지만, 초반부는 실재 강의 교재 그대로 입니다. 이것 자체로도 여러분이 GA를 이해하는데 좀 도움이 된다면 기쁘겠습니다^^ 감사합니다.
This document discusses recommender engines, which are systems that predict items a user may be interested in based on their preferences and behaviors. It describes several common recommendation techniques, including demographic filtering, content-based filtering, user-based collaborative filtering, and item-based collaborative filtering. Examples of recommender engines used by Amazon and Digg are provided to illustrate how these techniques are implemented on e-commerce and social news sites. The document concludes that recommender engines provide benefits to both businesses and users by enabling personalized recommendations at scale.
The attached summary highlights the benefits of the Cross Slot no-till seeding system over other inferior seeding technologies.
In many cases Cross Slot can be the difference between a farmer getting his seed in the ground in wet or dry conditions.... if another piece of seeding technology does not allow the farmer to get his crop in then the farmer has had a failure.
The document discusses symbols and symbolism in literature. It explains that a symbol is a simple thing that represents a deeper meaning, and gives an example of a red scarf symbolizing something important in a story. Symbolism helps authors reveal big concepts through everyday objects. To analyze a symbol, readers ask what the object means in daily life and how it applies to the situation in the story. Famous symbols in literature include colors, buildings, settings based on season or location, plants, weather, and animals.
The document discusses information retrieval (IR) research and introduces Jin Young Kim, a PhD student studying IR. Kim presents on designing retrieval models, recent trends in IR like personalized search and user modeling, and his own research projects in areas like structured document retrieval, personal search, and understanding book search behavior. The presentation aims to provide an overview of IR research and highlight some challenges and opportunities in the field.
The document provides an overview of plot structure, including key elements like exposition, rising action, climax, falling action, and denouement. It then prompts the reader to consider how these structural elements correspond to the plot of the novel Ethan Frome, asking the reader to identify what parts of the story represent each component of typical plot structure.
The document provides information about plot structures but does not describe the plot of a specific story. It defines a plot as the sequence of events in a story and notes that plots typically involve some process of change. It then outlines common structural elements of plots, including exposition, rising action, climax, falling action, and denouement/resolution.
La química estudia las transformaciones de la materia a través de reacciones químicas que ocurren cuando sustancias entran en contacto. Un laboratorio de química utiliza varios materiales de vidrio como matraces, probetas, buretas y embudos para almacenar y medir sustancias químicas, así como realizar reacciones controladas con el calor.
La química estudia las transformaciones de la materia a través de reacciones químicas que ocurren cuando sustancias entran en contacto. Un laboratorio de química utiliza varios materiales de vidrio como matraces, probetas, buretas y embudos para almacenar y medir sustancias químicas, así como realizar reacciones controladas con el calor.
TechSoup Global is a nonprofit that has built nonprofit sector capacity through technology donations for 25 years. It works with a global network of technology providers to deliver donated products to nonprofits worldwide. It also operates data services like GuideStar International, which aggregates data on civil society organizations around the world to increase transparency and facilitate connections between organizations. TechSoup Global aims to ensure every nonprofit has the technology, resources, and knowledge needed to reach their full potential.
This document provides an overview of the Twig templating engine. It discusses why Twig was created as an alternative to PHP for templating, its key features like inheritance, macros and filters. It also covers the basics of Twig syntax including variables, tags, filters and control structures like if/else and for loops. Finally it discusses how to install and use Twig including extending templates and including other templates.
The document provides information about plot structures in stories. It defines plot as the sequence of events in a story and describes typical plot patterns, including an exposition to introduce characters and settings, a rising action with complicating events, a climax of high drama or tension, a falling action as conflicts are resolved, and a dénouement or conclusion. It discusses the use of chronological order, flashbacks, and foreshadowing in storytelling.
Vegetables provide important vitamins, minerals, and carbohydrates. It is best to eat a variety of vegetables as they contain different nutrients. Leafy greens like kale and poke greens, as well as cabbages, plantains, and peppers are rich in vitamin C. Deep orange and dark green vegetables contain high amounts of vitamin A. Broccoli, spinach, and collards are dark green vegetables that are good sources of calcium and iron. Eating fresh vegetables is healthier than cooked or processed varieties because fresh vegetables naturally contain fewer calories, fat, and sodium.
Digital Still Camera Market in India by Abhinava MishraAbhinava Mishra
This document analyzes the digital still camera market in India. It discusses the market size and growth story, key drivers and challenges, product trends toward higher megapixels and optical zoom, and competitive positioning of major players like Canon, Nikon, Sony, and Panasonic in DSLR and mirrorless cameras. The document also examines the market share of leaders like Kodak and the distribution networks of brands through photo retail outlets and memory card partners.
The document discusses various aspects of characters in novels, including what characters are, character traits, analyzing characters, characterisation techniques, character types, and how settings can also take on characteristics of characters. It provides definitions and examples for each topic. Key points include that characters are created by authors and influenced by concepts/conventions, traits refer to a character's appearance and behavior, authors use techniques like speech, appearance, actions, and others' thoughts to create characters, and settings provide scenery and atmosphere that characters respond to.
Subtleties in Tracking Happiness -- Seattle QS#10Jin Young Kim
This document summarizes Jin Young Kim's approach to tracking and measuring happiness over time. Some key points:
- Kim tracks happiness using a 5-point scale recorded 3 times per day, and also logs factors like sleep, events, and states that may influence happiness.
- Happiness is evaluated based on both successful achievement and sense of well-being. Metrics are analyzed to identify patterns and improve lifestyle.
- Past tracking revealed cyclical happiness and importance of structure like work/deadlines. Early mornings and avoiding home increased happiness.
- Lessons include the impact of self-rating and need to turn insights into tangible results, like maintaining an average happiness score.
- Tracking improved self
Social Entrepreneur meets Technology by 황진솔 대표Jin Young Kim
곧 연말을 맞아 Giving에 동참하시는 분이 많으실텐데요, 이번에 모신 황진솔 대표님께서는 기부와 투자를 결합하는 새로운 BM을 가진 스타트업 The Bridge를 운영하고 계십니다. 이번 세미나에서 The Bridge의 사업과 함께, 저개발국 상황에 맞는 적정기술(appropriate technology)에 관해서도 말씀해주실 예정입니다.
헬로 데이터 과학: 삶과 업무를 개선하는 데이터 과학 이야기 (스타트업 얼라이언스 강연)Jin Young Kim
12월 22일 스타트업 얼라이언스에서 있었던 데이터 사이언스 관련 공개 강연 슬라이드 입니다. 실제 사용했던 슬라이드에 시간 관계상 생략했던 슬라이드와 각종 링크를 추가한 확장판입니다.
- 데이터에 대한 오해와 진실
- 데이터 과학의 절차와 유의사항
- 비즈니스 성장을 위한 데이터 과학사례
- 데이터 과학을 활용한 책 쓰기
데이터 과학에 관련된 다양한 자료를 제 홈페이지와 페북, 트위터, 브런치에서 만나보실 수 있습니다.
http://www.hellodatascience.com/
이벤트에 관련된 좀더 자세한 사항은 온오프믹스 링크 참조하세요: http://onoffmix.com/event/59334
The document discusses the process of designing, creating, and evaluating human-computer interaction studies. It covers key aspects of study design including developing hypotheses, selecting populations and tasks, defining metrics, outlining procedures, analyzing data, and addressing confounds and biases. The goal is to empirically test hypotheses about interfaces through well-designed experimental studies.
Webinar: How to Conduct Unmoderated Remote Usability TestingUserZoom
The webinar covered how to conduct unmoderated remote usability testing in 3 parts: an introduction and case study, how to plan, design, recruit for, and analyze a remote unmoderated usability study. It discussed choosing goals and metrics, creating study scripts with tasks and questions, recruiting participants, and analyzing results including task success rates, efficiency metrics, satisfaction scores, and behavioral data. The presentation provided examples and tips for each part of the process.
Ranking engineers at Google work to optimize search results by developing signals, metrics, and experiments. They:
1. Look for new signals that measure page quality and relevance and combine existing signals in new ways.
2. Optimize search results based on metrics like relevance, quality, and time to load; these metrics are measured through live experiments and ratings from human raters.
3. Address issues like systematically bad ratings or missing metrics by fixing rater guidelines, developing new metrics, or identifying patterns of losses to improve results.
This slide deck is for all the QA members who want to understand the methodology of test case design. These slides are not theoretical gyan but designed based on experience.
Gain a deeper understanding of what Exploratory Testing (ET) is, the essential elements of the practice with practical tips and techniques, and finally, ideas for integrating ET into the cadence of an agile process
This document discusses testing on agile teams. It notes that quality is everyone's responsibility, and testing should begin early in iterations. Effective testing requires considering factors like risk and priority. Manual testing sessions should vary tests over time. Test documentation should only be created if it helps manage the testing project. Defects should be communicated constructively. Teams should continuously learn and improve. Feature maps, heuristics, and exploratory testing techniques are recommended. Automated testing of units, services and UIs can help teams test often. Lessons include collaborating on test ideas and problems, and questioning the value of all testing efforts.
The document discusses various techniques for project estimation including three point estimation, Delphi method, planning poker, function point analysis, use case points, and PERT diagrams. It provides details on each technique including how they are conducted, their advantages and disadvantages, and when each is best applied. The key aspects that estimators need to consider for large scale projects are work partitioning challenges, increasing communication overhead with larger teams, and understanding how fast the project can realistically be completed based on its size.
This document provides an introduction to Lean UX and UserTesting. It defines UX and Lean UX, discusses the benefits of user testing such as increased revenue and decreased costs, and outlines the UserTesting process including defining objectives, writing tasks, analyzing results, and using metrics and notes. UserTesting allows remote, unmoderated usability testing of digital products through video recordings of testers interacting with designs. The document provides tips for effective user testing through UserTesting.
The document discusses user experience evaluation and prototyping. It explains that evaluation is important to determine if a design goal of improving user experience has been accomplished. Both quantitative and qualitative data can be collected during formative evaluation with low-fidelity prototypes early in the design process or summative evaluation with high-fidelity prototypes later on. The type of prototype impacts what data can be collected and where the evaluation takes place. User experience design aims to create interfaces that are useful and usable, where usability is measured by effectiveness, efficiency and satisfaction in completing tasks. Various metrics are discussed for evaluating the usability dimensions.
Remote moderated testing was once out of reach for many organizations -- but not anymore!
Steve Schang of Midwood Usability shares his expert review of and advice for getting the most of remote testing tools.
Contact Steve and his team at MidwoodUsability.com.
Presented at Firecat Studio's monthly UX and Marketing Strategy gathering, Firecat First Friday, in November 2020.
Human computation, crowdsourcing and social: An industrial perspectiveoralonso
This document summarizes a talk on human computation and crowdsourcing from an industrial perspective. It discusses how crowdsourcing can provide large amounts of cheap labeled data through platforms like Mechanical Turk but that ensuring high quality labels requires careful task design, payments, quality control methods and addressing issues like worker experience and content. Current trends include algorithms for optimizing human-machine workflows and routing tasks between crowds based on their expertise.
The document summarizes key points from the book "Communicating Design" by Daniel M. Brown. It outlines 10 common design deliverables such as personas, user needs documents, strategy documents, and design documents. For each deliverable, it describes the purpose, audience, context, and challenges. It also provides examples of how to structure the information in each deliverable into three layers of detail. The document concludes by noting some benefits of the book, such as its use of layers to introduce document creation and practical meeting strategies.
This document discusses various techniques for project estimation. It begins by outlining the goals of estimation and what is needed to perform estimations. It then discusses expected results and provides examples of three point estimation and the Delphi method. A variety of techniques are covered such as planning poker, proxy-based estimation, and functional point analysis. Common mistakes are reviewed and cognitive biases that can impact estimations. The document provides a helpful overview of project estimation approaches.
This document discusses various methods for testing and evaluating user interfaces, including:
1) Expert review involves having design experts evaluate the interface against guidelines. Heuristic evaluation involves experts evaluating it based on established heuristics like Nielsen's.
2) Usability testing observes real users performing tasks while thinking aloud. It is conducted in labs and can use remote or discount testing.
3) Surveys collect user feedback through questionnaires with Likert scales.
4) Field studies and logging actual usage after release provide continuous evaluation of how users interact with the system in natural settings.
The Importance of Culture: Building and Sustaining Effective Engineering Org...Randy Shoup
Randy is a 25-year veteran of Silicon Valley, having led engineering organizations at eBay, Google, Oracle, and a number of other companies. Through the lens of his personal experience from hands-on engineer to architect to CTO, at organizations ranging from tiny startups to global giants, Randy will discuss several important aspects of engineering cultures, which both support and hinder the ability to innovate: hiring and retention, ownership and collaboration, quality and discipline, and learning and experimentation.
Randy will suggest some learnings about what has worked well -- and what has not -- in creating and sustaining an effective engineering culture. He will further offer some concrete suggestions on how other organizations -- both large and small -- can evolve their cultures as well.
The document discusses various techniques for estimating testing efforts, including work breakdown structure, function point analysis, three point estimation, planning poker, and story points. It emphasizes dividing projects into smaller tasks, estimating the time and resources needed for each, and allowing for contingencies in timelines and budgets. Accurate estimation requires considering factors like team skills, complexity of the software, and lessons learned from past projects.
Best Practices in Recommender System ChallengesAlan Said
Recommender System Challenges such as the Netflix Prize, KDD Cup, etc. have contributed vastly to the development and adoptability of recommender systems. Each year a number of challenges or contests are organized covering different aspects of recommendation. In this tutorial and panel, we present some of the factors involved in successfully organizing a challenge, whether for reasons purely related to research, industrial challenges, or to widen the scope of recommender systems applications.
User Experience Design Fundamentals - Part 2: Talking with UsersLaura B
#2 in a 3-part series on UX Fundamentals: Talking with Users
Understand why you should talk to users to uncover, validate and/or understand their goals.
Learn how and when to talk with your users:
User research methods
Planning
Best practices for interviews
The document discusses various methods for evaluating the usability of interfaces and software systems. It describes the goals of evaluation as assessing functionality, interface effects, and identifying specific problems. Both analytical and empirical testing methods are covered. Analytical methods include heuristic evaluation, consistency inspection, and cognitive walkthrough. Empirical methods involve observation/monitoring of users and experimentation. Key aspects discussed for evaluation include iterative testing, formative vs. summative approaches, and the DECIDE framework.
Similar to SIGIR Tutorial on IR Evaluation: Designing an End-to-End Offline Evaluation Pipeline (20)
DnA Playshop - Serious Fun with LEGO.pptxJin Young Kim
This document discusses LEGO facts and lessons that can be learned from LEGO as a platform. It notes that LEGO is a Danish company, Legoland is not owned by the LEGO group, and retired LEGO sets often increase significantly in value. It highlights LEGO's obsession with production quality, use of generic and versatile components, intuitive design and instructions, and vibrant fan community. The document also advertises a DnA Playshop event where participants will team up, build items from instructions, build something new together, have their creation voted on by others, and keep the LEGO bricks used.
Frontiers in Data Science For Modern Web Search EngineJin Young Kim
1) Modern web search engines use centralized data and analytics platforms to monitor key performance indicators and growth strategies across teams. This allows for internal alignment.
2) Search engines must adapt to dynamic environments through continuous search quality monitoring and experimentation to improve results.
3) Ensuring fairness in search results and recommendations helps create a healthy online ecosystem and promotes trustworthy content.
Fairness in Search & RecSys 네이버 검색 콜로키움 김진영Jin Young Kim
검색 및 추천 시스템의 사회적 역할이 커지면서, 그 결과의 공정성 역시 최근 관심사로 대두되었다. 본 발표에서는 검색 및 추천시스템의 공정성 이슈 및 그 해법을 다룬다. 공정한 검색 및 추천 결과를 정의하는 다양한 방법, 공정성의 결여가 미치는 자원 배분 및 스테레오타이핑 문제, 그리고 검색 및 추천시스템 개발의 각 단계별로 어떤 해결책이 있는지를 최신 연구 중심으로 살펴본다. 마지막으로 실제 공정한 시스템 개발을 위한 실무적인 고려사항을 다룬다.
Measuring the Quality of Online Service - Jinyoung kimJin Young Kim
This document discusses methods for measuring the quality of online services. It describes how major companies like Google, Facebook, and Netflix collect data through user behavior, panel surveys, and direct user feedback at different stages of their services. Panel surveys can provide insights but have limitations, while user behavior data is abundant but noisy. The document also provides examples of how to design panel surveys and side-by-side evaluations to assess search engine result pages. It concludes that the best approach is to combine various data collection methods depending on the service characteristics and lifecycle.
CS 유학 모임을 여러분의 참여로 성황리에 마칠 수 있었습니다. 지난번 데이터 사이언스 모임에서처럼, 저의 발표보다는 여러분들의 경험과 지식에서 많이 배울 수 있었던 것 같습니다. 패널리스트로 참석해주신 이병호님과 (CSUhak.info 운영자이십니다.) 박욱님 (제 학교 선배님으로 경희대 전자공학과 조교수로 올해 임용되셨습니다.)께 다시금 감사의 말씀 전하고 싶습니다. 이병호님께서는 석사 유학에 대한 경험담을 공유해 주셨고, 그리고 박욱님께서는 국내에서 학위를 마치신 만큼, 자칫 유학 중심으로 치우치기 쉬운 논의의 중심을 잡아주셨습니다.
행사의 결과물을 다 많은 분들과 공유하는 차원에서, 발표자료 및 동영상을 여기에 공유합니다.
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataKiwi Creative
Harness the power of AI-backed reports, benchmarking and data analysis to predict trends and detect anomalies in your marketing efforts.
Peter Caputa, CEO at Databox, reveals how you can discover the strategies and tools to increase your growth rate (and margins!).
From metrics to track to data habits to pick up, enhance your reporting for powerful insights to improve your B2B tech company's marketing.
- - -
This is the webinar recording from the June 2024 HubSpot User Group (HUG) for B2B Technology USA.
Watch the video recording at https://youtu.be/5vjwGfPN9lw
Sign up for future HUG events at https://events.hubspot.com/b2b-technology-usa/
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Kaxil Naik
Navigating today's data landscape isn't just about managing workflows; it's about strategically propelling your business forward. Apache Airflow has stood out as the benchmark in this arena, driving data orchestration forward since its early days. As we dive into the complexities of our current data-rich environment, where the sheer volume of information and its timely, accurate processing are crucial for AI and ML applications, the role of Airflow has never been more critical.
In my journey as the Senior Engineering Director and a pivotal member of Apache Airflow's Project Management Committee (PMC), I've witnessed Airflow transform data handling, making agility and insight the norm in an ever-evolving digital space. At Astronomer, our collaboration with leading AI & ML teams worldwide has not only tested but also proven Airflow's mettle in delivering data reliably and efficiently—data that now powers not just insights but core business functions.
This session is a deep dive into the essence of Airflow's success. We'll trace its evolution from a budding project to the backbone of data orchestration it is today, constantly adapting to meet the next wave of data challenges, including those brought on by Generative AI. It's this forward-thinking adaptability that keeps Airflow at the forefront of innovation, ready for whatever comes next.
The ever-growing demands of AI and ML applications have ushered in an era where sophisticated data management isn't a luxury—it's a necessity. Airflow's innate flexibility and scalability are what makes it indispensable in managing the intricate workflows of today, especially those involving Large Language Models (LLMs).
This talk isn't just a rundown of Airflow's features; it's about harnessing these capabilities to turn your data workflows into a strategic asset. Together, we'll explore how Airflow remains at the cutting edge of data orchestration, ensuring your organization is not just keeping pace but setting the pace in a data-driven future.
Session in https://budapestdata.hu/2024/04/kaxil-naik-astronomer-io/ | https://dataml24.sessionize.com/session/667627
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...Social Samosa
The Modern Marketing Reckoner (MMR) is a comprehensive resource packed with POVs from 60+ industry leaders on how AI is transforming the 4 key pillars of marketing – product, place, price and promotions.
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
SIGIR Tutorial on IR Evaluation: Designing an End-to-End Offline Evaluation Pipeline
1. IR Evaluation:
Designing an End-to-End
Offline Evaluation Pipeline (2)
Jin Young Kim, Microsoft
jink@microsoft.com
Emine Yilmaz, University College London
emine.yilmaz@ucl.ac.uk
2. Speaker Bio
• Graduated from UMass Amherst with Ph.D in 2012
• Spent past 3 years in Bing’s Relevance Measurement / Science Team
• Taught MSFT course on offline evaluation
• Passionate for working with data of all kinds
(search, personal, baseball, …)
3. Evaluating a Data Product
• How would you evaluate Web Search, App Recommendations, and
even an Intelligent Agent?
4. Better Evaluation = Better Data Product
• Investment decisions
• Shipping decisions
• Compensation decisions
• More effective ML models
5. Tutorial Objective
• Overview End-to-End process of how evaluation works
in a large-scale commercial web search engine
• Learn about various decisions and tips for each step
• Practice designing a judging interface for specific task
• Review related literature in various fronts
6. What Makes Evaluation in Industry different?
• Larger scale / team / business at stake
• More diverse signals for evaluation (online + offline)
• More diverse evaluation targets (not just documents)
• Need for a sustainable evaluation pipeline
7. Agenda: Steps for Offline Evaluation
• Preparing tasks
• Designing a judging interface
• Designing an experiment
• Running the experiment
• Evaluating the Experiment
9. What constitutes a task?
• Goal
• You want to evaluate the target
for task description provided
• Task description
• Some (expression of) information need
• Search query / user profile / …
• Target
• System response to satisfy the need
• SERP / webpage / answer / …
10. Sampling tasks (queries)
• Random sample of user query is common method
• What can go wrong in this approach?
• Sampling criteria
• Representative: Are the samples representative of the user traffic?
• Actionable: Are they targeted for what we’re trying to improve on?
• Need for more context
• Are queries specific enough for consistent judgment?
11. Add contexts if query alone is not enough
• Context examples:
• User’s location
• Task description
• Session history
• …
• Cost of contextual judging
• Potentially need more judgments
• Increase judge’s cognitive load
13. Goals in designing a judging interface
• Maximum information
• Minimum efforts
• Minimum errors
14. Designing a judging interface: SERP*
• Questions
• Responses
• Judging Target
Q: How would you rate
the search results?
Not Relevant
Fair
Good
Excellent
Q: Why do you think so?
*SERP: Search Engine Results Page
15. Practice: Design your own Judging Interface
• What can go wrong with the evaluation interface?
• How can you improve the evaluation interface?
16. What can go wrong here?
• Judges may like some part of the page, but not others
• Judges may not understand the query at all
• Each judge may understand the task differently
• Rating can be very subjective without a clear baseline
• …
17. Designing a judging interface: web result
Given ‘crowdsourcing’ as
a query, how would you
rate the webpage?
Not Relevant
Fair
Good
Excellent
Q: Why do you think so?
Now the judging target is specific enough
18. Judging Guideline
• A document for judges to read
before starting the task
• Need to keep simple (i.e., one
page), especially for crowd judges
• Can’t rely on the guideline for all
instructions: use training / tooltips
19. Designing a judging interface: side-by-side
Q: How would you
compare two results?
Left much better
Left better
About the same
Right better
Right much better
Q: Why do you think so?
The other page establishes a clear baseline for the judgment
21. Here or There: Preference Judgments for
Relevance [Carterette et al. 2008]
Higher inter-judge agreement in preference judgement
22. Tips on judging interface design
• Use plain language (i.e., avoid jargons)
• Make the UI light and simple (e.g., no scroll)
• Put ‘I don’t know’ (skip) option (to avoid random responses)
• Collect optional textual comments (for rationale or feedback)
• Collect judging time and behavioral log data (for quality control)
23. Using Hidden Tasks for Quality Control [Alonso ’15]
• Ask simple questions that
require judges to read the
contents
• This prepare the judge for
actual judging task
• This provide ways to verify if
the response is bogus
25. From judgments to an experiment
• Experiment
• A set of judgments collected with a particular goal
• A typical experiment consists of many tasks and judgments
• Multiple judgments are collected for each task (overlap)
• Types of goals
• Resource planning: where to invest in next few months?
• Feature debugging: what can go wrong with this feature?
• Shipping decision: should we ship the feature to the production?
9 tasks X 3 overlap
Judgments
Tasks
26. Breakdown of Experimental Cost
• How much money (time) spent per task?
• How many (overlap) judgments per task?
• How many tasks within experiment?
$ (time)
per Judgment
# Judgments
per Task
# Tasks within
Experiment
10 cent = 30 second
(12$/HR)
3 judgments per task 9 tasks
10 10 10
10 10 10
10 10 10
10 10 10
10 10 10
10 10 10
10 10 10
10 10 10
10 10 10
Total cost: 2.7$
Judgments
Tasks
27. Effect of Pay per Task
• Higher pay per task doesn’t improve judging quality, but throughput
[Mason and Watts, 2009]
28. Why overlap judgments?
• Better task understanding
• What’s the distribution of labels?
• What are judges’ collective feedback?
• Quality control for labels / judges
• What is the majority opinion for each task?
• Who tends to disagree with the majority opinion?
Majority opinion is not always right, especially
before you have enough of good judges
29. Majority Voting and Label Quality
• Ask multiple labellers, keep majority label as “true” label
• Quality is probability of being correct
p: probability
of individual
labeller being
correct
[Kuncheva et al., PA&A, 2003]
30. High vs. Low overlap experiment
• High-overlap
• Early iteration stage
• Information-centric tasks
• Low-overlap
• Mature / production stage
• Number-centric tasks
3 tasks X 9 overlap
9 tasks X 3 overlap
Judgments
Tasks
Judgments
Tasks
31. Summary: Evaluation Goals & Guidelines
Evaluation Goal Judgment Design Experiment Design
Feature Planning /
Debugging
Label + Comments Information-centric
(High overlap)
Training Data Label + Comments Specific to the algorithm
Shipping Decision
(ExpA vs. ExpB)
Label + Comments Number-centric
(Low overlap)
33. Choosing judge pools
• Development Team
• In-house (managed) judges
• Crowdsourcing judges
Less expertise
More judgments
Closer to users
Ground Truth
Judgments
Ground Truth
Judgments
Ground Truth
Judgments
Collect ground
truth labels for
next stage
34. Choosing judge within the pool
• Considerations
• Do judges have necessary knowledge?
• Do judge profiles match with target users?
• Can they perform the task with reasonable accuracy?
• Methods
• Pre-screen judges by profile
• Filter out judges by screening task
• Kick off ‘bad’ judges regularly
35. Training judges: Training tasks
Given ‘crowdsourcing’ as
a query, how would you
rate the webpage?
Bad
Fair
Good
Excellent
Perfect
Q: Why do you think so?
The Answer is ‘Excellent’
This document satisfies user’s main
intent by providing well curated
information about the topic
Initial
qualification
task
Interleaved
training task
Interleaved
QA task
36. Crowd workers communicate with each other!
You need to manage
your reputation as a
requester.
(Quick payment /
Responsive to
workers’ feedback)
Answers shared with
one worker is likely
shared with all.
37. Cost of Qualification Test [Alonso’13]
• Judges become an order of
magnitude slower under the
presence of qualification
tasks
• However, depending on the
type of task, the results may
worth the delay and cost
38. Tips on running an experiment
• Scale up judging tasks slowly
• Beware of the quality of golden hits
• Submit a big task in small batches
(for task debugging / judge engagement)
• Monitor & respond to judges’ feedback
40. Analyzing the judgment quality
• Agreement with ground truth (aka golden hits)
• Inter-rater agreement
• Behavioral signals (time, label distribution)
• Agreement with other metrics
41. Comparing Inter-rater Metrics
• Percentage agreement: the ratio the cases that received the same
rating by two judges and divides the number by the total number of
cases rated by the two judges.
• Cohen’s kappa. estimate the degree of consensus between two
judges by correcting if they are operating by chance alone.
• Fleiss’ kappa: generalization of Cohen to n raters instead of just two.
• Krippendorff’s alpha: accept any number of observers, being
applicable to nominal, ordinal, interval, and ratio levels of
measurement
https://en.wikipedia.org/wiki/Inter-rater_reliability
42. Analyzing the judgment quality
Automating Crowdsourcing Tasks in an Industrial Environment
Vasilis Kandylas, Omar Alonso, Shiroy Choksey, Kedar Rudre, Prashant Jaiswal
43. Using Behavior of Crowd Judges for QA
• Predictive models of task performance can be built based on
behavioral traces, and that these models generalize to related tasks.
Instrumenting the Crowd: Using Implicit Behavioral Measures to Predict
Task Performance, UIST’11, Jeffrey M. Rzeszotarski, Aniket Kittur
44. Case Study: Relevance Dimensions in
Preference-based IR Evaluation [Kim et al. ’13]
Q: How would you
compare two results?
Overall
Relevance
Diversity
Freshness
Authority
Caption
Q: Why do you think so?
Left Tie Right
Allow judges to break down their judgments along several dimensions
45. Case Study: Relevance Dimensions in
Preference-based IR Evaluation [Kim et al. ’13]
• Inter-judge Agreement • Preference judgments vs.
Delta in NDCG@{1,3} correlation
All achieved with 10% increase in judging time
47. Building a Production Evaluation Pipeline
Omar Alonso, Implementing crowdsourcing-based relevance
experimentation: an industrial perspective. Inf. Retr. 16(2): 101-120 (2013)
48. Recap: Steps for Offline Evaluation
• Preparing tasks
• Designing a judging interface
• Designing an experiment
• Running the experiment
• Evaluating the Experiment
49. Main References
• Implementing crowdsourcing-based relevance experimentation: an
industrial perspective. Omar Alonso
• Tutorial on Crowdsourcing Panos Ipeirotis
• Amazon Mechanical Turk: Requester Best Practices Guide
• Quantifying the User Experience. Sauro and Lewis. (book)
51. Impact of Highlights on Document Relevance
• Highlighted versions of the document were perceived to be more
relevant to plain versions. [Alonso, 2013]
• Subtle interface change can affect the outcome significantly
54. • Statistic used for measuring inter-rater agreement
• Can be used to measure
• Agreement with gold data
• Agreement between two workers
• More robust than error rate as it takes into account agreement by
chance
Computing Quality Score: Cohen’s Kappa
)Pr(1
)Pr()Pr(
e
ea
Pr(a): Observed agreement among raters
Pr(e): Hypothetical probability of chance of
agreement (agreement due to chance)
55. Computing Cohen’s Kappa
• Computing probability of agreement (Pr(a))
• Generate the contingency table
• Compute number of cases of agreement/ total number of ratings
9 3 1
4 8 2
2 1 6
Worker 1
Worker 2
a b c
a
b
c
Total:
13
14
9
Total: 15 12 9 Overall total: 36
56. Computing Cohen’s Kappa
• Computing probability of agreement (Pr(a))
• Generate the contingency table
• Compute number of cases of agreement/ total number of ratings
9 3 1
4 8 2
2 1 6
Worker 1
Worker 2
a b c
a
b
c
Pr(a) = (9+8+6)/36 = 23/36
Total: 15 12 9 Overall total: 36
Total:
13
14
9
57. Computing Cohen’s Kappa
• Computing probability of agreement due to chance
• Compute expected frequency for agreements that would occur due to chance
• What is the probability that worker 1&worker 2 both label any item as an a?
• What is the expected number of items labelled as a by both worker 1 and worker 2?
9 3 1
4 8 2
2 1 6
Worker 1
Worker 2
a b c
a
b
c
Total: 15 12 9 Overall total: 36
Total:
13
14
9
Pr(w1=a&w2=a) = (15/36)*(13/36)
E[w1=a&w2=a] = (15/36)*(13/36)*36
= 5.42
58. Computing Cohen’s Kappa
• Computing probability of agreement due to chance
• Compute expected frequency for agreements that would occur due to chance
• What is the probability that worker 1&worker 2 both label any item as an a?
• What is the expected number of items labelled as a by both worker 1 and worker 2?
9 (5.42) 3 1
4 8 2
2 1 6
Worker 1
Worker 2
a b c
a
b
c
Total: 15 12 9 Overall total: 36
Total:
13
14
9
Pr(w1=a&w2=a) = (13/36)*(15/36)
E[w1=a&w2=a] = (13/36)*(15/36)*36
= 5.42
59. Computing Cohen’s Kappa
• Computing probability of agreement due to chance
• Compute expected frequency for agreements that would occur due to chance
• What is the probability that worker 1&worker 2 both label any item as an a?
• What is the expected number of items labelled as a by both worker 1 and worker 2?
9 (5.42) 3 1
4 8 (4.67) 2
2 1 6 (2.25)
Worker 1
Worker 2
a b c
a
b
c
Total: 15 12 9 Overall total: 36
Total:
13
14
9
Pr(w1=a&w2=a) = (13/36)*(15/36)
E[w1=a&w2=a] = (13/36)*(15/36)*36
= 5.42
60. Computing Cohen’s Kappa
• Computing probability of agreement due to chance
• Compute expected frequency for agreements that would occur due to chance
• What is the probability that worker 1&worker 2 both label any item as an a?
• What is the expected number of items labelled as a by both worker 1 and worker 2?
9 (5.42) 3 1
4 8 (4.67) 2
2 1 6 (2.25)
Worker 1
Worker 2
a b c
a
b
c
Total: 15 12 9 Overall total: 36
Total:
13
14
9
Pr(e) = (5.42+4.67+2.25)/36
61. Computing Cohen’s Kappa
• Computing probability of agreement due to chance
• Compute expected frequency for agreements that would occur due to chance
• What is the probability that worker 1&worker 2 both label any item as an a?
• What is the expected number of items labelled as a by both worker 1 and worker 2?
9 (5.42) 3 1
4 8 (4.67) 2
2 1 6 (2.25)
Worker 1
Worker 2
a b c
a
b
c
Total: 15 12 9 Overall total: 36
Total:
13
14
9
Pr(e) = 12.34/36
Pr(a) = 23/36
Kappa = (23-12.34)/(36-12.34) = 0.45
62. What is a good value for Kappa?
• Kappa >= 0.70 => reliable inter-rater agreement
• For the above example, inter-rater reliability is not satisfactory
• If Kappa<0.70, need ways to improve worker quality
• Better incentives
• Better interface for the task
• Better guidelines/clarifications for the task
• Training before the task…
64. Drawing Conclusions
• Hypothesis testing (covered in Part I)
• How confident can we be about our conclusion?
• Confidence interval
• How big is the improvement?
• How precise is our estimate?
Both statistical significance and confidence interval
should be reported!
65. Confidence Interval and Hypothesis Testing
• Confidence Interval
• Does the 95% C.I. of sample mean include zero?
• Hypothesis Testing
• Does 95% C.I. under H0 include the critical value ?
Critical Value0
95% Confidence Interval
0 Sample Mean
95% Conf. Int. under H0
66. Sampling Distribution and Confidence Interval
• 95% confidence interval: 95% of
sample means will fall under this
interval
• This means 95% of sample will
include the mean of original
sample
http://rpsychologist.com/d3/CI/
67. Computing the Confidence Interval
• Determine confidence level (typically 95%)
• Estimate a sampling distribution (sample mean & variance)
• Calculate confidence interval
• 𝐶𝑜𝑛𝑓𝐼𝑛𝑡𝑒𝑟𝑣𝑎𝑙95 = 𝑋 ± 𝑍 ×
𝜎
𝑛
Sampling
Distribution
95% Confidence Interval
𝑋
𝑍: 1.96 (for 95% C.I.)
𝑋: sample mean
𝜎: sample variance
𝑛: sample size
Editor's Notes
Different from software evaluation:
Output depends on task & user / Subjective quality
Evaluation is critical in every stages of development
Harry Shum: ‘We are as good as having the perfect WSE if we perfect the evaluation’
Compared to Pt.1 where Emine focused on Academic IR evaluation, I’ll focus on what people in industry care about
For the rest of this talk, I’ll follow the steps for …
Mention TREC topic desc.
No ground for comparison / What if the judge doesn’t understand the intent?
No ground for comparison / What if the judge doesn’t understand the intent?
Should we use ‘about the same’ vs. ‘the same??
One judgment is not enough!
Pay per task: how much of judges’ time do you want to borrow?
Different layout?
Dev team should definitely be the first judges
Screenshot?
Tasks with known answers are interleaved with regular tasks
Judges need regular stream of jobs to stick to
For the rest of this talk, I’ll follow the steps for …
Need to be careful if you want to change the judging interface suddenly…
These can be derived from the sampling distribution