Towards Automated Classification of Discussion Transcripts: A Cognitive Prese...Vitomir Kovanovic
LAK'16 Conference paper presentation:
abstract:
In this paper, we present the results of an exploratory study that examined the problem of automating content analysis of student online discussion transcripts. We looked at the problem of coding discussion transcripts for the levels of cognitive presence, one of the three main constructs in the Community of Inquiry (CoI) model of distance education. Using Coh-Metrix and LIWC features, together with a set of custom features developed to capture discussion context, we developed a random forest classification system that achieved 70.3% classification accuracy and 0.63 Cohen’s kappa, which is significantly higher than values reported in the previous studies. Besides improvement in classification accuracy, the developed system is also less sensitive to overfitting as it uses only 205 classification features, which is around 100 times less features
than in similar systems based on bag-of-words features. We also provide an overview of the classification features most indicative of the different phases of cognitive presence that gives an additional insights into the nature of cognitive presence learning cycle. Overall, our results show great potential of the proposed approach, with an added benefit of providing further characterization of the cognitive presence coding scheme.
링크드인의 Big Data Recommendation Products - 어제의 데이터를 통해 내일을 예측한다Evion Kim
DEVIEW 2013 발표 내용입니다 - http://deview.kr/2013/detail.nhn?topicSeq=36
링크드인 플랫폼 상의 다양한 Recommendation Product들, 이 제품들의 키워드는 바로 'Relevance(연관성)' 입니다. 가장 관련있는 데이터들을 제공함으로써 사용자의 삶을 더 쉽고 편하게 만들어 주는것이 링크드인 데이터 팀의 목표라 할 수 있겠습니다. 그렇다면 어떻게 해야 사용자에게 가장 연관성 높은 데이터를 제공 할 수 있을까요? 이에 대한 답을 한문장으로 요약하자면 '어제의 데이터를 분석하여 내일의 사용자의 행동을 예측한다' 가 될 것 같습니다.
본 발표에서는 이 한 문장을 좀 더 길게 풀어보려 합니다. 링크드인에서는 Hadoop, Key-Value Storage, Machine Learning등의 기술을 어떤 식으로 활용하여 연관성 높은 Recommendation Product를 만들고 있는지에 대해 소개해보겠습니다.
2015년 11월 20일, 패스트캠퍼스가 개최한 [데이터를 부탁해] 오픈 세미나의 4번째 세션에서 발표하신, [러닝머신 CAMP]를 수강하셨던 황준식 님의 자료입니다.
http://www.fastcampus.co.kr/dab_openlecture_151120/
[머신러닝 CAMP] 자세히 보기 ↓
http://www.fastcampus.co.kr/data_camp_mlearning/
Taken from the Future of Web Design, San Francisco 2015 Conference. https://futureofwebdesign.com/san-francisco-2015/
Site analytics. The quantified self. Big data. Human activity is creating more and more measurable data. But is more data really helping designers make better decisions? Human problems often require illogical approaches. In order to meet real human needs, we need to approach the data we collect with empathy and find the story in the facts.
Towards Automated Classification of Discussion Transcripts: A Cognitive Prese...Vitomir Kovanovic
LAK'16 Conference paper presentation:
abstract:
In this paper, we present the results of an exploratory study that examined the problem of automating content analysis of student online discussion transcripts. We looked at the problem of coding discussion transcripts for the levels of cognitive presence, one of the three main constructs in the Community of Inquiry (CoI) model of distance education. Using Coh-Metrix and LIWC features, together with a set of custom features developed to capture discussion context, we developed a random forest classification system that achieved 70.3% classification accuracy and 0.63 Cohen’s kappa, which is significantly higher than values reported in the previous studies. Besides improvement in classification accuracy, the developed system is also less sensitive to overfitting as it uses only 205 classification features, which is around 100 times less features
than in similar systems based on bag-of-words features. We also provide an overview of the classification features most indicative of the different phases of cognitive presence that gives an additional insights into the nature of cognitive presence learning cycle. Overall, our results show great potential of the proposed approach, with an added benefit of providing further characterization of the cognitive presence coding scheme.
링크드인의 Big Data Recommendation Products - 어제의 데이터를 통해 내일을 예측한다Evion Kim
DEVIEW 2013 발표 내용입니다 - http://deview.kr/2013/detail.nhn?topicSeq=36
링크드인 플랫폼 상의 다양한 Recommendation Product들, 이 제품들의 키워드는 바로 'Relevance(연관성)' 입니다. 가장 관련있는 데이터들을 제공함으로써 사용자의 삶을 더 쉽고 편하게 만들어 주는것이 링크드인 데이터 팀의 목표라 할 수 있겠습니다. 그렇다면 어떻게 해야 사용자에게 가장 연관성 높은 데이터를 제공 할 수 있을까요? 이에 대한 답을 한문장으로 요약하자면 '어제의 데이터를 분석하여 내일의 사용자의 행동을 예측한다' 가 될 것 같습니다.
본 발표에서는 이 한 문장을 좀 더 길게 풀어보려 합니다. 링크드인에서는 Hadoop, Key-Value Storage, Machine Learning등의 기술을 어떤 식으로 활용하여 연관성 높은 Recommendation Product를 만들고 있는지에 대해 소개해보겠습니다.
2015년 11월 20일, 패스트캠퍼스가 개최한 [데이터를 부탁해] 오픈 세미나의 4번째 세션에서 발표하신, [러닝머신 CAMP]를 수강하셨던 황준식 님의 자료입니다.
http://www.fastcampus.co.kr/dab_openlecture_151120/
[머신러닝 CAMP] 자세히 보기 ↓
http://www.fastcampus.co.kr/data_camp_mlearning/
Taken from the Future of Web Design, San Francisco 2015 Conference. https://futureofwebdesign.com/san-francisco-2015/
Site analytics. The quantified self. Big data. Human activity is creating more and more measurable data. But is more data really helping designers make better decisions? Human problems often require illogical approaches. In order to meet real human needs, we need to approach the data we collect with empathy and find the story in the facts.
Standardizing +113 million Merchant Names in Financial Services with Greenplu...Data Science London
The document discusses standardizing over 113 million merchant names from transaction data using regex and fuzzy matching. It involved extracting features from merchant names, cleaning names using regular expressions, fuzzy matching to group similar names, and manual rules. This allowed preliminary analysis showing 90% of transactions and spending were concentrated in 7-8% of top merchants. Customer segments were identified based on relative value added scores.
Talk given at Fronteers 2015 in Amsterdam.
In a world where many of our digital spaces are becoming more closed than ever, open data is a concept that is rapidly on the rise.
In this talk we'll explore what open data is (and what it isn't), and why we should care about it. We'll look at how you can introduce it into your projects with regards to practical publication and consumption, and discuss some useful tools and reference points.
Open data isn't just dry and technical - it gives us great scope to be creative, and throughout this talk we'll go through some of the amazing things that it has been used for globally in the hope that it will inspire you to create something amazing yourself.
How to Create Surveys to Read Your Audience's MindsLeslie Samuel
How do you know exactly what your audience wants? Ask them! Conducting surveys can help you provide more value, IF you do them the right way. This slide deck will show you how to create surveys well.
The What, Why and How of (Web) Analytics Testing (Web, IoT, Big Data)Anand Bagmar
Learning Objectives:
The most used and heard about buzz words in the Software Industry today are … IoT and Big Data!
With IoT, with a creative mindset looking for opportunities and ways to add value, the possibilities are infinite. With each such opportunity, there is a huge volume of data being generated - which if analyzed and used correctly, can feed into creating more opportunities and increased value propositions.
There are 2 types of analysis that one needs to think about.
1. How is the end-user interacting with the product? This will give some level of understanding into how to re-position and focus on the true value add features for the product.
2. With the huge volume of data being generated by the end-user interactions, and the data being captured by all devices in the food-chain of the offering, it is important to identify patterns from what has happened, and find out new product / value opportunities based on usage patterns.
Learn what is Web Analytics, why is it important, and see some techniques how you can test it manually and and also automate that validation.
Business today is starting to understand the value of data, and some organisations are outperforming their competition by putting data at the heart of their thinking. Leveraging data to change business models, understand their customers and employees better and deliver new revenue streams is the driving force in this new data centric era.
Jon Woodward - MSFT
Dave Coplin - MSFT
Mike Bugembe - JustGiving
Gary Richardson - KPMG
This document discusses automating big data analytics processes. It describes the traditional "old school" approach of manually extracting, loading and transforming raw data. The document then presents two "new school" examples that automate these processes. The first automates extracting core website and social media data, enriching email addresses, and generating segments and metrics. The second integrates data from 70+ APIs in real-time, performs custom aggregations, and enables behavioral segmentation and messaging. The document concludes by soliciting questions and feedback on working with big data.
This document provides an overview of digital marketing trends, channels, tactics and tools. It is presented in four parts: emerging trends in the digital landscape, key digital marketing channels, tactics for driving engagement and conversion, and useful tools for digital marketing efforts. The presentation covers topics such as the shift from outbound to inbound marketing, the proliferation of digital channels, how to leverage search engine marketing, social media, email and other channels, and how tools like Google Analytics can be used to track metrics and analyze traffic. The overall message is the importance of staying aware of emerging trends and being curious to continuously improve digital marketing strategies.
Loss function discovery for object detection via convergence simulation drive...taeseon ryu
안녕하세요 딥러닝 논문읽기 모임 입니다.
오늘 소개드릴 영상은 펀디멘탈팀 송헌님의 Loss Function Discovery for Object Detection Via Convergence- Simulation Driven Search 라는 논문 입니다.
문의 : tfkeras@kakao.com
Standardizing +113 million Merchant Names in Financial Services with Greenplu...Data Science London
The document discusses standardizing over 113 million merchant names from transaction data using regex and fuzzy matching. It involved extracting features from merchant names, cleaning names using regular expressions, fuzzy matching to group similar names, and manual rules. This allowed preliminary analysis showing 90% of transactions and spending were concentrated in 7-8% of top merchants. Customer segments were identified based on relative value added scores.
Talk given at Fronteers 2015 in Amsterdam.
In a world where many of our digital spaces are becoming more closed than ever, open data is a concept that is rapidly on the rise.
In this talk we'll explore what open data is (and what it isn't), and why we should care about it. We'll look at how you can introduce it into your projects with regards to practical publication and consumption, and discuss some useful tools and reference points.
Open data isn't just dry and technical - it gives us great scope to be creative, and throughout this talk we'll go through some of the amazing things that it has been used for globally in the hope that it will inspire you to create something amazing yourself.
How to Create Surveys to Read Your Audience's MindsLeslie Samuel
How do you know exactly what your audience wants? Ask them! Conducting surveys can help you provide more value, IF you do them the right way. This slide deck will show you how to create surveys well.
The What, Why and How of (Web) Analytics Testing (Web, IoT, Big Data)Anand Bagmar
Learning Objectives:
The most used and heard about buzz words in the Software Industry today are … IoT and Big Data!
With IoT, with a creative mindset looking for opportunities and ways to add value, the possibilities are infinite. With each such opportunity, there is a huge volume of data being generated - which if analyzed and used correctly, can feed into creating more opportunities and increased value propositions.
There are 2 types of analysis that one needs to think about.
1. How is the end-user interacting with the product? This will give some level of understanding into how to re-position and focus on the true value add features for the product.
2. With the huge volume of data being generated by the end-user interactions, and the data being captured by all devices in the food-chain of the offering, it is important to identify patterns from what has happened, and find out new product / value opportunities based on usage patterns.
Learn what is Web Analytics, why is it important, and see some techniques how you can test it manually and and also automate that validation.
Business today is starting to understand the value of data, and some organisations are outperforming their competition by putting data at the heart of their thinking. Leveraging data to change business models, understand their customers and employees better and deliver new revenue streams is the driving force in this new data centric era.
Jon Woodward - MSFT
Dave Coplin - MSFT
Mike Bugembe - JustGiving
Gary Richardson - KPMG
This document discusses automating big data analytics processes. It describes the traditional "old school" approach of manually extracting, loading and transforming raw data. The document then presents two "new school" examples that automate these processes. The first automates extracting core website and social media data, enriching email addresses, and generating segments and metrics. The second integrates data from 70+ APIs in real-time, performs custom aggregations, and enables behavioral segmentation and messaging. The document concludes by soliciting questions and feedback on working with big data.
This document provides an overview of digital marketing trends, channels, tactics and tools. It is presented in four parts: emerging trends in the digital landscape, key digital marketing channels, tactics for driving engagement and conversion, and useful tools for digital marketing efforts. The presentation covers topics such as the shift from outbound to inbound marketing, the proliferation of digital channels, how to leverage search engine marketing, social media, email and other channels, and how tools like Google Analytics can be used to track metrics and analyze traffic. The overall message is the importance of staying aware of emerging trends and being curious to continuously improve digital marketing strategies.
Loss function discovery for object detection via convergence simulation drive...taeseon ryu
안녕하세요 딥러닝 논문읽기 모임 입니다.
오늘 소개드릴 영상은 펀디멘탈팀 송헌님의 Loss Function Discovery for Object Detection Via Convergence- Simulation Driven Search 라는 논문 입니다.
문의 : tfkeras@kakao.com
6. 사고 실험
• 로봇 평가자의 에세이 채점은 옳은가? 윤리적인가?
1. 인간 채점자들이 항상 공정한 것은 아니다.
2. 기계는 상황을 구조화하고, 이것은 창의성을 억제하는가?
3. 에세이의 목적은 훌륭한 에세이를 쓰는 것인가?
아니면 표준화된 시험을 잘 보는 것인가?
7. 특징 선택feature selection
• 모형에 넣을 데이터의 부분 집합 선택
• 알고리즘과 통계 모형 구축의 중요한 부분
• 중복되거나 상관이 높은 변수 제거
• “때로는, 더 많은 데이터는 단지 더 많은 데이터에 불과하다”
8. 사례: 체이싱 드래곤
• 체이싱 드래곤이라는 애플리케이션을 설계하였다고 가정
• 첫 달이 지난 후 신규 사용자의 10%만 유지
• 신규 사용자 유치보다는 기존 사용자 유지가 비용적 유리
• 어떻게 기존 사용자를 유지할 것인가?
9. 사용자 유지
1. 데이터 수집
• 사용자의 모든 행동을 time-stamped event log로 저장
2. 데이터 세트로 변환
• 각 행은 사용자, 각 열은 특징으로 구성
• 특징들에 대한 브레인스토밍이 필요(특징 추출feature extraction)
✤ 첫 달에 사용자가 방문한 날의 횟수
✤ 두 번째 방문까지 소요된 총 시간
✤ 사용자의 프로필 작성 유무 등등..
• 특징들 간에 중복과 연관성에 주의
10. 사용자 유지
3. 로지스틱 회귀 분석
• 첫 달 사용자 활동의 조건 아래에서 두 번째 달 사용자가
돌아올 확률 계산
• logit(P(ci =1|xi)) = α + βτ
・xi
• 특징을 선택하여 로지스틱 회귀에 입력
• 특징 선택 방법: 필터, 래퍼, 임베디드
11. 특징 선택 방법; 필터filter
• Model의 성능을 고려하지 않고 특징 선택
• 모든 특징을 척도에 따라 순위를 정하고, 가장 높은 순위의
특징들로 선택
• 특징 간의 중복을 고려하지 않는다
12. 특징 선택 방법; 래퍼wrapper
• Model이 최고의 성능을 내는 특징 선택
• 시간이 오래 걸린다
• 부분집합의 수가 기하급수적으로 늘어 과적합의 위험 발생
• 특징 선택을 위한 알고리즘과 선택기준을 결정해야 함
13. 특징 선택을 위한 알고리즘
1. 전진 선택forward selection
•비어 있는 상태에서 시작
•모형을 가장 많이 향상시키는 특징을 하나씩 점진적으로 추가
•추가 시 선택기준이 향상되지 않을 때 추가를 중단
2. 후진 제거backward elimination
•모두 포함된 상태에서 시작
•제거 시 가장 큰 향상을 가져왔느냐에 따라 점진적으로 제거
•특징 제거가 선택기준을 나쁘게 할 때 추가를 중단
3. 혼합형 접근
•전진 선택과 후진 제거를 함께 사용
14. 특징 선택을 위한 선택기준
• 다수의 선택기준이 존재
• R-제곱값(R
2
)
• P-값
• 아카이케 정보 기준
• 베이지안 정보 기준
• 엔트로피
• 선택기준에 따라 다른 모형이 제작
• 여러 선택기준을 적용 후 결과를 관찰하여 선택
15. 특징 선택 방법; 임베디드 방법
• 의사 결정 나무decision tree
• 분류classification 알고리즘
• 높은 해석가능성의 장점
• 각 단계의 특징을 어떻게 배치할 것인가가 관건
• 데이터에 기반한 특징 배치: 엔트로피
16. 엔트로피entropy
• 무엇이 얼마나 혼합되어 있는지에 대한 척도
• H(X) =−p(X=1)log2(p(X=1)) −p(X=0)log2(p(X=0))
• p(X=1)=0 또는 p(X=0)=0 일 경우
H(X) = 0
• H(X|a) = Σai p(a=ai)・H(X|a=ai)
• 속성 a의 값을 알 때 X에 대해 얼마나 많은 정보를 알게 되는
가?
17. 가지치기pruning
• 특정한 깊이 아래를 잘라내는 작업
• 방대한 데이터를 학습할 경우 과적합이 발생
• 가지치기를 통해 과적합을 방지하고 정확도 향상
18. 랜덤 포리스트random forest
1. 배깅bagging을 통해 의사결정나무를 일반화
• 학습 데이터에 따라 결과가 크가 달라지는 의사결정나무의 단점을 보완
• 연속 학습을 수행하는 동안 이전 학습에서 틀린 답에 좀 더 초점을 맞춰 학습하는 기법
• 월등히 높은 정확성. 간편하고 빠른 학습 및 테스트
• 해석가능성을 희생. 이해하기가 매우 어렵다
2. 부트스트래핑
•복원추출 표본으로 같은 데이터 포인터를 반복 추출
3. 가지치기를 하지 않는다
• 특이한 잡음을 포함할 수 있는 것이 큰 장점