Data Analysis @ DaumPaul Kim (totworld@daumcorp.com)DevOn 2012
Search Quality
Search Quality          =Satisfaction / Cost
How?
HTTP://WWW.GOOGLE.COM/ONCEUPONATIME/TECHNOLOGY/PIGEONRANK.HTML
Understanding Users     with Logs               BIG              DATA!
Data Analysis Process           with Hadoop?                                                     !          HADOOP        ...
For example,
라면 맛있게 끓이는 비법
많이 본 글Mission  만족스러운 검색 경험들을 랭킹에 반영Target Data  Half Year Search Logs (about 40TB)Features                                ...
많이 본 글 Modeling   Linear Regression with SAS Batch ProcessHADOOP           FEATURES        MODEL   ENGINE                 ...
바다 이야기
SEARCH SPAM INDEX Mission   Spam이 검색 사용자에게 미치는 영향 파악 Data   Search Log : Text with Delimiter   Post Filtered Documents : J...
SEARCH SPAM INDEX Result Sample
BLOG CLASSIFICATION
BLOG CLASSIFICATIONMission  Unsupervised Learning을 통한 나쁜 Blog ClusteringData  30 Days Blog DocumentsTask  Blog - Document’...
BLOG CLASSIFICATIONModeling  Kohonen’s SOM(Self Organizing Map) with R
WHAT ELSE?Topic Analysis with PLSAQuery Chain FilteringReprocessing with Hadoop
In Conclusion,
ADVANTAGE OF HADOOPADVANTAGE  Low analyze cost!  No more sampling!  Low operation cost!  Programming Language Independent ...
THANK YOU!
Data Analysis @ Daum | Devon 2012
Data Analysis @ Daum | Devon 2012
Upcoming SlideShare
Loading in …5
×

Data Analysis @ Daum | Devon 2012

13,191 views

Published on

0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
13,191
On SlideShare
0
From Embeds
0
Number of Embeds
10,439
Actions
Shares
0
Downloads
62
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide

Data Analysis @ Daum | Devon 2012

  1. 1. Data Analysis @ DaumPaul Kim (totworld@daumcorp.com)DevOn 2012
  2. 2. Search Quality
  3. 3. Search Quality =Satisfaction / Cost
  4. 4. How?
  5. 5. HTTP://WWW.GOOGLE.COM/ONCEUPONATIME/TECHNOLOGY/PIGEONRANK.HTML
  6. 6. Understanding Users with Logs BIG DATA!
  7. 7. Data Analysis Process with Hadoop? ! HADOOP FEATURES TOOLS 2 QUAD-CORES SAS 8GB RAM X 60 NODES WEKA 4TB HDD R ETC 4 QUAD-CORES 16GB RAM X 30 NODES 4TB HDD
  8. 8. For example,
  9. 9. 라면 맛있게 끓이는 비법
  10. 10. 많이 본 글Mission 만족스러운 검색 경험들을 랭킹에 반영Target Data Half Year Search Logs (about 40TB)Features JOB ROU P-BY Query - Collection Relationship G UP-B Y JOB Query - Document - Session Relationship GRO JOB Session - Query Relationship GROU P-BY UP-B Y JOB Session - Document Relationship GRO
  11. 11. 많이 본 글 Modeling Linear Regression with SAS Batch ProcessHADOOP FEATURES MODEL ENGINE LESS THAN 2 HOURS
  12. 12. 바다 이야기
  13. 13. SEARCH SPAM INDEX Mission Spam이 검색 사용자에게 미치는 영향 파악 Data Search Log : Text with Delimiter Post Filtered Documents : Json Format Operation Deleted Documents : Xml Format Task Query - Session - Doc. 1 - Doc. 2 - Doc. 3 - Doc. 4 Click? TER JOIN OU Type? (Ham, Spam, OP Del.)
  14. 14. SEARCH SPAM INDEX Result Sample
  15. 15. BLOG CLASSIFICATION
  16. 16. BLOG CLASSIFICATIONMission Unsupervised Learning을 통한 나쁜 Blog ClusteringData 30 Days Blog DocumentsTask Blog - Document’s Feature Analysis with Fixed Interval
  17. 17. BLOG CLASSIFICATIONModeling Kohonen’s SOM(Self Organizing Map) with R
  18. 18. WHAT ELSE?Topic Analysis with PLSAQuery Chain FilteringReprocessing with Hadoop
  19. 19. In Conclusion,
  20. 20. ADVANTAGE OF HADOOPADVANTAGE Low analyze cost! No more sampling! Low operation cost! Programming Language Independent Various support toolsDISADVANTAGE Conceptual Change is Needed. Project under active development. Version upgrade is not supported.
  21. 21. THANK YOU!

×