• Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,001
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
23
Comments
0
Likes
2

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. PlatformDay2009SNS Analysis using Cloud Computing ServicesDHT-based Key-Value Storage and MapReduce-based AnalysisDongWoo Leeoiko.cloud@gmail.com S Oiko Laboratory D SocialFlow OikoLab 2 CloudKR
  • 2. Agenda 2 CloudKR ‣ Introduction • Social Network Serivce • Motivation : Visualization, Social Network Analysis • SocialFlow • Scale Out Technologies : Cloud Computing ‣ SNS Analysis Architecture based on Cloud • Overall Process • Crawling • DHT Storage (CouchDB) • MapReduce • Pair-Wise Similarity ‣ Cloud Computing Service • Amazon Web Service • EC2 / S3 / Elastic MapReduce • Tips ‣ References
  • 3. Introduction 2 CloudKR Social Network Cloud Computing Mobile Device
  • 4. Social Network Service 2 CloudKR “Social Applications = Social Networks” “A social network is a collection of people bound together through a specific set of social relations.” “A collection of people is a social network if and only if it is possible for something to spread virally through that collection.”
  • 5. Social Network Services : Twitter, Facebook 2 CloudKR
  • 6. Social Applications
  • 7. Social Networks http://www.vincos.it/world-map-of-social-networks/
  • 8. Social Network Analysis 2 CloudKR‣ Social Graph Analysis‣ Visualization‣ Person-to-Person Relationship‣ Temporal Mind Mining (Content Clustering)‣ Post-Mortem Log Processing
  • 9. Social Network Analysis : Visualization 2 CloudKR ‣
  • 10. Social Network Analysis : Visualization 2 CloudKR ‣
  • 11. Social Network Analysis : Visualization 2 CloudKR ‣ ‣ ‣ ‣
  • 12. Social Network Analysis : Visualization 2 CloudKR ‣
  • 13. SocialFlow 2 CloudKR‣ Thoughts, Feelings, Interests, Relationship and Information of SNS‣ Real-time Massive Social Data Streams‣ Difficult to follow the Social Streams‣ Need a way to get a summary or clustered information based on Common Interests D SocialFlow OikoLab
  • 14. SocialFlow‣ Getting Common Flows of people through Content Similarities 2 CloudKR‣ Reflecting Short-Term Interests of People‣ Extracting Hot Issues‣ Revealing Relationships among In/Out Resources‣ Implementing Scale-Out Technologies‣ Evolving toward Recommendation System based on Collective Intelligence
  • 15. Scale Out Technologies : Cloud Computing 2 CloudKR
  • 16. Why Cloud Computing? 2 CloudKR‣ SPOF (Single Point of Failure)‣ Cluster Administration (Who do this?)‣ Initial Infrastructure Investment (Risk Management)‣ Focus on Main Thing (Intelligence)‣ Enable Highly Scalable Services New resource provision paradigms for Grid Infrastructures: Virtualization and Cloud / ISGC 2009 http://tinyurl.com/nacgu7
  • 17. Cloud Computing: e.g. Storage Failure 2 CloudKR é
  • 18. SNS Analysis Architecture based on Cloud 2 CloudKR D SocialFlow OikoLab
  • 19. Experimental Project 2 CloudKRD SocialFlow OikoLab‣Python / Django / Boto‣ML / Data Mining‣DHT / CouchDB‣Cloud / AWS S3, EC2, Hadoop MapReduce
  • 20. Workflow 2 CloudKR SNS Crawler MapReduce Post-Processing CDN User In-house Cluster Cloud Service (Local DataCenter)
  • 21. Technologies : Before 2 CloudKR Crawler Crawler Crawler Hash_ring Consistent MapReduce CouchJS DHT CouchDB Key-Value Machine Home Storage Learning Made
  • 22. Technologies : After 2 CloudKR Crawler Crawler Crawler Storage S3 Hash_ring Consistent EC2 MapReduce DHT Hadoop CouchDB Key-Value Machine Home Storage Learning Made
  • 23. Crawling 2 CloudKR‣ Fetching recent postings of SNS‣ Storing fetched postings to CouchDB Storage through DHT Layer (which select a sever)‣ Pushing raw data into the Cloud to process them with MapReduce Crawler DB [ term, doc ] Crawler DB Index Indexer File DB Crawler DB Mapper Crawler DHT Replication
  • 24. Consistent DHT (Distributed Hash Table)‣ Uniform key distribution and load balancing with a good hash function 2 CloudKR‣ Minimizing the effects of a storage crash or temporal down‣ High availability with replication scheme N-1 0 Node N-1 Node k-1 k-1 ‣ Notice: A real node has non- linear portions of the total key space. Replicas k+1 Node k+1 Node k 2 1 Replicate(k, k-1, k+1) !"#$!%&()*+,-.( /0123,(0405123,(&6-.-7-1(080.-9(.0405.-9(.&6-.-7-1(0:
  • 25. Consistent DHT (Distributed Hash Table) 2 CloudKR Admin Anonymouse Traffic User Traffic View Admin View User View Generated Contents SNS Crawler SNS Anlysis AWS S3 html image DHT Front End Memory Cache N-1 0 Node N-1 Node k-1 DHT Node k+1 Node k 2 1
  • 26. Consistent DHT : Replication 2 CloudKR * Replica = 2 D A B C A B B C C D D A B B B Replica Replica
  • 27. CouchDB (Key-Value Storage) 2 CloudKR‣ Erlang -based Key-Value Storage‣ Storage Engine (MVCC, B-tree)‣ RESTful API‣ Service-side JavaScript Engine (MapReduce)‣ View Engine‣ Futon Web UI
  • 28. CouchDB: Server-side Javascript 2 CloudKR‣ Purpose ‣ Local Computations on Local Data Sets‣ Features ‣ Mozilla’s Spidermonkey ‣ MapReduce Framework with Javascript ‣ Fork External Process (couchjs)‣ Performance Enhancements Expected ‣ Googles V8 (Chrome’s Javascript Engine / JIT) http://tinyurl.com/m76sx3
  • 29. CouchDB: MapReduce 2 CloudKRdoc = (d1, d2, fq) dx: { di }
  • 30. Map & Reduce : Pair-Wise Similarity 2 CloudKR [ term, { docs } ] => DB [ term, doc ] [ term, { docs } ] [ doc1, doc2 ] DB Index Doc Group Doc Candidate Indexer File Grouper File Combinator File DB DB Mapper Reducer Mapper DocPair Reducer Counter Doc File Result File‣ Indexer and Grouper for Processing Korean. [ freq, doc1, doc2 ]‣ No NLP and No Structural Analysis.‣ Produce a pairwise similarity between two postings.
  • 31. Map & Reduce : Optimization 2 CloudKR‣ Concerns ‣ Sample Data ‣ Consider Key Group Size Distribution ‣ Two months postings of my friends ‣ Data Load Balancing ‣ Reachable graph: 4,060 Peoples ‣ Barrier Point ‣ Total Postings: 206,115
  • 32. Pair-Wise Similarity and its TreeMap Posting: 110,008 Users: 2,691 Score >= 6
  • 33. Pair-Wise Similarity and its Cluster 2 CloudKR ➡One issue and different opinions among people
  • 34. Pair-Wise Similarity and its Cluster 2 CloudKR➡Common Interest / Hot Issue
  • 35. Pair-Wise Similarity and its Cluster ➡One person and the similar contents pattern (specialty) 2 CloudKR
  • 36. Pair-Wise Similarity and its Cluster ➡ Similar Structure of Sentences (trendy, parady) 2 CloudKR
  • 37. Deployment 2 CloudKR EC2 S3/CloudFront Flickr www
  • 38. Cloud Computing Service 2 CloudKR
  • 39. Before the Cloud Age‣ Smart Shell Guru’s Daily Work : Parallel Sort 2 CloudKR$ wc -l data scp scp $ sort -rm data*.sorted >$ split -l 1000k data NFS NFS data.sorted $ nohup ./work.sh data1 > data1.processed $ nohup sort -r data1.processed > data1.sorted ➡ Need to prepare/maintain physical machines and resources Complexity ➡ Need to monitor job progress (wait and see job’s status) ➡ Need to cope with machine failure (slave nodes / storages / networks) ➡ Need to schedule multiple jobs
  • 40. Amazon Web Service : Overview EC2 EC2 EC2 EC2 2 CloudKR Messages SQS (Simple Query Service) Auto Scaling CloudWatch Monitoring Elastic Load Balancing EC2 (Elastic Compute Cloud) Mount EBS (Elastic Block Store) 1 GB to 1TB Permissions Header Clients API Objects Clients HTTP Clients Buckets AMI (Machine Image) eSATA/USB SimpleDB S3 (Simple Storage Service) Offline Mgmt Console EC2 CLI SSH Import/Export Admin key-value CloudFront Access Key ID Edges Secret Access Key Key Pair Instant EC2 Hadoop Cluster Elastic MapReduce HTTP Hadoop Hadoop Hadoop Clients
  • 41. Amazon Web Service ‣ Amazon Management Console 2 CloudKR
  • 42. AWS : AMI 2 CloudKR AMI Amazon Machine Image
  • 43. AWS : Paid AMI / The Cloud Market 2 CloudKR AMI Amazon Machine Image Paid AMI
  • 44. AWS : How to make a AMI (1) 2 CloudKR Loopback File # dd if=/dev/zero of=new_image.fs bs=1M count=1024 Make ext3 file system # mke2fs -F -j new_image.fs # mkdir /mnt/ec2-fs # mount -o loop new_image.fs /mnt/ec2-fs # mkdir /mnt/ec2-fs/dev # /sbin/MAKEDEV -d /mnt/ec2-fs/dev -x console # /sbin/MAKEDEV -d /mnt/ec2-fs/dev -x null # /sbin/MAKEDEV -d /mnt/ec2-fs/dev -x zero # mkdir /mnt/ec2-fs/etc Create /mnt/ec2-fs/etc/fstab (Add /dev/sda1 --> /, /etc/pts, shm, /proc, /sys) Create yum-xen.conf # mkdir /mnt/ec2-fs/proc # mount -t proc none /mnt/ec2-fs/proc # yum -c yum-xen.conf --installroot=/mnt/ec2-fs -y groupinstall Base Edit /mnt/ec2-fs/etc/sysconfig/network-scripts/ifcfg-eth0 Edit /mnt/ec2-fs/etc/sysconfig/network Edit /mnt/ec2-fs/etc/fstab (Add /dev/sda2 --> /mnt, /dev/sda3 --> swap) chroot /mnt/ec2-fs /bin/sh Edit services
  • 45. AWS : How to make a AMI (2) 2 CloudKR Building an AMI # yum install ruby # rpm -i ec2-ami-tools-noarch.rpm (Download from public s3 bucket) # ec2-bundle-image -i new_image.fs -k my-private-key.key -u aws-user-id Local Machine Root File System # ec2-bundle-vol -k my-private-key.key -s 1000 -u aws-user-id Upload to S3 # ec2-upload-bundle -b my-bucket -m image.manifest -a my-aws-access-key-id -s my-secret-key-id Register AMI # ec2-register my-bucket/image.manifest IMAGE ami-xxxx Testing # ec2-describe-images ami-xxxx Deregister AMI # ec2-deregister ami-xxxx Running AMI # ec2-run-intances ami-xxxx -n 1 http://docs.amazonwebservices.com/AWSEC2/2006-06-26/DeveloperGuide/
  • 46. AWS : EC2 Running Instance ‣ AWS Management Console 2 CloudKR
  • 47. AWS : EC2 Running Instance 2 CloudKR
  • 48. Amazon Web Service: Access Methods‣ Access Key ID / Secret Access Key ID / Key Pairs 2 CloudKR‣ Amazon Management Console‣ EC2 API (WSDL) / EC2 CLI (Command Line Interface)‣ SSH‣ Firefox Extensions • S3 Firefox Organizer • Elasticfox‣ S3 •DNS: s3 CNAME s3.amazonaws.com. e.g) Bucket Name: /s3.xyz.com http://s3.xyz.com ---> S3‘s s3.xyz.com‣s3cmd (python)‣s3cmd.rb / s3sync.rb (ruby)‣S3Hub (Mac)
  • 49. Amazon Web Service: Elasticfox‣ Firefox’s Extension: Elasticfox 2 CloudKR
  • 50. Amazon Web Service: Elasticfox 2 CloudKR ‣ Key Pairs ‣ Private Key ‣ SSH
  • 51. Amazon Web Service: Elasticfox 2 CloudKR ‣ Security Groups ‣ Open Network Ports
  • 52. AWS: Elastic MapReduce 2 CloudKR ‣ EC2 + Hadoop ‣Tools ‣ Management Console ‣ elastic-mapreduce CLI ‣ Preparation ‣ Code --> S3 ‣ Data --> S3 ‣ Log Folder ‣ Output Folder ‣Job Flow ‣ Streaming ‣ Custom Jar ‣ Sample Applications
  • 53. AWS: Elastic MapReduce 2 CloudKR
  • 54. AWS: Elastic MapReduce : Web UI 2 CloudKR
  • 55. AWS: Elastic MapReduce : CLI for Workflow 2 CloudKR input/* Step1 jobflow #id output1/part-000** Step2 output2/part-000** Step3 output3/part-000**
  • 56. AWS: Elastic MapReduce 2 CloudKR ‣ Failed tasks will be rescheduled in other Hadoop slaves. ‣ If a task is finished, the same instance will be killed by a tracker.
  • 57. AWS: Elastic MapReduce 2 CloudKR
  • 58. AWS: SocialFlow Automation 2 CloudKR Home IDC Amazon Wild World Local Global Results Admin DHT S3 Users Read/Write Read Only Renderer boto python Launching EC2 pool
  • 59. AWS: EC2, EMR Price Model 2 CloudKR Service Type Per Instance Hour 1 Week (7 Days) 1 Week (7 Days) $ 0.10 (S) $ 16.8 KRW 20,865 On-Demand $ 0.40 (L) $ 67.2 KRW 83,462 $ 0.80 (E) $ 134.4 KRW 166,924 EC2 Reserved $ 0.03 (S) $ 5.04 KRW 6,259 1yr $ 325 $ 0.12 (L) $ 20.16 KRW 25,038 3yr $ 500 $ 0.24 (E) $ 40.32 KRW 50,077 $ 0.10 (S) $ 0.015 $ 19.32 KRW 23,995 Elastic On-Demand $ 0.40 (L) $ 0.06 $ 77.28 KRW 95,981 MapReduce $ 0.80 (E) $ 154.56 KRW 191,963 $ 0.12 (S) = Small, (L) = Large, (E) = Extra Large 1 USD = 1242 KRW
  • 60. AWS: Performance 2 CloudKR http://tinyurl.com/qj6ao7
  • 61. AWS: Performance 2 CloudKR
  • 62. AWS: Performance 2 CloudKR http://tinyurl.com/p9jsyz
  • 63. AWS: Performance 2 CloudKR http://tinyurl.com/cqqxgl
  • 64. 10 Cent Tips 2 CloudKR‣ AWS EC2 ‣ Minimizing set-up time with prepared shell scripts ‣ Use Boto for automating deployments ‣ Use S3 (Free of Charge between S3 and EC2 in the same region) ‣ $0.030 per GB through June 30, 2000 ($0.1 per GB normal price)‣ AWS Elastic MapReduce ‣ Enabling the SSH port(22) and Hadoop related ports (9100, 91001) ‣ Assess to Master Node: ssh -i keypair hadoop@public_dns_name ‣ Double Check (PATH, etc) ‣ Debug, Debug, Debug ‣ Use EC2 for hadoop (eg. Clouera’s Hadoop AMI) (No extra cost for Hadoop!)
  • 65. 10 Cent Tips 2 CloudKR‣ AWS S3 ‣ Setting HTTP header for images and static resources. ‣ Cache-Control: max-age=31536000 ‣ Block Search Bots ‣ robots.txt at the root of a Bucket ‣ User-agent: * ‣ Disallow: / ‣ Using BitTorrent for large files ‣ http://s3.xyz.com/xfile.zip?torrent ‣ Compress Rendered HTML with gzip ‣ Content-Encoding: gzip $ s3cmd put index.html s3://s3.xyz.com/www --mime-type "text/html” --add-header "Content-Encoding: gzip" --acl-public
  • 66. Amazon Web Service : Limitations 2 CloudKR
  • 67. References‣ 10 MapReduces Tips, Cloudera, http://tinyurl.com/pxuqup 2 CloudKR‣ Christian Charas, Thierry Lecroq, Handbook of Exact String-Matching Algorithms‣ Dan Pritchett (eBay), BASE: Alternative ACID, p.48-55, ACM Queue May/June 2008‣ Edward Chang, (Google Research), Mining Large Scale Social Networks, MMDS ’08‣ Edward Walker, Benchmarking Amazon EC2 for high-performance scientific computing‣ Matei Zaharia et al, Improving MapReduce Performance in Heterogeneous Environments, OSDI ’08‣ Following Twitter ‣ http://twitter.com/AmazonEC2 ‣ http://twitter.com/AmazonS3S3