SNS Analysis using Cloud Computing Services

3,746 views

Published on

SNS Cloud AWS S3 Hadoop MapReduce

Published in: Technology, Business
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,746
On SlideShare
0
From Embeds
0
Number of Embeds
207
Actions
Shares
0
Downloads
126
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

SNS Analysis using Cloud Computing Services

  1. 1. PlatformDay2009 SNS Analysis using Cloud Computing Services DHT-based Key-Value Storage and MapReduce-based Analysis DongWoo Lee oiko.cloud@gmail.com S Oiko Laboratory D SocialFlow OikoLab 2 CloudKR
  2. 2. Agenda 2 CloudKR ‣ Introduction • Social Network Serivce • Motivation : Visualization, Social Network Analysis • SocialFlow • Scale Out Technologies : Cloud Computing ‣ SNS Analysis Architecture based on Cloud • Overall Process • Crawling • DHT Storage (CouchDB) • MapReduce • Pair-Wise Similarity ‣ Cloud Computing Service • Amazon Web Service • EC2 / S3 / Elastic MapReduce • Tips ‣ References
  3. 3. Introduction 2 CloudKR Social Network Cloud Computing Mobile Device
  4. 4. Social Network Service 2 CloudKR “Social Applications = Social Networks” “A social network is a collection of people bound together through a specific set of social relations.” “A collection of people is a social network if and only if it is possible for something to spread virally through that collection.”
  5. 5. Social Network Services : Twitter, Facebook 2 CloudKR
  6. 6. Social Applications
  7. 7. Social Networks http://www.vincos.it/world-map-of-social-networks/
  8. 8. Social Network Analysis 2 CloudKR ‣ Social Graph Analysis ‣ Visualization ‣ Person-to-Person Relationship ‣ Temporal Mind Mining (Content Clustering) ‣ Post-Mortem Log Processing
  9. 9. Social Network Analysis : Visualization 2 CloudKR ‣
  10. 10. Social Network Analysis : Visualization 2 CloudKR ‣
  11. 11. Social Network Analysis : Visualization 2 CloudKR ‣ ‣ ‣ ‣
  12. 12. Social Network Analysis : Visualization 2 CloudKR ‣
  13. 13. SocialFlow 2 CloudKR ‣ Thoughts, Feelings, Interests, Relationship and Information of SNS ‣ Real-time Massive Social Data Streams ‣ Difficult to follow the Social Streams ‣ Need a way to get a summary or clustered information based on Common Interests D SocialFlow OikoLab
  14. 14. SocialFlow ‣ Getting Common Flows of people through Content Similarities 2 CloudKR ‣ Reflecting Short-Term Interests of People ‣ Extracting Hot Issues ‣ Revealing Relationships among In/Out Resources ‣ Implementing Scale-Out Technologies ‣ Evolving toward Recommendation System based on Collective Intelligence
  15. 15. Scale Out Technologies : Cloud Computing 2 CloudKR
  16. 16. Why Cloud Computing? 2 CloudKR ‣ SPOF (Single Point of Failure) ‣ Cluster Administration (Who do this?) ‣ Initial Infrastructure Investment (Risk Management) ‣ Focus on Main Thing (Intelligence) ‣ Enable Highly Scalable Services New resource provision paradigms for Grid Infrastructures: Virtualization and Cloud / ISGC 2009 http://tinyurl.com/nacgu7
  17. 17. Cloud Computing: e.g. Storage Failure 2 CloudKR é
  18. 18. SNS Analysis Architecture based on Cloud 2 CloudKR D SocialFlow OikoLab
  19. 19. Experimental Project 2 CloudKR D SocialFlow OikoLab ‣Python / Django / Boto ‣ML / Data Mining ‣DHT / CouchDB ‣Cloud / AWS S3, EC2, Hadoop MapReduce
  20. 20. Workflow 2 CloudKR SNS Crawler MapReduce Post-Processing CDN User In-house Cluster Cloud Service (Local DataCenter)
  21. 21. Technologies : Before 2 CloudKR Crawler Crawler Crawler Hash_ring Consistent MapReduce CouchJS DHT CouchDB Key-Value Machine Home Storage Learning Made
  22. 22. Technologies : After 2 CloudKR Crawler Crawler Crawler Storage S3 Hash_ring Consistent EC2 MapReduce DHT Hadoop CouchDB Key-Value Machine Home Storage Learning Made
  23. 23. Crawling 2 CloudKR ‣ Fetching recent postings of SNS ‣ Storing fetched postings to CouchDB Storage through DHT Layer (which select a sever) ‣ Pushing raw data into the Cloud to process them with MapReduce Crawler DB [ term, doc ] Crawler DB Index Indexer File DB Crawler DB Mapper Crawler DHT Replication
  24. 24. Consistent DHT (Distributed Hash Table) ‣ Uniform key distribution and load balancing with a good hash function 2 CloudKR ‣ Minimizing the effects of a storage crash or temporal down ‣ High availability with replication scheme N-1 0 Node N-1 Node k-1 k-1 ‣ Notice: A real node has non- linear portions of the total key space. Replicas k+1 Node k+1 Node k 2 1 Replicate(k, k-1, k+1) !"#$!%&'()*+,-.( /0123',(0405123',(&6-.-7-1(080.-'9(.0405.-'9(.&6-.-7-1(0:
  25. 25. Consistent DHT (Distributed Hash Table) 2 CloudKR Admin Anonymouse Traffic User Traffic View Admin View User View Generated Contents SNS Crawler SNS Anlysis AWS S3 html image DHT Front End Memory Cache N-1 0 Node N-1 Node k-1 DHT Node k+1 Node k 2 1
  26. 26. Consistent DHT : Replication 2 CloudKR * Replica = 2 D A B C A B B C C D D A B B B Replica Replica
  27. 27. CouchDB (Key-Value Storage) 2 CloudKR ‣ Erlang -based Key-Value Storage ‣ Storage Engine (MVCC, B-tree) ‣ RESTful API ‣ Service-side JavaScript Engine (MapReduce) ‣ View Engine ‣ Futon Web UI
  28. 28. CouchDB: Server-side Javascript 2 CloudKR ‣ Purpose ‣ Local Computations on Local Data Sets ‣ Features ‣ Mozilla’s Spidermonkey ‣ MapReduce Framework with Javascript ‣ Fork External Process (couchjs) ‣ Performance Enhancements Expected ‣ Googles V8 (Chrome’s Javascript Engine / JIT) http://tinyurl.com/m76sx3
  29. 29. CouchDB: MapReduce 2 CloudKR doc = (d1, d2, fq) dx: { di }
  30. 30. Map & Reduce : Pair-Wise Similarity 2 CloudKR [ term, { docs } ] => DB [ term, doc ] [ term, { docs } ] [ doc1, doc2 ] DB Index Doc Group Doc Candidate Indexer File Grouper File Combinator File DB DB Mapper Reducer Mapper DocPair Reducer Counter Doc File Result File ‣ Indexer and Grouper for Processing Korean. [ freq, doc1, doc2 ] ‣ No NLP and No Structural Analysis. ‣ Produce a pairwise similarity between two postings.
  31. 31. Map & Reduce : Optimization 2 CloudKR ‣ Concerns ‣ Sample Data ‣ Consider Key Group Size Distribution ‣ Two months postings of my friends ‣ Data Load Balancing ‣ Reachable graph: 4,060 Peoples ‣ Barrier Point ‣ Total Postings: 206,115
  32. 32. Pair-Wise Similarity and its TreeMap Posting: 110,008 Users: 2,691 Score >= 6
  33. 33. Pair-Wise Similarity and its Cluster 2 CloudKR ➡One issue and different opinions among people
  34. 34. Pair-Wise Similarity and its Cluster 2 CloudKR ➡Common Interest / Hot Issue
  35. 35. Pair-Wise Similarity and its Cluster ➡One person and the similar contents pattern (specialty) 2 CloudKR
  36. 36. Pair-Wise Similarity and its Cluster ➡ Similar Structure of Sentences (trendy, parady) 2 CloudKR
  37. 37. Deployment 2 CloudKR EC2 S3/CloudFront Flickr www
  38. 38. Cloud Computing Service 2 CloudKR
  39. 39. Before the Cloud Age ‣ Smart Shell Guru’s Daily Work : Parallel Sort 2 CloudKR $ wc -l data scp scp $ sort -rm data*.sorted > $ split -l 1000k data NFS NFS data.sorted $ nohup ./work.sh data1 > data1.processed $ nohup sort -r data1.processed > data1.sorted ➡ Need to prepare/maintain physical machines and resources Complexity ➡ Need to monitor job progress (wait and see job’s status) ➡ Need to cope with machine failure (slave nodes / storages / networks) ➡ Need to schedule multiple jobs
  40. 40. Amazon Web Service : Overview EC2 EC2 EC2 EC2 2 CloudKR Messages SQS (Simple Query Service) Auto Scaling CloudWatch Monitoring Elastic Load Balancing EC2 (Elastic Compute Cloud) Mount EBS (Elastic Block Store) 1 GB to 1TB Permissions Header Clients API Objects Clients HTTP Clients Buckets AMI (Machine Image) eSATA/USB SimpleDB S3 (Simple Storage Service) Offline Mgmt Console EC2 CLI SSH Import/Export Admin key-value CloudFront Access Key ID Edges Secret Access Key Key Pair Instant EC2 Hadoop Cluster Elastic MapReduce HTTP Hadoop Hadoop Hadoop Clients
  41. 41. Amazon Web Service ‣ Amazon Management Console 2 CloudKR
  42. 42. AWS : AMI 2 CloudKR AMI Amazon Machine Image
  43. 43. AWS : Paid AMI / The Cloud Market 2 CloudKR AMI Amazon Machine Image Paid AMI
  44. 44. AWS : How to make a AMI (1) 2 CloudKR Loopback File # dd if=/dev/zero of=new_image.fs bs=1M count=1024 Make ext3 file system # mke2fs -F -j new_image.fs # mkdir /mnt/ec2-fs # mount -o loop new_image.fs /mnt/ec2-fs # mkdir /mnt/ec2-fs/dev # /sbin/MAKEDEV -d /mnt/ec2-fs/dev -x console # /sbin/MAKEDEV -d /mnt/ec2-fs/dev -x null # /sbin/MAKEDEV -d /mnt/ec2-fs/dev -x zero # mkdir /mnt/ec2-fs/etc Create /mnt/ec2-fs/etc/fstab (Add /dev/sda1 --> /, /etc/pts, shm, /proc, /sys) Create yum-xen.conf # mkdir /mnt/ec2-fs/proc # mount -t proc none /mnt/ec2-fs/proc # yum -c yum-xen.conf --installroot=/mnt/ec2-fs -y groupinstall Base Edit /mnt/ec2-fs/etc/sysconfig/network-scripts/ifcfg-eth0 Edit /mnt/ec2-fs/etc/sysconfig/network Edit /mnt/ec2-fs/etc/fstab (Add /dev/sda2 --> /mnt, /dev/sda3 --> swap) chroot /mnt/ec2-fs /bin/sh Edit services
  45. 45. AWS : How to make a AMI (2) 2 CloudKR Building an AMI # yum install ruby # rpm -i ec2-ami-tools-noarch.rpm (Download from public s3 bucket) # ec2-bundle-image -i new_image.fs -k my-private-key.key -u aws-user-id Local Machine Root File System # ec2-bundle-vol -k my-private-key.key -s 1000 -u aws-user-id Upload to S3 # ec2-upload-bundle -b my-bucket -m image.manifest -a my-aws-access-key-id -s my-secret-key-id Register AMI # ec2-register my-bucket/image.manifest IMAGE ami-xxxx Testing # ec2-describe-images ami-xxxx Deregister AMI # ec2-deregister ami-xxxx Running AMI # ec2-run-intances ami-xxxx -n 1 http://docs.amazonwebservices.com/AWSEC2/2006-06-26/DeveloperGuide/
  46. 46. AWS : EC2 Running Instance ‣ AWS Management Console 2 CloudKR
  47. 47. AWS : EC2 Running Instance 2 CloudKR
  48. 48. Amazon Web Service: Access Methods ‣ Access Key ID / Secret Access Key ID / Key Pairs 2 CloudKR ‣ Amazon Management Console ‣ EC2 API (WSDL) / EC2 CLI (Command Line Interface) ‣ SSH ‣ Firefox Extensions • S3 Firefox Organizer • Elasticfox ‣ S3 •DNS: s3 CNAME s3.amazonaws.com. e.g) Bucket Name: /s3.xyz.com http://s3.xyz.com ---> S3‘s s3.xyz.com ‣s3cmd (python) ‣s3cmd.rb / s3sync.rb (ruby) ‣S3Hub (Mac)
  49. 49. Amazon Web Service: Elasticfox ‣ Firefox’s Extension: Elasticfox 2 CloudKR
  50. 50. Amazon Web Service: Elasticfox 2 CloudKR ‣ Key Pairs ‣ Private Key ‣ SSH
  51. 51. Amazon Web Service: Elasticfox 2 CloudKR ‣ Security Groups ‣ Open Network Ports
  52. 52. AWS: Elastic MapReduce 2 CloudKR ‣ EC2 + Hadoop ‣Tools ‣ Management Console ‣ elastic-mapreduce CLI ‣ Preparation ‣ Code --> S3 ‣ Data --> S3 ‣ Log Folder ‣ Output Folder ‣Job Flow ‣ Streaming ‣ Custom Jar ‣ Sample Applications
  53. 53. AWS: Elastic MapReduce 2 CloudKR
  54. 54. AWS: Elastic MapReduce : Web UI 2 CloudKR
  55. 55. AWS: Elastic MapReduce : CLI for Workflow 2 CloudKR input/* Step1 jobflow #id output1/part-000** Step2 output2/part-000** Step3 output3/part-000**
  56. 56. AWS: Elastic MapReduce 2 CloudKR ‣ Failed tasks will be rescheduled in other Hadoop slaves. ‣ If a task is finished, the same instance will be killed by a tracker.
  57. 57. AWS: Elastic MapReduce 2 CloudKR
  58. 58. AWS: SocialFlow Automation 2 CloudKR Home IDC Amazon Wild World Local Global Results Admin DHT S3 Users Read/Write Read Only Renderer boto python Launching EC2 pool
  59. 59. AWS: EC2, EMR Price Model 2 CloudKR Service Type Per Instance Hour 1 Week (7 Days) 1 Week (7 Days) $ 0.10 (S) $ 16.8 KRW 20,865 On-Demand $ 0.40 (L) $ 67.2 KRW 83,462 $ 0.80 (E) $ 134.4 KRW 166,924 EC2 Reserved $ 0.03 (S) $ 5.04 KRW 6,259 1yr $ 325 $ 0.12 (L) $ 20.16 KRW 25,038 3yr $ 500 $ 0.24 (E) $ 40.32 KRW 50,077 $ 0.10 (S) $ 0.015 $ 19.32 KRW 23,995 Elastic On-Demand $ 0.40 (L) $ 0.06 $ 77.28 KRW 95,981 MapReduce $ 0.80 (E) $ 154.56 KRW 191,963 $ 0.12 (S) = Small, (L) = Large, (E) = Extra Large 1 USD = 1242 KRW
  60. 60. AWS: Performance 2 CloudKR http://tinyurl.com/qj6ao7
  61. 61. AWS: Performance 2 CloudKR
  62. 62. AWS: Performance 2 CloudKR http://tinyurl.com/p9jsyz
  63. 63. AWS: Performance 2 CloudKR http://tinyurl.com/cqqxgl
  64. 64. 10 Cent Tips 2 CloudKR ‣ AWS EC2 ‣ Minimizing set-up time with prepared shell scripts ‣ Use Boto for automating deployments ‣ Use S3 (Free of Charge between S3 and EC2 in the same region) ‣ $0.030 per GB through June 30, 2000 ($0.1 per GB normal price) ‣ AWS Elastic MapReduce ‣ Enabling the SSH port(22) and Hadoop related ports (9100, 91001) ‣ Assess to Master Node: ssh -i keypair hadoop@public_dns_name ‣ Double Check (PATH, etc) ‣ Debug, Debug, Debug ‣ Use EC2 for hadoop (eg. Clouera’s Hadoop AMI) (No extra cost for Hadoop!)
  65. 65. 10 Cent Tips 2 CloudKR ‣ AWS S3 ‣ Setting HTTP header for images and static resources. ‣ Cache-Control: max-age=31536000 ‣ Block Search Bots ‣ robots.txt at the root of a Bucket ‣ User-agent: * ‣ Disallow: / ‣ Using BitTorrent for large files ‣ http://s3.xyz.com/xfile.zip?torrent ‣ Compress Rendered HTML with gzip ‣ Content-Encoding: gzip $ s3cmd put index.html s3://s3.xyz.com/www --mime-type "text/html” --add-header "Content-Encoding: gzip" --acl-public
  66. 66. Amazon Web Service : Limitations 2 CloudKR
  67. 67. References ‣ 10 MapReduces Tips, Cloudera, http://tinyurl.com/pxuqup 2 CloudKR ‣ Christian Charas, Thierry Lecroq, Handbook of Exact String-Matching Algorithms ‣ Dan Pritchett (eBay), BASE: Alternative ACID, p.48-55, ACM Queue May/June 2008 ‣ Edward Chang, (Google Research), Mining Large Scale Social Networks, MMDS ’08 ‣ Edward Walker, Benchmarking Amazon EC2 for high-performance scientific computing ‣ Matei Zaharia et al, Improving MapReduce Performance in Heterogeneous Environments, OSDI ’08 ‣ Following Twitter ‣ http://twitter.com/AmazonEC2 ‣ http://twitter.com/AmazonS3S3

×