Your SlideShare is downloading. ×
Building infrastructure for Big Data
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Building infrastructure for Big Data

1,192
views

Published on

This deck gives a sample overview on different pain points while building the infrastructure for big data and solutions to the same.

This deck gives a sample overview on different pain points while building the infrastructure for big data and solutions to the same.

Published in: Technology, Business

0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,192
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
36
Comments
0
Likes
2
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Building the Infrastructure for Big Data @ The Fifth Elephant July 27th, 2012 -Prashant Kumar, Founder- PromptCloud 1 © PromptCloud 2012, All rights reserved
  • 2. AgendaAboutContextMachines, Installation & Cloud AutomationBuilding blocks of a systemSample application sketchLack of time components 2 © PromptCloud 2012, All rights reserved
  • 3. About Section 0 3© PromptCloud 2012, All rights reserved
  • 4. About PromptCloud We provide data feeds and feed ourselves on data- since 2009How??• Large-scale data crawl and extraction• Hosted indexing• Custom data analytics• Working round the clock About Me • PromptCloud’s Founder • Yahoo! - 2007-2008 • IIT-Kanpur CS- 2007 4 © PromptCloud 2012, All rights reserved
  • 5. Deliverable 5© PromptCloud 2012, All rights reserved
  • 6. Context Section 0.1 6© PromptCloud 2012, All rights reserved
  • 7. Generic Big Data Systems• Multiple nodes (incoherent set of coherent ones)• Compute layer- Interdependent processes• Data storage layer & multiple middleware• Tools for installation, monitoring & scheduling*Meta- source control, code reviews, continuous integration 7 © PromptCloud 2012, All rights reserved
  • 8. Machines, Installation & Cloud Automation Section 1 8 © PromptCloud 2012, All rights reserved
  • 9. Installation Create an image and install•Easy to install •Modifications? Difficult to save•No maintenance cost it back•1 image for 1 purpose •Apt, yum, etc-keeper like systems but difficult to scale Solutions??  9 © PromptCloud 2012, All rights reserved
  • 10. Enter the Magic!Not a panacea; analgesic though 10 © PromptCloud 2012, All rights reserved
  • 11. Virtual Machines Virtual Machines ssh UpInit Vagrant Shared directory Port Forwarding AWS, Xen, Virtual Box KVM,… Installation 11 © PromptCloud 2012, All rights reserved
  • 12. Code the Installation using Chef Give the recipe- code what’s to be done I’m Solo Roles, Data Files RecipesTemplates, Run List Chef Server Knife 12 © PromptCloud 2012, All rights reserved
  • 13. Building blocks Section 2 13 © PromptCloud 2012, All rights reserved
  • 14. To keep processes running, Option 1- Install GOD to monitor processes and to keep them in place Option 2 (for atheists)- Install MONIT Courtesy- BIT Mesra 14 © PromptCloud 2012, All rights reserved
  • 15. God’s SnippetGod.watch do |w| w.name = watcher_name w.start = start_command #w.restart = restart_command w.stop = stop_command w.behavior(:clean_pid_file) #w.group = "some group" w.log = "/tmp/god_monitoring_#{watcher_name}.log" w.keepalive w.stop_timeout = 10.secondsend 15 © PromptCloud 2012, All rights reserved
  • 16. Job SchedulingResque, Beanstalk, Gearman, Celery, + cron and queuesThings to remember while making choices-• Persistence• Priorities• Tags• Option for retry• Ability to inspect the queue 16 © PromptCloud 2012, All rights reserved
  • 17. Data Storage LayerSQL/NoSQL, key/value, document-based, graph databases• For large systems, maintenance cost is a primary overhead• Replication & Availability• Consistency guarantees• Full-text search 17 © PromptCloud 2012, All rights reserved
  • 18. Voldemort Not me!!!!!!!!• Distributed key/value store• Great performance• Easy to add/remove nodes• Alternatives- Mongo, Courtesy- harrypotter.wikia.com Riak, Hbase, Cassandra 18 © PromptCloud 2012, All rights reserved
  • 19. Messaging Layer-• RabbitMQ- most commonly used in high-load production systems• Implements AMQP• Robust exchange server• Multiple kinds of exchanges- direct, topic, fanout• Options for HA with Pacemaker/DRBD 19 © PromptCloud 2012, All rights reserved
  • 20. Demo Section 3 20© PromptCloud 2012, All rights reserved
  • 21. Demo Sketch1. We’ll generate random sentences based on Markov chain2. Store these in Voldemort3. Enqueue corresponding jobs in RabbitMQ4. Another set of workers will process these sentences 21 © PromptCloud 2012, All rights reserved
  • 22. For the lack of time.. Section 4 22 © PromptCloud 2012, All rights reserved
  • 23. Sensu &Graphite• Monitoring router• "check scripts” on nodes• “handler scripts” on servers• Output can be sent to pagerduty, graphite, twitter or IRC 23 © PromptCloud 2012, All rights reserved
  • 24. Distributed Log Collection Scribe, Flume, SplunkFlume• Allows multiple topologies• Agent• Collector• Sink 24 © PromptCloud 2012, All rights reserved
  • 25. Feel free to reach out Big Data made Small info@promptcloud.com Appreciate your time Thanks to Arpan Jha for her help with the slides 25 © PromptCloud 2012, All rights reserved