Building infrastructure for Big Data

  • 1,127 views
Uploaded on

This deck gives a sample overview on different pain points while building the infrastructure for big data and solutions to the same.

This deck gives a sample overview on different pain points while building the infrastructure for big data and solutions to the same.

More in: Technology , Business
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,127
On Slideshare
0
From Embeds
0
Number of Embeds
3

Actions

Shares
Downloads
35
Comments
0
Likes
2

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Building the Infrastructure for Big Data @ The Fifth Elephant July 27th, 2012 -Prashant Kumar, Founder- PromptCloud 1 © PromptCloud 2012, All rights reserved
  • 2. AgendaAboutContextMachines, Installation & Cloud AutomationBuilding blocks of a systemSample application sketchLack of time components 2 © PromptCloud 2012, All rights reserved
  • 3. About Section 0 3© PromptCloud 2012, All rights reserved
  • 4. About PromptCloud We provide data feeds and feed ourselves on data- since 2009How??• Large-scale data crawl and extraction• Hosted indexing• Custom data analytics• Working round the clock About Me • PromptCloud’s Founder • Yahoo! - 2007-2008 • IIT-Kanpur CS- 2007 4 © PromptCloud 2012, All rights reserved
  • 5. Deliverable 5© PromptCloud 2012, All rights reserved
  • 6. Context Section 0.1 6© PromptCloud 2012, All rights reserved
  • 7. Generic Big Data Systems• Multiple nodes (incoherent set of coherent ones)• Compute layer- Interdependent processes• Data storage layer & multiple middleware• Tools for installation, monitoring & scheduling*Meta- source control, code reviews, continuous integration 7 © PromptCloud 2012, All rights reserved
  • 8. Machines, Installation & Cloud Automation Section 1 8 © PromptCloud 2012, All rights reserved
  • 9. Installation Create an image and install•Easy to install •Modifications? Difficult to save•No maintenance cost it back•1 image for 1 purpose •Apt, yum, etc-keeper like systems but difficult to scale Solutions??  9 © PromptCloud 2012, All rights reserved
  • 10. Enter the Magic!Not a panacea; analgesic though 10 © PromptCloud 2012, All rights reserved
  • 11. Virtual Machines Virtual Machines ssh UpInit Vagrant Shared directory Port Forwarding AWS, Xen, Virtual Box KVM,… Installation 11 © PromptCloud 2012, All rights reserved
  • 12. Code the Installation using Chef Give the recipe- code what’s to be done I’m Solo Roles, Data Files RecipesTemplates, Run List Chef Server Knife 12 © PromptCloud 2012, All rights reserved
  • 13. Building blocks Section 2 13 © PromptCloud 2012, All rights reserved
  • 14. To keep processes running, Option 1- Install GOD to monitor processes and to keep them in place Option 2 (for atheists)- Install MONIT Courtesy- BIT Mesra 14 © PromptCloud 2012, All rights reserved
  • 15. God’s SnippetGod.watch do |w| w.name = watcher_name w.start = start_command #w.restart = restart_command w.stop = stop_command w.behavior(:clean_pid_file) #w.group = "some group" w.log = "/tmp/god_monitoring_#{watcher_name}.log" w.keepalive w.stop_timeout = 10.secondsend 15 © PromptCloud 2012, All rights reserved
  • 16. Job SchedulingResque, Beanstalk, Gearman, Celery, + cron and queuesThings to remember while making choices-• Persistence• Priorities• Tags• Option for retry• Ability to inspect the queue 16 © PromptCloud 2012, All rights reserved
  • 17. Data Storage LayerSQL/NoSQL, key/value, document-based, graph databases• For large systems, maintenance cost is a primary overhead• Replication & Availability• Consistency guarantees• Full-text search 17 © PromptCloud 2012, All rights reserved
  • 18. Voldemort Not me!!!!!!!!• Distributed key/value store• Great performance• Easy to add/remove nodes• Alternatives- Mongo, Courtesy- harrypotter.wikia.com Riak, Hbase, Cassandra 18 © PromptCloud 2012, All rights reserved
  • 19. Messaging Layer-• RabbitMQ- most commonly used in high-load production systems• Implements AMQP• Robust exchange server• Multiple kinds of exchanges- direct, topic, fanout• Options for HA with Pacemaker/DRBD 19 © PromptCloud 2012, All rights reserved
  • 20. Demo Section 3 20© PromptCloud 2012, All rights reserved
  • 21. Demo Sketch1. We’ll generate random sentences based on Markov chain2. Store these in Voldemort3. Enqueue corresponding jobs in RabbitMQ4. Another set of workers will process these sentences 21 © PromptCloud 2012, All rights reserved
  • 22. For the lack of time.. Section 4 22 © PromptCloud 2012, All rights reserved
  • 23. Sensu &Graphite• Monitoring router• "check scripts” on nodes• “handler scripts” on servers• Output can be sent to pagerduty, graphite, twitter or IRC 23 © PromptCloud 2012, All rights reserved
  • 24. Distributed Log Collection Scribe, Flume, SplunkFlume• Allows multiple topologies• Agent• Collector• Sink 24 © PromptCloud 2012, All rights reserved
  • 25. Feel free to reach out Big Data made Small info@promptcloud.com Appreciate your time Thanks to Arpan Jha for her help with the slides 25 © PromptCloud 2012, All rights reserved