Building infrastructure for Big Data

1,557 views
1,471 views

Published on

This deck gives a sample overview on different pain points while building the infrastructure for big data and solutions to the same.

Published in: Technology, Business
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,557
On SlideShare
0
From Embeds
0
Number of Embeds
27
Actions
Shares
0
Downloads
37
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Building infrastructure for Big Data

  1. 1. Building the Infrastructure for Big Data @ The Fifth Elephant July 27th, 2012 -Prashant Kumar, Founder- PromptCloud 1 © PromptCloud 2012, All rights reserved
  2. 2. AgendaAboutContextMachines, Installation & Cloud AutomationBuilding blocks of a systemSample application sketchLack of time components 2 © PromptCloud 2012, All rights reserved
  3. 3. About Section 0 3© PromptCloud 2012, All rights reserved
  4. 4. About PromptCloud We provide data feeds and feed ourselves on data- since 2009How??• Large-scale data crawl and extraction• Hosted indexing• Custom data analytics• Working round the clock About Me • PromptCloud’s Founder • Yahoo! - 2007-2008 • IIT-Kanpur CS- 2007 4 © PromptCloud 2012, All rights reserved
  5. 5. Deliverable 5© PromptCloud 2012, All rights reserved
  6. 6. Context Section 0.1 6© PromptCloud 2012, All rights reserved
  7. 7. Generic Big Data Systems• Multiple nodes (incoherent set of coherent ones)• Compute layer- Interdependent processes• Data storage layer & multiple middleware• Tools for installation, monitoring & scheduling*Meta- source control, code reviews, continuous integration 7 © PromptCloud 2012, All rights reserved
  8. 8. Machines, Installation & Cloud Automation Section 1 8 © PromptCloud 2012, All rights reserved
  9. 9. Installation Create an image and install•Easy to install •Modifications? Difficult to save•No maintenance cost it back•1 image for 1 purpose •Apt, yum, etc-keeper like systems but difficult to scale Solutions??  9 © PromptCloud 2012, All rights reserved
  10. 10. Enter the Magic!Not a panacea; analgesic though 10 © PromptCloud 2012, All rights reserved
  11. 11. Virtual Machines Virtual Machines ssh UpInit Vagrant Shared directory Port Forwarding AWS, Xen, Virtual Box KVM,… Installation 11 © PromptCloud 2012, All rights reserved
  12. 12. Code the Installation using Chef Give the recipe- code what’s to be done I’m Solo Roles, Data Files RecipesTemplates, Run List Chef Server Knife 12 © PromptCloud 2012, All rights reserved
  13. 13. Building blocks Section 2 13 © PromptCloud 2012, All rights reserved
  14. 14. To keep processes running, Option 1- Install GOD to monitor processes and to keep them in place Option 2 (for atheists)- Install MONIT Courtesy- BIT Mesra 14 © PromptCloud 2012, All rights reserved
  15. 15. God’s SnippetGod.watch do |w| w.name = watcher_name w.start = start_command #w.restart = restart_command w.stop = stop_command w.behavior(:clean_pid_file) #w.group = "some group" w.log = "/tmp/god_monitoring_#{watcher_name}.log" w.keepalive w.stop_timeout = 10.secondsend 15 © PromptCloud 2012, All rights reserved
  16. 16. Job SchedulingResque, Beanstalk, Gearman, Celery, + cron and queuesThings to remember while making choices-• Persistence• Priorities• Tags• Option for retry• Ability to inspect the queue 16 © PromptCloud 2012, All rights reserved
  17. 17. Data Storage LayerSQL/NoSQL, key/value, document-based, graph databases• For large systems, maintenance cost is a primary overhead• Replication & Availability• Consistency guarantees• Full-text search 17 © PromptCloud 2012, All rights reserved
  18. 18. Voldemort Not me!!!!!!!!• Distributed key/value store• Great performance• Easy to add/remove nodes• Alternatives- Mongo, Courtesy- harrypotter.wikia.com Riak, Hbase, Cassandra 18 © PromptCloud 2012, All rights reserved
  19. 19. Messaging Layer-• RabbitMQ- most commonly used in high-load production systems• Implements AMQP• Robust exchange server• Multiple kinds of exchanges- direct, topic, fanout• Options for HA with Pacemaker/DRBD 19 © PromptCloud 2012, All rights reserved
  20. 20. Demo Section 3 20© PromptCloud 2012, All rights reserved
  21. 21. Demo Sketch1. We’ll generate random sentences based on Markov chain2. Store these in Voldemort3. Enqueue corresponding jobs in RabbitMQ4. Another set of workers will process these sentences 21 © PromptCloud 2012, All rights reserved
  22. 22. For the lack of time.. Section 4 22 © PromptCloud 2012, All rights reserved
  23. 23. Sensu &Graphite• Monitoring router• "check scripts” on nodes• “handler scripts” on servers• Output can be sent to pagerduty, graphite, twitter or IRC 23 © PromptCloud 2012, All rights reserved
  24. 24. Distributed Log Collection Scribe, Flume, SplunkFlume• Allows multiple topologies• Agent• Collector• Sink 24 © PromptCloud 2012, All rights reserved
  25. 25. Feel free to reach out Big Data made Small info@promptcloud.com Appreciate your time Thanks to Arpan Jha for her help with the slides 25 © PromptCloud 2012, All rights reserved

×