Building infrastructure for Big Data
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Building infrastructure for Big Data

on

  • 1,495 views

This deck gives a sample overview on different pain points while building the infrastructure for big data and solutions to the same.

This deck gives a sample overview on different pain points while building the infrastructure for big data and solutions to the same.

Statistics

Views

Total Views
1,495
Views on SlideShare
1,440
Embed Views
55

Actions

Likes
2
Downloads
33
Comments
0

5 Embeds 55

http://promptcloud.com 39
http://currymeen.wordpress.com 10
http://www.linkedin.com 3
https://twitter.com 2
https://www.linkedin.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Building infrastructure for Big Data Presentation Transcript

  • 1. Building the Infrastructure for Big Data @ The Fifth Elephant July 27th, 2012 -Prashant Kumar, Founder- PromptCloud 1 © PromptCloud 2012, All rights reserved
  • 2. AgendaAboutContextMachines, Installation & Cloud AutomationBuilding blocks of a systemSample application sketchLack of time components 2 © PromptCloud 2012, All rights reserved
  • 3. About Section 0 3© PromptCloud 2012, All rights reserved
  • 4. About PromptCloud We provide data feeds and feed ourselves on data- since 2009How??• Large-scale data crawl and extraction• Hosted indexing• Custom data analytics• Working round the clock About Me • PromptCloud’s Founder • Yahoo! - 2007-2008 • IIT-Kanpur CS- 2007 4 © PromptCloud 2012, All rights reserved
  • 5. Deliverable 5© PromptCloud 2012, All rights reserved
  • 6. Context Section 0.1 6© PromptCloud 2012, All rights reserved
  • 7. Generic Big Data Systems• Multiple nodes (incoherent set of coherent ones)• Compute layer- Interdependent processes• Data storage layer & multiple middleware• Tools for installation, monitoring & scheduling*Meta- source control, code reviews, continuous integration 7 © PromptCloud 2012, All rights reserved
  • 8. Machines, Installation & Cloud Automation Section 1 8 © PromptCloud 2012, All rights reserved
  • 9. Installation Create an image and install•Easy to install •Modifications? Difficult to save•No maintenance cost it back•1 image for 1 purpose •Apt, yum, etc-keeper like systems but difficult to scale Solutions??  9 © PromptCloud 2012, All rights reserved
  • 10. Enter the Magic!Not a panacea; analgesic though 10 © PromptCloud 2012, All rights reserved
  • 11. Virtual Machines Virtual Machines ssh UpInit Vagrant Shared directory Port Forwarding AWS, Xen, Virtual Box KVM,… Installation 11 © PromptCloud 2012, All rights reserved
  • 12. Code the Installation using Chef Give the recipe- code what’s to be done I’m Solo Roles, Data Files RecipesTemplates, Run List Chef Server Knife 12 © PromptCloud 2012, All rights reserved
  • 13. Building blocks Section 2 13 © PromptCloud 2012, All rights reserved
  • 14. To keep processes running, Option 1- Install GOD to monitor processes and to keep them in place Option 2 (for atheists)- Install MONIT Courtesy- BIT Mesra 14 © PromptCloud 2012, All rights reserved
  • 15. God’s SnippetGod.watch do |w| w.name = watcher_name w.start = start_command #w.restart = restart_command w.stop = stop_command w.behavior(:clean_pid_file) #w.group = "some group" w.log = "/tmp/god_monitoring_#{watcher_name}.log" w.keepalive w.stop_timeout = 10.secondsend 15 © PromptCloud 2012, All rights reserved
  • 16. Job SchedulingResque, Beanstalk, Gearman, Celery, + cron and queuesThings to remember while making choices-• Persistence• Priorities• Tags• Option for retry• Ability to inspect the queue 16 © PromptCloud 2012, All rights reserved
  • 17. Data Storage LayerSQL/NoSQL, key/value, document-based, graph databases• For large systems, maintenance cost is a primary overhead• Replication & Availability• Consistency guarantees• Full-text search 17 © PromptCloud 2012, All rights reserved
  • 18. Voldemort Not me!!!!!!!!• Distributed key/value store• Great performance• Easy to add/remove nodes• Alternatives- Mongo, Courtesy- harrypotter.wikia.com Riak, Hbase, Cassandra 18 © PromptCloud 2012, All rights reserved
  • 19. Messaging Layer-• RabbitMQ- most commonly used in high-load production systems• Implements AMQP• Robust exchange server• Multiple kinds of exchanges- direct, topic, fanout• Options for HA with Pacemaker/DRBD 19 © PromptCloud 2012, All rights reserved
  • 20. Demo Section 3 20© PromptCloud 2012, All rights reserved
  • 21. Demo Sketch1. We’ll generate random sentences based on Markov chain2. Store these in Voldemort3. Enqueue corresponding jobs in RabbitMQ4. Another set of workers will process these sentences 21 © PromptCloud 2012, All rights reserved
  • 22. For the lack of time.. Section 4 22 © PromptCloud 2012, All rights reserved
  • 23. Sensu &Graphite• Monitoring router• "check scripts” on nodes• “handler scripts” on servers• Output can be sent to pagerduty, graphite, twitter or IRC 23 © PromptCloud 2012, All rights reserved
  • 24. Distributed Log Collection Scribe, Flume, SplunkFlume• Allows multiple topologies• Agent• Collector• Sink 24 © PromptCloud 2012, All rights reserved
  • 25. Feel free to reach out Big Data made Small info@promptcloud.com Appreciate your time Thanks to Arpan Jha for her help with the slides 25 © PromptCloud 2012, All rights reserved