Your SlideShare is downloading. ×
0
SXSWi Workshop: DevOps - Infrastructure as Code
SXSWi Workshop: DevOps - Infrastructure as Code
SXSWi Workshop: DevOps - Infrastructure as Code
SXSWi Workshop: DevOps - Infrastructure as Code
SXSWi Workshop: DevOps - Infrastructure as Code
SXSWi Workshop: DevOps - Infrastructure as Code
SXSWi Workshop: DevOps - Infrastructure as Code
SXSWi Workshop: DevOps - Infrastructure as Code
SXSWi Workshop: DevOps - Infrastructure as Code
SXSWi Workshop: DevOps - Infrastructure as Code
SXSWi Workshop: DevOps - Infrastructure as Code
SXSWi Workshop: DevOps - Infrastructure as Code
SXSWi Workshop: DevOps - Infrastructure as Code
SXSWi Workshop: DevOps - Infrastructure as Code
SXSWi Workshop: DevOps - Infrastructure as Code
SXSWi Workshop: DevOps - Infrastructure as Code
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

SXSWi Workshop: DevOps - Infrastructure as Code

656

Published on

Configuring, deploying, and managing Big Data infrastructure, Hadoop in particular, is time consuming and expensive. Infochimps’ Ironfan is an open source systems configuration suite for the cloud, …

Configuring, deploying, and managing Big Data infrastructure, Hadoop in particular, is time consuming and expensive. Infochimps’ Ironfan is an open source systems configuration suite for the cloud, quickly and easily orchestrating an entire Big Data stack including data ingestion, scraping, storage, computation, and monitoring. With Ironfan, you can spin up clusters when you need them and turn them off when you don’t, enabling you to spend your time, money, and engineering focus on finding insights and creating value, not getting your machines ready. These are the slides from the SXSWi workshop, where individuals learned how to go from a single development machine to a full-stack cloud deployment.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
656
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
6
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Part I. Big Data for Chimps1. Hello, Early Releasers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3My Questions for You 4Probable Contents 4Not Contents 7Feedback 72. About. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9What this book covers 9Who this book is for 10Who this book is not for 10How this book is being written 113. Hello, Reviewers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13Controversials 13Style Nits 144. First Exploration (ch. A). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15Where is Barbecue? 15First Steps 16Why? 16Plot of this story 17Exemplars and Touchstones 17Data and features 18Summarize every page on Wikipedia 18Summarize every page on Wikipedia 18Bin by Location 19A pause, to think 20iiiPulling signal from noise 20Takeaways 215. The Stream (ch. B). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23Exercises 27Exercise 1.1: Running time 27Exercise 1.2: A Petabyte-scale wc command 286. Reshape Steps (ch. C). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29Locality of Reference 29Locality: Examples 29The Hadoop Haiku 307. Chimpanzee and Elephant Save Christmas (ch. D). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33A Non-scalable approach 33Letters to Toy Requests 34Order Delivery 36Toy Assembly 38Why it’s efficient 38Sorted Batches 39The Map-Reduce Haiku 39The Reducer Guarantee 40Partition Key and Sort Key 418. Geo Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43Spatial Data 44Geographic Data Model 44Geospatial JOIN using quadtiles 45The Quadtile Grid System 45Patterns in UFO Sightings 47Mapper: dispatch objects to rendezvous at quadtiles 48Reducer: combine objects on each quadtile 49Comparing Distributions 50Data Model 50GeoJSON 51Quadtile Practicalities 52Converting points to quadkeys (quadtile indexes) 52Exploration 56Interesting quadtile properties 56Quadtile Ready Reference 58Working with paths 59Calculating Distances 60iv | Table of ContentsDistributing Boundaries and Regions to Grid Cells 61Adaptive Grid Size 62Tree structure of Quadtile indexing 66Map Polygons to Grid Tiles 66Weather Near You 68Find the Voronoi Polygon for each Weather Station 68Break polygons on quadtiles 69Map Observations to Grid Cells 69K-means clustering to summarize 69Keep Exploring 70Exercises 70— References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 719. Log Processing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73Data Model 73Simple Log Parsing 73Parser script 74Histograms 75User Paths through the site (“Sessionizing”) 77Page-Page similarity 79Geo-IP Matching 79Range Queries 80Using Hadoop for website stress testing (“Benign DDos”) 8010. WhyHadoop Works. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83Disk is the new tape 83Hadoop is Secretly Fun 83Economics: 84Notes 8411. Sampling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87Consistent Random Sampling 88Random Sampling using strides 89Constant-Memory “Reservoir” Sampling 89— References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9012. HadoopExecution in Detail. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91Launch 91Split 92Mappers 93Choosing a file size 94Jobs with Map andReduce 94Table of Contents | vMapper-only jobs 9413. Pathology of Tuning (aka “when you should touch that dial”). . . . . . . . . . . . . . . . . . . . . 97Mapper 97A few map tasks take noticably longer than all the rest 97Tons of tiny little mappers 98Many non-local mappers 98Map tasks “spill” multiple times 98Job output files that are each slightly larger than an HDFS block 98Reducer 99Tons of data to a few reducers (high skew) 99Reducer merge (sort+shuffle) is longer than Reducer processing 99Output Commit phase is longer than Reducer processing 99Way more total data to reducers than cumulative cluster RAM 99System 100Excessive Swapping 100Out of Memory / No C+B reserve 100Stop-the-world (STW) Garbage collections 100Checklist 100Other 101Basic Checks 10114. HadoopMetrics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103The USE Method appliedtoHadoop 103Look for the Bounding Resource 104Resource List 105See What’s Happening 108JMX (Java Monitoring Extensions) 108Roughnotes 10915. Data Formats and Schemata. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111Good Format 1: TSV (It’s simple) 111Good Format 2: JSON (It’s Generic and Ubiquitous) 112structured to model. 112Good Format #3: Avro (It does everything right) 113Other reasonable choices: tagged net strings and null-delimited documents 114Crap format #1: XML 114Writing XML 114Crap Format #2: N3 triples 117Crap Format #3: Flat format 117Web log and Regexpable 117Glyphing (string encoding), Unicode,UTF-8 117vi | Table of ContentsICSS 118Schema.org Types 118Munging 11816. HBase Data Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121Row Key, Column Family, Column Qualifier, Timestamp, Value 121Keep it Stupidly Simple 123Help HBase be Lazy 123Row Locality and Compression 124Simple Table 124Airport Metadata 124Airport Timezone 125Range Lookup 125Geographic Data 126Multi-scale indexing 126Wikipedia: Corpus and Graph 126Graph Data 126Web Logs: Rows-As-Columns 127Column Families 128Atomic Counters 128Most-Frequent URLs 129Most-Recent URLs 129Rollup columns 130Row Locality 130adjacency is good 130adjacency is bad 130Vertical Partitioning (Column Families) 131Feature Set review 131“Design for Reads” 132— References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13417. Semi-Structured Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135Wikipedia Metadata 135Wikipedia Pageview Stats (importing TSV) 135Assembling the namespace join table 136Getting file metadata in a Wukong (or any Hadoop streaming) Script 136Wikipedia Article Metadata (importing a SQL Dump) 136Necessary Bullcrap #76: Bad encoding 136Wikipedia Page Graph 137Target Domain Models 137XML Data (Wikipedia Corpus) 138Extract, Translate, Canonicalize 141Table of Contents | vii
  • Part I. Big Data for Chimps1. Hello, Early Releasers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3My Questions for You 4Probable Contents 4Not Contents 7Feedback 72. About. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9What this book covers 9Who this book is for 10Who this book is not for 10How this book is being written 113. Hello, Reviewers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13Controversials 13Style Nits 144. First Exploration (ch. A). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15Where is Barbecue? 15First Steps 16Why? 16Plot of this story 17Exemplars and Touchstones 17Data and features 18Summarize every page on Wikipedia 18Summarize every page on Wikipedia 18Bin by Location 19A pause, to think 20iiiPulling signal from noise 20Takeaways 215. The Stream (ch. B). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23Exercises 27Exercise 1.1: Running time 27Exercise 1.2: A Petabyte-scale wc command 286. Reshape Steps (ch. C). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29Locality of Reference 29Locality: Examples 29The Hadoop Haiku 307. Chimpanzee and Elephant Save Christmas (ch. D). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33A Non-scalable approach 33Letters to Toy Requests 34Order Delivery 36Toy Assembly 38Why it’s efficient 38Sorted Batches 39The Map-Reduce Haiku 39The Reducer Guarantee 40Partition Key and Sort Key 418. Geo Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43Spatial Data 44Geographic Data Model 44Geospatial JOIN using quadtiles 45The Quadtile Grid System 45Patterns in UFO Sightings 47Mapper: dispatch objects to rendezvous at quadtiles 48Reducer: combine objects on each quadtile 49Comparing Distributions 50Data Model 50GeoJSON 51Quadtile Practicalities 52Converting points to quadkeys (quadtile indexes) 52Exploration 56Interesting quadtile properties 56Quadtile Ready Reference 58Working with paths 59Calculating Distances 60iv | Table of ContentsDistributing Boundaries and Regions to Grid Cells 61Adaptive Grid Size 62Tree structure of Quadtile indexing 66Map Polygons to Grid Tiles 66Weather Near You 68Find the Voronoi Polygon for each Weather Station 68Break polygons on quadtiles 69Map Observations to Grid Cells 69K-means clustering to summarize 69Keep Exploring 70Exercises 70— References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 719. Log Processing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73Data Model 73Simple Log Parsing 73Parser script 74Histograms 75User Paths through the site (“Sessionizing”) 77Page-Page similarity 79Geo-IP Matching 79Range Queries 80Using Hadoop for website stress testing (“Benign DDos”) 8010. WhyHadoop Works. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83Disk is the new tape 83Hadoop is Secretly Fun 83Economics: 84Notes 8411. Sampling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87Consistent Random Sampling 88Random Sampling using strides 89Constant-Memory “Reservoir” Sampling 89— References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9012. HadoopExecution in Detail. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91Launch 91Split 92Mappers 93Choosing a file size 94Jobs with Map andReduce 94Table of Contents | vMapper-only jobs 9413. Pathology of Tuning (aka “when you should touch that dial”). . . . . . . . . . . . . . . . . . . . . 97Mapper 97A few map tasks take noticably longer than all the rest 97Tons of tiny little mappers 98Many non-local mappers 98Map tasks “spill” multiple times 98Job output files that are each slightly larger than an HDFS block 98Reducer 99Tons of data to a few reducers (high skew) 99Reducer merge (sort+shuffle) is longer than Reducer processing 99Output Commit phase is longer than Reducer processing 99Way more total data to reducers than cumulative cluster RAM 99System 100Excessive Swapping 100Out of Memory / No C+B reserve 100Stop-the-world (STW) Garbage collections 100Checklist 100Other 101Basic Checks 10114. HadoopMetrics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103The USE Method appliedtoHadoop 103Look for the Bounding Resource 104Resource List 105See What’s Happening 108JMX (Java Monitoring Extensions) 108Roughnotes 10915. Data Formats and Schemata. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111Good Format 1: TSV (It’s simple) 111Good Format 2: JSON (It’s Generic and Ubiquitous) 112structured to model. 112Good Format #3: Avro (It does everything right) 113Other reasonable choices: tagged net strings and null-delimited documents 114Crap format #1: XML 114Writing XML 114Crap Format #2: N3 triples 117Crap Format #3: Flat format 117Web log and Regexpable 117Glyphing (string encoding), Unicode,UTF-8 117vi | Table of ContentsICSS 118Schema.org Types 118Munging 11816. HBase Data Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121Row Key, Column Family, Column Qualifier, Timestamp, Value 121Keep it Stupidly Simple 123Help HBase be Lazy 123Row Locality and Compression 124Simple Table 124Airport Metadata 124Airport Timezone 125Range Lookup 125Geographic Data 126Multi-scale indexing 126Wikipedia: Corpus and Graph 126Graph Data 126Web Logs: Rows-As-Columns 127Column Families 128Atomic Counters 128Most-Frequent URLs 129Most-Recent URLs 129Rollup columns 130Row Locality 130adjacency is good 130adjacency is bad 130Vertical Partitioning (Column Families) 131Feature Set review 131“Design for Reads” 132— References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13417. Semi-Structured Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135Wikipedia Metadata 135Wikipedia Pageview Stats (importing TSV) 135Assembling the namespace join table 136Getting file metadata in a Wukong (or any Hadoop streaming) Script 136Wikipedia Article Metadata (importing a SQL Dump) 136Necessary Bullcrap #76: Bad encoding 136Wikipedia Page Graph 137Target Domain Models 137XML Data (Wikipedia Corpus) 138Extract, Translate, Canonicalize 141Table of Contents | vii
  • Transcript

    • 1. DevOps: Empowering Developers with Infrastructure SXSW 2 0 1 3 – Tu e s d a y, M a r c h 1 2 Go here: http://infochim.ps/15INnv8 Nathan Eliot - @temujin9 Ryan Miller - @rmiller107 Amanda McGuckin-Hager - @shoogie Tim Gasper - @timgasper3/12/2013 #ironfan #devops #sxsw #bigdata #chef 1
    • 2. Agenda http://infochim.ps/15INnv81. Intros - Housekeeping (15 min – 15 total)2. Initial Setup (30 min – 45 total)3. Debug Initial Set Up (30-45 min – 1:15 total)4. Standing Up a Simple Cluster (30-60 min – 2:15 total)5. Hadoop! (30-60 min – 3:15 total)6. General Q&A (30-60 min – 4:00 total)3/12/2013 #ironfan #devops #sxsw #bigdata #chef 2
    • 3. Key Ironfan Contributors• Flip Kromer, @mrflip – CTO of Infochimps• Nathaniel Eliot, @temujin9 – Ops Engineer of Infochimps• Chris Howe – System Architect at Civitas Learning
    • 4. Infochimps Enterprise Cloud for Big Data CUSTOMER APPLICATIONS Custom Applications Business Intelligence Packaged Apps (Java, Python, etc.) (Cognos, BOBJ, Microstrategy) (ERP, CRM, etc.)3/12/2013 #ironfan #devops #sxsw #bigdata #chef 4
    • 5. Why We Love Chef• Infrastructure as Code – Version Control – Shareable – Testable – Recapitulable3/12/2013 #ironfan #devops #sxsw #bigdata #chef 5
    • 6. Why We Love Chef MySQL Nginx SOLR My Application3/12/2013 #ironfan #devops #sxsw #bigdata #chef 6
    • 7. Why We Love Chef3/12/2013 #ironfan #devops #sxsw #bigdata #chef 7
    • 8. Why We Don’t Love Chef• Anything is possible• Nothing is simple• There’s not much repetition (not DRY)
    • 9. Why We Don’t Love Chef Too much is hard-coded at development/upload time!3/12/2013 #ironfan #devops #sxsw #bigdata #chef 9
    • 10. Why We Don’t Love Chef How do we make @server_ips dynamic?3/12/2013 #ironfan #devops #sxsw #bigdata #chef 10
    • 11. Why We Wrote Ironfan• Simplify, unify, and standardize our usage of the Chef toolset• Build further abstractions on top of Chef• Give us superpowers that Chef doesn’t have yet http://github.com/infochimps-labs/ironfan3/12/2013 #ironfan #devops #sxsw #bigdata #chef 11
    • 12. What Does Ironfan Do Ironfan Simple helpers in the silverware cookbook abstract common Chef patterns and keep things DRY. Chef
    • 13. What Does Ironfan DoDynamic service discovery:3/12/2013 #ironfan #devops #sxsw #bigdata #chef 13
    • 14. What Does Ironfan Do A simple DSL for defining clusters of machines.3/12/2013 #ironfan #devops #sxsw #bigdata #chef 14
    • 15. Big Data for Chimps May 20133/12/2013 #ironfan #devops #sxsw #bigdata #chef 15
    • 16. As we walk through Ironfan…• Shortlink: http://infochim.ps/15INnv8FYI• We are hiring! (we have offices in Austin & SF) – careers@infochimps.com – infochimps.com/careers• Learn more about our enterprise product: – sales@infochimps.com3/12/2013 #ironfan #devops #sxsw #bigdata #chef 16

    ×