Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Hadoop Demystified + Automation Smackdown! Austin JUG June 24 2014

745 views

Published on

Hadoop Demystified + Automation Smackdown!
You want to learn hadoop, we cover that. But then we also cover that automation process, so you can compare approaches. Includes code and references.

  • Be the first to comment

  • Be the first to like this

Hadoop Demystified + Automation Smackdown! Austin JUG June 24 2014

  1. 1. Hadoop 101 ETL + Automation Smackdown Learning Big Data: Which approach makes me the most valuable as developer?
  2. 2. Bio - Pete Carapetyan • Java dev last 15 years, dev 20 years • Grew up automating in a different industry • Apparent obsession with systems & automation • Since 2000 as dataFundamentals, now 2 man shop
  3. 3. Special Skills - Special Snowflakes • Let me show you these Hadoop & Avro skills. • Then, we code for the special snowflakes. (data) • Thus we are more valuable, and can up our bill rates! • This is Approach #1: Manual or Special Snowflake
  4. 4. My 2013 Manual Hadoop Story • 15 ETL jobs [Partial scope] • Brilliant, ninja level team • 1 year of competitive NIH* 
 copy paste spaghetti coding - AKA special snowflake approach • Not a fun year *NIH: Not Invented Here
  5. 5. [Demo Basics of ETL Job]
  6. 6. Special Snowflake Approach: Human drama! What limitations of this manual 
 special skills special snowflakes
 approach do we observe?
  7. 7. How To Un-Pack Either Approach? What if we remove the human drama?
  8. 8. Now, what happens if we automate? Automated Approach
  9. 9. Carrie Our own internal project for automating big data.
 
 Name inspired by the horror film…
  10. 10. Also inspired by 
 The Phoenix Project • Results, not drama • Focus only on bottleneck • Brent as bottleneck
  11. 11. On Brent • Brent is a team’s best asset! Brent is a ninja. • Brent is my dark side only when treating every situation like a special snowflake. • Brent enjoys the attention. • Brent is not the drama queen, others bring the drama to him. Brent?
  12. 12. Automation Basics 1. Brent spends time on clean design, not NIH* • [Camel] - Integration Server 2. Brent automates the rule, codes the exception • Apply metadata to templates • Automated VM dev infrastructure * NIH: Not Invented Here
  13. 13. Demo Clean • Clean project folder • Clean hadoop file system • Clean hadoop DDL https://www.youtube.com/watch?v=qR7XTzv5P_M&index=2&list=PLO_T9AjxEaYeByfqBqHVCmg4GbLFkYCJe
  14. 14. Later Demo Integration Server • Raw linux OS (Centos) • Java • Maven • Ruby • networking • maven repo - binaries • [created with vagrant] https://www.youtube.com/watch?v=xgheERvulqw&index=3&list=PLO_T9AjxEaYeByfqBqHVCmg4GbLFkYCJe
  15. 15. Demo Metadata Collection • Simple properties • Collected using a cheesy UI • UI written in Ruby
  16. 16. Demo Generated Code • Camel ETL binary • OSGi, versioned, modular jar • Only 3 primary outputs! • simple • clean • well designed (?) • JUnit/integration tested • Supporting scripting • messy
  17. 17. Demo Server Deploy • One line deploy/run command • Compiles on server with Maven • Also runnable as jar
  18. 18. Does it work? • Make custom file • Drop into ETL folder • Inspect
  19. 19. Demo - Review • Schema created • DDL run • Avro binary (JSON) transform • Data Migration • FTP to server • Into HDFS partition • Alter Table: Date Partition
  20. 20. Transform to Avro • Not detailed in this talk • Demo’d here as a binary • Code listed at end of talk
  21. 21. Modular Binaries • Each ETL • Own binary, OSGi • Own codebase • Fully versioned • Fully customizable after generation • Runs alone or as part of Camel container(s) • Tests on build • Contains own supporting scripts
  22. 22. Takeaways • Brent coding the exception manually, rule by template. • Brent has time to focus on design. • Brent may lose some amount of desired attention :( • Resulting code is • clean • consistent, easy to maintain • But is there a Home Run? • defined as not possible via special snowflake approach
  23. 23. Home Run 1: Infrastructure As Code Demo • [Jeff]
  24. 24. Home Run 2: Big Data, Beyond Hadoop! 1. Pick your provider • Hadoop • Cassandra • Couchbase • etc 2. Adopt your templates, VMs, etc
  25. 25. Home Run 3: Idempotent Effort • Idempotent effort? Each subsequent run doesn’t have bad effect. • Walkup - The 10 minute test • Walkaway - Requirements • Features • Testing, technical debt, already in place for code • VMs and recipes for dev, test, prod • OSGi etc modularity for binaries • Does what we see here pass this test?
  26. 26. What to leave with • De-mystify: how to Avro/Hadoop a delimited file • Review motives for automating this process • Code automation basics • Infrastructure automation basics • Code for above
  27. 27. Further Hadoop Tutuorial Resources • Hortonworks • best free stuff? Except networking vas • Cloudera • Lots but appear to prefer to get paid • Apache Hadoop • haven’t tried but it is Apache
  28. 28. Wish To See More? • In office demos • Your data
  29. 29. Code, Content, Contacts • This Slide Deck: http://www.slideshare.net/datafundamentals/hadoop-big-data-35762308 • or just remember slideshare.net/datafundamentals it may be the only one there • Youtube - 11 minute version of code demo - https://www.youtube.com/playlist?list=PLO_T9AjxEaYeByfqBqHVCmg4GbLFkYCJe • Dev Code • Carrie (ruby UI and generator) https://github.com/datafundamentals/df_ui_carrie • Avro from delimited https://bitbucket.org/datafundamentals/avro_from_delimited • Camel-Avro https://bitbucket.org/datafundamentals/camel-avro-etl • Ops Code - cookbook recipes • https://github.com/datafundamentals • Contact • pete@datafundamentals.com, jeff@datafundamentals.com Be careful out there!

×