Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Data Engineering Quick Guide

986 views

Published on

Presented at Galvanize Data Engineering Immersive Info Night on March 9, 2016 at Galvanize. Whirlwind tour of Big Data technologies.

Published in: Data & Analytics
  • Be the first to comment

Data Engineering Quick Guide

  1. 1. DATA ENGINEERING QUICK GUIDE ASIM JALIS GALVANIZE
  2. 2. BIG DATA
  3. 3. WHY HADOOP? How can we create a supercomputer Using cheap Linux boxes?
  4. 4. WHAT IS HADOOP? Operating system for cluster of machines Combines small weak computers To create Big Data system Unified disk and processing power
  5. 5. HADOOP
  6. 6. WHY HDFS? How can we store a petabyte-sized file Using cheap Linux boxes?
  7. 7. WHAT IS HDFS? Split petabye file into 128 MB blocks Distribute blocks across Hadoop cluster Make 3 copies of each block for insurance
  8. 8. HDFS
  9. 9. WHY MAPREDUCE? How can we process the data in HDFS Without pulling it out and pushing the result back?
  10. 10. WHAT IS MAPREDUCE? Send program to where the data is on HDFS Process petabyte file by processing each block Then combining the result
  11. 11. MAPREDUCE
  12. 12. WHY HIVE? How can people who don’t know Java Write MapReduce jobs?
  13. 13. WHAT IS HIVE? Hive translates SQL to MapReduce jobs
  14. 14. HIVE SELECT * FROM sales WHERE amount > 400;
  15. 15. HIVE
  16. 16. WHY PIG? How can people who don’t know Java or SQL Write MapReduce jobs?
  17. 17. WHAT IS PIG? Pig translates PigLatin to MapReduce jobs PigLatin is a scripting language comparable to SQL
  18. 18. PIG high_sales = FILTER sales_data BY amount > 400;
  19. 19. PIG
  20. 20. WHY SPARK? How can we make MapReduce faster And the API less clunky?
  21. 21. WHAT IS SPARK? Spark is like MapReduce Spark has a cleaner API and is faster Speed up because it saves intermediate results in memory
  22. 22. SPARK sc.textFile("shakespeare.txt"). flatMap(line => line.split("W+")). map(word => (word,1)). reduceByKey((count1,count2) => (count1 + count2)). saveAsTextFile("output")
  23. 23. SPARK
  24. 24. WHY SPARK SQL? How can people who don’t know Scala, Python, or Java Write Spark code?
  25. 25. WHAT IS SPARK SQL? Spark SQL is like Hive for Spark Hive translates SQL to MapReduce Spark SQL translates SQL to Spark
  26. 26. SPARK SQL SELECT * FROM sales WHERE amount > 400;
  27. 27. SPARK SQL
  28. 28. WHAT IS SPARK MLLIB? Machine Learning algorithms on Spark Analyze data to extract insights
  29. 29. WHAT IS MACHINE LEARNING? Technique Question Regression Predict revenue next month Classification Is tumor cancerous or benign Clustering Which customers are similar to each other Recommendation Which movie will you like
  30. 30. REAL-TIME TECHNOLOGIES
  31. 31. WHAT IS THE DIFFERENCE BETWEEN REAL-TIME AND BATCH? Term Means Example Real- Time Process data when it arrives Reject credit card transaction Batch Process data periodically Flag suspicious transaction at night
  32. 32. BATCH Processing Layer SQL Layer MapReduce Hive, Pig Spark Spark SQL
  33. 33. REAL-TIME HBase Kafka Spark Streaming Lambda Architecture
  34. 34. WHY HBASE? How can we store petabytes of data on HDFS And do fast read and writes like a database?
  35. 35. WHAT IS HBASE? HBase is a NoSQL database on top of HDFS Can store petabytes of data Reads/writes much faster than traditional database and HDFS
  36. 36. HBASE
  37. 37. WHY KAFKA? How can we hold onto incoming data and not lose it When we are getting a million messages per second?
  38. 38. WHAT IS KAFKA? Kafka is TiVo for the cluster It stores real-time data as it comes in Can store a week of data Queuing system for Hadoop cluster
  39. 39. KAFKA
  40. 40. WHY SPARK STREAMING? How can we process data as it comes in Instead of every night (using Spark or MapReduce)
  41. 41. WHAT IS SPARK STREAMING? Spark Streaming is a library on top of Spark It allows processing data as soon as it comes in Sits in front of Kafka
  42. 42. SPARK STREAMING
  43. 43. WHY LAMBDA ARCHITECTURE? How can we watch historical trends and what is happening right now? How can we show bestsellers from this year and from last hour?
  44. 44. WHAT IS LAMBDA ARCHITECTURE? Big Data system which can handle both batch and real- time Uses historical data as well as real-time data Best of both worlds
  45. 45. LAMBDA ARCHITECTURE
  46. 46. REVIEW
  47. 47. BATCH REVIEW Technology Description Hadoop Cluster operating system HDFS Stores petabytes of data on 100s or 1000s of machines MapReduce Processes data in HDFS Hive SQL MapReduce Pig PigLatin MapReduce Spark Faster MapReduce Spark SQL SQL Spark
  48. 48. REAL-TIME REVIEW Technology Description HBase Fast NoSQL database on top of HDFS Kafka Queues incoming data into cluster Spark Streaming Process in real-time Lambda Architecture Combines real-time and batch

×