Build a Big Data Warehouse on the Cloud in 30 Minutes


Published on

Elliott Cordo, Chief Architect at Caserta Concepts will give a live demo using Amazon's AWS to build a Big Data Warehouse using S3 for data storage, Elastic MapReduce (EMR) for data manipulation and Redshift for interactive queries.

For more information, visit

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Build a Big Data Warehouse on the Cloud in 30 Minutes

  1. 1. Big Data Warehousing Meetup - April 8, 2014 Building a Big Data Warehouse on the Cloud in 30 Minutes Sponsored By:
  2. 2. 7:00 – 7:15 Networking (15 min) Grab some food and drink... Make some friends. 7:15 – 7:35 Bob Eilbacher (20 min) VP Sales Caserta Concepts Welcome + Intro About the Meetup, about Caserta Concepts + Swag 7:35 – 8:20 Elliott Cordo (45 min) Chief Architect Caserta Concepts. Building a Big Data Warehouse on the Cloud Live demo of Amazon's AWS, S3, EMR, and Redshift 8:20 – 8:40 Ben Sgro (20 min) Sr. Software Engineer Simulmedia Implementing Redis on the Cloud An ultra-low latency customer segmentation tool with AWS Elasticache 8:40 – 9:00 Q&A (10 min) More Networking (10 min) Tell us what you’re up to… Agenda
  3. 3. Gathering music brought to you by…. BIG DATA a paranoid electronic music project from the Internet, formed out of a general distrust for technology and The Cloud (despite a growing dependence on them).
  4. 4. • Big Data is a complex, rapidly changing landscape • We want to share our stories and hear about yours • Great networking opportunity for like minded data nerds • Opportunities to collaborate on exciting projects • Founded by Caserta Concepts • Big Data Analytics, DW, BI Consulting About the BDW Meetup
  5. 5. A BDW Meetup Milestone
  6. 6. Real-world Data Science w/Claudia Perlich • Date: • Tuesday May 27, 2014, 7:00 PM • Location: • New Work City, Broadway & Canal • Sponsor: • Revolution Analytics Next BDW Meetup
  7. 7. Caserta Concepts • Technology innovation company with expertise in: • Big Data Solutions • Data Warehousing • Business Intelligence • Core focus in the following industries: • eCommerce / Retail / Marketing • Financial Services / Insurance • Healthcare / Digital Media • Established in 2001: • Increased growth year-over-year • Industry recognized work force • Consulting, Writing, Education • Data Science & Analytics • Data on the Cloud • Data Interaction & Visualization
  8. 8. Innovation & Implementation Listed as a Top 20 Most Promising Data Analytics Consulting Companies CIOReview looked at hundreds of data analytics consulting companies and shortlisted the ones who are at the forefront of tackling the real analytics challenges. A distinguished panel comprising of CEOs, CIOs, VCs, industry analysts and the editorial board of CIOReview selected the Final 20.
  9. 9. Expertise & Offerings Strategic Roadmap / Assessment / Education / Implementation Data Warehousing/ ETL/Data Integration BI/Visualization/ Analytics Big Data Analytics
  10. 10. Hadoop Distributions Platforms/ETL Analytics & BI Caserta Partners
  11. 11. Client Portfolio Finance. Healthcare & Insurance Retail/eCommerce & Manufacturing Education & Services
  12. 12. Does this word cloud excite you? Speak with us about our open positions: Join Our Network Storm Big Data Architect Hbase Cassandra
  13. 13. SWAG
  14. 14. Big Data is like water. There is little point in debating how much there is. It’s the flow and use that matters. #gigaomlive @dominiek 3/20/2014 Gigaom Structure Data
  15. 15. BUILDINGA BIG DATA WAREHOUSE IN THE CLOUD IN 30 MIN Elliott Cordo Chief Architect, Caserta Concepts
  16. 16. What is a Big Data Warehouse?? • An enterprise system providing reliable ah-hoc analytics, reporting, and decision support • Large Scale – Big Data • Not only confined to traditional Dimensional model
  17. 17. Big Data Warehouse • Data governance is still important! • Data Quality • Metadata: Naming, Lineage, etc Data cannot be governed until it is structured Big Data Warehouse Data Science Workspace Data Lake – Integrated Sandbox Landing – Source Data in “Full Fidelity”
  18. 18. Cloud • Infrastructure is not fun • Months to server procurement • Inability to handle growth • Servers idling all day doing nothing • Cloud to the rescue • Unlimited cheap storage • Provision new servers in minutes • Use of elastic services!  EMR • AWESOME for prototypes and POC’s
  19. 19. About our sample data • Consumer Yelp Ratings • Generated based on Kaggle dataset  100 million rows • Model looks something like this: f_reviews d_date d_business d_user
  20. 20. So let’s get cooking 1. Create an EMR cluster  On Demand Hadoop 1. Provision a Redshift cluster  Data Warehouse
  21. 21. Redshift • Massive Parallel Processing • Columnar DB’s that present themselves as relational • MPP’s grew up in Parallel to Hadoop • Impala, HAWQ are MPP’s themselves! • OEM of Actian Matrix (formerly ParaAccel) • A modern MPP, clean, reliable, SCHEMA AGNOSTIC
  22. 22. Redshift is cheap inexpensive? Enterprise grade EDW @ $1000/TB per year
  23. 23. MPP Design Considerations • JOINS • Shuffle – data is large and distributed by key to servers • Broadcast – data is small and gets distributed to all servers • Collocated – all data needed for join is on same server • Design Considerations for MPP • Distribution Key • Collocated joins • Even distribution of work across the cluster • Customer will work well • Sort Key • Fastest scan operations • Primary date field is usually best
  24. 24. ETL – Transform your data • S3 is the ultimate staging ground • Use EMR for the heavy lifting: • Run your ETL Program and kill it when done! • Pay just for processing. • PIG, native map reduce, streaming • For the right use case HIVE or Impala can be used for ETL too (mainly for aggregates, summaries)
  25. 25. Smaller data - don’t need EMR? • Python ETL on EC2 (on Demand) • Can later “graduate” to big data using Hadoop streaming • Your favorite ETL tool is just fine too
  26. 26. Presentation Layer – Data Warehouse How do you get your ETL data in? • Hadoop distcp - High performance transfer of data from S3 to HDFS • Distributed COPY from S3 to Redshift
  27. 27. And how to orchestrate all of this? • Amazon data pipelines • AWS CLI • Build a driver program using modules like Boto (Python) • Cron or external scheduler
  28. 28. Back to AWS 1. Apply Redshift DDL and load tables 1. Run some queries
  29. 29.