Big Data Warehousing
January 20, 2014

Sponsored By:

Today’s Topic: Big Data 2.0: YARN
Distributed ETL & SQL with Hadoop
Agenda
7:00

Networking (15 min)
Grab some food and a drink... Make some friends.

7:15

Welcome + Intro

President, Caser...
About the BDW Meetup
• Big Data is a complex, rapidly changing

landscape
• We want to share our stories and hear

about y...
About Caserta Concepts
Focused
Expertise
•
•
•
•

Big Data Analytics
Data Warehousing
Business Intelligence
Strategic Data...
Implementation Expertise & Offerings
Strategic Roadmap/
Assessment/Consulting

Big Data
Analytics

Storm
Database

BI/Visu...
Client Portfolio
Finance
& Insurance

Retail/eCommerce
& Manufacturing

Education
& Services
Caserta Partners
Hadoop Distributions

Platforms/ETL

Analytics & BI
Caserta Concepts
Listed as a Top 20 Most Promising
Data Analytics Consulting Company

CIOReview looked at hundreds of data...
Opportunities
Does this word cloud excite you?

Speak with us about our open positions: jobs@casertaconcepts.com
BIG DATA 2.0, EVOLUTION OF HADOOP,
SQL, AND NOSQL
Elliott Cordo
Chief Architect, Caserta Concepts
Hadoop1.0
WHAT DID WE ACHIEVE
• Established Hadoop’s place in analytic architecture
• Realized cheap, reliable, scalable s...
What did this mean to the Big Data
Warehouse
• Extending the Data Warehouse
• Establish new facts and “projections” in Had...
Where did it fall short
• Pretty much only MapReduce
• Batch oriented – not tuned to real time or interactive

processes ...
Hadoop 2.0 - what is the big deal
• YARN “Yet Another Resource Negotiator”

• Job Tracker and Task Tracker has been split...
YARN – Why is it significant
• Provides a management layer between

Applications and Hadoop
• These applications could sti...
Why is it important we are moving beyond
map reduce?
• MapReduce is a generalized computing framework
• A query engine for...
ETL Can benefit from this approach too!
• ETL have broader scopes than query engines but gains

can be made from a purpose...
Back to Query Engines
MPP: Massively Parallel Processing - scalable,
distributed processing engines.
• Typically underlyin...
MPP’s leveraging dedicated storage
• Modern MPP’s like Actian’s Matrix are also taking

advantage of Hadoop
• Integrating ...
So, about NOSQL
• In “Big Data 1.0” NoSQL found it’s place as a mainstream

analytic store:
• Cassandra
• HBase
• Redis
• ...
NOSQL Use cases - BDW
• Highly scalable and flexible Staging, ODS Layers

• High performance analytic store  Real time da...
2.0 NOSQL Evolutions
SQL!!!
• Easier adoption
• Standardizing Interfaces
Cassandra CQL3
Pheonix on HBase
Evolving
• Grea...
So.. In conclusion
HADOOP IS THE NEW DATA OS?
What we have:
• A distributed file system
• A robust multitenant resource ma...
THANK YOU
Elliott Cordo (elliott@casertaconcepts.com)
Chief Architect, Caserta Concepts
Upcoming SlideShare
Loading in...5
×

Big Data 2.0: YARN Enablement for Distributed ETL & SQL with Hadoop

1,285

Published on

In our most recent Big Data Warehousing Meetup, we learned about transitioning from Big Data 1.0 with Hadoop 1.x with nascent technologies to the advent of Hadoop 2.x with YARN to enable distributed ETL, SQL and Analytics solutions. Caserta Concepts Chief Architect Elliott Cordo and an Actian Engineer covered the complete data value chain of an Enterprise-ready platform including data connectivity, collection, preparation, optimization and analytics with end user access.

For more information on our services or upcoming events, please visit our website at http://www.casertaconcepts.com/.

Published in: Technology, Business

Transcript of "Big Data 2.0: YARN Enablement for Distributed ETL & SQL with Hadoop"

  1. 1. Big Data Warehousing January 20, 2014 Sponsored By: Today’s Topic: Big Data 2.0: YARN Distributed ETL & SQL with Hadoop
  2. 2. Agenda 7:00 Networking (15 min) Grab some food and a drink... Make some friends. 7:15 Welcome + Intro President, Caserta Concepts 7:30 Joe Caserta (15 min) About the Meetup, about Caserta Concepts Elliott Cordo (20 min) Hadoop 2.0: The Evolution of Hadoop, SQL, and NoSQL Chief Architect, Caserta Concepts 7:50 Paul Dingman (20 min) Chief Technologist, Actian Innovation Lab Using Actian to process data in Hadoop The latest features of Actian to enable maximum throughput 8:10 Tyler Mitchell (35 min) See how it works! Senior Engineer, Actian Innovation Lab 8:45 Q&A, More Networking (15 min) Tell us what you’re up to…
  3. 3. About the BDW Meetup • Big Data is a complex, rapidly changing landscape • We want to share our stories and hear about yours • Great networking opportunity for like minded data nerds • Opportunities to collaborate on exciting projects • Founded by Caserta Concepts, DW, BI & Big Data Analytics Consulting • Next BDW Meetup: February 10, 2014 • Data Governance on Big Data with Cloudera
  4. 4. About Caserta Concepts Focused Expertise • • • • Big Data Analytics Data Warehousing Business Intelligence Strategic Data Ecosystems Industries Served • • • • • Financial Services Healthcare / Insurance Retail / eCommerce Digital Media / Marketing K-12 / Higher Education Founded in 2001 • President: Joe Caserta, industry thought leader, consultant, educator and co-author, The Data Warehouse ETL Toolkit (Wiley, 2004)
  5. 5. Implementation Expertise & Offerings Strategic Roadmap/ Assessment/Consulting Big Data Analytics Storm Database BI/Visualization/ Analytics Master Data Management
  6. 6. Client Portfolio Finance & Insurance Retail/eCommerce & Manufacturing Education & Services
  7. 7. Caserta Partners Hadoop Distributions Platforms/ETL Analytics & BI
  8. 8. Caserta Concepts Listed as a Top 20 Most Promising Data Analytics Consulting Company CIOReview looked at hundreds of data analytics consulting companies and shortlisted the ones who are at the forefront of tackling the real analytics challenges. A distinguished panel comprising of CEOs, CIOs, VCs, industry analysts and the editorial board of CIOReview selected the Final 20.
  9. 9. Opportunities Does this word cloud excite you? Speak with us about our open positions: jobs@casertaconcepts.com
  10. 10. BIG DATA 2.0, EVOLUTION OF HADOOP, SQL, AND NOSQL Elliott Cordo Chief Architect, Caserta Concepts
  11. 11. Hadoop1.0 WHAT DID WE ACHIEVE • Established Hadoop’s place in analytic architecture • Realized cheap, reliable, scalable storage and processing • Made us more data driven • Store and process anything  new data types, structured, unstructured • New types of analysis including Machine learning  Mahout
  12. 12. What did this mean to the Big Data Warehouse • Extending the Data Warehouse • Establish new facts and “projections” in Hadoop on unstructured and high volume data sources • Hive, Impala • Datameer • BIG ETL -- Using MapReduce pipelines to process massive amounts of data  • Using our favorite ELT tool PIG • Data storage for staging • Reducing the costs and increasing the performance of our EDW
  13. 13. Where did it fall short • Pretty much only MapReduce • Batch oriented – not tuned to real time or interactive processes  Look at what was achieved with Impala side-stepping MR for SQL Queries on Hadoop • Hive performance made users sad • Legacy vendors were slow to adopt due to the massive paradigm shift in their product architecture.
  14. 14. Hadoop 2.0 - what is the big deal • YARN “Yet Another Resource Negotiator” • Job Tracker and Task Tracker has been split up • Increase scalability • Remove MapReduce from core architecture • Now there is a • Global Resource Manager • Per Application - Application Manger – Map Reduce will have it’s own • Per node slave NodeManger (with per application container)
  15. 15. YARN – Why is it significant • Provides a management layer between Applications and Hadoop • These applications could still be Map Reduce • Or all sorts of applications such as Streaming, ETL Engines, New Database engines all running NATIVELY in Hadoop! • These applications can have access to HDFS and safely contained by cluster resources. • 1st generation impala ran OUTSIDE of Hadoop and competed with cluster resources • More intelligent use of cluster resources  not just slots… more productivity out of the same hardware.
  16. 16. Why is it important we are moving beyond map reduce? • MapReduce is a generalized computing framework • A query engine for instance can benefit from a “non- generalized” pattern”, the flexibility isn’t fully needed • In-memory/ disk data access • Index usage • Serialization • Shuffling/ data movement  again look at the Impala approach • MapReduce is not suited well for other tasks such as real time stream processing, iterative machine learning, graph processing
  17. 17. ETL Can benefit from this approach too! • ETL have broader scopes than query engines but gains can be made from a purpose built processing framework • Batch is not the only way! Streaming apps can now interact with HDFS and be managed by cluster resources • Storm • SPARK • Existing Assets: SIGNIFICANT existing IP more easily leverage from both open source and commercial software
  18. 18. Back to Query Engines MPP: Massively Parallel Processing - scalable, distributed processing engines. • Typically underlying storage is columnar in nature (performance, compression, easier to distribute data) • Present themselves relationally and handle all the brutal work of aggregation, joins  ANSI Compliant SQL • Impala, HAWQ– the industry is really just taking the approach of building MPP’s on Hadoop • Columnar storage: ORC, Parquet, Proprietary • Advanced query optimizations
  19. 19. MPP’s leveraging dedicated storage • Modern MPP’s like Actian’s Matrix are also taking advantage of Hadoop • Integrating tight integrations to Hadoop infrastructure On Demand Integration • Developing tools and frameworks that leverage YARN heavy lifting  ETL
  20. 20. So, about NOSQL • In “Big Data 1.0” NoSQL found it’s place as a mainstream analytic store: • Cassandra • HBase • Redis • Riak • They gave us raw, unbeatable performance for handling realtime analytic workloads
  21. 21. NOSQL Use cases - BDW • Highly scalable and flexible Staging, ODS Layers • High performance analytic store  Real time data analytic systems • Recommendation, customer profile data  web-facing performance characteristics, flexible schema • BIG ETL Components  Reference data lookup cache, stream joins
  22. 22. 2.0 NOSQL Evolutions SQL!!! • Easier adoption • Standardizing Interfaces Cassandra CQL3 Pheonix on HBase Evolving • Greater flexibility on in-memory/disk persistence • In memory will also likely usher more flexibility on server side processes: Map Reduce, Aggregation, Joins • Analytic support
  23. 23. So.. In conclusion HADOOP IS THE NEW DATA OS? What we have: • A distributed file system • A robust multitenant resource manager • Generalized framework for distributed computing and data processing Even greater mainstream adoption of NOSQL SQL rules!
  24. 24. THANK YOU Elliott Cordo (elliott@casertaconcepts.com) Chief Architect, Caserta Concepts

×