SlideShare a Scribd company logo
BIG ETL 
Elliott Cordo 
Chief Architect, Caserta Concepts
What do we mean by BIG ETL? 
• ETL at scale: 
• Machine data/Log data 
• Web/Social Media 
• Traditional Data -> full history, full fidelity 
• ERP 
• POS 
• CRM 
• Email 
• New Engines 
• Map Reduce, Tez 
• Spark 
• Storm
Why do we need to do this 
• BIG DATA 
• Parallelize ETL for large datasets 
• Increase throughput 
• Shrink ETL windows 
• Offload ill-fitting workloads from MPP and Relational 
• Data types: 
• Semi-structured 
• Unstructured 
• Custom Engineering: control over code 
MPP
Traditional ETL Tools 
• GUI Based ETL tools have their place 
• Not all ETL needs to be BIG ETL 
• Many have integrated Data Quality and MDM  Great for master 
data 
• Can be productive, many are easy to learn. 
• Many are “upgrading” to big data volumes 
• MR or Spark push down (Java code generators are well positioned 
for that) 
• Yarn
About GUI based ETL 
• Modern programming languages and DSL’s have come a long way 
• Is a GUI always more productive? 
• What if custom engineering is required? 
• Complex operations are 
often difficult to represent
ETL -What are we implementing 
• Our favorites: 
• Pig 
• Hive 
• Streaming  especially with Python 
• Native MR  only if we have too 
• Spark – R&D, beginning adoption 
• Best fit? 
• Use case 
• Skillset
Pig 
• Procedural Programming 
• SQL Friendly 
• Many of Source and Target connectors 
• Flat files 
• Json 
• NoSQL: Redis, Cassandra, Hbase 
• And many more 
• Works well with pulling apart unstructured data, working with 
nested data 
• Large number of value add libraries: DataFu, PiggyBank, 
ElephantBird 
• Now running on TEZ! 
• Spark coming hopefully– Spork?
A pig script 
business = LOAD 's3://caserta-bucket1/yelp-academic-dataset/yelp_academic_dataset_business.json' 
USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad'); 
business_cleaned = FOREACH business 
generate $0#'business_id' as business_id, 
REPLACE($0#'name','|','-') as name, 
$0#'city' as city, 
$0#'state' as state, 
$0#'longitude' as longitude, 
$0#'latitude' as latitude, 
$0#'categories' as categories; 
review = LOAD 's3://caserta-bucket1/yelp/in/reviews/' 
USING PigStorage('|') 
AS (review_id:chararray, business_id:chararray, user_id:chararray, rating:float, review_date:chararray); 
review_cleaned = FOREACH review 
GENERATE 
review_id, 
REPLACE(review_date,'-','') as review_date_id, 
business_id, 
user_id, 
rating, 
(int)(rating >= 3 ? 1 : 0) as good_review_count, 
(int)(rating < 3 ? 1 : 0) as bad_review_count;
..continued 
joined = JOIN review_cleaned by business_id, business_cleaned by business_id parallel 10; 
grouped = GROUP joined BY name; 
results = FOREACH grouped GENERATE 
group, 
SUM(joined.review_cleaned::good_review_count), 
SUM(joined.review_cleaned::bad_review_count), 
COUNT(joined); 
STORE results into 's3://caserta-bucket1/yelp/out/joined/' 
USING PigStorage('|');
Hive 
• SQL  Approching latest ANSI SPEC 
• Great if you have tabular data, or data which can easily 
be projected as tabular (via functions or serdes) 
• Great for Joins and “relational operations” 
• Getting more SQL-ish all the time 
• Approaching latest ANSI SPEC 
• Allegedly Acid and insert/update/delete on its way?
Impala – honorable mention 
• Super fast, but not up to brutal ETL workloads 
• In Memory, does not spill over to disk gracefully  OOM 
errors 
• May be useful for some set/relational operations that can 
be done in memory  final aggregation, summary 
• Best saved as a query tool on data that has been 
prepared by upstream ETL
Streaming 
• Don’t want to write native java MR 
• Great for code control and custom engineering 
• Leverage existing skillset 
• Python is our favorite 
• Productive/Terse 
• Helpful projects like: 
• Pydoop 
• MrJob (Yelp)
Spark 
• A swiss army knife! 
• Streaming, batch, interactive, SQL, Graph 
• RDD – Redundant Distributed Datastore 
• API’s for Java, Scala, Python 
• PySpark –turns Python into a true Data DSL 
• Best of both worlds – use SQL and Python together 
• Use SQL when it makes sense, Python when it doesn’t
A Spark Script 
from pyspark.sql import SQLContext, Row 
sqlContext = SQLContext(sc) 
#------------------------------------------------ 
#load some users 
lines=sc.textFile("s3://caserta-bucket1/yelp/in/users/users.txt") 
parts = lines.map(lambda l: l.split(",")) 
users = parts.map(lambda p: Row(id=p[0],name=p[1],gender=p[2],age=p[3])) 
schemaUsers = sqlContext.inferSchema(users) 
schemaUsers.registerTempTable("users") 
sqlContext.sql("select count(1) from users").collect() 
#------------------------------------------------ 
#load some reviews from json baby 
reviews = sqlContext.jsonFile("s3://caserta-bucket1/yelp-academic-dataset/yelp_academic_dataset_review.json”) 
reviews.printSchema() 
reviews.registerTempTable("reviews”) 
sqlContext.sql("select count(1) from reviews").collect() 
sqlContext.sql("select user_id, votes.cool as cool, votes.useful as useful from reviews").take(10) 
#------------------------------------------------ 
#let's join! 
genders_summary = sqlContext.sql(""" 
select gender, 
count(1) as cnt 
from reviews r 
join users u on u.id = r.user_id 
group by gender""" ) 
genders_summary.collect()
Tip - Avoid low level coding: 
Start by evaluating 
DSL’s 
Structured/t 
abular 
Core or 
Extended 
Libraries 
Will a 
Custom 
UDF help? 
Hive Pig 
Use Streaming 
or Native MR 
yes 
yes 
No 
No 
yes 
Practical to 
express in 
SQL 
yes 
No 
No 
Spark
Tip - Use Hive Metastore 
• Hive Metastore can be leveraged by a wide array of 
applications 
• Spark 
• Hive 
• Impala 
• Pig 
• Pig fans can use hcatloader 
• Can ingest any format covered by Hive SerDes 
• Push down predicate filters  Partitioning 
• Spark uses HiveContext
Tip - that downstream MPP 
• MPP’s are pretty darn good at processing data 
• A lot of ETL operations can be expressed in SQL 
• BEWARE of MPP abuse 
• Hadoop is a lot cheaper than your MPP 
• ..likely more scalable and elastic 
• “Forcing” operations that don’t fit 
• GOOD operations to remain in MPP: 
• Data warehouse stuff! 
• Dimensionalization 
• Surrogate keys 
• Type 2 dimensions 
• Summary tables 
• Derived facts 
MPP
DEMO Some Spark! 
…
Awesome collection of AWS developed bootstrap actions: 
https://github.com/awslabs/emr-bootstrap-actions 
The scripts we shared: 
https://gist.github.com/elliottcordo/f5267a75defff468b757#file-yelp_ 
pyspark_example-py 
https://gist.github.com/elliottcordo/45bdd460e2ec7148b6b9#file-yelp_ 
pig_join-pig 
elliott@casertaconcepts.com

More Related Content

Big Data Warehousing Meetup: BigETL: Trad Tool vs Pig vs Hive vs Python. What to use when. (Slide Set #2)

  • 1. BIG ETL Elliott Cordo Chief Architect, Caserta Concepts
  • 2. What do we mean by BIG ETL? • ETL at scale: • Machine data/Log data • Web/Social Media • Traditional Data -> full history, full fidelity • ERP • POS • CRM • Email • New Engines • Map Reduce, Tez • Spark • Storm
  • 3. Why do we need to do this • BIG DATA • Parallelize ETL for large datasets • Increase throughput • Shrink ETL windows • Offload ill-fitting workloads from MPP and Relational • Data types: • Semi-structured • Unstructured • Custom Engineering: control over code MPP
  • 4. Traditional ETL Tools • GUI Based ETL tools have their place • Not all ETL needs to be BIG ETL • Many have integrated Data Quality and MDM  Great for master data • Can be productive, many are easy to learn. • Many are “upgrading” to big data volumes • MR or Spark push down (Java code generators are well positioned for that) • Yarn
  • 5. About GUI based ETL • Modern programming languages and DSL’s have come a long way • Is a GUI always more productive? • What if custom engineering is required? • Complex operations are often difficult to represent
  • 6. ETL -What are we implementing • Our favorites: • Pig • Hive • Streaming  especially with Python • Native MR  only if we have too • Spark – R&D, beginning adoption • Best fit? • Use case • Skillset
  • 7. Pig • Procedural Programming • SQL Friendly • Many of Source and Target connectors • Flat files • Json • NoSQL: Redis, Cassandra, Hbase • And many more • Works well with pulling apart unstructured data, working with nested data • Large number of value add libraries: DataFu, PiggyBank, ElephantBird • Now running on TEZ! • Spark coming hopefully– Spork?
  • 8. A pig script business = LOAD 's3://caserta-bucket1/yelp-academic-dataset/yelp_academic_dataset_business.json' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad'); business_cleaned = FOREACH business generate $0#'business_id' as business_id, REPLACE($0#'name','|','-') as name, $0#'city' as city, $0#'state' as state, $0#'longitude' as longitude, $0#'latitude' as latitude, $0#'categories' as categories; review = LOAD 's3://caserta-bucket1/yelp/in/reviews/' USING PigStorage('|') AS (review_id:chararray, business_id:chararray, user_id:chararray, rating:float, review_date:chararray); review_cleaned = FOREACH review GENERATE review_id, REPLACE(review_date,'-','') as review_date_id, business_id, user_id, rating, (int)(rating >= 3 ? 1 : 0) as good_review_count, (int)(rating < 3 ? 1 : 0) as bad_review_count;
  • 9. ..continued joined = JOIN review_cleaned by business_id, business_cleaned by business_id parallel 10; grouped = GROUP joined BY name; results = FOREACH grouped GENERATE group, SUM(joined.review_cleaned::good_review_count), SUM(joined.review_cleaned::bad_review_count), COUNT(joined); STORE results into 's3://caserta-bucket1/yelp/out/joined/' USING PigStorage('|');
  • 10. Hive • SQL  Approching latest ANSI SPEC • Great if you have tabular data, or data which can easily be projected as tabular (via functions or serdes) • Great for Joins and “relational operations” • Getting more SQL-ish all the time • Approaching latest ANSI SPEC • Allegedly Acid and insert/update/delete on its way?
  • 11. Impala – honorable mention • Super fast, but not up to brutal ETL workloads • In Memory, does not spill over to disk gracefully  OOM errors • May be useful for some set/relational operations that can be done in memory  final aggregation, summary • Best saved as a query tool on data that has been prepared by upstream ETL
  • 12. Streaming • Don’t want to write native java MR • Great for code control and custom engineering • Leverage existing skillset • Python is our favorite • Productive/Terse • Helpful projects like: • Pydoop • MrJob (Yelp)
  • 13. Spark • A swiss army knife! • Streaming, batch, interactive, SQL, Graph • RDD – Redundant Distributed Datastore • API’s for Java, Scala, Python • PySpark –turns Python into a true Data DSL • Best of both worlds – use SQL and Python together • Use SQL when it makes sense, Python when it doesn’t
  • 14. A Spark Script from pyspark.sql import SQLContext, Row sqlContext = SQLContext(sc) #------------------------------------------------ #load some users lines=sc.textFile("s3://caserta-bucket1/yelp/in/users/users.txt") parts = lines.map(lambda l: l.split(",")) users = parts.map(lambda p: Row(id=p[0],name=p[1],gender=p[2],age=p[3])) schemaUsers = sqlContext.inferSchema(users) schemaUsers.registerTempTable("users") sqlContext.sql("select count(1) from users").collect() #------------------------------------------------ #load some reviews from json baby reviews = sqlContext.jsonFile("s3://caserta-bucket1/yelp-academic-dataset/yelp_academic_dataset_review.json”) reviews.printSchema() reviews.registerTempTable("reviews”) sqlContext.sql("select count(1) from reviews").collect() sqlContext.sql("select user_id, votes.cool as cool, votes.useful as useful from reviews").take(10) #------------------------------------------------ #let's join! genders_summary = sqlContext.sql(""" select gender, count(1) as cnt from reviews r join users u on u.id = r.user_id group by gender""" ) genders_summary.collect()
  • 15. Tip - Avoid low level coding: Start by evaluating DSL’s Structured/t abular Core or Extended Libraries Will a Custom UDF help? Hive Pig Use Streaming or Native MR yes yes No No yes Practical to express in SQL yes No No Spark
  • 16. Tip - Use Hive Metastore • Hive Metastore can be leveraged by a wide array of applications • Spark • Hive • Impala • Pig • Pig fans can use hcatloader • Can ingest any format covered by Hive SerDes • Push down predicate filters  Partitioning • Spark uses HiveContext
  • 17. Tip - that downstream MPP • MPP’s are pretty darn good at processing data • A lot of ETL operations can be expressed in SQL • BEWARE of MPP abuse • Hadoop is a lot cheaper than your MPP • ..likely more scalable and elastic • “Forcing” operations that don’t fit • GOOD operations to remain in MPP: • Data warehouse stuff! • Dimensionalization • Surrogate keys • Type 2 dimensions • Summary tables • Derived facts MPP
  • 19. Awesome collection of AWS developed bootstrap actions: https://github.com/awslabs/emr-bootstrap-actions The scripts we shared: https://gist.github.com/elliottcordo/f5267a75defff468b757#file-yelp_ pyspark_example-py https://gist.github.com/elliottcordo/45bdd460e2ec7148b6b9#file-yelp_ pig_join-pig elliott@casertaconcepts.com