Big Data Warehousing Meetup: BigETL: Trad Tool vs Pig vs Hive vs Python. What to use when. (Slide Set #2)
We discussed best practices for implementing the right tool for the Big ETL job, whether it's Tradition ETL tools, Map Reduce, Pig, Hive or Python. We explained when, where and how to apply the different steps of the ETL process in the Big Data ecosystem, which tool to use for each, and a demo of a real use case.
Presenters included Joe Caserta, President, Caserta Concepts, Elliott Cordo, Chief Architect, Caserta Concepts, Kyle Hubert, Principal Data Architect, Simulmedia.
The event was hosted by Caserta Concepts and Simulmedia and sponsored by O'Reilly.
For more information on Caserta Concepts, visit our website at http://www.casertaconcepts.com/.
For access to additional slide decks, visit our SlideShare site at http://www.slideshare.net/CasertaConcepts.
Report
Share
Report
Share
1 of 19
More Related Content
Big Data Warehousing Meetup: BigETL: Trad Tool vs Pig vs Hive vs Python. What to use when. (Slide Set #2)
2. What do we mean by BIG ETL?
• ETL at scale:
• Machine data/Log data
• Web/Social Media
• Traditional Data -> full history, full fidelity
• ERP
• POS
• CRM
• Email
• New Engines
• Map Reduce, Tez
• Spark
• Storm
3. Why do we need to do this
• BIG DATA
• Parallelize ETL for large datasets
• Increase throughput
• Shrink ETL windows
• Offload ill-fitting workloads from MPP and Relational
• Data types:
• Semi-structured
• Unstructured
• Custom Engineering: control over code
MPP
4. Traditional ETL Tools
• GUI Based ETL tools have their place
• Not all ETL needs to be BIG ETL
• Many have integrated Data Quality and MDM Great for master
data
• Can be productive, many are easy to learn.
• Many are “upgrading” to big data volumes
• MR or Spark push down (Java code generators are well positioned
for that)
• Yarn
5. About GUI based ETL
• Modern programming languages and DSL’s have come a long way
• Is a GUI always more productive?
• What if custom engineering is required?
• Complex operations are
often difficult to represent
6. ETL -What are we implementing
• Our favorites:
• Pig
• Hive
• Streaming especially with Python
• Native MR only if we have too
• Spark – R&D, beginning adoption
• Best fit?
• Use case
• Skillset
7. Pig
• Procedural Programming
• SQL Friendly
• Many of Source and Target connectors
• Flat files
• Json
• NoSQL: Redis, Cassandra, Hbase
• And many more
• Works well with pulling apart unstructured data, working with
nested data
• Large number of value add libraries: DataFu, PiggyBank,
ElephantBird
• Now running on TEZ!
• Spark coming hopefully– Spork?
8. A pig script
business = LOAD 's3://caserta-bucket1/yelp-academic-dataset/yelp_academic_dataset_business.json'
USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad');
business_cleaned = FOREACH business
generate $0#'business_id' as business_id,
REPLACE($0#'name','|','-') as name,
$0#'city' as city,
$0#'state' as state,
$0#'longitude' as longitude,
$0#'latitude' as latitude,
$0#'categories' as categories;
review = LOAD 's3://caserta-bucket1/yelp/in/reviews/'
USING PigStorage('|')
AS (review_id:chararray, business_id:chararray, user_id:chararray, rating:float, review_date:chararray);
review_cleaned = FOREACH review
GENERATE
review_id,
REPLACE(review_date,'-','') as review_date_id,
business_id,
user_id,
rating,
(int)(rating >= 3 ? 1 : 0) as good_review_count,
(int)(rating < 3 ? 1 : 0) as bad_review_count;
9. ..continued
joined = JOIN review_cleaned by business_id, business_cleaned by business_id parallel 10;
grouped = GROUP joined BY name;
results = FOREACH grouped GENERATE
group,
SUM(joined.review_cleaned::good_review_count),
SUM(joined.review_cleaned::bad_review_count),
COUNT(joined);
STORE results into 's3://caserta-bucket1/yelp/out/joined/'
USING PigStorage('|');
10. Hive
• SQL Approching latest ANSI SPEC
• Great if you have tabular data, or data which can easily
be projected as tabular (via functions or serdes)
• Great for Joins and “relational operations”
• Getting more SQL-ish all the time
• Approaching latest ANSI SPEC
• Allegedly Acid and insert/update/delete on its way?
11. Impala – honorable mention
• Super fast, but not up to brutal ETL workloads
• In Memory, does not spill over to disk gracefully OOM
errors
• May be useful for some set/relational operations that can
be done in memory final aggregation, summary
• Best saved as a query tool on data that has been
prepared by upstream ETL
12. Streaming
• Don’t want to write native java MR
• Great for code control and custom engineering
• Leverage existing skillset
• Python is our favorite
• Productive/Terse
• Helpful projects like:
• Pydoop
• MrJob (Yelp)
13. Spark
• A swiss army knife!
• Streaming, batch, interactive, SQL, Graph
• RDD – Redundant Distributed Datastore
• API’s for Java, Scala, Python
• PySpark –turns Python into a true Data DSL
• Best of both worlds – use SQL and Python together
• Use SQL when it makes sense, Python when it doesn’t
14. A Spark Script
from pyspark.sql import SQLContext, Row
sqlContext = SQLContext(sc)
#------------------------------------------------
#load some users
lines=sc.textFile("s3://caserta-bucket1/yelp/in/users/users.txt")
parts = lines.map(lambda l: l.split(","))
users = parts.map(lambda p: Row(id=p[0],name=p[1],gender=p[2],age=p[3]))
schemaUsers = sqlContext.inferSchema(users)
schemaUsers.registerTempTable("users")
sqlContext.sql("select count(1) from users").collect()
#------------------------------------------------
#load some reviews from json baby
reviews = sqlContext.jsonFile("s3://caserta-bucket1/yelp-academic-dataset/yelp_academic_dataset_review.json”)
reviews.printSchema()
reviews.registerTempTable("reviews”)
sqlContext.sql("select count(1) from reviews").collect()
sqlContext.sql("select user_id, votes.cool as cool, votes.useful as useful from reviews").take(10)
#------------------------------------------------
#let's join!
genders_summary = sqlContext.sql("""
select gender,
count(1) as cnt
from reviews r
join users u on u.id = r.user_id
group by gender""" )
genders_summary.collect()
15. Tip - Avoid low level coding:
Start by evaluating
DSL’s
Structured/t
abular
Core or
Extended
Libraries
Will a
Custom
UDF help?
Hive Pig
Use Streaming
or Native MR
yes
yes
No
No
yes
Practical to
express in
SQL
yes
No
No
Spark
16. Tip - Use Hive Metastore
• Hive Metastore can be leveraged by a wide array of
applications
• Spark
• Hive
• Impala
• Pig
• Pig fans can use hcatloader
• Can ingest any format covered by Hive SerDes
• Push down predicate filters Partitioning
• Spark uses HiveContext
17. Tip - that downstream MPP
• MPP’s are pretty darn good at processing data
• A lot of ETL operations can be expressed in SQL
• BEWARE of MPP abuse
• Hadoop is a lot cheaper than your MPP
• ..likely more scalable and elastic
• “Forcing” operations that don’t fit
• GOOD operations to remain in MPP:
• Data warehouse stuff!
• Dimensionalization
• Surrogate keys
• Type 2 dimensions
• Summary tables
• Derived facts
MPP
19. Awesome collection of AWS developed bootstrap actions:
https://github.com/awslabs/emr-bootstrap-actions
The scripts we shared:
https://gist.github.com/elliottcordo/f5267a75defff468b757#file-yelp_
pyspark_example-py
https://gist.github.com/elliottcordo/45bdd460e2ec7148b6b9#file-yelp_
pig_join-pig
elliott@casertaconcepts.com