0
1
Scaling ETL
with Hadoop
Gwen Shapira
@gwenshap
gshapira@cloudera.com
Coming soon to a bookstore near you…
• Hadoop Application
Architectures
How to build end-to-end solutions
using Apache Had...
ETL is…
• Extracting data from outside sources
• Transforming it to fit operational needs
• Loading it into the end target...
Hadoop Is…
• HDFS – Massive, redundant data storage
• MapReduce – Batch oriented data processing at scale
• Many many ways...
The Ecosystem
• High level languages and abstractions
• File, relational and streaming data integration
• Process Orchestr...
6
Why ETL with Hadoop?
Data Has Changed in the Last 30 YearsDATAGROWTH
END-USER
APPLICATIONS
THE INTERNET
MOBILE DEVICES
SOPHISTICATED
MACHINES
S...
Volume, Variety, Velocity Cause Problems
8
OLTP
Enterprise
Applications
Data
Warehouse
QueryExtract
Transform
Load
Busines...
Got unstructured data?
• Traditional ETL:
• Text
• CSV
• XLS
• XML
• Hadoop:
• HTML
• XML, RSS
• JSON
• Apache Logs
• Avro...
What is Apache Hadoop?
10
Has the Flexibility to Store and
Mine Any Type of Data
 Ask questions across structured and
uns...
What I often see
ETL
Cluster
ELT in
DWH
ETL in
Hadoop
11
12
Best Practices
Arup Nanda taught me to ask:
1. Why is it better than the rest?
2. What happens if it is not followed?
3. W...
14
15
Extract
Let me count the ways
1. From Databases: Sqoop
2. Log Data: Flume
3. Copy data to HDFS
16
Data Loading Mistake #1
17
Hadoop is scalable.
Lets run as many Sqoop mappers
as possible, to get the data from
our DB fas...
Result:
18
Lesson:
• Start with 2 mappers, add slowly
• Watch DB load and network utilization
• Use FairScheduler to limit number of ...
Data Loading Mistake #2
20
Database specific connectors are
complicated and scary.
Lets just use the default JDBC
connecto...
Result:
21
Lesson:
1. There are connectors to:
Oracle, Netezza and Teradata
2. Download them
3. Read documentation
4. Ask questions i...
Data Loading Mistake #3
23
Just copying files?
This sounds too simple. We
probably need some cool
whizzbang tool.
— Famous...
Result
24
Lessons:
• Copying files is a legitimate solution
• In general, simple is good
25
26
Transform
Endless Possibilities
• Map Reduce
• Crunch / Cascading
• Spark
• Hive (i.e. SQL)
• Pig
• R
• Shell scripts
• Plain old Ja...
Data Processing Mistake #0
28
Data Processing Mistake #1
29
This system must be ready in 12 month.
We have to convert 100 data sources and 5000
transfor...
Result
30
Lessons
• Take learning curve into account
• You don’t know what you don’t know
• Hadoop will be difficult and frustrating...
Data Processing Mistake #2
32
Hadoop is all about MapReduce.
So I’ll use MapReduce for all my
data processing needs.
— Fam...
Result:
33
Lessons:
MapReduce is the assembly language of Hadoop:
Simple things are hard.
Hard things are possible.
34
Data Processing Mistake #3
35
I got 5000 tiny XMLs, and Hadoop
is great at processing
unstructured data.
So I’ll just leav...
Result
36
Lessons
1. Consolidate small files
2. Don’t argue about #1
3. Convert files to easy-to-query formats
4. De-normalize
37
Data Processing Mistake #4
38
Partitions are for relational databases
— Famous last words
Result
39
Lessons
1. Without partitions every query is a full table scan
2. Yes, Hadoop scans fast.
3. But faster read is the one yo...
41
Load
Technologies
• Sqoop
• Fuse-DFS
• Oracle Connectors
• Just copy files
• Query Hadoop
42
Data Loading Mistake #1
43
All of the data must end up in a
relational DWH.
— Famous last words
Result
44
Lessons:
• Use Relational:
• To maintain tool compatibility
• DWH enrichment
• Stay in Hadoop for:
• Text search
• Graph a...
Data Loading Mistake #2
46
We used Sqoop to get data out of
Oracle. Lets use Sqoop to get it back in.
— Famous last words
Result
47
Lesson
Use Oracle direct connectors if you can afford them.
They are:
1. Faster than any alternative
2. Use Hadoop to make...
49
Workflow Management
Tools
• Oozie
• Pentaho, Talend, ActiveBatch, AutoSys, Informatica,
UC4, Cron
50
Workflow Mistake #1
51
Workflow management is easy. I’ll just
write few scripts.
— Famous last words
52
— Josh Wills
Lesson:
Workflow management tool should enable:
• Keeping track of metadata, components and
integrations
• Scheduling and ...
Workflow Mistake #2
54
Schema? This is Hadoop. Why would
we need a schema?
— Famous last words
Result
55
Lesson
/user/…
/user/gshapira/testdata/orders
/data/<database>/<table>/<partition>
/data/<biz unit>/<app>/<dataset>/partit...
Workflow Mistake #3
57
Oozie was written for Hadoop, so the
right solution will always use Oozie
— Famous last words
Result
58
Lessons:
• Oozie has advantages
• Use the tool that works for you
59
Hue + Oozie
60
61
— Neil Gaiman
62
— Esther Dyson
63
Should DBAs learn Hadoop?
• Hadoop projects are more visible
• 48% of Hadoop clusters are owned by DWH team
• Big Data == ...
Beginner Projects
• Take a class
• Download a VM
• Install 5 node Hadoop cluster in AWS
• Load data:
• Complete works of S...
Books
66
More Books
67
Upcoming SlideShare
Loading in...5
×

Scaling ETL with Hadoop - Avoiding Failure

1,744

Published on

Published in: Software, Technology
0 Comments
12 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,744
On Slideshare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
200
Comments
0
Likes
12
Embeds 0
No embeds

No notes for slide
  • The data landscape looks totally different now than it did 30 years ago when the fundamental concepts and technologies of data management were developed. Since 1980, the birth of personal computing, the internet, mobile devices and sophisticated electronic machines have led to an explosion in data volume, variety and velocity. Simply said, the data you’re managing today looks nothing like the data you were managing in 1980. Actually, structured data only represents somewhere between 10-20% of the total data volume in any given enterprise.
  • Open source system implemented (mostly) In Java. Provides reliable and scalable storage and processing of extremely large volumes of data (TB to PB) on commodity hardware.
    Two primary components, HDFS and MapReduce. Based on software originally developed at Google.
    An important aspect of Hadoop in the context of this talk is that it allows for inexpensive storage (<10% of other solutions) of any type of data, and places no constraints on how that data is processed.
    Allows companies to begin storing data that was previously thrown away.

    Hadoop has value as a stand-alone analytics system, but many if not most organizations will be looking to integrate into existing data infrastructure.
  • ETL clusters are easy to use and manage, but are often inefficient. ELT is efficient, but spends expensive DWH cycles on low-value transformations, wastes storage on temp data and can cause missed SLAs. The next step is moving the ETL to Hadoop.
  • There are many Hadoop technologies for ETL. Its easy to google and see what each one does and how to use it. The trick is to use the right technology at the right time, and avoid some common mistakes.
    So, I’m not going to waste time telling you how to use Hadoop. You’ll easily find out. I’ll tell you what to use it for, and how to avoid pitfalls.
  • Database specific connectors make data ingest about 1000 times faster.
  • Important note: Sqoop’s Connector to Oracle is NOT Oracle connector for Hadoop. It is called OraOop and is both free and open source
  • If you get a bunch of files to a directory every day or hour, just copying them to Hadoop is fine. If you can do something in 1 line in shell, do it. Don’t overcomplicate.

    You are just starting to learn Hadoop, so make life easier on yourself.
  • Give yourself a fighting chance
  • These days there is hardly ever a reason to use plain map reduce. So many other good tools.
  • Like walking Huanan trail: Not very stable and not very fast
  • Mottainai – Japanese term for wasting resources or doing something in an inefficient manner
  • Use Hadoop cores to pre-sort, pre-partition the data and turn it into Oracle data format. Then load it as “append” to Oracle. Uses no redo, very little CPU, lots of parallelism.
  • Transcript of "Scaling ETL with Hadoop - Avoiding Failure"

    1. 1. 1 Scaling ETL with Hadoop Gwen Shapira @gwenshap gshapira@cloudera.com
    2. 2. Coming soon to a bookstore near you… • Hadoop Application Architectures How to build end-to-end solutions using Apache Hadoop and related tools @hadooparchbook www.hadooparchitecturebook.com
    3. 3. ETL is… • Extracting data from outside sources • Transforming it to fit operational needs • Loading it into the end target • (Wikipedia: http://en.wikipedia.org/wiki/Extract,_transform,_load) 3
    4. 4. Hadoop Is… • HDFS – Massive, redundant data storage • MapReduce – Batch oriented data processing at scale • Many many ways to process data in parallel at scale 4
    5. 5. The Ecosystem • High level languages and abstractions • File, relational and streaming data integration • Process Orchestration and Scheduling • Libraries for data wrangling • Low latency query language 5
    6. 6. 6 Why ETL with Hadoop?
    7. 7. Data Has Changed in the Last 30 YearsDATAGROWTH END-USER APPLICATIONS THE INTERNET MOBILE DEVICES SOPHISTICATED MACHINES STRUCTURED DATA – 10% 1980 2013 UNSTRUCTURED DATA – 90%
    8. 8. Volume, Variety, Velocity Cause Problems 8 OLTP Enterprise Applications Data Warehouse QueryExtract Transform Load Business Intelligence Transform 1 1 1 Slow data transformations. Missed SLAs. 2 2 Slow queries. Frustrated business and IT. 3 Must archive. Archived data can’t provide value.
    9. 9. Got unstructured data? • Traditional ETL: • Text • CSV • XLS • XML • Hadoop: • HTML • XML, RSS • JSON • Apache Logs • Avro, ProtoBuffs, ORC, Parquet • Compression • Office, OpenDocument, iWorks • PDF, Epup, RTF • Midi, MP3 • JPEG, Tiff • Java Classes • Mbox, RFC822 • Autocad • TrueType Parser • HFD / NetCDF 9
    10. 10. What is Apache Hadoop? 10 Has the Flexibility to Store and Mine Any Type of Data  Ask questions across structured and unstructured data that were previously impossible to ask or solve  Not bound by a single schema Excels at Processing Complex Data  Scale-out architecture divides workloads across multiple nodes  Flexible file system eliminates ETL bottlenecks Scales Economically  Can be deployed on commodity hardware  Open source platform guards against vendor lock Hadoop Distributed File System (HDFS) Self-Healing, High Bandwidth Clustered Storage MapReduce Distributed Computing Framework Apache Hadoop is an open source platform for data storage and processing that is…  Distributed  Fault tolerant  Scalable CORE HADOOP SYSTEM COMPONENTS
    11. 11. What I often see ETL Cluster ELT in DWH ETL in Hadoop 11
    12. 12. 12
    13. 13. Best Practices Arup Nanda taught me to ask: 1. Why is it better than the rest? 2. What happens if it is not followed? 3. When are they not applicable? 13
    14. 14. 14
    15. 15. 15 Extract
    16. 16. Let me count the ways 1. From Databases: Sqoop 2. Log Data: Flume 3. Copy data to HDFS 16
    17. 17. Data Loading Mistake #1 17 Hadoop is scalable. Lets run as many Sqoop mappers as possible, to get the data from our DB faster! — Famous last words
    18. 18. Result: 18
    19. 19. Lesson: • Start with 2 mappers, add slowly • Watch DB load and network utilization • Use FairScheduler to limit number of mappers 19
    20. 20. Data Loading Mistake #2 20 Database specific connectors are complicated and scary. Lets just use the default JDBC connector. — Famous last words
    21. 21. Result: 21
    22. 22. Lesson: 1. There are connectors to: Oracle, Netezza and Teradata 2. Download them 3. Read documentation 4. Ask questions if not clear 5. Follow installation instructions 6. Use Sqoop with connectors 22
    23. 23. Data Loading Mistake #3 23 Just copying files? This sounds too simple. We probably need some cool whizzbang tool. — Famous last words
    24. 24. Result 24
    25. 25. Lessons: • Copying files is a legitimate solution • In general, simple is good 25
    26. 26. 26 Transform
    27. 27. Endless Possibilities • Map Reduce • Crunch / Cascading • Spark • Hive (i.e. SQL) • Pig • R • Shell scripts • Plain old Java 27
    28. 28. Data Processing Mistake #0 28
    29. 29. Data Processing Mistake #1 29 This system must be ready in 12 month. We have to convert 100 data sources and 5000 transformations to Hadoop. Lets spend 2 days planning a schedule and budget for the entire year and then just go and implement it. Prototype? Who needs that? — Famous last words
    30. 30. Result 30
    31. 31. Lessons • Take learning curve into account • You don’t know what you don’t know • Hadoop will be difficult and frustrating for at least 3 month. 31
    32. 32. Data Processing Mistake #2 32 Hadoop is all about MapReduce. So I’ll use MapReduce for all my data processing needs. — Famous last words
    33. 33. Result: 33
    34. 34. Lessons: MapReduce is the assembly language of Hadoop: Simple things are hard. Hard things are possible. 34
    35. 35. Data Processing Mistake #3 35 I got 5000 tiny XMLs, and Hadoop is great at processing unstructured data. So I’ll just leave the data like that and parse the XML in every job. — Famous last words
    36. 36. Result 36
    37. 37. Lessons 1. Consolidate small files 2. Don’t argue about #1 3. Convert files to easy-to-query formats 4. De-normalize 37
    38. 38. Data Processing Mistake #4 38 Partitions are for relational databases — Famous last words
    39. 39. Result 39
    40. 40. Lessons 1. Without partitions every query is a full table scan 2. Yes, Hadoop scans fast. 3. But faster read is the one you don’t perform 4. Cheap storage allows you to store same dataset, partitioned multiple ways. 5. Use partitions for fast data loading 40
    41. 41. 41 Load
    42. 42. Technologies • Sqoop • Fuse-DFS • Oracle Connectors • Just copy files • Query Hadoop 42
    43. 43. Data Loading Mistake #1 43 All of the data must end up in a relational DWH. — Famous last words
    44. 44. Result 44
    45. 45. Lessons: • Use Relational: • To maintain tool compatibility • DWH enrichment • Stay in Hadoop for: • Text search • Graph analysis • Reduce time in pipeline • Big data & small network • Congested database 45
    46. 46. Data Loading Mistake #2 46 We used Sqoop to get data out of Oracle. Lets use Sqoop to get it back in. — Famous last words
    47. 47. Result 47
    48. 48. Lesson Use Oracle direct connectors if you can afford them. They are: 1. Faster than any alternative 2. Use Hadoop to make Oracle more efficient 3. Make *you* more efficient 48
    49. 49. 49 Workflow Management
    50. 50. Tools • Oozie • Pentaho, Talend, ActiveBatch, AutoSys, Informatica, UC4, Cron 50
    51. 51. Workflow Mistake #1 51 Workflow management is easy. I’ll just write few scripts. — Famous last words
    52. 52. 52 — Josh Wills
    53. 53. Lesson: Workflow management tool should enable: • Keeping track of metadata, components and integrations • Scheduling and Orchestration • Restarts and retries • Cohesive System View • Instrumentation, Measurement and Monitoring • Reporting 53
    54. 54. Workflow Mistake #2 54 Schema? This is Hadoop. Why would we need a schema? — Famous last words
    55. 55. Result 55
    56. 56. Lesson /user/… /user/gshapira/testdata/orders /data/<database>/<table>/<partition> /data/<biz unit>/<app>/<dataset>/partition /data/pharmacy/fraud/orders/date=20131101 /etl/<biz unit>/<app>/<dataset>/<stage> /etl/pharmacy/fraud/orders/validated 56
    57. 57. Workflow Mistake #3 57 Oozie was written for Hadoop, so the right solution will always use Oozie — Famous last words
    58. 58. Result 58
    59. 59. Lessons: • Oozie has advantages • Use the tool that works for you 59
    60. 60. Hue + Oozie 60
    61. 61. 61 — Neil Gaiman
    62. 62. 62 — Esther Dyson
    63. 63. 63
    64. 64. Should DBAs learn Hadoop? • Hadoop projects are more visible • 48% of Hadoop clusters are owned by DWH team • Big Data == Business pays attention to data • New skills – from coding to cluster administration • Interesting projects • No, you don’t need to learn Java 64
    65. 65. Beginner Projects • Take a class • Download a VM • Install 5 node Hadoop cluster in AWS • Load data: • Complete works of Shakespeare • Movielens database • Find the 10 most common words in Shakespeare • Find the 10 most recommended movies • Run TPC-H • Cloudera Data Science Challenge • Actual use-case: XML ingestion, ETL process, DWH history 65
    66. 66. Books 66
    67. 67. More Books 67
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×