Scaling ETL with Hadoop - Avoiding Failure
Upcoming SlideShare
Loading in...5

Scaling ETL with Hadoop - Avoiding Failure






Total Views
Views on SlideShare
Embed Views



4 Embeds 16 11 3 1
http://localhost 1



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment
  • The data landscape looks totally different now than it did 30 years ago when the fundamental concepts and technologies of data management were developed. Since 1980, the birth of personal computing, the internet, mobile devices and sophisticated electronic machines have led to an explosion in data volume, variety and velocity. Simply said, the data you’re managing today looks nothing like the data you were managing in 1980. Actually, structured data only represents somewhere between 10-20% of the total data volume in any given enterprise. <br />
  • Open source system implemented (mostly) In Java. Provides reliable and scalable storage and processing of extremely large volumes of data (TB to PB) on commodity hardware. <br /> Two primary components, HDFS and MapReduce. Based on software originally developed at Google. <br /> An important aspect of Hadoop in the context of this talk is that it allows for inexpensive storage (&lt;10% of other solutions) of any type of data, and places no constraints on how that data is processed. <br /> Allows companies to begin storing data that was previously thrown away. <br /> <br /> Hadoop has value as a stand-alone analytics system, but many if not most organizations will be looking to integrate into existing data infrastructure.
  • ETL clusters are easy to use and manage, but are often inefficient. ELT is efficient, but spends expensive DWH cycles on low-value transformations, wastes storage on temp data and can cause missed SLAs. The next step is moving the ETL to Hadoop. <br />
  • There are many Hadoop technologies for ETL. Its easy to google and see what each one does and how to use it. The trick is to use the right technology at the right time, and avoid some common mistakes. <br /> So, I’m not going to waste time telling you how to use Hadoop. You’ll easily find out. I’ll tell you what to use it for, and how to avoid pitfalls.
  • Database specific connectors make data ingest about 1000 times faster.
  • Important note: Sqoop’s Connector to Oracle is NOT Oracle connector for Hadoop. It is called OraOop and is both free and open source
  • If you get a bunch of files to a directory every day or hour, just copying them to Hadoop is fine. If you can do something in 1 line in shell, do it. Don’t overcomplicate. <br /> <br /> You are just starting to learn Hadoop, so make life easier on yourself.
  • Give yourself a fighting chance
  • These days there is hardly ever a reason to use plain map reduce. So many other good tools.
  • Like walking Huanan trail: Not very stable and not very fast
  • Mottainai – Japanese term for wasting resources or doing something in an inefficient manner
  • Use Hadoop cores to pre-sort, pre-partition the data and turn it into Oracle data format. Then load it as “append” to Oracle. Uses no redo, very little CPU, lots of parallelism.

Scaling ETL with Hadoop - Avoiding Failure Scaling ETL with Hadoop - Avoiding Failure Presentation Transcript

  • 1 Scaling ETL with Hadoop Gwen Shapira @gwenshap
  • Coming soon to a bookstore near you… • Hadoop Application Architectures How to build end-to-end solutions using Apache Hadoop and related tools @hadooparchbook
  • ETL is… • Extracting data from outside sources • Transforming it to fit operational needs • Loading it into the end target • (Wikipedia:,_transform,_load) 3
  • Hadoop Is… • HDFS – Massive, redundant data storage • MapReduce – Batch oriented data processing at scale • Many many ways to process data in parallel at scale 4
  • The Ecosystem • High level languages and abstractions • File, relational and streaming data integration • Process Orchestration and Scheduling • Libraries for data wrangling • Low latency query language 5
  • 6 Why ETL with Hadoop?
  • Volume, Variety, Velocity Cause Problems 8 OLTP Enterprise Applications Data Warehouse QueryExtract Transform Load Business Intelligence Transform 1 1 1 Slow data transformations. Missed SLAs. 2 2 Slow queries. Frustrated business and IT. 3 Must archive. Archived data can’t provide value.
  • Got unstructured data? • Traditional ETL: • Text • CSV • XLS • XML • Hadoop: • HTML • XML, RSS • JSON • Apache Logs • Avro, ProtoBuffs, ORC, Parquet • Compression • Office, OpenDocument, iWorks • PDF, Epup, RTF • Midi, MP3 • JPEG, Tiff • Java Classes • Mbox, RFC822 • Autocad • TrueType Parser • HFD / NetCDF 9
  • What is Apache Hadoop? 10 Has the Flexibility to Store and Mine Any Type of Data  Ask questions across structured and unstructured data that were previously impossible to ask or solve  Not bound by a single schema Excels at Processing Complex Data  Scale-out architecture divides workloads across multiple nodes  Flexible file system eliminates ETL bottlenecks Scales Economically  Can be deployed on commodity hardware  Open source platform guards against vendor lock Hadoop Distributed File System (HDFS) Self-Healing, High Bandwidth Clustered Storage MapReduce Distributed Computing Framework Apache Hadoop is an open source platform for data storage and processing that is…  Distributed  Fault tolerant  Scalable CORE HADOOP SYSTEM COMPONENTS
  • What I often see ETL Cluster ELT in DWH ETL in Hadoop 11
  • 12
  • Best Practices Arup Nanda taught me to ask: 1. Why is it better than the rest? 2. What happens if it is not followed? 3. When are they not applicable? 13
  • 14
  • 15 Extract
  • Let me count the ways 1. From Databases: Sqoop 2. Log Data: Flume 3. Copy data to HDFS 16
  • Data Loading Mistake #1 17 Hadoop is scalable. Lets run as many Sqoop mappers as possible, to get the data from our DB faster! — Famous last words
  • Result: 18
  • Lesson: • Start with 2 mappers, add slowly • Watch DB load and network utilization • Use FairScheduler to limit number of mappers 19
  • Data Loading Mistake #2 20 Database specific connectors are complicated and scary. Lets just use the default JDBC connector. — Famous last words
  • Result: 21
  • Lesson: 1. There are connectors to: Oracle, Netezza and Teradata 2. Download them 3. Read documentation 4. Ask questions if not clear 5. Follow installation instructions 6. Use Sqoop with connectors 22
  • Data Loading Mistake #3 23 Just copying files? This sounds too simple. We probably need some cool whizzbang tool. — Famous last words
  • Result 24
  • Lessons: • Copying files is a legitimate solution • In general, simple is good 25
  • 26 Transform
  • Endless Possibilities • Map Reduce • Crunch / Cascading • Spark • Hive (i.e. SQL) • Pig • R • Shell scripts • Plain old Java 27
  • Data Processing Mistake #0 28
  • Data Processing Mistake #1 29 This system must be ready in 12 month. We have to convert 100 data sources and 5000 transformations to Hadoop. Lets spend 2 days planning a schedule and budget for the entire year and then just go and implement it. Prototype? Who needs that? — Famous last words
  • Result 30
  • Lessons • Take learning curve into account • You don’t know what you don’t know • Hadoop will be difficult and frustrating for at least 3 month. 31
  • Data Processing Mistake #2 32 Hadoop is all about MapReduce. So I’ll use MapReduce for all my data processing needs. — Famous last words
  • Result: 33
  • Lessons: MapReduce is the assembly language of Hadoop: Simple things are hard. Hard things are possible. 34
  • Data Processing Mistake #3 35 I got 5000 tiny XMLs, and Hadoop is great at processing unstructured data. So I’ll just leave the data like that and parse the XML in every job. — Famous last words
  • Result 36
  • Lessons 1. Consolidate small files 2. Don’t argue about #1 3. Convert files to easy-to-query formats 4. De-normalize 37
  • Data Processing Mistake #4 38 Partitions are for relational databases — Famous last words
  • Result 39
  • Lessons 1. Without partitions every query is a full table scan 2. Yes, Hadoop scans fast. 3. But faster read is the one you don’t perform 4. Cheap storage allows you to store same dataset, partitioned multiple ways. 5. Use partitions for fast data loading 40
  • 41 Load
  • Technologies • Sqoop • Fuse-DFS • Oracle Connectors • Just copy files • Query Hadoop 42
  • Data Loading Mistake #1 43 All of the data must end up in a relational DWH. — Famous last words
  • Result 44
  • Lessons: • Use Relational: • To maintain tool compatibility • DWH enrichment • Stay in Hadoop for: • Text search • Graph analysis • Reduce time in pipeline • Big data & small network • Congested database 45
  • Data Loading Mistake #2 46 We used Sqoop to get data out of Oracle. Lets use Sqoop to get it back in. — Famous last words
  • Result 47
  • Lesson Use Oracle direct connectors if you can afford them. They are: 1. Faster than any alternative 2. Use Hadoop to make Oracle more efficient 3. Make *you* more efficient 48
  • 49 Workflow Management
  • Tools • Oozie • Pentaho, Talend, ActiveBatch, AutoSys, Informatica, UC4, Cron 50
  • Workflow Mistake #1 51 Workflow management is easy. I’ll just write few scripts. — Famous last words
  • 52 — Josh Wills
  • Lesson: Workflow management tool should enable: • Keeping track of metadata, components and integrations • Scheduling and Orchestration • Restarts and retries • Cohesive System View • Instrumentation, Measurement and Monitoring • Reporting 53
  • Workflow Mistake #2 54 Schema? This is Hadoop. Why would we need a schema? — Famous last words
  • Result 55
  • Lesson /user/… /user/gshapira/testdata/orders /data/<database>/<table>/<partition> /data/<biz unit>/<app>/<dataset>/partition /data/pharmacy/fraud/orders/date=20131101 /etl/<biz unit>/<app>/<dataset>/<stage> /etl/pharmacy/fraud/orders/validated 56
  • Workflow Mistake #3 57 Oozie was written for Hadoop, so the right solution will always use Oozie — Famous last words
  • Result 58
  • Lessons: • Oozie has advantages • Use the tool that works for you 59
  • Hue + Oozie 60
  • 61 — Neil Gaiman
  • 62 — Esther Dyson
  • 63
  • Should DBAs learn Hadoop? • Hadoop projects are more visible • 48% of Hadoop clusters are owned by DWH team • Big Data == Business pays attention to data • New skills – from coding to cluster administration • Interesting projects • No, you don’t need to learn Java 64
  • Beginner Projects • Take a class • Download a VM • Install 5 node Hadoop cluster in AWS • Load data: • Complete works of Shakespeare • Movielens database • Find the 10 most common words in Shakespeare • Find the 10 most recommended movies • Run TPC-H • Cloudera Data Science Challenge • Actual use-case: XML ingestion, ETL process, DWH history 65
  • Books 66
  • More Books 67