The data landscape looks totally different now than it did 30 years ago when the fundamental concepts and technologies of data management were developed. Since 1980, the birth of personal computing, the internet, mobile devices and sophisticated electronic machines have led to an explosion in data volume, variety and velocity. Simply said, the data you’re managing today looks nothing like the data you were managing in 1980. Actually, structured data only represents somewhere between 10-20% of the total data volume in any given enterprise.
Open source system implemented (mostly) In Java. Provides reliable and scalable storage and processing of extremely large volumes of data (TB to PB) on commodity hardware. Two primary components, HDFS and MapReduce. Based on software originally developed at Google. An important aspect of Hadoop in the context of this talk is that it allows for inexpensive storage (<10% of other solutions) of any type of data, and places no constraints on how that data is processed. Allows companies to begin storing data that was previously thrown away.
Hadoop has value as a stand-alone analytics system, but many if not most organizations will be looking to integrate into existing data infrastructure.
ETL clusters are easy to use and manage, but are often inefficient. ELT is efficient, but spends expensive DWH cycles on low-value transformations, wastes storage on temp data and can cause missed SLAs. The next step is moving the ETL to Hadoop.
There are many Hadoop technologies for ETL. Its easy to google and see what each one does and how to use it. The trick is to use the right technology at the right time, and avoid some common mistakes. So, I’m not going to waste time telling you how to use Hadoop. You’ll easily find out. I’ll tell you what to use it for, and how to avoid pitfalls.
Database specific connectors make data ingest about 1000 times faster.
Important note: Sqoop’s Connector to Oracle is NOT Oracle connector for Hadoop. It is called OraOop and is both free and open source
If you get a bunch of files to a directory every day or hour, just copying them to Hadoop is fine. If you can do something in 1 line in shell, do it. Don’t overcomplicate.
You are just starting to learn Hadoop, so make life easier on yourself.
Give yourself a fighting chance
These days there is hardly ever a reason to use plain map reduce. So many other good tools.
Like walking Huanan trail: Not very stable and not very fast
Mottainai – Japanese term for wasting resources or doing something in an inefficient manner
Use Hadoop cores to pre-sort, pre-partition the data and turn it into Oracle data format. Then load it as “append” to Oracle. Uses no redo, very little CPU, lots of parallelism.
Transcript of "Scaling ETL with Hadoop - Avoiding Failure"
Coming soon to a bookstore near you…
• Hadoop Application
How to build end-to-end solutions
using Apache Hadoop and related
• Extracting data from outside sources
• Transforming it to fit operational needs
• Loading it into the end target
• (Wikipedia: http://en.wikipedia.org/wiki/Extract,_transform,_load)
• HDFS – Massive, redundant data storage
• MapReduce – Batch oriented data processing at scale
• Many many ways to process data in parallel at scale
• High level languages and abstractions
• File, relational and streaming data integration
• Process Orchestration and Scheduling
• Libraries for data wrangling
• Low latency query language
Data Has Changed in the Last 30 YearsDATAGROWTH
STRUCTURED DATA – 10%
UNSTRUCTURED DATA – 90%
Volume, Variety, Velocity Cause Problems
Slow data transformations. Missed SLAs.
Slow queries. Frustrated business and IT.
3 Must archive. Archived data can’t provide value.
What is Apache Hadoop?
Has the Flexibility to Store and
Mine Any Type of Data
Ask questions across structured and
unstructured data that were previously
impossible to ask or solve
Not bound by a single schema
Processing Complex Data
Scale-out architecture divides workloads
across multiple nodes
Flexible file system eliminates ETL
Can be deployed on commodity
Open source platform guards against
File System (HDFS)
Apache Hadoop is an open source
platform for data storage and processing
CORE HADOOP SYSTEM COMPONENTS
What I often see
1. There are connectors to:
Oracle, Netezza and Teradata
2. Download them
3. Read documentation
4. Ask questions if not clear
5. Follow installation instructions
6. Use Sqoop with connectors
Data Loading Mistake #3
Just copying files?
This sounds too simple. We
probably need some cool
— Famous last words
Data Processing Mistake #1
This system must be ready in 12 month.
We have to convert 100 data sources and 5000
transformations to Hadoop.
Lets spend 2 days planning a schedule and budget
for the entire year and then just go and implement
Prototype? Who needs that?
— Famous last words
1. Without partitions every query is a full table scan
2. Yes, Hadoop scans fast.
3. But faster read is the one you don’t perform
4. Cheap storage allows you to store same dataset,
partitioned multiple ways.
5. Use partitions for fast data loading
• Use Relational:
• To maintain tool compatibility
• DWH enrichment
• Stay in Hadoop for:
• Text search
• Graph analysis
• Reduce time in pipeline
• Big data & small network
• Congested database
Data Loading Mistake #2
We used Sqoop to get data out of
Oracle. Lets use Sqoop to get it back in.
— Famous last words
Workflow management tool should enable:
• Keeping track of metadata, components and
• Scheduling and Orchestration
• Restarts and retries
• Cohesive System View
• Instrumentation, Measurement and Monitoring
Workflow Mistake #2
Schema? This is Hadoop. Why would
we need a schema?
— Famous last words
Should DBAs learn Hadoop?
• Hadoop projects are more visible
• 48% of Hadoop clusters are owned by DWH team
• Big Data == Business pays attention to data
• New skills – from coding to cluster administration
• Interesting projects
• No, you don’t need to learn Java
• Take a class
• Download a VM
• Install 5 node Hadoop cluster in AWS
• Load data:
• Complete works of Shakespeare
• Movielens database
• Find the 10 most common words in Shakespeare
• Find the 10 most recommended movies
• Run TPC-H
• Cloudera Data Science Challenge
• Actual use-case:
XML ingestion, ETL process, DWH history