Data Wrangling and Oracle Connectors for Hadoop

1
Wrangling Data
With Oracle Connectors for Hadoop
Gwen Shapira, Solutions Architect
gshapira@cloudera.com
@gwenshap

Data Has Changed in the Last 30 YearsDATAGROWTH
END-USER
APPLICATIONS
THE INTERNET
MOBILE DEVICES
SOPHISTICATED
MACHINES
STRUCTURED DATA – 10%
1980 2013
UNSTRUCTURED DATA – 90%

Hadoop Is…
• HDFS – Massive, redundant data storage
• Map-Reduce – Batch oriented data processing at scale
6
Hadoop Distributed
File System (HDFS)
Replicated
High Bandwidth
Clustered Storage
MapReduce
Distributed Computing
Framework
CORE HADOOP SYSTEM COMPONENTS

Hadoop and Databases
7
“Schema-on-Write” “Schema-on-Read”
 Schema must be created before any data
can be loaded
 An explicit load operation has to take place
which transforms data to DB internal
structure
 New columns must be added explicitly
 Data is simply copied to the file store, no
transformation is needed
 Serializer/Deserlizer is applied during read
time to extract the required columns
 New data can start flowing anytime and will
appear retroactively
1) Reads are Fast
2) Standards and Governance
PROS
1) Loads are Fast
2) Flexibility and Agility

Hadoop rocks Data Wrangling
• Cheap storage for messy data
• Tools to play with data:
• Acquire
• Clean
• Transform
• Flexibility where you need it most
8

Got unstructured data?
• Data Warehouse:
• Text
• CSV
• XLS
• XML
• Hadoop:
• HTML
• XML, RSS
• JSON
• Apache Logs
• Avro, ProtoBuffs, ORC, Parquet
• Compression
• Office, OpenDocument, iWorks
• PDF, Epup, RTF
• Midi, MP3
• JPEG, Tiff
• Java Classes
• Mbox, RFC822
• Autocad
• TrueType Parser
• HFD / NetCDF
9

What Data Wrangling Looks Like?
Source Acquire Clean Transform Load
11

Data Sources
• Internal
• OLTP
• Log files
• Documents
• Sensors / network events
• External:
• Geo-location
• Demographics
• Public data sets
• Websites
12

Free External Data
Name URL
U.S. Census Bureau http://factfinder2.census.gov/
U.S. Executive Branch http://www.data.gov/
U.K. Government http://data.gov.uk/
E.U. Government http://publicdata.eu/
The World Bank http://data.worldbank.org/
Freebase http://www.freebase.com/
Wikidata http://meta.wikimedia.org/wiki/Wikidata
Amazon Web Services http://aws.amazon.com/datasets
13

Data for Sell
Source Type URL
Gnip Social Media http://gnip.com/
AC Nielsen Media Usage http://www.nielsen.com/
Rapleaf Demographic http://www.rapleaf.com/
ESRI Geographic (GIS) http://www.esri.com/
eBay AucAon https://developer.ebay.com/
D&B Business Entities http://www.dnb.com/
Trulia Real Estate http://www.trulia.com/
Standard & Poor’s Financial http://standardandpoors.com/
14

15

Getting Data into Hadopp
• Sqoop
• Flume
• Copy
• Write
• Scraping
• Data APIs
16

Sqoop Import Examples
• Sqoop import --connect
jdbc:oracle:thin:@//dbserver:1521/masterdb
--username hr --table emp
--where “start_date > ’01-01-2012’”
• Sqoop import
jdbc:oracle:thin:@//dbserver:1521/masterdb
--username myuser
--table shops --split-by shop_id
--num-mappers 16
Must be
indexed or
partitioned to
avoid 16 full
table scans

Or…
• Hadoop fs -put myfile.txt /big/project/myfile.txt
• curl –i list_of_urls.txt
• curl
https://api.twitter.com/1/users/show.json?screen_name=
cloudera
{ "id":16134540,
"name":"Cloudera",
"screen_name":"cloudera",
"location":"Palo Alto, CA",
"url":"http://www.cloudera.com”
"followers_count":11359 }
18

And even…
$cat scraper.py
import urllib
from BeautifulSoup import BeautifulSoup
txt = urllib.urlopen("http://
www.example.com/")
soup = BeautifulSoup(txt)
headings = soup.findAll("h2")
for heading in headings:
print heading.string
19

20

Data Quality Issues
• Given enough data – quality issues are inevitable
• Main issues:
• Inconsistent – “99” instead of “1999”
• Invalid – last_update: 2036
• Corrupt - #$%&@*%@
21

22
Happy families are all alike.
Each unhappy family is unhappy
in its own way.

Endless Inconsistencies
• Upper vs. lower case
• Date formats
• Times, time zones, 24h
• Missing values
• NULL vs. empty string vs. NA
• Variation in free format input
• 1 PATCH EVERY 24 HOURS
• Replace patches on skin daily
23

Hadoop Strategies
• Validation script is
ALWAYS first step
• But not always enough
• We have
known unknowns and
unknowns unknowns
24

Known Unknowns
• Script to:
• Check number of columns per row
• Validate not-null
• Validate data type (“is number”)
• Date constraints
• Other business logic
25

Unknown Unknowns
• Bad records will happen
• Your job should move on
• Use counters in Hadoop job to count bad records
• Log errors
• Write bad records to re-loadable file
26

Solving Bad Data
• Can be done at many levels:
• Fix at source
• Improve acquisition process
• Pre-process before analysis
• Fix during analysis
• How many times will you analyze this data?
• 0,1, many, lots
27

28

Endless Possibilities
• Map Reduce
(in any language)
• Hive (i.e. SQL)
• Pig
• R
• Shell scripts
• Plain old Java
29

De-Identification
• Remove PII data
• Names, addresses, possibly
more
• Remove columns
• Remove IDs *after* joins
• Hash
• Use partial data
• Create statistically similar
fake data
30

31
87% of US population
can be identified from
gender, zip code and date of birth

Joins
• Do at source if possible
• Can be done with MapReduce
• Or with Hive (Hadoop SQL )
• Joins are expensive:
• Do once and store results
• De-aggregate aggressively
• Everything a hospital knows about a patient
32

Process Tips
• Keep track of data lineage
• Keep track of all changes to data
• Use source control for code
34

35

Sqoop
sqoop export
--connect jdbc:mysql://db.example.com/foo
--table bar
--export-dir /results/bar_data
36

FUSE-DFS
• Mount HDFS on Oracle server:
• sudo yum install hadoop-0.20-fuse
• hadoop-fuse-dfs
dfs://<name_node_hostname>:<namenode_port>
<mount_point>
• Use external tables to load data into Oracle
37

38
That’s nice.
But can you load data FAST?

Oracle Connectors
• SQL Connector for Hadoop
• Oracle Loader for Hadoop
• ODI with Hadoop
• OBIEE with Hadoop
• R connector for Hadoop
You don’t need BDA
39

Oracle Loader for Hadoop
• Kinda like SQL Loader
• Data is on HDFS
• Runs as Map-Reduce job
• Partitions, sorts, converts format to Oracle Blocks
• Appended to database tables
• Or written to Data Pump files for later load
40

Oracle SQL Connector for HDFS
• Data is in HDFS
• Connector creates external table
• That automatically matches Hadoop data
• Control degree of parallelism
• You know External Tables, right?
41

Data Types Supported
• Data Pump
• Delimited text
• Avro
• Regular expressions
• Custom formats
43

44
Main Benefit:
Processing is done in Hadoop

Benefits
• High performance
• Reduce CPU usage on Database
• Automatic optimizations:
• Partitions
• Sort
• Load balance
45

Measuring Data Load
46
Concerns
How much time?
How much CPU?
Bottlenecks
Disk
CPU
Network

Measuring Data Load
• Disks: ~300MB /s each
• SSD: ~ 1.6 GB/s each
• Network:
• ~ 100MB/s (1gE)
• ~ 1GB/s (10gE)
• ~ 4GB/s (IB)
• CPU: 1 CPU second per second per core.
• Need to know: CPU seconds per GB
49

Lets walk through this…
We have 5TB to load
Each core: 3600 seconds per hour
5000GB will take:
With Fuse: 5000*150 cpu-sec = 750000/3600 = 208 cpu-hours
With SQL Connector: 5000 * 40 = 55 cpu-hours
Our X2-3 half rack has 84 cores.
So, around 30 minutes to load 5TB at 100% CPU.
Assuming you use Exadata (Infiniband + SSD = 8TB/h load rate)
And use all CPUs for loading
50

51
Given fast enough network and disks,
data loading will take all available CPU
This is a good thing

Data Wrangling and Oracle Connectors for Hadoop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Data Wrangling and Oracle Connectors for Hadoop

Similar to Data Wrangling and Oracle Connectors for Hadoop (20)

More from Gwen (Chen) Shapira

More from Gwen (Chen) Shapira (20)

Recently uploaded

Recently uploaded (20)

Data Wrangling and Oracle Connectors for Hadoop

Editor's Notes