Data Wrangling and Oracle Connectors for Hadoop

2,195 views

Published on

Published in: Technology
0 Comments
6 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,195
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
99
Comments
0
Likes
6
Embeds 0
No embeds

No notes for slide
  • Data, especially from outside sources is not in a perfect condition to be useful to your business.Not only does it need to be processed into useful formats, it also needs:Filtering for potentially useful information. 99% of everything is crapStatistical analysis – is this data significant?Integration with existing dataEntity resolution. Is “Oracle Corp” the same as “Oracle” and “Oracle Corporation”? De-DuplicationGood processing and filtering of data can reduce the volume and variety of data. It is important to distinguish between true and accidental variety.This requires massive use of processing power. In a way, there is a trade-off between storage space and CPU. If you don’t invest CPU in filtering, de-duping and entity resolution – you’ll need more storage.
  • Oracle uses acquire-organize-analyze model. We are looking at acquire and organize phases in some details.
  • Internal data sources are typically more valuable.Hadoop lets you utilize data that doesn’t make financial sense to load to RDBMSIn large enough organization, internal data becomes external – no control over quality, format, changes.
  • Example: Find our how far people live from nearest doctor and pharmacy. Using zipcodes and zipcode-long/lat mapping.
  • ESRI data is probably the most common. Oil&gas, defense.
  • Oracle uses acquire-organize-analyze model. We are looking at acquire and organize phases in some details.
  • Oracle uses acquire-organize-analyze model. We are looking at acquire and organize phases in some details.
  • Inconsistent – data is correct, but has small formatting issues (1999 vs. 99. M vs. male, etc)Invalid – format is correct, but something is wrong with the data (update from 2036 or 1976)Corrupt – format completely unparsable.You can fix inconsistencies, identify invalid data and throw out corrupt data.
  • Oracle uses acquire-organize-analyze model. We are looking at acquire and organize phases in some details.
  • Oracle uses acquire-organize-analyze model. We are looking at acquire and organize phases in some details.
  • Data Wrangling and Oracle Connectors for Hadoop

    1. 1. 1 Wrangling Data With Oracle Connectors for Hadoop Gwen Shapira, Solutions Architect gshapira@cloudera.com @gwenshap
    2. 2. Data Has Changed in the Last 30 YearsDATAGROWTH END-USER APPLICATIONS THE INTERNET MOBILE DEVICES SOPHISTICATED MACHINES STRUCTURED DATA – 10% 1980 2013 UNSTRUCTURED DATA – 90%
    3. 3. Data is Messy
    4. 4. 5
    5. 5. Hadoop Is… • HDFS – Massive, redundant data storage • Map-Reduce – Batch oriented data processing at scale 6 Hadoop Distributed File System (HDFS) Replicated High Bandwidth Clustered Storage MapReduce Distributed Computing Framework CORE HADOOP SYSTEM COMPONENTS
    6. 6. Hadoop and Databases 7 “Schema-on-Write” “Schema-on-Read”  Schema must be created before any data can be loaded  An explicit load operation has to take place which transforms data to DB internal structure  New columns must be added explicitly  Data is simply copied to the file store, no transformation is needed  Serializer/Deserlizer is applied during read time to extract the required columns  New data can start flowing anytime and will appear retroactively 1) Reads are Fast 2) Standards and Governance PROS 1) Loads are Fast 2) Flexibility and Agility
    7. 7. Hadoop rocks Data Wrangling • Cheap storage for messy data • Tools to play with data: • Acquire • Clean • Transform • Flexibility where you need it most 8
    8. 8. Got unstructured data? • Data Warehouse: • Text • CSV • XLS • XML • Hadoop: • HTML • XML, RSS • JSON • Apache Logs • Avro, ProtoBuffs, ORC, Parquet • Compression • Office, OpenDocument, iWorks • PDF, Epup, RTF • Midi, MP3 • JPEG, Tiff • Java Classes • Mbox, RFC822 • Autocad • TrueType Parser • HFD / NetCDF 9
    9. 9. 10
    10. 10. What Data Wrangling Looks Like? Source Acquire Clean Transform Load 11
    11. 11. Data Sources • Internal • OLTP • Log files • Documents • Sensors / network events • External: • Geo-location • Demographics • Public data sets • Websites 12
    12. 12. Free External Data Name URL U.S. Census Bureau http://factfinder2.census.gov/ U.S. Executive Branch http://www.data.gov/ U.K. Government http://data.gov.uk/ E.U. Government http://publicdata.eu/ The World Bank http://data.worldbank.org/ Freebase http://www.freebase.com/ Wikidata http://meta.wikimedia.org/wiki/Wikidata Amazon Web Services http://aws.amazon.com/datasets 13
    13. 13. Data for Sell Source Type URL Gnip Social Media http://gnip.com/ AC Nielsen Media Usage http://www.nielsen.com/ Rapleaf Demographic http://www.rapleaf.com/ ESRI Geographic (GIS) http://www.esri.com/ eBay AucAon https://developer.ebay.com/ D&B Business Entities http://www.dnb.com/ Trulia Real Estate http://www.trulia.com/ Standard & Poor’s Financial http://standardandpoors.com/ 14
    14. 14. Source Acquire Clean Transform Load 15
    15. 15. Getting Data into Hadopp • Sqoop • Flume • Copy • Write • Scraping • Data APIs 16
    16. 16. Sqoop Import Examples • Sqoop import --connect jdbc:oracle:thin:@//dbserver:1521/masterdb --username hr --table emp --where “start_date > ’01-01-2012’” • Sqoop import jdbc:oracle:thin:@//dbserver:1521/masterdb --username myuser --table shops --split-by shop_id --num-mappers 16 Must be indexed or partitioned to avoid 16 full table scans
    17. 17. Or… • Hadoop fs -put myfile.txt /big/project/myfile.txt • curl –i list_of_urls.txt • curl https://api.twitter.com/1/users/show.json?screen_name= cloudera { "id":16134540, "name":"Cloudera", "screen_name":"cloudera", "location":"Palo Alto, CA", "url":"http://www.cloudera.com” "followers_count":11359 } 18
    18. 18. And even… $cat scraper.py import urllib from BeautifulSoup import BeautifulSoup txt = urllib.urlopen("http:// www.example.com/") soup = BeautifulSoup(txt) headings = soup.findAll("h2") for heading in headings: print heading.string 19
    19. 19. Source Acquire Clean Transform Load 20
    20. 20. Data Quality Issues • Given enough data – quality issues are inevitable • Main issues: • Inconsistent – “99” instead of “1999” • Invalid – last_update: 2036 • Corrupt - #$%&@*%@ 21
    21. 21. 22 Happy families are all alike. Each unhappy family is unhappy in its own way.
    22. 22. Endless Inconsistencies • Upper vs. lower case • Date formats • Times, time zones, 24h • Missing values • NULL vs. empty string vs. NA • Variation in free format input • 1 PATCH EVERY 24 HOURS • Replace patches on skin daily 23
    23. 23. Hadoop Strategies • Validation script is ALWAYS first step • But not always enough • We have known unknowns and unknowns unknowns 24
    24. 24. Known Unknowns • Script to: • Check number of columns per row • Validate not-null • Validate data type (“is number”) • Date constraints • Other business logic 25
    25. 25. Unknown Unknowns • Bad records will happen • Your job should move on • Use counters in Hadoop job to count bad records • Log errors • Write bad records to re-loadable file 26
    26. 26. Solving Bad Data • Can be done at many levels: • Fix at source • Improve acquisition process • Pre-process before analysis • Fix during analysis • How many times will you analyze this data? • 0,1, many, lots 27
    27. 27. Source Acquire Clean Transform Load 28
    28. 28. Endless Possibilities • Map Reduce (in any language) • Hive (i.e. SQL) • Pig • R • Shell scripts • Plain old Java 29
    29. 29. De-Identification • Remove PII data • Names, addresses, possibly more • Remove columns • Remove IDs *after* joins • Hash • Use partial data • Create statistically similar fake data 30
    30. 30. 31 87% of US population can be identified from gender, zip code and date of birth
    31. 31. Joins • Do at source if possible • Can be done with MapReduce • Or with Hive (Hadoop SQL ) • Joins are expensive: • Do once and store results • De-aggregate aggressively • Everything a hospital knows about a patient 32
    32. 32. DataWrangler 33
    33. 33. Process Tips • Keep track of data lineage • Keep track of all changes to data • Use source control for code 34
    34. 34. Source Acquire Clean Transform Load 35
    35. 35. Sqoop sqoop export --connect jdbc:mysql://db.example.com/foo --table bar --export-dir /results/bar_data 36
    36. 36. FUSE-DFS • Mount HDFS on Oracle server: • sudo yum install hadoop-0.20-fuse • hadoop-fuse-dfs dfs://<name_node_hostname>:<namenode_port> <mount_point> • Use external tables to load data into Oracle 37
    37. 37. 38 That’s nice. But can you load data FAST?
    38. 38. Oracle Connectors • SQL Connector for Hadoop • Oracle Loader for Hadoop • ODI with Hadoop • OBIEE with Hadoop • R connector for Hadoop You don’t need BDA 39
    39. 39. Oracle Loader for Hadoop • Kinda like SQL Loader • Data is on HDFS • Runs as Map-Reduce job • Partitions, sorts, converts format to Oracle Blocks • Appended to database tables • Or written to Data Pump files for later load 40
    40. 40. Oracle SQL Connector for HDFS • Data is in HDFS • Connector creates external table • That automatically matches Hadoop data • Control degree of parallelism • You know External Tables, right? 41
    41. 41. Data Types Supported • Data Pump • Delimited text • Avro • Regular expressions • Custom formats 43
    42. 42. 44 Main Benefit: Processing is done in Hadoop
    43. 43. Benefits • High performance • Reduce CPU usage on Database • Automatic optimizations: • Partitions • Sort • Load balance 45
    44. 44. Measuring Data Load 46 Concerns How much time? How much CPU? Bottlenecks Disk CPU Network
    45. 45. I Know What This Means: 47
    46. 46. What does this mean? 48
    47. 47. Measuring Data Load • Disks: ~300MB /s each • SSD: ~ 1.6 GB/s each • Network: • ~ 100MB/s (1gE) • ~ 1GB/s (10gE) • ~ 4GB/s (IB) • CPU: 1 CPU second per second per core. • Need to know: CPU seconds per GB 49
    48. 48. Lets walk through this… We have 5TB to load Each core: 3600 seconds per hour 5000GB will take: With Fuse: 5000*150 cpu-sec = 750000/3600 = 208 cpu-hours With SQL Connector: 5000 * 40 = 55 cpu-hours Our X2-3 half rack has 84 cores. So, around 30 minutes to load 5TB at 100% CPU. Assuming you use Exadata (Infiniband + SSD = 8TB/h load rate) And use all CPUs for loading 50
    49. 49. 51 Given fast enough network and disks, data loading will take all available CPU This is a good thing
    50. 50. 52

    ×