Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Data Wrangling and Oracle Connectors for Hadoop

2,393 views

Published on

Published in: Technology
  • Be the first to comment

Data Wrangling and Oracle Connectors for Hadoop

  1. 1. 1 Wrangling Data With Oracle Connectors for Hadoop Gwen Shapira, Solutions Architect gshapira@cloudera.com @gwenshap
  2. 2. Data Has Changed in the Last 30 YearsDATAGROWTH END-USER APPLICATIONS THE INTERNET MOBILE DEVICES SOPHISTICATED MACHINES STRUCTURED DATA – 10% 1980 2013 UNSTRUCTURED DATA – 90%
  3. 3. Data is Messy
  4. 4. 5
  5. 5. Hadoop Is… • HDFS – Massive, redundant data storage • Map-Reduce – Batch oriented data processing at scale 6 Hadoop Distributed File System (HDFS) Replicated High Bandwidth Clustered Storage MapReduce Distributed Computing Framework CORE HADOOP SYSTEM COMPONENTS
  6. 6. Hadoop and Databases 7 “Schema-on-Write” “Schema-on-Read”  Schema must be created before any data can be loaded  An explicit load operation has to take place which transforms data to DB internal structure  New columns must be added explicitly  Data is simply copied to the file store, no transformation is needed  Serializer/Deserlizer is applied during read time to extract the required columns  New data can start flowing anytime and will appear retroactively 1) Reads are Fast 2) Standards and Governance PROS 1) Loads are Fast 2) Flexibility and Agility
  7. 7. Hadoop rocks Data Wrangling • Cheap storage for messy data • Tools to play with data: • Acquire • Clean • Transform • Flexibility where you need it most 8
  8. 8. Got unstructured data? • Data Warehouse: • Text • CSV • XLS • XML • Hadoop: • HTML • XML, RSS • JSON • Apache Logs • Avro, ProtoBuffs, ORC, Parquet • Compression • Office, OpenDocument, iWorks • PDF, Epup, RTF • Midi, MP3 • JPEG, Tiff • Java Classes • Mbox, RFC822 • Autocad • TrueType Parser • HFD / NetCDF 9
  9. 9. 10
  10. 10. What Data Wrangling Looks Like? Source Acquire Clean Transform Load 11
  11. 11. Data Sources • Internal • OLTP • Log files • Documents • Sensors / network events • External: • Geo-location • Demographics • Public data sets • Websites 12
  12. 12. Free External Data Name URL U.S. Census Bureau http://factfinder2.census.gov/ U.S. Executive Branch http://www.data.gov/ U.K. Government http://data.gov.uk/ E.U. Government http://publicdata.eu/ The World Bank http://data.worldbank.org/ Freebase http://www.freebase.com/ Wikidata http://meta.wikimedia.org/wiki/Wikidata Amazon Web Services http://aws.amazon.com/datasets 13
  13. 13. Data for Sell Source Type URL Gnip Social Media http://gnip.com/ AC Nielsen Media Usage http://www.nielsen.com/ Rapleaf Demographic http://www.rapleaf.com/ ESRI Geographic (GIS) http://www.esri.com/ eBay AucAon https://developer.ebay.com/ D&B Business Entities http://www.dnb.com/ Trulia Real Estate http://www.trulia.com/ Standard & Poor’s Financial http://standardandpoors.com/ 14
  14. 14. Source Acquire Clean Transform Load 15
  15. 15. Getting Data into Hadopp • Sqoop • Flume • Copy • Write • Scraping • Data APIs 16
  16. 16. Sqoop Import Examples • Sqoop import --connect jdbc:oracle:thin:@//dbserver:1521/masterdb --username hr --table emp --where “start_date > ’01-01-2012’” • Sqoop import jdbc:oracle:thin:@//dbserver:1521/masterdb --username myuser --table shops --split-by shop_id --num-mappers 16 Must be indexed or partitioned to avoid 16 full table scans
  17. 17. Or… • Hadoop fs -put myfile.txt /big/project/myfile.txt • curl –i list_of_urls.txt • curl https://api.twitter.com/1/users/show.json?screen_name= cloudera { "id":16134540, "name":"Cloudera", "screen_name":"cloudera", "location":"Palo Alto, CA", "url":"http://www.cloudera.com” "followers_count":11359 } 18
  18. 18. And even… $cat scraper.py import urllib from BeautifulSoup import BeautifulSoup txt = urllib.urlopen("http:// www.example.com/") soup = BeautifulSoup(txt) headings = soup.findAll("h2") for heading in headings: print heading.string 19
  19. 19. Source Acquire Clean Transform Load 20
  20. 20. Data Quality Issues • Given enough data – quality issues are inevitable • Main issues: • Inconsistent – “99” instead of “1999” • Invalid – last_update: 2036 • Corrupt - #$%&@*%@ 21
  21. 21. 22 Happy families are all alike. Each unhappy family is unhappy in its own way.
  22. 22. Endless Inconsistencies • Upper vs. lower case • Date formats • Times, time zones, 24h • Missing values • NULL vs. empty string vs. NA • Variation in free format input • 1 PATCH EVERY 24 HOURS • Replace patches on skin daily 23
  23. 23. Hadoop Strategies • Validation script is ALWAYS first step • But not always enough • We have known unknowns and unknowns unknowns 24
  24. 24. Known Unknowns • Script to: • Check number of columns per row • Validate not-null • Validate data type (“is number”) • Date constraints • Other business logic 25
  25. 25. Unknown Unknowns • Bad records will happen • Your job should move on • Use counters in Hadoop job to count bad records • Log errors • Write bad records to re-loadable file 26
  26. 26. Solving Bad Data • Can be done at many levels: • Fix at source • Improve acquisition process • Pre-process before analysis • Fix during analysis • How many times will you analyze this data? • 0,1, many, lots 27
  27. 27. Source Acquire Clean Transform Load 28
  28. 28. Endless Possibilities • Map Reduce (in any language) • Hive (i.e. SQL) • Pig • R • Shell scripts • Plain old Java 29
  29. 29. De-Identification • Remove PII data • Names, addresses, possibly more • Remove columns • Remove IDs *after* joins • Hash • Use partial data • Create statistically similar fake data 30
  30. 30. 31 87% of US population can be identified from gender, zip code and date of birth
  31. 31. Joins • Do at source if possible • Can be done with MapReduce • Or with Hive (Hadoop SQL ) • Joins are expensive: • Do once and store results • De-aggregate aggressively • Everything a hospital knows about a patient 32
  32. 32. DataWrangler 33
  33. 33. Process Tips • Keep track of data lineage • Keep track of all changes to data • Use source control for code 34
  34. 34. Source Acquire Clean Transform Load 35
  35. 35. Sqoop sqoop export --connect jdbc:mysql://db.example.com/foo --table bar --export-dir /results/bar_data 36
  36. 36. FUSE-DFS • Mount HDFS on Oracle server: • sudo yum install hadoop-0.20-fuse • hadoop-fuse-dfs dfs://<name_node_hostname>:<namenode_port> <mount_point> • Use external tables to load data into Oracle 37
  37. 37. 38 That’s nice. But can you load data FAST?
  38. 38. Oracle Connectors • SQL Connector for Hadoop • Oracle Loader for Hadoop • ODI with Hadoop • OBIEE with Hadoop • R connector for Hadoop You don’t need BDA 39
  39. 39. Oracle Loader for Hadoop • Kinda like SQL Loader • Data is on HDFS • Runs as Map-Reduce job • Partitions, sorts, converts format to Oracle Blocks • Appended to database tables • Or written to Data Pump files for later load 40
  40. 40. Oracle SQL Connector for HDFS • Data is in HDFS • Connector creates external table • That automatically matches Hadoop data • Control degree of parallelism • You know External Tables, right? 41
  41. 41. Data Types Supported • Data Pump • Delimited text • Avro • Regular expressions • Custom formats 43
  42. 42. 44 Main Benefit: Processing is done in Hadoop
  43. 43. Benefits • High performance • Reduce CPU usage on Database • Automatic optimizations: • Partitions • Sort • Load balance 45
  44. 44. Measuring Data Load 46 Concerns How much time? How much CPU? Bottlenecks Disk CPU Network
  45. 45. I Know What This Means: 47
  46. 46. What does this mean? 48
  47. 47. Measuring Data Load • Disks: ~300MB /s each • SSD: ~ 1.6 GB/s each • Network: • ~ 100MB/s (1gE) • ~ 1GB/s (10gE) • ~ 4GB/s (IB) • CPU: 1 CPU second per second per core. • Need to know: CPU seconds per GB 49
  48. 48. Lets walk through this… We have 5TB to load Each core: 3600 seconds per hour 5000GB will take: With Fuse: 5000*150 cpu-sec = 750000/3600 = 208 cpu-hours With SQL Connector: 5000 * 40 = 55 cpu-hours Our X2-3 half rack has 84 cores. So, around 30 minutes to load 5TB at 100% CPU. Assuming you use Exadata (Infiniband + SSD = 8TB/h load rate) And use all CPUs for loading 50
  49. 49. 51 Given fast enough network and disks, data loading will take all available CPU This is a good thing
  50. 50. 52

×