GoDataDrivenPROUDLY PART OF THE XEBIA GROUP@fzkfrisovanvollenhoven@godatadriven.comDirty DataFriso van VollenhovenIt’s a m...
februari-22 2013
A: Yes, sometimes asoften as 1 in every 10Kcalls. Or about once aweek at 3K files / day.
þ
þ
TSV ==thorn separated values?
þ == 0xFE
or -2, in HiveCREATE TABLE browsers (browser_id STRING,browser STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY -2;
• The format will change• Faulty deliveries will occur• Your parser will break• Records will be mistakingly produced (over...
•Simple deployment of ETL code•Scheduling•Scalable•Independent jobs•Fixable data store•Incremental where possible•Metrics
EXTRACTTRANSFORMLOAD
• No JVM startup overhead for Hadoop API usage• Relatively concise syntax (Python)• Mix Python standard library with any J...
• Flexible scheduling with dependencies• Saves output• E-mails on errors• Scales to multiple nodes• REST API• Status monit...
Deploymentgit push jenkins master
Independent jobssource (external)staging (HDFS)hive-staging (HDFS)HiveHDFS upload + move in placeMapReduce + HDFS moveHive...
Out of order jobs• At any point, you don’t really know what ‘made it’to Hive• Will happen anyway, because some days the da...
Fixable data store• Using Hive partitions• Jobs that move data from staging create partitions• When new data / insight abo...
Metrics
Metrics service• Job ran, so may units processed, took so muchtime• e.g. 10GB imported, took 1 hr• e.g. 60M records transf...
GoDataDrivenWe’re hiring / Questions? / Thank you!@fzkfrisovanvollenhoven@godatadriven.comFriso van Vollenhoven
Dirty data by friso van vollenhoven
Dirty data by friso van vollenhoven
Dirty data by friso van vollenhoven
Dirty data by friso van vollenhoven
Dirty data by friso van vollenhoven
Dirty data by friso van vollenhoven
Dirty data by friso van vollenhoven
Dirty data by friso van vollenhoven
Dirty data by friso van vollenhoven
Dirty data by friso van vollenhoven
Dirty data by friso van vollenhoven
Dirty data by friso van vollenhoven
Dirty data by friso van vollenhoven
Dirty data by friso van vollenhoven
Dirty data by friso van vollenhoven
Dirty data by friso van vollenhoven
Dirty data by friso van vollenhoven
Dirty data by friso van vollenhoven
Dirty data by friso van vollenhoven
Dirty data by friso van vollenhoven
Dirty data by friso van vollenhoven
Upcoming SlideShare
Loading in...5
×

Dirty data by friso van vollenhoven

511

Published on

At a large European publisher, we had the challenge of receiving external data from 10+ different sources, adding up to tens of GBs per day. Add to this that the file formats will sometimes change without notifications and that sometimes connections go bad or files go missing. This is while trying to maintain that at least the amount of data is near correct in an environment where the 'correct' amount of data for a source is often a difficult to predict number somewhere between 20M and 50M records for a particular day.

We built a extracting and loading pipeline to get data into Hadoop en expose it via Hive tables, which includes scheduling, reporting, monitoring, transforming and, above all, the ability to respond to changes very quickly. After all, responding to a file format change within the same day or adding a new source in a day are very reasonable user requests (right). We were focused on developer friendliness and rely on a fully open source stack, using Hadoop, Hive, Jenkins, various scripting languages and more. This is my talk about the setup and our lessons learned.

In our quest for data quality, we also did work on attempting to predict the expected data volumes, based on seasonality and weather information, in order to proactively alert when a data import appears to fall short of the expected volume. I will include these results in the talk.

Published in: Technology, Spiritual
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
511
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Transcript of "Dirty data by friso van vollenhoven"

  1. 1. GoDataDrivenPROUDLY PART OF THE XEBIA GROUP@fzkfrisovanvollenhoven@godatadriven.comDirty DataFriso van VollenhovenIt’s a mess. It’s your problem.
  2. 2. februari-22 2013
  3. 3. A: Yes, sometimes asoften as 1 in every 10Kcalls. Or about once aweek at 3K files / day.
  4. 4. þ
  5. 5. þ
  6. 6. TSV ==thorn separated values?
  7. 7. þ == 0xFE
  8. 8. or -2, in HiveCREATE TABLE browsers (browser_id STRING,browser STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY -2;
  9. 9. • The format will change• Faulty deliveries will occur• Your parser will break• Records will be mistakingly produced (over-logging)• Other people test in production too (and you get thedata from it)• Etc., etc.
  10. 10. •Simple deployment of ETL code•Scheduling•Scalable•Independent jobs•Fixable data store•Incremental where possible•Metrics
  11. 11. EXTRACTTRANSFORMLOAD
  12. 12. • No JVM startup overhead for Hadoop API usage• Relatively concise syntax (Python)• Mix Python standard library with any Java libs
  13. 13. • Flexible scheduling with dependencies• Saves output• E-mails on errors• Scales to multiple nodes• REST API• Status monitor• Integrates with version control
  14. 14. Deploymentgit push jenkins master
  15. 15. Independent jobssource (external)staging (HDFS)hive-staging (HDFS)HiveHDFS upload + move in placeMapReduce + HDFS moveHive map external table + SELECT INTO
  16. 16. Out of order jobs• At any point, you don’t really know what ‘made it’to Hive• Will happen anyway, because some days the datadelivery is going to be three hours late• Or you get half in the morning and the other halflater in the day• It really depends on what you do with the data• This is where metrics + fixable data store help...
  17. 17. Fixable data store• Using Hive partitions• Jobs that move data from staging create partitions• When new data / insight about the data arrives,drop the partition and re-insert• Be careful to reset any metrics in this case• Basically: instead of trying to make everythingtransactional, repair afterwards• Use metrics to determine whether data is fit forpurpose
  18. 18. Metrics
  19. 19. Metrics service• Job ran, so may units processed, took so muchtime• e.g. 10GB imported, took 1 hr• e.g. 60M records transformed, took 10 minutes• Dropped partition• Inserted X records into partition
  20. 20. GoDataDrivenWe’re hiring / Questions? / Thank you!@fzkfrisovanvollenhoven@godatadriven.comFriso van Vollenhoven

×