Your SlideShare is downloading. ×
0
Dirty data by friso van vollenhoven
Dirty data by friso van vollenhoven
Dirty data by friso van vollenhoven
Dirty data by friso van vollenhoven
Dirty data by friso van vollenhoven
Dirty data by friso van vollenhoven
Dirty data by friso van vollenhoven
Dirty data by friso van vollenhoven
Dirty data by friso van vollenhoven
Dirty data by friso van vollenhoven
Dirty data by friso van vollenhoven
Dirty data by friso van vollenhoven
Dirty data by friso van vollenhoven
Dirty data by friso van vollenhoven
Dirty data by friso van vollenhoven
Dirty data by friso van vollenhoven
Dirty data by friso van vollenhoven
Dirty data by friso van vollenhoven
Dirty data by friso van vollenhoven
Dirty data by friso van vollenhoven
Dirty data by friso van vollenhoven
Dirty data by friso van vollenhoven
Dirty data by friso van vollenhoven
Dirty data by friso van vollenhoven
Dirty data by friso van vollenhoven
Dirty data by friso van vollenhoven
Dirty data by friso van vollenhoven
Dirty data by friso van vollenhoven
Dirty data by friso van vollenhoven
Dirty data by friso van vollenhoven
Dirty data by friso van vollenhoven
Dirty data by friso van vollenhoven
Dirty data by friso van vollenhoven
Dirty data by friso van vollenhoven
Dirty data by friso van vollenhoven
Dirty data by friso van vollenhoven
Dirty data by friso van vollenhoven
Dirty data by friso van vollenhoven
Dirty data by friso van vollenhoven
Dirty data by friso van vollenhoven
Dirty data by friso van vollenhoven
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Dirty data by friso van vollenhoven

477

Published on

At a large European publisher, we had the challenge of receiving external data from 10+ different sources, adding up to tens of GBs per day. Add to this that the file formats will sometimes change …

At a large European publisher, we had the challenge of receiving external data from 10+ different sources, adding up to tens of GBs per day. Add to this that the file formats will sometimes change without notifications and that sometimes connections go bad or files go missing. This is while trying to maintain that at least the amount of data is near correct in an environment where the 'correct' amount of data for a source is often a difficult to predict number somewhere between 20M and 50M records for a particular day.

We built a extracting and loading pipeline to get data into Hadoop en expose it via Hive tables, which includes scheduling, reporting, monitoring, transforming and, above all, the ability to respond to changes very quickly. After all, responding to a file format change within the same day or adding a new source in a day are very reasonable user requests (right). We were focused on developer friendliness and rely on a fully open source stack, using Hadoop, Hive, Jenkins, various scripting languages and more. This is my talk about the setup and our lessons learned.

In our quest for data quality, we also did work on attempting to predict the expected data volumes, based on seasonality and weather information, in order to proactively alert when a data import appears to fall short of the expected volume. I will include these results in the talk.

Published in: Technology, Spiritual
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
477
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. GoDataDrivenPROUDLY PART OF THE XEBIA GROUP@fzkfrisovanvollenhoven@godatadriven.comDirty DataFriso van VollenhovenIt’s a mess. It’s your problem.
  • 2. februari-22 2013
  • 3. A: Yes, sometimes asoften as 1 in every 10Kcalls. Or about once aweek at 3K files / day.
  • 4. þ
  • 5. þ
  • 6. TSV ==thorn separated values?
  • 7. þ == 0xFE
  • 8. or -2, in HiveCREATE TABLE browsers (browser_id STRING,browser STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY -2;
  • 9. • The format will change• Faulty deliveries will occur• Your parser will break• Records will be mistakingly produced (over-logging)• Other people test in production too (and you get thedata from it)• Etc., etc.
  • 10. •Simple deployment of ETL code•Scheduling•Scalable•Independent jobs•Fixable data store•Incremental where possible•Metrics
  • 11. EXTRACTTRANSFORMLOAD
  • 12. • No JVM startup overhead for Hadoop API usage• Relatively concise syntax (Python)• Mix Python standard library with any Java libs
  • 13. • Flexible scheduling with dependencies• Saves output• E-mails on errors• Scales to multiple nodes• REST API• Status monitor• Integrates with version control
  • 14. Deploymentgit push jenkins master
  • 15. Independent jobssource (external)staging (HDFS)hive-staging (HDFS)HiveHDFS upload + move in placeMapReduce + HDFS moveHive map external table + SELECT INTO
  • 16. Out of order jobs• At any point, you don’t really know what ‘made it’to Hive• Will happen anyway, because some days the datadelivery is going to be three hours late• Or you get half in the morning and the other halflater in the day• It really depends on what you do with the data• This is where metrics + fixable data store help...
  • 17. Fixable data store• Using Hive partitions• Jobs that move data from staging create partitions• When new data / insight about the data arrives,drop the partition and re-insert• Be careful to reset any metrics in this case• Basically: instead of trying to make everythingtransactional, repair afterwards• Use metrics to determine whether data is fit forpurpose
  • 18. Metrics
  • 19. Metrics service• Job ran, so may units processed, took so muchtime• e.g. 10GB imported, took 1 hr• e.g. 60M records transformed, took 10 minutes• Dropped partition• Inserted X records into partition
  • 20. GoDataDrivenWe’re hiring / Questions? / Thank you!@fzkfrisovanvollenhoven@godatadriven.comFriso van Vollenhoven

×