Dirty data by friso van vollenhoven

Uploaded on

At a large European publisher, we had the challenge of receiving external data from 10+ different sources, adding up to tens of GBs per day. Add to this that the file formats will sometimes change …

At a large European publisher, we had the challenge of receiving external data from 10+ different sources, adding up to tens of GBs per day. Add to this that the file formats will sometimes change without notifications and that sometimes connections go bad or files go missing. This is while trying to maintain that at least the amount of data is near correct in an environment where the 'correct' amount of data for a source is often a difficult to predict number somewhere between 20M and 50M records for a particular day.

We built a extracting and loading pipeline to get data into Hadoop en expose it via Hive tables, which includes scheduling, reporting, monitoring, transforming and, above all, the ability to respond to changes very quickly. After all, responding to a file format change within the same day or adding a new source in a day are very reasonable user requests (right). We were focused on developer friendliness and rely on a fully open source stack, using Hadoop, Hive, Jenkins, various scripting languages and more. This is my talk about the setup and our lessons learned.

In our quest for data quality, we also did work on attempting to predict the expected data volumes, based on seasonality and weather information, in order to proactively alert when a data import appears to fall short of the expected volume. I will include these results in the talk.

More in: Technology , Spiritual
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads


Total Views
On Slideshare
From Embeds
Number of Embeds



Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

    No notes for slide


  • 1. GoDataDrivenPROUDLY PART OF THE XEBIA GROUP@fzkfrisovanvollenhoven@godatadriven.comDirty DataFriso van VollenhovenIt’s a mess. It’s your problem.
  • 2. februari-22 2013
  • 3. A: Yes, sometimes asoften as 1 in every 10Kcalls. Or about once aweek at 3K files / day.
  • 4. þ
  • 5. þ
  • 6. TSV ==thorn separated values?
  • 7. þ == 0xFE
  • 8. or -2, in HiveCREATE TABLE browsers (browser_id STRING,browser STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY -2;
  • 9. • The format will change• Faulty deliveries will occur• Your parser will break• Records will be mistakingly produced (over-logging)• Other people test in production too (and you get thedata from it)• Etc., etc.
  • 10. •Simple deployment of ETL code•Scheduling•Scalable•Independent jobs•Fixable data store•Incremental where possible•Metrics
  • 12. • No JVM startup overhead for Hadoop API usage• Relatively concise syntax (Python)• Mix Python standard library with any Java libs
  • 13. • Flexible scheduling with dependencies• Saves output• E-mails on errors• Scales to multiple nodes• REST API• Status monitor• Integrates with version control
  • 14. Deploymentgit push jenkins master
  • 15. Independent jobssource (external)staging (HDFS)hive-staging (HDFS)HiveHDFS upload + move in placeMapReduce + HDFS moveHive map external table + SELECT INTO
  • 16. Out of order jobs• At any point, you don’t really know what ‘made it’to Hive• Will happen anyway, because some days the datadelivery is going to be three hours late• Or you get half in the morning and the other halflater in the day• It really depends on what you do with the data• This is where metrics + fixable data store help...
  • 17. Fixable data store• Using Hive partitions• Jobs that move data from staging create partitions• When new data / insight about the data arrives,drop the partition and re-insert• Be careful to reset any metrics in this case• Basically: instead of trying to make everythingtransactional, repair afterwards• Use metrics to determine whether data is fit forpurpose
  • 18. Metrics
  • 19. Metrics service• Job ran, so may units processed, took so muchtime• e.g. 10GB imported, took 1 hr• e.g. 60M records transformed, took 10 minutes• Dropped partition• Inserted X records into partition
  • 20. GoDataDrivenWe’re hiring / Questions? / Thank you!@fzkfrisovanvollenhoven@godatadriven.comFriso van Vollenhoven