Building a Big data data warehouse - goto 2013


Published on

Content and talk by Friso van Vollenhoven (GoDataDriven)

Let's face it: data warehousing in the traditional sense has been tedious, lacking agility, and slow. Often designed and built around a static set of up front questions, they are basically a big database tailored to the application of populating dashboards. The fact that the volume of data that we need to deal with recently exploded by several order of magnitude isn't improving the situation. It's no surprise that we see a new class of data warehousing setups emerge, using big data technologies en NoSQL stores. A nice side effect is that these solutions are usually not only the domain of the BI crowd, but can also be developer friendly and allow development of more data driven apps.

In this talk I will present about experiences using Hadoop and other tools from the Hadoop ecosystem, such as Hive, Pig and bare MapReduce, to handle data that grows tens of GBs per day. We create a system where data is captured, stored and made available to different users and use cases, ranging from end users that write SQL queries to software developers that access the underlying data to create data driven products. I will cover topics like ETL, querying, development and deployment and reporting; using a fully open source stack, of course.

Published in: Data & Analytics, Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Building a Big data data warehouse - goto 2013

  1. 1. GoDataDrivenPROUDLY PART OF THE XEBIA GROUP@fzkfrisovanvollenhoven@godatadriven.comBuilding a Big Data DWHFriso van VollenhovenCTOData Warehousing on Hadoop
  2. 2. -- Wikipedia“In computing, a data warehouse orenterprise data warehouse (DW,DWH, or EDW) is a database usedfor reporting and data analysis.”
  3. 3. ETL
  4. 4. How to:• Add a column to the facts table?• Change the granularity of dates from dayto hour?• Add a dimension based on someaggregation of facts?
  5. 5. Schema’s are designedwith questions in mind.Changing it requires toredo the ETL.
  6. 6. Schema’s are designedwith questions in mind.Changing it requires toredo the ETL.Push things to the factslevel.Keep all source dataavailable all times.
  7. 7. And now?• MPP databases?• Faster / better / more SAN?• (RAC?)
  8. 8. distributed storagedistributed processingmetadata + query engine
  10. 10. • No JVM startup overhead for Hadoop API usage• Relatively concise syntax (Python)• Mix Python standard library with any Java libs
  11. 11. • Flexible scheduling with dependencies• Saves output• E-mails on errors• Scales to multiple nodes• REST API• Status monitor• Integrates with version control
  12. 12. Deploymentgit push jenkins master
  13. 13. •Scheduling•Simple deployment of ETL code•Scalable•Developer friendly
  14. 14. februari-22 2013
  15. 15. A: Yes, sometimes asoften as 1 in every 10Kcalls. Or about once aweek at 3K files / day.
  16. 16. þ
  17. 17. þ
  18. 18. TSV ==thorn separated values?
  19. 19. þ == 0xFE
  20. 20. or -2, in HiveCREATE TABLE browsers (browser_id STRING,browser STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY -2;
  21. 21. • The format will change• Faulty deliveries will occur• Your parser will break• Records will be mistakingly produced (over-logging)• Other people test in production too (and you get thedata from it)• Etc., etc.
  22. 22. •Simple deployment of ETL code•Scheduling•Scalable•Independent jobs•Fixable data store•Incremental where possible•Metrics
  23. 23. Independent jobssource (external)staging (HDFS)hive-staging (HDFS)HiveHDFS upload + move in placeMapReduce + HDFS moveHive map external table + SELECT INTO
  24. 24. Out of order jobs• At any point, you don’t really know what ‘made it’to Hive• Will happen anyway, because some days the datadelivery is going to be three hours late• Or you get half in the morning and the other halflater in the day• It really depends on what you do with the data• This is where metrics + fixable data store help...
  25. 25. Fixable data store• Using Hive partitions• Jobs that move data from staging create partitions• When new data / insight about the data arrives,drop the partition and re-insert• Be careful to reset any metrics in this case• Basically: instead of trying to make everythingtransactional, repair afterwards• Use metrics to determine whether data is fit forpurpose
  26. 26. Metrics
  27. 27. Metrics service• Job ran, so may units processed, took so muchtime• e.g. 10GB imported, took 1 hr• e.g. 60M records transformed, took 10 minutes• Dropped partition• Inserted X records into partition
  28. 28. GoDataDrivenWe’re hiring / Questions? / Thank you!@fzkfrisovanvollenhoven@godatadriven.comFriso van VollenhovenCTO