Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Hadoop as data refinery


Published on

Explores the notion of "Hadoop as a Data Refinery" within an organisation, be it one with an existing Business Intelligence system or none - looks at 'agile data' as a a benefit of using Hadoop as the store for historical, unstructured and very-large-scale datasets.

The final slides look at the challenge of an organisation becoming "data driven"

Published in: Technology
  • Be the first to comment

Hadoop as data refinery

  1. 1. Hadoop as a Data RefinerySteve Loughran– Hortonworks@steveloughranLondon, October 2012© Hortonworks Inc. 2012
  2. 2. About me:• HP Labs: –Deployment, cloud infrastructure, Hadoop-in-Cloud• Apache – member and committer –Ant, Axis ; author: Ant in Action –Hadoop –Dynamic deployments –Diagnostics on failures –Cloud infrastructure integration• Joined Hortonworks in 2012 –UK based: R&D Page 2 © Hortonworks Inc. 2012
  3. 3. What is Apache Hadoop?• Collection of Open Source Projects One of the best examples of – Apache Software Foundation (ASF) open source driving innovation – commercial and community development and creating a market • Foundation for Big Data Solutions – Stores petabytes of data reliably – Runs highly distributed computation – Commodity servers & storage – Powers data-driven business Page 3 © Hortonworks Inc. 2012
  4. 4. Why Hadoop? Business Pressure1 Opportunity to enable innovative new business models2 Potential new insights that drive competitive advantage Technical Pressure3 Data collected and stored continues to grow exponentially4 Data is increasingly everywhere and in many formats5 Traditional solutions not designed for new requirements Financial Pressure6 Cost of data systems, as % of IT spend, continues to grow7 Cost advantages of commodity hardware & open source Page 4 © Hortonworks Inc. 2012
  5. 5. The data refinery in an enterprise Audio, Web, Mobile, CRM, Video, ERP, SCM, …Images New Data Business Transactions Docs, Sources Text, & Interactions XML HDFS Web Logs, Clicks Big Data SQL NoSQL NewSQLSocial, RefineryGraph, ETLFeeds EDW MPP NewSQLSensors,Devices, RFID Business PigSpatial, Intelligence GPS Apache Hadoop & AnalyticsEvents, Other Dashboards, Reports, Visualization, … Page 5 © Hortonworks Inc. 2012
  6. 6. Modernising Business Intelligence• Before: – Current records & short history – Analytics/BI systems keep conformed / cleaned / digested data – Unstructured data locked silos, archived offline Inflexible, new questions require system redesigns• Now – Keep raw data in Hadoop for a long time – Reprocess/enhance analytics/BI data on-demand – Can directly experiment on all raw data – New products / services can be added very quickly Storage and agility justifies new infrastructure Page 6 © Hortonworks Inc. 2012
  7. 7. Refineries pull in raw dataInternal: pipelines with Apache Flume – Web site logs – Real-world events: retail, financial, vehicle movements – New data sources you create The data you couldnt afford to keepExternal: pipelines and bulk deliveries – Correlating data: weather, market, competition – New sources -twitter feeds, infochimps, open government – Real-world events: retail, financial – Apache Sqoop To help understand your own data Page 8 © Hortonworks Inc. 2012
  8. 8. Refineries refine raw data• Clean up raw data• Filter “cleaned” data• Forward data to different destinations: – Existing BI infrastructure – New “Agile Data” infrastructures• Offload work from the core Data Warehouse – ETL operations – Report and Chart Generation – Ad-hoc queries Needs: query, workflow and reporting tools Page 9 © Hortonworks Inc. 2012
  9. 9. Refineries can store data• Retain historical transaction data, analyses• Store (cleaned, filtered, compressed) raw data• Provide the history for more advanced analysis in future applications and queries• Needs: storage, query tools – Storage: HDFS and HBase – Languages: Pig & Hive – Workflow for scheduled jobs: Oozie – Shared schema repository: HCatalogHadoop makes storing bulk & historical data affordable Page 10 © Hortonworks Inc. 2012
  10. 10. What if I didnt have a DataWarehouse? Page 12© Hortonworks Inc. 2012
  11. 11. Congratulations!1. HBase: scale, Hadoop integration2. mongoDB, CouchDB, Riak good for web UIs3. Postgres, MySQL, … transactions Page 13 © Hortonworks Inc. 2012
  12. 12. Agile Data Page 14© Hortonworks Inc. 2012
  13. 13. Agile Data• SQL Experts: Hive HQL queries• Ad-hoc queries: Pig• Statistics platform: R + Hadoop• Visualisation tools –including Excel• New web UI applications Because you don’t know all that you are looking for when you collect the data Page 15 © Hortonworks Inc. 2012
  14. 14. Page 16© Hortonworks Inc. 2012
  15. 15. Pig: an Agile Data language• Optimised for refining data• Dataflow-driven –much higher level than Java• Macros and User Defined Functions• ILLUSTRATE aids development• For ad-hoc and production use Page 17 © Hortonworks Inc. 2012
  16. 16. Example: Packetpigsnort_alerts = LOAD $pcap USINGcom.packetloop.packetpig.loaders.pcap.detection.SnortLoader($snortconfig);countries = FOREACH snort_alerts GENERATE com.packetloop.packetpig.udf.geoip.Country(src) as country, priority;countries = GROUP countries BY country;countries = FOREACH countries GENERATE group, AVG(countries.priority) as average_severity;STORE countries into output/choropleth_countries using PigStorage(,); Page 18 © Hortonworks Inc. 2012
  17. 17. web UI: d3.js Page 19 © Hortonworks Inc. 2012
  18. 18. Analytics Apps: It takes a Team• Broad skill-set to make useful apps• Basically nobody has them all• Application development is inherently collaborative Page 20 © Hortonworks Inc. 2012
  19. 19. Developers: learn statistics via PigData Scientists & Statisticians:learn Pig (and R)Russ Jurney @ HUG UK in Page 21 © Hortonworks Inc. 2012
  20. 20. Challenge:Becoming a data-driven organisation Page 22© Hortonworks Inc. 2012
  21. 21. Challenges• Thinking of the right questions to ask• Conducting valid experiments: A/B testing, surveys with effective sampling, … – Not: "try a web new design for a week" – Not: "please do a site survey" pop-up dialog• Accepting negative results – "no design was better than the other"• Accepting results you dont agree with – “trials imply the proposed strategy wont work” Page 23 © Hortonworks Inc. 2012
  22. 22. Example: Yahoo!• Online Application logic driven by big lookup tables• Lookup data computed periodically on Hadoop – Machine learning, other expensive computation offline – Personalization, classification, fraud, value analysis…• Application development requires data science – Huge amounts of actually observed data key to modern apps – Hadoop used as the science platform Architecting © Hortonworks Inc. 2012 the Future of Big Data Page 24
  23. 23. Yahoo! Homepage • Serving Maps SCIENCE » Machine learning to build ever • Users - Interests HADOOP better categorization models CLUSTER • Five Minute CATEGORIZATION USER Production BEHAVIOR MODELS (weekly) • Weekly PRODUCTION Categorization HADOOP » Identify user interests using CLUSTER models SERVING Categorization models MAPS (every 5 minutes) USER BEHAVIOR SERVING SYSTEMS ENGAGED USERS Build customised home pages with latest data (thousands / second)Copyright Yahoo 2011 25
  24. 24. ConclusionsHadoop can live alongside existing BIsystems –as a data refinery• Store, refine bulk & unstructured data• Archive data for long-term analysis• Support ad-hoc queries over bulk data• Become the data-science platform 26
  25. 25. Thank You!Questions & Page 27 © Hortonworks Inc. 2012