Hadoop as Data Refinery - Steve Loughran


Published on

Apache Hadoop is often described as a "Big Data Platform" but what does that mean? One way to better understand Hadoop is to talk about how Hadoop is used. This talk discusses using Hadoop as a "Data Refinery", which is a common use case. The concept is very much like a traditional oil refinery except with data, pulling in large quantities of "crude data" over pipelines, refining some into useful business intelligence; refining other pieces into slightly less crude data that stays in the cluster until needed later. This metaphor proves useful when considering how Hadoop could be adopted in an organisation that already has data warehousing and business intelligence systems -and when contemplating how to hook up a Hadoop cluster to the sources of data inside and outside that organisation. A key point to remember is that storing data in Hadoop is not a means to an end any more than storing data in a database is: it is extracting information from that data. Using Hadoop as a front end "data refinery" means that it can integrate with existing Business Intelligence systems, while providing the platform for new applications.

  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • In the graphic above, Apache Hadoop acts as the Big Data Refinery. It’s great at storing, aggregating, and transforming multi-structured data into more useful and valuable formats.Apache Hive is a Hadoop-related component that fits within the Business Intelligence & Analytics category since it is commonly used for querying and analyzing data within Hadoop in a SQL-like manner. Apache Hadoop can also be integrated with other EDW, MPP, and NewSQL components such as Teradata, Aster Data, HP Vertica, IBM Netezza, EMC Greenplum, SAP Hana, Microsoft SQL Server PDW and many others.Apache HBase is a Hadoop-related NoSQL Key/Value store that is commonly used for building highly responsive next-generation applications. Apache Hadoop can also be integrated with other SQL, NoSQL, and NewSQL technologies such as Oracle, MySQL, PostgreSQL, Microsoft SQL Server, IBM DB2, MongoDB, DynamoDB, MarkLogic, Riak, Redis, Neo4J, Terracotta, GemFire, SQLFire, VoltDB and many others.Finally, data movement and integration technologies help ensure data flows seamlessly between the systems in the above diagrams; the lines in the graphic are powered by technologies such as WebHDFS, Apache HCatalog, Apache Sqoop, Talend Open Studio for Big Data, Informatica, Pentaho, SnapLogic, Splunk, Attunity and many others.
  • At the highest level, I describe three broad areas of data processing and outline how these areas interconnect.The three areas are:1.Business Transactions & Interactions2. Business Intelligence & Analytics3. Big Data RefineryThe graphic illustrates a vision for how these three types of systems can interconnect in ways aimed at deriving maximum value from all forms of data.Enterprise IT has been connecting systems via classic ETL processing, as illustrated in Step 1 above, for many years in order to deliver structured and repeatable analysis. In this step, the business determines the questions to ask and IT collects and structures the data needed to answer those questions.The “Big Data Refinery”, as highlighted in Step 2, is a new system capable of storing, aggregating, and transforming a wide range of multi-structured raw data sources into usable formats that help fuel new insights for the business. The Big Data Refinery provides a cost-effective platform for unlocking the potential value within data and discovering the business questions worth answering with this data. A popular example of big data refining is processing Web logs, clickstreams, social interactions, social feeds, and other user generated data sources into more accurate assessments of customer churn or more effective creation of personalized offers.More interestingly, there are businesses deriving value from processing large video, audio, and image files. Retail stores, for example, are leveraging in-store video feeds to help them better understand how customers navigate the aisles as they find and purchase products. Retailers that provide optimized shopping paths and intelligent product placement within their stores are able to drive more revenue for the business. In this case, while the video files may be big in size, the refined output of the analysis is typically small in size but potentially big in value.The Big Data Refinery platform provides fertile ground for new types of tools and data processing workloads to emerge in support of rich multi-level data refinement solutions.With that as backdrop, Step 3 takes the model further by showing how the Big Data Refinery interacts with the systems powering Business Transactions & Interactions and Business Intelligence & Analytics. Interacting in this way opens up the ability for businesses to get a richer and more informed 360 ̊ view of customers, for example.By directly integrating the Big Data Refinery with existing Business Intelligence & Analytics solutions that contain much of the transactional information for the business, companies can enhance their ability to more accurately understand the customer behaviors that lead to the transactions.Moreover, systems focused on Business Transactions & Interactions can also benefit from connecting with the Big Data Refinery. Complex analytics and calculations of key parameters can be performed in the refinery and flow downstream to fuel runtime models powering business applications with the goal of more accurately targeting customers with the best and most relevant offers, for example.Since the Big Data Refinery is great at retaining large volumes of data for long periods of time, the model is completed with the feedback loops illustrated in Steps 4 and 5. Retaining the past 10 years of historical “Black Friday” retail data, for example, can benefit the business, especially if it’s blended with other data sources such as 10 years of weather data accessed from a third party data provider. The point here is that the opportunities for creating value from multi-structured data sources available inside and outside the enterprise are virtually endless if you have a platform that can do it cost effectively and at scale.
  • Real world data is 'dirty' -you need to clean it upExamples: merge multiple events into one of an extended periodSanity check events against your world view (how fast things move, how much things cost). There is much danger here.text cleanup, discard empty fieldsYou may still want to retain the original data to see what was filtered -at the very least log & sample the outliers
  • This is taking a metaphor beyond the limits: all that comes next is photos of Grangemout or Milford Haven.Real world refineries have giant storage tanks to buffer differences between ingress and egress rates.Here we are proposing keeping data near the refinery
  • RCFile (Record Columnar File)http://en.wikipedia.org/wiki/RCFileHCatalog is a table abstraction and a storage abstraction system that makes it easy for multiple tools to interact with the same underlying data. A common buzzword in the NoSQL world today is that of polyglot persistence. Basically, what that comes down to is that you pick the right tool for the job. In the Hadoop ecosystem, you have many tools that might be used for data processing - you might use Pig or Hive, or your own custom MapReduce program, or that shiny new GUI-based tool that's just come out. And which one to use might depend on the user, or on the type of query you're interested in, or the type of job we want to run. From another perspective, you might want to store your data in columnar storage for efficient storage and retrieval for particular query types, or in text so that users can write data producers in scripting languages like Perl or Python, or you may want to hook up that HBase table as a data source. As a end-user, I want to use whatever data processing tool is available to me. As a data designer, I want to optimize how data is stored. As a cluster manager/data architect, I want the ability to share pieces of information across the board, and move data back and forth fluidly. HCatalog's hopes and promises are the realization of all of the above.
  • This is an example that's gone up our web site recently, using Pig to analyse NetFlow packets and so look for origins over time. That's the kind of thing you can only do with large datasets. Using a language like Pig helps you look at the numbers and decide what the next questions to ask are.
  • This is important. once you start becoming more aware of your customers, your potential customers, your internal state and the world outside -you have more information than ever before.Yet you still need to analyse it.
  • Conducting valid experiments: A/B testing of two different options must be conducted truly at random, to avoid selection bias or influence by external factorsAccepting negative results: It's OK to have an outcome that says "neither option is any better or worse than the other"Accepting results you don't agree with: evidence your idea doesn't work. no 3, is hard -and why you need large, valid sample sets. Otherwise you could dismiss it as a bad experiment. Governments are classic examples of organisations that don't do this. Badger Culling and Drug Policies are key examples -policy is driven by the belief of constituencies (farmers, daily mail), rather than recognising the evidence and trying to explain to the constituencies that they are mistaken. This isn't a critique of the current administration -the previous one was also belief-driven rather than fact-driven.
  • This is just one scenario of how data can flow into a combined ecosystem of Hadoop, Aster Data, and Teradata. In this scenario, Hadoop is acting as a raw data store and transformation engine to load Aster Data. There are also scenarios where raw data could be loaded directly into Aster Data and Teradata systems. The point is that each system has a role and focus for the customer.The goal is to understand how harnessing more/all of the data provides more value to the customer, more users on the Aster and Teradata systems, more data-driven applications, and more.
  • This is just one scenario of how data can flow into a combined ecosystem of Hadoop, Aster Data, and Teradata. In this scenario, Hadoop is acting as a raw data store and transformation engine to load Aster Data. There are also scenarios where raw data could be loaded directly into Aster Data and Teradata systems. The point is that each system has a role and focus for the customer.The goal is to understand how harnessing more/all of the data provides more value to the customer, more users on the Aster and Teradata systems, more data-driven applications, and more.
  • Hadoop as Data Refinery - Steve Loughran

    1. 1. Hadoop as a Data RefinerySteve Loughran– Hortonworks@steveloughranLondon, October 2012© Hortonworks Inc. 2012
    2. 2. About me:• HP Labs: –Deployment, cloud infrastructure, Hadoop-in-Cloud• Apache – member and committer –Ant, Axis ; author: Ant in Action –Hadoop –Dynamic deployments –Diagnostics on failures –Cloud infrastructure integration• Joined Hortonworks in 2012 –UK based: R&D Page 2 © Hortonworks Inc. 2012
    3. 3. What is Apache Hadoop?• Collection of Open Source Projects One of the best examples of – Apache Software Foundation (ASF) open source driving innovation – commercial and community development and creating a market • Foundation for Big Data Solutions – Stores petabytes of data reliably – Runs highly distributed computation – Commodity servers & storage – Powers data-driven business Page 3 © Hortonworks Inc. 2012
    4. 4. Why Hadoop? Business Pressure1 Opportunity to enable innovative new business models2 Potential new insights that drive competitive advantage Technical Pressure3 Data collected and stored continues to grow exponentially4 Data is increasingly everywhere and in many formats5 Traditional solutions not designed for new requirements Financial Pressure6 Cost of data systems, as % of IT spend, continues to grow7 Cost advantages of commodity hardware & open source Page 4 © Hortonworks Inc. 2012
    5. 5. The data refinery in an enterprise Audio, Web, Mobile, CRM, Video, ERP, SCM, …Images New Data Business Transactions Docs, Sources Text, & Interactions XML HDFS Web Logs, Clicks Big Data SQL NoSQL NewSQLSocial, RefineryGraph, ETLFeeds EDW MPP NewSQLSensors,Devices, RFID Business PigSpatial, Intelligence GPS Apache Hadoop & AnalyticsEvents, Other Dashboards, Reports, Visualization, … Page 5 © Hortonworks Inc. 2012
    6. 6. Modernising Business Intelligence• Before: – Current records & short history – Analytics/BI systems keep conformed / cleaned / digested data – Unstructured data locked silos, archived offline Inflexible, new questions require system redesigns• Now – Keep raw data in Hadoop for a long time – Reprocess/enhance analytics/BI data on-demand – Can directly experiment on all raw data – New products / services can be added very quickly Storage and agility justifies new infrastructure Page 6 © Hortonworks Inc. 2012
    7. 7. Refineries pull in raw dataInternal: pipelines with Apache Flume – Web site logs – Real-world events: retail, financial, vehicle movements – New data sources you create The data you couldnt afford to keepExternal: pipelines and bulk deliveries – Correlating data: weather, market, competition – New sources -twitter feeds, infochimps, open government – Real-world events: retail, financial – Apache Sqoop To help understand your own data Page 8 © Hortonworks Inc. 2012
    8. 8. Refineries refine raw data• Clean up raw data• Filter “cleaned” data• Forward data to different destinations: – Existing BI infrastructure – New “Agile Data” infrastructures• Offload work from the core Data Warehouse – ETL operations – Report and Chart Generation – Ad-hoc queries Needs: query, workflow and reporting tools Page 9 © Hortonworks Inc. 2012
    9. 9. Refineries can store data• Retain historical transaction data, analyses• Store (cleaned, filtered, compressed) raw data• Provide the history for more advanced analysis in future applications and queries• Needs: storage, query tools – Storage: HDFS and HBase – Languages: Pig & Hive – Workflow for scheduled jobs: Oozie – Shared schema repository: HCatalogHadoop makes storing bulk & historical data affordable Page 10 © Hortonworks Inc. 2012
    10. 10. What if I didnt have a DataWarehouse? Page 12© Hortonworks Inc. 2012
    11. 11. Congratulations!1. HBase: scale, Hadoop integration2. mongoDB, CouchDB, Riak good for web UIs3. Postgres, MySQL, … transactions Page 13 © Hortonworks Inc. 2012
    12. 12. Agile Data Page 14© Hortonworks Inc. 2012
    13. 13. Agile Data• SQL Experts: Hive HQL queries• Ad-hoc queries: Pig• Statistics platform: R + Hadoop• Visualisation tools –including Excel• New web UI applications Because you don’t know all that you are looking for when you collect the data Page 15 © Hortonworks Inc. 2012
    14. 14. Page 16© Hortonworks Inc. 2012
    15. 15. Pig: an Agile Data language• Optimised for refining data• Dataflow-driven –much higher level than Java• Macros and User Defined Functions• ILLUSTRATE aids development• For ad-hoc and production use Page 17 © Hortonworks Inc. 2012
    16. 16. Example: Packetpigsnort_alerts = LOAD $pcap USINGcom.packetloop.packetpig.loaders.pcap.detection.SnortLoader($snortconfig);countries = FOREACH snort_alerts GENERATE com.packetloop.packetpig.udf.geoip.Country(src) as country, priority;countries = GROUP countries BY country;countries = FOREACH countries GENERATE group, AVG(countries.priority) as average_severity;STORE countries into output/choropleth_countries using PigStorage(,); Page 18 © Hortonworks Inc. 2012
    17. 17. web UI: d3.js Page 19 © Hortonworks Inc. 2012
    18. 18. Analytics Apps: It takes a Team• Broad skill-set to make useful apps• Basically nobody has them all• Application development is inherently collaborative Page 20 © Hortonworks Inc. 2012
    19. 19. Developers: learn statistics via PigData Scientists & Statisticians:learn Pig (and R)Russ Jurney @ HUG UK in Novembermeetup.com/hadoop-users-group-uk/ Page 21 © Hortonworks Inc. 2012
    20. 20. Challenge:Becoming a data-driven organisation Page 22© Hortonworks Inc. 2012
    21. 21. Challenges• Thinking of the right questions to ask• Conducting valid experiments: A/B testing, surveys with effective sampling, … – Not: "try a web new design for a week" – Not: "please do a site survey" pop-up dialog• Accepting negative results – "no design was better than the other"• Accepting results you dont agree with – “trials imply the proposed strategy wont work” Page 23 © Hortonworks Inc. 2012
    22. 22. Example: Yahoo!• Online Application logic driven by big lookup tables• Lookup data computed periodically on Hadoop – Machine learning, other expensive computation offline – Personalization, classification, fraud, value analysis…• Application development requires data science – Huge amounts of actually observed data key to modern apps – Hadoop used as the science platform Architecting © Hortonworks Inc. 2012 the Future of Big Data Page 24
    23. 23. Yahoo! Homepage • Serving Maps SCIENCE » Machine learning to build ever • Users - Interests HADOOP better categorization models CLUSTER • Five Minute CATEGORIZATION USER Production BEHAVIOR MODELS (weekly) • Weekly PRODUCTION Categorization HADOOP » Identify user interests using CLUSTER models SERVING Categorization models MAPS (every 5 minutes) USER BEHAVIOR SERVING SYSTEMS ENGAGED USERS Build customised home pages with latest data (thousands / second)Copyright Yahoo 2011 25
    24. 24. ConclusionsHadoop can live alongside existing BIsystems –as a data refinery• Store, refine bulk & unstructured data• Archive data for long-term analysis• Support ad-hoc queries over bulk data• Become the data-science platform 26
    25. 25. Thank You!Questions & Answershortonworks.com/download Page 27 © Hortonworks Inc. 2012