Bigger Data For Your Budget
Upcoming SlideShare
Loading in...5
×
 

Bigger Data For Your Budget

on

  • 856 views

How to turn your Big Data into Big Insights without breaking the bank.

How to turn your Big Data into Big Insights without breaking the bank.

Statistics

Views

Total Views
856
Views on SlideShare
856
Embed Views
0

Actions

Likes
1
Downloads
18
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Big Data is made up of traditional structured data in databases, but increasingly it’s also coming in from unstructured sources – server logs, sensor logs, raw transaction logs, and, let’s say you’re analyzing Twitter for market sentiment or searching the web for signs of terrorist plots, you’re going to be digging through reams of Human-Quality input.
  • Where’s it coming from? As computers and networks speed up, their ability to capture and store more of what’s happening in the real world has gone up, and it’s kicked off a feedback loop. As high-speed trading has taken over the finance industry, the volume of transactions has skyrocketed. More scientific data points were generated in the last five years than in the previous 100,000 years of human existence, and that’s likely to be true again in five years. And it’s not just MIT and Wall Street. We’re increasingly living our lives through machines that can capture and aggregate more of our actions than ever before.
  • Data, meaning structured or unstructured information collected and stored in computing systems, is increasing exponentially.
  • Big Data is literally promising to cure cancer, and fight off drug-resistant tuberculosis. It found the Higgs Boson, and it’s going to find life on other planets.And of course it promises to let you see deeper into your business. Insights into real-world problems that we didn’t ever have the data to collect, or the tools to analyze before. Understanding everything like Amazon understands your taste in movies. Google can track the flu better than the CDC.Big Data is promising to be a kind of Magical Insight Portal.
  • Image courtesy of http://www.greenbookblog.orgOf course, magic doesn’t pay the bills, so the question is what can big data do for your business? I’d like to start with a very simple example:
  • Let’s say you’re a regional retail giant with an inventory system that tracks all of the transactions, then batch-processes them for your chief inventory manager overnight. Let’s say a radio DJ in Framingham plugs Widget A, and suddenly your Framingham location is sold out by 11 AM. Your inventory guy won’t find out about the unexpected spike until the next morning, and it’s probably day two before a truck can arrive, by which time the DJ is talking about something else.And that’s sort of okay, right? Waking up to discover your sales were through the roof yesterday is a sort of nice, 1990’s-style victory.
  • But instead of overnight, let’s restructure our processing with Big Data techniques to be able to run on an hourly cycle. The system can tell that Framingham is selling through widgets faster than normal by 10AM, and it knows they’re out by noon. Before noon, the inventory guy gets an alert on his HTML5 dashboard, and an email on his phone, and he’s got a truck en route from the warehouse in time to restock the shelves the next morning. He’s cut his response time down from 24 hours down to 1, and he’s restocked the shelves in hours instead of days. Most importantly, you doubled your sales of Widget A.
  • The big data challenge is twofold: Collecting and storing the data, and then chewing through it to produce the valuable insights.
  • Existing solutions work great, but they’re costly. It’s expensive custom “enterprise”-grade hardware (which is code for expensive) with expensive licensed software. The regional retail giant can’t
  • Here’s the promise we’re delivering today. You can have the same insights into your accumulating data at a fraction of the price.
  • Scaling for the same budget requires a paradigm shift.Enter Hadoop. Hadoop is free & open-source software running on commodity hardware like you pick up at Best Buy. (slight exaggeration.) On a commodity hardware budget, the retail inventory system is able to run hourly and will allow dramatically faster reaction to inventory events.
  • Not just retail, and not just speeding processes up. Review a couple of other use cases.
  • Still planning on having a better analogy for Wednesday. This one is really growing on me though.
  • I can’t really talk about Hortonworks without first taking a moment to talk about the history of Hadoop.What we now know of as Hadoop really started back in 2005, when Eric Baldeschwieler – known as “E14” – started to work on a project that to build a large scale data storage and processing technology that would allow them to store and process massive amounts of data to underpin Yahoo’s most critical application, Search. The initial focus was on building out the technology – the key components being HDFS and MapReduce – that would become the Core of what we think of as Hadoop today, and continuing to innovate it to meet the needs of this specific application.By 2008, Hadoop usage had greatly expanded inside of Yahoo, to the point that many applications were now using this data management platform, and as a result the team’s focus extended to include a focus on Operations: now that applications were beginning to propagate around the organization, sophisticated capabilities for operating it at scale were necessary. It was also at this time that usage began to expand well beyond Yahoo, with many notable organizations (including Facebook and others) adopting Hadoop as the basis of their large scale data processing and storage applications and necessitating a focus on operations to support what as by now a large variety of critical business applications.In 2011, recognizing that more mainstream adoption of Hadoop was beginning to take off and with an objective of facilitating it, the core team left – with the blessing of Yahoo – to form Hortonworks. The goal of the group was to facilitate broader adoption by addressing the Enterprise capabilities that would would enable a larger number of organizations to adopt and expand their usage of Hadoop.[note: if useful as a talk track, Cloudera was formed in 2008 well BEFORE the operational expertise of running Hadoop at scale was established inside of Yahoo]
  • At Hortonworks today, our focus is very clear: we Develop, Distribute and Support a 100% open source distribution of Enterprise Apache Hadoop.We employ the core architects, builders and operators of Apache Hadoop and drive the innovation in the open source community.We distribute the only 100% open source Enterprise Hadoop distribution: the Hortonworks Data PlatformGiven our operational expertise of running some of the largest Hadoop infrastructure in the world at Yahoo, our team is uniquely positioned to support youOur approach is also uniquely endorsed by some of the biggest vendors in the IT marketYahoo is both and investor and a customer, and most importantly, a development partner. We partner to develop Hadoop, and no distribution of HDP is released without first being tested on Yahoo’s infrastructure and using the same regression suite that they have used for years as they grew to have the largest production cluster in the worldMicrosoft has partnered with Hortonworks to include HDP in both their off-premise offering on Azure but also their on-premise offering under the product name HDInsight. This also includes integration with both Visual Studio for application development but also with System Center for operational management of the infrastructureTeradata includes HDP in their products in order to provide the broadest possible range of options for their customers
  • So how does this get brought together into our distribution? It is really pretty straightforward, but also very unique:We start with this group of open source projects that I described and that we are continually driving in the OSS community. [CLICK] We then package the appropriate versions of those open source projects, integrate and test them using a full suite, including all the IP for regression testing contributed by Yahoo, and [CLICK] contribute back all of the bug fixes to the open source tree. From there, we package and certify a distribution in the from of the Hortonworks Data Platform (HDP) that includes both Hadoop Core as well as the related projects required by the Enterprise user, and provide to our customers.Through this application of Enterprise Software development process to the open source projects, the result is a 100% open source distribution that has been packaged, tested and certified by Hortonworks. It is also 100% in sync with the open source trees.
  • At its core, Hadoop is about HDFS and MapReduce, 2 projects that are really about distributed storage and data processing which are the underpinnings of Hadoop.In addition to Core Hadoop, we must identify and include the requisite “Platform Services” that are central to any piece of enterprise software. These include High Availability, Disaster Recovery, Security, etc, which enable use of the technology for a much broader (and mission critical) problem set.This is accomplished not by introducing new open source projects, but rather ensuring that these aspects are addressed within existing projects.
  • Beyond Core and Platform Services, we must add a set of Data Services that enable the full data lifecycle. This includes capabilities to:Store dataProcess dataAccess dataFor example: how do we maintain consistent metadata information required to determine how best to query data stored in HDFS? The answer: a project called Apache HCatalogOr how do we access data stored in Hadoop from SQL-oriented tools? The answer: with projects such as Hive, which is the defacto standard for accessing data stored in HDFS.All of these are broadly captured under the category of “data services”.
  • Any data management platform that is operated at any reasonable scale requires a management technology – for example SQL Server Management Studio for SQL Server, or Oracle Enterprise Manager for Oracle DB, etc. Hadoop is no exception, and means Apache Ambari, which is increasingly being recognized as foundational to the operation of Hadoop infrastructures. It allows users to provision, manage and monitor a cluster and provides a set of tools to visualize and diagnose operational issues. There are other projects in this category (such as Oozie) but Ambari is really the most influential.
  • And finally, because any enterprise runs a heterogeneous set of infrastructures, we ensure that HDP runs on your choice of infrastructure. Whether this is Linux, Windows (HDP is the only distribution certified for Windows), on a cloud platform such as Azure or Rackspace, or in an appliance, we ensure that all of them are supported and that this work is all contributed back to the open source community.
  • In summary, by addressing these elements, we can provide an Enterprise Hadoop distribution which includes the:Core ServicesPlatform ServicesData ServicesOperational ServicesRequired by the Enterprise user.And all of this is done in 100% open source, and tested at scale by our team (together with our partner Yahoo) to bring Enterprise process to an open source approach. And finally this is the distribution that is endorsed by the ecosystem to ensure interoperability in your environment.
  • While overly simplistic, this graphic represents what we commonly see as a general data architecture:A set of data sources producing dataA set of data systems to capture and store that data: most typically a mix of RDBMS and data warehousesA set of applications that leverage the data stored in those data systems. These could be package BI applications (Business Objects, Tableau, etc), Enterprise Applications (e.g. SAP) or Custom Applications (e.g. custom web applications), ranging from ad-hoc reporting tools to mission-critical enterprise operations applications.Your environment is undoubtedly more complicated, but conceptually it is likely similar.
  • As the volume of data has exploded, we increasingly see organizations acknowledge that not all data belongs in a traditional database. The drivers are both cost (as volumes grow, database licensing costs can become prohibitive) and technology (databases are not optimized for very large datasets).Instead, we increasingly see Hadoop – and HDP in particular – being introduced as a complement to the traditional approaches. It is not replacing the database but rather is a complement: and as such, must integrate easily with existing tools and approaches. This means it must interoperate with:Existing applications – such as Tableau, SAS, Business Objects, etc,Existing databases and data warehouses for loading data to / from the data warehouseDevelopment tools used for building custom applicationsOperational tools for managing and monitoring
  • It is for that reason that we focus on HDP interoperability across all of these categories:Data systemsHDP is endorsed and embedded with SQL Server, Teradata and moreBI tools: HDP is certified for use with the packaged applications you already use: from Microsoft, to Tableau, Microstrategy, Business Objects and moreWith Development tools: For .Net developers: Visual studio, used to build more than half the custom applications in the world, certifies with HDP to enable microsoft app developers to build custom apps with HadoopFor Java developers: Spring for Apache Hadoop enables Java developers to quickly and easily build Hadoop based applications with HDPOperational toolsIntegration with System Center, and with Teradata viewpoint
  • Across all of our user base, we have identified just 3 separate usage patterns – sometimes more than one is used in concert during a complex project, but the patterns are distinct nonetheless. These are Refine, Explore and Enrich.The first of these, the Refine case, is probably the most common today. It is about taking very large quantities of data and using Hadoop to distill the information down into a more manageable data set that can then be loaded into a traditional data warehouse for usage with existing tools. This is relatively straightforward and allows an organization to harness a much larger data set for their analytics applications while leveraging their existing data warehousing and analytics tools.Using the graphic here, in step 1 data is pulled from a variety of sources, into the Hadoop platform in step 2, and then in step 3 loaded into a data warehouse for analysis by existing BI tools
  • A second use case is what we would refer to as Data Exploration – this is the use case in question most commonly when people talk about “Data Science”.In simplest terms, it is about using Hadoop as the primary data store rather than performing the secondary step of moving data into a data warehouse. To support this use case you’ve seen all the BI tool vendor rally to add support for Hadoop – and most commonly HDP – as a peer to the database and in so doing allow for rich analytics on extremely large datasets that would be both unwieldy and also costly in a traditional data warehouse. Hadoop allows for interaction with a much richer dataset and has spawned a whole new generation of analytics tools that rely on Hadoop (HDP) as the data store.To use the graphic, in step 1 data is pulled into HDP, it is stored and processed in Step 2, before being surfaced directly into the analytics tools for the end user in Step 3.
  • The final use case is called Application Enrichment.This is about incorporating data stored in HDP to enrich an existing application. This could be an on-line application in which we want to surface custom information to a user based on their particular profile. For example: if a user has been searching the web for information on home renovations, in the context of your application you may want to use that knowledge to surface a custom offer for a product that you sell related to that category. Large web companies such as Facebook and others are very sophisticated in the use of this approach.In the diagram, this is about pulling data from disparate sources into HDP in Step 1, storing and processing it in Step 2, and then interacting with it directly from your applications in Step 3, typically in a bi-directional manner (e.g. request data, return data, store response).

Bigger Data For Your Budget Bigger Data For Your Budget Presentation Transcript

  • VDave Porter Dave Porter – SproutCore Architect, Appnovation davep@appnovation.com Bigger Data For Your Budget CANADIAN HEADQUARTERS 152 West Hastings Street Vancouver BC, V6B 1G8 UNITED STATES OFFICE 3414 Peachtree Road, #1600 Atlanta Georgia, 30326-1164 UNITED KINGDOM OFFICE 3000 Hillswood Drive Hillswood Business Park Chertsey KT16 0RS, UK www.appnovation.com info@appnovation.com How to turn your Big Data into Big Insights without breaking the bank
  • VDave Porter John Kreisa VP Marketing, Hortonworks Dave Porter SproutCore Architect, Appnovation Technologies Speakers
  • VDave Porter Appnovation is one of the world’s TOP OPEN SOURCE DEVELOPMENT SHOPS.
  • VDave Porter LOCATIONS VANCOUVER OFFICE 152 West Hastings Street Vancouver BC, V6B 1G8 ATLANTA OFFICE 3414 Peachtree Road, #1600 Atlanta Georgia, 30326-1164 LONDON OFFICE 3000 Hillswood Drive Hillswood Business Park Chertsey KT16 0RS, UK
  • VDave Porter
  • VDave Porter Bigger Data For Your Budget
  • VDave Porter Databases Server logs Raw transactional data Human-Quality Input WHAT IS BIG DATA?
  • VDave Porter Website Traffic Patterns Financial Transactions Science People WHERE IS IT COMING FROM?
  • VDave Porter
  • VDave Porter Curing Cancer Beating XDR-TB Finding Earth 2.0 in Outer Space Seeing Deeper Into Your Business THE PROMISE OF BIG DATA
  • VDave Porter
  • VDave Porter Retail Inventory System WHAT CAN BIG DATA DO FOR ME?
  • VDave Porter Retail Inventory System Overnight Batch Cycle WHAT CAN BIG DATA DO FOR ME?
  • VDave Porter Retail Inventory System Hourly Cycle WHAT CAN BIG DATA DO FOR ME?
  • VDave Porter Collecting & Storing Processing & Analyzing THE BIG DATA CHALLENGES
  • VDave Porter Collecting & Storing …on expensive hardware Processing & Analyzing …with expensive software THE BIG DATA CHALLENGES
  • VDave Porter Bigger Data For Your Budget
  • VDave Porter Open Source Software, Running on Commodity Hardware. BIGGER DATA FOR YOUR BUDGET
  • VDave Porter BIGGER DATA FOR YOUR BUDGET
  • VDave Porter Gnomes … with flashlights (and notepads) HADOOP: BIGGER DATA FOR YOUR BUDGET
  • VDave Porter + HADOOP: BIGGER DATA FOR YOUR BUDGET
  • © Hortonworks Inc. 2013 A Brief History of Apache Hadoop Page 22 2013 Focus on INNOVATION 2005: Yahoo! creates team under E14 to work on Hadoop Focus on OPERATIONS 2008: Yahoo team extends focus to operations to support multiple projects & growing clusters Yahoo! begins to Operate at scale Enterprise Hadoop Apache Project Established Hortonworks Data Platform 2004 2008 2010 20122006 STABILITY 2011: Hortonworks created to focus on “Enterprise Hadoop“. Starts with 24 key Hadoop engineers from Yahoo
  • © Hortonworks Inc. 2013 Hortonworks Snapshot Page 23 • We distribute the only 100% Open Source Enterprise Hadoop Distribution: Hortonworks Data Platform • We engineer, test & certify HDP for enterprise usage • We employ the core architects, builders and operators of Apache Hadoop • We drive innovation within Apache Software Foundation projects • We are uniquely positioned to deliver the highest quality of Hadoop support • We enable the ecosystem to work better with Hadoop Develop Distribute Support We develop, distribute and support the ONLY 100% open source Enterprise Hadoop distribution Endorsed by Strategic Partners Headquarters: Palo Alto, CA Employees: 180+ and growing Investors: Benchmark, Index, Yahoo
  • © Hortonworks Inc. 2013 Hortonworks Process for Enterprise Hadoop Page 24 Upstream Community Projects Downstream Enterprise Product Hortonworks Data Platform Design & Develop Distribute Integrate & Test Package & Certify Apache HCatalo g Apache Pig Apache HBase Other Apache Projects Apache Hive Apache Ambari Apache Hadoop Test & Patch Design & Develop Release No Lock-in: Integrated, tested & certified distribution lowers risk by ensuring close alignment with Apache projects Virtuous cycle when development & fixed issues done upstream & stable project releases flow downstream Stable Project Releases Fixed Issues
  • © Hortonworks Inc. 2013 Enhancing the Core of Apache Hadoop Deliver high-scale storage & processing with enterprise-ready platform services Unique Focus Areas: • Bigger, faster, more flexible Continued focus on speed & scale and enabling near-real-time apps • Tested & certified at scale Run ~1300 system tests on large Yahoo clusters for every release • Enterprise-ready services High availability, disaster recovery, snapshots, security, … Page 25 HADOOP CORE Hortonworkers are the architects, operators, and builders of core Hadoop Distributed Storage & Processing PLATFORM SERVICES Enterprise Readiness
  • © Hortonworks Inc. 2013 Page 26 HADOOP CORE DATA SERVICES Provide data services to store, process & access data in many ways Unique Focus Areas: • Apache HCatalog Metadata services for consistent table access to Hadoop data • Apache Hive Explore & process Hadoop data via SQL & ODBC-compliant BI tools Distributed Storage & Processing Hortonworks enables Hadoop data to be accessed via existing tools & systems Store, Process and Access Data PLATFORM SERVICES Enterprise Readiness Data Services for Full Data Lifecycle
  • © Hortonworks Inc. 2013 Operational Services for Ease of Use Page 27 OPERATIONAL SERVICES Include complete operational services for productive operations & management Unique Focus Area: • Apache Ambari: Provision, manage & monitor a cluster; complete REST APIs to integrate with existing operational tools; job & task visualizer to diagnose issues Only Hortonworks provides a complete open source Hadoop management tool Manage & Operate at Scale DATA SERVICES Store, Process and Access Data HADOOP CORE Distributed Storage & Processing PLATFORM SERVICES Enterprise Readiness
  • © Hortonworks Inc. 2013 OS Cloud VM Appliance Page 28 PLATFORM SERVICES HADOOP CORE DATA SERVICES OPERATIONAL SERVICES Manage & Operate at Scale Store, Process and Access Data Enterprise Readiness Only Hortonworks allows you to deploy seamlessly across any deployment option • Linux & Windows • Azure, Rackspace & other clouds • Virtual platforms • Big data appliances HORTONWORKS DATA PLATFORM (HDP) Distributed Storage & Processing Deployable Across a Range of Options
  • © Hortonworks Inc. 2013 OS Cloud VM Appliance HDP: Enterprise Hadoop Distribution Page 29 PLATFORM SERVICES HADOOP CORE DATA SERVICES OPERATIONAL SERVICES Manage & Operate at Scale Store, Process and Access Data HORTONWORKS DATA PLATFORM (HDP) Distributed Storage & Processing Hortonworks Data Platform (HDP) Enterprise Hadoop • The ONLY 100% open source and complete distribution • Enterprise grade, proven and tested at scale • Ecosystem endorsed to ensure interoperability Enterprise Readiness
  • © Hortonworks Inc. 2013 Existing Data Architecture Page 30 APPLICATIONSDATASYSTEMS TRADITIONAL REPOS RDBMS EDW MPP DATASOURCES OLTP, POS SYSTEMS OPERATIONAL TOOLS MANAGE & MONITOR Traditional Sources (RDBMS, OLTP, OLAP) DEV & DATA TOOLS BUILD & TEST Business Analytics Custom Applications Enterprise Applications
  • © Hortonworks Inc. 2013 An Emerging Data Architecture Page 31 APPLICATIONSDATASYSTEMS TRADITIONAL REPOS RDBMS EDW MPP DATASOURCES MOBILE DATA OLTP, POS SYSTEMS OPERATIONAL TOOLS MANAGE & MONITOR Traditional Sources (RDBMS, OLTP, OLAP) New Sources (web logs, email, sensor data, social media) DEV & DATA TOOLS BUILD & TEST Business Analytics Custom Applications Enterprise Applications HORTONWORKS DATA PLATFORM
  • © Hortonworks Inc. 2013 Interoperating With Your Tools Page 32 APPLICATIONSDATASYSTEMS TRADITIONAL REPOS DEV & DATA TOOLS OPERATIONAL TOOLS Viewpoint Microsoft Applications HORTONWORKS DATA PLATFORM DATASOURCES MOBILE DATA OLTP, POS SYSTEMS Traditional Sources (RDBMS, OLTP, OLAP) New Sources (web logs, email, sensor data, social media)
  • © Hortonworks Inc. 2013 Big Data Transactions, Interactions, Observations Hadoop Patterns of Use Page 33 Business Case HORTONWORKS DATA PLATFORM Refine Explore Enrich
  • © Hortonworks Inc. 2013 Operational Data Refinery Page 34 DATASYSTEMSDATASOURCES 1 3 1 Capture Capture all data Process Parse, cleanse, apply structure & transform Exchange Push to existing data warehouse for use with existing analytic tools 2 3 Refine Explore Enric h 2 APPLICATIONS Collect data and apply a known algorithm to it in trusted operational process TRADITIONAL REPOS RDBMS EDW MPP HORTONWORKS DATA PLATFORM Business Analytics Custom Applications Enterprise Applications Traditional Sources (RDBMS, OLTP, OLAP) New Sources (web logs, email, sensor data, social media)
  • © Hortonworks Inc. 2013 Big Data Exploration & Visualization Page 35 DATASYSTEMSDATASOURCES Refine Explore Enrich APPLICATIONS 1 Capture Capture all data Process Parse, cleanse, apply structure & transform Exchange Explore and visualize with analytics tools supporting Hadoop 2 3 Collect data and perform iterative investigation for value 3 2 TRADITIONAL REPOS RDBMS EDW MPP 1 HORTONWORKS DATA PLATFORM Business Analytics Traditional Sources (RDBMS, OLTP, OLAP) New Sources (web logs, email, sensor data, social media)
  • © Hortonworks Inc. 2013 Application Enrichment Page 36 DATASYSTEMSDATASOURCES Refine Explore Enrich APPLICATIONS 1 Capture Capture all data Process Parse, cleanse, apply structure & transform Exchange Incorporate data directly into applications 2 3 Collect data, analyze and present salient results for online apps 3 1 2 TRADITIONAL REPOS RDBMS EDW MPP Traditional Sources (RDBMS, OLTP, OLAP) New Sources (web logs, email, sensor data, social media) Custom Applications Enterprise Applications NOSQL HORTONWORKS DATA PLATFORM
  • VDave Porter John Kreisa VP Marketing, Hortonworks Dave Porter SproutCore Architect, Appnovation Technologies Speakers
  • VDave Porter Next Steps Hortonworks.com /sandbox Hortonworks.com /hadoop-training @Appnovation DaveP@Appnovation.com JKriesa@Hortonworks.com @hortonworks @hortonworks_U Appnovation.com /Blog Blog LEAR N
  • VDave Porter Thank You For Your Participation! CANADIAN HEADQUARTERS 152 West Hastings Street Vancouver BC, V6B 1G8 UNITED STATES OFFICE 3414 Peachtree Road, #1600 Atlanta Georgia, 30326-1164 UNITED KINGDOM OFFICE 3000 Hillswood Drive Hillswood Business Park Chertsey KT16 0RS, UK www.appnovation.com info@appnovation.com