First, let me explain what Hadoop is:Apache Hadoop is an open source software project originally invented by Google.It enables the distributed processing of large data sets across clusters of commodity servers. Hadoop provides an inexpensive massively scalable solution to storing structured and unstructured raw data.The Hadoop Data Reservoir is a vision of what Hadoop can be for your enterprise.
Before I go any further, I’d like to make sure I describe what the HDR is not. And sometimes this gets confused.
Upfront planning: what data will we collect? how will the data be modeled to answer our business questions? how will we make access to the data fast for all of our users? (the questions are almost endless)Ongoing maintenance: when will we refresh the data in the EDW? when datasets change, do we start over?Self-service: should be obvious that EDWs are the domain of the IT team. But the vision of the HDR implies this is self-service. When we see what is required, we’ll see that this is no easy task
How did we come to the concept of the HDR? THE VISION CAME OUT THROUGH THE INTERVIEWSStory:Developed a script of questionsPeople were at different places in their cycleThese were not data scientists. These were not people that had built their application on Hadoop (LinkedIn “People I know”)Cross section of industries: online media, financial services (banks and credit cards), federal government, retail, ecommerce, etc
But, reality was that none of these interviewees had reached the vision of the HDR. In fact, this is my image of the folks we were talking to.Talk about the enlightened IT user.
What is the thing that goes in between the HDR and the end user?The challenges with the Hadoop Data Reservoir:Missing link between the massive amount of raw data stored in HDR and access for business usersAccess has been self-limited to expert users who now data modeling and SQLIT teams must performing expensive ad-hoc data extractions into existing infrastructureAccess to the data in HDR must be high performance, self-service, and secure.
Should be about 1:40pm
Despite data size, queries must be fastIt’s not that queries just needed to be fast, they needed to be consistently fast.Modern tools require the ability to ask successive questions.As the centralized resource, you have many many questions being asked at once. The problem is when someone asks the wrong question in Hadoop, it impacts everyone.
Explain the media company data. Desire to get a 360 view of the customer on their site.A straightforward question such as the one posed here potentially requires touching 10s of billions of records to the process the answer.
Highly scalable architecture. Merv Adrian a few months ago: “One of the biggest technical challenges for BI in the Big Data era is deciding what is in memory. Fractal Cache does that efficiently and automatically.“The single most dramatic way to affect performance in a large data warehouse is to provide a proper set of aggregate (summary) records that coexist with the primary base records. Aggregates can have a very significant effect on performance, in some cases speeding queries by a factor of one hundred or even one thousand. No other means exist to harvest such spectacular gains.” – Ralph Kimball
You’ve heard of “drilling-down” on something. Or even drilling up. Use example of Region -> States -> Metro -> City -> Stores.Back to the Netflow example of our interviewee. He had 26B rows of raw data in Hadoop, per month. We built aggregate tables which reduced the grain and removed dimensionality, and made our work really fast.But what happens if, in our self-service Data Reservoir, the end user wants to get more detail from the raw data in Hadoop? We can’t just query it directly, because it will take too long, and I won’t have a rich set of metrics or dimensions to use to answer questions. I need to be able to drill through the aggregation. And since HDR is self-service, I need to be able to do this without involving my colleagues in IT.
Example of making sure data doesn’t get away.
Platfora addressed the challenges of HDR with the interest driven pipeline.Platfora software instantly transforms raw data in Hadoop into interactive, in-memory business intelligence. No ETL or data warehouse required.Platfora is a full stack of technology that spans from raw data the Hadoop Data Reservoir all the way to BI and analytics for the end user.In the past this would require at least three separate products.Platfora is the first product to completely rebuild the traditional business analytics stack from the ground up.
Platfora is made of three components – and none of these are more important than another – they all work together seamlessly.Platfora puts a very pretty face on Hadoop. Stunningly beautiful web-based BI interface. MAKES HADOOP DATA BEAUTIFUL.A scale-out in-memory data processing engine. MAKES HADOOP DATA FAST.Platfora drives Hadoop like a work engine. Automatically generating pushing jobs to Hadoop to do the heavy lifting without needing experts. MAKES HADOOP USABLE.These components work together. Based on what the user needs in the BI layer, the Lenses are automatically refined, the Hadoop data refinery does the heavy lifting without needing programming.Story: as we were working on the early designs for the product we thought about the old world that users were complaining about. Three separate layers – each with heavy expert intervention in-between. It reminded us of the way phones used to work. Remember managing contacts? iPhone analogy. Vertically integrated.
Outline• What is the Hadoop Data Reservoir (HDR)?• Requirements and Solutions• Hadoop Data Reservoir in Practice• Demo• Q&A
What is the Hadoop Data Reservoir (HDR)?• Central Hadoop cluster for the enterprise• Serves as the Storage and the Source of data for self-service business analytics• Provides Processing for data preparation and advanced analytics The Hadoop Data Reservoireliminates data silos, reduces costs,and makes business analytics agile.
HDR is Not a Replacement for the EDW • EDWs require upfront planning • EDWs require major ongoing IT maintenance and staffing • EDWs are not self-service
HDR Origin: Interviews with Enterprise IT• Platfora interviewed over 200 enterprise IT professionals working with Hadoop• Summer 2011 through early 2012• Topic of interview: challenges using Hadoop for business intelligence & analytics
What is Your Vision for Hadoop? • “I want Hadoop to be the central repository of all the data people need.”• “We shouldn’t have to plan too much before we store data.”• “Cost should only be a minor factor in how long we kept data around.”• “I want to give everyone access to the data and break down the existing silos. But it needs to be secure.”• “IT would not have to be involved in day-to-day management.”
“I’m a bit out on a limb here. I pushed to use Hadoop to collect data that we Out on a Limb were dropping before. But now it’s taking way more time to make use of it then I expected.” Stock Photo 9
The Missing Link to HDR Automatic / Fast / Iterative Unbounded FLEXIBLE Hadoop Data “SOFTWARE DEFINED” Web-based Reservoir Business Intelligence DATA MARTS Performance, Self-Service, and Security
Queries must be consistently fastModern BI applications are driving more Modern Data Discovery BI and more queries all the time. A single HDR user should not be able toimpact other users simply because they asked the wrong question. Each move results in a new query. “We’re addicted to sub-second. If it takes longer than that for any reason, something is wrong.”
Most Queries are Straightforward, but Big “What’s the trend of female visitors clicking on ads on the Traffic autos channel over time?” Logs Advertising ??? Logs Clicks UserDemographics Big Hadoop cluster Months 2.4 PB total 700M records/day Processing the answer 400 GB/day could touch 10s of billions 2B user records of records.
Solution: Aggregate Tables Stored In-Memory• Pre-calculated summary tables, summarizing data to a coarser grain • Dramatically reduces data required to answer a question • Keeps redundant processing off the batch system (Hadoop)• Keep summary data in memory to provide sub- second access 14
Finding Data in the Reservoir Sales Shipments Hadoop Distributed File System (HDFS) is organized like other common FS: a directory structure Sentiment Web LogsInfo Datasets in HDFS could be a single file or 10,000+ files, Customer Interactions commonly organized by Demographics directory Business users must be able to find data to answers their questions 16
Aggregations Must Be Fully Automatic• Building aggregate tables requires planning and up- front decisions • Must choose the metrics, dimensions, granularity • In practice, this is an iterative process, and the first attempt is usually wrong• Aggregate tables must be maintained • Each time new data arrives • Sliding window tables (i.e. last 30 days): data in, data out For HDR to be self-service, this must be automatic.
Drilling Through the Aggregation Netflow Example Raw Data in Hadoop Aggregate Tables Milliseconds Hours, Days Source IP Address # of Machines Destination IP Address # of Flows “What happened between Application Total Flow Size (KB) 10:03-10:04am?” Packets Application Bytes 100MB Compressed 26B records/month Fast 400GB compressed Slow Need to “drill through the aggregation” to get more detail, or add dimensionality. And, it needs to be self-service. 18
Augmenting Datasets• Users must be able to augment data with sources outside of the HDR • I.e. market research or demographics• Commonly needs to be combined at the raw level, before data is aggregated
Modern Data Security Requirements• Hadoop provides: • File and directory based permissions (like Unix) • Secure authentication (via Kerberos)• However, enterprises require a finer level of data security control • Datasets – could be one or many files, spanning directories • Columns – datasets likely have many columns, with different security permissions • Rows – can span many files, and directories• Solution must abstract file-level security and enforce a finer level of control 21
Strong and Secure; Collaborative Sharing• In a self-service model, security must be strong and clear • End-users will need to understand what they can access and what they can’t • Security administrators must be able to enforce security centrally, down to the raw data• As a centralized system, HDR must integrate with directory services for authentication and group membership 22
Platfora: Interest-Driven PipelineTM Automatic / Fast / Iterative Unbounded FLEXIBLE Hadoop Data “SOFTWARE DEFINED” Web-based Reservoir Business Intelligence DATA MARTS Performance, Self-Service, and Security
Edmunds.com • Beta participant since January 2013 • Moved to Hadoop because of explosive data growth and promise of agility • Web, mobile, visitor demographic data • Use Case: optimize the matching of visitors withFounded in 1966: the cars they are looking for”For the purpose of publishing • Correlating browsers with the cars they are actuallynew and used automotive pricingguides to assist automobile buyingbuyers” • Platfora has made big data accessible to the businessOnline Innovators: • Increased access from 5 to 50 users• First auto information website • Decreased time to value from months to hours• True Market Value®, True Cost to Own®, and My Car Match “Before, if we wanted access to Hadoop data, we wouldn’t even try. With Platfora our analysts can access anything they need.”
Introducing Platfora’s Integrated Platform Web-based Business Vizboard Intelligence Application + Lens Scale-out, In-Memory Data Mart & Processing Engine + Dataset Automated Hadoop Data Refinery Powerful Closed-loop Analysis of Big Data
Summary• The Hadoop Data Reservoir vision is driven from requirements of enterprise Hadoop users• HDR eliminates data silos, reduces costs, and makes business analytics agile• To make HDR a reality, it needs to provide: • Performance • Self-service • Security 28