Big Data - How to Get Started


Published on

This webinar provides a step-by-step approach to getting started with a Big Data solution and what to expect in the first 90 days including:10 Steps to Starting your Big Data Project, 5 critical mistakes and 2 success stories.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Kurt Lueckhas over 20 years of experience within the Business Intelligence and Analytics field. During his consulting career he has worked with over 40 different organizations in multiple industries on a variety of technologies. In his current role, Mr. Lueck manages the BI & Analytics practice for Pactera.    
  • Good Afternoon and Good Morning on the west Coast.I appreciate everyone’s attendance and sincerely hope that you gather some very valuable information and insights from our presentation today. This is a very exciting topic.As promised, we have our 10 steps, 5 critical mistakes, and 2 success stories but I also wanted to start with some quick definitions, drivers, and key predictions .This presentation was built as a primer and we will be having several follow-up presentations in the coming weeks and months that dive deeper into industry (Financial services and Retail for example) and particular vendor solutions (for example: What is Oracle’s Big Data Solution).Lets get started.
  • Ok, I feel obligated to start with the 3 V’s. The definition of Big Data has been a work in progress over the past few years. The established definition at this point always has the 3V’s somewhere in the mix. Most recently I have seen another V mentioned but first the traditional 3 V’s.Volume – This is probably the most mentioned. The shear volume of data has been the biggest driver. Velocity – As the saying goes Speed Kills. Social Media put the bullet in most traditional attempts for retail organizations. Other industries such as our Energy client are getting overwhelmed by Smart Grid initiatives. Each industry has their own issues from some new technology.Variety – If it was just traditional data then there probably would not be a neccesity for any of this discussion. However, the fact is we have all different type of data that are simply not handled correctly in the traditional Oracle/DB2/SQL databases. Sure they can store them but they cannot do anything with them very efficiently.What is the fourth V? Value
  • There is a general consensus that there are 3 drivers of Big Data. The first is something called Dark Data – This is the data that we stored because we had to or wanted to store the data but never used. The thought was we better store it and at some point we might get some value out of it. We never did. This data volume has increased and increased.
  • Driver #2 – Enterprise organizations have been slowly adding in data to their enterprise datawarehouses but unfortunately at too slow of a pace. Business cannot wait. The introduction on new technologies has just piled onto the problem.I was recently at a large bank and the CTO was explaining that the current state simply cannot continue. We cannot throw more people and more servers at this problem. We have to think about this differently. The shear cost of adding one more TERADATA platform was becoming too expensive. If you looked a the cost on a graph it is obvious it was out of control.
  • The last driver is the variety of data. If the data was always a set of nice neat columns of text and numbers we would all be in better shape. XML a few years ago was meant to help take semi-unstructured data and put it into a nice relational format. Unfortunately, the ball is so far off the track now with the variety of data that is a NEW solution is required.Social Media is case in point. The information is completely unstructured but yet actually holds HUGE value.What about Audio and Video
  • I am always doing research and thought these Predictions seem very relevant to our presentation today. These are straight from GartnerI won’t read all of these predictions but the bottom line is the BIG Data IS in a hype cycle….but it IS here to stay. I was recently at the TDWI Conference on BIG DATA and the group was reminded that there have been a number of terms and products that in the beginning were USED in front of every product…. WEB-ENABLED. This is the assumption today. BIG DATA is here to stay for a number of reasons.Enterprise clients MUST engage BIG DATA as a competitive advantage today and later as an equalizer.The last point that I want to drive home is the amount of jobs that will go unfullfilled in the big data arena. If you have any college age kids this is where you should push them. However, I believe it takes a very science oriented mind to really engage this profession.
  • This presentation was built as a primer so I want to ensure that everyone gets some basic terms and some advanced as well.Lets talk about the basics of hadoop. They consist of two basic components. Hadoop and MapReduce.Hadoop is used generically to discuss the BIG DATA platform. More specifically Hadoop is a file system. It’s actually not a database but does share some commonality. HADOOP is to the starting block for any BIG DATA solution. See PictureMapReduce according to Wiki is is a programming model for processing large data sets, and the name of an implementation of the model by Google. MapReduce is typically used to do distributed computing on clusters of computers. See the Picture.
  • Most It departments are simply feeling overwhelmed with the amount of data and the amount of pressure from the business to combine data to provide business insight. This can be an incredibly exciting opportunity for IT and business to work together.IF you can gain an understanding of What a Big Data solutions look like then and only then will you be able to determine how Big data can actually help.The chart on this page show IN general which areas are most positively impacted by Big data. Financial Services as usual is right up front on the overall volume of data, velocity of data. Media Services however has a highvariety of data.As an interesting side note, Pactera has worked with Microsoft to develop solutions that will read in videos and decipher them into textual …hence searchable output. This is just one of many example where EVERYTHING is becoming searchable. Pictures, Videos, blogs, and the traditional data Action Item: Look around your enterprise, and identify scenarios where combining and analyzingdiverse datasets will generate substantial business value.
  • Ok, so this should be obvious. Reading that Big Data in CIO or CFO magazine does NOT mean that you need a BIG data solution. I was prepped before this call on the type of participants so I am fairly comfortable with this next statement. You all need to understand the types of Big Data solutions and more importantly you need to understand the business problems.Big Data Business Problems have the following characteristics:Combining Multiple Types of data togetherCombining LARGE volumes of dataRealistic ability to create value from projectLet me list some examples from in the Financial services industry just to drive the point home about realistic problems:FINANCE:Customer Service Context Aware sales and serviceSegmentationHollistic Customer ViewTrading:Algorithmic TradingSurvellianceRevenue GenerationNext Based OfferIn many cases, despite your significant size you do not have expertise on the Big Data solutions and will need to engage outside firms for at least the first project. WHICH BRINGS US TO THE NEXT SLIDE.
  • The main items that I am worried about for companies is this role called a data scientist. I believe most organizations simply do not have any or enough.What are the key roles of a data scientist?To make a big data project or any analytics project succeed, you actually need a lot of skills. I think of it as a combination of functional skills and technical skills … Most people when they think of data scientists, they think of the technical side. And their minds immediately go to analytics, which is important, but it’s not the whole part of the story. To me it’s 2 Sides:Analytics & DesignSo on the analytics, it’s the things around statistics, operations research, computer science, machine learning in particular is important for data science … But then there’s technology in the sense of being able to understand systems, particularly large systems, because you need to store data all over the place in distributed form, and the ability to program -- to write code that acts as a glue to put all these pieces together. The second functional area is around Design:There’s also the design side of things, which is basically being able to create an interface to the data so people will find it usable, and there's the data side, which is data manipulation, data modeling, data cleansing. So if I got the numbers right, there should be kind of two functional skill sets and four technical skill sets. And all of those need to be combined to make a good data science project work. This is a LOT to ask of ONE person. I believe this set of skills comes from teams of individuals who work on projects together and use each others strengths.
  • I love this picture because it really drives home the complexity of the marketplace. Ok, lets put some sanity to an insane marketplace of products.Hopefully this at least looks a little more reasonable visually. My recommendations are two-fold in the build-out of your solution.What is the problem attempting to be solved. If the problem is so complex that it requires a very specific solution then you may want to purchase a product that addresses that specific industry problem.IF possible I would prefer to recommend a big vendor such as the usual suspects (IBM - ORACLE /Cloudera) for your base platform of a BIG Data solution. I say that because this market is changing so fast and the vendors are popping up and in some cases down faster than a Whack A Mole game.
  • Again, I put this presentation together as a primer. When I first began looking at this market one of my first questions was what does a BIG DATA solution look like for a technical or data architecture perspective. My other question was DOES Big Data replace my BI solution? This picture is taken from one of our current projects before we introduced Big Data. We worked with this client for several years to build out a world-class BI solution. Starting from the bottom of the diagram we have our typical source systems that feed into the datawarehouse. The Analytics layer for developing real business answers was usually second. This layer would usually have our predictive analytic models.The third layer was typically a BI Presentation layer of some form or another. Products such as SAP Business Objects, Cognos, QlikView, Microstrategy, Spotfire would typically play here.So what does Big data Look like (CLICK)Ok – As you saw from the number of vendors that we had presented this configuration could have a whole slew of different products. This is simply an example.For this particular client we have base Hadoop – Hive and HBASE in the Data Service LayerHadoop - Remember Hadoop is the distributed file system that facilitates the data among potentially thousands of nodes and in most cases involves thousands of terabytes.HBASE - HBase is a sub-project of the Apache Hadoop project and is used to provide real-time read and write access to your big data., the primary objective of Apache HBase is the hosting of very large tables (billions of rows X millions of columns) on top of clusters of commodity hardwareHIVE - Apache Hive (Hive) is a data warehouse system for the open sourceApache Hadoop project. Hive features a SQL-like HiveQL language that facilitates data analysis and summarization for large datasets stored in Hadoop-compatible file systems. Hive enables SQL geeks to continue to be productive.In the Analytics Layer we used Mahout. This particular product is used to perform our predictive analytic searches. What does Mahout do? Glad you asked.Currently Mahout supports mainly four use cases: Recommendation mining takes users' behavior and from that tries to find items users might like. Clustering takes e.g. text documents and groups them into groups of topically related documents. Classification learns from exisiting categorized documents what documents of a specific category look like and is able to assign unlabelled documents to the (hopefully) correct category. Frequent itemset mining takes a set of item groups (terms in a query session, shopping cart content) and identifies, which individual items usually appear together.Think of this like portions of SAS but in opensource form.
  • This slide discussed the 5 Most Common mistakes that we are seeing within the marketplace.In no particular order:Lack of Expertise – I am actually not referring to the Hadoop or Java expertise that is required. If that was the case most projects never even get started. I am referring more to data scientist type of resources. There are projections from various “critical thinking” organizations such as Gartner who project a significant short-fall of data scientist. The truth is you may have to develop this internally. I would suggest you do that now. I also suggest looking at your universities and hiring graduates from Analytic programs.BIG Data projects without a problem. We are certainly in a hype cycle around Big Data. This is natural with any technology that can be a game changer. More than likely your company does have business problems that can be assisted with BIG Data solutions. The alignment between Savvy business users and technology enabled IT departments is still in the works. Lack of technology alignment – By this I am referring to the fact that it is very easy to begin purchasing point Big Data Solutions for one specific problem. Watch out. This same problem has been happenning for years with out HYPE cycles. Lets get a bit smarter on this CYCLE.This flows directly into my next CAUTION – Develop a longer-term roadmap. If you are going to start a BIG DATA project that means you will be purchase software and may be hiring resources. Before you start, it may be time for a short Big Data strategy. Understand what happens after the first project. I am absolutely in favor of starting with a POC and starting small. However, before large investments think through the 2yr plan. PACTERA’s Big Data Strategy is a quick engagement to review each major business group in an organization and look for detailed problems that may be solved by BIG DATA solutions. It’s a great engagement that has the outcome of a 1-2year plan for implementing BIG DATA>5) Lack of Critical Evaluation – I feel like this has been missing in most IT projects. At the end of the project, did we achieve the expected business goals. If the answer is no then lets figure out why and make improvements.
  • I now want to present two business cases from real-life projects. The first project is for one of the largest on-line travel organizations in the world. Lets call them Acme OnLine Travel (AOL).Pactera has had a relationship with AOL for over 6years. We built the datawarehouse. We understand the business very well and frankly we understand the weaknesses of the BI solution. The volumes of data were so high and the cost to maintain was growing.The data sources for this client were everything from traditional ERP systems, Click Stream Data to Social Media such as Facebook. It’s not hard to see why the volumes were high. Petabytes is the norm.Part of the main driver for this project was to Reduce cost per TB from which was running at ten thousand USD. So a few years ago we suggested to AOL that we think a BIG Data solution is most likely necessary if we want to continue to be competitive in this industry. It started with some POC’s and then moved into BUILDING ONTO the current BI system at first. We are now beginning to see the natural death of some portions of the traditional BI system. I say natural death because our business users are simply not using some of the old methods. The most interesting and hard-hitting is the Preditive analytic functions that are being built on top of the base hadoop file system.One of the most recent changes is our moving to near REAL-TIME with a newer BIG Data product called Impala. Our team has been working with Impala for the past year or so even before it was officially released. This addresses one of the CRITICAL issues with Big Data and that is the lack of real-time capabilities.
  • Lets talk about IMPALA for a moment. This graph show our own testing at this client with Petabytes of data. As you can see the performance is quite stark between Hive and Impala. If you know anything about traditional What is also interesting that I wanted to draw out is that DESPITE our success with BIG DATA at this client is that large number of people use HADOOP only to get data so that they can process in a traditional RDBMS. A lot of this is simply because people are more comfortable and end-user tools are more user-friendly on relational/traditional databases. Please keep in mind that when it says FASTER on that 3rd line it is referring to much smaller sets of data that we are placing into RDBMS.In conclusion, the solution provided FASTER , more intelligence insights and the cost is down toless than 2 thousand per TB in Hadoop from 10kUSD.
  • The final case study that I want to present is around Retail. The picture that you are seeing is the goal of most major Retailers. The goal to drive a marketing and eventual sale down so personalized that it felt like they knew the customer on a one on one basis. Oh and by the way, not to cross the “Creep Factor” line. That is the line where the customer feels violated. This was the case with our client. Our client had a mix of the following types of data:Store POSWeb ClickStreamSocial MediaFinancialA BUNCH of spreadsheetsCustomer Satisfaction dataCall Center DataJust as in the last case study the volume of data was growing and the cost to manage it was growing even faster.The project started with a POC and has now reached into several departments. Examples of business problems / projects include:Customer buying behaviourPrice Optimization – as in changing prices on the web based on behaviourAnd Space planningAll of these projects were accomplished with a Theory, a Model, and A lot of testing. Eventually when good models were built and TESTED significantly then the models were embedded into the clients operational systems. What I have walked away from with these projects and research is how much phycology is required to be successful.This particular client is actually using BIG DATA solutions combined with SAS and several other traditional BI tools.
  • BIG data is not the solution. The solution is some type of use of technology that enables business answers. The four bullets on here represent the 4 focus areas of our BI&Analytic practice in 2013. I believe BIG DATA is the foundation that many of these other solutions.
  • I love this story because it is so hard hitting ….especially if you have daughters like I do.Most of you have heard the story so I won’t go into all of the details. The basic gist goes something like this.Target started a predictive analytics project that was so successful and accurate that it actually predicted that a Fathers daughter was pregnant before the father knew. Google the story to find the full story if you have not heard it.I wanted to end on this because we all have a corporate responsibility to use our technology without crossing the privacy line with our customers.
  • Big Data - How to Get Started

    1. 1. Primer On Getting Started With Big Data Projects Kurt Lueck January 2013
    2. 2. Contact Presenter Kurt Lueck Managing Director, Business Intelligence & Analytics Email: Desk: +1.704.944.3155 x240 6100 Fairview Road, Suite 560, Charlotte, NC 28210 Visit our website:© Pactera. Confidential. All Rights Reserved. 2
    3. 3. Pactera Snapshot  NASDAQ: Symbol PACT  Based in Charlotte NC & Beijing, China  35 Offices Globally / 24,000 Employees  Fortune 500 Clients (Financial Services, High Tech, Retail)  Focus on Driving Innovation (Big Data, Analytics, Mobility, Cloud Solutions)© Pactera. Confidential. All Rights Reserved. 3
    4. 4. Global Footprint and Flexible Delivery Capabilities Pactera is a global company strategically headquartered in China, enabling partnership with companies seeking to leverage one of the world’s largest and fastest-growing technology markets. Global FTE: 24,000 North America & EU: 500 Asia Pacific: 1,000 Greater China: 22,500 London Seattle Changchun San Francisco Barcelona Beijing Dalian Tokyo Silicon Valley Charlotte Tianjin Qingdao Xi’an Nanjing Wuxi Osaka Atlanta Wuhan San Diego Shanghai Chengdu Changsha Hangzhou Guangzhou Taiwan Dongguan Hong Kong Shenzhen Malaysia Singapore Melbourne Sydney© Pactera. Confidential. All Rights Reserved. 4
    5. 5. Primer on Big Data 1 Definitions 2 Drivers 3 Predictions 4 10 Steps to Starting Your Big Data project 5 5 Critical Mistakes 6 2 Practical Success Stories 7 Next Steps© Pactera. Confidential. All Rights Reserved. 5
    6. 6. What Is Big Data? Volume Velocity Variety Big Data is high-volume, velocity and variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making© Pactera. Confidential. All Rights Reserved. 6
    7. 7. Driver - #1 Growth of Dark DataLeveraging dark data represents largestopportunity to transform business.© Pactera. Confidential. All Rights Reserved. 7
    8. 8. Drivers - #2 Increasing Need to Process Data (Efficiently)Organizations must process increasingdata, increasing types, and createreal-time business decisions.© Pactera. Confidential. All Rights Reserved. 8
    9. 9. Driver - #3 Explosion of VarietyExplosion of unstructured data to beanalyzed creates opportunities.© Pactera. Confidential. All Rights Reserved. 9
    10. 10. Big Data Predictions Through 2014, 20% of enterprise warehouses will add distributed processes By 2015, 20% of Global 1000 organizations will have a strategic focus on information infrastructure equal to that of application management Beginning in 2015, the term ‘big data’ will no longer be a competitive differentiator for technology providers By 2015, big data demand will reach 4.4 million jobs globally but only one third of those jobs will be filled Source: Gartner© Pactera. Confidential. All Rights Reserved. 10
    11. 11. What Exactly is hadoop? Hadoop Distributed File System (HDFS) MapReduce File Sharing & Data Distribute Computing Protection Across Across Physical Servers Physical Servers© Pactera. Confidential. All Rights Reserved. 11
    12. 12. Getting Started – 9 Steps Identify Problem Develop Business Case Identify Resource Needs Evaluate /Select Hardware & Software Fund POC Create Small Solution Evaluate Solution Develop Long-Term Roadmap Perform Project© Pactera. Confidential. All Rights Reserved. 12
    13. 13. Step 1: What’s Your Problem© Pactera. Confidential. All Rights Reserved. 13
    14. 14. Step 2: Develop Business Case General Guidelines 1. Follow Traditional Business Case Steps 2. Engage Organization – This is Not an IT project Proposed 3. Engage Experts (You May Not Business Have Them Yet) Solution 4. Consider Team Carefully Business Case Proposed Technology Solution© Pactera. Confidential. All Rights Reserved. 14
    15. 15. Step 3: Identify Resource Needs Potential Weaknesses: • Big Data Skills • Predictive Analytics • Data Scientist • Strong Business Analyst • Agile Methodology Business • Project Managers Expertise New Resources? Technology Expertise© Pactera. Confidential. All Rights Reserved. 15
    16. 16. Step 4: Technical Architecture Mega-Vendors – Big Data – Vertical Industry© Pactera. Confidential. All Rights Reserved. 16
    17. 17. Step 4: Technical Architecture Architectures • Move computing near to data • Online analysis & Offline analysis • Parallel ingestion/exchanges • SQL and NoSQL • Computing as well as storing Business Value • From statistic to explore & prediction • From period to near real time • From commercial to open source • From big data to big understanding© Pactera. Confidential. All Rights Reserved. 17
    18. 18. Critical Mistakes Lack of Expertise Big Data is IT project without a problem Lack of technology alignment Lack of Long-Term Roadmap Lack of critical evaluation© Pactera. Confidential. All Rights Reserved. 18
    19. 19. Story #1 – Travel Cloudera StyleCollecting Data• Offline explorer, spiders• Web server log files and Web UI scripts• Data feed from tools, tealeaf, Omniture feed, etc• Data feed from external, such as facebook feed, etc• Upstream operational databaseAnalyzing and Exploiting Data• Method, funnel analysis, shopping cart analysis, decision tree, etc• Tools, such as Omniture, Google analytics, SSAS, Unica, Weka, etc• Analytics of searching engine, such as SEO and SEM reportingEmpower Business with Intelligence• Mini-batch• Near real time DW/DB• A/B and MVT Testing Originally, we implement Behavioral Search project intended to capture• Recommendation Engine customer behavior on line. It captures search parameters from the customers using Tealeaf and persists this data in Hadoop. From it, an• Finance projection analyst would be able to re-tell a story of what the customer searched for, what he/she saw, and what he/she did based on the response. • High margin comes from the lodging; • High degree of merchant hotels are sold in the Next, we polished new customer data mart including full roll out of 1st page of search result; individualization, customer segmentations, customer lifetime value calc, • Larger families tend to book passenger vans and quick lookup of customer purchase details for longer period instead of midsize cars © Pactera. Confidential. All Rights Reserved. 19
    20. 20. Story #1 – Lessons Learnedsecs Data @ Nov. 2012 1800 Hive Impala 1556 1600 1400 1200 934 1000 800 667 600 431 425 400 224 240 151 200 37 49 86 4 0 One Day Query- One Month Query- Three Month Query Six Month Query One Year Query Two and half Year 21GB-24P 650GB-744P 1.7TB-2047P 2.9TB-2920P 3.8TB-2391P Query 5.8TB-3500+P • Hadoop Use Cases Moving to Real-Time • 71% - Move data from Hadoop to RDBMS for faster and interactive SQL • 67% - already query Hadoop using Hive • Impala – Real-Time SQL Queries engine for Hadoop, officially release in Q1, 2013 • Query results 4-30x faster than Hive • Support HQL and 100% open source© Pactera. Confidential. All Rights Reserved. 20
    21. 21. Story #2 – Personalization With Big Data© Pactera. Confidential. All Rights Reserved. 21
    22. 22. 2013 Pactera Focus Area 1 2 3 4 Putting Big Data Visual Performance Voice of Customer: Predict Your Future: To Work: Management Enabled: Large clients are still struggling Nobody can predict their future Data volumes are growing fast. Clients who desire to tie with what to do with the other but using advanced predictive Customers, partners, and now individual accountability to 85% of their data, which is analytics financial services even sensor-based systems are business value drivers can utilize unstructured. This unstructured organizations can apply science to generating data so quickly that BPM services to identify metrics data is made up of customer understanding fraudulent organizations across all industries and BI & Analytics technology to surveys, call center activity, customer buying need new technologies to stay enable the BPM Strategy. discussions, and most recently behavior, and manage risk etc. ahead. Organizations must analyze social media data. VOC strategies this data to understand and help companies manage and gain improve their business. value from this data. Example: Creating a Example: Embedding Big Data Solution to Example: Enabling BPM Example: Creating Predictive Analytics into through Visual Analytic Customer Buying analyze customer Risk Management Behavior Solutions relationship and Mgmt Dashboards solutions demand data© Pactera. Confidential. All Rights Reserved. 22
    23. 23. Conclusions How Target Figured Out A Teen Girl Was Pregnant Before Her Father Did© Pactera. Confidential. All Rights Reserved. 23
    24. 24. Thank youKurt LueckManaging Director, Business Intelligence & AnalyticsEmail: Kurt.Lueck@pactera.comDesk: +1.704.944.3155 x2406100 Fairview Road, Suite 560, Charlotte, NC 28210Visit our website: © Pactera. Confidential. All Rights Reserved. 24