• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Big Data: An Overview
 

Big Data: An Overview

on

  • 1,898 views

A high level semi-technical overview of Big Data (specifically Hadoop).

A high level semi-technical overview of Big Data (specifically Hadoop).

Statistics

Views

Total Views
1,898
Views on SlideShare
1,895
Embed Views
3

Actions

Likes
7
Downloads
0
Comments
0

2 Embeds 3

https://twitter.com 2
https://tasks.crowdflower.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Not just a lot of information. [click] My working definition is anything so large it becomes very hard to manage with the usual tools. It’s not that you cannot work with big data using your traditional toolsets, it’s just that you can do it faster and cheaper.
  • CIO see licensing as a barrier- Focus pricing on researchers Technology programming - Data Management ERA A number of new challenges Volume – always been a problem but more so now because of the increased opportunity to gather data. Grabbing data. Equipment have more and more monitors in them which generate more and more data In the past People typically grab a piece of information that they wanted and ditch the others. Today people are finding theses Steams of data more interesting and want to get hold of those. So the Volume of data you would like to retain growing rapidly Linked to that is velocity not only is the data growing but it is arriving a lot faster. So if you look at collecting data from machine or any source these days it could come a phenomenal rate like at terabytes per minutes. And typically people are looking to diving into a lot more different data sources. It can be data they generate themselves or data from special data sources. Link linked , twitter and other sources scraping informaiton and linking it into what they have. And the types of data is not text and numbers but it is images, pictures, graphs tv camera linked to that is the challenge of Value You have this huge collection of data, huge constinent of data, you want to collect You have these different types of data. You get huge value across multiple groups you get huge value but only n small pieces of data from each of these groups are relivent to your business or the research being done. That they want to work These are the challenges so how can Oracle help you get that value add
  • Data in transit – your phone call or the email of your vacation photos while traveling over the network backbone 1 GB stored content can create 1 PB in transit Stored data is doubling about every 2 years. 130 Exabytes in 2005 1227 Exabytes in 2010 (1.19 Zettabytes) 7910 EB in 2015 (7.72 Zettabytes)
  • Big Data is driving significant data volume in customers who are leveraging it. A wide variety of sources provide this type of data.
  • Definitions are from Peter Wood, Professor of Computer Science at the University of London
  • These definitions are solely my own
  • There are lots, but the main one (and the one on which we are going to focus today) is Hadoop
  • It costs a lot more money to build bandwidth than it does CPU
  • Meanwhile at Yahoo, Doug Cutting was working on Nutch, Yahoo’s next generation search tool. The elephant is important; trust me
  • Hadoop is basically a massively parallel, shared nothing, distributed processing algorithm
  • HDFS Distributes Files At The Block Level Across Multiple Commodity Devices For Redundancy On The Cheap Not RAID: Distribution Is Across Machines/Racks
  • By Default, HDFS Writes Into Blocks & The Blocks Are Distributed Three Times. The size of the files can be set by the user. Pay attention to the NameNode here; this server keeps track of where all the chunks have been distributed across the file system. If you lose it, you’re hosed and have to rebuild everything from scratch.
  • Data Is Written Once & (Basically) Never Erased
  • Data Is Read From The Stream In Large, Contiguous Chunks, Not Random Reads
  • Hadoop is just a programming paradigm. You can do MapReduce inside an Oracle database; you generally just don’t want to do so.
  • Basically, a way of measuring how important an attribute is to the whole. The number of times it appears within the item compared to the background environment.
  • What does a given person do and how would they behave in a given situation
  • “ 80% of all network traffic (internet or otherwise) is one machine talking with another machine.” Mike Olsen, Cloudera
  • Spam vs. Ham
  • Flume as a intake device, MapReduce as a transformation engine. Instead of the classic hub & spoke of Informatica, you can run your ETL across a few thousand nodes and massively increase the throughput. Facebook uses Hadoop as a underlying architecture (through lots of filtering) in it’s messaging application. 1.5M ops/sec at peak 75B+ ops/day
  • Includes Sentiment Analysis What is this person thinking? Is that a happy smile, a sarcastic smile, a sad smile?
  • A customer can ingest all the logs from every machine in their environment and data mine the results to find any machine out of compliance.
  • Monte Carlo simulations complex derivate valuations predict when a customer is heading into credit problems and shorten their terms before you get caught in their problems demand forecasting
  • credit risk, scoring and analysis Parallelizing data access as well as computation “ A large financial institution combined their data warehouses into a single Hadoop environment. They then used that information to more accurately score their customer portfolio risk" Social networking activity Bill payments (cell phone, for example) How often have you moved
  • When considering a new hire, an extended investigation may show risky behavior on the applicant’s part which may exclude him or her from some of the more sensitive areas.
  • I hurt myself on the yards and you have to pay me workers comp. Then he tells Twitter he’s going to his house in Belize for some waterskiing.
  • look for bad actors within NGC Nick Lesson at Barings in 1995, for example Shrinkage detection. Enable the security people to better do their jobs in monitoring the activities of people in sensitive positions. The Petraeus scandal – one of the reasons why the FBI was able to close in on the identities of the people involve is that they were able to geolocate the sender and receiver of the Gmail emails and then connect those IP addresses with known users having the same IP addresses.
  • Portfolio evaluation for existing holdings Portfolio eval for future activities High speed arbitrage trading Simply keeping up "Options were 4.55B contracts in 2011 -- 17% over 2010 and the 9th straight year in a row”, 10k credit card transactions per second -- all stats here from ComputerWorld, 042512
  • People either do not fill out these forms or they fill them out with inaccurate information. These same people usually will tell their friends not just the truth, but the whole truth. And they will do it on Facebook and Twitter.
  • People either do not fill out these forms or they fill them out with inaccurate information. These same people usually will tell their friends not just the truth, but the whole truth. And they will do it on Facebook and Twitter.
  • People either do not fill out these forms or they fill them out with inaccurate information. These same people usually will tell their friends not just the truth, but the whole truth. And they will do it on Facebook and Twitter.
  • People either do not fill out these forms or they fill them out with inaccurate information. These same people usually will tell their friends not just the truth, but the whole truth. And they will do it on Facebook and Twitter.
  • Social Networking is coming to NGC’s customers at some point in time. It won’t be Facebook, but it will be something internally for the Navy (and/or the military). Oracle uses a secured social network internally to great effect… Live Twitter demo: http://50.17.239.57:9704/analytics/saw.dll?dashboard&PortalPath=%2Fshared%2FSentiment%20Analysis%2F_portal%2FSentiment%20Analsysis weblogic/welcome1
  • Over 50% of all trades are done at the behest of a computer. As the #io6maps #fail tags trended on Twitter, a sell off of Apple occurred.
  • Over 50% of all trades are done at the behest of a computer. As the #io6maps #fail tags trended on Twitter, a sell off of Apple occurred.
  • Over 50% of all trades are done at the behest of a computer. As the #io6maps #fail tags trended on Twitter, a sell off of Apple occurred.
  • Over 50% of all trades are done at the behest of a computer. As the #io6maps #fail tags trended on Twitter, a sell off of Apple occurred.
  • Advance Auto Parts
  • Advance Auto Parts
  • Use machine processing to “read” the press releases and blogs of your customers to learn when they are getting ready to cut their budget. NGC can then position themselves to best answer their customer needs. [click] This can also extend to picking opportunities [click] from other competitors when they fall short. For that matter, [click] have programs scouring your competitor’s site and then use their own information against them. “Gosh, Air Force, I don’t know if I’d trust Boeing right about now; aren’t they using some of the same Dreamliner tech on their avionics package? Maybe we could help out there….”
  • Use machine processing to “read” the press releases and blogs of your customers to learn when they are getting ready to cut their budget. NGC can then position themselves to best answer their customer needs. [click] This can also extend to picking opportunities [click] from other competitors when they fall short. For that matter, [click] have programs scouring your competitor’s site and then use their own information against them. “Gosh, Air Force, I don’t know if I’d trust Boeing right about now; aren’t they using some of the same Dreamliner tech on their avionics package? Maybe we could help out there….”
  • As of Monday, there are [click] 724Hadoop postings in the DC area open. For each of those jobs, [click] you’ll have hundreds – if not thousands – of applicants. So, how can you determine [click] that she is the one you want. Not because she’s the most technically adept, but because she is going to fit with your corporate culture and existing team.
  • What do I mean by Cultural Fit? Well, the easiest way to get this across is what I call the airport test. When you’re thinking of hiring someone [click] and you have to sit in an airport [click] with them while the flight is delayed [click] for a few hours, would that make you happy or would you cringe at the thought of hours of chit-chat and making conversation.
  • Instead of doing a simple keyword match in the resume, go beyond the resume and find out more about the person. Regardless of where their resume says they worked or went to school, language analysis can reveal details about where they grew up and where they experienced their formative years. [click] is that a faucet or a spigot? [click] A wallet or a billfold? [click] A dog, a hound or a hound dog? And it’s more than just regional. All these words basically mean the same thing, but come from a different cultural point in time. You can use all of this information – and you can get from Facebook, twitter, blog posts and the like – to help determine if a potential hire is going to work well within your team. And you can do this all before they ever set foot on your property for an interview.
  • Instead of doing a simple keyword match in the resume, go beyond the resume and find out more about the person. Regardless of where their resume says they worked or went to school, language analysis can reveal details about where they grew up and where they experienced their formative years. [click] is that a faucet or a spigot? [click] A wallet or a billfold? [click] A dog, a hound or a hound dog? And it’s more than just regional. All these words basically mean the same thing, but come from a different cultural point in time. You can use all of this information – and you can get from Facebook, twitter, blog posts and the like – to help determine if a potential hire is going to work well within your team. And you can do this all before they ever set foot on your property for an interview.
  • Log analysis Improve uptimes through predictive failure analysis
  • The machines on a manufacturing floor produce data exhaust: Use this exhaust to improve the efficiency of the production line.
  • Trash bins are not an item most would consider when it comes to the internet of things. Here’s how they could provide valuable intelligence
  • We make the trash bins smart. [advance] You can buy a consumer grade, wifi enabled scale for about $100 a piece; I’ve seen bulk quotes on the internet for as low as $40 a pop. Put one of these scales under each of the bins [advance] and now the bin will tell you when it’s full.
  • Currently, the custodian has to go [just start advancing 13 times], check each bin at time and then empty the bin if necessary. With a self-reporting bin, the custodian [advance to phone image] can check his smart phone [advance to next slide]. WalGreens did this, and cut $57M out of their bottom line in 2012.
  • And see where he needs to go. Less time on the floor, less costs for cleanup, a more efficient waste management process. But, more importantly, we can now focus on what is happening when these bins are filling up. [advance] We can create a histogram for the amount of waste ingested at each bin. If you look [advance], you can see an outlier on the high side and [advance] an outlier on the low side. Take this one. [advance]
  • Why does this particular bin fill up so much faster than all the others? Is there something inefficient in the line which can be remedied? [advance] This is an example of data exhaust from before. Once we learn that this bin is filling up much faster than the other bins, we can start to look into the line around it and see if there is something about the manufacturing process which can be improved. After a bit of digging, we may discover that there is a problem with the machine cutting away too much metal; we refactor the line to send less metal down the pipe, saving on material costs and improving the efficiency of the line.
  • Advance Auto Parts
  • No. Hadoop is amazing at processing, but lacks a number of features found in traditional RDBMS platforms (like, say Oracle). These features include (but are not limited to): Security Ad-hoc query support SQL support Readily available technical resources
  • In general, do the data crunching in Hadoop, then import the results into a system like Oracle for more traditional BI analysis. Oracle Connectors; other options
  • Storage is the primary limiting factor, with one exception
  • Storage is the primary limiting factor, with one exception
  • If you remember from before, the NameNode controls the file distribution. It’s also the bottleneck for growth; you can only add nodes and files to the system if the NameNode can hold that information with its available RAM.
  • So, for the NameNode, load the machine up with as much memory as possible.
  • FUSE-DFS is a utility to allow a user to mount the distributed file system as a traditional file system (e.g. you can cannot it to another server as a remote disk)
  • Hue is the Cloudera analog to OEM
  • Here are some of the powerful capabilities of Cloudera Manager Service health and performance – Cloudera Manager is the only Hadoop management application that gives you the ability to get a real time view of the health of all the services running in the Hadoop stack. Competitive products tend to focus primarily on the file system, which is only 1 piece of the solution. Host-Level Snapshots – this gives you a view into that status of each host or node in your cluster Monitor and Diagnose Workloads – with Cloudera Manager, you can view and compare current and historical job performance for benchmarking, troubleshooting and optimization View/Search Hadoop Logs – Cloudera Manager is the only Hadoop management application that provides comprehensive log management. Each screen provides contextual log views, so you only need to view the logs that are relevant to what you’re looking at. You can also search logs by keyword, type and severity. Track Events – Cloudera Manager creates and aggregates relevant Hadoop events pertaining to system health, log messages, user services and activities and makes them available for alerting and searching Usage/Performance Reports – With Cloudera Manager you can visualize current and historical disk usage by user, group and directory. Track MapReduce activity on the cluster by job or user
  • Mahout is a collection of machine learning libraries. Mahout is also the job title for elephant wranglers in India
  • Oozie manages workflow and dependencies between MR jobs
  • Flume supports massively fast intake of log files.
  • Sqoop is a very simple connector between Hadoop and any ANSI SQL database using JDBC.
  • Pig and Hive are helper languages to provide a more SQL like interface to the Hadoop environment. Both work with MapReduce behind the scenes.
  • Hbase supports read/write access in a columnar store style
  • Whirr is the deployment tool to push out new nodes into the Hadoop environment. Very similar to Chef or Ferret, if your customers are already familiar with either of those tools
  • Zookeeper manages the coordination between all of the distributed services
  • BigTop is a test harness for Hadoop – both the environment as well as specific MapReduce jobs
  • Build slide. In Analytics, we start with [click] Standard report, move to [click] Ad Hoc, then [click] Drill Down. These are all [click] ways of analyzing what has happened or what is happening right now. Next, are alerts [click] to let me know that action must be taken, [click] simulation to experiment with ways to shape the action, [click] forecasting to take a look at what is happening now and projecting it into the future, and [click] prediction to play “what if?” All of these are [click] predictive in nature – what’s going to happen next. The top tier is when you get into [click] various forms of [click] optimization – both when you believe you have a good handle on the circumstances and when you do not. These areas are [click] prescriptive – given what we expect to be next, what is the best course of action.
  • Big Data can play across of these areas, but it is better suited for the higher level, more complex operations. It’s not that Big Data cannot support a more standard approach to reporting, it’s just that those areas are probably better served by existing, lower cost options.
  • This is what Oracle sees as the typical stages in analytics … ranges from initial data discovery to predictive analytics. [click] Many organizations are investing at the two ends of this spectrum today.
  • Our customers continue to evolve. [click] While there is a lot of hype and promise from Big Data, most are continuing to focus on aligning data warehouses with business needs, etc. [click] However, investments in Big Data are becoming much more common, often starting with proof of concepts.
  • "Big Data is not only about analytics, it's about the entire value chain. So when you think about Big Data solutions you have to think about all the different steps. In the first step, you need to actually acquire and store the data.
  • The next step is to organize the data – you will have acquired massive amounts of unstructured data, but it won’t be of use until you organize or transform and distill it such that it can be easily integrated into your data center.
  • Next, you will want to analyze the data – slice it and dice it, do data mining on it , look at it in tables and cubes etc. Basically, you want to know what this means.
  • And lastly, you want to turn this into something useful something that decision makers can see in their dashboards quickly so that they can act upon in near real-time.
  • There are a lot of new technologies out there that address the challenges at each stage of the process we just talked about.
  • 4/3/2012 Copyright 2012 Oracle Corporation. All rights reserved. Slide Conquering Big Data with the Oracle Information Model We typically look at capabilities through People, Process, and Tools. We had a lot of discussion this morning on tools and products. So let me direct your attention to a few other dimensions of big data capability.   First, the Big Data process is different.   The development of traditional BI and DW is entirely different from Big Data. With traditional BI, you know the answer you are looking for. You simply define requirements and build to your objective.   With Big Data (of course, not in all cases), you may have an idea or interest, but you don’t know what would come out of it. The answer for your initial question will trigger the next set of questions. So, the development process is more fluid. It requires that you explore the data as you develop and refine your hypothesis.   So this might be a process you go through with big data Hypothesis – The Big Idea Data Sources – Acquire, Access, Capture Data (private weblogs, streams, public [data.gov]) Explore Results – Simple MapReduce results with Hive/QL or SQL, use interactive query on through search, use visualization Reduce Ambiguity – Apply statistical models—eliminate outliers, find concentrations, and make correlations You interpret the outcome and continuously refine models and establish an improved hypothesis.   In the end, this analysis might lead to a creation of new theories and predictions based upon the data. Again, it’s very fluid and very different from traditional SDLC and BI development.
  • The comparison with the 30-node cloud based cluster is showing a single 18 node BDA being 2.5x faster than an almost twice as large Amazon cluster. The reason that this is only 2.5x is because a 30 node cluster has substantially more mappers and reducers running. On a normalized basis a BDA achieves 4x the throughput of the Amazon cluster.
  • Direct Connect: Optimized version of External Tables for HDFS. Fast, Parallelized data movement with automatic load balancing Loader: A MapReduce utility to load data from Hadoop into Oracle. Handles data conversion on the Hadoop side, makes loads very fast and efficient ODI Adapter: Works with ODI, creates MapReduce jobs behind the scenes, uses Hive (qv) R Connector: Writes MR jobs behind the scenes, Connects R, Oracle, local file system and HDFS.
  • Embedded analytics focus: Oracle R Enterprise enabling R statistics programs to be run against data in the Oracle Database eliminating latency and improving data security.
  • Embedded analytics focus: Data Mining algorithms available via SQL as part of the Advanced Analytics Option.
  • Embedded analytics focus: What’s included in the Oracle Database at no charge.
  • Oracle Endeca Information Discovery provides the Endeca Server which provides a “multi-faceted” data model that automatically provides drill paths through structured and unstructured data that is loaded into the server.
  • Support for mobile experience provided by the BI Foundation Suite for iOS (Apple) devices, here represented as being hosted on Exalytics.
  • Oracle’s goal is to reduce the amount of time required to implement these solutions. Simplify the support. Allow you to focus on delivering value – and not on maintaining infrastructure. And to provide the tools you need to effectively analyze data and generate insights. Let’s look at this picture from left to right. Twitter data streamed into the ..
  • This is a very simple equation for a Fourier transformation of a wave kernel at 0. If you think the data analysts with your customer would look at the above equation and cringe or hear the description I just gave and glaze over, then they are not ready for this.
  • A picture of one whiteboard at bit.ly
  • The demand for people with programming skills, math skills and business acumen is out of this world.
  • Many companies are opting to grow their own rather than hire from the outside. If this is your customer, they need to look for a programmer who liked Lisp in college, knows computational matrixes and his/her way around the business issues.
  • Big Data is a very powerful tool, but it is not the right tool for every problem.
  • You would never operate a POS system on Hadoop – you can only sell that widget once and only once and the batch processing nature of Hadoop doesn’t support this type of activity If you remember from the technical overview, Hadoop reads data in contiguous streams, so [click] random access of data does not work very well in a Hadoop world.
  • The amount of data is the wrong measurement. <1/1-50/50-300/300-600/600+ is my yardstick, but only if I have to make a size determinant.
  • The amount of data is the wrong measurement. <1/1-50/50-300/300-600/600+ is my yardstick, but only if I have to make a size determinant.
  • In general, do the data crunching in Hadoop, then import the results into a system like Oracle for more traditional BI analysis. Oracle Connectors; other options
  • Caffeine was built by Google to address real time indexing (instant results when searching). This technology will be of high interest for organizations looking to access their quickly changing data in real time, but not as useful for longitudinal or historical introspection.
  • Use Hadoop to analyze all of the data within your corpus and then generate a mathematical model. This model can be as simple [click] as a hard knee waveform or as complex [click] as a multivariate linear regression
  • Once the model has been created (and properly vetted, of course), it can be used to determine resolution of events in real time – thereby getting around the batch bottleneck of Hadoop. And these real time events can be handled quite well in a system like Oracle’s Complex Event Processing (hand over)
  • Hilary Mason is the chief data scientist at bit.ly (a web service which shortens links for social media). They handle ~80M new URLs per day and ~300M clicks per day. She’s an excellent lecturer and instructor – you really should find time to listen to her speak – and I’ve learned quite a bit from her over the years. She views Big Data projects as moving across 5 distinct stages. Let’s go through them…. in reverse order. In other words, let’s start at the end project. What do we want as the end result of a Big Data project?
  • The end goal of any Big Data solution is to provide data which can be interpreted into meaningful decisions. But, before we can interpret the data, we must first….
  • Model the data into a useful paradigm which will allow us to make sense of any new data based upon past experiences. But, before we can model the data, we must first
  • Explore the data we have and look for meaningful patterns from which we could extract a useful model. But before we can look through the data for a meaningful pattern, we first have to…
  • Clean and clarify the data we have to make it as neat as possible and as easier to manipulate. But before we can clean the data, we have to start with…
  • Obtaining the as much data as possible. The advances in technology coupled with Moore’s law means that DASD is very, very cheap these days. So much so that you might as well hang on to as much data as you can, because you never know when it will prove useful. And here’s where the BDA comes back into play. Able to ingest terabytes of data per hour with disk to store (particularly when coupled with ZFS) – it’s a great starting place.

Big Data: An Overview Big Data: An Overview Presentation Transcript

  • <Insert Picture Here>Big Data: An Overview
  • What Is Big Data?
  • What Is Big Data?• Big Data is not simply a huge pile of information• A good starting place is the following paraphrase:“Big Data describes datasets so large they becomeawkward to manage with traditional database tools• at a reasonable cost.”
  • VOLUME VELOCITY VARIETY VALUESOCIALBLOGSMARTMETER101100101001001001101010101011100101010100100101A Breakdown Of What Makes Up Big Data
  • Data Growth Explosion• 1 GB of stored content can create 1 PB of data in transitData & Image courtesy of IDC• The totality of stored data is doubling about every 2 years• This meant 130 EB in 2005• 1227 EB in 2010 (1.19 ZB)• 7910 EB in 2015 (7.72 ZB)
  • 2005 20152010• More than 90% is unstructured dataand managed outside RelationalDatabase• Approx. 500 quadrillion files• Quantity doubles every 2 years1.8 trillion gigabytes of datawas created in 2011…10,0000GBofData(INBILLIONS)STRUCTURED DATA(MANAGED INSIDE RELATIONAL DATABASE)UNSTRUCTURED DATA(MANAGED OUTSIDE RELATIONAL DATABASE)Growth Of Big DataHarnessing Insight From Big Data Is Now Possible
  • So, Any Just Any Dataset?• Big Data Can WorkWith Any Dataset• However, Big DataShines When DealingWith Unstructured Data
  • Structured Vs. UnstructuredStructured Data is any data to which apre-defined data model can be appliedin an automated fashion producing in asemantically meaningful result withoutreferencing some outside elements.If you can’t, it’s unstructuredIn other words, if you can apply sometemplate to a data set and have itinstantly make sense to the averageperson, it’s structured.
  • Really? Only Two Categories?Okay, there’s alsosemi-structured data.Which basicallymeans after thetemplate is applied,some of the resultwill make sense andsome will not.XML is a classicexample of thiskind of data.
  • Formal Definitions Of Data TypesStructured Data:Entities in the same group have the same descriptions (or attributes), while descriptions forall entities in a group (or schema): a) have the same defined format; b) have a predefinedlength; c) are all present; and d) follow the same order. Structured data are what is normallyassociated with conventional databases such as relational transactional ones whereinformation is organized into rows and columns within tables. Spreadsheets are anotherexample. Nearly all understood database management systems (DBMS) are designed forstructural dataSemi-Structured Data:Semi-structured data are intermediate between the two forms above wherein “tags” or“structure” are associated or embedded within unstructured data. Semi-structured data areorganized in semantic entities, similar entities are grouped together, entities in the samegroup may not have same attributes, the order of attributes is not necessarily important, notall attributes may be required, and the size or type of same attributes in a group may differ. Tobe organized and searched, semi-structured data should be provided electronically fromdatabase systems, file systems (e.g., bibliographic data, Web data) or via data exchangeformats (e.g., EDI, scientific data, XML).Unstructured Data:Data can be of any type and do not necessarily follow any format or sequence, do not followany rules, are not predictable, and can generally be described as “free form.” Examples ofunstructured data include text, images, video or sound (the latter two also known as“streaming media”). Generally, “search engines” are used for retrieval of unstructured datavia querying on keywords or tokens that are indexed at time of the data ingest.
  • Informal Definitions Of Data TypesStructured Data:Fits neatly into a relational structure.Semi-Structured Data:Think documents or EDI.Unstructured Data:Can be anything.Text Video Sound Images
  • Tools For Dealing With Semi/Un-Structured Data
  • What Is Hadoop?“The Apache™ Hadoop® project develops open-source software forreliable, scalable, distributed computing.“The Apache Hadoop software library is a framework that allows for thedistributed processing of large data sets across clusters of computersusing simple programming models. It is designed to scale up fromsingle servers to thousands of machines, each offering localcomputation and storage. Rather than rely on hardware to deliverhigh-availability, the library itself is designed to detect and handlefailures at the application layer, so delivering a highly-availableservice on top of a cluster of computers, each of which may beprone to failures.”
  • Rather than moving the data to a central server for processingThe Paradigm Shift Of HadoopCentralized Processing Doesn’t WorkMoving data to a central location forprocessing (like, say, Informatica)cannot scale. You can only buy amachine so big.
  • The Paradigm Shift Of HadoopBandwidth Is The Bottleneck• Moving data aroundis expensive.• Bandwidth $$ > CPU $$
  • The Paradigm Shift Of HadoopProcess The Data Locally Where It Lives
  • The Paradigm Shift Of HadoopThen Return Only The Results• You move much less dataaround this way• You also gain the advantageof greater parallel processing
  • Where Did Hadoop Originate?GFSPresented To ThePublic In 2003MapReducePresented To ThePublic in 2004
  • Spreading Out From GoogleDoug Cutting was working on “Nutch”, Yahoo’s next generation searchengine at the same time when he read the Google papers and reverseengineered the technology. The elephant was his son’s toy named….
  • Going Open SourceHDFS MapReduceReleased To Public 2006
  • A Bit More In Depth, Then A Lot More In DepthHDFS MapReduceHDFS is primarily a dataredundancy solution.MapReduce is wherethe work gets done.
  • How Hadoop WorksHadoop is basically a massively parallel, sharednothing, distributed processing algorithm
  • GFS / HDFSHDFS Distributes Files At The Block Level Across MultipleCommodity Devices For Redundancy On The CheapNot RAID:Distribution Is Across Machines/Racks
  • Data DistributionBy Default, HDFS Writes Into Blocks & The BlocksAre Distributed x3
  • WORMData Is Written Once & (Basically) Never Erased
  • How Is The Data Manipulated?Not Random ReadsData Is Read From The Stream InLarge, Contiguous Chunks
  • The Key To Hadoop Is MapReduceIn a Shared Nothing architecture,programmers must break the workdown into distinct segments that are:• Autonomous• Digestible• Can be processed independently• With the expectation of incipientfailure at every step
  • A Canonical MapReduce ExampleImage Credit: Martijn van Groningen
  • The dataarrives intothe system.A MapReduce ExampleThe Input
  • The data is moved into theHDFS system, divided intoblocks, each of which are copiedmultiple times for redundancy.A MapReduce ExampleSplitting The Input Into Chunks
  • The Mapper picks up a chunk forprocessing. The MR Frameworkensures only one mapper will beassigned to a given chunkA MapReduce ExampleMapping The Chunks
  • In this case, the Mapperemits a word with the numberof times it was found.A MapReduce ExampleMapping The Chunks
  • The Shuffler can do a roughsort of like items (optional)A MapReduce ExampleA Shuffle Sort
  • The Reducer combinesthe Mapper’s output intoa totalA MapReduce ExampleReducing The Emissions
  • The job completes with anumeric index of words foundwithin the original input.A MapReduce ExampleThe Output
  • MapReduce Is Not Only Hadoophttp://blogs.oracle.com/datawarehousing/2009/10/in-database_map-reduce.htmlMapReduce is a programming paradigm, not a language. You can do MapReducewithin an Oracle database; it’s just usually not a good idea. A large MapReducejob would quickly exhaust the SGA of any Oracle environment.
  • Problem Solving With MapReduce• The key feature is the Shared Nothing architecture.• Any MapReduce program has to understandand leverage that architecture.• This is usually a paradigm shift for mostprogrammers and one that many cannotovercome.
  • Programming With MapReduce• HDFS & MapReduce IsWritten In Java1. package org.myorg;2.3. import java.io.*;4. import java.util.*;5.6. import org.apache.hadoop.fs.Path;7. import org.apache.hadoop.filecache.DistributedCache;8. import org.apache.hadoop.conf.*;9. import org.apache.hadoop.io.*;10. import org.apache.hadoop.mapreduce.*;11. import org.apache.hadoop.mapreduce.lib.input.*;12. import org.apache.hadoop.mapreduce.lib.output.*;13. import org.apache.hadoop.util.*;14.15. public class WordCount2 extends Configured implements Tool {16.17. public static class Map18. extends Mapper<LongWritable, Text, Text, IntWritable> {19.20. static enum Counters { INPUT_WORDS }21.22. private final static IntWritable one = new IntWritable(1);23. private Text word = new Text();24.25. private boolean caseSensitive = true;26. private Set<String> patternsToSkip = new HashSet<String>();27.28. private long numRecords = 0;29. private String inputFile;30.31. public void setup(Context context) {32. Configuration conf = context.getConfiguration();33. caseSensitive = conf.getBoolean("wordcount.case.sensitive", true);34. inputFile = conf.get("mapreduce.map.input.file");35.36. if (conf.getBoolean("wordcount.skip.patterns", false)) {37. Path[] patternsFiles = new Path[0];38. try {39. patternsFiles = DistributedCache.getLocalCacheFiles(conf);40. } catch (IOException ioe) {41. System.err.println("Caught exception while getting cached files: "42. + StringUtils.stringifyException(ioe));43. }44. for (Path patternsFile : patternsFiles) {45. parseSkipFile(patternsFile);46. }47. }48. }49.50. private void parseSkipFile(Path patternsFile) {51. try { ,,,,,,• Will Work With Any LanguageSupporting STDIN/STDOUT• Lots Of People Using Python,R, Matlab, Perl, Ruby et al• Is Still Very Immature &Requires Low Level Coding
  • What Are Some Big Data Use Cases?• Inverse Frequency / Weighting• Co-Occurrence• Behavioral Discovery• “The Internet Of Things”• Classification / Machine Learning• Sorting• Indexing• Data Intake• Language ProcessingBasically, Clustering And Targeting
  • Inverse Frequency WeightingRecommendationSystems
  • Co-OccurrenceFundamental Data Mining –People Who Did This Also Do That
  • Behavioral Discovery
  • Behavioral Discovery“The best minds of my generation arethinking about how to make peopleclick ads.”Jeff Hammerbacher,Former Research Scientist at FacebookCurrently Chief Scientist at Cloudera
  • “The Internet Of Things”“Data Exhaust”
  • Classification / Machine Learning
  • SortingCurrent Record Holder:•10PB sort•8000 nodes•6 hours, 27 minutes•September 7, 2011Current Record Holder:•1.5 TB•2103 nodes•59 seconds•February 26, 2013
  • Indexing
  • Data IntakeHadoop can be used as a massive parallel ETL tool;Flume to ingest files, MapReduce to transform them.
  • Language ProcessingIncludes SentimentAnalysisHow can you infermeaning fromsomeone’s words?Does that smile meanhappy? Sarcastic?Bemusement?Anticipation?
  • How Can Big Data Help You?9 Use Cases:• Natural Language Processing• Internal Misconduct• Fraud Detection• Marketing• Risk Management• Compliance / Regulatory Reporting• Portfolio Management• IT Optimization• Predictive Analysis
  • Compliance / Regulatory Reporting
  • Predictive AnalysisThink data mining onsteroids. One of the mainbenefits Hadoop brings tothe enterprise is the abilityto analyze every piece ofdata, not just a statisticalsample or an aggregatedform of the entiredatastream.
  • Risk ManagementPhoto credit: Guinness WorldRecords (88 catches, by the way)
  • When considering a new hire, an extended investigation may show risky behavior on theapplicant’s part which may exclude him or her from more sensitive positions.Risk ManagementBehavioral Analysis
  • Fraud Detection“Dear Company: I hurt myself working on the line and now I can’t walk without acane.” Then he tells his Facebook friends he’s going to his house in Belize forsome waterskiing.
  • Internal MisconductOne of the reasons why the FBI was able to close in on the identities of the peopleinvolved is that they geolocated the sender and recipient of the Gmail emails andconnected those IP addresses with known users on those same IP addresses.
  • Portfolio Management• Evaluate portfolio performance on existing holdings• Evaluate portfolio for future activities• High speed arbitrage trading• Simply keeping up:"Options were 4.55B contracts in 2011 -- 17% over 2010 and the 9thstraight year in a row”10,000 credit card transactions per secondStatistics courtesy of ComputerWorld, April 2012
  • Sentiment Analysis – Social Network AnalysisCompanies used torely on warranty cardsand the like to collectdemographic data.People either did notfill out the forms or didso with inaccurateinformation.
  • Sentiment Analysis – Social Network AnalysisPeople are much more likely to be truthful when talking to their friends.
  • Sentiment Analysis – Social Network AnalysisThis person – and20 of their friends– are talking aboutthe NFL.This personis a runnerSomeonelikes KindleSomeone iscurrent withpop music
  • Sentiment Analysis – Social Network AnalysisEven Where You Least Expect It.You Might Be Thinking Something Like “My Customer Will Never Use SocialMedia For Anything I Care About. No Sargent Is Ever Going To Tweet “The StrapsOn This New Rucksack Are So Comfortable!!!”
  • Sentiment Analysis – Social Network AnalysisInternal Social Networking At Customer Sites• Oracle already uses an internal social network to facilitate work.• The US Military is beginning to explore a similar type of environment.• It is not unreasonable to plan for the DoD installing a network on base; Yourcompany could incorporate feedback from end users into design decisions.
  • Sentiment Analysis – Apple iOS6, Maps & Stock PriceApple Released iOS6 with theirown version of Maps. It has hadsome issues, to put it mildly.Photo courtesy ofhttp://theamazingios6maps.tumblr.com/
  • Sentiment Analysis – Apple iOS6, Maps & Stock PriceOver half of all trades in the US are initiated by a computer algorithm.Source: Planet Money (NPR) Aug 2012
  • Sentiment Analysis – Apple iOS6, Maps & Stock PricePhoto courtesy ofhttp://theamazingios6maps.tumblr.com/People started totweet about themaps problem, andit went viral (to thepoint that someonecreated a Tumblrblog to make funof Apple’s fiasco.
  • Sentiment Analysis – Apple iOS6, Maps & Stock PricePhoto courtesy ofhttp://theamazingios6maps.tumblr.com/As the twitter stream started to peak, Apple’sstock price took a short dip. I believe it likelythat automatic trading algorithms started tosell off Apple based on the negative sentimentanalysis from Twitter and Facebook.
  • Natural Language ProcessingBigHugeBloomingAmpleBlimpGiganticAbundantBroadBulkyCapaciousColossalComprehensiveCopiousEnormousExcessiveExorbitantExtensiveExtravagantFullGenerousGiantGoodlyBigGrandGrandioseGreatHeftyHumongousImmeasurableImmenseJumboGargantuanMassiveMonumentalMountainousPlentifulPopulousRoomySizableSpaciousStupendousSubstantialSuperSweepingVastVoluminousWhoppingWideGinormousMongoBadonkaBookuDoozy
  • Natural Language ProcessingBigHugeBloomingAmpleBlimpGiganticAbundantBroadBulkyCapaciousColossalComprehensiveCopiousEnormousExcessiveExorbitantExtensiveExtravagantFullGenerousGiantGoodlyBigGrandGrandioseGreatHeftyHumongousImmeasurableImmenseJumboGargantuanMassiveMonumentalMountainousPlentifulPopulousRoomySizableSpaciousStupendousSubstantialSuperSweepingVastVoluminousWhoppingWideGinormousMongoBadonkaBookuDoozyLarge
  • Natural Language ProcessingAnticipate Customer Need
  • Natural Language ProcessingReact To Competitor’s Missteps
  • Natural Language ProcessingCultural Fit For HiresAs of Apr 22, there were 724 Hadoopopenings in the DC area. There will behundreds – if not thousands – of applicantsfor each position. How can you determinewho is the most appropriate candidate, notjust technically, but culturally?
  • Natural Language ProcessingCultural Fit?A good way to think of cultural fit isthe “airport test.” If you’re thinkingof hiring someone and you had to sitwith them in an airport for a fewhours because of a delayed flight,would that make you happy? Orwould you cringe at the thought ofhours of forced conversation?
  • Natural Language ProcessingAnalyze Their Writings For Cultural FitGo beyond simple keyword searches to find out more about the person.Regardless of what their resume says, language analysis can reveal details aboutwhere they grew up and where they experienced their formative years.
  • Do they say “faucet” or “spigot”? “Wallet” or “billfold”? “Dog”, “hound” or“hound dog”? “Groovy”, “cool”, “sweet” or “off the hook”? While these wordsare synonyms, they carry cultural connotations with them. Find candidates withthe same markers as your existing team for a more cohesive unit.Natural Language ProcessingAnalyze Their Writings For Cultural Fit
  • IT Optimization
  • IT Optimization – Enabling The EnvironmentI’m runningout ofsupplies!I’m overheating!EverythingIs Fine.Wheel 21is out ofalignment.I’m 42.4%full.
  • IT Optimization – Enabling The Shop FloorA More Specific ExampleI’m 42.4%full.
  • IT Optimization – Enabling The Shop FloorMake The Trash SmartWe can make the trash bins “smart” byputting a wifi enabled scale beneath eachbin and using that to determine when thebins reaching capacity.
  • As of now, the custodian has to check each bin to see ifit is full. With a “smart” bin, the custodian can check hissmart phone and see does and does not need to be done.IT Optimization – Enabling The Shop FloorCut Down On Clean Up Labor
  • More importantly, we can now focus on what is happeningto the bins and how they are being used. For example, wemay find outliers where one bin is filling much faster thanall of the others.IT Optimization – Enabling The Shop FloorCut Down On Clean Up Labor
  • “Data Exhaust”We can drill into why that bin is filling faster, leveragethe Six Sigma efficiency processes already in placeand improve the overall performance of the line.IT Optimization – Enabling The Shop FloorDrilling Into To Waste Production
  • IT Optimization – Classify Legacy DataA customer can use a machine learning processto take unknown data and sort it into useful dataelements. For example, a retail car part companymight use this process to sort photos – is thatcircle a steering wheel, a hubcap or a tire?
  • So, All We Need Is Hadoop, Right?• Lack of Security • Ad-hoc Query Support• SQL support • Readily AvailableTechnical ResourcesHadoop is amazing at processing, but lacks a number of features found intraditional RDBMS platforms (like, say Oracle).
  • Then How Do We Fix Those Problems?In general, do the data crunching in Hadoop, then import the results into a systemlike Oracle for more traditional BI analysis.
  • Oracle’s Big Data Appliance
  • Oracle’s Big Data ApplianceIn Depth
  • Big Data ApplianceThe Specs Of The MachineHardware:•18 Compute/Storage Nodes• 2 6 code Intel processors• 48G Memory (up to 144G)• 12x3TB SAS DIsks•3 InfiniBand Switches•Ethernet Switch, KVM, PDU•42U rackSoftware:•Oracle Linux•Java Virtual Machine•Cloudera Hadoop Distribution•R (statistical programming language)•Oracle NoSQL DatabaseEnvironmental:•12.25 kVA Power Draw•41k BTU/hr Cooling•1886 CFM Airflow216 Cores864G RAM (2.5T Max)648T Storage• 12.0 KW Power Draw• 42k KJ/hr Cooling
  • Big Data ApplianceThe Cloudera Distribution
  • 105 Copyright © 2012, Oracle and/or its affiliates. All rightsreserved.CompetitiveAdvantageDegree Of ComplexitySome arehereGrowinginvestment hereThe Analytics EvolutionWhat Is Happening In The IndustryStandard Reporting: What Happened?Ad Hoc Reporting: How Many, How Often, Where?Query/Drill Down: What Exactly Is The Problem?Alerts: What Actions Are Needed?Simulation: What Could Happen….?Forecasting: What If These Trends Continue?Predictive Modeling: What Will Happen Next If…?Optimization: How Can We Achieve The Best Outcome?How can we achieve the bestStochastic Optimization: outcome, including the effectsof variability?Descriptive: Analyzing DataTo Determine What HasHappened Or Is HappeningNowPredictive: ExaminingData To DiscoverWhether Trends WillContinue Into The FuturePrescriptive: Studying DataTo Elevate The Best CourseOf Action For The FutureCompeting On Analytics: The New Science Of Winning;Thomas Davenport & Jeanne Harris, 2007
  • 106 Copyright © 2012, Oracle and/or its affiliates. All rightsreserved.CompetitiveAdvantageDegree Of ComplexitySome arehereGrowinginvestment hereThe Analytics EvolutionWhere Big Data Fits On This ModelStandard Reporting: What Happened?Ad Hoc Reporting: How Many, How Often, Where?Query/Drill Down: What Exactly Is The Problem?Alerts: What Actions Are Needed?Simulation: What Could Happen….?Forecasting: What If These Trends Continue?Predictive Modeling: What Will Happen Next If…?Optimization: How Can We Achieve The Best Outcome?How can we achieve the bestStochastic Optimization: outcome, including the effectsof variability?Descriptive: Analyzing DataTo Determine What HasHappened Or Is HappeningNowPredictive: ExaminingData To DiscoverWhether Trends WillContinue Into The FuturePrescriptive: Studying DataTo Elevate The Best CourseOf Action For The FutureCompeting On Analytics: The New Science Of Winning;Thomas Davenport & Jeanne Harris, 2007Where Big Data Best Fits
  • 107 Copyright © 2012, Oracle and/or its affiliates. All rightsreserved.Typical Stages In AnalyticsChoosing The Right Solutions For The Right Data NeedsGrowinginvestmenthereGrowinginvestmenthere
  • 108 Copyright © 2012, Oracle and/or its affiliates. All rightsreserved.IncreasingBusinessValueInformation Architecture MaturityDATA &ANALYTICSDIVERSITYCONSOLIDATE DATADATA WAREHOUSE& What is happening todayMostare here!DATA MARTS& What happened yesterdayBIG DATA& What could happen tomorrowSome arehereGrowinginvestment hereThe Data Warehouse EvolutionWhat Are Oracle’s Customers Deploying Today?
  • 109 Copyright © 2012, Oracle and/or its affiliates. All rightsreserved.How will youacquire livestreams ofunstructured data?ANALYZEDECIDEORGANIZEACQUIREWhat Is Your Big Data Strategy?Where Does Your Data Originate?
  • 110 Copyright © 2012, Oracle and/or its affiliates. All rightsreserved.How will youorganize big dataso it can beintegrated intoyour data center?ANALYZEDECIDE ACQUIREORGANIZEWhat Is Your Big Data Strategy?What Do You Do With It Once You Have It?
  • 111 Copyright © 2012, Oracle and/or its affiliates. All rightsreserved.What skill setsand tools will youuse to analyze bigdata?ANALYZEDECIDE ACQUIREORGANIZEANALYZEWhat Is Your Big Data Strategy?How Do You Manipulate It Once You Have It?
  • 112 Copyright © 2012, Oracle and/or its affiliates. All rightsreserved.How will youshare theanalysis in real-time?ANALYZEACQUIREORGANIZEDECIDEWhat Is Your Big Data Strategy?What To You Do After You’re Done?
  • 113 Copyright © 2012, Oracle and/or its affiliates. All rightsreserved.MakeBetterDecisionsUsingBig DataANALYZEDECIDE ACQUIREORGANIZEBig Data In Action
  • 114 Copyright © 2012, Oracle and/or its affiliates. All rightsreserved.Traditional BI Big DataChangeRequestsHypothesisIdentify Data SourcesExplore ResultsReduce AmbiguityRefine ModelsImproved HypothesisThe Big Data Development Process
  • 115 Copyright © 2012, Oracle and/or its affiliates. All rightsreserved.OracleExalyticsInfiniBandOracleReal-TimeDecisionsOracleBig DataApplianceOracleExadataInfiniBandAcquireOrganize & Discover Analyze DecideEndeca Information DiscoveryOracle’s Big Data Solution
  • PerformanceAchievementPerformanceAchievementTime(Days)Time(Months)100%Measure,diagnose,tune andreconfigureTest & debugfailure modesAssembledozens ofcomponentsMulti-vendorfingerpointingCustom ConfigurationOracle’s Big Data SolutionPre-Built And Optimized Out Of The Box
  • 6x faster than custom 20-node Hadoopcluster for large batch transformationjobs2.5x faster than 30-node Hadoopcluster for tagging and parsing textdocumentsBig Data Appliance Performance Comparisons
  • • Oracle Loader for Hadoop (OLH)• A MapReduce utility to optimize data loading from HDFS into OracleDatabase• Oracle Direct Connector for HDFS• Access data directly in HDFS using external tables• ODI Application Adapter for Hadoop• ODI Knowledge Modules optimized for Hive and OLH• Oracle R Connector for Hadoop• Load Results intoOracle Databaseat 12TB/hourBDAOracleExadataInfiniBandOracleBig DataConnectorsOracle Big Data Connectors
  • • The R open source environment for statistical computing andgraphics is growing in popularity for advanced analytics• Widely taught in colleges and universities• Popular among millions of statisticians• R programs can run unchanged againstdata residing in the Oracle Database• Reduce latency• Improve data security• Augment results with powerful graphics• Integrate R results and graphics withOBIEE dashboardsOracle Database Advanced Analytics OptionOracle R Enterprise
  • ClassificationAssociationRulesClusteringAttributeImportanceProblem Algorithm ApplicabilityClassical statistical techniquePopular / Rules / transparencyEmbedded appWide / narrow data / textMinimum Description Length (MDL) Attribute reductionIdentify useful dataReduce data noiseHierarchical K-MeansHierarchical O-ClusterProduct groupingText miningGene and protein analysisAprioriMarket basket analysisLink analysisMultiple Regression (GLM)Support Vector MachineClassical statistical techniqueWide / narrow data / textRegressionFeatureExtractionNon-Negative Matrix Factorization (NMF)Text analysisFeature reductionLogistic Regression (GLM)Decision TreesNaïve BayesSupport Vector MachineOne Class Support Vector Machine (SVM) Lack examplesAnomalyDetectionA1 A2 A3 A4 A5 A6 A7F1 F2 F3 F4Oracle Database Advanced Analytics OptionOracle Data Mining
  • • Ranking functions• rank, dense_rank, cume_dist, percent_rank,ntile•Window Aggregate functions (movingand cumulative)• Avg, sum, min, max, count, variance, stddev,first_value, last_value• LAG/LEAD functions• Direct inter-row reference using offsets• Reporting Aggregate functions• Sum, avg, min, max, variance, stddev, count,ratio_to_report• Statistical Aggregates• Correlation, linear regression family, covariance• Linear regression• Fitting of an ordinary-least-squares regressionline to a set of number pairs.• Frequently combined with the COVAR_POP,COVAR_SAMP, and CORR functionsDescriptive Statistics• DBMS_STAT_FUNCS: summarizes numericalcolumns of a table and returns count, min, max,range, mean, median, stats_mode, variance,standard deviation, quantile values, +/- n sigmavalues, top/bottom 5 values• Correlations• Pearson’s correlation coefficients, Spearmansand Kendalls (both nonparametric).• Cross Tabs• Enhanced with % statistics: chi squared, phicoefficient, Cramers V, contingency coefficient,Cohens kappa• Hypothesis Testing• Student t-test , F-test, Binomial test, WilcoxonSigned Ranks test, Chi-square, Mann Whitneytest, Kolmogorov-Smirnov test, One-wayANOVA• Distribution Fitting• Kolmogorov-Smirnov Test, Anderson-DarlingTest, Chi-Squared Test, Normal, Uniform,Weibull, ExponentialOracle Database SQL AnalyticsIncluded In The Oracle Database
  • ANALYZEDECIDEACQUIREORGANIZEDISCOVERVISUALIZESTREAMOracle Big Data Ecosystem
  • Having Said That…
  • Big Data Is More Than Just Hardware & Software
  • The Math Is The Hard PartThis is a very simple equation for a Fourier transformation of a wave kernel at 0.
  • The Math Is The Hard PartThis is a photograph of a data scientist’s white board at Bit.ly
  • Data Scientists Are Expensive And Hard To Find• Typical Job Description:“Ph.D. in data mining, machinelearning, statistical analysis,applied mathematics or equivalent;three-plus years hands-on practicalexperience with large-scale dataanalysis; and fluency in analyticaltools such as SAS, R, etc.”• Looking For “baIT”:• Business• Analytics• ITAll in the same personAll in the same personThese people exist, but arevery expensive.
  • Growing Your Own Data Scientist• Business Acumen• Familiarity/Likes ComputationalLinear Algebra / Matrix Analysis• Interest in SAS, R, Matlab• Familiarity/Likes Lisp
  • Big Data Cannot Do Everything
  • Big Data Cannot Do EverythingBig Data Is A Great ToolBut Not A Silver BulletYou would never run a POS system on Hadoop; Hadoop is far too batch orientedto support this type of activity. Similarly, random access of data does not workwell in the Hadoop world.
  • When Big Data? When Relational?Size Of Data(rough measure)
  • When Big Data? When Relational?RDBMS vs Hadoop: A ComparisonFully SQL Compliant Helper Languages (Hive, Pig)Many RDBMS Vendors Extend SQL In Useful Ways Very Useful But Not As Robust As SQLOptmized For Query Performance Optmized For Analytics OperationsTunable (Input Vs Output, Long Running Queries, Etc) Specifically Those Of A Statistics NatureArmies Of Trained And Available ResourcesResources Are Hard To Find AndExpensive When FoundRequires More Specialized HardwareAt Performance ExtremesDesigned To Work On CommodityHardware At All LevelsOLTP, OLAP, ODS, DSS, Hybrid -- More General Purpose Basically Only For AnalyticsExpensive To Implement Over Wide Geographical Distribution Designed To Span Data CentersVery Mature Technology Very New TechnologyReal Time or Batch Processing Batch Operations OnlyNontrivial Licensing Costs Open Source ("Free" --ish)About 2 PB As Largest CommercialCluster (Telecom Company)100+ PB As Largest CommercialCluster (Facebook) (as of March 2013)Ad Hoc Operations Common, If Not EncouragedAd Hoc Operations Possible With HBaseBut Nontrivial
  • It Is Not An “Either/Or” ChoiceRDBMS and Hadoop Each Solve Different Problems
  • Where Are Things Heading?
  • A Quick RecapGFSPresented To ThePublic In 2003MapReducePresented To ThePublic in 2004
  • YESHadoop Is Already Dead?Sort Of** = for a specific set of problems…
  • NamePubYear Use What It Does Impact Open Source?Colossus n/aGFS for realtimesystems NoCaffeine 2009 Real Time SearchIncremental updates of analyticsand indexes in real timeEstimated to be 100x fasterthan Hadoop NoPregel 2009Social Graphs,Location Graphs,Learning &Discovery, NetworkOptimization,Internet Of Things Analyze next neighbor problemsEstimated to handled billionsof nodes & trillions of edgesAlphaApacheGiraphPercolator 2010Large scaleincrementalprocessing usingdistributedtransactionsMakes transactional, atomicupdates in a widely distributeddata environment. Eliminatesneed to rerun a batch for a(relatively) small update.Data in the environmentremains much more up todate with less effort.Dremel 2010SQL like languagefor queries on theabove technologiesInteractive, ad hoc queries overtrillion row tables in subsecondtime. Works against Caffeine /Pregel / Colossus withoutrequiring MapReduceEasier for analysts and nontechnical people to beproductive (i.e. not as manydata scientists are required)VeryAlphaApacheDrill(Incubator)SpannerOct2012Fully consistent (?),transactional,horizontallyscalable, distributeddatabase spanningthe globeUses GPS sensors and atomicclocks to keep the clocks ofservers in sync regardless oflocation or other factors.Transactional support on aglobal scale at a fraction ofthe cost and where (manytimes) not technicallypossible otherwise.No, andunlikelyto everbeStorm 2012Real time Hadoop-like processingThe power of Hadoop in realtime. Not from Google; fromTwitterEliminates requirement forbatch processingYesBeta*The New Stuff In Overview
  • One Last ThingIs Just The Start Of The Equation
  • One Last ThingHadoop For Analytics And Determining Boundary ConditionsIs Just The Start Of The EquationUse Hadoop to analyze all of the data in your environment and then generatemathematical models from that data.
  • One Last ThingActing On Boundary ConditionsOnce the model has been built (and vetted), it can be used to resolve events inreal time, thereby getting around the batch bottleneck of Hadoop.
  • No Really. One More Last Thing
  • Who Is Hilary Mason?• Chief Data Scientist At bit.ly• One of the major innovatorsin data science• Scary smart and fun to be around• A heck of a teacher, to bootPhoto credit: Pinar Ozger, Strata 2011
  • InterpretThe end goal of any Big Data solution is to provide data which can be interpretedinto meaningful decisions. But, before we can interpret the data, we must first…The Mason 5 Step Process For Big DataIn Reverse Order
  • ModelModel the data into a useful paradigm which will allow us to make sense of anynew data based on past experiences. But, before we can model the data, we mustfirst….The Mason 5 Step Process For Big DataIn Reverse Order
  • ExploreExplore the data we have and look for meaningful patterns from which we couldextract a useful model. But, before we can look through the data for meaningfulpatterns, we first have to…The Mason 5 Step Process For Big DataIn Reverse Order
  • ScrubClean and clarify the data we have to make it as neat as possible and easier tomanipulate. But, before we can clean the data, we have to start with…The Mason 5 Step Process For Big DataIn Reverse Order
  • ObtainObtaining as much data as possible. Advances in technology – coupled withMoore’s law – means that DASD is very, very cheap these days. So much so thatyou may as well hang on to as much data as you can, because you never knowwhen it will prove useful.The Mason 5 Step Process For Big DataIn Reverse Order
  • Questions?
  • Some ResourcesWhite Papers:• An Architect’s Guide To Big Data• Big Data For The Enterprise• Big Data Gets Real Time• Build vs. Buy For HadoopThis Deck:SlideshareWeb Resources:• Oracle Big Data• Oracle Big Data Appliance• Oracle Big Data ConnectorsMe:charles dot scyphers oracle dot com@scyphers (twitter)
  • 153