Big data, data science & fast data
Upcoming SlideShare
Loading in...5
×
 

Big data, data science & fast data

on

  • 936 views

 

Statistics

Views

Total Views
936
Views on SlideShare
575
Embed Views
361

Actions

Likes
3
Downloads
31
Comments
0

2 Embeds 361

http://internetmarketing-readme.pricemaniacs.com 360
http://www.bingsandbox.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Let’s start by looking at some of the pioneers in the big data space. These well known, and highly valuable, enterprises have built their business on Big Data. The numbers they support are staggering.
  • But Big Data is for more than just internet companies. This slide shows some Greenplum customer examples who are leveraging big data to transform the business and drive new revenue streams. We ill talk about these in more detail today.
  • Think about what Big Data is for a moment. Share your thoughts with the group and write your notes in the space below.Is there a size threshold over which data becomes Big Data?How much does the complexity of its structure influence the designation as Big Data? How new are the analytical techniques?
  • There are multiple characteristics of big data, but 3 stand out as defining Characteristics: Huge volume of data (for instance, tools that can manage billions of rows and billions of columns)Complexity of data types and structures, with an increasing volume of unstructured data (80-90% of the data in existence is unstructured)….part of the Digital Shadow or “Data Exhaust”Speed or velocity of new data creation In addition, the data, due to its size or level of structure,cannot be efficiently analyzed using only traditional databases or methods.There are many examples of emerging big data opportunities and solutions. Here are a few: Netflix suggesting your next movie rental, dynamic monitoring of embedded sensors in bridges to detect real-time stresses and longer-term erosion, and retailers analyzing digital video streams to optimize product and display layouts and promotional spaces on a store-by-store basis are a few real examples of how big data is involved in our lives today. These kinds of big data problems require new tools/technologies to store, manage and realize the business benefit. The new architectures it necessitates are supported by new tools, processes and procedures that enable organizations to create, manipulate and manage these very large data sets and the storage environments that house them.
  • Big data can come in multiple forms. Everything from highly structured financial data, to text files, to multi-media files and genetic mappings. The high volume of the data is a consistent characteristic of big data. As a corollary to this, because of the complexity of the data itself, the preferred approach for processing big data is in parallel computing environments and Massively Parallel Processing (MPP), which enable simultaneous, parallel ingest and data loading and analysis. As we will see in the next slide, most of the big data is unstructured or semi-structured in nature, which requires different techniques and tools to process and analyze.Let us examine the most prominent characteristic: its structure.
  • The graphic shows different types of data structures, with 80-90% of the future data growth coming from non structured data types (semi, quasi and unstructured). Although the image shows four different, separate types of data, in reality, these can be mixed together at times. For instance, you may have a classic RDBMS storing call logs for a software support call center. In this case, you may have typical structured data such as date/time stamps, machine types, problem type, operating system, which were probably entered by the support desk person from a pull-down menu GUI. In addition, you will likely have unstructured or semi-structured data, such as free form call log information, taken from an email ticket of the problem or an actual phone call description of a technical problem and a solution. The most salient information is often hidden in there. Another possibility would be voice logs or audio transcripts of the actual call that might be associated with the structured data. Until recently, most analysts would NOT be able to analyze the most common and highly structured data in this call log history RDBMS, since the mining of the textual information is very labor intensive and could not be easily automated.
  • Here are examples of what each of the 4 main different types of data structures may look like. People tend to be most familiar with analyzing structured data, while semi-structured data (shown as XML here), quasi-structured (shown as a clickstream string), and unstructured data present different challenges and require different techniques to analyze.For each data type shown, answer these questions: What type of analytics are performed on these data?Who analyzes this kind of data?What types of data repositories are suited for each, or requirements you may have for storing and cataloguing this kind of data?Who consumes the data?Who manages and owns the data?
  • Here are 4 examples of common business problems that organizations contend with today, where they have an opportunity to leverage advanced analytics to create competitive advantage. Rather than doing standard reporting on these areas, organizations can apply advanced analytical techniques to optimize processes and derive more value from these typical tasks. The first 3 examples listed above are not new problems – companies have been trying to reduce customer churn, increase sales, and cross-sell customers for many years. What’s new is the opportunity to fuse advanced analytical techniques with big data to produce more impactful analyses for these old problems. Example 4 listed above portrays emerging regulatory requirements. Many compliance and regulatory laws have been in existence for decades, but additional requirements are added every year, which mean additional complexity and data requirements for organizations. These laws, such as anti-money laundering and fraud prevention, require advanced analytical techniques to manage well.
  • The graphic shows a typical data warehouse and some of the challenges that it presents. For source data (1) to be loaded into the EDW, data needs to be well understood, structured and normalized with the appropriate data type definitions. While this kind of centralization enables organizations to enjoy the benefits of security, backup and failover of highly critical data, it also means that data must go through significant pre-processing and checkpoints before it can enter this sort of controlled environment, which does not lend itself to data exploration and iterative analytics.(2) As a result of this level of control on the EDW, shadow systems emerge in the form of departmental warehouses and local data marts that business users create to accommodate their need for flexible analysis. These local data marts do not have the same constraints for security and structure as the EDW does, and allow users across the enterprise to do some level of analysis. However, these one-off systems reside in isolation, often are not networked or connected to other data stores, and are generally not backed up.(3) Once in the data warehouse, data is fed to enterprise applications for business intelligence and reporting purposes. These are high priority operational processes getting critical data feeds from the EDW.(4) At the end of this work flow, analysts get data provisioned for their downstream analytics. Since users cannot run custom or intensive analytics on production databases, analysts create data extracts from the EDW to analyze offline in R or other local analytical tools. Many times these tools are limited to in-memory analytics with desktops analyzing samples of data, rather than the entire population of a data set. Because these analyses are based on data extracts, they live in a separate location and the results of the analysis – and any insights on the quality of the data or anomalies, rarely are fed back into the main EDW repository. Lastly, because data slowly accumulates in the EDW due to the rigorous validation and data structuring process, data is slow to move into the EDW and the schema is slow to change. EDWs may have been originally designed for a specific purpose and set of business needs, but over time evolves to house more and more data and enables business intelligence and the creation of OLAP cubes for analysis and reporting. The EDWs provide limited means to accomplish these goals, achieving the objective of reporting, and sometimes the creation of dashboards, but generally limiting the ability of analysts to iterate on the data in an separate environment from the production environment where they can conduct in-depth analytics, or perform analysis on unstructured data.
  • Today’s typical data architectures were designed for storing mission critical data, supporting enterprise applications, and enabling enterprise level reporting. These functions are still critical for organizations, although these architectures inhibit data exploration and more sophisticated analysis.
  • …..describe or refer to NO SQL and KVPEveryone and everything is leaving a digital footprint. The graphic above provides a perspective on sources of big data generated by new applications and the scale and growth rate of the data. These applications provide opportunities for new analytics and driving value for organizations.These data come from multiple sources, including:Medical Information, such as genomic sequencing and MRIsIncreased use of broadband on the Web – including the 2 billion photos each month that Facebook users currently upload as well as the innumerable videos uploaded to YouTube and other multimedia sitesVideo surveillanceIncreased global use of mobile devices – the torrent of texting is not likely to ceaseSmart devices – sensor-based collection of information from smart electric grids, smart buildings and many other public and industry infrastructureNon-traditional IT devices – including the use of RFID readers, GPS navigation systems, and seismic processingThe Big Data trend is generating an enormous amount of information that requires advanced analytics and new market players to take advantage of it.
  • Big data projects carry with them several considerations that you need to keep in mind to ensure this approach fits with what you are trying to achieve. Due of the characteristics of big data, these projects lend themselves to decision support for high-value, strategic decision making with high processing complexity. The analytic techniques being used in this context need to be iterative and flexible (analysis flexibility), due to the high volume of data and its complexity. These conditions give rise to complex analytical projects (such as predicting customer churn rates) that can be performed with some latency (consider the speed of decision making needed), or by operationalizing these analytical techniques using a combination of advanced analytical methods, big data and machine learning algorithms to provide real time (requires high throughput) or near real time analysis, such as recommendation engines that look at your recent web history and purchasing behavior.In addition, to be successful you will need a different approach to the data architecture than seen in today’s typical EDWs. Analysts need to partner with IT and DBAs to get the data they need within an analytic sandbox, which contains raw data, aggregated data, and data with multiple kinds of structure. The sandbox requires a more savvy user to take advantage of it, and leverage it or exploring data in a more robust way.
  • The loan process has been honed to a science over the past several decades. Unfortunately today’s realities require that lenders take more care to make better decisions with fewer resources than they’ve had in the past. The typical loan process uses a set of data on which pre-approval and underwriting approval is based, including:Income data, such as pay and income tax recordsEmployment history to establish the ability to meet loan obligationsCredit history including credit scores and outstanding debtAppraisal data associated with the asset for which the loan is made (such as a home, boat, or car)This model works but it’s not perfect, in fact, the loan crisis in the US is proof that using only these data points may not be enough to gauge the risk associated with making sound lending decisions and pricing loans properly.Case Study Exercise:ObjectivesUsing additional data sources, dramatically improve the quality of the loan underwriting process Streamline the process to yield results in less timeDirectionsSuggest kinds of publicly available data (big data) that you can leverage to supplement the traditional lending processSuggest types of analysis you would perform with the data to reduce the bank’s risk and expedite the lending process
  • This is the standard format we will use for each representative example.
  • Check http://wiki.apache.org/hadoop/PoweredBy for examples of how people are using Hadoop Check this article on the large scale image conversion: http://open.blogs.nytimes.com/2007/11/01/self-service-prorated-super-computing-fun/. Check this for an ad for a ‘computer’ from 1892…http://query.nytimes.com/mem/archive-free/pdf?res=9F07E0D81438E233A25751C0A9639C94639ED7CF
  • Use the space here to record your answers to these questions:
  • Greenplum is driving the future of Big Data analytics with the industry’s first Unified Analytics Platform (UAP) that delivers:Our award winning Greenplum Database for structured dataOur enterprise Hadoop offering, Greenplum HD, for the analysis and processing of unstructured dataGreenplum Chorus that acts as the productivity layer for the data science teamGreenplum UAP is more than just integrated software working together; it is a single, unified platform enabling powerful and agile analytics that can transform how your organization uses data.What sets this diagram apart from a typically vendor example is the inclusion of people. That is not a mistake. We have introduced the Unified Analytics Platform but there is more to the story than technology and I will talk more about that in a few minutes. UAP is about enabling this emerging group of talent, the new practitioners, that we refer to as the Data Science team. This team can include the data platform administrator, data scientist, analysts, engineers, BI teams, and most importantly the line of business user and how they participate on this data science team.We develop, package, and support this as a unified software platform available over your favorite commodity hardware, cloud infrastructure, or from our modular Data Computing Appliance.MOORE’s LAW (named after Gordon Moore, the founder of Intel) states that the number of transistors that can be placed in a processor will double approximately every two years, for half the cost. But trends in chip design are changing to face new realities. While we can still double the number of transistors per unit area at this pace, this does not necessarily result in faster single-threaded performance. New processors such as Intel Core 2 and Itanium 2 architectures now focus on embedding many smaller CPUs or "cores" onto the same physical device. This allows multiple threads to process twice as much data in parallel, but at the same speed at which they operated previously.
  • Greenplum database’s strengths are in the structured side of the house. The functionality is based around the fact the data is structured.With GP MapReduce and large text objects, Greenplum database is able to do some things that are considered unstructured data analysis.
  • Greenplum database’s strengths are in the structured side of the house. The functionality is based around the fact the data is structured.With GP MapReduce and large text objects, Greenplum database is able to do some things that are considered unstructured data analysis.
  • Unfortunately, people may use the word “Hadoop” to mean multiple things. They may use it to describe the MapReduce paradigm, or they may use if to describe massive unstructured data storage using commodity hardware (although commodity doesn’t mean inexpensive). On the other hand, they may be referring to the Java classes provided by Hadoop that support HDFS file types or provide MapReduce job management. Or they may be referring to HDFS: the Hadoop distributed file system. And they might mean both HDFS and MapReduce.The point is that Hadoop enables the Data Scientist to create MapReduce jobs quickly and efficiently. As we shall see, one can utilize Hadoop at multiple levels: writing MapReduce modules in Java, leveraging streaming mode to write such functions in one of several scripting languages, or utilizing a higher level interface such as Pig or Hive. The Web site http://hadoop.apache.org/ provides a solid foundation for unstructured data mining and management.So what exactly is Hadoop anyway? The quick answer is that Hadoop is a framework for performing Big Data Analytics, and as such is an implementation of the MapReduce programming model. Hadoop is comprised of two main components, HDFS for storing big data and MapReduce for big data analytics. The storage function consists of HDFS (Hadoop Distributed File System) that provides a reliable, redundant, distributed file system optimized for large files. The analytics functions are provided by MapReduce that consists of a Java API as well as software to implement the services that Hadoop needs to function.Hadoop glues the storage and analytics together in a framework that provides reliability, scalability, and management of the data.
  • Let’s look a little deeper at the HDFS. Between MapReduce and HDFS, Hadoop supports four different node types (a node is a particular machine on the network). The NameNode and the DataNode are part of the HDFS implementation. Apache Hadoop has one NameNode and multiple DataNodes (there may be a secondary NameNode as well, but we won’t consider that here). The NameNode service in Hadoop acts as a regulator/resolver between a client and the various DataNode servers. The NameNode manages that name space by determining which DataNode contains the data requested by the client and redirecting the client to that particular datanode. DataNodes in HDFS are (oddly enough) where the data is actually stored. Hadoop is “rack aware”: that is, the NameNode and the Jobtracker node utilize a data structure that determines what DataNode is preferred based on the “network distance” between them. Nodes that are “closer” are preferred (same rack, different rack, same datacenter). The data itself is replicated across racks: this means that a failure in one rack will not halt data access at the expense of possibly slower response. Since HDFS isn’t suitable for near real-time access, this is acceptable in the majority of cases.
  • The MapReduce function within Hadoop depend on two different nodes: the JobTracker and the TaskTracker. The JobTracker node exists for each MapReduce implementation. JobTracker nodes are responsible for distributing the Mapper and Reducer functions to available TaskTrackers and monitoring the results, while TaskTracker nodes actually run the jobs and communicate results back to the JobTracker. That communication between nodes is often through files and directories in HDFS so internode (network) communication is minimized.Let’s consider the above example. Initially(1) , we have a very large data set containing log files, sensor data or whatnot. HDFS stores replicas of that data (represented here by the blue, yellow and beige icons) across DataNodes. In Step 2, the client defines and executes a map job and a reduce job on a particular data set, and sends them both to the Jobtracker, where in Step 3, the jobs are in turn distributed to the TaskTrackers nodes. The TaskTracker runs the mapper, and the mapper produces output that itself is stored in the HDFS file system. Lastly, in Step 4, the reduce job runs across the mapped data in order to produce the result.We’ve deliberately skipped much of the complexity involved in the MapReduce implementation, specifically the steps that provide the “sorted by key” guarantee the MapReduce framework offers to its reducers. Hadoop provides a Web-based GUI for the Namenode, Jobtracker and Tasktracker nodes: we’ll see more of this in the lab associated with this lesson.
  • In Pig and Hive, the presence of HDFS is very noticeable. Pig, for example, directly supports most of the Hadoop file system commands. Likewise, Hive can access data whether it’s local or stored in an HDFS. In either case, data can usually be specified via an HDFS URL (hdfs:///path>). In the case of HBase, however, Hadoop is mostly hidden in the HBase framework, and HBase provides data to the client via a programmatic interface (usually Java).Via these interfaces, a Data Scientist can focus on manipulating large datasets without concerning themselves with the inner working of Hadoop. Of course, a Data Scientist must be aware of the constraints associated with using Hadoop for data storage, but doesn’t need to know the exact Hadoop command to check the file system.
  • Pig is a data flow language and an execution environment to access the MapReduce functionality of Hadoop (as well as HDFS). Pig consists of two main elements:A data flow language called Pig Latin (ig-pay atin-lay) andAn execution environment, either as a standalone system or one using HDFS for data storage.A word of caution is in order: If you only want to touch a small portion of a given dataset, then Pig is not for you, since it only knows how to read all the data presented to it. Pig only supports batch processing of data, so if you need an interactive environment, Pig isn’t for you.
  • The Hive systemis aimed at the Data Scientist with strong SQL skills. Think of Hive as occupying a space between Pig and an DBMS (although that DBMS doesn’t have to be a Relational DBMS [RDBMS]). In Hive, all data is stored in tables. The schema for each table is managed by Hive itself. Tables can be populated via the Hive interface, or a Hive schema can be applied to existing data stored in HDFS.
  • HBase represents a further layer of abstraction on Hadoop. HBase has been described as “a distributed column-oriented database [data storage system]” built of top of HDFS. Note that HBase is described as managing structured data. Each record in the table can be described as a key (treated as a byte stream) and a set of variables, each of which may be versioned. It’s not a structure in the same sense as an RDBMS is structured. HBase is a more complex system than what we have seen previously. HBase uses additional Apache Foundation open source frameworks: Zookeeper is used as a co-ordination system to maintain consistency, Hadoop for MapReduce and HDFS, and Oozie for workflow management. As a Data Scientist, you probably won’t be concerned overmuch with implementation, but it is useful to at least know the names of all the moving parts. HBase can be run from the command line, but also supports REST (Representational State Transfer – think HTTP) and Thrift and Avro interfaces via the Siteserver daemon. Thrift and Avro both provide an interface to send and receive serialized data (objects where the data is “flattened” into a byte stream).
  • Although HBase may look like a traditional DBMS, it isn’t.HBase is a “distributed, column-oriented data storage system that can scale tall (billions of rows), wide (billions of columns), and can be horizontally partitioned and replicated across thousands of commodity servers automatically.”The HBase table schemas mirror physical storage for efficiency; a RDBMS doesn’t. (the RDBMS schema is a logical description of the data, and implies no specific physical structuring.) Most RDBMS systems require that data must be consistent after each transaction (ACID prosperities). DBMS systems like HBase don’t suffer from these constraints, and implement eventual consistency. This means that for some systems you cannot write a value into the database and immediately read it back in. Strange, but true. Another of HBase’s strengths is in its wide open view of data – HBASE will accept almost anything it can cram into an HBase table.
  • Mahout is a set of machine learning algorithms that leverages Hadoop to provide both data storage and the MapReduce implementation.The mahout command is itself a script that wraps the Hadoop command and executes a requested algorithm from the Mahout job jar file (jar files are Java ARchives, and are very similar to Linux tar files [tape archives]). Parameters are passed from the command line to the class instance. Mahout mainly supports four use cases:Recommendation mining takes users' behavior and tries to find items users might like. An example of this is LinkedIn’s “People You Might Know” (PYMK). Classificationlearns from existing categorized documents what documents of a specific category look like and is able to assign unlabelled documents to the (hopefully) correct category. Clusteringtakes documentsand groups them into collections of topically related documents based on word occurrences. Frequent itemset mining takes a set of item groups (for example, terms in a query session, shopping cart content) and identifieswhich individual items usually appear together.If you plan on using Mahout, rememberthat these distributions (Hadoop and Mahout) anticipate running on a *nix machine, although a Cygwin environment on Windows will work as well (or rewriting the command scripts in another language, say as a batch file on Windows). It goes without saying that a compatible working version of Hadoop is required. Lastly, Mahout requires that you program in Java: no other interface outside of the command line is supported.
  • Greenplum Database utilizes a shared-nothing, massively parallel processing (MPP) architecture that has been designed for complex business intelligence (BI) and analytical processing. Most of today’s general-purpose relational database management systems are designed for Online Transaction Processing (OLTP) applications. The reality is that BI and analytical workloads are fundamentally different from OLTP transaction workloads and require a profoundly different architecture.The Greenplum Database is fully parallel and highly optimized for executing both SQL and MapReduce queries. Additionally, the system offers a new level of parallel analysis capabilities for data scientists, with support for SAS, R, linear algebra, and machine learning primitives; and includes extensibility for functions written in Java, C, Perl, or Python.Because of the shared nothing MPP architecture the system is linearly scalable. Simply add additional nodes and the database performance and capacity improves. Expansions are online keeping the database available for production workloads.
  • Logical depiction: (top portion): Logically, gNet enables data in multiple formats, that resides in Hadoop HDFS file system, to be used as though it were a table in Greenplum Database. This is the essence of processing – we can select, filter, join, modify, aggregate, essentially, all normal SQL operations, on the combination of RDBMS data in Greenplum Database and data stored in Hadoop, as though all data were in the database. The results are:Real-Time: Fast access to new data as it arrives – no waiting for reformatting and periodic movement processes to copy data into the database.Space-Efficiency: No duplication of data – big data makes any plan to duplicate data very expensive, even on so-called “cheap storage”Query Efficiency: Movement of frequently-accessed data, where moving it for local access in the database results in a desirable reduction in gNet trafficArchival: Information Lifecycles where data arrives in one platform, but as it ages, is moved to another platform to achieve lower cost of retention – consider the cost of HDFS storage – it’s low – so some customers will generate and manipulate data in the database for simplicity, but archive the data in Hadoop. With co-processing over gNet, the data remains available even after it’s been archived in HDFS files.
  • The Greenplum Database was conceived, designed and engineered to allow customers to take advantage of large clusters of increasingly powerful and economical general purpose servers, storage and ethernet switches. With this approach, EMC Greenplum customers can gain immediate benefit from the industry’s latest computing innovations.Greenplum’s MPP shared-nothing architecture delivers industry-leading performance in big data. You can compare the impact to finding a specific card—let’s say the Ace of Spades—in a deck. If you do it yourself, it could take you up to 52 tries to find the Ace of Spades. If you distribute it to 26 people, it will only take up to 2 tries. Likewise, Greenplum distributes processing across nodes—and these nodes work independently and in parallel to quickly deliver answers.
  • Please take a moment to answer these questions. Record your answers here.
  • As background, it is important to understand that Business Intelligence is different than data science and analytics. BI deals with reporting on history. What happened last quarter? How many did we sell, etc.Data science is about predicting the future and understanding why things happen. What is the optimal solution? What will happen next?For many companies data science is a new approach to understanding the business yet an important one to undertake today.
  • Here are 5 main competency and behavioral characteristics for Data Scientists.Quantitative skills, such as mathematics or statistics Technical aptitude, such as software engineering, machine learning, and programming skills. Skeptical…..this may be a counterintuitive trait, although it is important that data scientists can examine their work critically rather than in a one-sided way.Curious & Creative, data scientists must be passionate about data and finding creative ways to solve problems and portray informationCommunicative & Collaborative: it is not enough to have strong quantitative skills or engineering skills. To make a project resonate, you must be able to articulate the business value in a clear way, and work collaboratively with project sponsors and key stakeholders.
  • In using Greenplum as the foundation for lab work, we’ve started to converge on a standard set of tools for the various stages of our analyses.For data cleansing and transformation, we do most of our work in SQL. MapReduce is also useful, especially for unstructured data. For data exploration we also use SQL, as well as R, which is particularly useful for generating summary statistics, analyzing significance, and plotting data visualizations such as frequency distributions, densities, scatter plots and so on.For model building, we typically use R. It operates very well on file extracts, but these may be cumbersome and may slow down the modeling process, so it is also useful to read data directly from the database into dataframes via RPostgreSQL (which uses the RDbi interface and is therefore considerably faster than RODBC). For very large data sets, it is often best to use Greenplum’s built-in SQL analytics and the Analytics Library.Models built in R can be executed on file extracts, but in most cases it’s desirable to run them on a complete set of records in the database. In this case, they can run in the database as PL/R after a simple conversion, and for optimal performance they can be converted to SQL.In many cases we work with legacy models that were built in SAS. We are developing methods to convert these to SQL or PL/Java. We are also working with SAS Engineering to co-develop ‘Accelerator’ functions.
  • In using Greenplum as the foundation for lab work, we’ve started to converge on a standard set of tools for the various stages of our analyses.For data cleansing and transformation, we do most of our work in SQL. MapReduce is also useful, especially for unstructured data. For data exploration we also use SQL, as well as R, which is particularly useful for generating summary statistics, analyzing significance, and plotting data visualizations such as frequency distributions, densities, scatter plots and so on.For model building, we typically use R. It operates very well on file extracts, but these may be cumbersome and may slow down the modeling process, so it is also useful to read data directly from the database into dataframes via RPostgreSQL (which uses the RDbi interface and is therefore considerably faster than RODBC). For very large data sets, it is often best to use Greenplum’s built-in SQL analytics and the Analytics Library.Models built in R can be executed on file extracts, but in most cases it’s desirable to run them on a complete set of records in the database. In this case, they can run in the database as PL/R after a simple conversion, and for optimal performance they can be converted to SQL.In many cases we work with legacy models that were built in SAS. We are developing methods to convert these to SQL or PL/Java. We are also working with SAS Engineering to co-develop ‘Accelerator’ functions.
  • In using Greenplum as the foundation for lab work, we’ve started to converge on a standard set of tools for the various stages of our analyses.For data cleansing and transformation, we do most of our work in SQL. MapReduce is also useful, especially for unstructured data. For data exploration we also use SQL, as well as R, which is particularly useful for generating summary statistics, analyzing significance, and plotting data visualizations such as frequency distributions, densities, scatter plots and so on.For model building, we typically use R. It operates very well on file extracts, but these may be cumbersome and may slow down the modeling process, so it is also useful to read data directly from the database into dataframes via RPostgreSQL (which uses the RDbi interface and is therefore considerably faster than RODBC). For very large data sets, it is often best to use Greenplum’s built-in SQL analytics and the Analytics Library.Models built in R can be executed on file extracts, but in most cases it’s desirable to run them on a complete set of records in the database. In this case, they can run in the database as PL/R after a simple conversion, and for optimal performance they can be converted to SQL.In many cases we work with legacy models that were built in SAS. We are developing methods to convert these to SQL or PL/Java. We are also working with SAS Engineering to co-develop ‘Accelerator’ functions.
  • In using Greenplum as the foundation for lab work, we’ve started to converge on a standard set of tools for the various stages of our analyses.For data cleansing and transformation, we do most of our work in SQL. MapReduce is also useful, especially for unstructured data. For data exploration we also use SQL, as well as R, which is particularly useful for generating summary statistics, analyzing significance, and plotting data visualizations such as frequency distributions, densities, scatter plots and so on.For model building, we typically use R. It operates very well on file extracts, but these may be cumbersome and may slow down the modeling process, so it is also useful to read data directly from the database into dataframes via RPostgreSQL (which uses the RDbi interface and is therefore considerably faster than RODBC). For very large data sets, it is often best to use Greenplum’s built-in SQL analytics and the Analytics Library.Models built in R can be executed on file extracts, but in most cases it’s desirable to run them on a complete set of records in the database. In this case, they can run in the database as PL/R after a simple conversion, and for optimal performance they can be converted to SQL.In many cases we work with legacy models that were built in SAS. We are developing methods to convert these to SQL or PL/Java. We are also working with SAS Engineering to co-develop ‘Accelerator’ functions.
  • In using Greenplum as the foundation for lab work, we’ve started to converge on a standard set of tools for the various stages of our analyses.For data cleansing and transformation, we do most of our work in SQL. MapReduce is also useful, especially for unstructured data. For data exploration we also use SQL, as well as R, which is particularly useful for generating summary statistics, analyzing significance, and plotting data visualizations such as frequency distributions, densities, scatter plots and so on.For model building, we typically use R. It operates very well on file extracts, but these may be cumbersome and may slow down the modeling process, so it is also useful to read data directly from the database into dataframes via RPostgreSQL (which uses the RDbi interface and is therefore considerably faster than RODBC). For very large data sets, it is often best to use Greenplum’s built-in SQL analytics and the Analytics Library.Models built in R can be executed on file extracts, but in most cases it’s desirable to run them on a complete set of records in the database. In this case, they can run in the database as PL/R after a simple conversion, and for optimal performance they can be converted to SQL.In many cases we work with legacy models that were built in SAS. We are developing methods to convert these to SQL or PL/Java. We are also working with SAS Engineering to co-develop ‘Accelerator’ functions.
  • In using Greenplum as the foundation for lab work, we’ve started to converge on a standard set of tools for the various stages of our analyses.For data cleansing and transformation, we do most of our work in SQL. MapReduce is also useful, especially for unstructured data. For data exploration we also use SQL, as well as R, which is particularly useful for generating summary statistics, analyzing significance, and plotting data visualizations such as frequency distributions, densities, scatter plots and so on.For model building, we typically use R. It operates very well on file extracts, but these may be cumbersome and may slow down the modeling process, so it is also useful to read data directly from the database into dataframes via RPostgreSQL (which uses the RDbi interface and is therefore considerably faster than RODBC). For very large data sets, it is often best to use Greenplum’s built-in SQL analytics and the Analytics Library.Models built in R can be executed on file extracts, but in most cases it’s desirable to run them on a complete set of records in the database. In this case, they can run in the database as PL/R after a simple conversion, and for optimal performance they can be converted to SQL.In many cases we work with legacy models that were built in SAS. We are developing methods to convert these to SQL or PL/Java. We are also working with SAS Engineering to co-develop ‘Accelerator’ functions.
  • In using Greenplum as the foundation for lab work, we’ve started to converge on a standard set of tools for the various stages of our analyses.For data cleansing and transformation, we do most of our work in SQL. MapReduce is also useful, especially for unstructured data. For data exploration we also use SQL, as well as R, which is particularly useful for generating summary statistics, analyzing significance, and plotting data visualizations such as frequency distributions, densities, scatter plots and so on.For model building, we typically use R. It operates very well on file extracts, but these may be cumbersome and may slow down the modeling process, so it is also useful to read data directly from the database into dataframes via RPostgreSQL (which uses the RDbi interface and is therefore considerably faster than RODBC). For very large data sets, it is often best to use Greenplum’s built-in SQL analytics and the Analytics Library.Models built in R can be executed on file extracts, but in most cases it’s desirable to run them on a complete set of records in the database. In this case, they can run in the database as PL/R after a simple conversion, and for optimal performance they can be converted to SQL.In many cases we work with legacy models that were built in SAS. We are developing methods to convert these to SQL or PL/Java. We are also working with SAS Engineering to co-develop ‘Accelerator’ functions.
  • In using Greenplum as the foundation for lab work, we’ve started to converge on a standard set of tools for the various stages of our analyses.For data cleansing and transformation, we do most of our work in SQL. MapReduce is also useful, especially for unstructured data. For data exploration we also use SQL, as well as R, which is particularly useful for generating summary statistics, analyzing significance, and plotting data visualizations such as frequency distributions, densities, scatter plots and so on.For model building, we typically use R. It operates very well on file extracts, but these may be cumbersome and may slow down the modeling process, so it is also useful to read data directly from the database into dataframes via RPostgreSQL (which uses the RDbi interface and is therefore considerably faster than RODBC). For very large data sets, it is often best to use Greenplum’s built-in SQL analytics and the Analytics Library.Models built in R can be executed on file extracts, but in most cases it’s desirable to run them on a complete set of records in the database. In this case, they can run in the database as PL/R after a simple conversion, and for optimal performance they can be converted to SQL.In many cases we work with legacy models that were built in SAS. We are developing methods to convert these to SQL or PL/Java. We are also working with SAS Engineering to co-develop ‘Accelerator’ functions.
  • We can have an *animated* view of color-coded traffic volumes on Google Earth over a user-specified period. The file that produces the animation is created within Greenplum. The Google Map display is similar to this, but it only provides traffic volume at a specific time.
  • Eight banks become oneBranches across the USConsolidation of products and customersEmployees faced with new products and customersOld does not necessarily equal newWhat to recommend to customers?Needs to make the bank moneyNeeds to make the customer moneyOverlap with existing products is challengingCost of acquiring a new customer is significantly higher than selling additional products to existing customers
  • Here’s an example in which we used clustering techniques (grouping similar objects together) and a form of “market basket analysis” (if you bought one set of products, you might be interested in another) to create a simple product recommendation engine.First, we defined a measurement of customer value. (For this particular customer, they already had a way of computing that, but it took 20 hours to run in a separate database. Now it runs in Greenplum in less than an hour, so they run it regularly as part of their ETL process.)Next, we created groups of customers based on product usage. We did this by defining a “distance” between customers so that those who owned a similar assortment of products would be measured as being close. We then used this notion of distance to identify clusters of customers.
  • Then we used various methods, including “association rules” (the technique used in market basket analysis, on sites such as Amazon), to identify common product associations. In other words, by looking at product usage across millions of customers, we found that certain groups of products tended to be occur together. By restricting our analysis to a certain segment of the population (in this case, based on customer value), we were more likely to find product groupings that made sense for that customer segment.
  • We used these results to make product recommendations. For a given customer, we used the product associations to determine which new products made sense. Then we filtered out products that were disproportionately associated with customers of lower value. The remaining products were then more likely to move the customer into a higher value. The client referred to this as “filling incomplete baskets”.VerticalsThis applies to any organization that advertises to a sufficiently large number of customers.
  • Modern applications need to respond faster and capture more information so the business can perform the analysis needed for the making the best business decision. By combining the best online transactional processing (OLTP) product and the best online analytical processing (OLAP) we can create a platform that enables businesses to make the best of historical and real-time data.  By utilizing the strengths of both OLTP and OLAP systems we can create platform that can cover the others weakness. Traditionally OLAP databases excel at handling petabytes of information but are not geared for fine-grained low latency access. Similarly OLTP excel at fine-grained low latency access but may fall short of handling large scale data sets with ad-hoc queries. To solve the OLTP aspect of this problem we have chosen vFabric SQLFire. SQLFire is a memory-optimized shared-nothing distributed SQL database delivering dynamic scalability and high performance for data-intensive modern applications. SQLFire’s memory-optimized architecture minimizes time spent waiting for disk access, the main performance bottleneck in traditional databases. SQLFire achieves dramatic scaling by pooling memory, CPU and network bandwidth across a cluster of machines and can manage data across geographies.  For the OLAP aspect we will be looking at EMC Greenplum. Greenplum was built to support Big Data Analytics, Greenplum Database manages, stores, and analyzes Terabytes to Petabytes of data. Users experience 10 to 100 times better performance over traditional RDBMS products – a result of Greenplum’s shared-nothing massively parallel processing architecture, high-performance parallel dataflow engine, and advanced gNet software interconnect technology.

Big data, data science & fast data Big data, data science & fast data Presentation Transcript

  • Copyright © 2012 EMC Corporation. All Rights Reserved.EMC2PROVEN PROFESSIONALBig Data Analytics, Data Science & Fast Data1Kunal Joshijoshik@vmware.com
  • Copyright © 2012 EMC Corporation. All Rights Reserved.EMC2PROVEN PROFESSIONALAgenda1. Introduction to Big Data Analytics2. Big Data Analytics - Use Cases3. Technologies for Big Data Analytics4. Introduction to Data Science5. Data Science - Use Cases6. Introduction to Fast Data7. Fast Data - Use Cases8. Fast Data meets Big DataBIG DATADATA SCIENCEFAST DATA
  • Copyright © 2012 EMC Corporation. All Rights Reserved.EMC2PROVEN PROFESSIONALAgenda1. Introduction to Big Data Analytics2. Big Data Analytics - Use Cases3. Technologies for Big Data Analytics4. Introduction to Data Science5. Data Science - Use Cases6. Introduction to Fast Data7. Fast Data - Use Cases8. Fast Data meets Big Data
  • Copyright © 2012 EMC Corporation. All Rights Reserved.EMC2PROVEN PROFESSIONALCopyright © 2011 EMC Corporation. All Rights Reserved.Big Data Pioneers1,000,000,000 Queries A Day250,000,000 New Photo‟s / Day290,000,000 Updates / Day
  • Copyright © 2012 EMC Corporation. All Rights Reserved.EMC2PROVEN PROFESSIONALCopyright © 2011 EMC Corporation. All Rights Reserved.Other Companies using Big Data4,000,000 Claims / Day2,800,000,000 Trades / Day31,000,000,000 Interactions / Day
  • Copyright © 2012 EMC Corporation. All Rights Reserved.EMC2PROVEN PROFESSIONALMoore’s LawGordon Moore (Founder of Intel)Number oftransistors thatcan be placed in aprocessorDOUBLES inapproximatelyevery TWO years.
  • Copyright © 2012 EMC Corporation. All Rights Reserved.EMC2PROVEN PROFESSIONALIntroduction to Big Data AnalyticsWhat is Big Data?What makes data, “Big” Data?7Your Thoughts?
  • Copyright © 2012 EMC Corporation. All Rights Reserved.EMC2PROVEN PROFESSIONAL8Copyright © 2011 EMC Corporation. All Rights Reserved.• “Big Data” is data whose scale, distribution, diversity,and/or timeliness require the use of new technicalarchitectures and analytics to enable insights that unlocknew sources of business value. Requires new data architectures, analytic sandboxes New tools New analytical methods Integrating multiple skills into new role of data scientist• Organizations are deriving business benefit from analyzingever larger and more complex data sets that increasinglyrequire real-time or near-real time capabilitiesBig Data DefinedSource: McKinsey May 2011 article Big Data: The next frontier for innovation, competition, and productivity
  • Copyright © 2012 EMC Corporation. All Rights Reserved.EMC2PROVEN PROFESSIONAL9Copyright © 2011 EMC Corporation. All Rights Reserved.1. Data Volume 44x increase from 2010 to 2020(1.2zettabytes to 35.2zb)2. Processing Complexity Changing data structures Use cases warranting additional transformations andanalytical techniques3. Data Structure Greater variety of data structures to mine and analyzeKey Characteristics of Big DataModule 1: Introduction to BDA
  • Copyright © 2012 EMC Corporation. All Rights Reserved.EMC2PROVEN PROFESSIONALBig Data Characteristics: Data StructuresData Growth is Increasingly UnstructuredModule 1: Introduction to BDA 10StructuredSemi-Structured“Quasi”StructuredUnstructured• Data containing a defined data type, format, structure• Example: Transaction data and OLAP• Data that has no inherentstructure and is usually storedas different types of files.• Example: Textdocuments, PDFs, images andvideo• Textual data with erratic data formats, canbe formatted with effort, tools, and time• Example: Web clickstream data thatmay contain some inconsistencies in datavalues and formats• Textual data files with a discernablepattern, enabling parsing• Example: XML data files that are selfdescribing and defined by an xml schemaMoreStructured
  • Copyright © 2012 EMC Corporation. All Rights Reserved.EMC2PROVEN PROFESSIONALFour Main Types of Data StructuresModule 1: Introduction to BDA 11http://www.google.com/#hl=en&sugexp=kjrmc&cp=8&gs_id=2m&xhr=t&q=data+scientist&pq=big+data&pf=p&sclient=psyb&source=hp&pbx=1&oq=data+sci&aq=0&aqi=g4&aql=f&gs_sm=&gs_upl=&bav=on.2,or.r_gc.r_pw.,cf.osb&fp=d566e0fbd09c8604&biw=1382&bih=651The Red Wheelbarrow, byWilliam Carlos WilliamsView  SourceStructured DataSemi-Structured DataQuasi-Structured DataUnstructured Data
  • Copyright © 2012 EMC Corporation. All Rights Reserved.EMC2PROVEN PROFESSIONALCopyright © 2011 EMC Corporation. All Rights Reserved.Driver ExamplesDesire to optimize businessoperationsSales, pricing, profitability, efficiencyDesire to identify business risk Customer churn, fraud, defaultPredict new businessopportunitiesUpsell, cross-sell, best new customerprospectsComply with laws or regulatoryrequirementsAnti-Money Laundering, Fair Lending,Basel IIBusiness Drivers for Big Data Analytics1234Current Business Problems Provide Opportunities for Organizations toBecome More Analytical & Data DrivenModule 1: Introduction to BDA 12
  • Copyright © 2012 EMC Corporation. All Rights Reserved.EMC2PROVEN PROFESSIONALCopyright © 2011 EMC Corporation. All Rights Reserved.Challenges with a Traditional Data WarehouseDepartmentalWarehouseEnterpriseApplicationsReportingNon-Prioritized Data ProvisioningNon-Agile Models“SpreadMarts”DataSourcesSiloedAnalyticsStatic schemasaccrete over timePrioritizedOperationalProcessesErrant data & martsDepartmentalWarehouse123134
  • Copyright © 2012 EMC Corporation. All Rights Reserved.EMC2PROVEN PROFESSIONALImplications of a Traditional Data Warehouse14• High-value data is hard to reach and leverage• Predictive analytics & data mining activities are lastin line for data Queued after prioritized operational processes• Data is moving in batches from EDW to localanalytical tools In-memory analytics (such as R, SAS, SPSS, Excel) Sampling can skew model accuracy• Isolated, ad hoc analytic projects, rather thancentrally-managed harnessing of analytics Non-standardized initiatives Frequently, not aligned with corporate business goalsSlow“time-to-insight”&reducedbusiness impactModule 1: Introduction to BDA
  • Copyright © 2012 EMC Corporation. All Rights Reserved.EMC2PROVEN PROFESSIONALOpportunities for a New Approach to AnalyticsNew Applications Driving Data VolumeModule 1: Introduction to BDA 152000‟s(CONTENT & DIGITAL ASSETMANAGEMENT)1990‟s(RDBMS & DATAWAREHOUSE)2010‟s(NO-SQL & KEY/VALUE)VOLUMEOFINFORMATIONLARGESMALLMEASURED INTERABYTES1TB = 1,000GBMEASURED INPETABYTES1PB = 1,000TBWILL BE MEASURED INEXABYTES1EB = 1,000PB
  • Copyright © 2012 EMC Corporation. All Rights Reserved.EMC2PROVEN PROFESSIONALConsiderations for Big Data Analytics1. Speed of decision making2. Throughput3. Analysis flexibilityAnalytic SandboxData assets gathered from multiple sourcesand technologies for analysis• Enables high performance analyticsusing in-db processing• Reduces costs associated with datareplication into "shadow" filesystems• “Analyst-owned” rather than “DBAowned”Criteria for Big Data Projects New Analytic Architecture1. Speed of decision making2. Throughput3. Analysis flexibility16
  • Copyright © 2012 EMC Corporation. All Rights Reserved.EMC2PROVEN PROFESSIONALCopyright © 2011 EMC Corporation. All Rights Reserved.State of the Practice in Analytics: Mini-Case StudyBig Data Enabled Loan Processing at XYZ bankUnderwritingRiskTraditionalUnderwritingRisk LevelTRADITIONAL DATA LEVERAGED BIG DATA LEVERAGEDBig Data EnabledUnderwritingRisk Level17Module 1: Introduction to BDAYour Thoughts?
  • Copyright © 2012 EMC Corporation. All Rights Reserved.EMC2PROVEN PROFESSIONALAgenda1. Introduction to Big Data Analytics2. Big Data Analytics - Use Cases3. Technologies for Big Data Analytics4. Introduction to Data Science5. Data Science - Use Cases6. Introduction to Fast Data7. Fast Data - Use Cases8. Fast Data meets Big Data
  • Copyright © 2012 EMC Corporation. All Rights Reserved.EMC2PROVEN PROFESSIONALBig Data Analytics: Industry ExamplesModule 1: Introduction to BDA 19Health Care•Reducing Cost of CarePublic Services•Preventing PandemicsLife Sciences•Genomic MappingIT Infrastructure•Unstructured Data AnalysisOnline Services•Social Media for ProfessionalsRetailPhone/TVGovernment InternetMedicalFinancialDataCollectors12345
  • Copyright © 2012 EMC Corporation. All Rights Reserved.EMC2PROVEN PROFESSIONALBig Data Analytics: HealthcareUse of Big DataKeyOutcomesSituation• Poor police response and problems with medical care, triggeredby shooting of a Rutgers student• The event drove local doctor to map crime data and examinelocal health care• Dr. Jeffrey Brenner generated his own crime maps from medicalbilling records of 3 hospitals• City hospitals & ER‟s provided expensive care, low quality care• Reduced hospital costs by 56% by realizing that 80% of city‟smedical costs came from 13% of its residents, mainly low-income or elderly• Now offers preventative care over the phone or through homevisits120Module 1: Introduction to BDA
  • Copyright © 2012 EMC Corporation. All Rights Reserved.EMC2PROVEN PROFESSIONALBig Data Analytics: Public ServicesUse of Big DataKeyOutcomesSituation• Threat of global pandemics has increased exponentially• Pandemics spreads at faster rates, more resistant to antibiotics• Created a network of viral listening posts• Combines data from viral discovery in the field, research indisease hotspots, and social media trends• Using Big Data to make accurate predications on spread of newpandemics• Identified a fifth form of human malaria, including its origin• Identified why efforts failed to control swine flu• Proposing more proactive approaches to preventing outbreaks2Module 1: Introduction to BDA 21
  • Copyright © 2012 EMC Corporation. All Rights Reserved.EMC2PROVEN PROFESSIONALBig Data Analytics: Life SciencesUse of Big DataKeyOutcomesSituation • Broad Institute (MIT & Harvard) mapping the Human Genome• In 13 yrs, mapped 3 billion genetic base pairs; 8 petabytes• Developed 30+ software packages, now shared publicly, alongwith the genomic data• Using genetic mappings to identify cellular mutations causingcancer and other serious diseases• Innovating how genomic research informs new pharmaceuticaldrugs3Module 1: Introduction to BDA 22
  • Copyright © 2012 EMC Corporation. All Rights Reserved.EMC2PROVEN PROFESSIONALBig Data Analytics: IT InfrastructureUse of Big DataKeyOutcomesSituation • Explosion of unstructured data required new technology toanalyze quickly, and efficiently• Doug Cutting created Hadoop to divide large processing tasksinto smaller tasks across many computers• Analyzes social media data generated by hundreds ofthousands of users• New York Times used Hadoop to transform its entire publicarchive, from 1851 to 1922, into 11 million PDF files in 24 hrs• Applications range from social media, sentiment analysis,wartime chatter, natural language processing4Module 1: Introduction to BDA 23
  • Copyright © 2012 EMC Corporation. All Rights Reserved.EMC2PROVEN PROFESSIONALBig Data Analytics: Online ServicesUse of Big DataKeyOutcomesSituation • Opportunity to create social media space for professionals• Collects and analyzes data from over 100 million users• Adding 1 million new users per week• LinkedIn Skills, InMaps, Job Recommendations, Recruiting• Established a diverse data scientist group, as founder believesthis is the start of Big Data revolution5Module 1: Introduction to BDA 24
  • Copyright © 2012 EMC Corporation. All Rights Reserved.EMC2PROVEN PROFESSIONALAgenda1. Introduction to Big Data Analytics2. Big Data Analytics - Use Cases3. Technologies for Big Data Analytics4. Introduction to Data Science5. Data Science - Use Cases6. Introduction to Fast Data7. Fast Data - Use Cases8. Fast Data meets Big Data
  • Copyright © 2012 EMC Corporation. All Rights Reserved.EMC2PROVEN PROFESSIONALGreenplum Unified Analytic PlatformPartner Tools & ServicesGREENPLUM CHORUS – Analytic Productivity LayerGreenplum gNetGREENPLUMDATABASEDataScientistDataEngineerData Analyst BlAnalystLOBUserDataPlatformAdminDATASCIENCETEAMCloud, x86 Infrastructure, or ApplianceGREENPLUMHDUnify your teamDrive CollaborationKeep Your Options OpenThe Power of DataCo-Processing
  • Copyright © 2012 EMC Corporation. All Rights Reserved.EMC2PROVEN PROFESSIONALCopyright © 2011 EMC Corporation. All Rights Reserved.Greenplum HadoopSTRUCTURED UNSTRUCTUREDHiveMapReducePigXML, JSON, … Flat filesSchema on loadDirectoriesNo ETLJavaSequenceFile
  • Copyright © 2012 EMC Corporation. All Rights Reserved.EMC2PROVEN PROFESSIONALCopyright © 2011 EMC Corporation. All Rights Reserved.Greenplum DatabaseSTRUCTURED UNSTRUCTUREDSQLRDBMSTables and SchemasGreenplumMapReduceIndexingPartitioningBI Tools
  • Copyright © 2012 EMC Corporation. All Rights Reserved.EMC2PROVEN PROFESSIONAL• A framework for handling big data An implementation of the MapReduce paradigm Hadoop glues the storage and analytics together and provides reliability,scalability, and managementWhat do we Mean by HadoopStorage (Big Data) HDFS – Hadoop DistributedFile System Reliable, redundant,distributed file systemoptimized for large filesMapReduce (Analytics) Programming model forprocessing sets of data Mapping inputs to outputs andreducing the output of multipleMappers to one (or a few)answer(s)Two Main Components30Module 5: Advanced Analytics - Technology and Tools
  • Copyright © 2012 EMC Corporation. All Rights Reserved.EMC2PROVEN PROFESSIONALHadoop Distributed File System31Module 5: Advanced Analytics - Technology and Tools
  • Copyright © 2012 EMC Corporation. All Rights Reserved.EMC2PROVEN PROFESSIONALMapReduce and HDFSTask TrackerTask Tracker Task TrackerJob TrackerHadoop Distributed File System (HDFS)Client/DevLarge Data Set(Log files, Sensor Data)Map JobReduce JobMap JobReduce JobMap JobReduce JobMap JobReduce JobMap JobReduce JobMap JobReduce Job213432Module 5: Advanced Analytics - Technology and Tools
  • Copyright © 2012 EMC Corporation. All Rights Reserved.EMC2PROVEN PROFESSIONAL• As you move from Pig to Hive toHBase, you are increasinglymoving away from the mechanicsof Hadoop and get an RDBMSview of the Big Data worldComponents of HadoopHBaseQueriesagainst definedtablesHive SQL-basedlanguagePigData flowlanguage &ExecutionenvironmentMore HadoopVisibleLess HadoopVisibleDBMS ViewMechanics ofHadoop33Module 5: Advanced Analytics - Technology and Tools
  • Copyright © 2012 EMC Corporation. All Rights Reserved.EMC2PROVEN PROFESSIONALCopyright © 2011 EMC Corporation. All Rights Reserved.Greenplum DatabaseExtreme Performance for Analytics• Optimized for BI and analytics Deep integration with statistical packages High performance parallel implementations• Simple and automatic parallelization Just load and query like any database Tables are automatically distributedacross nodes No need for manual partitioning or tuning• Extremely scalable MPP* shared-nothing architecture All nodes can scan and process in parallel Linear scalability by adding nodes where eachnode adds storage, query & load performance*MPP – Massive Parallel Processing
  • Copyright © 2012 EMC Corporation. All Rights Reserved.EMC2PROVEN PROFESSIONALGreenplum DB & HDMassively Parallel Access and MovementMaximize SolutionFlexibilityMinimize DataDuplicationAccess HadoopData in Real TimeFrom GreenplumDBImport and exportin Text, Binaryand CompressedFormatsCustom formats via user-written MapReduce Javaprogram And GPDB Format classesgNet10Gb EthernetGreenplum DB HadoopNode 1Node 2Node 3Segment 1Segment 2Segment 3GP DBMasterHostMapReduceUser-DefinedBinaryTextExternalTables
  • Copyright © 2012 EMC Corporation. All Rights Reserved.EMC2PROVEN PROFESSIONALCopyright © 2011 EMC Corporation. All Rights Reserved.Analytical SoftwareExploiting ParallelismIn-Database AnalyticsAnalyticResultsInterconnectStorageIndependent SegmentProcessorsIndependent MemoryIndependentDirect StorageConnectionMaster Segment ProcessorInterconnectSwitchMath & StatisticalFunctions
  • Copyright © 2012 EMC Corporation. All Rights Reserved.EMC2PROVEN PROFESSIONALAgenda1. Introduction to Big Data Analytics2. Big Data Analytics - Use Cases3. Technologies for Big Data Analytics4. Introduction to Data Science5. Data Science - Use Cases6. Introduction to Fast Data7. Fast Data - Use Cases8. Fast Data meets Big Data
  • Copyright © 2012 EMC Corporation. All Rights Reserved.EMC2PROVEN PROFESSIONALCopyright © 2011 EMC Corporation. All Rights Reserved.Big Data Requires Data ScienceData Science• Predictive analysis• What if…..?BusinessIntelligence• Standard reporting• What happened?HighFuturePast TIMEBUSINESSVALUEBusinessIntelligenceDataScienceLow
  • Copyright © 2012 EMC Corporation. All Rights Reserved.EMC2PROVEN PROFESSIONALCopyright © 2011 EMC Corporation. All Rights Reserved.Data science and business intelligence“BIG DATA ANALYTICS”“TRADITIONAL BI”GBs to 10s of TBsOperationalStructuredRepetitive10s of TB to Pb‟sExternal + OperationalMostly Semi-StructuredExperimental, Ad Hoc
  • Copyright © 2012 EMC Corporation. All Rights Reserved.EMC2PROVEN PROFESSIONALProfile of a Data ScientistModule 1: Introduction to BDA 46
  • Copyright © 2012 EMC Corporation. All Rights Reserved.EMC2PROVEN PROFESSIONALPeople and EcosystemDomainData Science as a ProcessData PrepVariableSelectionModelBuildingModelExecutionCommunication& OperationalizationEvaluate• People• Scientists / Analysts• Business Analysts• Consumers of analysis• Stakeholders• EMC sales and services• Ecosystem• Sector (Telecom, banking, security agency etc.)• Modeling software and other tools used by analysts(MADlib, SAS, R etc.)• Database (Greenplum) & Data Sources
  • Copyright © 2012 EMC Corporation. All Rights Reserved.EMC2PROVEN PROFESSIONALPeople and EcosystemDomainData Science as a ProcessData PrepVariableSelectionModelBuildingModelExecutionCommunication& OperationalizationEvaluateDiscovery & prioritized identification ofopportunities• Customer Retention• Fraud detection• Pricing• Marketing effectiveness and optimization• Product Recommendation• Others……
  • Copyright © 2012 EMC Corporation. All Rights Reserved.EMC2PROVEN PROFESSIONALPeople and EcosystemDomainData Science as a ProcessData PrepVariableSelectionModelBuildingModelExecutionCommunication& OperationalizationEvaluate• What are the data sources?• Do we have access to them?• How big are they?• How often are they updated?• How far back do they go?• Which of these data sources are being used foranalysis? Can we use a data source which is currentlyunused? What problems would that help us solve?
  • Copyright © 2012 EMC Corporation. All Rights Reserved.EMC2PROVEN PROFESSIONALPeople and EcosystemDomainData Science as a ProcessData PrepVariableSelectionModelBuildingModelExecutionCommunication& OperationalizationEvaluate• Selection of raw variables which arepotentially relevant to problem beingsolved• Transformations to create a set ofcandidate variables• Clustering and other types ofcategorization which could provideinsights
  • Copyright © 2012 EMC Corporation. All Rights Reserved.EMC2PROVEN PROFESSIONALPeople and EcosystemDomainData Science as a ProcessData StepVariableSelectionModelBuildingModelExecutionCommunication& OperationalizationEvaluatePick suitable statistics, or suitable model form and algorithmand build model
  • Copyright © 2012 EMC Corporation. All Rights Reserved.EMC2PROVEN PROFESSIONALPeople and EcosystemDomainData Science as a ProcessData PrepVariableSelectionModelBuildingModelExecutionCommunication& OperationalizationEvaluateThe model needs to be executable in database on big datawith reasonable execution time
  • Copyright © 2012 EMC Corporation. All Rights Reserved.EMC2PROVEN PROFESSIONALPeople and EcosystemDomainData Science as a ProcessData PrepVariableSelectionModelBuildingModelExecutionCommunication& OperationalizationEvaluateThe model results need to be communicated &operationalized to have a measurable impact on thebusiness
  • Copyright © 2012 EMC Corporation. All Rights Reserved.EMC2PROVEN PROFESSIONALPeople and EcosystemDomainData Science as a ProcessData PrepVariableSelectionModelBuildingModelExecutionCommunication& OperationalizationEvaluate• Accuracy of results and forecasts• Analysis of real-world experiments• A/B testing on target samples• End-user and LOB feedback
  • Copyright © 2012 EMC Corporation. All Rights Reserved.EMC2PROVEN PROFESSIONALAgenda1. Introduction to Big Data Analytics2. Big Data Analytics - Use Cases3. Technologies for Big Data Analytics4. Introduction to Data Science5. Data Science - Use Cases6. Introduction to Fast Data7. Fast Data - Use Cases8. Fast Data meets Big Data
  • Copyright © 2012 EMC Corporation. All Rights Reserved.EMC2PROVEN PROFESSIONALUse Case 1 Trip modelingProblem: Analyze behaviour ofvisitors to MakeMyTrip.comParticularly interested inunregistered visitors– About 99% of total visitor traffic
  • Copyright © 2012 EMC Corporation. All Rights Reserved.EMC2PROVEN PROFESSIONALCopyright © 2011 EMC Corporation. All Rights Reserved.Applications of model• Tailor promotions for popular types of trips Most popular types probably already well-known; potential innext tier down• ... and for different types of customers• Present customised promotions to visitors based on clicks• Ad optimization: present ads based on modelled behavior
  • Copyright © 2012 EMC Corporation. All Rights Reserved.EMC2PROVEN PROFESSIONALCopyright © 2011 EMC Corporation. All Rights Reserved.Hypertargeting• Serving content to customers based on individualcharacteristics and preferences, rather than broadgeneralizations
  • Copyright © 2012 EMC Corporation. All Rights Reserved.EMC2PROVEN PROFESSIONALCopyright © 2011 EMC Corporation. All Rights Reserved.Available data• Data available from server: Date/time IP address Parts of site visited• Geographic location can be obtained via geo lookup on IP• Personal information available for registered visitors only
  • Copyright © 2012 EMC Corporation. All Rights Reserved.EMC2PROVEN PROFESSIONALCopyright © 2011 EMC Corporation. All Rights Reserved.Approach• Use clustering to identify trip/visitor types Sport (IPL,F1, Football, etc) Festivals Other seasonal movements• Decision trees to predict which type of trip a visitor is likelyto make Based on successively more information as they movethrough the site• Use registered visitor info to augment models
  • Copyright © 2012 EMC Corporation. All Rights Reserved.EMC2PROVEN PROFESSIONALCopyright © 2011 EMC Corporation. All Rights Reserved.Use Case 2 Municipal traffic analysis• Client domain: Municipal city government• Available data:Cross-city loop detectors measuring traffic volumeDetailed city bus movement information from Bluetooth devicesVideo detection of traffic volume, velocity• Goal: Exploit available data for unrealized business insights andvalues
  • Copyright © 2012 EMC Corporation. All Rights Reserved.EMC2PROVEN PROFESSIONALCopyright © 2011 EMC Corporation. All Rights Reserved.Data loading and manipulation• Parallel data loading– Data loaded from local file system and distributed across Greenplumservers in parallel.– Loading 9 months of traffic volume data (16 GB, 464 million rows) in 69.4seconds.• SQL data manipulation– Standard SQL permits city personnel to use existing skillsets.– Greenplum SQL extensions offer the control over data distribution.– Open source packages (e.g. in Python, R) can be conveniently deployedwithin Greenplum for visualization and analytics purposes.
  • Copyright © 2012 EMC Corporation. All Rights Reserved.EMC2PROVEN PROFESSIONALCopyright © 2011 EMC Corporation. All Rights Reserved.Basic reporting on traffic volume• Easy generation of reports via straightforward user-defined functions• Standard graphing utilities called from within Greenplum to create figures• Detector downtimes can be clearly spotted in the figure, or via an SQL query, thusmitigating maintenance challenges
  • Copyright © 2012 EMC Corporation. All Rights Reserved.EMC2PROVEN PROFESSIONALCopyright © 2011 EMC Corporation. All Rights Reserved.Basic reporting on city buses• Data from Bluetooth devices has a wealth of information on citybuses that we can report on: Travel route of each bus Deviations of arrival times compared to provided timetable Occurrences of driver errors (e.g. taking a wrong turn) and possiblecauses Occurrences where the same bus service arrives at the same stopwithin seconds of each other Whether new bus services translates into lower traffic volume onintroduced roads
  • Copyright © 2012 EMC Corporation. All Rights Reserved.EMC2PROVEN PROFESSIONALCopyright © 2011 EMC Corporation. All Rights Reserved.Result visualizations (Google Earth)
  • Copyright © 2012 EMC Corporation. All Rights Reserved.EMC2PROVEN PROFESSIONALCopyright © 2011 EMC Corporation. All Rights Reserved.Applications for traffic network modelling• Compute the fastest path between any two locations at afuture time point• Identify potential bottlenecks in the traffic• Identify phase transition points for massive traffic congestionusing simulation techniques• Study the likely impact of new roads and traffic policies,without having to observe real disruptive events todetermine the impact
  • Copyright © 2012 EMC Corporation. All Rights Reserved.EMC2PROVEN PROFESSIONALCopyright © 2011 EMC Corporation. All Rights Reserved.• Greenplum‟s parallel architecture permits traffic network analysis on acity scale• Travel time can be predicted via model learning, involving hundreds ofthousands of optimizations in parallel, across the entire traffic network• Variables that can be considered includeDistance between two locationsConcurrent traffic volumeTime of dayWeatherConstruction work• Computationally prohibitive for traditional non-parallel databaseenvironmentsParallel traffic network modelling
  • Copyright © 2012 EMC Corporation. All Rights Reserved.EMC2PROVEN PROFESSIONALCopyright © 2011 EMC Corporation. All Rights Reserved.Use Case 3 - Product Recommendation Analysis• Eight banks became one Branches across the US• Consolidation of products and customers Employees faced with new products andcustomers Visibility into churn and retention waschallenged• Analytics focus was historically reporting-centric Descriptive “hindsight”`
  • Copyright © 2012 EMC Corporation. All Rights Reserved.EMC2PROVEN PROFESSIONALCustomer SegmentationCustomer segments– First, define a measurement ofcustomer value– Then create clusters ofcustomers based on customervalue, and then productprofiles.
  • Copyright © 2012 EMC Corporation. All Rights Reserved.EMC2PROVEN PROFESSIONALAssociation RulesProduct associations– Now find products that arecommon in the segment, butnot owned by the givenhousehold.Product AProduct BProduct XProduct YProduct Z
  • Copyright © 2012 EMC Corporation. All Rights Reserved.EMC2PROVEN PROFESSIONALProduct RecommendationsNext best offer– Now, filter down to productsassociated with high-valuecustomers in the same segment.Product AProduct BProduct XProduct YProduct Z
  • Copyright © 2012 EMC Corporation. All Rights Reserved.EMC2PROVEN PROFESSIONALIncreased customer valueCustomer Comments– “The Greenplum Solution hasscaled from 6 to 11 TB of data.”– Moved from 7 hours /month ofdata to 7.5 hours / 2.5 years ofdataProduct Recommender
  • Copyright © 2012 EMC Corporation. All Rights Reserved.EMC2PROVEN PROFESSIONALAgenda1. Introduction to Big Data Analytics2. Big Data Analytics - Use Cases3. Technologies for Big Data Analytics4. Introduction to Data Science5. Data Science - Use Cases6. Introduction to Fast Data7. Fast Data - Use Cases8. Fast Data meets Big Data
  • Copyright © 2012 EMC Corporation. All Rights Reserved.EMC2PROVEN PROFESSIONALModule #: Module Name 74Ferrari Freight Train0-100 KMPH 2.3 seconds 100 secondsTop Speed 360 KMPH 140 KMPHStops / hr 1000 5Horse Power 660 bhp 16,000 bhpThroughput 220 KG in 27 mins 55000000 KG in 60 minsVS
  • Copyright © 2012 EMC Corporation. All Rights Reserved.EMC2PROVEN PROFESSIONALModule #: Module Name 75Fast Data Big DataTransactions /Second100000+ per second n.aConcurrent hits 10000 + per sec 10 per secondUpdate Patterns Read / Write AppendsData Complexity Simple Joins on a few tables Can be highly complexData Volumes GB‟s / TB PB to ZBAccess Tools GemFire / SQLFire GP DB, GP HadoopVSFast Data Big Data
  • Copyright © 2012 EMC Corporation. All Rights Reserved.EMC2PROVEN PROFESSIONALCopyright © 2011 EMC Corporation. All Rights Reserved.Not a fast OLTP DB!APPLICATION(S)
  • Copyright © 2012 EMC Corporation. All Rights Reserved.EMC2PROVEN PROFESSIONALCopyright © 2011 EMC Corporation. All Rights Reserved.Fast Data is• More than just an OLTP DB• Super Fast access to Data• Server side flexibility• Data is HA• Supports transactions• Setup is fault tolerant• Can handle thousands of concurrent hits• Distributed hence horizontally scalable• Runs on cheap x86 hardware
  • Copyright © 2012 EMC Corporation. All Rights Reserved.EMC2PROVEN PROFESSIONALCAP TheoremA distributed system can onlyachieve TWO out of thethree qualities ofConsistency, Availability andPartition Toleranceonsistency vailability artition Tolerence
  • Copyright © 2012 EMC Corporation. All Rights Reserved.EMC2PROVEN PROFESSIONALFast Data =• Service Loose Coupling• Data Transformation• System Integration+ Service Bus• Guaranteed Delivery• Event Propagation• Data Distribution+ Messaging System• Event Driven Architectures• Real-time Analysis• Business Event Detection+ Complex Event ProcessorFast Data combines select features from all of these products and combinesthem into a low-latency, linearly scalable, memory-based data fabric• Storage• Persistence• Transactions• QueriesDatabase• High Availability• Load Balancing• Data Replication• L1 Caching• Map-Reduce, Scatter-Gather• Distributed Task Assignment• Task Decomposition+ Grid Controller• Result Summarization
  • Copyright © 2012 EMC Corporation. All Rights Reserved.EMC2PROVEN PROFESSIONALA Typical Fast Data SetupWeb TierApplication TierLoad BalancerAdd/removeweb/application/data serversAdd/remove storageDatabase TierStorage TierDisks may be direct or networkattachedOptional reliable, asynchronousfeedto a Big Data Store
  • Copyright © 2012 EMC Corporation. All Rights Reserved.EMC2PROVEN PROFESSIONALMemory-based PerformancePerformFast Data uses memory on a peer machine to make data updatesdurable, allowing the updating thread to return 10x to 100x faster thanupdates that must be written through to disk, without risking any dataloss. Typical latencies are in the few hundreds of microsecondsinstead of in the tens to hundreds of milliseconds.One can optionally write updates to disk / data warehouse / big data storeasynchronously and reliably.
  • Copyright © 2012 EMC Corporation. All Rights Reserved.EMC2PROVEN PROFESSIONALWAN DistributionDistributeFast Data can keep clusters that are distributed around the world synchronized in real-time and can operate reliably in Disconnected, Intermittent and Low-Bandwidth networkenvironments.
  • Copyright © 2012 EMC Corporation. All Rights Reserved.EMC2PROVEN PROFESSIONALDistributed EventsTargeted, guaranteed delivery, eventnotification and Continuous QueriesNotify
  • Copyright © 2012 EMC Corporation. All Rights Reserved.EMC2PROVEN PROFESSIONALParallel QueriesBatch Controlleror ClientScatter-Gather (Map-Reduce)QueriesCompute
  • Copyright © 2012 EMC Corporation. All Rights Reserved.EMC2PROVEN PROFESSIONALData-Aware RoutingExecuteFast Data provides „data aware function routing‟ – moving the behavior tothe correct data instead of moving the data to the behavior.Batch Controlleror ClientData Aware Function
  • Copyright © 2012 EMC Corporation. All Rights Reserved.EMC2PROVEN PROFESSIONALAccessing Fast DataStores Objects (Java, C++, C#, .NET) or unstructured dataSpring-GemFireStores Relational Data with SQL interfaceSupports JDBC, ODBC, Java and .NET interfacesKey-Value store with OQL QueriesUses existing relational toolsOrderOrder Line ItemQuantityDiscountProductSKUUnit PriceL2 Cache plugin for HibernateHTTP Session replication moduleGemFireSQLFire
  • Copyright © 2012 EMC Corporation. All Rights Reserved.EMC2PROVEN PROFESSIONALAgenda1. Introduction to Big Data Analytics2. Big Data Analytics - Use Cases3. Technologies for Big Data Analytics4. Introduction to Data Science5. Data Science - Use Cases6. Introduction to Fast Data7. Fast Data - Use Cases8. Fast Data meets Big Data
  • Copyright © 2012 EMC Corporation. All Rights Reserved.EMC2PROVEN PROFESSIONALCopyright © 2011 EMC Corporation. All Rights Reserved.Use CasesApplying the technologyA few examples of Fast Data technologyapplied to real business cases
  • Copyright © 2012 EMC Corporation. All Rights Reserved.EMC2PROVEN PROFESSIONALCopyright © 2011 EMC Corporation. All Rights Reserved.A mainframe-based, nightly customer account reconciliation batch runMainframe Migrationmin0 12060I/O Wait9%CPU Busy15%MainframeCPU Unavailable76%COTS ClusterBatch now runs in 60 seconds93% Network Wait! Time could have been reduced further with higher network bandwidth
  • Copyright © 2012 EMC Corporation. All Rights Reserved.EMC2PROVEN PROFESSIONALCopyright © 2011 EMC Corporation. All Rights Reserved.Mainframe MigrationSo What? So the batch runs faster – who cares?1. It ran on cheaper, modern, scalable hardware2. If something goes wrong with the batch, you only wait 60seconds to find out3. Now, the hardware and the data are available to do otherthings in the remaining 119 minutes:• Fraud detection• Regulatory compliance• Re-run risk calculations with 119 different scenarios• Up sell customers4. You can move from batch to real-time processing!
  • Copyright © 2012 EMC Corporation. All Rights Reserved.EMC2PROVEN PROFESSIONALCopyright © 2011 EMC Corporation. All Rights Reserved.Online BettingA popular online gambling site attracts new players through ads on affiliate sitesCustomized Banner Ad on affiliate siteAffiliates Web Server1 Banner Ad Server234In a fraction of a second, the banner ad sever must:Generate a tracking id specific to the requestApply temporal, sequential, regional, contractual and otherpolicies in order to decide which banner to deliverCustomize the bannerRecord that the banner ad was delivered
  • Copyright © 2012 EMC Corporation. All Rights Reserved.EMC2PROVEN PROFESSIONALCopyright © 2011 EMC Corporation. All Rights Reserved.Online Betting (Contd.)Their initial RDBMS-based systemLimited their ability to sign up new affiliatesLimited their ability to add new products on their siteLimited the delivery performance experienced by theiraffiliates and their customersLimited their ability to add additional internal applicationsand policies to the processTheir new Fast Data based systemResponded with sub-millisecond latencyMet their target of 2500 banner ad deliveries per secondProvides for future scalabilityImproved performance to the browser by 4xCost less
  • Copyright © 2012 EMC Corporation. All Rights Reserved.EMC2PROVEN PROFESSIONALCopyright © 2011 EMC Corporation. All Rights Reserved.Asset/Position MonitoringCentralized data storage was not possibleMulti-agency, multi-force integrationNumerous Applications needed access to multiple data sourcessimultaneouslyNetworks constantly changing, unreliable, mobile deploymentsUpwards of 60,000 object updates each minuteOver 70 data feedsNeeded a real-time situational awareness system to track assets that could be used by the war fighters in theatreNorthrop Grumman (integrator) investigated the following technologies before deciding on GemFire•RDBMS – Oracle, Sybase, Postgres, TimesTen, MySQL•ODBMS - Objectivity•jCache – GemFire, Oracle Coherence•JMS – SonicMQ, BEA Weblogic, IBM, jBoss•TIBCO Rendezvous•Web Services
  • Copyright © 2012 EMC Corporation. All Rights Reserved.EMC2PROVEN PROFESSIONALCopyright © 2011 EMC Corporation. All Rights Reserved.Asset/Position Monitoring655 sites, 11 thousand usersReal-time, 3 dimensional, NASA World Wind User Interface60,000 Position updates per minuteReal time info available on the desk ofPresident of the United StatesUS Secretary of DefenseEach of the Joint Chiefs of StaffEvery commander in the US Military
  • Copyright © 2012 EMC Corporation. All Rights Reserved.EMC2PROVEN PROFESSIONALCopyright © 2011 EMC Corporation. All Rights Reserved.Low-latency trade insertionPermanent Archival of every tradeKept pace with fast ticking market dataRapid, Event Based Position CalculationDistribution of Position Updates GloballyConsistent Global Views of PositionsPass the BookRegional Close-of-dayHigh AvailabilityDisaster RecoveryRegional AutonomyThe project achieved:Global Foreign Exchange Trading System
  • Copyright © 2012 EMC Corporation. All Rights Reserved.EMC2PROVEN PROFESSIONALCopyright © 2011 EMC Corporation. All Rights Reserved.Global Foreign Exchange Trading SystemIn that same application, Fast Data replaced:Sybase Database In Every RegionStill need 1 instance for archival purposesTIBCO Rendezvous for Local Area MessagingIBM MQ Series for WAN DistributionVeritas N+1 Clustering for H/AIn fact, we save the physical +1 node itself3DNS or Wide IPAdmin personnel reduced from 1.5 to 0.5
  • Copyright © 2012 EMC Corporation. All Rights Reserved.EMC2PROVEN PROFESSIONALAgenda1. Introduction to Big Data Analytics2. Big Data Analytics - Use Cases3. Technologies for Big Data Analytics4. Introduction to Data Science5. Data Science - Use Cases6. Introduction to Fast Data7. Fast Data - Use Cases8. Fast Data meets Big Data
  • Copyright © 2012 EMC Corporation. All Rights Reserved.EMC2PROVEN PROFESSIONALCopyright © 2011 EMC Corporation. All Rights Reserved.Application High Level OverviewAPPLICATION(S)Single DB cant handle bothOLTP and OLAPworkloads
  • Copyright © 2012 EMC Corporation. All Rights Reserved.EMC2PROVEN PROFESSIONALCopyright © 2011 EMC Corporation. All Rights Reserved.Big Data SetupAPPLICATION(S)How to get the best of Fast & Big DataFast DataSetupIn case record isnt availableConcurrent hits