• Like
Uploaded on

Explore the technology that enables real-time analytics and streaming data processing, and how it differs from the world of Hadoop and batch analytics. …

Explore the technology that enables real-time analytics and streaming data processing, and how it differs from the world of Hadoop and batch analytics.

For more information, check out http://infochimps.com

More in: Technology , Business
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,236
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
32
Comments
0
Likes
2

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Why Real-Time Analytics? The Chimp Way: Using the right tool for each jobExplore the At Infochimps, we abide by the philosophy that you should use the right tool for each job. Why lock in to one set of technologiestechnology that or techniques? Depending on what you are trying to accomplish - the questions you want to ask of your data, or the applicationsenables real-time and visualizations you build on top of that data - different tech-analytics and nologies are best suited for each unique task. You should have all the best tools at your fingertips for each task. Infochimps excels atstreaming data systems and technology integration -- we can take your existing tools, add powerful new ones from our kit, and glue them togetherprocessing, and into a unified whole.how it differs from We also strongly embrace open source technologies as part ofthe world of a complete data solution. Not only do you benefit from the active participation of the open source community -- you aren’t limitedHadoop and to a proprietary vendor’s finite feature set and integration connec-batch analytics. tors. We use Hadoop, Elasticsearch, Flume, Ironfan, and Wu- kong, among other world-class open source tools that work flex- ibly with each other and the rest of the tools in your enterprise.© 2012 Infochimps, Inc. All rights reserved. 1
  • 2. The Hadoop & NoSQL conundrumHadoop is a powerful framework for Big Data analytics. It simplifies the analysis of massive sets ofdata by distributing the computation load across many processes and machines. Hadoop embracesa map/reduce framework, which means analytics are performed as batch processes. Depending onthe quantity of data and the complexity of the computation, running a set of Hadoop jobs could takeanywhere from a few minutes to many days. Batch analytics tool sets like Hadoop are great for doingone-off reports, a recurring schedule of periodic runs, or setting up dedicated data exploration envi-ronments. However, waiting hours for the analysis you need means you aren’t able to get real-timeanswers from your data. Hadoop analysis ends up being a rear view mirror instead of a pulse on themoment.NoSQL databases are extremely powerful, but come with certain challenges of their ownAt Infochimps we use Hadoop to run map/reduce jobs against scalable, NoSQL data stores likeHBase, Cassandra, or Elasticsearch. These databases are extremely good at enabling fast queriesagainst many terabytes of data, but each makes certain tradeoffs to enable this ability. One majortradeoff, common across all three of these examples, is the inability to do SQL-like joins -- the abilityto combine data from one database table with data from another table.The usual way we work around this tradeoff is to practice denormalization. Imagine we’re asking aquestion such as “Find all posts that contain the phrase ‘Cola-Cola’ from all authors based in Spo-kane, Washington”. In a traditional relational database like SQL, a table of “posts” would join againsta table of “authors” using a shared key like an author’s ID number. In NoSQL databases, denormal-ization consists of inserting a copy of the author into each row of their posts. Rather than joining theposts table with the authors table during the query a la SQL, all the authors’ data is already containedwithin the posts table before the query.The question then becomes when should the denormalization of our NoSQL database occur? Oneoption is to use Hadoop to “backfill” denormalized data from normalized tables before running thesekinds of queries. This approach is perfectly workable but it suffers from the same “rear-view mirror”problem of doing Hadoop-based batch analytics -- we still cannot perform complex queries of real-time data. What if we could write denormalized data on the fly: write each incoming Twitter post intoa row in the posts table, and augment that row with information on the author in real-time. This wouldkeep all data denormalized at all times, always ready for downstream applications to run complexqueries and generate the rich, real-time business insights. Real-time analytics and stream processingmake this possible.© 2012 Infochimps, Inc. All rights reserved. 2
  • 3. Real-time + Big Data = Stream ProcessingIn situations where you need to make well-informed, real-time decisions, good data isn’t enough. Itmust be timely and actionable. As a mutual fund operator, you can’t wait hours to analyze whether ornot it’s the right moment to sell 200,000 stock shares. As CMO, you can’t wait days to see if there is aPR crisis occurring around your brand. The time window for data analysis is shrinking, and you needa different set of tools to get these on-the-fly answers.Batch Versus StreamingConsider two hypothetical sandwich makers. Each company makes great sandwiches, but chooses todeliver them to their customers either in batches or in near real-time.© 2012 Infochimps, Inc. All rights reserved. 3
  • 4. The Batch Sub Shop can provide large quantities of sandwiches by leveraging many people to ac-complish the overall project. Similarly, batch analytics can leverage multiple machines to accomplisha set of analytics jobs. By adding more resources, we can increase the speed with which the tasksare accomplished, but at a higher cost.Contrast that with the Streaming Sub Shop, which doesn’t deliver a huge set of sandwiches all atonce, but does quickly create sandwiches on the fly. The process aims to get a sandwich in the cus-tomer’s hand as soon as possible. Real-time analytics works the same way by processing data themoment it is collected. If the data is coming in too quickly, we can flexibly increase the resources thatsupport our real-time workflow. Is the toasting process the bottleneck of our production line? We eas-ily add a couple of additional toasters.As you can imagine, the ideal sandwich company probably combines both the ability to cater largeorders ahead of time and in-store made to order business. Likewise, your organization can leverageboth batch analytics and real-time analytics depending on your business needs. Batch analytics isthe most efficient way to process a large quantity of data in a non-time sensitive manner. Real-timeanalytics and stream processing are the answer when the timeliness of your insights is important, youneed to scalably process a very large influx of live data, or if NoSQL databases cannot answer thequestions you are asking.© 2012 Infochimps, Inc. All rights reserved. 4
  • 5. How Does Real-Time Analytics Work?1. Collect real-time data. Real-time data is being generated all the time. If you are a mutual fund operator, it’s real-time stock price data. If you are a CMO, it’s real-time social media posts and Google search results. Typically this data is live streaming data. That means the moment the stock price changes, we can grab that data point - like a faucet of running water. We collect live data by “hooking a hose up” to the faucet stream to capture that information in real-time. A lot of different vocabulary exists to describe these “hoses” including calling them scrapers, collectors, agents, and listeners.2. Process the data as it flows in. The key to real-time analytics is that we cannot wait until later to do things to our data; we must analyze it instantly. Stream processing (also known as streaming data processing) is the term used for doing things to data instantly as it’s collected. Actions that you can perform in real-time include splitting data, merging it, doing calculations, connecting it with outside data sources, forking data to multiple destinations, and more.3. Reports and dashboards access processed data. Now that data has been processed, it is reliably delivered to the databases that power your reports, dashboards, and ad-hoc queries. Just seconds after the data was collected, it is now visible in your charts and tables. Since real-time analytics and stream processing are flexible frameworks, you can utilize whatever tools you prefer, whether that’s Tableau, Pentaho, GoodData, a custom application, or something else. Integration is Infochimps’ forté.© 2012 Infochimps, Inc. All rights reserved. 5
  • 6. What Can You Do With Stream Processing?Augment • Enhance your sales leads - IP addresses of visitors to your website are augmented by the “company name” associated with that visitor if they are coming from an enterprise. Email ad- dresses get linked to Twitter handles and Facebook handles to help your sales team leverage social selling. • Real-time social media analytics - tweets that mention the brands you are tracking are aug- mented with a sentiment score (how positive or negative the comment was) and an influencer score (such as Klout). Know instantly if positive news breaks or a PR crisis arises. Instantly gain insight into how influential people are and on what topics.Process and Transform • On-the-fly analytics reporting - Reformat a tweet on the fly to fit into an agency’s data model so that the data is visible in our reporting application immediately upon landing in the database. • SQL-like data queries - Implement a denormalization policy to allow for doing complex JOIN- like queries in real-time in downstream analytics applications. • Stock price algorithms - Implement your stock analyzer algorithm mid-stream. Instantly after an updated stock price is received, the data is processed through the algorithm, and placed in your reporting database.Calculate • Usage monitoring - Track the number of social media posts mentioning your client company’s brand. See at any given moment how much a brand is buzzing, and even set up tiered pricing based on how many social posts you are collecting on a client’s behalf.© 2012 Infochimps, Inc. All rights reserved. 6
  • 7. Real-time analytics with the Infochimps PlatformApache FlumeWhile initially built for log collection and routing, Flume has evolved to confidently serve the roles ofgeneral data transport and streaming data processing. Flume not only reliably delivers data from asource to a destination. With the right optimizations, a single Flume system can ingest many tera-bytes of data per day, from thousands of data sources. As data flows in, you can do things to thatdata, such as add additional data, do calculations, run algorithms, split data, merge data, etc. InFlume lingo, these actions are powered by scripts called decorators, which perform the stream pro-cessing required for real-time analytics.Infochimps Data Delivery ServiceInfochimps uses Apache Flume for the Data Delivery Service (DDS), our reliable data transport andreal-time analytics engine for the Infochimps Platform. Infochimps DDS adds important enhance-ments to the Flume open-source tool including: • Seamless integrations with your existing environment and data sources • Optimizations for highly scalable data collection and distributed ETL (extract, transform, load) • Tool set for rapid development of decorators which perform the stream processing • Flexible delivery framework to send data to any type and quantity of databases or file systems • Rapid solution development and deployment, along with our expert Big Data methodology and best practicesInfochimps has extensive experience implementing the DDS, both for clients and for our internal dataflows including massive Twitter scrapes, the Foursquare firehose, customer purchase data, productpricing data, and much more.Single-purpose ETL solutions are rapidly being replaced with multi-node, multi-purpose data integra-tion platforms -- the universal glue that connects systems together and makes Big Data analyticsfeasible. Today, companies are taking advantage of Amazon Web Services for a few processes, on-premise or outsourced data centers for others, NoSQL databases, relational databases, cloud storage-- the list goes on. Data Delivery Service is compatible with all of those environments, making yourdata transport needs an implementation detail, not an analytics bottleneck.© 2012 Infochimps, Inc. All rights reserved. 7
  • 8. About Infochimps Our mission is to make the world’s data more accessible. Infochimps helps companies understand their data. We provide tools and services that connect their internal data, leverage the power of cloud computing and new technologies such as Hadoop, and provide a wealth of external datasets, which organizations can connect to their own data. Contact Us Infochimps, Inc. 1214 W 6th St. Suite 202 Austin, TX 78703 1-855-DATA-FUN (1-855-328-2386) www.infochimps.com info@infochimps.com Twitter: @infochimps Get a free Big Data consultation Let’s talk Big Data in the enterprise! Get a free conference with the leading big data experts regarding your enterprise big data project. Meet with leading data scientists Flip Kromer and/or Dhruv Bansal to talk shop about your project objectives, design, infrastructure, tools, etc. Find out how other compa- nies are solving similar problems. Learn best practices and get recommendations — free.© 2012 Infochimps, Inc. All rights reserved. 8