Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Expert Big Data Tips

Whether you are interested in healthcare data analytics or looking to get started with big data and marketing, these fundamental principles from data experts will contribute to your success.

  • Be the first to comment

Expert Big Data Tips

  1. TIPS from the Experts
  2. Table of Contents Setup is Key Think wide Tool integration Evaluate and Adapt Sharing Encryption A data science mindset Innovation Real-time action
  3. To see all of the tips in list form, click the button on the bottom of the slide. See in List Form
  4. Grant Unlimited Access Create a data lake and give your business and data analysts access to all your data – structured and unstructured – with SQL engines like Hive. They will surprise you with the insight and value they can extract, and your development team will have less work answering ad-hoc queries. “ “ —Christian Prokopp, Principal Consultant at Big Data Partnership See in List Form
  5. Select the Right Tools Very often the query is when to use MapReduce/Pig/Hive vs. HBase/Cassandra/Impala frameworks. NFR (Non Functional Requirements) have to be considered while deciding the framework. MapReduce/Pig/Hive are used for high throughput/high latency requirements as in the case of Batch processing/ETL. HBase/Cassandra/Impala are used for low throughput/low latency requirements as in the case of a customer filling out an online application. “ “ —Praveen Sripati, Hadoop trainer and author of Hadoop Tips See in List Form
  6. Improve query performance by considering Presto with RCFile or ORC File format. Use Presto “ “ —Minesh Patel, Qubole See in List Form
  7. Incorporate Machine Learning Use Robust Machine Learning Algorithms to extract the data – Data collection and massive storing is only the enabling infrastructure. You should leverage existing and also propriety machine learning algorithms, that will discover hidden patterns, and will learn from the data what is important for the analyst to view and examine, and what is not. “ “ —Idan Tendler, CEO of Fortscale See in List Form
  8. Automation is Key There is a big need for automation in Big Data. Security is an important industry that has proven the value of Big Data. But, that has just as quickly proved that Big Data is also valueless without automation wrapped around it to make it practical. Only once you make Big Data practical can you begin to perform analytics, etc., which is where the value of Big Data in the security industry really gets unlocked. “ “ —Sean Brady, VP of Product Management at Vorstack See in List Form
  9. Identify Easy Wins Segment the data based on demographic and/or firmographic information. This is an easy and inexpensive way to highlight trends in the primary customers and industries served. This information is very helpful when determining what new products and/or services should be offered. In addition, look for trends in behavioral transaction information and further optimize the customer’s experience with relevant marketing and messaging. “ “ —David Handmaker, CEO of Next Day Flyers See in List Form
  10. Think Broad Identify all of the data you have access to and/or will produce, and explore possible audiences and use cases for it. Often times, big data plays are geared toward a fairly narrow audience and set of use cases based on the original inspiration for the solution. Or, there is not an active and explicit exploration of the full potential of what you have to offer. I can all but assure you that there are major opportunities for your offering that you haven’t even considered yet. The earlier you have a crisp view of the potential of your big data and offering, the better able you will be to build the right thing, in the right way, to exploit the potential of that idea. “ “ —Dirk Knemeyer, founder of Involution Studios See in List Form
  11. Setup is Key
  12. Careful and Smart Integration with BI tools Big Data tools ( Mapreduce/Hive etc. ) are known for their latency problems, but on the other hand they are excellent for processing petabytes of data in a distributed computing environment. When it comes to integration with any BI/reporting tools, big data technologies should be used in an appropriate manner so that you can avoid the negatives and leverage the strength of these technologies. For example – if you are building an integrated pipeline with BI tools, try to aggregate as much as you can and utilize the caching or cube technologies with the BI tools to make it a faster experience for the end user. Real time connectivity with big data sources like Hive/HDFS is not a great end user experience in the BI space, so it should be avoided. “ “ —Ashish Dubey, Solutions Architect at Qubole See in List Form
  13. Invest in Your Pipeline Rule of thumb, invest 80% of your time in your data lake and data pipeline (mining, extracting, cleaning, transforming, loading), and 20% in the high level data science and machine learning effort. Data in the wild is complex, wrong, contradicting, hard to access and find. Consequently more, faster, and accurate data usually has a higher impact than more complex models and makes for a robust system. “ “ —Christian Prokopp, Principal Consultant at Big Data Partnership See in List Form
  14. Don’t Rush Into Analysis Everyone with a Big Data project wants to rush straight into analysis. That is where things usually fall apart, however, because there is simply too much data flowing across the network and it is mostly in a format that current analytics software cannot handle. “ “ —Rick Aguirre, president of Cirries Technologies See in List Form
  15. Start with Heavy Lifting Big Data success requires three steps of heavy lifting first, before you ever analyze it. Step 1 is data capture. Most of the Big Data torrent is a big nothing and not relevant. Decide what data you want to analyze and set up algorithms to locate and corral it. “ “ Step 1 is data control. You want to capture the data you need as it come across the network. It may not be relevant in just a few minutes, or you may need to store it for a number of years if, as one example, it is data that might be needed later for law enforcement purposes. Step 1 is data humanization. This is where you convert whatever format the data is in to a format that your analytics software can use. Only now, at this step, do you have the right data in the right format that you can then use for whatever kind of analytics you have in mind. —Rick Aguirre, president of Cirries Technologies See in List Form
  16. Once data is collected then you have easy access for advanced analytics – don’t stop at only analyzing one log source or one dimension of data – analyze across log sources and multiple entities. For example, in order to discover advanced cyber attacks that leveraged users’ credentials, we profile across behavioral activity of users – including their permissions configuration, their access to files and systems and their web activity. We analyze their historical activity as well as comparing them against their peers. Think wide “ “ —Idan Tendler, CEO of Fortscale See in List Form
  17. Use the ODBC Driver Perform BI Analytics and Visualization with the ODBC Driver. “ “ —Minesh Patel, Qubole See in List Form
  18. Use a Subsample I always start by looking at a subsample of the data. You often get a very good impression of what the main focus of the data munging or cleaning will be just by looking at some numbers (or characters). “ “ —Benedikt Koehler, Data Scientist and Blogger at Beautiful Data See in List Form
  19. Evaluate and Adapt
  20. Measure Everything Measure and record everything, and keep an eye on your key metrics. Things change, and tests become obsolete, and sometimes in surprising ways especially when you depend on external data. For example, data sources you mine may introduce rolling changes, which are hard to catch as an error but easy to identify in metrics. “ “ —Christian Prokopp, Principal Consultant at Big Data Partnership See in List Form
  21. Sharing is Caring Measure and record everything, and keep an eye on your key metrics. Things change, and tests become obsolete, and sometimes in surprising ways especially when you depend on external data. For example, data sources you mine may introduce rolling changes, which are hard to catch as an error but easy to identify in metrics. “ “ —Idan Tendler, CEO of Fortscale See in List Form
  22. Encrypting data at rest is a good best practice. Encryption “ “ —Minesh Patel, Qubole See in List Form
  23. Pick the Right Distribution A common question is whether to go for a distribution from Apache or a vendor. When there is enough expertise in the organization to know the internals of the different frameworks for integrating and resolving any issues quickly, then go with Apache Hive. If that expertise is not available, use a distribution through a vendor and get commercial support to resolve any issues that may arise. “ “ —Praveen Sripati, Hadoop trainer and author of Dattamsha See in List Form
  24. Developing a Big Data strategy is all about starting small and making gradual steps in becoming more data-driven. Start with breaking down the data silos within your organization to gain the most insights from your data when you start analyzing it through a variety of tools. Start Small “ “ —Mark van Rijmenam – CEO / Founder BigData-Startups See in List Form
  25. Have a Business Intent There is often a perception that there is gold in an organization’s data, and that if you just look hard enough, you will find it. In reality, this perception can lead to fruitless efforts with no real direction and no payoff. Instead, start with a business intent in mind. What are the actions you would take—and the value to your business—if data can provide the answer to a certain question? “ “ —Sean Stauth, Director, Client Services, Silicon Valley Data Science See in List Form
  26. Update Your Strategy Your data strategy should be a living document that helps you get the most value from your data. As your goals, your technical environment, or the market change, keep it updated to help you follow those changes and stay on course. “ “ —Scott Kurth, VP, Advisory Services, Silicon Valley Data Science See in List Form
  27. A Data Science Mindset
  28. Data Science Mindset Have an always-on data science mindset — Successful big data initiatives start with a holistic 360 view of the problem space. This includes understanding the inputs (data types, sources, features), the desired outputs (decisions, goals, predictions), and the constraints (model parameters, boundary conditions, optimization constraints). To achieve this perspective, one must be thinking like a scientist from start to finish: collect data, infer a testable hypothesis, design an experiment, test and evaluate the results, refine your hypothesis, and repeat (if necessary). “ “ —Kirk Borne, Data Scientist, Astrophysicist and Big Data Science Consultant See in List Form
  29. Return on Innovation The most important ROI in Big Data Analytics projects is Return On Innovation. What are you doing that’s different and consequential? What sets you apart from the rest of the multitudes in this space? “ “ —Kirk Borne, Data Scientist, Astrophysicist and Big Data Science Consultant See in List Form
  30. Focus on the Users Developing a big data platform requires focusing on the users. Serve a few users well, and let their processing scale up with your capabilities. “Premature platformization” or trying to satisfy too many use cases too early in the project leads to failures. Make the initial users successful, and the ecosystem will thrive and grow. “ “ —Owen O’Malley – Sr. Architect and Co-founder of Hortonworks See in List Form
  31. Using the API: samples for Java SDK, Python SDK, and REST. Use the API “ “ —Minesh Patel, Qubole See in List Form
  32. Take Real-Time Action If you cannot take real-time action, you have no need of real-time processing. There will always be batch processing workloads supporting the enterprise, and increasingly dynamic decision areas can be effectively supported by analytical systems because of advances in data architectures. “ “ —Sanjay Mathur, CEO, Silicon Valley Data Science See in List Form
  33. Store Denormalized State State—the full context of an event, like a customer visit or the completion of a step in a manufacturing process—can be expensive to reassemble after the fact. This is particularly true with highly relational systems: witness the complex ETL (extract, transform, load) workloads that enterprise data warehouse systems struggle to scale. Storing denormalized state, e.g. rich logs, for analysis has proven highly successful for the web businesses of silicon valley, and those techniques can be applied to industries across the economy. “ “ —John Akred, CTO, Silicon Valley Data Science See in List Form
  34. Build a Common Platform Whether you are thinking about migrating towards Big Data or whether you are just starting out with data all together, it helps to focus upon building and maintaining a common platform. Similar to software development platforms, data platforms should also include source control, change management, and testing scenarios. This will help reduce future migration costs and will lead to long-term sustainable, competitive data capabilities. “ “ —Ryan Kirk, SR. Data Scientist at Hipcricket See in List Form
  35. Looking for additional big data tips and advice? Subscribe to Qubole's email newsletter. Sources: