Bi(G) data: opportunities for BI Professionals


Published on

Presentation given to a group of freelance BI professionals at october 2013 .Description of big data from different views.

Published in: Technology, Education
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Bi(G) data: opportunities for BI Professionals

  1. 1. BI(G) DATA Opportunities for BI professionals in the Netherlands Most companies mentioned are Dutch
  2. 2. Our fantasy... At Last: an IT job is sexy
  3. 3. Agenda ● Big Data views ○ Scientific Method ○ Data Characteristics ○ New Technology ○ Business Opportunities ○ Culture ● Opportunities for BI professionals
  4. 4. Google Trends The famous McKinsey Report: Big data: The next frontier for innovation, competition, and productivity BIG Data became trending because of Mckinsey Now it’s correlated with hadoop
  5. 5. Wikipedia Big Data Big data usually includes data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process the data within a tolerable elapsed time.[19] Big data sizes are a constantly moving target, as of 2012 ranging from a few dozen terabytes to many petabytes of data in a single data set. The target moves due to constant improvement in traditional DBMS technology as well as new databases like NoSQL and their ability to handle larger amounts of data.[20] With this difficulty, new platforms of "big data" tools are being developed to handle various aspects of large quantities of data. Focus on volume… instead of other V’s
  6. 6. BIG Data The Scientific method is changing
  7. 7. The Fourth Paradigm: Data-Intensive Scientific Discovery Increasingly, scientific breakthroughs will be powered by advanced computing capabilities that help researchers manipulate and explore massive datasets. Implicit in the idea of a fourth paradigm is the ability, and the need, to share data. In sciences like physics and astronomy, the instruments are so expensive that data must be shared Data analysis is the new microscope Human Genome, Large Hydron Collider
  8. 8. Jim Gray ● ● ● ● Thousand years ago: science was empirical describing natural phenomena Last few hundred years: theoretical branch using models, generalizations Last few decades: a computational branch simulating complex phenomena Today:data exploration (eScience) unify theory, experiment, and simulation ○ Data captured by instruments Or generated by simulator ○ Processed by software ○ Information/Knowledge stored in computer ○ Scientist analyzes database / files using data management and statistics On Sunday, January 28, 2007, during a short solo sailing trip to the Farallon Islands near San Francisco to scatter his mother's ashes, Gray and his 40-foot yacht, Tenacious, were reported missing by his wife, Donna Carnes. The Coast Guard searched for four days using a C-130 plane, helicopters, and patrol boats but found no sign of the vessel.[10][11][12][13] Gray's boat was equipped with an automatically deployable EPIRB (Emergency PositionIndicating Radio Beacon), which should have deployed and begun transmitting the instant his vessel sank. The area around the Farallon Islands where Gray was sailing is well north of the East-West ship channel used by freighters entering and leaving San Francisco Bay. The weather was clear that day and no ships reported striking his boat, nor were any distress radio transmissions reported. On February 1, 2007, the DigitalGlobe satellite did a scan of the area, generating thousands of images.[14] The images were posted to Amazon Mechanical Turk in order to distribute the work of searching through them, in hopes of spotting his boat. In the immediate aftermath of the disappearance, many theories were put forward on how Gray disappeared.[15] On February 16, 2007, the family and Friends of Jim Gray Group suspended their search,[16]
  9. 9. but continue to follow any important leads. The family ended its underwater search May 31, 2007. Despite much effort and use of high-tech equipment above and below water, searches did not reveal any new clues.[17][18][19][20][21][22] Personal life[edit] While at Berkeley, Gray and his first wife Loretta had a daughter; the couple later divorced.[2] He is survived by his wife, Donna Carnes, his daughter, three grandchildren, and his sister Gail. The University of California, Berkeley and Gray's family hosted a tribute to him on May 31, 2008. The conference included sessions delivered by Richard Rashid and David Vaskevitch. [23] Microsoft's WorldWide Telescope software is dedicated to Gray. In 2008, Microsoft opened a research center in Madison, Wisconsin, named after Jim Gray.[24] Having being missing for five years as of May 16, 2012, Gray is legally assumed to have died at sea.[4][25] Jim Gray Award[edit] Each year, Microsoft Research presents the Jim Gray eScience Award[26] to a researcher who has made an outstanding contribution to the field of data-intensive computing. Award recipients are selected for their ground-breaking, fundamental contributions to the field of eScience. Previous award winners include Alex Szalay (2007), Carole Goble (2008), Jeff Dozier (2009), Phil Bourne (2010), Mark Abbott (2011) and Antony John Williams (2012). Books[edit] ● Transaction Processing: Concepts and Techniques (with Andreas Reuter) (1993). ISBN 1-55860-190-2. ● The Benchmark Handbook: For Database and Transaction Processing Systems (1991). Morgan Kaufmann. ISBN 978-1-55860-159-8. See also
  10. 10. esciencecenter Projecten
  11. 11. Chris Anderson This is a world where massive amounts of data and applied mathematics replace every other tool that might be brought to bear. Out with every theory of human behavior, from linguistics to sociology. Forget taxonomy, ontology, and psychology. Who knows why people do what they do? The point is they do it, and we can track and measure it with unprecedented fidelity. With enough data, the numbers speak for themselves. There is now a better way. Petabytes allow us to say: "Correlation is enough." We can stop looking for models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot. The end of theory: Edge Wired
  12. 12. Cukier and MAyer-Schonberger Shift 1: End of Samples Shift 2: End of exactitude Shift 3: End of Causality patterns & correlations if you know that your customers are going to buy more products by analyzing a data set or correlation, then the “why” doesn’t matter — you should try to exploit that. The technical equivalent in big data is the ability to survey a whole population instead of just sampling random portions of it. with less error from sampling we can accept more measurement error”. According to the authors, science is obsessed with sampling and measurement error as a consequence of coping in a ‘small data’ world. The third and most radical shift implies “we won’t have to be fixated on causality [...] the idea of understanding the reasons behind all that happens.” This is a straw
  13. 13. Nate Silver “We're not that much smarter than we used to be, even though we have much more information - and that means the real skill now is learning how to pick out the useful information from all this noise.” “I came to realize that prediction in the era of Big Data was not going very well.” “If the quantity of information is increasing [exponentially]… Most of it is just noise.” “… numbers have no way of speaking for themselves. We speak for them.” Nate Silver has lived a preposterously interesting life. In 2002, while toiling away as a lowly consultant for the accounting firm KPMG, he hatched a revolutionary method for predicting the performance of baseball players, which the Web site Baseball Prospectus subsequently acquired. The following year, he took up poker in his spare time and quit his job after winning $15,000 in six months. (His annual poker winnings soon ran into the six-figures.)
  14. 14. Nasim Taleb Big Data is bullshit This is the tragedy of big data: The more variables, the more correlations that can show significance. Falsity also grows faster than information; it is nonlinear (convex) with respect to data. 1. It is an outlier, as it lies outside the realm of regular expectations, because nothing in the past can convincingly point to its possibility. 2. It carries an extreme 'impact'. 3. in spite of its outlier status, human nature makes us concoct explanations for its occurrence after I am not saying here that there is no information in big data. There is plenty of information. The problem — the central issue — is that the needle comes in an increasingly larger haystack. the fact, making it explainable and predictable. A small number of Black Swans explains almost everything in our world, from the success of ideas and religions, to the dynamics of historical events, to elements of our own personal lives.
  15. 15. Ludic Fallay The discovery of the Higgs particle was a dissapointment for some physicist because now they know what they don’t know: no big things to discover The ludic fallacy is a term coined by Nassim Nicholas Taleb in his 2007 book The Black Swan. "Ludic" is from the Latin ludus, meaning "play, game, sport, pastime."[1] It is summarized as "the misuse of games to model real-life situations."[2] Taleb explains the fallacy as "basing studies of chance on the narrow world of games and dice."[3] It is a central argument in the book and a rebuttal of the predictive mathematical models used to predict the future – as well as an attack on the idea of applying naïve and simplified statistical models in complex domains. According to Taleb, statistics works only in some domains like casinos in which the odds are visible and defined. Taleb's argument centers on the idea that predictive models are based on platonified forms, gravitating towards mathematical purity and failing to take some key ideas into account: ● It is impossible to be in possession of all the information. ● Very small unknown variations in the data could have a huge impact. Taleb does differentiate his idea from that of mathematical notions in chaos theory, e.g. the butterfly effect. ● Theories/Models based on empirical data are flawed, as they cannot predict events that have never happened before, but have tremendous impact. E.g. the 911 terrorist attacks, invention of the automobile, etc.
  16. 16. Discover what you (don’t) know you don’t know?
  17. 17. BIG Data Data Characteristics are changing
  18. 18. BI community ● ● ● ● ● ● ● Collegues.. Data integration is already 20+ years old Just another source We do not have much data Small or big data: it has to be managed Big data = business analytics One-off projects (data is too varied) We know what data is all about. Nobody has to tell us what you can do with data.
  19. 19. Gartner’s definition (2001) Big Data is high-volume, high-velocity, and/or high-variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization. ● ● ● Volume: relative size of data sources Velocity: speed at which data refresh is handled Variety: handling various data formats ● (Validity, Veracity( accuracy, correctness, applicability), Value, and Visibility)
  20. 20. Variety source: Hortonworks
  21. 21. Velocity keeping history for clickpaths isn’t interesting if the site is changing through the years.
  22. 22. Volume
  23. 23. “Information was a pond and has become a river” Peter Hinssen fantastiche leuke spreker op het SAS forum. goede presentatie : filtering wordt/is heel belangrijk
  24. 24. Liquid Data om data actionable te houden moet er instant gerageerd worden. . vissen in een meer versus vissen in een rivier. zoveel water dat snel voorbij stroomt
  25. 25. Barry Devlin The true godfather of Data warehousing. ● ● ● Human Sourced Information ○ is now largely digitized and electronically stored everywhere from tweets to movies Process-mediated data ○ This data includes transactions, reference tables and relationships, as well as the metadata that sets its context, all in a highly structured form. Machine-generated data ○ from simple sensor records to complex computer logs
  26. 26. Impact on the DWH ● ● ● ● The central core business data pillar is the consistent, quality-assured data found in EDW and MDM systems Deep analytic information requires highly flexible, large scale processing such as the statistical analysis and text mining Fast analytic data requires such high-speed analytic processing that it must be done on data in-flight, Specialty analytic data, using specialized processing such as NoSQL, XML, graph and other databases and data stores inmon richt zich nu op deep analytic information met zijn text mining
  27. 27. BIG Data New Tools
  28. 28. Other BIG data related trends ● ● ● elastic cloud nosql data visualization
  29. 29. Nosql A NoSQL database provides a mechanism for storage and retrieval of data that employs less constrained consistency models than traditional relational databases. NoSQL systems are also referred to as "Not only SQL" to emphasize that they do in fact allow SQL-like query languages to be used. ● Document: MongoDB, Couchbase ● Key-value : Dynamo, Riak, Redis, Cache, Project Voldemort ● Graph: Neo4J, Allegro, Virtuoso
  30. 30. Nosql: Mongo DB ● How and Why Leading Investment Organizations are Migrating to MongoDB ● Real World MongoDB: Use Cases from Financial Services ● How Financial Firms Create Single Customer Views Using MongoDB ● How Banks Use MongoDB to Manage Risk ● How Banks Manage Reference Data with MongoDB ● How Banks Use MongoDB as a Tick Database ● Position and Trade Management withMongoDB
  31. 31. Nosql: Neo4j Graph database ● Nodes represent entities ● Properties are pertinent information that relate to nodes. ● Edges are the lines that connect nodes to nodes or nodes to properties and they represent the relationship between the two
  32. 32. dataviz: synerscope Ooh/aah strategy: first be amazed then understand
  33. 33. Local intelligence: ORTEC/TSS Ortec Team Support Systems (ORTEC TSS), develops decision, support & information ICTSystems to analyze sport performances. These software systems are employed before, during and after sport matches. During a match, they are used to measure teams’ and players’ performances. Following top athletes and talents by their clubs, teams, sponsors, unions and the public has been brought to a whole new dimension because of these systems.
  34. 34. Internet of Things
  35. 35. Elastic cloud: Amazon Redshift $999 per TB per year Amazon Redshift $999 per TB per year
  36. 36. Hadoop…. ● ● ● ● ● ● ● ecosystem isn’t stable. A lot of configurations are possible Hadoop is complex. Java expertise. Apache Hadoop : Open source Hadoop framework in Java. Consists of Hadoop Common Package (filesystem and OS abstractions), a MapReduce engine (MapReduce or YARN), and Hadoop Distributed File System (HDFS) Apache Mahout : Machine learning algorithms for collaborative filtering, clustering, and classification using Hadoop Apache Hive : Data warehouse infrastructure for Hadoop. Provides data summarization, query, and analysis using a SQL- like language called HiveQL. Stores data in an embedded Apache Derby database. Apache Pig: Platform for creating MapReduce programs using a high-level “Pig Latin” language. Makes MapReduce programming similar to SQL. Can be extended by user defined functions written in Java, Python, etc Apache Avro: Data serialization system. Avro IDL is the interface description language syntax for Avro.
  37. 37. ● ● ● ● ● ● ● ● ● Apache HBase: Non-relational DBMS part of the Hadoop project. Designed for large quantities of sparse data (like BigTable). Provides a Java API for map reduce jobs to access the data. Used by Facebook. Apache ZooKeeper : Distributed configuration service, synchronization service, and naming registry for large distributed systems like Hadoop. Apache Cassandra: Distributed database management system. Highly scalable. Apache Ambari: A web-based tool for provision, managing and monitoring Apache Hadoop cluster Apache Chukwa: A data collection system for managing large distributed systems Apache Sqoop: Tool for transferring bulk data between structured databases and Hadoop Apache Oozie: A workflow scheduler system to manage Apache Hadoop jobs
  38. 38. Hadoop jobs
  39. 39. From a single solution to an Ecosystem
  40. 40. BIG Data Business Opportunities
  41. 41. Mckinsey’s big data report
  42. 42. For big data, 2013 is the year of experimentation and early deployment," said Frank Buytendijk, research vice president at the research firm. "Adoption is still at the early stages with less than 8 percent of all respondents indicating their organization has deployed big data solutions. [Across the board], 20 percent are piloting and experimenting, 18 percent are developing a strategy, 19 percent are knowledge gathering, while the remainder has no plans or don't know."
  43. 43. Has "Big Data" significantly changed Data Science principles and practice? kdnuggets poll (Oct 29, 2013.)
  44. 44. Analytics is BIG analytics is hotter. green line is google analytics: blue line should be corrected for that
  45. 45. Kaggle ● ● ● ● ● Platform for predictive analytics competitions Business hands over part of the data and keeps part of the data sets Contenders build models based on the available data Contenders predict the values of the kept data sets Best prediction wins the competition
  46. 46. Algoritmica
  47. 47. Science Rockstars
  48. 48. Ewatercycle A global hydrological model will provide the international community with the best possible estimates of the state of water resources in the world. Assimilation of remotely sensed and in situ data will be a major mathematical and computational challenge. A successful implementation of the project will lead to a community model for hydrologists across the globe. - See more at:
  49. 49. BIG Data Cultural shift in using data
  50. 50. “Perhaps the most important cultural trend today: The explosion of data about every aspect of our world and the rise of applied math gurus who know how to use it.” Chris Anderson
  51. 51. Sharing: Silk Since Silk first came out of stealth mode in 2011, there have been 300,000 interactive pages created on its cloud-based, web data-crunching platform designed for nontechnical “knowledge workers.” Taking less easy-to-read data sets and making them more digestible, results have ranged from the Guardian newspaper in the UK creating graphics of which countries have the most asylum seekers, through to charting what products Google has killed and dads mapping out the best playgrounds for his kid in Amsterdam (where Silk also happens to be founded). It’s been a popular, and free, tool, with pages created by some 16,000 people growing by 20 percent each month. Now, Silk is moving on to its next phase: its first paid product, Silk for Teams, aimed at groups of enterprise users who want to use the platform to produce cleaner internal data sets, and eventually to create data visualizations that work with paywalls.
  52. 52. Open Data anay idea’s?
  53. 53. “Our research suggests that seven sectors alone could generate more than $3 trillion a year in additional value as a result of open data…” Mckinsey
  54. 54. Open Data Open data: Unlocking innovation and performance with liquid information A new McKinsey report says that open data can help create $3 trillion a year of economic value across seven sectors. In a related podcast, the McKinsey Global Institute’s Michael Chui discusses the economic
  55. 55.
  56. 56. Cap Gemini
  57. 57. Data Journalism new york times, guardian, sargasso,
  58. 58. Quantified Self
  59. 59. Quantified Self
  60. 60. Quantified Self
  61. 61. Quantified Self Combining all the sources of this and the previous 3 slides and finding correlations is the essence of (big) data analytics. example: combining sunpower with sleepcycle and fitness and diet
  62. 62. BIG Data Opportunities for BI professionals
  63. 63. “The ability to take data — to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it — that’ s going to be a hugely important skill in the next decades.” Hal Varian Google guru
  64. 64. “The illiterate of the 21st century will not be those who cannot read and write, but those who cannot learn, unlearn, relearn” Alvin Toffler
  65. 65. Mckinsey report highlights A significant constraint on realizing value from big data will be a shortage of talent, particularly of people with deep expertise in statistics and machine learning, and the managers and analysts who know how to operate companies by using insights from big data… Furthermore, this type of talent is difficult to produce, taking years of training in the case of someone with intrinsic mathematical abilities. (p.10)
  66. 66. Data Scientist Applying varying degrees of statistics, data visualizations, computer programming, data mining, machine learning, and database engineering to solve complex data problems. ● Association rule learning ● Pattern Recognition ● Classification ● Predictive Modelling ● Cluster Analysis ● Regression ● Crowd Sourcing ● Sentiment Analysis ● Data Fusion and Integration ● Signal Processing ● Ensemble Learning ● Supervised and Unsupervised ● Genetic Algorithms ● Machine Learning ● Simulation ● Natural Language Processing ● Time Series Analysis ● Neural Networks ● Visualization Learning
  67. 67. Typical Big Data Job is not a BI Job JOB OPENING: BIG DATA ARCHITECT We are looking to expand our core product team with a Senior Java Developer/Architect that will contribute in the product design and development and take pride in the delivery of kick-a** products. Knowledge, Skills and Experience ● Minimum 4 years Java experience ● Experience with NoSQL Databases, preferably MongoDB (MapReduce, Sharding) ● Experience with Cloud-based infrastructure, esp. AWS ● Expertise with Hadoop eco-system is a plus (examples: Flume, Zookeeper, Ganglia, etc) ● Experience with Web services (REST/SOAP) ● Obsession with performance and big data ● Passion for elegant technical design and good programming practices (TDD, CI) ● Energetic “self-starter” , have the will to take ownership, and be accountable for deliverables ● A true defender of quality and (light-weight) documentation of the designs ● ● ● Relevant HBO/University education or experience ● Sense of humor is essential Not typical BI hardcore tech..
  68. 68. Personal Strategies ● ● ● ● ● ● ● Do nothing ○ Just sell your personal data ○ Wait untill the big DM companies incorporate Hadoop ecosystem Hadoop expert ○ Learn java and the hadoop ecosystem Data scientist ○ Learn Python/R ○ Learn statistics and all kinds of algorithms (especially Bayes) Data architect/manager ○ Learn the principles of hadoop/nosql ○ Learn how to integrate (big) data in the enterprise dwh ○ data governance/ data stewardship/ DQ / metadata BI(g) Tool Specialist ○ Adopt a big data dataviz or reporting tool (Splunk, Platfora) ○ Adopt a platform (Cloudera, Hortonworks, MapR, Azure, Google, Amazon) Data artist ○ Data visualization tools, design info graphics Data story teller ○ data journalism course
  69. 69. Group Activities ● ● ● ● Expert Groups ○ Explore platforms ○ Explore tools Open data for personal and group branding ○ Start a project ○ Join open data sites Data journalism ○ Start a blog/join a blog ○ Make news with data Business Cases ○ Scanning business cases ○ Almere Datacapital Group Activities BI United
  70. 70. living in an big data augmented world