Successfully reported this slideshow.
Your SlideShare is downloading. ×

Data Science definition

Loading in …3

Check these out next

1 of 49 Ad

More Related Content


Data Science definition

  1. 1. Prolegomena to Any Future Statistics, that will be able to present itself as a (Data) Science Carlo Lauro Emeritus Professor of Statistics University of Naples Federico II Is there a Data Science? If yes, then what is Data Science? And what does Data Science mean in “data revolution era”? What about new professions? What are the challenges for Statistics? (LET'S TALK ABOUT DATA SCIENCE) Scientific Meeting in Memory of Simona Balbi Naples, February 19° , 2019 Director of the Department of Economic & Managerial Sciences in Digital Era SELF HOCHSCHULE, ZUG - CH
  2. 2. • “Data Science: The Sexiest Job of the 21st Century” (T. Davenport & D.J. Patil) • “Data Scientist : Person who is better at statistics than any software engineer and better at software engineering than any statistician.” (Josh Wills, Cloudera ) Is Data Science still a buzzword without a clear definition? Is Data Science just a rebranding of Statistics? ‘’Let’s talk about Data Science’’ Data Science and Data Scientists
  3. 3. ‘’Let’s talk about Data Science’’ According with Sir Maurice Kendall, among the issues the statisticians do not agree, there is the definition of their science. As a consequence, dictionaries and encyclopedias, do not share a common idea on what Statistics is. Similar problems seem to happen analysing the scientific literature on the subject matter as well as the various forum and blogs present in social networks where a common definition for Data Science is The Science of extraction the knowledge from the Data the same one used in Statistics. As for Statistics, we observed also another a common view , ‘’Data Science is what Data scientists do ‘’. So far is unclear if a Data science is a science or a profession? The Data Science Association introduce itself as a profession. Probably a Data Science is both. In fact it has the peculiarty of a ‘Methodological Science’ (Tosio Kitagawa) with no object but its object is to develop a unified methodology applicable to other categories of sciences. With the aim to propose a satisfactory definition to the different people that coexist in this colorful world of the Data science we analysed about 150 Data Science and Data scientist definitions by a lessical corrispondence analysis and a SNA. But what is also more relevant for us is to try understand eventual threats and challenges that can derive for Statistics and statisticians as consequence of the actual data revolution characterized by large amounts of data (big data) of various types (numeric, ordinal, nominal, symbolic, texts, images, data streams, multi-way, networks ,etc.), coming from disparate sources (surveys, administrative data,social media, sensors, transactions, open data).
  4. 4. ‘’Let’s talk about Data science ‘’ A short history of Data Science (Forbes Magazine, May ’13) 1962 John W. Tukey writes “The Future of Data Analysis” 1974 Peter Naur publishes Concise Survey of Computer Methods in Sweden 1977 The International Association for Statistical Computing (IASC) is established as a Section of the ISI. “It is the mission of the IASC to link traditional statistical methodology, modern computer technology, and the knowledge of domain experts in order to convert data into information and knowledge.” 1989 Gregory Piatetsky-Shapiro organizes and chairs the first Knowledge Discovery in Databases (KDD) workshop. 1993 J. Chambers presents the concept of learning from data as a challenges as well as exciting opportunities for Statistics. 1996 The International Federation of Classification Societies (IFCS) for the first time, uses the term in the conference “Data science, classification, and related methods”. 1996 Usama Fayyad, Gregory Piatetsky- Shapiro, and Padhraic Smyth publish “From Data Mining to Knowledge Discovery in Databases.” 1997C.F. Jeff Wu : “Statistics = Data Science?” 2001 William S. Cleveland publishes “Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statistics.” 2002/2003 Launch of Data Science Journal / Launch of Journal of Data Science 2007 The Research Center for Dataology and Data Science is set at Fudan University, China. 2010 Mike Loukides writes in “What is Data Science”. Drew Conway “DS Venn diagram” 2012 Tom Davenport & D.J Patil, “Data Scientist: The Sexiest Job of the 21st Century”
  5. 5. Tukey 1962: “…my central interest is in data analysis, which I take to include, among other things: Procedures for analysing data, techniques for interpreting the results of such procedures, ways of planning the gathering of data to make its analysis easier, more precise or more accurate, and all the machinery and results of (mathematical) statistics which apply to analysing data…” Tukey identified four driving forces in the new science: “Four major influences act on data analysis today: 1. The formal theories of statistics 2. Accelerating developments in computers and display devices 3. The challenge, in many fields, of more and ever larger bodies of data 4. The emphasis on quantification in an ever wider variety of disciplines” ‘’Let’s talk about Data science ‘’
  6. 6. ‘’Let’s talk about Data science ‘’ The origin of Data Science: the Benzecri’s 5 principles of Data Analysis Forbes published "A Very Short History of Data Science" but may too short as it forgets the fundamental contribution by JP Benzecri in the 60's. In the book "L'analyse des données" published by Dunod, Benzecri in 1973 for the first time sets out the 5 major principles on which Data analysis have to be based . • The first principle states that "The statistics is not probability, under the name of (mathematical) statistics was built a pompous discipline based on theoretical assumptions that are rarely met in practice." • The second principle states that "the models should follow the data., not vice versa." In fact is asserting the priority of the data or the approach to the extraction of knowledge in an optical data-driven. • The third specifies that "you must simultaneously process the information relating to the greater number of possible dimensions so as to provide a sufficiently complete representation of the phenomena of interest." It seems that in this principle advances the role of the"big data", • Finally, the last two principles relate to the basic use of the computer to process the data "for the analysis of complex phenomena (facts) the computer is indispensable" and even "use the computer implies the abandonment of all the techniques designed before of computing ". This latter principle advocates the change of the paradigm of classical statistics.
  7. 7. Paradigm Nature Form When First Experimental science Empiricism; describing natural phenomena pre-Renaissance Second Theoretical science Modelling and generalization pre-computers Third Computational science Simulation of complex phenomena pre-big data Fourth Exploratory science /Data Science Data-intensive; statistical exploration and data mining Now CHANGE OF PARADIGM IN SCIENCE By Science (Wikipedia) «we mean a system of knowledge obtained through an organized research activity and with methodical and rigorous procedures (the scientific method), with the aim of reaching, through tests , a description, likely, objective and predictive. , of reality and laws that regulate the occurrence of phenomena». The data revolution characterized by large amounts of data (big data) of various types (numeric, ordinal, nominal, symbolic, texts, images, data streams, multi-way, networks ,etc.), coming from disparate sources (surveys, administrative and official data, social media, sensors, transactions, open data) offers great opportunities to enhance knowledge on many key research areas that will bring a strong change in the paradigm of a science.
  8. 8. Data revolution : more and new data Stream data Symbolic data Multi sources data Text data High dimensional data Multimedia data Network data Complex data
  9. 9. To be termed scientific a method to acquiring scientific knowledge is commonly based on empirical or measurable evidence subject to specific principles of reasoning. The Oxford Dictionaries Online defines the scientific method as "a method or procedure that has characterized natural science since the 17th century, consisting in: (1) systematic observation; (2) hypotheses formulation ; (3) perform an experiment; (4) collection and analysis data to confirm (testing) hypotheses . If rejected back to (2 ) and refine 0r alterate hypothesis; (5) report findings and (6) assure results reproducibility to develop a theory or take action. Experiments are an important tool of the scientific method. The best hypotheses lead to predictions that can be tested in various ways. The strongest tests of hypotheses come from carefully controlled experiments that gather empirical data. Data Scientists use the scientific method?
  10. 10. The Data Science Method 1.Problem Identification 2.Data Collection, Organization, and Definitions 3.Exploratory Data Analysis 4.Pre-processing and Training Data Development 5.Fit Models with Training Data Set 6.Review Model Outcomes—Iterate over additional models as needed. 7.Identify the Final Model 8.Apply the Model to the Complete Data Set 9.Review the Results—Share your findings 10.Finalize Code and Documentation How to take a data science projects by using a methodological approach similar to the scientific method coined the Data Science Method. The biggest difference between people that are successful as data scientists and those that are not, is their ability to effectively frame data science projects and communicate project outcomes.
  12. 12. Let’s talk about Data science Data Science definitions data base DATA SCIENCE year defininition pagina web A field of big data which seeks to provide meaningful information from large amounts of complex data. Data Science combines different fields of work in statistics and computation in order to interpret data for the purpose of decision making 2 accademico 2014 A major goal of Data Science is to make it easier for others to find and coalesce data with greater ease. Data Science technologies impact how we access data and conduct research across various domains, including the biological sciences, medical informatics, social sciences and the humanities. 2 accademico 2010 Ability to] obtain, scrub, explore, model and interpret data, blending hacking, statistics, and machine learning 1 professionist a 2010 An unfortunate, unclear and misleading term that has emerged recently which refers to some subset of activities in the overall knowledge discovery process. What additional descriptive power data science provides beyond data mining and knowledge discovery is unclear. 2 accademico 2017 Data Science aims to transform data into actionable knowledge to perform predictions as well to support and validate decisions. Computer Science represents the language of the Data Science whereas Statistics is the Logic of the Data Science itself. However, in this process the domain expertise constitutes the catalytic element in the absence of which the transformation cannot be achieved". 2 accademico 2012 Data Science becomes clear pretty quickly that data science has two parents in traditional academia: statistics and computer science.(
  13. 13. Data Science through a SNA ‘’Let’s talk about Data science’’
  14. 14. ‘’Let’s talk about Data science’’ A Lexical Correspondence analysis of 70 DS definitions 1st axe: opposition of Research and Professional DS. 2nd axe: opposition of domain Data Sciences A typology according 4 Clusters: Epistemology DS, Methodology DS, Social DS, Business DS
  15. 15. ’’Let’s talk about Data science’’ Cluster analysis of Data Science: central definitions First group: Data Science Epistemology 18 Dataology and Data Science emphasizes on both theories and technologies, more importantly, it studies the laws in datanature not only ones in nature. It would represent the future direction and have breakthrough in the near future 16 Dataology and Data Science is an umbrella of theories, methods and technologies for studying phenomena and laws of datanature Second group: Data driven (Social) Data Science 3 Whether... data is search terms, voice samples, or product reviews,... users are in a feedback loop in which they contributeto the products they use. That's the beginning of Data Science. 46 Data science, also known as data-driven science, is an interdisciplinary field about scientific methods, processes and systems to extract knowledge or insights from data in various forms, either structured or unstructured Third group: Business Data Science 21 So far the main goal of Data Science is to provide a statistical framework for studying the problem of gaining knowledge, making predictions, making decisions or constructing models for specific domains. 20 It may be helpful to think of Data Science and business intelligence as being on two ends of the same spectrum, with business intelligence focused on managing and reporting existing business data in order to monitor or manage various concerns within the enterprise. In contrast, Data Science applies advanced analytical tools and algorithms to generate predictive insights and new product innovations that are a direct result of the data 29 Data Science aims to transform data into actionable knowledge to perform predictions as well to support and validate decisions. Computer Science represents the language of the Data Science whereas Statistics is the Logic of the Data Science itself. However, in this process the domain expertise constitutes the catalytic element in the absence of which the transformation cannot be achieved". Fourth group: Data Science Methodology 22 Data Science incorporates varying elements and builds on techniques and theories from many fields, including mathematics, statistics, data engineering, pattern recognition and learning, advanced computing, visualization, uncertainty modeling, data warehousing, and high performance computing with the goal of extracting meaning from data and creating data products. 49 Data science” is the general analysis of the creation of data. This means the comprehensive understanding of where data comes from, what data represents, and how to turn data into actionable information (something upon which we can base decisions). This encompasses statistics, hypothesis testing, predictive modeling, and understanding the effects of performing computations on data, among other things. Science in general has been armed with many of these tools, but data science pools the necessary tools together to provide a scientific discipline to the analysis and productizing of data.
  16. 16. Summarizing Data Science is ….. Data Science is an interdisciplinary approach, based mainly on the methods of Computational Science and Statistics suitably supplemented by the Knowledge of the different domains to meet the new challenges posed by the l Information Society. Computational Science represents the language of the Data Science whereas Statistics is the logic of the Data Science itself. The Knowledge of the various domain of interest constitutes the prerequisite of a Data Science. Thus, from this point of view, it would be preferable to speak about DATA SCIENCES. Data Sciences adopt and/or develop appropriate methodologies for purposes of knowledge discovery, forecasting and decision-making in the face of an increasingly complex reality often characterized by large amounts of data (big data) of various types (numeric, ordinal, nominal, symbolic data, texts, images, data streams, multi-way data, networks etc.), coming from disparate sources (surveys, official data,social media, sensors, transactions, open data). The main novelty in the Data Sciences is played by the role of the KNOWLEDGE. Its encoding in the form of logical rules or hierarchies, graphs, metadata, ontologies, will represent a new and more effective perspective to data analysis and interpretation of results if properly integrated in the methods of a Data Science. It is in this sense that the a Data Science is a discipline whose methods, result of the intersection between Statistics, Computer Science and a Knowledge Domain, that has as its purpose to give meaning to the data. Alternatively A Data Science can be defined as A Knowledge based Computational Statistics, or “Intelligent” Computational/Statistical Data Analysis.
  17. 17. Data Science = Knowledge based or ‘Intelligent’ Computational Statistics = ‘Intelligent’ Computational or Statistical Data Analysis Some CS tools: Data extraction and preparation; Data Warehousing; Optimization and numerical algorithms; Simulation; High Performance Computing; R; Hadoop; Python; SAS; Rapid Miner;Tableau;; Visualization ; Data Mining; A. I.; ANN; Machine Learning ;….. Some Stat tools: Exploratory methods ; Density estimation; Regression; Time series; Causal Models and SEM; Bayesian models; Factorial analysis and PCA; Cluster analysis; Classification; SNA …… Some Knowledge representation tools: Logical rules; Hierarchical rules; Probability models; Graphs; Network; Metadata; Ontologies…. The Data Science curvilinear triangle a DS definition by Carlo Lauro The Data Science adopts and/or develops appropriate methodologies for purposes of knowledge discovery, prediction and decision- making in the face of an increasingly complex reality often characterized by large amounts of data (big data) of various types (numeric, ordinal,nominal, symbolic, texts,images, data streams,multi-way, networks ,etc.),comingfrom disparate sources (surveys, official data, socialmedia,sensors,transactions,opendata,etc.) The role of Knowledge in DS SDA -> Data = Model + Error STATISTICS COMPUTATIONAL SCIENCE DS Computational Statistics Statistical Data Analysis KNOWLEDGE DOMAIN Computational Data Analysis CDA -> Data = Algorithm + Accuracy (The 2 cultures, Breiman) Data Science (DS) is an interdisciplinary approach to meet the challenges of the Information Society, based on the methods of Computational Science and Statistics supplemented by Knowledge of the different domains. Computational Science represents the language of the Data Science, whereas Statistics is its logic. The Knowledge of various domains of interest constitutes the prerequisite of a Data Science.
  18. 18. Computational science (also scientific computing ) is a rapidly growing multidisciplinary field that uses advanced computing capabilities to understand and solve complex problems. It is an area of science which spans many disciplines, but at its core it involves the development of models and simulations to understand natural systems. Computational science is now commonly considered a third mode of science, complementing and adding to experimentation/observation and theory. Substantial effort in computational sciences has been devoted to the development of algorithms (numerical and non-numerical), computer simulations, their efficient implementation in programming languages, and validation of the results to solve science, engineering, and humanities problems. Computational scientist should be capable of: - recognizing complex problems; adequately conceptualise the system containing these problems; design algorithms suitable for studying this system; - choose a suitable computing infrastructure (parallel computing / grid computing /supercomputers) - maximising the computational power of the simulation; assessing to what level the output of the simulation resembles the systems i.e. the model is validated; adjust the conceptualisation of the system accordingly; repeat cycle until a suitable level of validation is obtained. The computational scientists trusts that the simulation generates adequately realistic results for the system, under the studied condition. Not to be confused with Computer and information science that develops and optimizes the advanced system hardware, software, networking, and data management components needed to solve computationally demanding problems.
  19. 19. ‘’Let’s talk about Data Science’’ Data Scientist vs Statistician on Google citations Data Scientist Statistician
  20. 20. Let’s talk about Data science Data Scientist : ID AUTHOR DEFINITIONS 1 DJ Patil A data scientist is that unique blend of skills that can both unlock the insights of data and tell a fantastic story via the data 2 Mike Loukides Data scientists are involved with gathering data, massaging it into a tractable form, making it tell its story, and presenting that story to others 3 Jake Porway A data scientist is a rare hybrid, a computer scientist with the programming abilities to build software to scrape, combine, and manage data from a variety of sources and a statistician who knows how to derive insights from the information within. She combines the skills to create new prototypes with the creativity and thoroughness to ask and answer the deepest questions about the data and what secrets it holds 4 Steve Hillion analytically-minded, statistically and mathematically sophisticated data engineers who can infer insights into business and other complex systems out of large quantities of data 5 Hillary Mason A data scientist is someone who blends, math, algorithms, and an understanding of human behavior with the ability to hack systems together to get answers to interesting human questions from data 6 Anjul Bhambhri A data scientists is part digital trendspotter and part storyteller stitching various pieces of information together 7 Malcolm Chisholm A data scientist is somebody who is inquisitive, who can stare at data and spot trends. It’s almost like a Renaissance individual who really wants to learn and bring change to the organization 8 Pat Hanrahan The definition of "data scientist" could be broadened to cover almost everyone who works with data in an organization. At the most basic level, you are a data scientist if you have the analytical skills and the tools to 'get' data, manipulate it and make decisions with it 9 Monica Rogati By definition all scientists are data scientists. In my opinion, they are half hacker, half analyst, they use data to build products and find insights. It's Columbus meet Columbo – starry eyed explorers and skeptical detectives A data scientist is someone who can obtain, scrub, explore, model and interpret Let’s talk about Data science Data Scientist definitions database
  21. 21. Data Scientists through a SNA ‘’Let’s talk about Data science’’
  22. 22. ‘’Let’s talk about Data science’’ A Lexical Correspondence analysis of 80 Data Scientist’s definitions Professional Data Scientists Researcher Data Scientists 1ST Axe: opposition between Researcher and Professional Data Scientists A lemmas typology in 4 groups allows to identify different profiles of data scientist
  23. 23. ’’Let’s talk about Data science’’ Data Scientists, Clusters 1 & 2 : central definitions CLUSTER DEFINITION V.TEST “ANALYZING DATA FOR KNOWLEDGE” A data scientist basically needs to understand the data, extract information and create meaningful data products out of it. There are various technicalities involved in a data and despite software and hardware constraints, a scientist with all his expertise and knowledge has to crack the most complex data problems. Billions of people around the globe interact and utilize social media platforms. But have you ever wondered how so many accounts and the data are stored and kept secured? Ever wondered how many accounts have been left underutilized or unused? This is where the data scientist comes in and uses his skills of getting an insight to the data, understand theories and begin applying them. In this scenario, understanding the domain expertise becomes very crucial (Patrao N.) 3,49 “SKILLS FOR WORKING WITH (BIG) DATA” Data Scientist is a job title for an employee , who excels at analyzing data, particularly large amounts of data, to help a business gain a competitive edge. A data scientist possesses a combination of analytic, machine learning, data mining and statistical skills as well as experience with algorithms and coding (Ramakrishna N) 5,92 SEMANTIC AREA: Researcher Professional
  24. 24. ’’Let’s talk about Data science’’ Data Scientists, Clusters 3 & 4 : central definitions “DEALING WITH NEW METHODO LOGICAL ISSUES” Perform and interpret data studies and product experiments concerning new data sources or new uses for existing data sources. Develop prototypes, proof of concepts, algorithms, predictive models, and custom analysis. Design and build new data set processes for modeling, data mining, and production purposes. Determine new ways to improve data and search quality, and predictive capabilities (Castillo M.) 5,14 “IT’S A NEW JOB” A data scientist represents an evolution from the business or data analyst role. The formal training is similar, with a solid foundation typically in computer science and applications, modeling, statistics, analytics and math. What sets the data scientist apart is strong business acumen, coupled with the ability to communicate findings to both business and IT leaders in a way that can influence how an organization approaches a business challenge. Good data scientists will not just address business problems, they will pick the right problems that have the most value to the organization (Ventura E.) 7,32 SEMANTIC AREA: Researcher Professional
  25. 25. ’’Let’s talk about Data science’’ What does Data scientists do?
  26. 26. ’’Let’s talk about Data science’’ From the point of view of the labour market more Data scientist’s job titles appear Some of the prominent are: • Statistician • Data Scientist • Data Analyst • Business Analyst • Bus.Intelligence Manager • Data/Analytics Manager • Data Engineer • Data Architect • Data Administrator
  27. 27. DATA JOBS Data job trends
  28. 28. Data Analyst. Data Analyst works to interpret data to get actionable insights for the company. With a strong background in statistics and the ability to convert data from a raw form to a different format (data munging), the Data Analyst collects, processes and applies statistical algorithms to structured data. •Responsibilities: Data collection and processing, programming, machine learning, data munging, data visualization, applying statistical analysis •Languages: R, Python, SQL, NOSQL, HTML, Java Script, C/C++, SPSS Data Scientist A Data Scientist’s mission is similar to that of a Data Analyst’s: find actionable insights that are key to a company’s growth and decision-making. However, a Data Scientist role is needed in case of big data that require more robust skills for sorting through a lot unstructured data to identify questions and pull out critical information. The person then cleanses the data for proper analysis and creates new algorithms to run queries that relate data from disparate sources. On top of these skills, a Data Scientist also needs strong storytelling and visualization skills to share insights with peers across the company. •Responsibilities: Identifying questions, running queries,Data cleansing and processing, predictive modeling, machine learning,, applying statistical analysis, correlating disparate data, storytelling and visualization •Languages: R, Python, SAS, Hive, MatLab, SQL, Pig, Spark, Hadoop Data job descriptions
  29. 29. Data Architect. A Data Architect is the go-to person for data management, especially when dealing with any number of disparate data sources. With an extensive knowledge of how databases work, as well as how the acquired data relates to the business’s operations, the Data Architect, ideally, is able to speculate how changes will affect the company’s data use, then manipulate the data architecture to compensate for them. •Responsibilities: Data warehousing, ETL, architecture development, modeling •Languages: Hive, SQL, Pig, Spark, XML Data Engineer. This role is closely related to the Data Architect. The Data Engineer also works on the management side of data, making some people think the titles are interchangeable. However, a Data Engineer, who usually has a strong background in software engineering, builds, tests and maintains the data architecture. •Responsibilities: ETL, installing data warehousing solutions, data modeling, data architecture and development, database architecture testing •Languages: R, Python, SAS, MatLab, SQL, NOSQL, Pig, Hadoop, Java, C/C++ Data job descriptions
  30. 30. ‘’Let’s talk about Data science’’ 40 techniques used by Data Scientists Principal Component Neural Networks Support Vector Machine Nearest Neighbors Feature Selection (Geo-) Spatial Modeling Recommendation Engine Search Engine Attribution Modeling Collaborative Filtering Rule System Linkage Analysis Linear Regression Logistic Regression Jackknife Regression Density Estimation Confidence Interval Test of Hypotheses Pattern Recognition Clustering Supervised Learning (classification) Time Series Decision Trees Random Numbers Monte-Carlo Simulation Bayesian Statistics Naive Bayes Association Rules Scoring Engine Segmentation Predictive Modeling Graphs Deep Learning Game Theory Imputation Survival Analysis Arbitrage Lift Modeling Yield Optimization Cross-Validation Model Fitting Data science without statistics is possible, even desirable (Vincent Grenville @DSC 2014)!!!! Statistics is Dead – Long Live Data Science… ( Lee Baker, @DSC 2016)!!!!!
  31. 31. ‘’Let’s talk about Data science’’ Techniques used by Data Scientists (Source: KDNuggets 2017)
  32. 32. ‘’Let’s talk about Data science’’ Data analysed by Data Scientist!!!!! (Source: KDNuggets)
  33. 33. ‘’Let’s talk about Data science’’ Software used by Data Scientists (Source: KDNuggets)
  34. 34. ‘’Let’s talk about Data science’’ Largest data analysed by Data Scientist (Source: KDNuggets)
  35. 35. ’’Let’s talk about Data science’’ How to become a data scientist? Meeting the need School...core mathematics BSc...continue to focus on single disciplines, especially mathematics (including probability) and computing MSc...increase focus on statistics, begin to develop interdisciplinarity, but beware of “cut-and-paste data science" curricula. PhD...encourage interdisciplinary and team- based projects- PostDoc...focus on training fellowships, to include migrants from other disciplines (Peter J Diggle, 2015)
  36. 36. Suggestions for a MSc in Business Data Science
  37. 37. ‘’Let’s talk about Data science’’ Data Science challenges for Statistics According to a recent poll by Kdnuggets, the big majority (68%) of the audience thought that in the Era of Big Data, Statistics will become more important, as the foundation of Data Science. The rise of Data Science could be seen as a potential threat to the long-term status of the statistics discipline ….. but there is also a much greater opportunity to re- emphasize the universal relevance of the statistical thinking to the interpretation and exploiting of data, by improving links between statistics and information technology but also with those communities characterized by new and big data. We hope that the Statistician will be able to take this opportunity by developing new methods in a knowledge domain perspective, i.e • Computational Statistics Knowledge based • Statistical & Algorithmic intelligent data analysis contributing as well to the Data Science needs of the different scientific and professional domains implying new and big data. The cooperation between statisticians and computer scientists in the data revolution era, will allow to face in a proper way data management and preparation problems (data extraction, data and source integration, data cleaning and validation, knowledge coding). This task requires more than 70% of the whole data processing . It has a strong impact on the data quality and consequently on the data science results and actionable knowledge.
  38. 38. “Big data” is everywhere. The term was added to the Oxford English Dictionary in 2013. Now, Gartner’s just-released 2017 Hype Cycle that shows “big data” passing the “peak of inflated expectations” and moving on its way down into the “trough of disillusionment.” Big data is all the rage. But what does it actually mean? We analysed more then 45 definitions registered on a blog at Berleley A commonly repeated definition cites the three Vs: volume, velocity, and variety. But others argue that it’s not the size of data that counts, but the tools being used or the insights that can be drawn from a datase ‘’Let’s talk about Data science’’ About BiG Data ……
  39. 39. ‘’Let’s talk about Data science’’ A Lexical Correspondence analysis of 45 Big Data definitions 1ST Axe: opposition between Academic Authors and Professional Data Scientists A lemmas typology in 4 groups allows to identify different profiles of data scientist definitions (Academics, Influencers, DS Managers, DS Professionals )
  40. 40. Big Data Definitions – 4 class Typology – The central definitions • The first group (which contains 50% of the lemmas) concerns definitions that aim to identify the characterizing traits of the concept of big data and therefore "complex", "dataset", "large" and the concepts related to it as "analysis" and "technique". In this group fall definitions in a certain mainstream way, definitions in which the key words usually used to describe the phenomenon abound. Among the definitions that represent this group are: "As computational efficiency continues to increase," "big data" will be less about the actual size of a particular data and more about the specific expertise needed to process it. big data "will ultimately describe any datasets large enough to need high-level programming skills and statistically defensible methodologies in order to transform the data asset into something of value“. • In the second group fall those definitions that try to contextualize the phenomenon of big data, in this group we find many concepts related to the temporal dimension as "time", "now" and "new". Among the most representative of this group we find: "Big data, which started as a technological innovation in distributed computing, is now a cultural movement by which we continue to discover how humanity interacts with the world - and other - at large-scale" .
  41. 41. Big Data Definitions – 4 class Typology – The central definitions • Nel terzo gruppo invece troviamo le definizioni che danno anche prospettive extra-economiche dei big-data, riflettono su come i big-data potrebbero essere utili all'umanità intera e non solo in senso economico. In questo gruppo troviamo concetti come: "world", "people" "possibility". Tra le definizioni più rappresentative di questo gruppo troviamo: Big data is an umbrella term that means a lot of different things, but to me, it means the possibility of doing extraordinary things using modern machine learning techniques on digital data. Whether it is predicting illness, the weather, the spread of infectious diseases, or what you will buy next, it offers a world of possibilities for improving people’s lives. • Infine nell'ultimo gruppo le definizioni molto tecniche mirate anche alla promozione in senso economico dei big-data come quest’ultima: [Big data means] harnessing more sources of diverse data where “data variety” and “data velocity” are the key opportunities. (Each source represents “a signal” on what is happening in the business.) The opportunity is to harness data variety [and] automate “harmonization” of data sources to deliver fast- updating insights consumable by the line-of-business users.
  42. 42. 1st class (31). AnnaLee Saxenian, Dean, UC Berkeley School of Information (Academic) I’m not fond of the phrase “big data” because it focuses on the volume of data, obscuring the far-reaching changes are making data essential to individuals and organizations in today’s world. But if I have to define it I’d say that “big data” is data that can’t be processed using standard databases because it is too big, too fast-moving, or too complex for traditional data processing tools. 2nd class (28). Gregory Piatetsky-Shapiro, President and Editor, (influencer) The best definition I saw is, “Data is big when data size becomes part of the problem.” However, this refers to the size only. Now the buzzword “big data” refers to the new data-driven paradigm of business, science and technology, where the huge data size and scope enables better and new services, products, and platforms. #BigData also generates a lot of hype and will probably be replaced by a new buzzword, like “Internet of Things,” but “big data”- enabled services companies, like Google, Facebook, 3rd class (5). Mike Cavaretta, Data Scientist Consultant (DS Consultant) You cannot give me too much data. I see big data as storytelling — whether it is through information graphics or other visual aids that explain it in a way that allows others to understand across sectors. I always push for the full scope of the data over averages and aggregations — and I like to go to the raw data because of the possibilities of things you can do with it. 4th class (22). Sharmila Mulligan, CEO and Founder, ClearStory Data (DS Manager dirigente) [Big data means] harnessing more sources of diverse data where “data variety” and “data velocity” are the key opportunities. (Each source represents “a signal” on what is happening in the business.) The opportunity is to harness data variety [and] automate “harmonization” of data sources to deliver fast-updating insights consumable by the line-of-business users.3 Big Data Definitions – 4 class Typology – The central definitions
  43. 43. ‘’Let’s talk about Data science’’ Big Data challenges for Statistics Big data problems usually require multidisciplinary teams by their nature. They typically require knowledge domain experts, computational experts, machine learning experts, data miners and statisticians. • In particular Statisticians help translate the scientific question into a statistical question, which includes carefully describing data structure; the underlying system that generated the data (the model); and what we are trying to assess (the parameters we wish to estimate) or predict. What does Statistics bring to Big Data and where are the opportunities? • Statistics is fundamental to ensuring meaningful, accurate information is extracted from Big Data especially for the following: o Data quality ; o Missing and incomplete data; o Quantification of the uncertainty of predictions, forecasts and models. Statisticians are skillful at validation and correcting for bias; measuring uncertainty; designing studies and sampling strategies; data quality assessing and certification ; enumerating limitations of studies; dealing with issues such as missing data and other sources of non-sampling error; developing models for the analysis of complex data structures; creating methods for causal inference and comparative effectiveness; eliminating redundant and uninformative variables; data integration from multiple sources.
  44. 44. Data Scientist:: No thanks! In order to conduct my business I need Big Data Informative Data Information Informative Data Big Data Knowledg e ID ID PROCESSING The most important thing about data is not its size but its informative content DECISION Data Engineer:: Big Data ? Some Knowledge representation tools: Interval, Histogram. Logical rules; Hierarchical rules; Probability models; Graphs; Network; Metadata; Ontologies…. Theory without data is blind. Data without knowledge is lame An useful approach to ID PROCESSING: SYMBOLIC DATA ANALYSIS Big Data Challenges for Data Scientists/Statisticians
  45. 45. Informative Data as Symbolic Data Table SDT fig. in: E. Diday,Thinking by classes in data science: The symbolic data analysis paradigm. Wires , Vol. 8,, Sept. Oct., 2016 Symbolic Data Analysis tools as descriptive statistics, PCA, regression, decision trees, clustering, have been developed in order to analyze and discover new knowledge from Data. «A SDT quality can be measured in terms of explanatory and discriminatory power of its symbolic features» c A SDT offers a rapresentation of the variabiliy we find in the BIG DATA F. Brambilla: «Statistics is the Science that studies the vatiability of phenomena»
  46. 46. Knowledge Discovery is a sequential learning process Supervised statistical methods allow investigators to produce new knowledge! ` Knowledge encoding & data integration Knowledge Pyramid Toward a Data Science Knowledge Based
  47. 47. Conclusion: toward a Knowledge Based Data Science Data Science is an interdisciplinary approach, to meet the new challenges of the Information Society. It is based mainly on the methods of Statistics and Computational Science suitably supplemented by the Knowledge of the different domains. Computational Science represents the language of the Data Science whereas Statistics is the Logic of the Data Science itself. The Knowledge of the various domain of interest constitutes the prerequisite of a Data Science. Thus, from this point of view, it would be preferable to speak about DATA SCIENCES. The main novelty in the Data Sciences is played by the role of the KNOWLEDGE. Its encoding in a proper way (intervals, histograms, functions, logical rules or hierarchies, graphs, metadata, ontologies, etc….) can be used in the different step of a Data Science exercise: - in automating the step of (Big) Data cleaning and refinement (feature selection); - to obtain new (BiG) data representation in term of Informative Data; - to drive data processing methods in the right/expected direction avoiding trivial results; - to allow coherent interpretation of results and enrich storytelling; - to perform suitable decisions. For these reasons I like to call such an approach as a Knowledge based Data Science
  48. 48. Prolegomena to Any Future Statistics, that will be able to present itself as a (Data) Science Carlo Lauro Emeritus Professor of Statistics University of Naples Federico II THANK YOU FOR YOUR ATTENTION!! Scientific Meeting in Memory of Simona Balbi Naples, February 19° , 2019 Director of the Department of Economic & Managerial Sciences In Dnformation Era SELF HOCHSCHULE, ZUG - CH