Your SlideShare is downloading. ×
0
Choosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your Business
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Choosing the Right Big Data Architecture for your Business

1,830

Published on

Published in: Technology, Business
1 Comment
14 Likes
Statistics
Notes
  • Is it possible to receive a copy of this deck via email to sriram0506@gmail.com...Thanks much
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total Views
1,830
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
1
Likes
14
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Choosing the Right Data Architecture for Your Big Data Projects Presentation 1
  • 2. “There isn’t a cluster big enough to hold your ego!”
  • 3. Presentation 1 Choosing the Right Data Architecture for Your Big Data Projects AGENDA
  • 4. Choosing the Right Data Architecture for Your Big Data Projects Acknowledgements Planning Your Enterprise Data Strategy John Ladley President IMCue Solutions Metrics for Information Management Business Analysis Techniques for Data Professionals Alec Sharp Senior Consultant Clariteq Systems Consulting Steps to a Successful Enterprise Information Management ProgramMichael F. Jennings Executive Director - Data Governance Walgreens Meta Data Requirements for the Enterprise David Loshin President Knowledge Integrity Advanced MDM: Moving to the Next Level of MDM Success
  • 5. Choosing the Right Data Architecture for Your Big Data Projects Acknowledgements
  • 6. Choosing a Big Data Platform Big Data Platform
  • 7. Relational(SQL) Big Data Platform Choosing a Big Data Platform
  • 8. d Data Grid Graph NewS QL Analy tics/ MPP Big Data Platform
  • 9. http://arnon.me/2012/11/nosql-landscape-diagrams/
  • 10. Key Ideas One Big Data database cannot accommodate all the Big Data types One size DOES NOT fit all. You need to know the data type and data architecture to select the most appropriate Big Data database.
  • 11. Choosing a Big Data Architecture Big Data Platform Big Data Architecture
  • 12. What is Big Data? Big Data is about textual analytics (deriving data from unstructured content) [not dimension or fact tables] Web data click stream data social network data Semi-structured data email Unstructured content comments Sensor data Vertical industries structured transaction data tweets , text messages Choosing a Big Data Architecture
  • 13. Analysis Type Choosing a Big Data Architecture What do we need to consider when classifying Big Data? Real Time Batch Processing Methodology Predictive Analytics Analytical Querying & Reporting Misc. Data Type Meta Data Master Data Historical Transactional Data Frequency On Demand Feeds Continuous Feeds Real Time Feeds Time Series Structured Un- Structured Semi- Structured Web and Social Media Machine Generated Human Generated Internal Data Sources Transaction Data Biometric Data Via Data Providers Via Data Originators Data Consumers Human Business Process Other Enterprise Applications Other Data Repositories Hardware Commodity Hardware State of the Art Hardware
  • 14. Choosing a Big Data Architecture
  • 15. Choosing a Big Data Architecture
  • 16. Choosing a Big Data Architecture Classify Big Data Type According to the Business Needs Big data business problems by type Business problem Big Data Type Description Utility companies have rolled out smart meters to measure the consumption of water, gas, and electricity at regular intervals of one hour or less. These smart meters generate huge volumes of interval data that needs to be analyzed. Utilities also run big, expensive, and complicated systems to generate power. Each grid includes sophisticated sensors that monitor voltage, current, frequency, and?other important operating characteristics. To gain operating efficiency, the company must monitor the data delivered by the sensor. A big data solution can analyze power generation (supply) and power consumption (demand) data using smart meters. Web and social data Telecommunications operators need to build detailed customer churn models that include social media and transaction data, such as CDRs, to keep up with the competition. The value of the churn models depends on the quality of customer attributes (customer master data such as date of birth, gender, location, and income) and the social behavior of customers. Transaction data Telecommunications providers who implement a predictive analytics strategy can manage and predict churn by analyzing the calling patterns of subscribers. Marketing departments use Twitter feeds to conduct sentiment analysis to determine what users are saying about the company and its products or services, especially after a new product or release is launched. Customer sentiment must be integrated with customer profile data to derive meaningful results. Customer feedback may vary according to customer demographics. Utilities: Predict power consumption Machine- generated data Telecommunications: Customer churn analytics Marketing: Sentiment analysis Web and social data
  • 17. Choosing a Big Data Architecture Big data business problems by type Business problem Big Data Type Description Customer service: Call monitoring Human- generated IT departments are turning to big data solutions to analyze application logs to gain insight that can improve system performance. Log files from various application vendors are in different formats; they must be standardized before IT departments can use them. Web and social data Retailers can use facial recognition technology in combination with a photo from social media to make personalized offers to customers based on buying behavior and location. Biometrics This capability could have a tremendous impact on retailers? loyalty programs, but it has serious privacy ramifications. Retailers would need to make the appropriate privacy disclosures before implementing these applications. Machine- generated data Retailers can target customers with specific promotions and coupons based location data. Solutions are typically designed to detect a user's location upon entry to a store or through GPS. Transaction data Location data combined with customer preference data from social networks enable retailers to target online and in-store marketing campaigns based on buying history. Notifications are delivered through mobile applications, SMS, and email. Machine- generated data Fraud management predicts the likelihood that a given transaction or customer account is experiencing fraud. Solutions analyze transactions in real time and generate recommendations for immediate action, which is critical to stopping third-party fraud, first- party fraud, and deliberate misuse of account privileges. Solutions are typically designed to detect and prevent myriad fraud and risk types across multiple industries, including: Transaction data Credit and debit payment card fraud Deposit account fraud Human- generated Technical fraud Bad debt Healthcare fraud Medicaid and Medicare fraud Property and casualty insurance fraud Worker compensation fraud Insurance fraud Telecommunications fraud Retail and marketing: Mobile data and location-based targeting FSS, Healthcare: Fraud detection Retail: Personalized messaging based on facial recognition and social media Classify Big Data Type According to the Business Needs
  • 18. Key Idea There are guidelines to help suggest the Big Data Types that are commonly used by each industry.
  • 19. Choosing a Big Data Architecture Classify Big Data Type According to the Business Needs
  • 20. Validate the data being collected has business value. Critical Success Factor 55% of Big Data projects don’t get completed, …and many others fall short of their objectives. http://www.infochimps.com/resources/report-cios-big-data-what-your-it-team-wants-you-to-know-6/ Report: CIOs & Big Data: What Your IT Team Wants You to Know
  • 21. Choosing a Big Data Architecture Big Data Platform Big Data Architecture Big Data Business Needs by type
  • 22. Ten Big Data Schemas Big Data Architecture
  • 23. Ten Big Data SchemasRelational - Graph A graph database stores data in a graph, the most generic of data structures, capable of elegantly representing any kind of data in a highly accessible way. Graph databases can make a difference in harvesting more value in your data by looking at its relationships. Provides index-free adjacency where every element contains a direct pointer to its adjacent elements and no index lookups are necessary.
  • 24. Ten Big Data SchemasRelational - Graph
  • 25. Ten Big Data Schemas Relational - Analytics / MPP Columnar Column-oriented storage organization, which increases performance of sequential record access at the expense of common transactional operations such as single record retrieval, updates, and deletes Shared nothing architecture, which reduces system contention for shared resources and allows gradual degradation of performance in the face of hardware failure
  • 26. Ten Big Data Schemas Relational - Analytics / MPP Columnar
  • 27. Ten Big Data SchemasRelational - Analytics / MPP Delivers extreme performance and scalability for all your database applications including Online Transaction Processing (OTLP), data warehousing (DW) and mixed workloads
  • 28. Ten Big Data SchemasRelational - Analytics / MPP
  • 29. Ten Big Data SchemasRelational - NewSQL Scale out relational databases by virtualizing a distributed database environment. Provides organizations the relational data integrity combined with the scalability and flexibility of a modern distributed, multi-site database to support an unlimited numbers of users, larger data volumes and extremely high TPS
  • 30. Ten Big Data SchemasRelational - NewSQL
  • 31. Ten Big Data SchemasPolyStructured – Document Indexing Provides full-text search, hit highlighting, faceted search, dynamic clustering, database integration, and rich document (e.g., Word, PDF) handling. Provides distributed search and index replication Highly scalable
  • 32. Ten Big Data SchemasPolyStructured – Document Indexing
  • 33. Ten Big Data SchemasPolyStructured - Document Document databases completely embrace the web. Store data with JSON documents. Access documents and query indexes with web browsers, via HTTP. Index, combine, and transform documents with JavaScript. Works well with modern web and mobile apps. Serve web apps directly. On-the-fly document transformation and real-time change notifications
  • 34. Ten Big Data SchemasPolyStructured - Document Document databases lack a schema, or rigid pre-defined data structures such as tables. Data stored in document databases commonly use JSON document(s) JavaScript for MapReduce indexes
  • 35. Ten Big Data SchemasPolyStructured – Key Value Stored – InMemory - Data Grid In-Memory Accelerator for Apache Hadoop, high performance computing, streaming and database, HDFS and MongoDB Eliminate MapReduce Overhead Dynamically caches, partitions, replicates, and manages application data and business logic across multiple servers. Fully elastic memory based storage grid. Virtualized the free memory of a potentially large number of Java virtual machines and makes them behave like a single key addressable storage pool for application state. IBM WebSphere eXtreme Scale
  • 36. Ten Big Data SchemasPolyStructured – Key Value Stored – InMemory - Data Grid
  • 37. Ten Big Data SchemasPolyStructured – Key Value Stored – InMemory - Caching Run atomic operations like appending to a string; incrementing the value in a hash; pushing to a list; computing set intersection, union and difference; or getting the member with highest ranking in a sorted set. With an in-memory dataset, depending on your use case, you can persist it either by dumping the dataset to disk every once in a while, or by appending each command to a log.
  • 38. Ten Big Data SchemasPolyStructured – Key Value Stored – InMemory - Caching
  • 39. Ten Big Data SchemasPolyStructured – Key Value Stored – Columnar Random, real time read/write access to your Big Data Hosting of very large tables -- billions of rows X millions of columns -- atop clusters of commodity hardware
  • 40. Ten Big Data SchemasPolyStructured – Key Value Stored – Columnar
  • 41. Ten Big Data SchemasPolyStructured – Distributed File System Storage and large-scale processing of data-sets on clusters of commodity hardware. Distributed, scalable, and portable file-system
  • 42. Ten Big Data SchemasPolyStructured – Distributed File System
  • 43. Key Ideas Hadoop is the #1 distributed file system used for Big Data Projects Hadoop is used as the shared data source platform to merge and standardize big data with legacy data
  • 44. Data As A Service Single System Management API’s Data as a Service Applications (API) should be based from a single data source platform. Web and Social Media Machine Generated Human Generated Internal Data Sources Transaction Data Biometric Data Via Data Providers Via Data Originators
  • 45. Key Ideas Hadoop is the #1 distributed file system used for Big Data Projects Hadoop is used as the shared data source platform to merge and standardize big data with legacy data Hadoop is an excellent choice to start building your shared data source platform Hadoop can become your System of Record (SOR) for Big Data and part of your Master Data Management system (MDM)
  • 46. The date time format must be standardized across the data platform Critical Success Factors The time format of International Standard ISO 8601 specifies numeric representations of date and time. YYYY-MM-DDThh:mm:ss.sTZD (eg 1997-07-16T19:20:30.45+01:00) is suggested and preferred. Unique identifiers (domain keys) must be clearly described using friendly terminology For example: ‘ID’ should never be a column name ‘Sales ID’ is too generic ‘Sales Representative Reporting ID’ is friendly and clearly named
  • 47. Key Idea Hadoop is used as the shared analytical platform to merge and standardize analytics
  • 48. Single System Management Analytics should be based from a single data source platform. Analytics As A Service IBM WebSphere eXtreme Scale Analytics Analytics as a Service
  • 49. Key Ideas Hadoop is used as the shared analytical platform to merge and standardize analytics There are guidelines to help suggest the analytics, KPI’s and Profit Drivers for Big Data that are commonly used by each industry.
  • 50. Examples of tasks Algorithms to use (2) Predicting a discrete attribute •Flag the customers in a prospective buyers list as good or poor prospects. •Calculate the probability that a server will fail within the next 6 months. •Categorize patient outcomes and explore related factors. Decision Trees Algorithm Naive Bayes Algorithm Clustering Algorithm Neural Network Algorithm Predicting a continuous attribute •Forecast next year's sales. •Predict site visitors given past historical and seasonal trends. •Generate a risk score given demographics. Decision Trees Algorithm Time Series Algorithm Linear Regression Algorithm Predicting a sequence •Perform clickstream analysis of a company's Web site. •Analyze the factors leading to server failure. •Capture and analyze sequences of activities during outpatient visits, to formulate best practices around common activities. Sequence Clustering Algorithm Finding groups of common items in transactions •Use market basket analysis to determine product placement. •Suggest additional products to a customer for purchase. •Analyze survey data from visitors to an event, to find which activities or booths were correlated, to plan future activities. Association Algorithm Decision Trees Algorithm Finding groups of similar items •Create patient risk profiles groups based on attributes such as demographics and behaviors. •Analyze users by browsing and buying patterns. •Identify servers that have similar usage characteristics. Clustering Algorithm Sequence Clustering Algorithm
  • 51. Key Ideas Hadoop is used as the shared analytical platform to merge and standardize analytics There are guidelines to help suggest the analytics, KPI’s and Profit Drivers for Big Data that are commonly used by each industry. You do not need to know how the algorithm works or is designed. You only need to know the parameters needed to run them.
  • 52. Task Description Algorithms Market Basket Analysis Discover items sold together to create recommendations on-the-fly and to determine how product placement can directly contribute to your bottom line. Association Decision Trees Churn Analysis Anticipate customers who may be considering canceling their service and identify the benefits that will keep them from leaving. Decision Trees Linear Regression Logistic Regression Market Analysis Define market segments by automatically grouping similar customers together. Use these segments to seek profitable customers. Clustering Sequence Clustering Forecasting Predict sales and inventory amounts and learn how they are interrelated to foresee bottlenecks and improve performance. Decision Trees Time Series Data Exploration Analyze profitability across customers, or compare customers that prefer different brands of the same product to discover new opportunities. Neural Network Unsupervised Learning Identify previously unknown relationships between various elements of your business to inform your decisions. Neural Network Web Site Analysis Understand how people use your Web site and group similar usage patterns to offer a better experience. Sequence Clustering Campaign Analysis Spend marketing funds more effectively by targeting the customers most likely to respond to a promotion. Decision Trees Naïve Bayes Clustering Information Quality Identify and handle anomalies during data entry or data loading to improve the quality of information. Linear Regression Logistic Regression Text Analysis Analyze feedback to find common themes and trends that concern your customers or employees, informing decisions with unstructured input. Text Mining Data Mining Tasks (4)
  • 53. Data Mining Algorithms (Analysis Services - Data Mining) Choosing an Algorithm by Task To help you select an algorithm for use with a specific task, the following table provides suggestions for the types of tasks for which each algorithm is traditionally used. Examples of tasks Microsoft algorithms to use Predicting a discrete attribute Microsoft Decision Trees Algorithm Flag the customers in a prospective buyers list as good or poor prospects. Microsoft Naive Bayes Algorithm Calculate the probability that a server will fail within the next 6 months. Microsoft Clustering Algorithm Categorize patient outcomes and explore related factors. Microsoft Neural Network Algorithm Predicting a continuous attribute Microsoft Decision Trees Algorithm Forecast next year's sales. Microsoft Time Series Algorithm Predict site visitors given past historical and seasonal trends. Microsoft Linear Regression Algorithm Generate a risk score given demographics. Predicting a sequence Microsoft Sequence Clustering Algorithm Perform clickstream analysis of a company's Web site. Analyze the factors leading to server failure. Capture and analyze sequences of activities during outpatient visits, to formulate best practices around common activities. Finding groups of common items in transactions Microsoft Association Algorithm Use market basket analysis to determine product placement. Microsoft Decision Trees Algorithm Suggest additional products to a customer for purchase. Analyze survey data from visitors to an event, to find which activities or booths were correlated, to plan future activities. Finding groups of similar items Microsoft Clustering Algorithm Create patient risk profiles groups based on attributes such as demographics and behaviors. Microsoft Sequence Clustering Algorithm Analyze users by browsing and buying patterns. Identify servers that have similar usage characteristics.
  • 54. Analytic Algorithm Categories Regression a powerful and commonly used algorithm that evaluates the relationship of one variable, the dependent variable, with one or more other variables, called independent variables. By measuring exactly how large and significant each independent variable has historically been in its relation to the dependent variable, the future value of the dependent variable can be estimated. Regression models are widely used in applications, such as seasonal forecasting, quality assurance and credit risk analysis.
  • 55. Analytic Algorithm Categories Clustering / Segmentation the process of grouping items together to form categories. You might look at a large collection of shopping baskets and discover that they are clustered corresponding to health food buyers, convenience food buyers, luxury food buyers, and so on. Once these characteristics have been grouped together, they can be used to find other customers with similar characteristics. This algorithm is used to create groups for applications, such as customers for marketing campaigns, rate groups for insurance products, and crime statistics groups for law enforcement.
  • 56. Analytic Algorithm Categories Nearest Neighbor quite similar to clustering, but it will only look at others records in the dataset that are “nearest” to a chosen unclassified record based on a “similarity” measure. Records that are “near” to each other tend to have similar predictive values as well. Thus, if you know the prediction value of one of the records, you can predict its nearest neighbor. This algorithm works similar to the way that people think – by detecting closely matching examples. Nearest Neighbor applications are often used in retail and life sciences applications.
  • 57. Analytic Algorithm Categories Association Rules detects related items in a dataset. Association analysis identifies and groups together similar records that would otherwise go unnoticed by a casual observer. This type of analysis is often used for market basket analysis to find popular bundles of products that are related by transaction, such as low-end digital cameras being associated with smaller capacity memory sticks to store the digital images.
  • 58. Analytic Algorithm Categories Decision Tree a tree-shaped graphical predictive algorithm that represents alternative sequential decisions and the possible outcomes for each decision. This algorithm provides alternative actions that are available to the decision maker, the probabilistic events that follow from and affect these actions, and the outcomes that are associated with each possible scenario of actions and consequences. Their applications range from credit card scoring to time series predictions of exchange rates.
  • 59. Analytic Algorithm Categories Sequence Association detects causality and association between time-ordered events, although the associated events may be spread far apart in time and may seem unrelated. Tracking specific time-ordered records and linking these records to a specific outcome allows companies to predict a possible outcome based on a few occurring events. A sequence model can be used to reduce the number of clicks customers have to make when navigating a company’s website.
  • 60. Analytic Algorithm Categories Neural Network a sophisticated pattern detection algorithm that uses machine learning techniques to generate predictions. This technique models itself after the process of cognitive learning and the neurological functions of the brain capable of predicting new observations from other known observations. Neural networks are very powerful, complex, and accurate predictive models that are used in detecting fraudulent behavior, in predicting the movement of stocks and currencies, and in improving the response rates of direct marketing campaigns.
  • 61. Choosing a Big Data Architecture Big Data Platform Big Data Analytical Platform Big Data Analytics Big Data Business Needs by type Big Data Architecture
  • 62. Analytics Data Sources Analytics should be based from a single data source platform. Analytics As A Service Analytics as a Service IBM WebSphere eXtreme Scale
  • 63. Analytics As A Service When you write data to a traditional database, either through loading external data, writing the output of a query, doing UPDATE statements, etc., the database has total control over the storage. The database is the "gatekeeper." An important implication of this control is that the database can enforce the schema as data is written. This is called schema on write. Hive has no such control over the underlying storage. There are many ways to create, modify, and even damage the data that Hive will query. Therefore, Hive can only enforce queries on read. This is called schema on read. So what if the schema doesn’t match the file contents? Hive does the best that it can to read the data. You will get lots of null values if there aren’t enough fields in each record to match the schema. If some fields are numbers and Hive encounters nonnumeric strings, it will return nulls for those fields. Above all else, Hive tries to recover from all errors as best it can.
  • 64. http://www.sqlbiinfo.com/2014/02/schema-on-read-vs-schema-on-write.html Schema on Read vs Schema on Write... Analytics As A Service
  • 65. Analytics As A Service Benefits of schema on write: • Better type safety and data cleansing done for the data at rest • Typically more efficient (storage size and computationally) since the data is already parsed Downsides of schema on write: • You have to plan ahead of time what your schema is before you store the data (i.e., you have to do ETL) • Typically you throw away the original data, which could be bad if you have a bug in your ingest process • It's harder to have different views of the same data Benefits of schema on read: • Flexibility in defining how your data is interpreted at load time • This gives you the ability to evolve your "schema" as time goes on • This allows you to have different versions of your "schema" • This allows the original source data format to change without having to consolidate to one data format • You get to keep your original data • You can load your data before you know what to do with it (so you don't drop it on the ground) • Gives you flexibility in being able to store unstructured, unclean, and/or unorganized data Downsides of schema on read: • Generally it is less efficient because you have to reparse and reinterpret the data every time (this can be expensive with formats like XML) • The data is not self-documenting (i.e., you can't look at a schema to figure out what the data is) • More error prone and your analytics have to account for dirty data   http://nosql.mypopescu.com/post/48638541973/schema-on-writes-vs-schema-on-reads-apache-hadoop-and
  • 66. Reporting users make their own schemas and naming standards Reporting users run their own analytics --- as many times as they want
  • 67. Key Ideas - Summary One Big Data database cannot accommodate all the Big Data types You need to know the data type and data architecture to select the most appropriate Big Data database. There are guidelines to help suggest the Big Data Types that are commonly used by each business type. Hadoop is used as the shared data source platform to merge and standardize big data with legacy data Hadoop is used as the shared analytical platform to merge and standardize analytics Hadoop is an excellent choice to start building your shared data source platform Hadoop can become your System of Record (SOR) for Big Data and part of your Master Data Management system (MDM) Hadoop is used to standardize and centralize the Key Performance Indicators (KPI) and Profit Drivers for an Enterprise Analytical Platform There are guidelines to help suggest the analytics, KPI’s and Profit Drivers for Big Data that are commonly used by each industry. Schema on read
  • 68. Critical Success Factors - Summary Validate the data being collected has business value. The date time format must be standardized across the data platform. Unique identifiers (domain keys) must be clearly described using friendly terminology
  • 69. 1) Pervasive insights produce better business decision opening access to business intelligence by embedding analytics capabilities into everyday software tools pays substantial dividends. By Lauren Gibbons Paul 2) Data Mining Algorithms (Analysis Services - Data Mining) http://msdn.microsoft.com/en-us/library/ms175595.aspx 3) Data Mining Query Task http://msdn.microsoft.com/en-us/library/ms141728.aspx 4) Predictive Analysis with SQL Server 2008 - White Paper - Microsoft - Published: November 2007 5) Predictive Analytics for the Retail Industry - White Paper - Microsoft - Writer: Matt Adams Technical Reviewer: Roni Karassik, Published: May 2008 6) Breakthrough Insights using Microsoft SQL Server 2012 - Analysis Services https://www.microsoftvirtualacademy.com/tracks/breakthrough-insights-using-microsoft-sql-server-2012-a 7) Useful DAX Starter Functions and Expressions http://thomasivarssonmalmo.wordpress.com/category/powerpivot-and-dax/ 8) Stairway to PowerPivot and DAX - Level 1: Getting Started with PowerPivot and DAX By Bill_Pearson, 2011/12/21 9) Data Mining Tool http://technet.microsoft.com/en-us/library/ms174467.aspx 10) DAX Cheat Sheet http://powerpivot-info.com/post/439-dax-cheat-sheet 11) Big Data Landscape - http://arnon.me/2012/11/nosql-landscape-diagrams/ References
  • 70. On the Internet, the World Wide Web Consortium (W3C) uses ISO 8601 in defining a profile of the standard that restricts the supported date and time formats to reduce the chance of error and the complexity of software.[19] RFC 3339 defines a profile of ISO 8601 for use in Internet protocols and standards. It explicitly excludes durations and dates before the common era. The more complex formats such as week numbers and ordinal days are not permitted.[20] RFC 3339 deviates from ISO 8601 in allowing a zero timezone offset to be specified as "-00:00", which ISO 8601 forbids. RFC 3339 intends "-00:00" to carry the connotation that it is not stating a preferred timezone, whereas the conforming "+00:00" or any non-zero offset connotes that the offset being used is preferred. This convention regarding "-00:00" is derived from earlier RFCs, such as RFC 2822 which uses it for timestamps in email headers. RFC 2822 made no claim that any part of its timestamp format conforms to ISO 8601, and so was free to use this convention without conflict. RFC 3339 errs in adopting this convention while also claiming conformance to ISO 8601. http://www.w3.org/TR/NOTE-datetime http://stackoverflow.com/questions/16307563/utc-time-explanation International Standard ISO 8601 specifies numeric representations of date and time. YYYY-MM-DDThh:mm:ss.sTZD (eg 1997-07-16T19:20:30.45+01:00) where: YYYY = four-digit year MM = two-digit month (01=January, etc.) DD = two-digit day of month (01 through 31) hh = two digits of hour (00 through 23) (am/pm NOT allowed) mm = two digits of minute (00 through 59) ss = two digits of second (00 through 59) s = one or more digits representing a decimal fraction of a second TZD = time zone designator (Z or +hh:mm or -hh:mm) Times are expressed in UTC (Coordinated Universal Time), with a special UTC designator ("Z"). Times are expressed in local time, together with a time zone offset in hours and minutes. A time zone offset of "+hh:mm" indicates that the date/time uses a local time zone which is "hh" hours and "mm" minutes ahead of UTC. A time zone offset of "-hh:mm" indicates that the date/time uses a local time zone which is "hh" hours and "mm" minutes behind UTC.

×