Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Unraveling the mystery of big data


Published on

Curious about the Big Data hype? Want to find out just how big, BIG is? Who's using Big Data for what, and what can you use it for? How about the architecture underpinnings and technology stacks? Where might you fit in the stack? Maybe some gotchas to avoid? Lionel Silberman, a seasoned Data Architect spreads some light on it. A good and wholesome refresher into Big Data and what all it can do.

Our guest speaker:
Lionel Silberman,
Senior Data Architect, Compuware

Lionel Silberman has over thirty years of experience in big data product development. He has expert knowledge of relational databases, both internals and applications, performance tuning, modeling, and programming. His product and development experience encompasses the major RDBMS vendors, object-oriented, time-series, OLAP, transaction-driven, MPP, distributed and federated database applications, data appliances, NoSQL systems Hadoop and Cassandra, as well as data parallel and mathematical algorithm development. He is currently employed at Compuware, integrating enterprise products at the data level. All are welcome to join us.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Unraveling the mystery of big data

  1. 1. Unraveling the Mystery of Big Data Lionel Silberman May 22, 2014 Copyleft – Share Alike 1
  2. 2. Who Am I? Lionel Silberman, currently the Senior Data Architect at Compuware • 30 years in Software Development • Statistical modeling, DBMS, Big Data, Data Architecture, tech/product/management. • Diverse, deep data management: – All of the major RDBMS vendors and internals. – Data modeling and data parallelism techniques. – OLAP, OLTP, MPP, NoSql systems like Hadoop and Cassandra. – Scaling and Performance tuning of distributed and federated applications. • Current interest is integrating products in the enterprise at the data level that deliver more value than their individual pieces. • Active interest in big data metadata privacy issues. • Who are you? What’s your interest in this talk tonight? 2
  3. 3. 3
  4. 4. Unraveling the Mystery of Big Data Agenda 4 • What is Big Data?  Business Value  Technical Definitions  Sizes and Applications • What Big Data is Not (or why isn’t everything just “data”)? • Architectural Underpinnings • Some Useful Architectural Distinctions • Technology stacks and ecosystems • Data Modeling Example • Gotchas - What 12 things to watch out for? • References and more info • Questions? Lionel Silberman
  5. 5. What is Big Data? Business Value Enabling new products • Sensors everywhere! • Nowcasting. • Ever-narrower segmentation of customers Analytics - taking data from input through to decision • Correlation in real time • New insights from previously hidden data: • Social • Geographical data • Recommendations. • Finding needles in haystacks. • In 2010, industry was worth more than $100 billion. • Almost 10 percent a year growth rate; • or, about twice as fast as the software business as a whole. 5
  6. 6. What is Big Data? A Technical Definition • Data that exceeds the processing capacity of conventional database systems in volume, velocity or variety* or the 3Vs! Volume - Sheer size and growth. Velocity - how fast it moves. Variety - the inability to derive structure or frequency of change. * META Group (now Garner) analyst Doug Laney 6
  7. 7. What is Big Volume? • 1970s: Megabytes • Now: Many organizations are approaching an Exabyte • Examples: • Google – capacity for 15 Exabytes • NSA – capacity for a Yottabyte in Utah • AWS – 1 Trillion objects in 2012 • Facebook – 500 Terabytes/day • Scientific Pursuits: • Large Hadron Collider at CERN - last year 30 Petabytes • The NASA Center for Climate Simulation - 32 petabytes of climate observations and simulations on the Discover supercomputing cluster. • Sloan Digital Sky Survey (SDSS) 140 terabytes. 200 GB per night. Large Synoptic Telescope anticipated to acquires 140TB every five days. • - 90 Petabytes in data warehouses 7
  8. 8. What is Big Velocity? • Financial Trading volume • Retail – Cyber Monday (every click and interaction: not just the final sales). • Government – Affordable Care Act • Smartphone - geolocated imagery and audio data • Fraud (complex event processing) o Credit Card Traffic patterns o Phone Slam/Crams • Streaming – Netflix, Snapfish • Retail - Walmart handles more than 1 million customer transactions per hour. • Compuware APM (my firm) 25K transactions per second. • MMOG – Massive Multiplayer Online Games 8
  9. 9. What is Big Variety? • Diverse source and destinations: o Document Backup or Archival – HP, EMC, AWS o Pictures and Video – Facebook 50 Billion Photos o Sensor sources – GE, NetApp o Multi-device - Dropbox and Sugarsync • Big Data is messy: o structure aids meaning, but can change frequently. o multiple sources (e.g. financial feeds, browser incompatibilities) o Application integration issues (e.g. Fitbit) o Entity resolution issues (e.g. Portland, dog) o Visualization increasingly important. 9
  10. 10. Or a 4th or 5th V? 10
  11. 11. Big Data and Visualization - Wikipedia 11
  12. 12. What Big Data is Not (or why isn’t everything just “data”)? • There may not be the need in traditional systems o Payroll o Human Resources o Shop machine sampling? • Some tradeoffs required for the technology of Big Data: o Bleeding edge vs. established technology o Subtle definition of consistency o Complexity o New and hard-to-find skills • Make sure the business case warrants and can tolerate the tradeoffs… 12
  13. 13. Big Data is Everywhere? 13
  14. 14. Architectural Underpinnings: CAP Theorem 14 High Availability (A) of data for writes. Consistency (C) a single up-to-date copy of the data. Partition Tolerance (P) the system continues to operate despite arbitrary message loss or failure of parts of the system. NoSQL DBMS X
  15. 15. NoSql and Eventual Consistency • Relaxed or weaker consistency to achieve High Availability and Partition Tolerance. • Eventual Consistency: – An unbounded delay in propagating changes across partitions. – No ordering guarantees at all, thus a lower-level transaction atomicity. From system perspective means an operational system is ALWAYS in an inconsistent state. • Many NoSql systems (e.g. Cassandra, Hadoop): – support self-healing or restartability. – allow ease of scale and Disaster Recovery – are schema-free. – No standard way of retrieving data (e.g. SQL). 15
  16. 16. Business and Infrastructure Architecture Decisions 16 Infrastructure: • Amazon Web Services (AWS) • Storage • Elasticity • Availability • Data division: • Parallelism (sharding) • Redundancy • Application Servers • Data affinity • Stateless protocols Business Issues: • Research to Production Pipeline? • 3rd party integration needs? • Flexibility? • Radical Transparency?
  17. 17. Data Architecture Decisions 17 Data In-transit: • high write vs. high reads • Encryption and security • Distributed vs. centralized • Visualization tools Data Stores: • Documents • key/value pairs • Graphs • In-memory
  18. 18. Technology Stacks (Data Stores) • Hadoop: Distributed File System, Job Scheduler, MapReduce programming model – Pros: fault-tolerant, disaster protection, data parallelism fits many applications, ecosystem. – Best used: long-term data storage, research, basis of other flexible data stores. • Cassandra: Big Table, key-value – Pros: Fast writes, no single-point of failure, fault-tolerant, disaster, columns and column families. – Best used: when you write more than you read, Financial industry, real-time data analysis. • Riak: key-value – Pros: Cassandra-like, less complex, single-site scalability, availability & fault-tolerance – Best used: Point-of-sales, factory control systems, high writes. • Redis: in-memory, key-value – Pros: fast, transactional, expiring values. – Best used: Rapidly changing data in memory (stock prices, analytics, real-time data collection. • Dynamo: Big Table, key-value – Pros: Fast reads and writes, no single-point of failure, fault-tolerant, disaster, eventually consistent – Best used: Always available (e.g. Amazon). • CouchDB: Documents – Pros: bi-directional replication, conflict detection, previous versions – Best used: Accumulating occasionally changing data, pre-defined queries, versioning • MongoDB: document store. • Pros: update-in-place, defined index, built-in sharding, geospatial indexing. • Best used: Dynamic queries, schema-less, data changes a lot, • HBase: Big Table – Pros: huge datasets, map-reduce Hadoop/HDFS stack. – Best used: Analyzing log data • Memcached/Membase: in-memory and multi-node. – Pros: low-latency, high-concurrency and availability. – Best used: zynga – online gaming. • Neo4j: GraphDB – Pros: highly scalable, robust, ACID – Best used: social, routing, recommendation questions (e.g. How do I get to Linz?) 18
  19. 19. Related Data Management Ecosystems • Apache Hadoop stack (Cloudera) – MapReduce on HDFS – Pig, Hive, Hbase – SQL-like DB interfaces on top of HDFS. – Flume, RabbitMQ – Data conduits and message queues. – Splunk – Operational Analytics and Log Processing. – Sqoop – Bulk Data transfer to DBs – Puppet, Chef – Configuration Management and DevOps Orchestration. – Visualization, BI and ETL - Informatica, Talend, Pentaho, Tableau • Cloud computing infrastructure (Amazon Web Services) e.g. EC2, Elastic MapReduce, RDS • Cassandra (Datastax) • High-scale, distributed and hybrid RDBMS: - Teradata - Netezza - EMC/Greenplum - Aster Data - Vertica - VoltDB - RDF Triple Stores - Hadapt 19
  20. 20. Data Modeling Example: Twitter Publishers and Subscribers 20 • Relational DB: One table that has people relationships and tags whether a publisher or subscriber. Pros: No duplicated data, ACID transaction Cons: Does not scale, SPOF • NoSQL: Separate Indices for Subscribers and Publishers. Pros: – Partition Independence enables scale-out and no single point of failure for both reads and writes. – No Schema allows quick development. Cons: – Eventual consistency requires care in application layer and presentation of user experience, and future evolution. – Redundant storage.
  21. 21. Gotchas - What 12 Things to Watch Out for? 1. Privacy 2. Abuse – e.g. HFT/Front running. 3. Immature technologies and companies. 4. Business and Product Changes affect on architecture. 5. Data in-transit vs. at-rest - replication, mirroring, streaming, reprocessing. 6. Data security in-transit and at-rest. 7. Blurring of high availability, performance and disaster recovery. 8. Replacing sampling and aggregation with ALL of the data! 9. Correlation is not Causation – e.g. Google Flu 10. Data Snooping (or Confirmation Bias) 11. Irrelevance 12. Veracity - how do you check and reproduce results? “The process of making is iterative” - Cesar A. Hidalgo 21
  22. 22. References and More Info • • • • • • • • advantage#.U12f8aPD9jo • • Animation: Large Hadron Collider at CERN: shows-lhc-data-processing • • mouths/?utm_content=buffer53ae6&utm_medium=social& • • Copyleft – Share Alike - 22
  23. 23. Questions? Use Cases? Technology Adoption? Feedback or follow-up: 23