目录 Preface 1.Introduction to High Performance Spark What Is Spark and Why Performance Matters What You Can Expect to Get from This Book Spark Versions Why Scala? To Be a Spark Expert You Have to Learn a Little Scala Anyway The Spark Scala API Is Easier to Use Than the lava API Scala Is More Performant Than Python Why Not Scala? Learning Scala Conclusion 2.How Spark Works How Spark Fits into the Big Data Ecosystem Spark Components Spark Model of Parallel Computing: RDDs Lazy Evaluation In-Memory Persistence and Memory Management Immutability and the RDD Interface Types of RDDs Functions on RDDs: Transformations Versus Actions Wide Versus Narrow Dependencies Spark Job Scheduling Resource Allocation Across Applications The Spark Application The Anatomy of a Spark lob The DAG Jobs Stages Tasks Conclusion 3.DataFrames, Datasets, and Spark SQL Getting Started with the SparkSession (or HiveContext or SQLContext) Spark SQL Dependencies Managing Spark Dependencies Avoiding Hive JARs Basics of Schemas DataFrame API Transformations Multi-DataFrame Transformations Plain Old SQL Queries and Interacting with Hive Data Data Representation in DataFrames and Datasets Tungsten Data Loading and Saving Functions DataFrameWriter and DataFrameReader Formats Save Modes Partitions (Discovery and Writing) Datasets Interoperability with RDDs, DataFrames, and Local Collections Compile-Time Strong Typing Easier Functional (RDD "like") Transformations Relational Transformations Multi-Dataset Relational Transformations Grouped Operations on Datasets Extending with User-Defined Functions and Aggregate Functions (UDFs,UDAFs) Query Optimizer Logical and Physical Plans Code Generation Large Query Plans and Iterative Algorithms Debugging Spark SQL Queries JDBC/ODBC Server Conclusion 4.Joins (SQL and Core) Core Spark Joins Choosing a Join Type Choosing an Execution Plan Spark SQL Joins DataFrame Joins Dataset Joins Conclusion 5.Effective Transformations Narrow Versus Wide Transformations Implications for Performance Implications for Fault Tolerance The Spe Case of coalesce What Type of RDD Does Your Transformation Return? Minimizing Object Creation Reusing Existing Objects Using Smaller Data Structures Iterator-to-Iterator Transformations with mapPartitions What Is an Iterator-to-Iterator Transformation? Space and Time Advantages An Example Set Operations Reducing Setup Overhead Shared Variables Broadcast Variables Accumulators Reusing RDDs Cases for Reuse Deciding if Recompute Is Inexpensive Enough Types of Reuse: Cache, Persist, Checkpoint, Shuffle Files Alluxio (nee Tachyon) LRU Caching Noisy Cluster Considerations Interaction with Accumulators Conclusion 6.Working with Key/Value Data The Goldilocks Example Goldilocks Version 0: Iterative Solution How to Use PairRDDFunctions and OrderedRDDFunctions Actions on Key/Value Pairs Whats So Dangerous About the groupByKey Function Goldilocks Version 1: groupByKey Solution Choosing an Aggregation Operation Dictionary of Aggregation Operations with Performance Considerations Multiple RDD Operations Co-Grouping Partitioners and Key/Value Data Using the Spark Partitioner Object Hash Partitioning Range Partitioning Custom Partitioning Preserving Partitioning Information Across Transformations Leveraging Co-Located and Co-Partitioned RDDs Dictionary of Mapping and Partitioning Functions PairRDDFunctions Dictionary of OrderedRDDOperations Sorting by Two Keys with SortByKey Secondary Sort and repartitionAndSortWithinPartitions Leveraging repartitionAndSortWithinPartitions for a Group by Key and Sort Values Function How Not to Sort by Two Orderings Goldilocks Version 2: Secondary Sort A Different Approach to Goldilocks Goldilocks Version 3: Sort on Cell Values Straggler Detection and Unbalanced Data Back to Goldilocks (Again) Goldilocks Version 4: Reduce to Distinct on Each Partition Conclusion 7.Going Beyond Scala Beyond Scala within the JVM Beyond Scala, and Beyond the JVM How PySpark Works How SparkR Works Spark.jl (Julia Spark) How Eclair JS Works Spark on the Common Language Runtime (CLR)——C# and Friends Calling Other Languages from Spark Using Pipe and Friends JNI Java Native Access (JNA) Underneath Everything Is FORTRAN Getting to the GPU The Future Conclusion 8.Testing and Validation Unit Testing General Spark Unit Testing Mocking RDDs Getting Test Data Generating Large Datasets Sampling Property Checking with ScalaCheck Computing RDD Difference Integration Testing Choosing Your Integration Testing Environment Verifying Performance Spark Counters for Verifying Performance Projects for Verifying Performance Job Validation Conclusion 9.Spark MLlib and ML Choosing Between Spark MLlib and Spark ML Working with MLlib Getting Started with MLlib (Organization and Imports) MLlib Feature Encoding and Data Preparation Feature Scaling and Selection MLlib Model Training Predicting Serving and Persistence Model Evaluation Working with Spark ML Spark ML Organization and Imports Pipeline Stages Explain Params Data Encoding Data Cleaning Spark ML Models Putting It All Together in a Pipeline Training a Pipeline Accessing Individual Stages Data Persistence and Spark ML Extending Spark ML Pipelines with Your Own Algorithms Model and Pipeline Persistence and Serving with Spark ML General Serving Considerations Conclusion 10.Spark Components and Packages Stream Processing with Spark Sources and Sinks Batch Intervals Data Checkpoint Intervals Considerations for DStreams Considerations for Structured Streaming High Availability Mode (or Handling Driver Failure or Checkpointing) GraphX Using Community Packages and Libraries Creating a Spark Package Conclusion A.Tuning, Debugging, and Other Things Developers Like to Pretend Dont Exist Index
以下为对购买帮助不大的评价