消息首页搜索举报

SPARK高级数据分析

全新正版极速发货

29.89 5.3折 56 全新

仅1件

广东广州

认证卖家担保交易快速发货售后保障

作者(美)里扎(Sandy Ryza) 等著

出版社东南大学出版社

ISBN9787564159108

出版时间2015-09

装帧平装

开本16开

定价56元

货号1201198159

上书时间2024-11-14

大智慧小美丽

已实名已认证进店收藏店铺

在售商品暂无
平均发货时间 17小时
好评率暂无

最新上架

不忘初心 ¥17.33

企业安全生产标准化建设指南 ¥10.47

古典学研究 ¥17.47

第一次世界大战的爆发 ¥25.20

摩根财团 ¥88.76

新编牛津英语学习目标与测试 6年级下册 ¥5.68

螳螂演奏家 ¥11.11

芥子园画传 ¥16.02

军事学是什么 ¥25.18

商品详情

品相描述：全新

商品描述: 作者简介
Sandy Ryza，是Cloudera的不错数据科学家，也是Apache Spark项目的活跃贡献者。

目录
Foreword
Preface
1.Analyzing Big Data
The Challenges of Data Saence
Introduang Apache Spark
About This Book
2.Introduction to Data Analysis with Scala and Spark
Scala for Data Scientists
The Spark Programming Model
Record Linkage
Getting Started： The Spark Shell and Spark Context
Bringing Data from the Cluster to the Client
Shipping Code from the Client to the Cluster
Structuring Data with Tuples and Case Classes
Aggregations
Creating Histograms
Summary Statistics for Continuous Variables
Creating Reusable Code for Computing Summary Statistics
Simple Variable Selection and Scoring
Where to Go from Here
3.Recommending Music and the Audioscrobbler Data Set
Data Set
The Alternating Least Squares Recommender Algorithm
Preparing the Data
Building a First Model
Spot Checking Recommendations
Evaluating Recommendation Quality
Computing AUC
Hyperparameter Selection
Making Recommendations
Where to Go from Here
4.Predicting Forest Cover with Deasion Trees
Fast Forward to Regression
Vectors and Features
Training Examples
Decision Trees and Forests
Covtype Data Set
Preparing the Data
A First Decision Tree
Deasion Tree Hyperparameters
Tuning Decision Trees
Categorical Features Revisited
Random Decision Forests
Making Predictions
Where to Go from Here
5.Anomaly Detection in Network Traffic with K—means Clustering
Anomaly Detection
K—means Clustering
Network Intrusion
KDD Cup 1999 Data Set
A First Take on Clustering
Choosing k
Visualization in R
Feature Normalization
Categorical Variables —
Using Labels with Entropy
Clustering in Action
Where to Go from Here
6.Understanding Wikipedia with Latent Semantic Analysis
The Term—Document Matrix
Getting the Data
Parsing and Preparing the Data
Lemmatization
Computing the TF—IDFs
Singular Value Decomposition
Finding Important Concepts
Querying and Scoring with the Low—Dimensional Representation
Term—Term Relevance
Document—Document Relevance
Term—Document Relevance
Multiple—Term Queries
Where to Go from Here
7.Analyzing Co—occurrence Networks with GraphX
The MEDLINE Citation Index： A Network Analysis
Getting the Data
Parsing XML Documents with Scalas XML Library
Analyzing the MeSH Major Topics and Their Co—occurrences
Constructing a Co—occurrence Network with GraphX
Understanding the Structure of Networks
Connected Components
Degree Distribution
Filtering Out Noisy Edges
Processing Edge Triplets
Analyzing the Filtered Graph
Small—World Networks
Cliques and Clustering Coefficients
Computing Average Path Length with Pregel
Where to Go from Here
8.Geospatial and Temporal Data Analysjs on the New York City Taxi Trip Data
Getting the Data
Working with Temporal and Geospatial Data in Spark
Temporal Data with Joda Time and NScala Time
Geospatial Data with the Esri Geometry API and Spray
Exploring the Esri Geometry API
Intro to GeolSON
Preparing the New York City Taxi Trip Data
Handling Invalid Records at Scale
Geospatial Analysis
Sessionization in Spark
Building Sessions： Secondary Sorts in Spark
Where to Go from Here
9.Estimating Financial Risk through Monte Carlo Simulation
Terminology
Methods for Calculating VaR
Variance—Covariance
Historical Simulation
Monte Carlo Simulation
Our Model
Getting the Data
Preprocessing
Determining the Factor Weights
Sampling
The Multivariate Normal Distribution
Running the Trials
Visualizing the Distribution of Returns
Evaluating Our Results
Where to Go from Here
10.Analyzing Genomics Data and the BDG Project
Decoupling Storage from Modeling
Ingesting Genomics Data with the ADAM CLI
Parquet Format and Columnar Storage
Predicting Transcription Factor Binding Sites from ENCODE Data
Querying Genotypes from the 1000 Genomes Project
Where to Go from Here
11.Analyzing Neuroimaging Data with PySpark and Thunder
Overview of PySpark
PySpark Internals
Overview and Installation of the Thunder Library
Loading Data with Thunder
Thunder Core Data Types
Categorizing Neuron Types with Thunder
Where to Go from Here
A.Deeper into Spark
B.Upcoming MLlib Pipelines API
Index

内容摘要
网络数据量迅速增大的时代，亟需能高效迅捷分析处理数据的工具，Spark应运而生。本书由Spark开发者及核心成员打造，带领读者快速掌握用Spark收集、计算、简化保存海量数据的方法，学会交互、迭代和增量式分析，解决分区、数据本地化和自定义序列化等问题。

— 没有更多了 —