# `Text.WordCloud`
[🔗](https://github.com/kipcole9/text/blob/v0.6.1/lib/word_cloud.ex#L1)

Builds a weighted list of terms suitable for rendering as a word cloud.

The function returns a list of `%{term, weight, count, kind}` maps
sorted by `:weight` (descending). The top term always has weight
`1.0`; every other weight is normalised relative to it. Visual
layout — placing the words on a canvas — is handled separately by
`Text.WordCloud.Layout`.

Supports several scoring algorithms via the `:scoring` option;
`:yake` (the default) requires no reference corpus and is
multilingual by construction. See the `Text.WordCloud.Backends.*`
modules for the catalogue.

Multilingual end-to-end:

* Tokenisation runs through `Text.Segment.words/2` (Unicode UAX #29).

* Sentence segmentation uses `Text.Segment.sentences/2`.

* Stopwords come from the bundled `Text.Stopwords` (~60 languages)
  via the `:stopwords` option.

* Language is auto-detected with `Text.Language.Classifier.Fasttext`
  when `:language` is unset, falling back to no language-specific
  behaviour if the classifier is not available.

# `term_entry`

```elixir
@type term_entry() :: %{
  term: String.t(),
  weight: float(),
  count: pos_integer(),
  kind: :word | :phrase
}
```

A scored term, ready for rendering.

# `terms`

```elixir
@spec terms(
  String.t() | [String.t()],
  keyword()
) :: [term_entry()]
```

Returns a weighted list of terms for `text` suitable for word-cloud rendering.

### Arguments

* `text` is a UTF-8 string or a list of strings. A list is treated
  as a corpus of independent documents.

### Options

* `:scoring` — `:yake` (default), `:frequency`, `:tf_idf`, `:rake`,
  `:text_rank`, `:key_bert`, or any module implementing
  `Text.WordCloud.Backend`.

* `:max_terms` — cap on returned entries. Default `100`.

* `:min_count` — drop terms occurring fewer times than this.
  Default `1`.

* `:ngram_range` — `{min, max}` token length for candidate terms.
  Default depends on backend (`{1, 3}` for YAKE, `{1, 1}` for
  Frequency).

* `:language` — atom, BCP-47 string, or `Localize.LanguageTag`.
  Default `nil` (no language-specific behaviour). Pass
  `{:auto, model}` to auto-detect via a pre-loaded
  `Text.Language.Classifier.Fasttext.Model` — the orchestrator
  does not load the fastText model itself, so callers wanting
  detection load it once at boot and hand it in.

* `:stopwords` — `:auto` (use the bundled list for the resolved
  language; default), `:none`, a list, a `MapSet`, or
  `{:extend, [extra]}` to add to the bundled list.

* `:case_fold` — boolean, default `true`.

* `:stem` — boolean, default `false`. When `true`, candidate terms
  are bucketed by their Snowball stem so morphological variants
  (`demolish`, `demolished`, `demolishing`, `demolition`) collapse
  into a single entry. The most-frequent surface form represents
  the bucket; counts and raw scores are summed across members.
  Requires the optional `:text_stemmer` dependency. The stemmer
  language defaults to the resolved `:language`; override with
  `:stem_language`.

* `:stem_language` — atom override for the stemmer language. Useful
  when the corpus language differs from the bucketing language
  (e.g. mixed-language text where you want only English variants
  consolidated). Defaults to `:language`.

* `:include` — `:all` (default), `:words` only, or `:phrases` only.

* `:reference_corpus` — used by `:tf_idf` and `:log_likelihood`.

### Returns

* A list of `%{term, weight, count, kind}` maps sorted by `:weight`
  descending. The top entry has `weight: 1.0`.

### Examples

    iex> text = "the cat sat on the mat. the cat ran. the cat slept."
    iex> [first | _] = Text.WordCloud.terms(text, scoring: :frequency, language: :en, max_terms: 3)
    iex> first.term
    "cat"

# `to_d3_cloud`

```elixir
@spec to_d3_cloud(
  [term_entry()],
  keyword()
) :: [
  %{
    text: String.t(),
    size: float(),
    weight: float(),
    count: pos_integer(),
    kind: :word | :phrase
  }
]
```

Converts scored terms into the shape consumed by [d3-cloud](https://github.com/jasondavies/d3-cloud).

d3-cloud expects an array of `{text, size}` records and runs its
Wordle-style layout in the browser. This adapter maps each entry's
`:weight` to a pixel font size using the same `:font_size_range`
vocabulary as `Text.WordCloud.Layout`, so a server-rendered SVG and a
client-rendered d3-cloud will scale identically.

The original `:weight`, `:count`, and `:kind` fields are passed through
unchanged. d3-cloud ignores them but exposes the full datum to its
`text`, `fontSize`, `fontWeight`, and `rotate` callbacks, so consumers
can read e.g. `d.count` for tooltips with no extra plumbing.

### Arguments

* `terms` is the output of `Text.WordCloud.terms/2` (or any list of
  `%{term, weight, count, kind}` maps).

### Options

* `:font_size_range` is a `{min, max}` pixel tuple. Weight `1.0` maps
  to `max`, weight `0.0` maps to `min`. Default `{12, 96}`.

* `:scale` is `:linear` (default) or `:sqrt`. `:sqrt` produces
  area-proportional sizing, which is the convention most d3-cloud
  examples use. `:linear` matches `Text.WordCloud.Layout`'s behaviour.

### Returns

* A list of `%{text, size, weight, count, kind}` maps sorted by
  `:size` descending. The `:text` and `:size` keys are what d3-cloud
  consumes; the rest are passed through for callbacks.

### Examples

    iex> terms = [
    ...>   %{term: "elixir", weight: 1.0, count: 5, kind: :word},
    ...>   %{term: "phoenix", weight: 0.5, count: 2, kind: :word}
    ...> ]
    iex> Text.WordCloud.to_d3_cloud(terms, font_size_range: {10, 100})
    [
      %{text: "elixir", size: 100.0, weight: 1.0, count: 5, kind: :word},
      %{text: "phoenix", size: 55.0, weight: 0.5, count: 2, kind: :word}
    ]

---

*Consult [api-reference.md](api-reference.md) for complete listing*
